All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 00/11] x86: Mitigations for SP2/CVE-2017-5715/Branch Target Injection
@ 2018-01-24 13:12 Andrew Cooper
  2018-01-24 13:12 ` [PATCH v10 01/11] x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests Andrew Cooper
                   ` (10 more replies)
  0 siblings, 11 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-24 13:12 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper

This series is availabe in git form from:

  http://xenbits.xen.org/gitweb/?p=people/andrewcoop/xen.git;a=shortlog;h=refs/heads/sp2-mitigations-v10

A copy of Intel's spec for IBRS/IBPB can be found here:

  https://software.intel.com/sites/default/files/managed/c5/63/336996-Speculative-Execution-Side-Channel-Mitigations.pdf

In addition to this software series, you will need the following:

  1) A compiler which understands -mindirect-branch=thunk-external and
     -mindirect-branch-register.  A GCC patch series implementing this should
     be available imminently.  In the meantime, a development branch can be
     obtained from:

     https://github.com/hjl-tools/gcc/commits/hjl/indirect/gcc-7-branch/master

  2) New microcode from Intel and AMD.  These provide new MSRs for Xen to use,
     and virtualise for guest kernels to use.

There are some limitations, even with the work presented here.

  1) vCPU-to-vCPU SP2 attacks can only be mitigated at the hypervisor level
     with IBPB support, which for internal pipeline reasons, we do not expect
     to be made available on older processors.  For now, I will leave these
     details to the hardware vendors.

  2) Hardware lacking SMEP is in a worse position than hardware with SMEP.  If
     you have SMEP (Intel IvyBridge and later, Some AMD Fam16h and all Fam17h
     and later), make absolutely sure it is enabled in the BIOS and working.

  3) On hardware lacking SMEP support, it is still an open question how to
     protect against RSB-to-SMM speculation.  Native operating systems can fix
     this by prohibiting userspace from mmap()'ing addresses which alias the
     SMM range, but Xen has no feasible way of enforcing this restriction on
     PV guests, even if we could tolerate the ABI breakage.  (However, see the
     forthcoming SP3 mitigation series for alternatives for un trusted PV
     guests).

~Andrew

Changes from v9:
  * Fix HVM emulation of MSR_SPEC_CTRL
  * Optimise the asm somewhat by reusing XPTI changes.
  * RFC patch for avoiding alternatives problems with NMI/#MC

Andrew Cooper (11):
  x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests
  x86/msr: Emulation of MSR_{SPEC_CTRL,PRED_CMD} for guests
  x86/migrate: Move MSR_SPEC_CTRL on migrate
  x86/hvm: Permit guests direct access to MSR_{SPEC_CTRL,PRED_CMD}
  x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
  x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
  x86/entry: Avoid using alternatives in NMI/#MC paths
  x86/boot: Calculate the most appropriate BTI mitigation to use
  x86/ctxt: Issue a speculation barrier between vcpu contexts
  x86/cpuid: Offer Indirect Branch Controls to guests
  x86/idle: Clear SPEC_CTRL while idle

 docs/misc/xen-command-line.markdown         |  13 +-
 tools/libxc/xc_cpuid_x86.c                  |   4 +-
 xen/arch/x86/acpi/cpu_idle.c                |  21 ++
 xen/arch/x86/cpu/mwait-idle.c               |   7 +
 xen/arch/x86/cpuid.c                        |  28 +++
 xen/arch/x86/domain.c                       |  37 +++
 xen/arch/x86/domctl.c                       |  21 ++
 xen/arch/x86/hvm/hvm.c                      |   2 +
 xen/arch/x86/hvm/svm/entry.S                |  11 +-
 xen/arch/x86/hvm/svm/svm.c                  |   5 +
 xen/arch/x86/hvm/vmx/entry.S                |  19 ++
 xen/arch/x86/hvm/vmx/vmx.c                  |  17 ++
 xen/arch/x86/msr.c                          |  45 ++++
 xen/arch/x86/setup.c                        |   1 +
 xen/arch/x86/smpboot.c                      |   2 +
 xen/arch/x86/spec_ctrl.c                    | 150 +++++++++++-
 xen/arch/x86/x86_64/asm-offsets.c           |   7 +
 xen/arch/x86/x86_64/compat/entry.S          |  14 ++
 xen/arch/x86/x86_64/entry.S                 |  48 +++-
 xen/include/asm-x86/asm_defns.h             |   3 +
 xen/include/asm-x86/cpufeatures.h           |   2 +
 xen/include/asm-x86/current.h               |   7 +
 xen/include/asm-x86/msr-index.h             |   2 +
 xen/include/asm-x86/msr.h                   |  10 +
 xen/include/asm-x86/nops.h                  |   7 +
 xen/include/asm-x86/spec_ctrl.h             |  47 ++++
 xen/include/asm-x86/spec_ctrl_asm.h         | 338 ++++++++++++++++++++++++++++
 xen/include/public/arch-x86/cpufeatureset.h |   6 +-
 28 files changed, 859 insertions(+), 15 deletions(-)
 create mode 100644 xen/include/asm-x86/spec_ctrl_asm.h

-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v10 01/11] x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests
  2018-01-24 13:12 [PATCH v10 00/11] x86: Mitigations for SP2/CVE-2017-5715/Branch Target Injection Andrew Cooper
@ 2018-01-24 13:12 ` Andrew Cooper
  2018-02-01  9:06   ` Jan Beulich
  2018-01-24 13:12 ` [PATCH v10 02/11] x86/msr: Emulation of MSR_{SPEC_CTRL, PRED_CMD} " Andrew Cooper
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 57+ messages in thread
From: Andrew Cooper @ 2018-01-24 13:12 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper

Intel specifies IBRS/IBPB (combined, in a single bit) and STIBP as a separate
bit.  AMD specifies IBPB alone in a 3rd bit.

AMD's IBPB is a subset of Intel's combined IBRS/IBPB.  For performance
reasons, administrators might wish to express "IBPB only" even on Intel
hardware, so we allow the AMD bit to be used for this purpose.

The behaviour of STIBP is more complicated.

It is our current understanding that STIBP will be advertised on HT-capable
hardware irrespective of whether HT is enabled, but not advertised on
HT-incapable hardware.  However, for ease of virtualisation, STIBP's
functionality is ignored rather than reserved by microcode/hardware on
HT-incapable hardware.

For guest safety, we treat STIBP as special, always override the toolstack
choice, and always advertise STIBP if IBRS is available.  This removes the
corner case where STIBP is not advertised, but the guest is running on
HT-capable hardware where it does matter.

Finally as a bugfix, update the libxc CPUID logic to understand the e8b
feature leaf, which has the side effect of also offering CLZERO to guests on
applicable hardware.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
v9:
 * New
 * Move libxc fix from a later patch, and associated toolstack ack.
---
 tools/libxc/xc_cpuid_x86.c                  |  4 +++-
 xen/arch/x86/cpuid.c                        | 28 ++++++++++++++++++++++++++++
 xen/include/public/arch-x86/cpufeatureset.h |  2 +-
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/tools/libxc/xc_cpuid_x86.c b/tools/libxc/xc_cpuid_x86.c
index 25b922e..9fa2f7c 100644
--- a/tools/libxc/xc_cpuid_x86.c
+++ b/tools/libxc/xc_cpuid_x86.c
@@ -465,7 +465,9 @@ static void xc_cpuid_hvm_policy(xc_interface *xch,
 
     case 0x80000008:
         regs[0] &= 0x0000ffffu;
-        regs[1] = regs[3] = 0;
+        regs[1] = info->featureset[featureword_of(X86_FEATURE_CLZERO)];
+        /* regs[2] handled in the per-vendor logic. */
+        regs[3] = 0;
         break;
 
     case 0x00000002: /* Intel cache info (dumped by AMD policy) */
diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c
index 2ef71d2..b3c9ac6 100644
--- a/xen/arch/x86/cpuid.c
+++ b/xen/arch/x86/cpuid.c
@@ -383,6 +383,16 @@ static void __init calculate_pv_max_policy(void)
     /* Unconditionally claim to be able to set the hypervisor bit. */
     __set_bit(X86_FEATURE_HYPERVISOR, pv_featureset);
 
+    /* On hardware with IBRS/IBPB support, there are further adjustments. */
+    if ( test_bit(X86_FEATURE_IBRSB, pv_featureset) )
+    {
+        /* Offer STIBP unconditionally.  It is a nop on non-HT hardware. */
+        __set_bit(X86_FEATURE_STIBP, pv_featureset);
+
+        /* AMD's IBPB is a subset of IBRS/IBPB. */
+        __set_bit(X86_FEATURE_IBPB, pv_featureset);
+    }
+
     sanitise_featureset(pv_featureset);
     cpuid_featureset_to_policy(pv_featureset, p);
     recalculate_xstate(p);
@@ -440,6 +450,16 @@ static void __init calculate_hvm_max_policy(void)
             __clear_bit(X86_FEATURE_XSAVES, hvm_featureset);
     }
 
+    /* On hardware with IBRS/IBPB support, there are further adjustments. */
+    if ( test_bit(X86_FEATURE_IBRSB, hvm_featureset) )
+    {
+        /* Offer STIBP unconditionally.  It is a nop on non-HT hardware. */
+        __set_bit(X86_FEATURE_STIBP, hvm_featureset);
+
+        /* AMD's IBPB is a subset of IBRS/IBPB. */
+        __set_bit(X86_FEATURE_IBPB, hvm_featureset);
+    }
+
     sanitise_featureset(hvm_featureset);
     cpuid_featureset_to_policy(hvm_featureset, p);
     recalculate_xstate(p);
@@ -581,6 +601,14 @@ void recalculate_cpuid_policy(struct domain *d)
     recalculate_xstate(p);
     recalculate_misc(p);
 
+    /*
+     * Override STIBP to match IBRS.  Guests can safely use STIBP
+     * functionality on non-HT hardware, but can't necesserily protect
+     * themselves from SP2/Spectre/Branch Target Injection if STIBP is hidden
+     * on HT-capable hardware.
+     */
+    p->feat.stibp = p->feat.ibrsb;
+
     for ( i = 0; i < ARRAY_SIZE(p->cache.raw); ++i )
     {
         if ( p->cache.subleaf[i].type >= 1 &&
diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index e148755..0f21fed 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -243,7 +243,7 @@ XEN_CPUFEATURE(IBPB,          8*32+12) /*   IBPB support only (no IBRS, used by
 XEN_CPUFEATURE(AVX512_4VNNIW, 9*32+ 2) /*A  AVX512 Neural Network Instructions */
 XEN_CPUFEATURE(AVX512_4FMAPS, 9*32+ 3) /*A  AVX512 Multiply Accumulation Single Precision */
 XEN_CPUFEATURE(IBRSB,         9*32+26) /*   IBRS and IBPB support (used by Intel) */
-XEN_CPUFEATURE(STIBP,         9*32+27) /*   STIBP */
+XEN_CPUFEATURE(STIBP,         9*32+27) /*!  STIBP */
 
 #endif /* XEN_CPUFEATURE */
 
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v10 02/11] x86/msr: Emulation of MSR_{SPEC_CTRL, PRED_CMD} for guests
  2018-01-24 13:12 [PATCH v10 00/11] x86: Mitigations for SP2/CVE-2017-5715/Branch Target Injection Andrew Cooper
  2018-01-24 13:12 ` [PATCH v10 01/11] x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests Andrew Cooper
@ 2018-01-24 13:12 ` Andrew Cooper
  2018-01-25 12:25   ` Jan Beulich
  2018-01-24 13:12 ` [PATCH v10 03/11] x86/migrate: Move MSR_SPEC_CTRL on migrate Andrew Cooper
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 57+ messages in thread
From: Andrew Cooper @ 2018-01-24 13:12 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich

As per the spec currently available here:

https://software.intel.com/sites/default/files/managed/c5/63/336996-Speculative-Execution-Side-Channel-Mitigations.pdf

MSR_ARCH_CAPABILITIES will only come into existence on new hardware, but is
implemented as a straight #GP for now to avoid being leaky when new hardware
arrives.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>

v9:
 * Alter the STIBP handling to match Intel's latest spec
 * Drop spec_ctrl.guest as it is no longer needed
---
 xen/arch/x86/msr.c              | 45 +++++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/msr-index.h |  2 ++
 xen/include/asm-x86/msr.h       | 10 +++++++++
 3 files changed, 57 insertions(+)

diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c
index 187f862..7875d9c 100644
--- a/xen/arch/x86/msr.c
+++ b/xen/arch/x86/msr.c
@@ -120,11 +120,22 @@ int init_vcpu_msr_policy(struct vcpu *v)
 
 int guest_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
 {
+    const struct cpuid_policy *cp = v->domain->arch.cpuid;
     const struct msr_domain_policy *dp = v->domain->arch.msr;
     const struct msr_vcpu_policy *vp = v->arch.msr;
 
     switch ( msr )
     {
+    case MSR_PRED_CMD:
+        /* Write-only */
+        goto gp_fault;
+
+    case MSR_SPEC_CTRL:
+        if ( !cp->feat.ibrsb )
+            goto gp_fault;
+        *val = vp->spec_ctrl.raw;
+        break;
+
     case MSR_INTEL_PLATFORM_INFO:
         if ( !dp->plaform_info.available )
             goto gp_fault;
@@ -132,6 +143,10 @@ int guest_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val)
                _MSR_PLATFORM_INFO_CPUID_FAULTING;
         break;
 
+    case MSR_ARCH_CAPABILITIES:
+        /* Not implemented yet. */
+        goto gp_fault;
+
     case MSR_INTEL_MISC_FEATURES_ENABLES:
         if ( !vp->misc_features_enables.available )
             goto gp_fault;
@@ -153,14 +168,44 @@ int guest_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val)
 {
     const struct vcpu *curr = current;
     struct domain *d = v->domain;
+    const struct cpuid_policy *cp = d->arch.cpuid;
     struct msr_domain_policy *dp = d->arch.msr;
     struct msr_vcpu_policy *vp = v->arch.msr;
 
     switch ( msr )
     {
     case MSR_INTEL_PLATFORM_INFO:
+    case MSR_ARCH_CAPABILITIES:
+        /* Read-only */
         goto gp_fault;
 
+    case MSR_SPEC_CTRL:
+        if ( !cp->feat.ibrsb )
+            goto gp_fault; /* MSR available? */
+
+        /*
+         * Note: SPEC_CTRL_STIBP is specified as safe to use (i.e. ignored)
+         * when STIBP isn't enumerated in hardware.
+         */
+
+        if ( val & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP) )
+            goto gp_fault; /* Rsvd bit set? */
+
+        vp->spec_ctrl.raw = val;
+        break;
+
+    case MSR_PRED_CMD:
+        if ( !cp->feat.ibrsb && !cp->extd.ibpb )
+            goto gp_fault; /* MSR available? */
+
+        /*
+         * The only defined behaviour is when writing PRED_CMD_IBPB.  In
+         * practice, real hardware accepts any value without faulting.
+         */
+        if ( v == curr && (val & PRED_CMD_IBPB) )
+            wrmsrl(MSR_PRED_CMD, PRED_CMD_IBPB);
+        break;
+
     case MSR_INTEL_MISC_FEATURES_ENABLES:
     {
         uint64_t rsvd = ~0ull;
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index a0aacfa..23ad743 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -39,6 +39,8 @@
 #define MSR_PRED_CMD			0x00000049
 #define PRED_CMD_IBPB			(_AC(1, ULL) << 0)
 
+#define MSR_ARCH_CAPABILITIES		0x0000010a
+
 /* Intel MSRs. Some also available on other CPUs */
 #define MSR_IA32_PERFCTR0		0x000000c1
 #define MSR_IA32_A_PERFCTR0		0x000004c1
diff --git a/xen/include/asm-x86/msr.h b/xen/include/asm-x86/msr.h
index 2fbed02..928f1cc 100644
--- a/xen/include/asm-x86/msr.h
+++ b/xen/include/asm-x86/msr.h
@@ -223,6 +223,16 @@ struct msr_domain_policy
 /* MSR policy object for per-vCPU MSRs */
 struct msr_vcpu_policy
 {
+    /* 0x00000048 - MSR_SPEC_CTRL */
+    struct {
+        /*
+         * Only the bottom two bits are defined, so no need to waste space
+         * with uint64_t at the moment, but use uint32_t for the convenience
+         * of the assembly code.
+         */
+        uint32_t raw;
+    } spec_ctrl;
+
     /* 0x00000140  MSR_INTEL_MISC_FEATURES_ENABLES */
     struct {
         bool available; /* This MSR is non-architectural */
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v10 03/11] x86/migrate: Move MSR_SPEC_CTRL on migrate
  2018-01-24 13:12 [PATCH v10 00/11] x86: Mitigations for SP2/CVE-2017-5715/Branch Target Injection Andrew Cooper
  2018-01-24 13:12 ` [PATCH v10 01/11] x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests Andrew Cooper
  2018-01-24 13:12 ` [PATCH v10 02/11] x86/msr: Emulation of MSR_{SPEC_CTRL, PRED_CMD} " Andrew Cooper
@ 2018-01-24 13:12 ` Andrew Cooper
  2018-01-24 13:12 ` [PATCH v10 04/11] x86/hvm: Permit guests direct access to MSR_{SPEC_CTRL, PRED_CMD} Andrew Cooper
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-24 13:12 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/domctl.c  | 2 ++
 xen/arch/x86/hvm/hvm.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 2585d4e..1a15a34 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -1292,6 +1292,7 @@ long arch_do_domctl(
         struct xen_domctl_vcpu_msr msr = {};
         struct vcpu *v;
         static const uint32_t msrs_to_send[] = {
+            MSR_SPEC_CTRL,
             MSR_INTEL_MISC_FEATURES_ENABLES,
         };
         uint32_t nr_msrs = ARRAY_SIZE(msrs_to_send);
@@ -1415,6 +1416,7 @@ long arch_do_domctl(
 
                 switch ( msr.index )
                 {
+                case MSR_SPEC_CTRL:
                 case MSR_INTEL_MISC_FEATURES_ENABLES:
                     if ( guest_wrmsr(v, msr.index, msr.value) != X86EMUL_OKAY )
                         break;
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index cee2385..d0a7246 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1327,6 +1327,7 @@ static int hvm_load_cpu_xsave_states(struct domain *d, hvm_domain_context_t *h)
 
 #define HVM_CPU_MSR_SIZE(cnt) offsetof(struct hvm_msr, msr[cnt])
 static const uint32_t msrs_to_send[] = {
+    MSR_SPEC_CTRL,
     MSR_INTEL_MISC_FEATURES_ENABLES,
 };
 static unsigned int __read_mostly msr_count_max = ARRAY_SIZE(msrs_to_send);
@@ -1462,6 +1463,7 @@ static int hvm_load_cpu_msrs(struct domain *d, hvm_domain_context_t *h)
         {
             int rc;
 
+        case MSR_SPEC_CTRL:
         case MSR_INTEL_MISC_FEATURES_ENABLES:
             rc = guest_wrmsr(v, ctxt->msr[i].index, ctxt->msr[i].val);
 
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v10 04/11] x86/hvm: Permit guests direct access to MSR_{SPEC_CTRL, PRED_CMD}
  2018-01-24 13:12 [PATCH v10 00/11] x86: Mitigations for SP2/CVE-2017-5715/Branch Target Injection Andrew Cooper
                   ` (2 preceding siblings ...)
  2018-01-24 13:12 ` [PATCH v10 03/11] x86/migrate: Move MSR_SPEC_CTRL on migrate Andrew Cooper
@ 2018-01-24 13:12 ` Andrew Cooper
  2018-01-24 13:12 ` [PATCH v10 05/11] x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point Andrew Cooper
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-24 13:12 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper

For performance reasons, HVM guests should have direct access to these MSRs
when possible.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
---
v7:
 * Drop excess brackets
v9:
 * Re-implement it light of Intels new spec.  Drop R-by's.
 * Spelling fixes
v10:
 * More spelling fixes
---
 xen/arch/x86/domctl.c      | 19 +++++++++++++++++++
 xen/arch/x86/hvm/svm/svm.c |  5 +++++
 xen/arch/x86/hvm/vmx/vmx.c | 17 +++++++++++++++++
 3 files changed, 41 insertions(+)

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index 1a15a34..2c4447b 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -53,6 +53,7 @@ static int update_domain_cpuid_info(struct domain *d,
     struct cpuid_policy *p = d->arch.cpuid;
     const struct cpuid_leaf leaf = { ctl->eax, ctl->ebx, ctl->ecx, ctl->edx };
     int old_vendor = p->x86_vendor;
+    unsigned int old_7d0 = p->feat.raw[0].d, old_e8b = p->extd.raw[8].b;
     bool call_policy_changed = false; /* Avoid for_each_vcpu() unnecessarily */
 
     /*
@@ -218,6 +219,14 @@ static int update_domain_cpuid_info(struct domain *d,
 
             d->arch.pv_domain.cpuidmasks->_7ab0 = mask;
         }
+
+        /*
+         * If the IBRS/IBPB policy has changed, we need to recalculate the MSR
+         * interception bitmaps.
+         */
+        call_policy_changed = (is_hvm_domain(d) &&
+                               ((old_7d0 ^ p->feat.raw[0].d) &
+                                cpufeat_mask(X86_FEATURE_IBRSB)));
         break;
 
     case 0xa:
@@ -292,6 +301,16 @@ static int update_domain_cpuid_info(struct domain *d,
             d->arch.pv_domain.cpuidmasks->e1cd = mask;
         }
         break;
+
+    case 0x80000008:
+        /*
+         * If the IBPB policy has changed, we need to recalculate the MSR
+         * interception bitmaps.
+         */
+        call_policy_changed = (is_hvm_domain(d) &&
+                               ((old_e8b ^ p->extd.raw[8].b) &
+                                cpufeat_mask(X86_FEATURE_IBPB)));
+        break;
     }
 
     if ( call_policy_changed )
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 6509b90..231074e 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -617,6 +617,7 @@ static void svm_cpuid_policy_changed(struct vcpu *v)
 {
     struct arch_svm_struct *arch_svm = &v->arch.hvm_svm;
     struct vmcb_struct *vmcb = arch_svm->vmcb;
+    const struct cpuid_policy *cp = v->domain->arch.cpuid;
     u32 bitmap = vmcb_get_exception_intercepts(vmcb);
 
     if ( opt_hvm_fep ||
@@ -626,6 +627,10 @@ static void svm_cpuid_policy_changed(struct vcpu *v)
         bitmap &= ~(1U << TRAP_invalid_op);
 
     vmcb_set_exception_intercepts(vmcb, bitmap);
+
+    /* Give access to MSR_PRED_CMD if the guest has been told about it. */
+    svm_intercept_msr(v, MSR_PRED_CMD,
+                      cp->extd.ibpb ? MSR_INTERCEPT_NONE : MSR_INTERCEPT_RW);
 }
 
 static void svm_sync_vmcb(struct vcpu *v)
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index e036303..1546c2a 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -656,6 +656,8 @@ void vmx_update_exception_bitmap(struct vcpu *v)
 
 static void vmx_cpuid_policy_changed(struct vcpu *v)
 {
+    const struct cpuid_policy *cp = v->domain->arch.cpuid;
+
     if ( opt_hvm_fep ||
          (v->domain->arch.cpuid->x86_vendor != boot_cpu_data.x86_vendor) )
         v->arch.hvm_vmx.exception_bitmap |= (1U << TRAP_invalid_op);
@@ -665,6 +667,21 @@ static void vmx_cpuid_policy_changed(struct vcpu *v)
     vmx_vmcs_enter(v);
     vmx_update_exception_bitmap(v);
     vmx_vmcs_exit(v);
+
+    /*
+     * We can safely pass MSR_SPEC_CTRL through to the guest, even if STIBP
+     * isn't enumerated in hardware, as SPEC_CTRL_STIBP is ignored.
+     */
+    if ( cp->feat.ibrsb )
+        vmx_clear_msr_intercept(v, MSR_SPEC_CTRL, VMX_MSR_RW);
+    else
+        vmx_set_msr_intercept(v, MSR_SPEC_CTRL, VMX_MSR_RW);
+
+    /* MSR_PRED_CMD is safe to pass through if the guest knows about it. */
+    if ( cp->feat.ibrsb || cp->extd.ibpb )
+        vmx_clear_msr_intercept(v, MSR_PRED_CMD,  VMX_MSR_RW);
+    else
+        vmx_set_msr_intercept(v, MSR_PRED_CMD,  VMX_MSR_RW);
 }
 
 int vmx_guest_x86_mode(struct vcpu *v)
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v10 05/11] x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
  2018-01-24 13:12 [PATCH v10 00/11] x86: Mitigations for SP2/CVE-2017-5715/Branch Target Injection Andrew Cooper
                   ` (3 preceding siblings ...)
  2018-01-24 13:12 ` [PATCH v10 04/11] x86/hvm: Permit guests direct access to MSR_{SPEC_CTRL, PRED_CMD} Andrew Cooper
@ 2018-01-24 13:12 ` Andrew Cooper
  2018-01-25 13:08   ` Jan Beulich
  2018-01-25 16:52   ` [PATCH v11 5/11] " Andrew Cooper
  2018-01-24 13:12 ` [PATCH v10 06/11] x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen Andrew Cooper
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-24 13:12 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich

We need to be able to either set or clear IBRS in Xen context, as well as
restore appropriate guest values in guest context.  See the documentation in
asm-x86/spec_ctrl_asm.h for details.

With the contemporary microcode, writes to %cr3 are slower when SPEC_CTRL.IBRS
is set.  Therefore, the positioning of SPEC_CTRL_{ENTRY/EXIT}* is important.

Ideally, the IBRS_SET/IBRS_CLEAR hunks might be positioned either side of the
%cr3 change, but that is rather more complicated to arrange, and could still
result in a guest controlled value in SPEC_CTRL during the %cr3 change,
negating the saving if the guest chose to have IBRS set.

Therefore, we optimise for the pre-Skylake case (being far more common in the
field than Skylake and later, at the moment), where we have a Xen-preferred
value of IBRS clear when switching %cr3.

There is a semi-unrelated bugfix, where various asm_defn.h macros have a
hidden dependency on PAGE_SIZE, which results in an assembler error if used in
a .macro definition.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>

v7:
 * Spelling fixes
 * Reposition the semicolon fix.
v9:
 * Rebase over XPTI.  Moderate changes in the exit-to-PV paths.
 * Swap cmpw for testb when looking for cpl0
v10:
 * Fix the restore of SPEC_CTRL on the exit-to-HVM paths.
 * Adjust comments and commit message
 * Pull the GET_STACK_END() out of DO_SPEC_CTRL_ENTRY and use %r14.
---
 xen/arch/x86/hvm/svm/entry.S        |  11 +-
 xen/arch/x86/hvm/vmx/entry.S        |  19 +++
 xen/arch/x86/setup.c                |   1 +
 xen/arch/x86/smpboot.c              |   2 +
 xen/arch/x86/x86_64/asm-offsets.c   |   6 +
 xen/arch/x86/x86_64/compat/entry.S  |  14 +++
 xen/arch/x86/x86_64/entry.S         |  48 +++++++-
 xen/include/asm-x86/asm_defns.h     |   3 +
 xen/include/asm-x86/current.h       |   6 +
 xen/include/asm-x86/nops.h          |   6 +
 xen/include/asm-x86/spec_ctrl.h     |   9 ++
 xen/include/asm-x86/spec_ctrl_asm.h | 227 ++++++++++++++++++++++++++++++++++++
 12 files changed, 346 insertions(+), 6 deletions(-)
 create mode 100644 xen/include/asm-x86/spec_ctrl_asm.h

diff --git a/xen/arch/x86/hvm/svm/entry.S b/xen/arch/x86/hvm/svm/entry.S
index df86da0..bf092fe 100644
--- a/xen/arch/x86/hvm/svm/entry.S
+++ b/xen/arch/x86/hvm/svm/entry.S
@@ -79,6 +79,12 @@ UNLIKELY_END(svm_trace)
         or   $X86_EFLAGS_MBS,%rax
         mov  %rax,VMCB_rflags(%rcx)
 
+        mov VCPU_arch_msr(%rbx), %rax
+        mov VCPUMSR_spec_ctrl_raw(%rax), %eax
+
+        /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
+        SPEC_CTRL_EXIT_TO_GUEST /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
+
         pop  %r15
         pop  %r14
         pop  %r13
@@ -101,8 +107,11 @@ UNLIKELY_END(svm_trace)
         SAVE_ALL
 
         GET_CURRENT(bx)
-        mov  VCPU_svm_vmcb(%rbx),%rcx
 
+        SPEC_CTRL_ENTRY_FROM_VMEXIT /* Req: b=curr %rsp=regs/cpuinfo, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
+        mov  VCPU_svm_vmcb(%rbx),%rcx
         movb $0,VCPU_svm_vmcb_in_sync(%rbx)
         mov  VMCB_rax(%rcx),%rax
         mov  %rax,UREGS_rax(%rsp)
diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index b2f98be..d9e9835 100644
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -38,6 +38,9 @@ ENTRY(vmx_asm_vmexit_handler)
         movb $1,VCPU_vmx_launched(%rbx)
         mov  %rax,VCPU_hvm_guest_cr2(%rbx)
 
+        SPEC_CTRL_ENTRY_FROM_VMEXIT /* Req: b=curr %rsp=regs/cpuinfo, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         mov  %rsp,%rdi
         call vmx_vmexit_handler
 
@@ -68,6 +71,13 @@ UNLIKELY_END(realmode)
         call vmx_vmenter_helper
         test %al, %al
         jz .Lvmx_vmentry_restart
+
+        mov VCPU_arch_msr(%rbx), %rax
+        mov VCPUMSR_spec_ctrl_raw(%rax), %eax
+
+        /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
+        SPEC_CTRL_EXIT_TO_GUEST /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
+
         mov  VCPU_hvm_guest_cr2(%rbx),%rax
 
         pop  %r15
@@ -99,6 +109,15 @@ UNLIKELY_END(realmode)
 .Lvmx_vmentry_fail:
         sti
         SAVE_ALL
+
+        /*
+         * PV variant needed here as no guest code has executed (so
+         * MSR_SPEC_CTRL can't have changed value), and NMIs/MCEs are liable
+         * to hit (in which case the HVM varient might corrupt things).
+         */
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         call vmx_vmentry_failure
         BUG  /* vmx_vmentry_failure() shouldn't return. */
 
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 9407247..ac530ec 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -678,6 +678,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     set_processor_id(0);
     set_current(INVALID_VCPU); /* debug sanity. */
     idle_vcpu[0] = current;
+    init_shadow_spec_ctrl_state();
 
     percpu_init_areas();
 
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index fe637da..2ebef03 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -41,6 +41,7 @@
 #include <asm/guest.h>
 #include <asm/msr.h>
 #include <asm/mtrr.h>
+#include <asm/spec_ctrl.h>
 #include <asm/time.h>
 #include <asm/tboot.h>
 #include <mach_apic.h>
@@ -309,6 +310,7 @@ void start_secondary(void *unused)
     set_current(idle_vcpu[cpu]);
     this_cpu(curr_vcpu) = idle_vcpu[cpu];
     rdmsrl(MSR_EFER, this_cpu(efer));
+    init_shadow_spec_ctrl_state();
 
     /*
      * Just as during early bootstrap, it is convenient here to disable
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index b1a4310..17f1d77 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -88,6 +88,7 @@ void __dummy__(void)
     OFFSET(VCPU_kernel_ss, struct vcpu, arch.pv_vcpu.kernel_ss);
     OFFSET(VCPU_iopl, struct vcpu, arch.pv_vcpu.iopl);
     OFFSET(VCPU_guest_context_flags, struct vcpu, arch.vgc_flags);
+    OFFSET(VCPU_arch_msr, struct vcpu, arch.msr);
     OFFSET(VCPU_nmi_pending, struct vcpu, nmi_pending);
     OFFSET(VCPU_mce_pending, struct vcpu, mce_pending);
     OFFSET(VCPU_nmi_old_mask, struct vcpu, nmi_state.old_mask);
@@ -139,6 +140,8 @@ void __dummy__(void)
     OFFSET(CPUINFO_cr4, struct cpu_info, cr4);
     OFFSET(CPUINFO_xen_cr3, struct cpu_info, xen_cr3);
     OFFSET(CPUINFO_pv_cr3, struct cpu_info, pv_cr3);
+    OFFSET(CPUINFO_shadow_spec_ctrl, struct cpu_info, shadow_spec_ctrl);
+    OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, use_shadow_spec_ctrl);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
@@ -154,6 +157,9 @@ void __dummy__(void)
     OFFSET(TRAPBOUNCE_eip, struct trap_bounce, eip);
     BLANK();
 
+    OFFSET(VCPUMSR_spec_ctrl_raw, struct msr_vcpu_policy, spec_ctrl.raw);
+    BLANK();
+
 #ifdef CONFIG_PERF_COUNTERS
     DEFINE(ASM_PERFC_exceptions, PERFC_exceptions);
     BLANK();
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index e668f00..4190c73 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -18,6 +18,10 @@ ENTRY(entry_int82)
         pushq $0
         movl  $HYPERCALL_VECTOR, 4(%rsp)
         SAVE_ALL compat=1 /* DPL1 gate, restricted to 32bit PV guests only. */
+
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         CR4_PV32_RESTORE
 
         GET_CURRENT(bx)
@@ -142,6 +146,13 @@ ENTRY(compat_restore_all_guest)
         .popsection
         or    $X86_EFLAGS_IF,%r11
         mov   %r11d,UREGS_eflags(%rsp)
+
+        mov VCPU_arch_msr(%rbx), %rax
+        mov VCPUMSR_spec_ctrl_raw(%rax), %eax
+
+        /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
+        SPEC_CTRL_EXIT_TO_GUEST /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
+
         RESTORE_ALL adj=8 compat=1
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
@@ -200,6 +211,9 @@ ENTRY(cstar_enter)
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
 
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         GET_STACK_END(bx)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
         neg   %rcx
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 710c061..73bd7ca 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -38,6 +38,10 @@ ENTRY(switch_to_kernel)
 restore_all_guest:
         ASSERT_INTERRUPTS_DISABLED
 
+        /* Stash guest SPEC_CTRL value while we can read struct vcpu. */
+        mov VCPU_arch_msr(%rbx), %rdx
+        mov VCPUMSR_spec_ctrl_raw(%rdx), %r15d
+
         /* Copy guest mappings and switch to per-CPU root page table. */
         mov   %cr3, %r9
         GET_STACK_END(dx)
@@ -65,6 +69,12 @@ restore_all_guest:
         write_cr3 rax, rdi, rsi
 .Lrag_keep_cr3:
 
+        /* Restore stashed SPEC_CTRL value. */
+        mov   %r15d, %eax
+
+        /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
+        SPEC_CTRL_EXIT_TO_GUEST /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
+
         RESTORE_ALL
         testw $TRAP_syscall,4(%rsp)
         jz    iret_exit_to_guest
@@ -103,9 +113,9 @@ restore_all_xen:
          * Check whether we need to switch to the per-CPU page tables, in
          * case we return to late PV exit code (from an NMI or #MC).
          */
-        GET_STACK_END(ax)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rax), %rdx
-        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rax), %rax
+        GET_STACK_END(bx)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rdx
+        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rbx), %rax
         test  %rdx, %rdx
         /*
          * Ideally the condition would be "nsz", but such doesn't exist,
@@ -115,6 +125,9 @@ UNLIKELY_START(g, exit_cr3)
         write_cr3 rax, rdi, rsi
 UNLIKELY_END(exit_cr3)
 
+        /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
+        SPEC_CTRL_EXIT_TO_XEN /* Req: %rbx=end, Clob: acd */
+
         RESTORE_ALL adj=8
         iretq
 
@@ -145,6 +158,9 @@ ENTRY(lstar_enter)
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
 
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         GET_STACK_END(bx)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
         neg   %rcx
@@ -248,6 +264,9 @@ GLOBAL(sysenter_eflags_saved)
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
 
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         GET_STACK_END(bx)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
         neg   %rcx
@@ -294,6 +313,9 @@ ENTRY(int80_direct_trap)
         movl  $0x80, 4(%rsp)
         SAVE_ALL
 
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         GET_STACK_END(bx)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
         neg   %rcx
@@ -469,6 +491,10 @@ ENTRY(common_interrupt)
         SAVE_ALL CLAC
 
         GET_STACK_END(14)
+
+        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
         mov   %rcx, %r15
         neg   %rcx
@@ -507,6 +533,10 @@ GLOBAL(handle_exception)
         SAVE_ALL CLAC
 
         GET_STACK_END(14)
+
+        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
         mov   %rcx, %r15
         neg   %rcx
@@ -700,8 +730,12 @@ ENTRY(double_fault)
         /* Set AC to reduce chance of further SMAP faults */
         SAVE_ALL STAC
 
-        GET_STACK_END(bx)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rbx
+        GET_STACK_END(14)
+
+        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rbx
         test  %rbx, %rbx
         jz    .Ldblf_cr3_okay
         jns   .Ldblf_cr3_load
@@ -730,6 +764,10 @@ handle_ist_exception:
         SAVE_ALL CLAC
 
         GET_STACK_END(14)
+
+        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
         mov   %rcx, %r15
         neg   %rcx
diff --git a/xen/include/asm-x86/asm_defns.h b/xen/include/asm-x86/asm_defns.h
index d2d91ca..b74ac81 100644
--- a/xen/include/asm-x86/asm_defns.h
+++ b/xen/include/asm-x86/asm_defns.h
@@ -7,6 +7,7 @@
 #include <asm/asm-offsets.h>
 #endif
 #include <asm/bug.h>
+#include <asm/page.h>
 #include <asm/processor.h>
 #include <asm/percpu.h>
 #include <xen/stringify.h>
@@ -386,4 +387,6 @@ static always_inline void stac(void)
 4:  .p2align 2                            ; \
     .popsection
 
+#include <asm/spec_ctrl_asm.h>
+
 #endif /* __X86_ASM_DEFNS_H__ */
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index b929c48..1009d05 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -53,6 +53,12 @@ struct cpu_info {
      */
     unsigned long xen_cr3;
     unsigned long pv_cr3;
+
+    /* See asm-x86/spec_ctrl_asm.h for usage. */
+    unsigned int shadow_spec_ctrl;
+    bool         use_shadow_spec_ctrl;
+
+    unsigned long __pad;
     /* get_stack_bottom() must be 16-byte aligned */
 };
 
diff --git a/xen/include/asm-x86/nops.h b/xen/include/asm-x86/nops.h
index 1a46b97..a35ef96 100644
--- a/xen/include/asm-x86/nops.h
+++ b/xen/include/asm-x86/nops.h
@@ -65,6 +65,12 @@
 #define ASM_NOP8 _ASM_MK_NOP(P6_NOP8)
 #define ASM_NOP9 _ASM_MK_NOP(P6_NOP9)
 
+#define ASM_NOP18 ASM_NOP8; ASM_NOP8; ASM_NOP2
+#define ASM_NOP22 ASM_NOP8; ASM_NOP8; ASM_NOP6
+#define ASM_NOP24 ASM_NOP8; ASM_NOP8; ASM_NOP8
+#define ASM_NOP29 ASM_NOP8; ASM_NOP8; ASM_NOP8; ASM_NOP5
+#define ASM_NOP32 ASM_NOP8; ASM_NOP8; ASM_NOP8; ASM_NOP8
+
 #define ASM_NOP_MAX 9
 
 #endif /* __X86_ASM_NOPS_H__ */
diff --git a/xen/include/asm-x86/spec_ctrl.h b/xen/include/asm-x86/spec_ctrl.h
index e088a55..b451250 100644
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -20,8 +20,17 @@
 #ifndef __X86_SPEC_CTRL_H__
 #define __X86_SPEC_CTRL_H__
 
+#include <asm/current.h>
+
 void init_speculation_mitigations(void);
 
+static inline void init_shadow_spec_ctrl_state(void)
+{
+    struct cpu_info *info = get_cpu_info();
+
+    info->shadow_spec_ctrl = info->use_shadow_spec_ctrl = 0;
+}
+
 #endif /* !__X86_SPEC_CTRL_H__ */
 
 /*
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
new file mode 100644
index 0000000..e0ddbd9
--- /dev/null
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -0,0 +1,227 @@
+/******************************************************************************
+ * include/asm-x86/spec_ctrl.h
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; If not, see <http://www.gnu.org/licenses/>.
+ *
+ * Copyright (c) 2017-2018 Citrix Systems Ltd.
+ */
+
+#ifndef __X86_SPEC_CTRL_ASM_H__
+#define __X86_SPEC_CTRL_ASM_H__
+
+#ifdef __ASSEMBLY__
+#include <asm/msr-index.h>
+
+/*
+ * Saving and restoring MSR_SPEC_CTRL state is a little tricky.
+ *
+ * We want the guests choice of SPEC_CTRL while in guest context, and Xen's
+ * choice (set or clear, depending on the hardware) while running in Xen
+ * context.  Therefore, a simplistic algorithm is:
+ *
+ *  - Set/clear IBRS on entry to Xen
+ *  - Set the guests' choice on exit to guest
+ *  - Leave SPEC_CTRL unchanged on exit to xen
+ *
+ * There are two complicating factors:
+ *  1) HVM guests can have direct access to the MSR, so it can change
+ *     behind Xen's back.
+ *  2) An NMI or MCE can interrupt at any point, including early in the entry
+ *     path, or late in the exit path after restoring the guest value.  This
+ *     will corrupt the guest value.
+ *
+ * Factor 1 is dealt with by relying on NMIs/MCEs being blocked immediately
+ * after VMEXIT.  The VMEXIT-specific code reads MSR_SPEC_CTRL and updates
+ * current before loading Xen's MSR_SPEC_CTRL setting.
+ *
+ * Factor 2 is harder.  We maintain a shadow_spec_ctrl value, and
+ * use_shadow_spec_ctrl boolean per cpu.  The synchronous use is:
+ *
+ *  1) Store guest value in shadow_spec_ctrl
+ *  2) Set use_shadow_spec_ctrl boolean
+ *  3) Load guest value into MSR_SPEC_CTRL
+ *  4) Exit to guest
+ *  5) Entry from guest
+ *  6) Clear use_shadow_spec_ctrl boolean
+ *  7) Load Xen's value into MSR_SPEC_CTRL
+ *
+ * The asynchronous use for interrupts/exceptions is:
+ *  -  Set/clear IBRS on entry to Xen
+ *  -  On exit to Xen, check use_shadow_spec_ctrl
+ *  -  If set, load shadow_spec_ctrl
+ *
+ * Therefore, an interrupt/exception which hits the synchronous path between
+ * steps 2 and 6 will restore the shadow value rather than leaving Xen's value
+ * loaded and corrupting the value used in guest context.
+ *
+ * The following ASM fragments implement this algorithm.  See their local
+ * comments for further details.
+ *  - SPEC_CTRL_ENTRY_FROM_VMEXIT
+ *  - SPEC_CTRL_ENTRY_FROM_PV
+ *  - SPEC_CTRL_ENTRY_FROM_INTR
+ *  - SPEC_CTRL_EXIT_TO_XEN
+ *  - SPEC_CTRL_EXIT_TO_GUEST
+ */
+
+.macro DO_SPEC_CTRL_ENTRY_FROM_VMEXIT ibrs_val:req
+/*
+ * Requires %rbx=current, %rsp=regs/cpuinfo
+ * Clobbers %rax, %rcx, %rdx
+ *
+ * The common case is that a guest has direct access to MSR_SPEC_CTRL, at
+ * which point we need to save the guest value before setting IBRS for Xen.
+ * Unilaterally saving the guest value is shorter and faster than checking.
+ */
+    mov $MSR_SPEC_CTRL, %ecx
+    rdmsr
+
+    /* Stash the value from hardware. */
+    mov VCPU_arch_msr(%rbx), %rdx
+    mov %eax, VCPUMSR_spec_ctrl_raw(%rdx)
+    xor %edx, %edx
+
+    /* Clear SPEC_CTRL shadowing *before* loading Xen's value. */
+    movb %dl, CPUINFO_use_shadow_spec_ctrl(%rsp)
+
+    /* Load Xen's intended value. */
+    mov $\ibrs_val, %eax
+    wrmsr
+.endm
+
+.macro DO_SPEC_CTRL_ENTRY maybexen:req ibrs_val:req
+/*
+ * Requires %rsp=regs (also cpuinfo if !maybexen)
+ * Requires %r14=stack_end (if maybexen)
+ * Clobbers %rax, %rcx, %rdx
+ *
+ * PV guests can't update MSR_SPEC_CTRL behind Xen's back, so no need to read
+ * it back.  Entries from guest context need to clear SPEC_CTRL shadowing,
+ * while entries from Xen must leave shadowing in its current state.
+ */
+    mov $MSR_SPEC_CTRL, %ecx
+
+    .if \maybexen
+        testb $3, UREGS_cs(%rsp)
+        jz .L\@_entry_from_xen
+    .endif
+
+    /*
+     * Clear SPEC_CTRL shadowing *before* loading Xen's value.  If entering
+     * from a possibly-xen context, %rsp doesn't necessarily alias the cpuinfo
+     * block so calculate the position directly.
+     */
+    .if \maybexen
+        movb $0, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)
+    .else
+        movb $0, CPUINFO_use_shadow_spec_ctrl(%rsp)
+    .endif
+
+.L\@_entry_from_xen:
+    /* Load Xen's intended value. */
+    mov $\ibrs_val, %eax
+    xor %edx, %edx
+    wrmsr
+.endm
+
+.macro DO_SPEC_CTRL_EXIT_TO_XEN
+/*
+ * Requires %rbx=stack_end
+ * Clobbers %rax, %rcx, %rdx
+ *
+ * When returning to Xen context, look to see whether SPEC_CTRL shadowing is
+ * in effect, and reload the shadow value.  This covers race conditions which
+ * exist with an NMI/MCE/etc hitting late in the return-to-guest path.
+ */
+    cmpb $0, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%rbx)
+    je .L\@_skip
+
+    mov STACK_CPUINFO_FIELD(shadow_spec_ctrl)(%rbx), %eax
+    mov $MSR_SPEC_CTRL, %ecx
+    xor %edx, %edx
+    wrmsr
+
+.L\@_skip:
+.endm
+
+.macro DO_SPEC_CTRL_EXIT_TO_GUEST
+/*
+ * Requires %eax=spec_ctrl, %rsp=regs/cpuinfo
+ * Clobbers %rcx, %rdx
+ *
+ * When returning to guest context, set up SPEC_CTRL shadowing and load the
+ * guest value.
+ */
+    /* Set up shadow value *before* enabling shadowing. */
+    mov %eax, CPUINFO_shadow_spec_ctrl(%rsp)
+
+    /* Set SPEC_CTRL shadowing *before* loading the guest value. */
+    movb $1, CPUINFO_use_shadow_spec_ctrl(%rsp)
+
+    mov $MSR_SPEC_CTRL, %ecx
+    xor %edx, %edx
+    wrmsr
+.endm
+
+/* Use after a VMEXIT from an HVM guest. */
+#define SPEC_CTRL_ENTRY_FROM_VMEXIT                                     \
+    ALTERNATIVE_2 __stringify(ASM_NOP32),                               \
+        __stringify(DO_SPEC_CTRL_ENTRY_FROM_VMEXIT                      \
+                    ibrs_val=SPEC_CTRL_IBRS),                           \
+        X86_FEATURE_XEN_IBRS_SET,                                       \
+        __stringify(DO_SPEC_CTRL_ENTRY_FROM_VMEXIT                      \
+                    ibrs_val=0),                                        \
+        X86_FEATURE_XEN_IBRS_CLEAR
+
+/* Use after an entry from PV context (syscall/sysenter/int80/int82/etc). */
+#define SPEC_CTRL_ENTRY_FROM_PV                                         \
+    ALTERNATIVE_2 __stringify(ASM_NOP22),                               \
+        __stringify(DO_SPEC_CTRL_ENTRY maybexen=0                       \
+                    ibrs_val=SPEC_CTRL_IBRS),                           \
+        X86_FEATURE_XEN_IBRS_SET,                                       \
+        __stringify(DO_SPEC_CTRL_ENTRY maybexen=0 ibrs_val=0),          \
+        X86_FEATURE_XEN_IBRS_CLEAR
+
+/* Use in interrupt/exception context.  May interrupt Xen or PV context. */
+#define SPEC_CTRL_ENTRY_FROM_INTR                                       \
+    ALTERNATIVE_2 __stringify(ASM_NOP29),                               \
+        __stringify(DO_SPEC_CTRL_ENTRY maybexen=1                       \
+                    ibrs_val=SPEC_CTRL_IBRS),                           \
+        X86_FEATURE_XEN_IBRS_SET,                                       \
+        __stringify(DO_SPEC_CTRL_ENTRY maybexen=1 ibrs_val=0),          \
+        X86_FEATURE_XEN_IBRS_CLEAR
+
+/* Use when exiting to Xen context. */
+#define SPEC_CTRL_EXIT_TO_XEN                                           \
+    ALTERNATIVE_2 __stringify(ASM_NOP18),                               \
+        DO_SPEC_CTRL_EXIT_TO_XEN, X86_FEATURE_XEN_IBRS_SET,             \
+        DO_SPEC_CTRL_EXIT_TO_XEN, X86_FEATURE_XEN_IBRS_CLEAR
+
+/* Use when exiting to guest context. */
+#define SPEC_CTRL_EXIT_TO_GUEST                                         \
+    ALTERNATIVE_2 __stringify(ASM_NOP24),                               \
+        DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_SET,           \
+        DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_CLEAR
+
+#endif /* __ASSEMBLY__ */
+#endif /* !__X86_SPEC_CTRL_ASM_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v10 06/11] x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
  2018-01-24 13:12 [PATCH v10 00/11] x86: Mitigations for SP2/CVE-2017-5715/Branch Target Injection Andrew Cooper
                   ` (4 preceding siblings ...)
  2018-01-24 13:12 ` [PATCH v10 05/11] x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point Andrew Cooper
@ 2018-01-24 13:12 ` Andrew Cooper
  2018-01-25 13:19   ` Jan Beulich
  2018-01-25 16:54   ` [PATCH v11 6/11] " Andrew Cooper
  2018-01-24 13:12 ` [PATCH v10 07/11] x86/entry: Avoid using alternatives in NMI/#MC paths Andrew Cooper
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-24 13:12 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich

ret instructions are speculated directly to values recorded in the Return
Stack Buffer/Return Address Stack, as there is no uncertainty in well-formed
code.  Guests can take advantage of this in two ways:

  1) If they can find a path in Xen which executes more ret instructions than
     call instructions.  (At least one in the waitqueue infrastructure,
     probably others.)

  2) Use the fact that the RSB/RAS in hardware is actually a circular stack
     without a concept of empty.  (When it logically empties, stale values
     will start being used.)

To mitigate, overwrite the RSB on entry to Xen with gadgets which will capture
and contain rogue speculation.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>

v7:
 * Rewritten almost from scratch.  See code comments for details.
v8:
 * Use jmp rather than call to contain speculation.  It doesn't affect the
   correctness of containment, but removes 6 bytes.
v10:
 * Spelling/comment improvements.
 * Split to fit around IST safety.  Other half of the patch moved into
   "x86/boot: Calculate the most appropriate BTI mitigation to use"
 * Avoid using numeric labels in DO_OVERWRITE_RSB
---
 xen/include/asm-x86/cpufeatures.h   |  2 ++
 xen/include/asm-x86/nops.h          |  1 +
 xen/include/asm-x86/spec_ctrl_asm.h | 43 +++++++++++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h
index dd2388f..0ee4a1f 100644
--- a/xen/include/asm-x86/cpufeatures.h
+++ b/xen/include/asm-x86/cpufeatures.h
@@ -28,3 +28,5 @@ XEN_CPUFEATURE(IND_THUNK_JMP,   (FSCAPINTS+0)*32+14) /* Use IND_THUNK_JMP */
 XEN_CPUFEATURE(XEN_IBPB,        (FSCAPINTS+0)*32+15) /* IBRSB || IBPB */
 XEN_CPUFEATURE(XEN_IBRS_SET,    (FSCAPINTS+0)*32+16) /* IBRSB && IRBS set in Xen */
 XEN_CPUFEATURE(XEN_IBRS_CLEAR,  (FSCAPINTS+0)*32+17) /* IBRSB && IBRS clear in Xen */
+XEN_CPUFEATURE(RSB_NATIVE,      (FSCAPINTS+0)*32+18) /* RSB overwrite needed for native */
+XEN_CPUFEATURE(RSB_VMEXIT,      (FSCAPINTS+0)*32+20) /* RSB overwrite needed for vmexit */
diff --git a/xen/include/asm-x86/nops.h b/xen/include/asm-x86/nops.h
index a35ef96..9806010 100644
--- a/xen/include/asm-x86/nops.h
+++ b/xen/include/asm-x86/nops.h
@@ -70,6 +70,7 @@
 #define ASM_NOP24 ASM_NOP8; ASM_NOP8; ASM_NOP8
 #define ASM_NOP29 ASM_NOP8; ASM_NOP8; ASM_NOP8; ASM_NOP5
 #define ASM_NOP32 ASM_NOP8; ASM_NOP8; ASM_NOP8; ASM_NOP8
+#define ASM_NOP34 ASM_NOP8; ASM_NOP8; ASM_NOP8; ASM_NOP8; ASM_NOP2
 
 #define ASM_NOP_MAX 9
 
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
index e0ddbd9..51d87a9 100644
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -74,6 +74,43 @@
  *  - SPEC_CTRL_EXIT_TO_GUEST
  */
 
+.macro DO_OVERWRITE_RSB
+/*
+ * Requires nothing
+ * Clobbers %rax, %rcx
+ *
+ * Requires 256 bytes of stack space, but %rsp has no net change. Based on
+ * Google's performance numbers, the loop is unrolled to 16 iterations and two
+ * calls per iteration.
+ *
+ * The call filling the RSB needs a nonzero displacement.  A nop would do, but
+ * we use "1: pause, jmp 1b" to safely contains any ret-based speculation,
+ * even if the loop is speculatively executed prematurely.
+ *
+ * %rsp is preserved by using an extra GPR because a) we've got plenty spare,
+ * b) the two movs are shorter to encode than `add $32*8, %rsp`, and c) can be
+ * optimised with mov-elimination in modern cores.
+ */
+    mov $16, %ecx                   /* 16 iterations, two calls per loop */
+    mov %rsp, %rax                  /* Store the current %rsp */
+
+.L\@_fill_rsb_loop:
+
+    .irp n, 1, 2                    /* Unrolled twice. */
+    call .L\@_insert_rsb_entry_\n   /* Create an RSB entry. */
+
+.L\@_capture_speculation_\n:
+    pause
+    jmp .L\@_capture_speculation_\n /* Capture rogue speculation. */
+
+.L\@_insert_rsb_entry_\n:
+    .endr
+
+    sub $1, %ecx
+    jnz .L\@_fill_rsb_loop
+    mov %rax, %rsp                  /* Restore old %rsp */
+.endm
+
 .macro DO_SPEC_CTRL_ENTRY_FROM_VMEXIT ibrs_val:req
 /*
  * Requires %rbx=current, %rsp=regs/cpuinfo
@@ -175,6 +212,8 @@
 
 /* Use after a VMEXIT from an HVM guest. */
 #define SPEC_CTRL_ENTRY_FROM_VMEXIT                                     \
+    ALTERNATIVE __stringify(ASM_NOP34),                                 \
+        DO_OVERWRITE_RSB, X86_FEATURE_RSB_VMEXIT;                       \
     ALTERNATIVE_2 __stringify(ASM_NOP32),                               \
         __stringify(DO_SPEC_CTRL_ENTRY_FROM_VMEXIT                      \
                     ibrs_val=SPEC_CTRL_IBRS),                           \
@@ -185,6 +224,8 @@
 
 /* Use after an entry from PV context (syscall/sysenter/int80/int82/etc). */
 #define SPEC_CTRL_ENTRY_FROM_PV                                         \
+    ALTERNATIVE __stringify(ASM_NOP34),                                 \
+        DO_OVERWRITE_RSB, X86_FEATURE_RSB_NATIVE;                       \
     ALTERNATIVE_2 __stringify(ASM_NOP22),                               \
         __stringify(DO_SPEC_CTRL_ENTRY maybexen=0                       \
                     ibrs_val=SPEC_CTRL_IBRS),                           \
@@ -194,6 +235,8 @@
 
 /* Use in interrupt/exception context.  May interrupt Xen or PV context. */
 #define SPEC_CTRL_ENTRY_FROM_INTR                                       \
+    ALTERNATIVE __stringify(ASM_NOP34),                                 \
+        DO_OVERWRITE_RSB, X86_FEATURE_RSB_NATIVE;                       \
     ALTERNATIVE_2 __stringify(ASM_NOP29),                               \
         __stringify(DO_SPEC_CTRL_ENTRY maybexen=1                       \
                     ibrs_val=SPEC_CTRL_IBRS),                           \
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v10 07/11] x86/entry: Avoid using alternatives in NMI/#MC paths
  2018-01-24 13:12 [PATCH v10 00/11] x86: Mitigations for SP2/CVE-2017-5715/Branch Target Injection Andrew Cooper
                   ` (5 preceding siblings ...)
  2018-01-24 13:12 ` [PATCH v10 06/11] x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen Andrew Cooper
@ 2018-01-24 13:12 ` Andrew Cooper
  2018-01-25 13:43   ` Jan Beulich
  2018-01-25 17:21   ` [PATCH v11 7/11] " Andrew Cooper
  2018-01-24 13:12 ` [PATCH v10 08/11] x86/boot: Calculate the most appropriate BTI mitigation to use Andrew Cooper
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-24 13:12 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich

This patch is deliberately arranged to be easy to revert if/when alternatives
patching becomes NMI/#MC safe.

For safety, there must be a dispatch serialising instruction in (what is
logically) DO_SPEC_CTRL_ENTRY so that, in the case that Xen needs IBRS set in
context, an attacker can't speculate around the WRMSR and reach an indirect
branch within the speculation window.

Using conditionals opens this attack vector up, so the else clause gets an
LFENCE to force the pipeline to catch up before continuing.  This also covers
the safety of RSB conditional, as execution it is guaranteed to either hit the
WRMSR or LFENCE.

One downside of not using alternatives is that there unconditionally an LFENCE
in the IST path in cases where we are not using the features from IBRS-capable
microcode.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>

v10:
 * New

I'm not entirely convinced this is better than the risk of an NMI/#MC hitting
the critical patching region.
---
 xen/arch/x86/x86_64/asm-offsets.c   |  1 +
 xen/arch/x86/x86_64/entry.S         |  6 ++--
 xen/include/asm-x86/current.h       |  1 +
 xen/include/asm-x86/spec_ctrl_asm.h | 68 +++++++++++++++++++++++++++++++++++++
 4 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 17f1d77..51be528 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -142,6 +142,7 @@ void __dummy__(void)
     OFFSET(CPUINFO_pv_cr3, struct cpu_info, pv_cr3);
     OFFSET(CPUINFO_shadow_spec_ctrl, struct cpu_info, shadow_spec_ctrl);
     OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, use_shadow_spec_ctrl);
+    OFFSET(CPUINFO_bti_ist_info, struct cpu_info, bti_ist_info);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 73bd7ca..a5a6702 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -126,7 +126,7 @@ UNLIKELY_START(g, exit_cr3)
 UNLIKELY_END(exit_cr3)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
-        SPEC_CTRL_EXIT_TO_XEN /* Req: %rbx=end, Clob: acd */
+        SPEC_CTRL_EXIT_TO_XEN_IST /* Req: %rbx=end, Clob: acd */
 
         RESTORE_ALL adj=8
         iretq
@@ -732,7 +732,7 @@ ENTRY(double_fault)
 
         GET_STACK_END(14)
 
-        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, Clob: acd */
+        SPEC_CTRL_ENTRY_FROM_INTR_IST /* Req: %rsp=regs, %r14=end, Clob: acd */
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rbx
@@ -765,7 +765,7 @@ handle_ist_exception:
 
         GET_STACK_END(14)
 
-        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, Clob: acd */
+        SPEC_CTRL_ENTRY_FROM_INTR_IST /* Req: %rsp=regs, %r14=end, Clob: acd */
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 1009d05..4678a0f 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -57,6 +57,7 @@ struct cpu_info {
     /* See asm-x86/spec_ctrl_asm.h for usage. */
     unsigned int shadow_spec_ctrl;
     bool         use_shadow_spec_ctrl;
+    uint8_t      bti_ist_info;
 
     unsigned long __pad;
     /* get_stack_bottom() must be 16-byte aligned */
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
index 51d87a9..f48f1ab 100644
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -20,6 +20,11 @@
 #ifndef __X86_SPEC_CTRL_ASM_H__
 #define __X86_SPEC_CTRL_ASM_H__
 
+/* Encoding of the bottom bits in cpuinfo.bti_ist_info */
+#define BTI_IST_IBRS  (1 << 0)
+#define BTI_IST_WRMSR (1 << 1)
+#define BTI_IST_RSB   (1 << 2)
+
 #ifdef __ASSEMBLY__
 #include <asm/msr-index.h>
 
@@ -256,6 +261,69 @@
         DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_SET,           \
         DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_CLEAR
 
+/* TODO: Drop these when the alternatives infrastructure is NMI/#MC safe. */
+.macro SPEC_CTRL_ENTRY_FROM_INTR_IST
+/*
+ * Requires %rsp=regs, %r14=stack_end
+ * Clobbers %rax, %rcx, %rdx
+ *
+ * This is logical merge of DO_OVERWRITE_RSB and DO_SPEC_CTRL_ENTRY
+ * maybexen=1, but with conditionals rather than alternatives.
+ */
+    movzbl STACK_CPUINFO_FIELD(bti_ist_info)(%r14), %eax
+
+    testb $BTI_IST_RSB, %al
+    jz .L\@_skip_rsb
+
+    DO_OVERWRITE_RSB
+
+.L\@_skip_rsb:
+
+    testb $BTI_IST_WRMSR, %al
+    jz .L\@_skip_wrmsr
+
+    testb $3, UREGS_cs(%rsp)
+    jz .L\@_entry_from_xen
+
+    movb $0, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)
+
+.L\@_entry_from_xen:
+    /*
+     * Load Xen's intended value.  SPEC_CTRL_IBRS vs 0 is encoded in the
+     * bottom bit of bti_ist_info, via a deliberate alias with BTI_IST_IBRS.
+     */
+    mov $MSR_SPEC_CTRL, %ecx
+    and $BTI_IST_IBRS, %eax
+    xor %edx, %edx
+    wrmsr
+
+    /* Opencoded UNLIKELY_START() with no condition. */
+.Ldispatch.\@_serialise:
+    .subsection 1
+    /*
+     * In the case that we might need to write to MSR_SPEC_CTRL for safety, we
+     * need to ensure that an attacker can't poison the `jz .L\@_skip_wrmsr`
+     * to speculate around the WRMSR.  As a result, we need a dispatch
+     * serialising instruction in the else clause.
+     */
+.L\@_skip_wrmsr:
+    lfence
+    UNLIKELY_END(\@_serialise)
+.endm
+
+.macro SPEC_CTRL_EXIT_TO_XEN_IST
+/*
+ * Requires %rbx=stack_end
+ * Clobbers %rax, %rcx, %rdx
+ */
+    testb $BTI_IST_WRMSR, STACK_CPUINFO_FIELD(bti_ist_info)(%rbx)
+    jz .L\@_skip
+
+    DO_SPEC_CTRL_EXIT_TO_XEN
+
+.L\@_skip:
+.endm
+
 #endif /* __ASSEMBLY__ */
 #endif /* !__X86_SPEC_CTRL_ASM_H__ */
 
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v10 08/11] x86/boot: Calculate the most appropriate BTI mitigation to use
  2018-01-24 13:12 [PATCH v10 00/11] x86: Mitigations for SP2/CVE-2017-5715/Branch Target Injection Andrew Cooper
                   ` (6 preceding siblings ...)
  2018-01-24 13:12 ` [PATCH v10 07/11] x86/entry: Avoid using alternatives in NMI/#MC paths Andrew Cooper
@ 2018-01-24 13:12 ` Andrew Cooper
  2018-01-25 13:52   ` Jan Beulich
  2018-02-01  8:41   ` Jan Beulich
  2018-01-24 13:12 ` [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts Andrew Cooper
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-24 13:12 UTC (permalink / raw)
  To: Xen-devel; +Cc: Asit K Mallick, Andrew Cooper, Jun Nakajima, Jan Beulich

See the logic and comments in init_speculation_mitigations() for further
details.

There are two controls for RSB overwriting, because in principle there are
cases where it might be safe to forego rsb_native (Off the top of my head,
SMEP active, no 32bit PV guests at all, no use of vmevent/paging subsystems
for HVM guests, but I make no guarantees that this list of restrictions is
exhaustive).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Asit K Mallick <asit.k.mallick@intel.com>
CC: Jun Nakajima <jun.nakajima@intel.com>

TODO: Confirm Broadwell microcode details, and Atom safety WRT retpoline.
These details should not block this series.

v7:
 * static, and tweak comment
v10:
 * Simplify init_speculation_mitigations()
 * Merge half of the old "x86/entry: Clobber the Return Stack Buffer/Return
   Address Stack on entry to Xen" due to the IST safety work
---
 docs/misc/xen-command-line.markdown |  10 ++-
 xen/arch/x86/spec_ctrl.c            | 142 +++++++++++++++++++++++++++++++++++-
 xen/include/asm-x86/spec_ctrl.h     |   3 +
 3 files changed, 150 insertions(+), 5 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index f73990f..11399ce 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -246,7 +246,7 @@ enough. Setting this to a high value may cause boot failure, particularly if
 the NMI watchdog is also enabled.
 
 ### bti (x86)
-> `= List of [ thunk=retpoline|lfence|jmp ]`
+> `= List of [ thunk=retpoline|lfence|jmp, ibrs=<bool>, rsb_{vmexit,native}=<bool> ]`
 
 Branch Target Injection controls.  By default, Xen will pick the most
 appropriate BTI mitigations based on compiled in support, loaded microcode,
@@ -261,6 +261,14 @@ locations.  The default thunk is `retpoline` (generally preferred for Intel
 hardware), with the alternatives being `jmp` (a `jmp *%reg` gadget, minimal
 overhead), and `lfence` (an `lfence; jmp *%reg` gadget, preferred for AMD).
 
+On hardware supporting IBRS, the `ibrs=` option can be used to force or
+prevent Xen using the feature itself.  If Xen is not using IBRS itself,
+functionality is still set up so IBRS can be virtualised for guests.
+
+The `rsb_vmexit=` and `rsb_native=` options can be used to fine tune when the
+RSB gets overwritten.  There are individual controls for an entry from HVM
+context, and an entry from a native (PV or Xen) context.
+
 ### xenheap\_megabytes (arm32)
 > `= <size>`
 
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 89e7287..de5ba1a 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -20,8 +20,10 @@
 #include <xen/init.h>
 #include <xen/lib.h>
 
+#include <asm/microcode.h>
 #include <asm/processor.h>
 #include <asm/spec_ctrl.h>
+#include <asm/spec_ctrl_asm.h>
 
 static enum ind_thunk {
     THUNK_DEFAULT, /* Decide which thunk to use at boot time. */
@@ -31,11 +33,15 @@ static enum ind_thunk {
     THUNK_LFENCE,
     THUNK_JMP,
 } opt_thunk __initdata = THUNK_DEFAULT;
+static int8_t __initdata opt_ibrs = -1;
+static bool __initdata opt_rsb_native = true;
+static bool __initdata opt_rsb_vmexit = true;
+uint8_t __read_mostly default_bti_ist_info;
 
 static int __init parse_bti(const char *s)
 {
     const char *ss;
-    int rc = 0;
+    int val, rc = 0;
 
     do {
         ss = strchr(s, ',');
@@ -55,6 +61,12 @@ static int __init parse_bti(const char *s)
             else
                 rc = -EINVAL;
         }
+        else if ( (val = parse_boolean("ibrs", s, ss)) >= 0 )
+            opt_ibrs = val;
+        else if ( (val = parse_boolean("rsb_native", s, ss)) >= 0 )
+            opt_rsb_native = val;
+        else if ( (val = parse_boolean("rsb_vmexit", s, ss)) >= 0 )
+            opt_rsb_vmexit = val;
         else
             rc = -EINVAL;
 
@@ -91,24 +103,84 @@ static void __init print_details(enum ind_thunk thunk)
         printk(XENLOG_DEBUG "  Compiled-in support: INDIRECT_THUNK\n");
 
     printk(XENLOG_INFO
-           "BTI mitigations: Thunk %s\n",
+           "BTI mitigations: Thunk %s, Others:%s%s%s\n",
            thunk == THUNK_NONE      ? "N/A" :
            thunk == THUNK_RETPOLINE ? "RETPOLINE" :
            thunk == THUNK_LFENCE    ? "LFENCE" :
-           thunk == THUNK_JMP       ? "JMP" : "?");
+           thunk == THUNK_JMP       ? "JMP" : "?",
+           boot_cpu_has(X86_FEATURE_XEN_IBRS_SET)    ? " IBRS+" :
+           boot_cpu_has(X86_FEATURE_XEN_IBRS_CLEAR)  ? " IBRS-"      : "",
+           boot_cpu_has(X86_FEATURE_RSB_NATIVE)      ? " RSB_NATIVE" : "",
+           boot_cpu_has(X86_FEATURE_RSB_VMEXIT)      ? " RSB_VMEXIT" : "");
+}
+
+/* Calculate whether Retpoline is known-safe on this CPU. */
+static bool __init retpoline_safe(void)
+{
+    unsigned int ucode_rev = this_cpu(ucode_cpu_info).cpu_sig.rev;
+
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_AMD )
+        return true;
+
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+         boot_cpu_data.x86 != 6 )
+        return false;
+
+    switch ( boot_cpu_data.x86_model )
+    {
+    case 0x17: /* Penryn */
+    case 0x1d: /* Dunnington */
+    case 0x1e: /* Nehalem */
+    case 0x1f: /* Auburndale / Havendale */
+    case 0x1a: /* Nehalem EP */
+    case 0x2e: /* Nehalem EX */
+    case 0x25: /* Westmere */
+    case 0x2c: /* Westmere EP */
+    case 0x2f: /* Westmere EX */
+    case 0x2a: /* SandyBridge */
+    case 0x2d: /* SandyBridge EP/EX */
+    case 0x3a: /* IvyBridge */
+    case 0x3e: /* IvyBridge EP/EX */
+    case 0x3c: /* Haswell */
+    case 0x3f: /* Haswell EX/EP */
+    case 0x45: /* Haswell D */
+    case 0x46: /* Haswell H */
+        return true;
+
+        /*
+         * Broadwell processors are retpoline-safe after specific microcode
+         * versions.
+         */
+    case 0x3d: /* Broadwell */
+        return ucode_rev >= 0x28;
+    case 0x47: /* Broadwell H */
+        return ucode_rev >= 0x1b;
+    case 0x4f: /* Broadwell EP/EX */
+        return ucode_rev >= 0xb000025;
+    case 0x56: /* Broadwell D */
+        return false; /* TBD. */
+
+        /*
+         * Skylake and later processors are not retpoline-safe.
+         */
+    default:
+        return false;
+    }
 }
 
 void __init init_speculation_mitigations(void)
 {
     enum ind_thunk thunk = THUNK_DEFAULT;
+    bool ibrs = false;
 
     /*
      * Has the user specified any custom BTI mitigations?  If so, follow their
      * instructions exactly and disable all heuristics.
      */
-    if ( opt_thunk != THUNK_DEFAULT )
+    if ( opt_thunk != THUNK_DEFAULT || opt_ibrs != -1 )
     {
         thunk = opt_thunk;
+        ibrs  = !!opt_ibrs;
     }
     else
     {
@@ -124,7 +196,18 @@ void __init init_speculation_mitigations(void)
              */
             if ( cpu_has_lfence_dispatch )
                 thunk = THUNK_LFENCE;
+            /*
+             * On Intel hardware, we'd like to use retpoline in preference to
+             * IBRS, but only if it is safe on this hardware.
+             */
+            else if ( retpoline_safe() )
+                thunk = THUNK_RETPOLINE;
+            else
+                ibrs = true;
         }
+        /* Without compiler thunk support, use IBRS if available. */
+        else if ( boot_cpu_has(X86_FEATURE_IBRSB) )
+            ibrs = true;
     }
 
     /*
@@ -135,6 +218,13 @@ void __init init_speculation_mitigations(void)
         thunk = THUNK_NONE;
 
     /*
+     * If IBRS is in use and thunks are compiled in, there is no point
+     * suffering extra overhead.  Switch to the least-overhead thunk.
+     */
+    if ( ibrs && thunk == THUNK_DEFAULT )
+        thunk = THUNK_JMP;
+
+    /*
      * If there are still no thunk preferences, the compiled default is
      * actually retpoline, and it is better than nothing.
      */
@@ -147,6 +237,50 @@ void __init init_speculation_mitigations(void)
     else if ( thunk == THUNK_JMP )
         setup_force_cpu_cap(X86_FEATURE_IND_THUNK_JMP);
 
+    if ( boot_cpu_has(X86_FEATURE_IBRSB) )
+    {
+        /*
+         * Even if we've chosen to not have IBRS set in Xen context, we still
+         * need the IBRS entry/exit logic to virtualise IBRS support for
+         * guests.
+         */
+        if ( ibrs )
+            setup_force_cpu_cap(X86_FEATURE_XEN_IBRS_SET);
+        else
+            setup_force_cpu_cap(X86_FEATURE_XEN_IBRS_CLEAR);
+
+        default_bti_ist_info |= BTI_IST_WRMSR | ibrs;
+    }
+
+    /*
+     * PV guests can poison the RSB to any virtual address from which
+     * they can execute a call instruction.  This is necessarily outside
+     * of the Xen supervisor mappings.
+     *
+     * With SMEP enabled, the processor won't speculate into user mappings.
+     * Therefore, in this case, we don't need to worry about poisoned entries
+     * from 64bit PV guests.
+     *
+     * 32bit PV guest kernels run in ring 1, so use supervisor mappings.
+     * If a processors speculates to 32bit PV guest kernel mappings, it is
+     * speculating in 64bit supervisor mode, and can leak data.
+     */
+    if ( opt_rsb_native )
+    {
+        setup_force_cpu_cap(X86_FEATURE_RSB_NATIVE);
+        default_bti_ist_info |= BTI_IST_RSB;
+    }
+
+    /*
+     * HVM guests can always poison the RSB to point at Xen supervisor
+     * mappings.
+     */
+    if ( opt_rsb_vmexit )
+        setup_force_cpu_cap(X86_FEATURE_RSB_VMEXIT);
+
+    /* (Re)init BSP state how that default_bti_ist_info has been calculated. */
+    init_shadow_spec_ctrl_state();
+
     print_details(thunk);
 }
 
diff --git a/xen/include/asm-x86/spec_ctrl.h b/xen/include/asm-x86/spec_ctrl.h
index b451250..6120e4f 100644
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -24,11 +24,14 @@
 
 void init_speculation_mitigations(void);
 
+extern uint8_t default_bti_ist_info;
+
 static inline void init_shadow_spec_ctrl_state(void)
 {
     struct cpu_info *info = get_cpu_info();
 
     info->shadow_spec_ctrl = info->use_shadow_spec_ctrl = 0;
+    info->bti_ist_info = default_bti_ist_info;
 }
 
 #endif /* !__X86_SPEC_CTRL_H__ */
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-24 13:12 [PATCH v10 00/11] x86: Mitigations for SP2/CVE-2017-5715/Branch Target Injection Andrew Cooper
                   ` (7 preceding siblings ...)
  2018-01-24 13:12 ` [PATCH v10 08/11] x86/boot: Calculate the most appropriate BTI mitigation to use Andrew Cooper
@ 2018-01-24 13:12 ` Andrew Cooper
  2018-01-24 13:34   ` Woodhouse, David
  2018-01-25 15:57   ` Jan Beulich
  2018-01-24 13:12 ` [PATCH v10 10/11] x86/cpuid: Offer Indirect Branch Controls to guests Andrew Cooper
  2018-01-24 13:12 ` [PATCH v10 11/11] x86/idle: Clear SPEC_CTRL while idle Andrew Cooper
  10 siblings, 2 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-24 13:12 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, David Woodhouse, Jan Beulich

Issuing an IBPB command flushes the Branch Target Buffer, so that any poison
left by one vcpu won't remain when beginning to execute the next.

The cost of IBPB is substantial, and skipped on transition to idle, as Xen's
idle code is robust already.  All transitions into vcpu context are fully
serialising in practice (and under consideration for being retroactively
declared architecturally serialising), so a cunning attacker cannot use SP1 to
try and skip the flush.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: David Woodhouse <dwmw@amazon.co.uk>

v7:
 * Use the opt_ibpb boolean rather than using a cpufeature flag.
v9:
 * Expand the commit message.
 * Optimise the idle case, based on a suggestion from David.
v10:
 * More detailed comments, and an explicit idle check.
---
 docs/misc/xen-command-line.markdown |  5 ++++-
 xen/arch/x86/domain.c               | 29 +++++++++++++++++++++++++++++
 xen/arch/x86/spec_ctrl.c            | 10 +++++++++-
 xen/include/asm-x86/spec_ctrl.h     |  1 +
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 11399ce..9c10d3a 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -246,7 +246,7 @@ enough. Setting this to a high value may cause boot failure, particularly if
 the NMI watchdog is also enabled.
 
 ### bti (x86)
-> `= List of [ thunk=retpoline|lfence|jmp, ibrs=<bool>, rsb_{vmexit,native}=<bool> ]`
+> `= List of [ thunk=retpoline|lfence|jmp, ibrs=<bool>, ibpb=<bool>, rsb_{vmexit,native}=<bool> ]`
 
 Branch Target Injection controls.  By default, Xen will pick the most
 appropriate BTI mitigations based on compiled in support, loaded microcode,
@@ -265,6 +265,9 @@ On hardware supporting IBRS, the `ibrs=` option can be used to force or
 prevent Xen using the feature itself.  If Xen is not using IBRS itself,
 functionality is still set up so IBRS can be virtualised for guests.
 
+On hardware supporting IBPB, the `ibpb=` option can be used to prevent Xen
+from issuing Branch Prediction Barriers on vcpu context switches.
+
 The `rsb_vmexit=` and `rsb_native=` options can be used to fine tune when the
 RSB gets overwritten.  There are individual controls for an entry from HVM
 context, and an entry from a native (PV or Xen) context.
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 643628c..12f527b 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -65,6 +65,7 @@
 #include <asm/psr.h>
 #include <asm/pv/domain.h>
 #include <asm/pv/mm.h>
+#include <asm/spec_ctrl.h>
 
 DEFINE_PER_CPU(struct vcpu *, curr_vcpu);
 
@@ -1743,6 +1744,34 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
         }
 
         ctxt_switch_levelling(next);
+
+        if ( opt_ibpb && !is_idle_domain(nextd) )
+        {
+            static DEFINE_PER_CPU(unsigned int, last);
+            unsigned int *last_id = &this_cpu(last);
+
+            /*
+             * Squash the domid and vcpu id together for comparason
+             * efficiency.  We could in principle stash and compare the struct
+             * vcpu pointer, but this risks a false alias if a domain has died
+             * and the same 4k page gets reused for a new vcpu.
+             */
+            unsigned int next_id = (((unsigned int)nextd->domain_id << 16) |
+                                    (uint16_t)next->vcpu_id);
+            BUILD_BUG_ON(MAX_VIRT_CPUS > 0xffff);
+
+            /*
+             * When scheduling from a vcpu, to idle, and back to the same vcpu
+             * (which might be common in a lightly loaded system, or when
+             * using vcpu pinning), there is no need to issue IBPB, as we are
+             * returning to the same security context.
+             */
+            if ( *last_id != next_id )
+            {
+                wrmsrl(MSR_PRED_CMD, PRED_CMD_IBPB);
+                *last_id = next_id;
+            }
+        }
     }
 
     context_saved(prev);
diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index de5ba1a..3baad8a 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -36,6 +36,7 @@ static enum ind_thunk {
 static int8_t __initdata opt_ibrs = -1;
 static bool __initdata opt_rsb_native = true;
 static bool __initdata opt_rsb_vmexit = true;
+bool __read_mostly opt_ibpb = true;
 uint8_t __read_mostly default_bti_ist_info;
 
 static int __init parse_bti(const char *s)
@@ -63,6 +64,8 @@ static int __init parse_bti(const char *s)
         }
         else if ( (val = parse_boolean("ibrs", s, ss)) >= 0 )
             opt_ibrs = val;
+        else if ( (val = parse_boolean("ibpb", s, ss)) >= 0 )
+            opt_ibpb = val;
         else if ( (val = parse_boolean("rsb_native", s, ss)) >= 0 )
             opt_rsb_native = val;
         else if ( (val = parse_boolean("rsb_vmexit", s, ss)) >= 0 )
@@ -103,13 +106,14 @@ static void __init print_details(enum ind_thunk thunk)
         printk(XENLOG_DEBUG "  Compiled-in support: INDIRECT_THUNK\n");
 
     printk(XENLOG_INFO
-           "BTI mitigations: Thunk %s, Others:%s%s%s\n",
+           "BTI mitigations: Thunk %s, Others:%s%s%s%s\n",
            thunk == THUNK_NONE      ? "N/A" :
            thunk == THUNK_RETPOLINE ? "RETPOLINE" :
            thunk == THUNK_LFENCE    ? "LFENCE" :
            thunk == THUNK_JMP       ? "JMP" : "?",
            boot_cpu_has(X86_FEATURE_XEN_IBRS_SET)    ? " IBRS+" :
            boot_cpu_has(X86_FEATURE_XEN_IBRS_CLEAR)  ? " IBRS-"      : "",
+           opt_ibpb                                  ? " IBPB"       : "",
            boot_cpu_has(X86_FEATURE_RSB_NATIVE)      ? " RSB_NATIVE" : "",
            boot_cpu_has(X86_FEATURE_RSB_VMEXIT)      ? " RSB_VMEXIT" : "");
 }
@@ -278,6 +282,10 @@ void __init init_speculation_mitigations(void)
     if ( opt_rsb_vmexit )
         setup_force_cpu_cap(X86_FEATURE_RSB_VMEXIT);
 
+    /* Check we have hardware IBPB support before using it... */
+    if ( !boot_cpu_has(X86_FEATURE_IBRSB) && !boot_cpu_has(X86_FEATURE_IBPB) )
+        opt_ibpb = false;
+
     /* (Re)init BSP state how that default_bti_ist_info has been calculated. */
     init_shadow_spec_ctrl_state();
 
diff --git a/xen/include/asm-x86/spec_ctrl.h b/xen/include/asm-x86/spec_ctrl.h
index 6120e4f..e328b0f 100644
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -24,6 +24,7 @@
 
 void init_speculation_mitigations(void);
 
+extern bool opt_ibpb;
 extern uint8_t default_bti_ist_info;
 
 static inline void init_shadow_spec_ctrl_state(void)
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v10 10/11] x86/cpuid: Offer Indirect Branch Controls to guests
  2018-01-24 13:12 [PATCH v10 00/11] x86: Mitigations for SP2/CVE-2017-5715/Branch Target Injection Andrew Cooper
                   ` (8 preceding siblings ...)
  2018-01-24 13:12 ` [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts Andrew Cooper
@ 2018-01-24 13:12 ` Andrew Cooper
  2018-01-24 13:12 ` [PATCH v10 11/11] x86/idle: Clear SPEC_CTRL while idle Andrew Cooper
  10 siblings, 0 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-24 13:12 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper

With all infrastructure in place, it is now safe to let guests see and use
these features.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
---
v9:
 * Split patch in half with the libxc hunk moving earlier, and rebasing over
   the changed nature of STIBP
---
 xen/include/public/arch-x86/cpufeatureset.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h
index 0f21fed..fa81af1 100644
--- a/xen/include/public/arch-x86/cpufeatureset.h
+++ b/xen/include/public/arch-x86/cpufeatureset.h
@@ -237,13 +237,13 @@ XEN_CPUFEATURE(EFRO,          7*32+10) /*   APERF/MPERF Read Only interface */
 
 /* AMD-defined CPU features, CPUID level 0x80000008.ebx, word 8 */
 XEN_CPUFEATURE(CLZERO,        8*32+ 0) /*A  CLZERO instruction */
-XEN_CPUFEATURE(IBPB,          8*32+12) /*   IBPB support only (no IBRS, used by AMD) */
+XEN_CPUFEATURE(IBPB,          8*32+12) /*A  IBPB support only (no IBRS, used by AMD) */
 
 /* Intel-defined CPU features, CPUID level 0x00000007:0.edx, word 9 */
 XEN_CPUFEATURE(AVX512_4VNNIW, 9*32+ 2) /*A  AVX512 Neural Network Instructions */
 XEN_CPUFEATURE(AVX512_4FMAPS, 9*32+ 3) /*A  AVX512 Multiply Accumulation Single Precision */
-XEN_CPUFEATURE(IBRSB,         9*32+26) /*   IBRS and IBPB support (used by Intel) */
-XEN_CPUFEATURE(STIBP,         9*32+27) /*!  STIBP */
+XEN_CPUFEATURE(IBRSB,         9*32+26) /*A  IBRS and IBPB support (used by Intel) */
+XEN_CPUFEATURE(STIBP,         9*32+27) /*A! STIBP */
 
 #endif /* XEN_CPUFEATURE */
 
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v10 11/11] x86/idle: Clear SPEC_CTRL while idle
  2018-01-24 13:12 [PATCH v10 00/11] x86: Mitigations for SP2/CVE-2017-5715/Branch Target Injection Andrew Cooper
                   ` (9 preceding siblings ...)
  2018-01-24 13:12 ` [PATCH v10 10/11] x86/cpuid: Offer Indirect Branch Controls to guests Andrew Cooper
@ 2018-01-24 13:12 ` Andrew Cooper
  10 siblings, 0 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-24 13:12 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper

On contemporary hardware, setting IBRS/STIBP has a performance impact on
adjacent hyperthreads.  It is therefore recommended to clear the setting
before becoming idle, to avoid an idle core preventing adjacent userspace
execution from running at full performance.

Care must be taken to ensure there are no ret or indirect branch instructions
between spec_ctrl_{enter,exit}_idle() invocations, which are forced always
inline.  Care must also be taken to avoid using spec_ctrl_enter_idle() between
flushing caches and becoming idle, in cases where that matters.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/acpi/cpu_idle.c    | 21 +++++++++++++++++++++
 xen/arch/x86/cpu/mwait-idle.c   |  7 +++++++
 xen/arch/x86/domain.c           |  8 ++++++++
 xen/include/asm-x86/spec_ctrl.h | 34 ++++++++++++++++++++++++++++++++++
 4 files changed, 70 insertions(+)

diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c
index cb1c5da..3f72bda 100644
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -55,6 +55,7 @@
 #include <asm/mwait.h>
 #include <xen/notifier.h>
 #include <xen/cpu.h>
+#include <asm/spec_ctrl.h>
 
 /*#define DEBUG_PM_CX*/
 
@@ -414,8 +415,14 @@ void mwait_idle_with_hints(unsigned int eax, unsigned int ecx)
      */
     if ( (expires > NOW() || expires == 0) && !softirq_pending(cpu) )
     {
+        struct cpu_info *info = get_cpu_info();
+
         cpumask_set_cpu(cpu, &cpuidle_mwait_flags);
+
+        spec_ctrl_enter_idle(info);
         __mwait(eax, ecx);
+        spec_ctrl_exit_idle(info);
+
         cpumask_clear_cpu(cpu, &cpuidle_mwait_flags);
     }
 
@@ -430,6 +437,8 @@ static void acpi_processor_ffh_cstate_enter(struct acpi_processor_cx *cx)
 
 static void acpi_idle_do_entry(struct acpi_processor_cx *cx)
 {
+    struct cpu_info *info = get_cpu_info();
+
     switch ( cx->entry_method )
     {
     case ACPI_CSTATE_EM_FFH:
@@ -437,15 +446,19 @@ static void acpi_idle_do_entry(struct acpi_processor_cx *cx)
         acpi_processor_ffh_cstate_enter(cx);
         return;
     case ACPI_CSTATE_EM_SYSIO:
+        spec_ctrl_enter_idle(info);
         /* IO port based C-state */
         inb(cx->address);
         /* Dummy wait op - must do something useless after P_LVL2 read
            because chipsets cannot guarantee that STPCLK# signal
            gets asserted in time to freeze execution properly. */
         inl(pmtmr_ioport);
+        spec_ctrl_exit_idle(info);
         return;
     case ACPI_CSTATE_EM_HALT:
+        spec_ctrl_enter_idle(info);
         safe_halt();
+        spec_ctrl_exit_idle(info);
         local_irq_disable();
         return;
     }
@@ -573,7 +586,13 @@ static void acpi_processor_idle(void)
         if ( pm_idle_save )
             pm_idle_save();
         else
+        {
+            struct cpu_info *info = get_cpu_info();
+
+            spec_ctrl_enter_idle(info);
             safe_halt();
+            spec_ctrl_exit_idle(info);
+        }
         return;
     }
 
@@ -752,6 +771,7 @@ void acpi_dead_idle(void)
          * Otherwise, CPU may still hold dirty data, breaking cache coherency,
          * leading to strange errors.
          */
+        spec_ctrl_enter_idle(get_cpu_info());
         wbinvd();
 
         while ( 1 )
@@ -781,6 +801,7 @@ void acpi_dead_idle(void)
         u32 address = cx->address;
         u32 pmtmr_ioport_local = pmtmr_ioport;
 
+        spec_ctrl_enter_idle(get_cpu_info());
         wbinvd();
 
         while ( 1 )
diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c
index 762dff1..e357f29 100644
--- a/xen/arch/x86/cpu/mwait-idle.c
+++ b/xen/arch/x86/cpu/mwait-idle.c
@@ -58,6 +58,7 @@
 #include <asm/hpet.h>
 #include <asm/mwait.h>
 #include <asm/msr.h>
+#include <asm/spec_ctrl.h>
 #include <acpi/cpufreq/cpufreq.h>
 
 #define MWAIT_IDLE_VERSION "0.4.1"
@@ -736,7 +737,13 @@ static void mwait_idle(void)
 		if (pm_idle_save)
 			pm_idle_save();
 		else
+		{
+			struct cpu_info *info = get_cpu_info();
+
+			spec_ctrl_enter_idle(info);
 			safe_halt();
+			spec_ctrl_exit_idle(info);
+		}
 		return;
 	}
 
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 12f527b..be6d15f 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -55,6 +55,7 @@
 #include <asm/hvm/viridian.h>
 #include <asm/debugreg.h>
 #include <asm/msr.h>
+#include <asm/spec_ctrl.h>
 #include <asm/traps.h>
 #include <asm/nmi.h>
 #include <asm/mce.h>
@@ -75,9 +76,15 @@ void (*dead_idle) (void) __read_mostly = default_dead_idle;
 
 static void default_idle(void)
 {
+    struct cpu_info *info = get_cpu_info();
+
     local_irq_disable();
     if ( cpu_is_haltable(smp_processor_id()) )
+    {
+        spec_ctrl_enter_idle(info);
         safe_halt();
+        spec_ctrl_exit_idle(info);
+    }
     else
         local_irq_enable();
 }
@@ -89,6 +96,7 @@ void default_dead_idle(void)
      * held by the CPUs spinning here indefinitely, and get discarded by
      * a subsequent INIT.
      */
+    spec_ctrl_enter_idle(get_cpu_info());
     wbinvd();
     for ( ; ; )
         halt();
diff --git a/xen/include/asm-x86/spec_ctrl.h b/xen/include/asm-x86/spec_ctrl.h
index e328b0f..5ab4ff3 100644
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -20,7 +20,9 @@
 #ifndef __X86_SPEC_CTRL_H__
 #define __X86_SPEC_CTRL_H__
 
+#include <asm/alternative.h>
 #include <asm/current.h>
+#include <asm/msr-index.h>
 
 void init_speculation_mitigations(void);
 
@@ -35,6 +37,38 @@ static inline void init_shadow_spec_ctrl_state(void)
     info->bti_ist_info = default_bti_ist_info;
 }
 
+/* WARNING! `ret`, `call *`, `jmp *` not safe after this call. */
+static always_inline void spec_ctrl_enter_idle(struct cpu_info *info)
+{
+    uint32_t val = 0;
+
+    /*
+     * Latch the new shadow value, then enable shadowing, then update the MSR.
+     * There are no SMP issues here; only local processor ordering concerns.
+     */
+    info->shadow_spec_ctrl = val;
+    barrier();
+    info->use_shadow_spec_ctrl = true;
+    barrier();
+    asm volatile ( ALTERNATIVE(ASM_NOP3, "wrmsr", X86_FEATURE_XEN_IBRS_SET)
+                   :: "a" (val), "c" (MSR_SPEC_CTRL), "d" (0) : "memory" );
+}
+
+/* WARNING! `ret`, `call *`, `jmp *` not safe before this call. */
+static always_inline void spec_ctrl_exit_idle(struct cpu_info *info)
+{
+    uint32_t val = SPEC_CTRL_IBRS;
+
+    /*
+     * Disable shadowing before updating the MSR.  There are no SMP issues
+     * here; only local processor ordering concerns.
+     */
+    info->use_shadow_spec_ctrl = false;
+    barrier();
+    asm volatile ( ALTERNATIVE(ASM_NOP3, "wrmsr", X86_FEATURE_XEN_IBRS_SET)
+                   :: "a" (val), "c" (MSR_SPEC_CTRL), "d" (0) : "memory" );
+}
+
 #endif /* !__X86_SPEC_CTRL_H__ */
 
 /*
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-24 13:12 ` [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts Andrew Cooper
@ 2018-01-24 13:34   ` Woodhouse, David
  2018-01-24 13:49     ` Andrew Cooper
  2018-01-25 15:57   ` Jan Beulich
  1 sibling, 1 reply; 57+ messages in thread
From: Woodhouse, David @ 2018-01-24 13:34 UTC (permalink / raw)
  To: andrew.cooper3, xen-devel; +Cc: JBeulich


[-- Attachment #1.1: Type: text/plain, Size: 1104 bytes --]

On Wed, 2018-01-24 at 13:12 +0000, Andrew Cooper wrote:
> +             * Squash the domid and vcpu id together for comparason

*comparison

> +             * efficiency.  We could in principle stash and compare the struct
> +             * vcpu pointer, but this risks a false alias if a domain has died
> +             * and the same 4k page gets reused for a new vcpu.
> +             */

Isn't that also true if the domain has died and its domain-id gets re-
used? 

> +            unsigned int next_id = (((unsigned int)nextd->domain_id << 16) |
> +                                    (uint16_t)next->vcpu_id);

I am loath to suggest *more* tweakables, but given the IBPB cost is
there any merit in having a mode which does it only if the *domain* is
different, regardless of vcpu_id?

If a given domain is running on HT siblings, it ought to be doing its
own mitigation — setting STIBP for userspace if it wants, ensuring its
own kernel is safe by having IBRS set or using retpoline, etc.

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 197 bytes --]




Amazon Web Services UK Limited. Registered in England and Wales with registration number 08650665 and which has its registered office at 60 Holborn Viaduct, London EC1A 2FD, United Kingdom.

[-- Attachment #2.2: Type: text/html, Size: 197 bytes --]

[-- Attachment #3: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-24 13:34   ` Woodhouse, David
@ 2018-01-24 13:49     ` Andrew Cooper
  2018-01-24 14:31       ` David Woodhouse
  0 siblings, 1 reply; 57+ messages in thread
From: Andrew Cooper @ 2018-01-24 13:49 UTC (permalink / raw)
  To: Woodhouse, David, xen-devel; +Cc: JBeulich

On 24/01/18 13:34, Woodhouse, David wrote:
> On Wed, 2018-01-24 at 13:12 +0000, Andrew Cooper wrote:
>> +             * Squash the domid and vcpu id together for comparason
> *comparison
>
>> +             * efficiency.  We could in principle stash and compare the struct
>> +             * vcpu pointer, but this risks a false alias if a domain has died
>> +             * and the same 4k page gets reused for a new vcpu.
>> +             */
> Isn't that also true if the domain has died and its domain-id gets re-
> used?

In principle, yes.  However, a toolstack needs to have non-default
behaviour[1] to reuse a domid without wrapping around 32k.

>
>> +            unsigned int next_id = (((unsigned int)nextd->domain_id << 16) |
>> +                                    (uint16_t)next->vcpu_id);
> I am loath to suggest *more* tweakables, but given the IBPB cost is
> there any merit in having a mode which does it only if the *domain* is
> different, regardless of vcpu_id?

This would only be a win if you were regularly cross-scheduling vcpus
from the same domain, which case you've probably other issues to be
worried about.

> If a given domain is running on HT siblings, it ought to be doing its
> own mitigation — setting STIBP for userspace if it wants, ensuring its
> own kernel is safe by having IBRS set or using retpoline, etc.

~Andrew

[1] Is this trying to be a subtle hint?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-24 13:49     ` Andrew Cooper
@ 2018-01-24 14:31       ` David Woodhouse
  2018-01-25 14:46         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 57+ messages in thread
From: David Woodhouse @ 2018-01-24 14:31 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel; +Cc: JBeulich


[-- Attachment #1.1: Type: text/plain, Size: 1092 bytes --]

On Wed, 2018-01-24 at 13:49 +0000, Andrew Cooper wrote:
> On 24/01/18 13:34, Woodhouse, David wrote:
> > I am loath to suggest *more* tweakables, but given the IBPB cost is
> > there any merit in having a mode which does it only if the *domain* is
> > different, regardless of vcpu_id?
>
> This would only be a win if you were regularly cross-scheduling vcpus
> from the same domain, which case you've probably other issues to be
> worried about.

Of course. If the guest *knows* about HT siblings that kind of implies
you've told it about the topology and thus you're pinning vCPU. I don't
think there *is* a world in which what I said makes sense.

> > 
> > If a given domain is running on HT siblings, it ought to be doing its
> > own mitigation — setting STIBP for userspace if it wants, ensuring its
> > own kernel is safe by having IBRS set or using retpoline, etc.
> ~Andrew
> 
> [1] Is this trying to be a subtle hint?

Heh, no. When I get to that bit, and the *reasons* we do that, it'll be
far from subtle. But as with so many other things, NOT THIS WEEK :)

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 02/11] x86/msr: Emulation of MSR_{SPEC_CTRL, PRED_CMD} for guests
  2018-01-24 13:12 ` [PATCH v10 02/11] x86/msr: Emulation of MSR_{SPEC_CTRL, PRED_CMD} " Andrew Cooper
@ 2018-01-25 12:25   ` Jan Beulich
  0 siblings, 0 replies; 57+ messages in thread
From: Jan Beulich @ 2018-01-25 12:25 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
> As per the spec currently available here:
> 
> https://software.intel.com/sites/default/files/managed/c5/63/336996-Speculative-Execution-Side-Channel-Mitigations.pdf
> 
> MSR_ARCH_CAPABILITIES will only come into existence on new hardware, but is
> implemented as a straight #GP for now to avoid being leaky when new hardware
> arrives.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 05/11] x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
  2018-01-24 13:12 ` [PATCH v10 05/11] x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point Andrew Cooper
@ 2018-01-25 13:08   ` Jan Beulich
  2018-01-25 14:12     ` Andrew Cooper
  2018-01-25 16:52   ` [PATCH v11 5/11] " Andrew Cooper
  1 sibling, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2018-01-25 13:08 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
> We need to be able to either set or clear IBRS in Xen context, as well as
> restore appropriate guest values in guest context.  See the documentation in
> asm-x86/spec_ctrl_asm.h for details.
> 
> With the contemporary microcode, writes to %cr3 are slower when SPEC_CTRL.IBRS
> is set.  Therefore, the positioning of SPEC_CTRL_{ENTRY/EXIT}* is important.
> 
> Ideally, the IBRS_SET/IBRS_CLEAR hunks might be positioned either side of the
> %cr3 change, but that is rather more complicated to arrange, and could still
> result in a guest controlled value in SPEC_CTRL during the %cr3 change,
> negating the saving if the guest chose to have IBRS set.
> 
> Therefore, we optimise for the pre-Skylake case (being far more common in the
> field than Skylake and later, at the moment), where we have a Xen-preferred
> value of IBRS clear when switching %cr3.
> 
> There is a semi-unrelated bugfix, where various asm_defn.h macros have a
> hidden dependency on PAGE_SIZE, which results in an assembler error if used in
> a .macro definition.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>

with a spelling observation and a question / perhaps implied
suggestion:

> @@ -99,6 +109,15 @@ UNLIKELY_END(realmode)
>  .Lvmx_vmentry_fail:
>          sti
>          SAVE_ALL
> +
> +        /*
> +         * PV variant needed here as no guest code has executed (so
> +         * MSR_SPEC_CTRL can't have changed value), and NMIs/MCEs are liable
> +         * to hit (in which case the HVM varient might corrupt things).

variant

> +.macro DO_SPEC_CTRL_ENTRY maybexen:req ibrs_val:req
> +/*
> + * Requires %rsp=regs (also cpuinfo if !maybexen)
> + * Requires %r14=stack_end (if maybexen)
> + * Clobbers %rax, %rcx, %rdx
> + *
> + * PV guests can't update MSR_SPEC_CTRL behind Xen's back, so no need to read
> + * it back.  Entries from guest context need to clear SPEC_CTRL shadowing,
> + * while entries from Xen must leave shadowing in its current state.
> + */
> +    mov $MSR_SPEC_CTRL, %ecx
> +
> +    .if \maybexen
> +        testb $3, UREGS_cs(%rsp)
> +        jz .L\@_entry_from_xen
> +    .endif
> +
> +    /*
> +     * Clear SPEC_CTRL shadowing *before* loading Xen's value.  If entering
> +     * from a possibly-xen context, %rsp doesn't necessarily alias the cpuinfo
> +     * block so calculate the position directly.
> +     */
> +    .if \maybexen
> +        movb $0, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)
> +    .else
> +        movb $0, CPUINFO_use_shadow_spec_ctrl(%rsp)
> +    .endif
> +
> +.L\@_entry_from_xen:
> +    /* Load Xen's intended value. */
> +    mov $\ibrs_val, %eax
> +    xor %edx, %edx
> +    wrmsr
> +.endm

Did you consider avoiding the conditional branch here altogether,
by doing something like

    .if \maybexen
        testb $3, UREGS_cs(%rsp)
        setz %al
        neg %al
        and %al, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)
    .else

?

It may also be worthwhile again to pull up the zeroing of %edx,
using %dl instead of $0 in the movb (and maybe then also
similarly in DO_SPEC_CTRL_EXIT_TO_XEN, but there I'm less
certain it couldn't have a negative effect).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 06/11] x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
  2018-01-24 13:12 ` [PATCH v10 06/11] x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen Andrew Cooper
@ 2018-01-25 13:19   ` Jan Beulich
  2018-01-25 14:17     ` Andrew Cooper
  2018-01-25 16:54   ` [PATCH v11 6/11] " Andrew Cooper
  1 sibling, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2018-01-25 13:19 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
> --- a/xen/include/asm-x86/spec_ctrl_asm.h
> +++ b/xen/include/asm-x86/spec_ctrl_asm.h
> @@ -74,6 +74,43 @@
>   *  - SPEC_CTRL_EXIT_TO_GUEST
>   */
>  
> +.macro DO_OVERWRITE_RSB
> +/*
> + * Requires nothing
> + * Clobbers %rax, %rcx
> + *
> + * Requires 256 bytes of stack space, but %rsp has no net change. Based on
> + * Google's performance numbers, the loop is unrolled to 16 iterations and two
> + * calls per iteration.
> + *
> + * The call filling the RSB needs a nonzero displacement.  A nop would do, but
> + * we use "1: pause, jmp 1b" to safely contains any ret-based speculation,
> + * even if the loop is speculatively executed prematurely.
> + *
> + * %rsp is preserved by using an extra GPR because a) we've got plenty spare,
> + * b) the two movs are shorter to encode than `add $32*8, %rsp`, and c) can be
> + * optimised with mov-elimination in modern cores.
> + */
> +    mov $16, %ecx                   /* 16 iterations, two calls per loop */
> +    mov %rsp, %rax                  /* Store the current %rsp */
> +
> +.L\@_fill_rsb_loop:
> +
> +    .irp n, 1, 2                    /* Unrolled twice. */
> +    call .L\@_insert_rsb_entry_\n   /* Create an RSB entry. */
> +
> +.L\@_capture_speculation_\n:
> +    pause
> +    jmp .L\@_capture_speculation_\n /* Capture rogue speculation. */

Have you seen Linux commit 28d437d550e1e39f805d99f9f8ac399c778827b7
("x86/retpoline: Add LFENCE to the retpoline/RSB filling RSB
macros")? I think we want to have the same, to please AMD. I'd
suggest to use alternative patching though (except again on the
IST paths), but then again maybe in a follow-up patch.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 07/11] x86/entry: Avoid using alternatives in NMI/#MC paths
  2018-01-24 13:12 ` [PATCH v10 07/11] x86/entry: Avoid using alternatives in NMI/#MC paths Andrew Cooper
@ 2018-01-25 13:43   ` Jan Beulich
  2018-01-25 15:04     ` Andrew Cooper
  2018-01-25 17:21   ` [PATCH v11 7/11] " Andrew Cooper
  1 sibling, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2018-01-25 13:43 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
> --- a/xen/include/asm-x86/spec_ctrl_asm.h
> +++ b/xen/include/asm-x86/spec_ctrl_asm.h
> @@ -20,6 +20,11 @@
>  #ifndef __X86_SPEC_CTRL_ASM_H__
>  #define __X86_SPEC_CTRL_ASM_H__
>  
> +/* Encoding of the bottom bits in cpuinfo.bti_ist_info */
> +#define BTI_IST_IBRS  (1 << 0)

This should be annotated to make clear it can't change its value. Or
maybe have something BUILD_BUG_ON()-like elsewhere.

> @@ -256,6 +261,69 @@
>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_SET,           \
>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_CLEAR
>  
> +/* TODO: Drop these when the alternatives infrastructure is NMI/#MC safe. */
> +.macro SPEC_CTRL_ENTRY_FROM_INTR_IST
> +/*
> + * Requires %rsp=regs, %r14=stack_end
> + * Clobbers %rax, %rcx, %rdx
> + *
> + * This is logical merge of DO_OVERWRITE_RSB and DO_SPEC_CTRL_ENTRY
> + * maybexen=1, but with conditionals rather than alternatives.
> + */
> +    movzbl STACK_CPUINFO_FIELD(bti_ist_info)(%r14), %eax
> +
> +    testb $BTI_IST_RSB, %al
> +    jz .L\@_skip_rsb
> +
> +    DO_OVERWRITE_RSB
> +
> +.L\@_skip_rsb:
> +
> +    testb $BTI_IST_WRMSR, %al
> +    jz .L\@_skip_wrmsr
> +
> +    testb $3, UREGS_cs(%rsp)
> +    jz .L\@_entry_from_xen
> +
> +    movb $0, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)
> +
> +.L\@_entry_from_xen:
> +    /*
> +     * Load Xen's intended value.  SPEC_CTRL_IBRS vs 0 is encoded in the
> +     * bottom bit of bti_ist_info, via a deliberate alias with BTI_IST_IBRS.
> +     */
> +    mov $MSR_SPEC_CTRL, %ecx
> +    and $BTI_IST_IBRS, %eax
> +    xor %edx, %edx
> +    wrmsr
> +
> +    /* Opencoded UNLIKELY_START() with no condition. */
> +.Ldispatch.\@_serialise:
> +    .subsection 1

How about adding .ifnb to UNLIKELY_START(), instead of open coding?
Or wait - why can't you use it as-is above (instead of the
"jz .L\@_skip_wrmsr")?

> +    /*
> +     * In the case that we might need to write to MSR_SPEC_CTRL for safety, we
> +     * need to ensure that an attacker can't poison the `jz 
> .L\@_skip_wrmsr`
> +     * to speculate around the WRMSR.  As a result, we need a dispatch
> +     * serialising instruction in the else clause.
> +     */
> +.L\@_skip_wrmsr:
> +    lfence
> +    UNLIKELY_END(\@_serialise)
> +.endm

Is LFENCE alone sufficient here for AMD in case "x86/amd: Try to
set lfence as being Dispatch Serialising" failed to achieve the intended
effect? In fact an LFENCE -> MFENCE conversion (through
alternatives patching) would even be safe on the IST paths, as the
two encodings differ by a single byte only.

> +.macro SPEC_CTRL_EXIT_TO_XEN_IST
> +/*
> + * Requires %rbx=stack_end
> + * Clobbers %rax, %rcx, %rdx
> + */
> +    testb $BTI_IST_WRMSR, STACK_CPUINFO_FIELD(bti_ist_info)(%rbx)
> +    jz .L\@_skip
> +
> +    DO_SPEC_CTRL_EXIT_TO_XEN
> +
> +.L\@_skip:
> +.endm

Why a new macro, instead of modifying the existing
SPEC_CTRL_EXIT_TO_XEN, the sole use site of which is being
changed? Just for the ease of eventually reverting?

And then, having reached the end of the patch - where is the
new struct cpu_info field written? Or is this again happening only in
later patches, with the one here just setting the stage? If so,
shouldn't you at least zero the field in init_shadow_spec_ctrl_state()?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 08/11] x86/boot: Calculate the most appropriate BTI mitigation to use
  2018-01-24 13:12 ` [PATCH v10 08/11] x86/boot: Calculate the most appropriate BTI mitigation to use Andrew Cooper
@ 2018-01-25 13:52   ` Jan Beulich
  2018-02-01  8:41   ` Jan Beulich
  1 sibling, 0 replies; 57+ messages in thread
From: Jan Beulich @ 2018-01-25 13:52 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Asit K Mallick, Jun Nakajima, Xen-devel

>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
> See the logic and comments in init_speculation_mitigations() for further
> details.
> 
> There are two controls for RSB overwriting, because in principle there are
> cases where it might be safe to forego rsb_native (Off the top of my head,
> SMEP active, no 32bit PV guests at all, no use of vmevent/paging subsystems
> for HVM guests, but I make no guarantees that this list of restrictions is
> exhaustive).
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

With the exception of the Intel model specific things
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Nevetherless ...

> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Asit K Mallick <asit.k.mallick@intel.com>
> CC: Jun Nakajima <jun.nakajima@intel.com>
> 
> TODO: Confirm Broadwell microcode details, and Atom safety WRT retpoline.
> These details should not block this series.

... I agree here, and hence the ack implied by the R-b extends to this.

> @@ -147,6 +237,50 @@ void __init init_speculation_mitigations(void)
>      else if ( thunk == THUNK_JMP )
>          setup_force_cpu_cap(X86_FEATURE_IND_THUNK_JMP);
>  
> +    if ( boot_cpu_has(X86_FEATURE_IBRSB) )
> +    {
> +        /*
> +         * Even if we've chosen to not have IBRS set in Xen context, we still
> +         * need the IBRS entry/exit logic to virtualise IBRS support for
> +         * guests.
> +         */
> +        if ( ibrs )
> +            setup_force_cpu_cap(X86_FEATURE_XEN_IBRS_SET);
> +        else
> +            setup_force_cpu_cap(X86_FEATURE_XEN_IBRS_CLEAR);
> +
> +        default_bti_ist_info |= BTI_IST_WRMSR | ibrs;
> +    }
> +
> +    /*
> +     * PV guests can poison the RSB to any virtual address from which
> +     * they can execute a call instruction.  This is necessarily outside
> +     * of the Xen supervisor mappings.
> +     *
> +     * With SMEP enabled, the processor won't speculate into user mappings.
> +     * Therefore, in this case, we don't need to worry about poisoned entries
> +     * from 64bit PV guests.
> +     *
> +     * 32bit PV guest kernels run in ring 1, so use supervisor mappings.
> +     * If a processors speculates to 32bit PV guest kernel mappings, it is
> +     * speculating in 64bit supervisor mode, and can leak data.
> +     */
> +    if ( opt_rsb_native )
> +    {
> +        setup_force_cpu_cap(X86_FEATURE_RSB_NATIVE);
> +        default_bti_ist_info |= BTI_IST_RSB;
> +    }
> +
> +    /*
> +     * HVM guests can always poison the RSB to point at Xen supervisor
> +     * mappings.
> +     */
> +    if ( opt_rsb_vmexit )
> +        setup_force_cpu_cap(X86_FEATURE_RSB_VMEXIT);
> +
> +    /* (Re)init BSP state how that default_bti_ist_info has been calculated. */

... now that ...

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 05/11] x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
  2018-01-25 13:08   ` Jan Beulich
@ 2018-01-25 14:12     ` Andrew Cooper
  2018-01-25 14:36       ` Jan Beulich
  0 siblings, 1 reply; 57+ messages in thread
From: Andrew Cooper @ 2018-01-25 14:12 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel

On 25/01/18 13:08, Jan Beulich wrote:
>>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
>> We need to be able to either set or clear IBRS in Xen context, as well as
>> restore appropriate guest values in guest context.  See the documentation in
>> asm-x86/spec_ctrl_asm.h for details.
>>
>> With the contemporary microcode, writes to %cr3 are slower when SPEC_CTRL.IBRS
>> is set.  Therefore, the positioning of SPEC_CTRL_{ENTRY/EXIT}* is important.
>>
>> Ideally, the IBRS_SET/IBRS_CLEAR hunks might be positioned either side of the
>> %cr3 change, but that is rather more complicated to arrange, and could still
>> result in a guest controlled value in SPEC_CTRL during the %cr3 change,
>> negating the saving if the guest chose to have IBRS set.
>>
>> Therefore, we optimise for the pre-Skylake case (being far more common in the
>> field than Skylake and later, at the moment), where we have a Xen-preferred
>> value of IBRS clear when switching %cr3.
>>
>> There is a semi-unrelated bugfix, where various asm_defn.h macros have a
>> hidden dependency on PAGE_SIZE, which results in an assembler error if used in
>> a .macro definition.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>
> with a spelling observation and a question / perhaps implied
> suggestion:
>
>> @@ -99,6 +109,15 @@ UNLIKELY_END(realmode)
>>  .Lvmx_vmentry_fail:
>>          sti
>>          SAVE_ALL
>> +
>> +        /*
>> +         * PV variant needed here as no guest code has executed (so
>> +         * MSR_SPEC_CTRL can't have changed value), and NMIs/MCEs are liable
>> +         * to hit (in which case the HVM varient might corrupt things).
> variant

Fixed.

>
>> +.macro DO_SPEC_CTRL_ENTRY maybexen:req ibrs_val:req
>> +/*
>> + * Requires %rsp=regs (also cpuinfo if !maybexen)
>> + * Requires %r14=stack_end (if maybexen)
>> + * Clobbers %rax, %rcx, %rdx
>> + *
>> + * PV guests can't update MSR_SPEC_CTRL behind Xen's back, so no need to read
>> + * it back.  Entries from guest context need to clear SPEC_CTRL shadowing,
>> + * while entries from Xen must leave shadowing in its current state.
>> + */
>> +    mov $MSR_SPEC_CTRL, %ecx
>> +
>> +    .if \maybexen
>> +        testb $3, UREGS_cs(%rsp)
>> +        jz .L\@_entry_from_xen
>> +    .endif
>> +
>> +    /*
>> +     * Clear SPEC_CTRL shadowing *before* loading Xen's value.  If entering
>> +     * from a possibly-xen context, %rsp doesn't necessarily alias the cpuinfo
>> +     * block so calculate the position directly.
>> +     */
>> +    .if \maybexen
>> +        movb $0, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)
>> +    .else
>> +        movb $0, CPUINFO_use_shadow_spec_ctrl(%rsp)
>> +    .endif
>> +
>> +.L\@_entry_from_xen:
>> +    /* Load Xen's intended value. */
>> +    mov $\ibrs_val, %eax
>> +    xor %edx, %edx
>> +    wrmsr
>> +.endm
> Did you consider avoiding the conditional branch here altogether,
> by doing something like
>
>     .if \maybexen
>         testb $3, UREGS_cs(%rsp)
>         setz %al
>         neg %al
>         and %al, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)
>     .else
>
> ?

That is quite subtle, although if you are looking to optimise further,
use_shadow_spec_ctrl being a bool means you can drop the neg.  An and
with the result of setz is good enough not to change the boolean value.

> It may also be worthwhile again to pull up the zeroing of %edx,
> using %dl instead of $0 in the movb (and maybe then also
> similarly in DO_SPEC_CTRL_EXIT_TO_XEN, but there I'm less
> certain it couldn't have a negative effect).

What negative effects are you worried about?  These macros are self
contained.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 06/11] x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
  2018-01-25 13:19   ` Jan Beulich
@ 2018-01-25 14:17     ` Andrew Cooper
  2018-01-25 14:40       ` Jan Beulich
  0 siblings, 1 reply; 57+ messages in thread
From: Andrew Cooper @ 2018-01-25 14:17 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel

On 25/01/18 13:19, Jan Beulich wrote:
>>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
>> --- a/xen/include/asm-x86/spec_ctrl_asm.h
>> +++ b/xen/include/asm-x86/spec_ctrl_asm.h
>> @@ -74,6 +74,43 @@
>>   *  - SPEC_CTRL_EXIT_TO_GUEST
>>   */
>>  
>> +.macro DO_OVERWRITE_RSB
>> +/*
>> + * Requires nothing
>> + * Clobbers %rax, %rcx
>> + *
>> + * Requires 256 bytes of stack space, but %rsp has no net change. Based on
>> + * Google's performance numbers, the loop is unrolled to 16 iterations and two
>> + * calls per iteration.
>> + *
>> + * The call filling the RSB needs a nonzero displacement.  A nop would do, but
>> + * we use "1: pause, jmp 1b" to safely contains any ret-based speculation,
>> + * even if the loop is speculatively executed prematurely.
>> + *
>> + * %rsp is preserved by using an extra GPR because a) we've got plenty spare,
>> + * b) the two movs are shorter to encode than `add $32*8, %rsp`, and c) can be
>> + * optimised with mov-elimination in modern cores.
>> + */
>> +    mov $16, %ecx                   /* 16 iterations, two calls per loop */
>> +    mov %rsp, %rax                  /* Store the current %rsp */
>> +
>> +.L\@_fill_rsb_loop:
>> +
>> +    .irp n, 1, 2                    /* Unrolled twice. */
>> +    call .L\@_insert_rsb_entry_\n   /* Create an RSB entry. */
>> +
>> +.L\@_capture_speculation_\n:
>> +    pause
>> +    jmp .L\@_capture_speculation_\n /* Capture rogue speculation. */
> Have you seen Linux commit 28d437d550e1e39f805d99f9f8ac399c778827b7
> ("x86/retpoline: Add LFENCE to the retpoline/RSB filling RSB
> macros")?

I saw the patch, but hadn't realised it had gone in.  There was quite a
long discussion.

>  I think we want to have the same, to please AMD. I'd
> suggest to use alternative patching though (except again on the
> IST paths), but then again maybe in a follow-up patch.

I trust the later patches are suitable in the IST regard?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 05/11] x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
  2018-01-25 14:12     ` Andrew Cooper
@ 2018-01-25 14:36       ` Jan Beulich
  2018-01-25 14:46         ` Andrew Cooper
  0 siblings, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2018-01-25 14:36 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 25.01.18 at 15:12, <andrew.cooper3@citrix.com> wrote:
> On 25/01/18 13:08, Jan Beulich wrote:
>> It may also be worthwhile again to pull up the zeroing of %edx,
>> using %dl instead of $0 in the movb (and maybe then also
>> similarly in DO_SPEC_CTRL_EXIT_TO_XEN, but there I'm less
>> certain it couldn't have a negative effect).
> 
> What negative effects are you worried about?  These macros are self
> contained.

The result of the xor then is an input to the cmp, which may be
one cycle more latency than with the immediate zero.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 06/11] x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
  2018-01-25 14:17     ` Andrew Cooper
@ 2018-01-25 14:40       ` Jan Beulich
  2018-01-25 14:44         ` Andrew Cooper
  0 siblings, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2018-01-25 14:40 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 25.01.18 at 15:17, <andrew.cooper3@citrix.com> wrote:
> On 25/01/18 13:19, Jan Beulich wrote:
>>  I think we want to have the same, to please AMD. I'd
>> suggest to use alternative patching though (except again on the
>> IST paths), but then again maybe in a follow-up patch.
> 
> I trust the later patches are suitable in the IST regard?

I've responded to patch 7. What other later patches have anything
IST-related?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 06/11] x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
  2018-01-25 14:40       ` Jan Beulich
@ 2018-01-25 14:44         ` Andrew Cooper
  2018-01-25 14:48           ` Jan Beulich
  0 siblings, 1 reply; 57+ messages in thread
From: Andrew Cooper @ 2018-01-25 14:44 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel

On 25/01/18 14:40, Jan Beulich wrote:
>>>> On 25.01.18 at 15:17, <andrew.cooper3@citrix.com> wrote:
>> On 25/01/18 13:19, Jan Beulich wrote:
>>>  I think we want to have the same, to please AMD. I'd
>>> suggest to use alternative patching though (except again on the
>>> IST paths), but then again maybe in a follow-up patch.
>> I trust the later patches are suitable in the IST regard?
> I've responded to patch 7. What other later patches have anything
> IST-related?

None, but my point was to inquire that your IST concern doesn't affect
this patch?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-24 14:31       ` David Woodhouse
@ 2018-01-25 14:46         ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 57+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-01-25 14:46 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Andrew Cooper, JBeulich, xen-devel

On Wed, Jan 24, 2018 at 02:31:20PM +0000, David Woodhouse wrote:
> On Wed, 2018-01-24 at 13:49 +0000, Andrew Cooper wrote:
> > On 24/01/18 13:34, Woodhouse, David wrote:
> > > I am loath to suggest *more* tweakables, but given the IBPB cost is
> > > there any merit in having a mode which does it only if the *domain* is
> > > different, regardless of vcpu_id?
> >
> > This would only be a win if you were regularly cross-scheduling vcpus
> > from the same domain, which case you've probably other issues to be
> > worried about.
> 
> Of course. If the guest *knows* about HT siblings that kind of implies
> you've told it about the topology and thus you're pinning vCPU. I don't
> think there *is* a world in which what I said makes sense.

You can expose vNUMA and topology in the guest (and pin the vCPUS) so that
it will be not moving at all.

> 
> > > 
> > > If a given domain is running on HT siblings, it ought to be doing its
> > > own mitigation — setting STIBP for userspace if it wants, ensuring its
> > > own kernel is safe by having IBRS set or using retpoline, etc.
> > ~Andrew
> > 
> > [1] Is this trying to be a subtle hint?
> 
> Heh, no. When I get to that bit, and the *reasons* we do that, it'll be
> far from subtle. But as with so many other things, NOT THIS WEEK :)



> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 05/11] x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
  2018-01-25 14:36       ` Jan Beulich
@ 2018-01-25 14:46         ` Andrew Cooper
  2018-01-25 15:08           ` Jan Beulich
  0 siblings, 1 reply; 57+ messages in thread
From: Andrew Cooper @ 2018-01-25 14:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel

On 25/01/18 14:36, Jan Beulich wrote:
>>>> On 25.01.18 at 15:12, <andrew.cooper3@citrix.com> wrote:
>> On 25/01/18 13:08, Jan Beulich wrote:
>>> It may also be worthwhile again to pull up the zeroing of %edx,
>>> using %dl instead of $0 in the movb (and maybe then also
>>> similarly in DO_SPEC_CTRL_EXIT_TO_XEN, but there I'm less
>>> certain it couldn't have a negative effect).
>> What negative effects are you worried about?  These macros are self
>> contained.
> The result of the xor then is an input to the cmp, which may be
> one cycle more latency than with the immediate zero.

Why?  Cmp writes all flags and reads none of them, so doesn't have any
flags dependency on earlier instructions.

It is only instructions which partially preserve older flag bits (and
where undefined behaviour is complicated) which suffer flags-merge
penalties.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 06/11] x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
  2018-01-25 14:44         ` Andrew Cooper
@ 2018-01-25 14:48           ` Jan Beulich
  0 siblings, 0 replies; 57+ messages in thread
From: Jan Beulich @ 2018-01-25 14:48 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 25.01.18 at 15:44, <andrew.cooper3@citrix.com> wrote:
> On 25/01/18 14:40, Jan Beulich wrote:
>>>>> On 25.01.18 at 15:17, <andrew.cooper3@citrix.com> wrote:
>>> On 25/01/18 13:19, Jan Beulich wrote:
>>>>  I think we want to have the same, to please AMD. I'd
>>>> suggest to use alternative patching though (except again on the
>>>> IST paths), but then again maybe in a follow-up patch.
>>> I trust the later patches are suitable in the IST regard?
>> I've responded to patch 7. What other later patches have anything
>> IST-related?
> 
> None, but my point was to inquire that your IST concern doesn't affect
> this patch?

No, it was just to re-emphasize that I think we'd be better off not
to patch any of the IST related code paths.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 07/11] x86/entry: Avoid using alternatives in NMI/#MC paths
  2018-01-25 13:43   ` Jan Beulich
@ 2018-01-25 15:04     ` Andrew Cooper
  2018-01-25 15:14       ` Jan Beulich
  0 siblings, 1 reply; 57+ messages in thread
From: Andrew Cooper @ 2018-01-25 15:04 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel

On 25/01/18 13:43, Jan Beulich wrote:
>>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
>> --- a/xen/include/asm-x86/spec_ctrl_asm.h
>> +++ b/xen/include/asm-x86/spec_ctrl_asm.h
>> @@ -20,6 +20,11 @@
>>  #ifndef __X86_SPEC_CTRL_ASM_H__
>>  #define __X86_SPEC_CTRL_ASM_H__
>>  
>> +/* Encoding of the bottom bits in cpuinfo.bti_ist_info */
>> +#define BTI_IST_IBRS  (1 << 0)
> This should be annotated to make clear it can't change its value. Or
> maybe have something BUILD_BUG_ON()-like elsewhere.

Ok, although I don't intend for this code to stay around for very long.

>
>> @@ -256,6 +261,69 @@
>>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_SET,           \
>>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_CLEAR
>>  
>> +/* TODO: Drop these when the alternatives infrastructure is NMI/#MC safe. */
>> +.macro SPEC_CTRL_ENTRY_FROM_INTR_IST
>> +/*
>> + * Requires %rsp=regs, %r14=stack_end
>> + * Clobbers %rax, %rcx, %rdx
>> + *
>> + * This is logical merge of DO_OVERWRITE_RSB and DO_SPEC_CTRL_ENTRY
>> + * maybexen=1, but with conditionals rather than alternatives.
>> + */
>> +    movzbl STACK_CPUINFO_FIELD(bti_ist_info)(%r14), %eax
>> +
>> +    testb $BTI_IST_RSB, %al
>> +    jz .L\@_skip_rsb
>> +
>> +    DO_OVERWRITE_RSB
>> +
>> +.L\@_skip_rsb:
>> +
>> +    testb $BTI_IST_WRMSR, %al
>> +    jz .L\@_skip_wrmsr
>> +
>> +    testb $3, UREGS_cs(%rsp)
>> +    jz .L\@_entry_from_xen
>> +
>> +    movb $0, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)
>> +
>> +.L\@_entry_from_xen:
>> +    /*
>> +     * Load Xen's intended value.  SPEC_CTRL_IBRS vs 0 is encoded in the
>> +     * bottom bit of bti_ist_info, via a deliberate alias with BTI_IST_IBRS.
>> +     */
>> +    mov $MSR_SPEC_CTRL, %ecx
>> +    and $BTI_IST_IBRS, %eax
>> +    xor %edx, %edx
>> +    wrmsr
>> +
>> +    /* Opencoded UNLIKELY_START() with no condition. */
>> +.Ldispatch.\@_serialise:
>> +    .subsection 1
> How about adding .ifnb to UNLIKELY_START(), instead of open coding?
> Or wait - why can't you use it as-is above (instead of the
> "jz .L\@_skip_wrmsr")?

Because then the end label is wrong.  One way of another, we either need
to opencode something, or implement UNLIKELY_ELSE() which is actually
the likely case, and this is a substantially larger can of worms.

>
>> +    /*
>> +     * In the case that we might need to write to MSR_SPEC_CTRL for safety, we
>> +     * need to ensure that an attacker can't poison the `jz 
>> .L\@_skip_wrmsr`
>> +     * to speculate around the WRMSR.  As a result, we need a dispatch
>> +     * serialising instruction in the else clause.
>> +     */
>> +.L\@_skip_wrmsr:
>> +    lfence
>> +    UNLIKELY_END(\@_serialise)
>> +.endm
> Is LFENCE alone sufficient here for AMD in case "x86/amd: Try to
> set lfence as being Dispatch Serialising" failed to achieve the intended
> effect?

Technically, no.  In practice, yes.

The lfence is only needed when we require the WRMSR to set
SPEC_CTRL.IBRS to 1, and AMD don't implement IBRS yet.

>  In fact an LFENCE -> MFENCE conversion (through
> alternatives patching) would even be safe on the IST paths, as the
> two encodings differ by a single byte only.
>
>> +.macro SPEC_CTRL_EXIT_TO_XEN_IST
>> +/*
>> + * Requires %rbx=stack_end
>> + * Clobbers %rax, %rcx, %rdx
>> + */
>> +    testb $BTI_IST_WRMSR, STACK_CPUINFO_FIELD(bti_ist_info)(%rbx)
>> +    jz .L\@_skip
>> +
>> +    DO_SPEC_CTRL_EXIT_TO_XEN
>> +
>> +.L\@_skip:
>> +.endm
> Why a new macro, instead of modifying the existing
> SPEC_CTRL_EXIT_TO_XEN, the sole use site of which is being
> changed? Just for the ease of eventually reverting?

Correct.  Having the IST name also makes it more obvious when reading
entry.S

> And then, having reached the end of the patch - where is the
> new struct cpu_info field written? Or is this again happening only in
> later patches, with the one here just setting the stage? If so,
> shouldn't you at least zero the field in init_shadow_spec_ctrl_state()?

Hmm - I should set 0 in init_shadow_spec_ctrl_state().

All the other calculation logic needs to be deferred to the subsequent
patch for bisectiblaty.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 05/11] x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
  2018-01-25 14:46         ` Andrew Cooper
@ 2018-01-25 15:08           ` Jan Beulich
  2018-01-25 15:10             ` Andrew Cooper
  0 siblings, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2018-01-25 15:08 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 25.01.18 at 15:46, <andrew.cooper3@citrix.com> wrote:
> On 25/01/18 14:36, Jan Beulich wrote:
>>>>> On 25.01.18 at 15:12, <andrew.cooper3@citrix.com> wrote:
>>> On 25/01/18 13:08, Jan Beulich wrote:
>>>> It may also be worthwhile again to pull up the zeroing of %edx,
>>>> using %dl instead of $0 in the movb (and maybe then also
>>>> similarly in DO_SPEC_CTRL_EXIT_TO_XEN, but there I'm less
>>>> certain it couldn't have a negative effect).
>>> What negative effects are you worried about?  These macros are self
>>> contained.
>> The result of the xor then is an input to the cmp, which may be
>> one cycle more latency than with the immediate zero.
> 
> Why?  Cmp writes all flags and reads none of them, so doesn't have any
> flags dependency on earlier instructions.
> 
> It is only instructions which partially preserve older flag bits (and
> where undefined behaviour is complicated) which suffer flags-merge
> penalties.

I'm not talking about an eflags dependency, but that on the
register (%edx being written by xor, %dl used by cmp).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 05/11] x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
  2018-01-25 15:08           ` Jan Beulich
@ 2018-01-25 15:10             ` Andrew Cooper
  0 siblings, 0 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-25 15:10 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel

On 25/01/18 15:08, Jan Beulich wrote:
>>>> On 25.01.18 at 15:46, <andrew.cooper3@citrix.com> wrote:
>> On 25/01/18 14:36, Jan Beulich wrote:
>>>>>> On 25.01.18 at 15:12, <andrew.cooper3@citrix.com> wrote:
>>>> On 25/01/18 13:08, Jan Beulich wrote:
>>>>> It may also be worthwhile again to pull up the zeroing of %edx,
>>>>> using %dl instead of $0 in the movb (and maybe then also
>>>>> similarly in DO_SPEC_CTRL_EXIT_TO_XEN, but there I'm less
>>>>> certain it couldn't have a negative effect).
>>>> What negative effects are you worried about?  These macros are self
>>>> contained.
>>> The result of the xor then is an input to the cmp, which may be
>>> one cycle more latency than with the immediate zero.
>> Why?  Cmp writes all flags and reads none of them, so doesn't have any
>> flags dependency on earlier instructions.
>>
>> It is only instructions which partially preserve older flag bits (and
>> where undefined behaviour is complicated) which suffer flags-merge
>> penalties.
> I'm not talking about an eflags dependency, but that on the
> register (%edx being written by xor, %dl used by cmp).

xor %edx, %edx is a zeroing idiom and doesn't get executed.  Register
renaming will cause the %dl to be sourced from the microarchitectural
idea of zero, rather than from the real result of the xor.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 07/11] x86/entry: Avoid using alternatives in NMI/#MC paths
  2018-01-25 15:04     ` Andrew Cooper
@ 2018-01-25 15:14       ` Jan Beulich
  2018-01-25 15:19         ` Andrew Cooper
  0 siblings, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2018-01-25 15:14 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 25.01.18 at 16:04, <andrew.cooper3@citrix.com> wrote:
> On 25/01/18 13:43, Jan Beulich wrote:
>>>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
>>> @@ -256,6 +261,69 @@
>>>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_SET,           \
>>>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_CLEAR
>>>  
>>> +/* TODO: Drop these when the alternatives infrastructure is NMI/#MC safe. */
>>> +.macro SPEC_CTRL_ENTRY_FROM_INTR_IST
>>> +/*
>>> + * Requires %rsp=regs, %r14=stack_end
>>> + * Clobbers %rax, %rcx, %rdx
>>> + *
>>> + * This is logical merge of DO_OVERWRITE_RSB and DO_SPEC_CTRL_ENTRY
>>> + * maybexen=1, but with conditionals rather than alternatives.
>>> + */
>>> +    movzbl STACK_CPUINFO_FIELD(bti_ist_info)(%r14), %eax
>>> +
>>> +    testb $BTI_IST_RSB, %al
>>> +    jz .L\@_skip_rsb
>>> +
>>> +    DO_OVERWRITE_RSB
>>> +
>>> +.L\@_skip_rsb:
>>> +
>>> +    testb $BTI_IST_WRMSR, %al
>>> +    jz .L\@_skip_wrmsr
>>> +
>>> +    testb $3, UREGS_cs(%rsp)
>>> +    jz .L\@_entry_from_xen
>>> +
>>> +    movb $0, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)
>>> +
>>> +.L\@_entry_from_xen:
>>> +    /*
>>> +     * Load Xen's intended value.  SPEC_CTRL_IBRS vs 0 is encoded in the
>>> +     * bottom bit of bti_ist_info, via a deliberate alias with BTI_IST_IBRS.
>>> +     */
>>> +    mov $MSR_SPEC_CTRL, %ecx
>>> +    and $BTI_IST_IBRS, %eax
>>> +    xor %edx, %edx
>>> +    wrmsr
>>> +
>>> +    /* Opencoded UNLIKELY_START() with no condition. */
>>> +.Ldispatch.\@_serialise:
>>> +    .subsection 1
>> How about adding .ifnb to UNLIKELY_START(), instead of open coding?
>> Or wait - why can't you use it as-is above (instead of the
>> "jz .L\@_skip_wrmsr")?
> 
> Because then the end label is wrong.  One way of another, we either need
> to opencode something, or implement UNLIKELY_ELSE() which is actually
> the likely case, and this is a substantially larger can of worms.

I'm willing to trust you for now (until I get around to play with this
myself), but I don't follow: In an UNLIKELY_START() /
UNLIKELY_END() pair, how could the labels be wrong if both specify
the same tag?

>> And then, having reached the end of the patch - where is the
>> new struct cpu_info field written? Or is this again happening only in
>> later patches, with the one here just setting the stage? If so,
>> shouldn't you at least zero the field in init_shadow_spec_ctrl_state()?
> 
> Hmm - I should set 0 in init_shadow_spec_ctrl_state().
> 
> All the other calculation logic needs to be deferred to the subsequent
> patch for bisectiblaty.

Of course. With this initialization fixed and the connection made
between BTI_IST_IBRS and the MSR bit at the definition point of
the former,
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 07/11] x86/entry: Avoid using alternatives in NMI/#MC paths
  2018-01-25 15:14       ` Jan Beulich
@ 2018-01-25 15:19         ` Andrew Cooper
  2018-01-25 16:17           ` Jan Beulich
  0 siblings, 1 reply; 57+ messages in thread
From: Andrew Cooper @ 2018-01-25 15:19 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel

On 25/01/18 15:14, Jan Beulich wrote:
>>>> On 25.01.18 at 16:04, <andrew.cooper3@citrix.com> wrote:
>> On 25/01/18 13:43, Jan Beulich wrote:
>>>>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
>>>> @@ -256,6 +261,69 @@
>>>>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_SET,           \
>>>>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_CLEAR
>>>>  
>>>> +/* TODO: Drop these when the alternatives infrastructure is NMI/#MC safe. */
>>>> +.macro SPEC_CTRL_ENTRY_FROM_INTR_IST
>>>> +/*
>>>> + * Requires %rsp=regs, %r14=stack_end
>>>> + * Clobbers %rax, %rcx, %rdx
>>>> + *
>>>> + * This is logical merge of DO_OVERWRITE_RSB and DO_SPEC_CTRL_ENTRY
>>>> + * maybexen=1, but with conditionals rather than alternatives.
>>>> + */
>>>> +    movzbl STACK_CPUINFO_FIELD(bti_ist_info)(%r14), %eax
>>>> +
>>>> +    testb $BTI_IST_RSB, %al
>>>> +    jz .L\@_skip_rsb
>>>> +
>>>> +    DO_OVERWRITE_RSB
>>>> +
>>>> +.L\@_skip_rsb:
>>>> +
>>>> +    testb $BTI_IST_WRMSR, %al
>>>> +    jz .L\@_skip_wrmsr
>>>> +
>>>> +    testb $3, UREGS_cs(%rsp)
>>>> +    jz .L\@_entry_from_xen
>>>> +
>>>> +    movb $0, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)
>>>> +
>>>> +.L\@_entry_from_xen:
>>>> +    /*
>>>> +     * Load Xen's intended value.  SPEC_CTRL_IBRS vs 0 is encoded in the
>>>> +     * bottom bit of bti_ist_info, via a deliberate alias with BTI_IST_IBRS.
>>>> +     */
>>>> +    mov $MSR_SPEC_CTRL, %ecx
>>>> +    and $BTI_IST_IBRS, %eax
>>>> +    xor %edx, %edx
>>>> +    wrmsr
>>>> +
>>>> +    /* Opencoded UNLIKELY_START() with no condition. */
>>>> +.Ldispatch.\@_serialise:
>>>> +    .subsection 1
>>> How about adding .ifnb to UNLIKELY_START(), instead of open coding?
>>> Or wait - why can't you use it as-is above (instead of the
>>> "jz .L\@_skip_wrmsr")?
>> Because then the end label is wrong.  One way of another, we either need
>> to opencode something, or implement UNLIKELY_ELSE() which is actually
>> the likely case, and this is a substantially larger can of worms.
> I'm willing to trust you for now (until I get around to play with this
> myself), but I don't follow: In an UNLIKELY_START() /
> UNLIKELY_END() pair, how could the labels be wrong if both specify
> the same tag?

The problem is that execution needs to jmp to after the wrmsr, rather
than before the testb $3, UREGS_cs(%rsp), which is where UNLIKELY_END()
would currently leave it.

We do not have suitable UNLIKELY() infrastructure to express a non-empty
likely basic block.

>
>>> And then, having reached the end of the patch - where is the
>>> new struct cpu_info field written? Or is this again happening only in
>>> later patches, with the one here just setting the stage? If so,
>>> shouldn't you at least zero the field in init_shadow_spec_ctrl_state()?
>> Hmm - I should set 0 in init_shadow_spec_ctrl_state().
>>
>> All the other calculation logic needs to be deferred to the subsequent
>> patch for bisectiblaty.
> Of course. With this initialization fixed and the connection made
> between BTI_IST_IBRS and the MSR bit at the definition point of
> the former,
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks,

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-24 13:12 ` [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts Andrew Cooper
  2018-01-24 13:34   ` Woodhouse, David
@ 2018-01-25 15:57   ` Jan Beulich
  2018-01-25 16:09     ` Andrew Cooper
  1 sibling, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2018-01-25 15:57 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: David Woodhouse, Xen-devel

>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
> @@ -1743,6 +1744,34 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>          }
>  
>          ctxt_switch_levelling(next);
> +
> +        if ( opt_ibpb && !is_idle_domain(nextd) )

Is the idle domain check here really useful?

> +        {
> +            static DEFINE_PER_CPU(unsigned int, last);
> +            unsigned int *last_id = &this_cpu(last);
> +
> +            /*
> +             * Squash the domid and vcpu id together for comparason
> +             * efficiency.  We could in principle stash and compare the struct
> +             * vcpu pointer, but this risks a false alias if a domain has died
> +             * and the same 4k page gets reused for a new vcpu.
> +             */
> +            unsigned int next_id = (((unsigned int)nextd->domain_id << 16) |
> +                                    (uint16_t)next->vcpu_id);
> +            BUILD_BUG_ON(MAX_VIRT_CPUS > 0xffff);
> +
> +            /*
> +             * When scheduling from a vcpu, to idle, and back to the same vcpu
> +             * (which might be common in a lightly loaded system, or when
> +             * using vcpu pinning), there is no need to issue IBPB, as we are
> +             * returning to the same security context.
> +             */
> +            if ( *last_id != next_id )
> +            {
> +                wrmsrl(MSR_PRED_CMD, PRED_CMD_IBPB);
> +                *last_id = next_id;
> +            }
> +        }
>      }
>  
>      context_saved(prev);

Short of any real numbers (or a proper explanation) having been
provided, I've done some measurements. Indeed I can see quite
high a rate of cases of execution coming this way despite the
vCPU not really changing during early boot of HVM guests. This
goes down quite a bit later on, but obviously that's also workload
dependent. But the number of cases where the barrier emission
could be avoided remains non-negligible, so I agree the extra
avoidance logic is probably warranted. On that basis (perhaps
with the idle check above removed)
Reviewed-by: Jan Beulich <jbeulich@suse.com>

For the record, the overwhelming majority of calls to
__sync_local_execstate() being responsible for the behavior
come from invalidate_interrupt(), which suggests to me that
there's a meaningful number of cases where a vCPU is migrated
to another CPU and then back, without another vCPU having
run on the original CPU in between. If I'm not wrong with this,
I have to question why the vCPU is migrated then in the first
place.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-25 15:57   ` Jan Beulich
@ 2018-01-25 16:09     ` Andrew Cooper
  2018-01-25 16:15       ` Andrew Cooper
                         ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-25 16:09 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Dario Faggioli, David Woodhouse, Xen-devel

On 25/01/18 15:57, Jan Beulich wrote:
>>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
>> @@ -1743,6 +1744,34 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>>          }
>>  
>>          ctxt_switch_levelling(next);
>> +
>> +        if ( opt_ibpb && !is_idle_domain(nextd) )
> Is the idle domain check here really useful?

Yes, because as you pointed out in v9, the outer condition isn't
sufficient to exclude nextd being idle.

>
>> +        {
>> +            static DEFINE_PER_CPU(unsigned int, last);
>> +            unsigned int *last_id = &this_cpu(last);
>> +
>> +            /*
>> +             * Squash the domid and vcpu id together for comparason
>> +             * efficiency.  We could in principle stash and compare the struct
>> +             * vcpu pointer, but this risks a false alias if a domain has died
>> +             * and the same 4k page gets reused for a new vcpu.
>> +             */
>> +            unsigned int next_id = (((unsigned int)nextd->domain_id << 16) |
>> +                                    (uint16_t)next->vcpu_id);
>> +            BUILD_BUG_ON(MAX_VIRT_CPUS > 0xffff);
>> +
>> +            /*
>> +             * When scheduling from a vcpu, to idle, and back to the same vcpu
>> +             * (which might be common in a lightly loaded system, or when
>> +             * using vcpu pinning), there is no need to issue IBPB, as we are
>> +             * returning to the same security context.
>> +             */
>> +            if ( *last_id != next_id )
>> +            {
>> +                wrmsrl(MSR_PRED_CMD, PRED_CMD_IBPB);
>> +                *last_id = next_id;
>> +            }
>> +        }
>>      }
>>  
>>      context_saved(prev);
> Short of any real numbers (or a proper explanation) having been
> provided, I've done some measurements. Indeed I can see quite
> high a rate of cases of execution coming this way despite the
> vCPU not really changing during early boot of HVM guests. This
> goes down quite a bit later on, but obviously that's also workload
> dependent. But the number of cases where the barrier emission
> could be avoided remains non-negligible, so I agree the extra
> avoidance logic is probably warranted. On that basis (perhaps
> with the idle check above removed)
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>
> For the record, the overwhelming majority of calls to
> __sync_local_execstate() being responsible for the behavior
> come from invalidate_interrupt(), which suggests to me that
> there's a meaningful number of cases where a vCPU is migrated
> to another CPU and then back, without another vCPU having
> run on the original CPU in between. If I'm not wrong with this,
> I have to question why the vCPU is migrated then in the first
> place.

This is a known (mis)feature of the Credit scheduler.  When a PCPU goes
idle, it aggressively steals work from the adjacent cpus, even if the
adjacent cpu only has a single vcpu to run and is running it.

XenServer has this gross hack which makes aggregate networking
measurements far far better

https://github.com/xenserver/xen.pg/blob/XS-7.1.x/master/sched-credit1-use-per-pcpu-runqueue-count.patch

Dario made a different fix to Credit1 upstream which was supposed to
resolve this behaviour (although I can't locate the patch by a list of
titles), but based on these observations, I'd say the misfeature hasn't
been fixed.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-25 16:09     ` Andrew Cooper
@ 2018-01-25 16:15       ` Andrew Cooper
  2018-01-27  1:27         ` Dario Faggioli
  2018-01-25 16:31       ` Jan Beulich
  2018-01-25 18:49       ` Dario Faggioli
  2 siblings, 1 reply; 57+ messages in thread
From: Andrew Cooper @ 2018-01-25 16:15 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Dario Faggioli, David Woodhouse, Xen-devel

CC'ing Dario with a working email address this time...

On 25/01/18 16:09, Andrew Cooper wrote:
> On 25/01/18 15:57, Jan Beulich wrote:
>>>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
>>> @@ -1743,6 +1744,34 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>>>          }
>>>  
>>>          ctxt_switch_levelling(next);
>>> +
>>> +        if ( opt_ibpb && !is_idle_domain(nextd) )
>> Is the idle domain check here really useful?
> Yes, because as you pointed out in v9, the outer condition isn't
> sufficient to exclude nextd being idle.
>
>>> +        {
>>> +            static DEFINE_PER_CPU(unsigned int, last);
>>> +            unsigned int *last_id = &this_cpu(last);
>>> +
>>> +            /*
>>> +             * Squash the domid and vcpu id together for comparason
>>> +             * efficiency.  We could in principle stash and compare the struct
>>> +             * vcpu pointer, but this risks a false alias if a domain has died
>>> +             * and the same 4k page gets reused for a new vcpu.
>>> +             */
>>> +            unsigned int next_id = (((unsigned int)nextd->domain_id << 16) |
>>> +                                    (uint16_t)next->vcpu_id);
>>> +            BUILD_BUG_ON(MAX_VIRT_CPUS > 0xffff);
>>> +
>>> +            /*
>>> +             * When scheduling from a vcpu, to idle, and back to the same vcpu
>>> +             * (which might be common in a lightly loaded system, or when
>>> +             * using vcpu pinning), there is no need to issue IBPB, as we are
>>> +             * returning to the same security context.
>>> +             */
>>> +            if ( *last_id != next_id )
>>> +            {
>>> +                wrmsrl(MSR_PRED_CMD, PRED_CMD_IBPB);
>>> +                *last_id = next_id;
>>> +            }
>>> +        }
>>>      }
>>>  
>>>      context_saved(prev);
>> Short of any real numbers (or a proper explanation) having been
>> provided, I've done some measurements. Indeed I can see quite
>> high a rate of cases of execution coming this way despite the
>> vCPU not really changing during early boot of HVM guests. This
>> goes down quite a bit later on, but obviously that's also workload
>> dependent. But the number of cases where the barrier emission
>> could be avoided remains non-negligible, so I agree the extra
>> avoidance logic is probably warranted. On that basis (perhaps
>> with the idle check above removed)
>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>>
>> For the record, the overwhelming majority of calls to
>> __sync_local_execstate() being responsible for the behavior
>> come from invalidate_interrupt(), which suggests to me that
>> there's a meaningful number of cases where a vCPU is migrated
>> to another CPU and then back, without another vCPU having
>> run on the original CPU in between. If I'm not wrong with this,
>> I have to question why the vCPU is migrated then in the first
>> place.
> This is a known (mis)feature of the Credit scheduler.  When a PCPU goes
> idle, it aggressively steals work from the adjacent cpus, even if the
> adjacent cpu only has a single vcpu to run and is running it.
>
> XenServer has this gross hack which makes aggregate networking
> measurements far far better
>
> https://github.com/xenserver/xen.pg/blob/XS-7.1.x/master/sched-credit1-use-per-pcpu-runqueue-count.patch
>
> Dario made a different fix to Credit1 upstream which was supposed to
> resolve this behaviour (although I can't locate the patch by a list of
> titles), but based on these observations, I'd say the misfeature hasn't
> been fixed.
>
> ~Andrew
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 07/11] x86/entry: Avoid using alternatives in NMI/#MC paths
  2018-01-25 15:19         ` Andrew Cooper
@ 2018-01-25 16:17           ` Jan Beulich
  0 siblings, 0 replies; 57+ messages in thread
From: Jan Beulich @ 2018-01-25 16:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 25.01.18 at 16:19, <andrew.cooper3@citrix.com> wrote:
> On 25/01/18 15:14, Jan Beulich wrote:
>>>>> On 25.01.18 at 16:04, <andrew.cooper3@citrix.com> wrote:
>>> On 25/01/18 13:43, Jan Beulich wrote:
>>>>>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
>>>>> @@ -256,6 +261,69 @@
>>>>>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_SET,           \
>>>>>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_CLEAR
>>>>>  
>>>>> +/* TODO: Drop these when the alternatives infrastructure is NMI/#MC safe. 
> */
>>>>> +.macro SPEC_CTRL_ENTRY_FROM_INTR_IST
>>>>> +/*
>>>>> + * Requires %rsp=regs, %r14=stack_end
>>>>> + * Clobbers %rax, %rcx, %rdx
>>>>> + *
>>>>> + * This is logical merge of DO_OVERWRITE_RSB and DO_SPEC_CTRL_ENTRY
>>>>> + * maybexen=1, but with conditionals rather than alternatives.
>>>>> + */
>>>>> +    movzbl STACK_CPUINFO_FIELD(bti_ist_info)(%r14), %eax
>>>>> +
>>>>> +    testb $BTI_IST_RSB, %al
>>>>> +    jz .L\@_skip_rsb
>>>>> +
>>>>> +    DO_OVERWRITE_RSB
>>>>> +
>>>>> +.L\@_skip_rsb:
>>>>> +
>>>>> +    testb $BTI_IST_WRMSR, %al
>>>>> +    jz .L\@_skip_wrmsr
>>>>> +
>>>>> +    testb $3, UREGS_cs(%rsp)
>>>>> +    jz .L\@_entry_from_xen
>>>>> +
>>>>> +    movb $0, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)
>>>>> +
>>>>> +.L\@_entry_from_xen:
>>>>> +    /*
>>>>> +     * Load Xen's intended value.  SPEC_CTRL_IBRS vs 0 is encoded in the
>>>>> +     * bottom bit of bti_ist_info, via a deliberate alias with 
> BTI_IST_IBRS.
>>>>> +     */
>>>>> +    mov $MSR_SPEC_CTRL, %ecx
>>>>> +    and $BTI_IST_IBRS, %eax
>>>>> +    xor %edx, %edx
>>>>> +    wrmsr
>>>>> +
>>>>> +    /* Opencoded UNLIKELY_START() with no condition. */
>>>>> +.Ldispatch.\@_serialise:
>>>>> +    .subsection 1
>>>> How about adding .ifnb to UNLIKELY_START(), instead of open coding?
>>>> Or wait - why can't you use it as-is above (instead of the
>>>> "jz .L\@_skip_wrmsr")?
>>> Because then the end label is wrong.  One way of another, we either need
>>> to opencode something, or implement UNLIKELY_ELSE() which is actually
>>> the likely case, and this is a substantially larger can of worms.
>> I'm willing to trust you for now (until I get around to play with this
>> myself), but I don't follow: In an UNLIKELY_START() /
>> UNLIKELY_END() pair, how could the labels be wrong if both specify
>> the same tag?
> 
> The problem is that execution needs to jmp to after the wrmsr, rather
> than before the testb $3, UREGS_cs(%rsp), which is where UNLIKELY_END()
> would currently leave it.
> 
> We do not have suitable UNLIKELY() infrastructure to express a non-empty
> likely basic block.

At least we have UNLIKELY_DISPATCH_LABEL() - may I ask you
don't open-code that one then. UNLIKELY_DONE(), as it looks,
won't be of any help here, otoh.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-25 16:09     ` Andrew Cooper
  2018-01-25 16:15       ` Andrew Cooper
@ 2018-01-25 16:31       ` Jan Beulich
  2018-01-25 16:48         ` Andrew Cooper
  2018-01-25 18:49       ` Dario Faggioli
  2 siblings, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2018-01-25 16:31 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Dario Faggioli, David Woodhouse, Xen-devel

>>> On 25.01.18 at 17:09, <andrew.cooper3@citrix.com> wrote:
> On 25/01/18 15:57, Jan Beulich wrote:
>>>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
>>> @@ -1743,6 +1744,34 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>>>          }
>>>  
>>>          ctxt_switch_levelling(next);
>>> +
>>> +        if ( opt_ibpb && !is_idle_domain(nextd) )
>> Is the idle domain check here really useful?
> 
> Yes, because as you pointed out in v9, the outer condition isn't
> sufficient to exclude nextd being idle.

True, but then again - what's wrong with an idle vCPU making it
into the block? It'll be a pointless barrier that you issue, but no
other harm afaics. Remember that I complained about the missing
check only because of the chosen variable naming, but you've
renamed the variables in question, so I don't see why you've also
added the extra condition.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-25 16:31       ` Jan Beulich
@ 2018-01-25 16:48         ` Andrew Cooper
  0 siblings, 0 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-25 16:48 UTC (permalink / raw)
  To: Jan Beulich; +Cc: David Woodhouse, Xen-devel

On 25/01/18 16:31, Jan Beulich wrote:
>>>> On 25.01.18 at 17:09, <andrew.cooper3@citrix.com> wrote:
>> On 25/01/18 15:57, Jan Beulich wrote:
>>>>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
>>>> @@ -1743,6 +1744,34 @@ void context_switch(struct vcpu *prev, struct vcpu *next)
>>>>          }
>>>>  
>>>>          ctxt_switch_levelling(next);
>>>> +
>>>> +        if ( opt_ibpb && !is_idle_domain(nextd) )
>>> Is the idle domain check here really useful?
>> Yes, because as you pointed out in v9, the outer condition isn't
>> sufficient to exclude nextd being idle.
> True, but then again - what's wrong with an idle vCPU making it
> into the block? It'll be a pointless barrier that you issue, but no
> other harm afaics. Remember that I complained about the missing
> check only because of the chosen variable naming, but you've
> renamed the variables in question, so I don't see why you've also
> added the extra condition.

The barrier would be pointless, yes, but at 8k cycles, the cost is massive.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v11 5/11] x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
  2018-01-24 13:12 ` [PATCH v10 05/11] x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point Andrew Cooper
  2018-01-25 13:08   ` Jan Beulich
@ 2018-01-25 16:52   ` Andrew Cooper
  1 sibling, 0 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-25 16:52 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich

We need to be able to either set or clear IBRS in Xen context, as well as
restore appropriate guest values in guest context.  See the documentation in
asm-x86/spec_ctrl_asm.h for details.

With the contemporary microcode, writes to %cr3 are slower when SPEC_CTRL.IBRS
is set.  Therefore, the positioning of SPEC_CTRL_{ENTRY/EXIT}* is important.

Ideally, the IBRS_SET/IBRS_CLEAR hunks might be positioned either side of the
%cr3 change, but that is rather more complicated to arrange, and could still
result in a guest controlled value in SPEC_CTRL during the %cr3 change,
negating the saving if the guest chose to have IBRS set.

Therefore, we optimise for the pre-Skylake case (being far more common in the
field than Skylake and later, at the moment), where we have a Xen-preferred
value of IBRS clear when switching %cr3.

There is a semi-unrelated bugfix, where various asm_defn.h macros have a
hidden dependency on PAGE_SIZE, which results in an assembler error if used in
a .macro definition.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
v10:
 * Fix the restore of SPEC_CTRL on the exit-to-HVM paths.
 * Adjust comments and commit message
 * Pull the GET_STACK_END() out of DO_SPEC_CTRL_ENTRY and use %r14.
v11:
 * Optimise ASM a bit more.  Remove a branch in DO_SPEC_CTRL_ENTRY for the
   maybexen case, and use %dl rather than $0
---
 xen/arch/x86/hvm/svm/entry.S        |  11 +-
 xen/arch/x86/hvm/vmx/entry.S        |  19 +++
 xen/arch/x86/setup.c                |   1 +
 xen/arch/x86/smpboot.c              |   2 +
 xen/arch/x86/x86_64/asm-offsets.c   |   6 +
 xen/arch/x86/x86_64/compat/entry.S  |  14 +++
 xen/arch/x86/x86_64/entry.S         |  48 +++++++-
 xen/include/asm-x86/asm_defns.h     |   3 +
 xen/include/asm-x86/current.h       |   6 +
 xen/include/asm-x86/nops.h          |   6 +
 xen/include/asm-x86/spec_ctrl.h     |   9 ++
 xen/include/asm-x86/spec_ctrl_asm.h | 225 ++++++++++++++++++++++++++++++++++++
 12 files changed, 344 insertions(+), 6 deletions(-)
 create mode 100644 xen/include/asm-x86/spec_ctrl_asm.h

diff --git a/xen/arch/x86/hvm/svm/entry.S b/xen/arch/x86/hvm/svm/entry.S
index df86da0..bf092fe 100644
--- a/xen/arch/x86/hvm/svm/entry.S
+++ b/xen/arch/x86/hvm/svm/entry.S
@@ -79,6 +79,12 @@ UNLIKELY_END(svm_trace)
         or   $X86_EFLAGS_MBS,%rax
         mov  %rax,VMCB_rflags(%rcx)
 
+        mov VCPU_arch_msr(%rbx), %rax
+        mov VCPUMSR_spec_ctrl_raw(%rax), %eax
+
+        /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
+        SPEC_CTRL_EXIT_TO_GUEST /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
+
         pop  %r15
         pop  %r14
         pop  %r13
@@ -101,8 +107,11 @@ UNLIKELY_END(svm_trace)
         SAVE_ALL
 
         GET_CURRENT(bx)
-        mov  VCPU_svm_vmcb(%rbx),%rcx
 
+        SPEC_CTRL_ENTRY_FROM_VMEXIT /* Req: b=curr %rsp=regs/cpuinfo, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
+        mov  VCPU_svm_vmcb(%rbx),%rcx
         movb $0,VCPU_svm_vmcb_in_sync(%rbx)
         mov  VMCB_rax(%rcx),%rax
         mov  %rax,UREGS_rax(%rsp)
diff --git a/xen/arch/x86/hvm/vmx/entry.S b/xen/arch/x86/hvm/vmx/entry.S
index b2f98be..e750544 100644
--- a/xen/arch/x86/hvm/vmx/entry.S
+++ b/xen/arch/x86/hvm/vmx/entry.S
@@ -38,6 +38,9 @@ ENTRY(vmx_asm_vmexit_handler)
         movb $1,VCPU_vmx_launched(%rbx)
         mov  %rax,VCPU_hvm_guest_cr2(%rbx)
 
+        SPEC_CTRL_ENTRY_FROM_VMEXIT /* Req: b=curr %rsp=regs/cpuinfo, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         mov  %rsp,%rdi
         call vmx_vmexit_handler
 
@@ -68,6 +71,13 @@ UNLIKELY_END(realmode)
         call vmx_vmenter_helper
         test %al, %al
         jz .Lvmx_vmentry_restart
+
+        mov VCPU_arch_msr(%rbx), %rax
+        mov VCPUMSR_spec_ctrl_raw(%rax), %eax
+
+        /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
+        SPEC_CTRL_EXIT_TO_GUEST /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
+
         mov  VCPU_hvm_guest_cr2(%rbx),%rax
 
         pop  %r15
@@ -99,6 +109,15 @@ UNLIKELY_END(realmode)
 .Lvmx_vmentry_fail:
         sti
         SAVE_ALL
+
+        /*
+         * PV variant needed here as no guest code has executed (so
+         * MSR_SPEC_CTRL can't have changed value), and NMIs/MCEs are liable
+         * to hit (in which case the HVM variant might corrupt things).
+         */
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         call vmx_vmentry_failure
         BUG  /* vmx_vmentry_failure() shouldn't return. */
 
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 9407247..ac530ec 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -678,6 +678,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
     set_processor_id(0);
     set_current(INVALID_VCPU); /* debug sanity. */
     idle_vcpu[0] = current;
+    init_shadow_spec_ctrl_state();
 
     percpu_init_areas();
 
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index fe637da..2ebef03 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -41,6 +41,7 @@
 #include <asm/guest.h>
 #include <asm/msr.h>
 #include <asm/mtrr.h>
+#include <asm/spec_ctrl.h>
 #include <asm/time.h>
 #include <asm/tboot.h>
 #include <mach_apic.h>
@@ -309,6 +310,7 @@ void start_secondary(void *unused)
     set_current(idle_vcpu[cpu]);
     this_cpu(curr_vcpu) = idle_vcpu[cpu];
     rdmsrl(MSR_EFER, this_cpu(efer));
+    init_shadow_spec_ctrl_state();
 
     /*
      * Just as during early bootstrap, it is convenient here to disable
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index b1a4310..17f1d77 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -88,6 +88,7 @@ void __dummy__(void)
     OFFSET(VCPU_kernel_ss, struct vcpu, arch.pv_vcpu.kernel_ss);
     OFFSET(VCPU_iopl, struct vcpu, arch.pv_vcpu.iopl);
     OFFSET(VCPU_guest_context_flags, struct vcpu, arch.vgc_flags);
+    OFFSET(VCPU_arch_msr, struct vcpu, arch.msr);
     OFFSET(VCPU_nmi_pending, struct vcpu, nmi_pending);
     OFFSET(VCPU_mce_pending, struct vcpu, mce_pending);
     OFFSET(VCPU_nmi_old_mask, struct vcpu, nmi_state.old_mask);
@@ -139,6 +140,8 @@ void __dummy__(void)
     OFFSET(CPUINFO_cr4, struct cpu_info, cr4);
     OFFSET(CPUINFO_xen_cr3, struct cpu_info, xen_cr3);
     OFFSET(CPUINFO_pv_cr3, struct cpu_info, pv_cr3);
+    OFFSET(CPUINFO_shadow_spec_ctrl, struct cpu_info, shadow_spec_ctrl);
+    OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, use_shadow_spec_ctrl);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
@@ -154,6 +157,9 @@ void __dummy__(void)
     OFFSET(TRAPBOUNCE_eip, struct trap_bounce, eip);
     BLANK();
 
+    OFFSET(VCPUMSR_spec_ctrl_raw, struct msr_vcpu_policy, spec_ctrl.raw);
+    BLANK();
+
 #ifdef CONFIG_PERF_COUNTERS
     DEFINE(ASM_PERFC_exceptions, PERFC_exceptions);
     BLANK();
diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S
index e668f00..4190c73 100644
--- a/xen/arch/x86/x86_64/compat/entry.S
+++ b/xen/arch/x86/x86_64/compat/entry.S
@@ -18,6 +18,10 @@ ENTRY(entry_int82)
         pushq $0
         movl  $HYPERCALL_VECTOR, 4(%rsp)
         SAVE_ALL compat=1 /* DPL1 gate, restricted to 32bit PV guests only. */
+
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         CR4_PV32_RESTORE
 
         GET_CURRENT(bx)
@@ -142,6 +146,13 @@ ENTRY(compat_restore_all_guest)
         .popsection
         or    $X86_EFLAGS_IF,%r11
         mov   %r11d,UREGS_eflags(%rsp)
+
+        mov VCPU_arch_msr(%rbx), %rax
+        mov VCPUMSR_spec_ctrl_raw(%rax), %eax
+
+        /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
+        SPEC_CTRL_EXIT_TO_GUEST /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
+
         RESTORE_ALL adj=8 compat=1
 .Lft0:  iretq
         _ASM_PRE_EXTABLE(.Lft0, handle_exception)
@@ -200,6 +211,9 @@ ENTRY(cstar_enter)
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
 
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         GET_STACK_END(bx)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
         neg   %rcx
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 710c061..73bd7ca 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -38,6 +38,10 @@ ENTRY(switch_to_kernel)
 restore_all_guest:
         ASSERT_INTERRUPTS_DISABLED
 
+        /* Stash guest SPEC_CTRL value while we can read struct vcpu. */
+        mov VCPU_arch_msr(%rbx), %rdx
+        mov VCPUMSR_spec_ctrl_raw(%rdx), %r15d
+
         /* Copy guest mappings and switch to per-CPU root page table. */
         mov   %cr3, %r9
         GET_STACK_END(dx)
@@ -65,6 +69,12 @@ restore_all_guest:
         write_cr3 rax, rdi, rsi
 .Lrag_keep_cr3:
 
+        /* Restore stashed SPEC_CTRL value. */
+        mov   %r15d, %eax
+
+        /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
+        SPEC_CTRL_EXIT_TO_GUEST /* Req: a=spec_ctrl %rsp=regs/cpuinfo, Clob: cd */
+
         RESTORE_ALL
         testw $TRAP_syscall,4(%rsp)
         jz    iret_exit_to_guest
@@ -103,9 +113,9 @@ restore_all_xen:
          * Check whether we need to switch to the per-CPU page tables, in
          * case we return to late PV exit code (from an NMI or #MC).
          */
-        GET_STACK_END(ax)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rax), %rdx
-        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rax), %rax
+        GET_STACK_END(bx)
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rdx
+        mov   STACK_CPUINFO_FIELD(pv_cr3)(%rbx), %rax
         test  %rdx, %rdx
         /*
          * Ideally the condition would be "nsz", but such doesn't exist,
@@ -115,6 +125,9 @@ UNLIKELY_START(g, exit_cr3)
         write_cr3 rax, rdi, rsi
 UNLIKELY_END(exit_cr3)
 
+        /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
+        SPEC_CTRL_EXIT_TO_XEN /* Req: %rbx=end, Clob: acd */
+
         RESTORE_ALL adj=8
         iretq
 
@@ -145,6 +158,9 @@ ENTRY(lstar_enter)
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
 
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         GET_STACK_END(bx)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
         neg   %rcx
@@ -248,6 +264,9 @@ GLOBAL(sysenter_eflags_saved)
         movl  $TRAP_syscall, 4(%rsp)
         SAVE_ALL
 
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         GET_STACK_END(bx)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
         neg   %rcx
@@ -294,6 +313,9 @@ ENTRY(int80_direct_trap)
         movl  $0x80, 4(%rsp)
         SAVE_ALL
 
+        SPEC_CTRL_ENTRY_FROM_PV /* Req: %rsp=regs/cpuinfo, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         GET_STACK_END(bx)
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rcx
         neg   %rcx
@@ -469,6 +491,10 @@ ENTRY(common_interrupt)
         SAVE_ALL CLAC
 
         GET_STACK_END(14)
+
+        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
         mov   %rcx, %r15
         neg   %rcx
@@ -507,6 +533,10 @@ GLOBAL(handle_exception)
         SAVE_ALL CLAC
 
         GET_STACK_END(14)
+
+        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
         mov   %rcx, %r15
         neg   %rcx
@@ -700,8 +730,12 @@ ENTRY(double_fault)
         /* Set AC to reduce chance of further SMAP faults */
         SAVE_ALL STAC
 
-        GET_STACK_END(bx)
-        mov   STACK_CPUINFO_FIELD(xen_cr3)(%rbx), %rbx
+        GET_STACK_END(14)
+
+        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
+        mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rbx
         test  %rbx, %rbx
         jz    .Ldblf_cr3_okay
         jns   .Ldblf_cr3_load
@@ -730,6 +764,10 @@ handle_ist_exception:
         SAVE_ALL CLAC
 
         GET_STACK_END(14)
+
+        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, Clob: acd */
+        /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
+
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
         mov   %rcx, %r15
         neg   %rcx
diff --git a/xen/include/asm-x86/asm_defns.h b/xen/include/asm-x86/asm_defns.h
index 6a2b833..88b775b 100644
--- a/xen/include/asm-x86/asm_defns.h
+++ b/xen/include/asm-x86/asm_defns.h
@@ -7,6 +7,7 @@
 #include <asm/asm-offsets.h>
 #endif
 #include <asm/bug.h>
+#include <asm/page.h>
 #include <asm/processor.h>
 #include <asm/percpu.h>
 #include <xen/stringify.h>
@@ -386,4 +387,6 @@ static always_inline void stac(void)
 4:  .p2align 2                            ; \
     .popsection
 
+#include <asm/spec_ctrl_asm.h>
+
 #endif /* __X86_ASM_DEFNS_H__ */
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index b929c48..1009d05 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -53,6 +53,12 @@ struct cpu_info {
      */
     unsigned long xen_cr3;
     unsigned long pv_cr3;
+
+    /* See asm-x86/spec_ctrl_asm.h for usage. */
+    unsigned int shadow_spec_ctrl;
+    bool         use_shadow_spec_ctrl;
+
+    unsigned long __pad;
     /* get_stack_bottom() must be 16-byte aligned */
 };
 
diff --git a/xen/include/asm-x86/nops.h b/xen/include/asm-x86/nops.h
index 1a46b97..80126c1 100644
--- a/xen/include/asm-x86/nops.h
+++ b/xen/include/asm-x86/nops.h
@@ -65,6 +65,12 @@
 #define ASM_NOP8 _ASM_MK_NOP(P6_NOP8)
 #define ASM_NOP9 _ASM_MK_NOP(P6_NOP9)
 
+#define ASM_NOP17 ASM_NOP8; ASM_NOP7; ASM_NOP2
+#define ASM_NOP21 ASM_NOP8; ASM_NOP8; ASM_NOP5
+#define ASM_NOP24 ASM_NOP8; ASM_NOP8; ASM_NOP8
+#define ASM_NOP29 ASM_NOP8; ASM_NOP8; ASM_NOP8; ASM_NOP5
+#define ASM_NOP32 ASM_NOP8; ASM_NOP8; ASM_NOP8; ASM_NOP8
+
 #define ASM_NOP_MAX 9
 
 #endif /* __X86_ASM_NOPS_H__ */
diff --git a/xen/include/asm-x86/spec_ctrl.h b/xen/include/asm-x86/spec_ctrl.h
index e088a55..b451250 100644
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -20,8 +20,17 @@
 #ifndef __X86_SPEC_CTRL_H__
 #define __X86_SPEC_CTRL_H__
 
+#include <asm/current.h>
+
 void init_speculation_mitigations(void);
 
+static inline void init_shadow_spec_ctrl_state(void)
+{
+    struct cpu_info *info = get_cpu_info();
+
+    info->shadow_spec_ctrl = info->use_shadow_spec_ctrl = 0;
+}
+
 #endif /* !__X86_SPEC_CTRL_H__ */
 
 /*
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
new file mode 100644
index 0000000..ba55574
--- /dev/null
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -0,0 +1,225 @@
+/******************************************************************************
+ * include/asm-x86/spec_ctrl.h
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; If not, see <http://www.gnu.org/licenses/>.
+ *
+ * Copyright (c) 2017-2018 Citrix Systems Ltd.
+ */
+
+#ifndef __X86_SPEC_CTRL_ASM_H__
+#define __X86_SPEC_CTRL_ASM_H__
+
+#ifdef __ASSEMBLY__
+#include <asm/msr-index.h>
+
+/*
+ * Saving and restoring MSR_SPEC_CTRL state is a little tricky.
+ *
+ * We want the guests choice of SPEC_CTRL while in guest context, and Xen's
+ * choice (set or clear, depending on the hardware) while running in Xen
+ * context.  Therefore, a simplistic algorithm is:
+ *
+ *  - Set/clear IBRS on entry to Xen
+ *  - Set the guests' choice on exit to guest
+ *  - Leave SPEC_CTRL unchanged on exit to xen
+ *
+ * There are two complicating factors:
+ *  1) HVM guests can have direct access to the MSR, so it can change
+ *     behind Xen's back.
+ *  2) An NMI or MCE can interrupt at any point, including early in the entry
+ *     path, or late in the exit path after restoring the guest value.  This
+ *     will corrupt the guest value.
+ *
+ * Factor 1 is dealt with by relying on NMIs/MCEs being blocked immediately
+ * after VMEXIT.  The VMEXIT-specific code reads MSR_SPEC_CTRL and updates
+ * current before loading Xen's MSR_SPEC_CTRL setting.
+ *
+ * Factor 2 is harder.  We maintain a shadow_spec_ctrl value, and
+ * use_shadow_spec_ctrl boolean per cpu.  The synchronous use is:
+ *
+ *  1) Store guest value in shadow_spec_ctrl
+ *  2) Set use_shadow_spec_ctrl boolean
+ *  3) Load guest value into MSR_SPEC_CTRL
+ *  4) Exit to guest
+ *  5) Entry from guest
+ *  6) Clear use_shadow_spec_ctrl boolean
+ *  7) Load Xen's value into MSR_SPEC_CTRL
+ *
+ * The asynchronous use for interrupts/exceptions is:
+ *  -  Set/clear IBRS on entry to Xen
+ *  -  On exit to Xen, check use_shadow_spec_ctrl
+ *  -  If set, load shadow_spec_ctrl
+ *
+ * Therefore, an interrupt/exception which hits the synchronous path between
+ * steps 2 and 6 will restore the shadow value rather than leaving Xen's value
+ * loaded and corrupting the value used in guest context.
+ *
+ * The following ASM fragments implement this algorithm.  See their local
+ * comments for further details.
+ *  - SPEC_CTRL_ENTRY_FROM_VMEXIT
+ *  - SPEC_CTRL_ENTRY_FROM_PV
+ *  - SPEC_CTRL_ENTRY_FROM_INTR
+ *  - SPEC_CTRL_EXIT_TO_XEN
+ *  - SPEC_CTRL_EXIT_TO_GUEST
+ */
+
+.macro DO_SPEC_CTRL_ENTRY_FROM_VMEXIT ibrs_val:req
+/*
+ * Requires %rbx=current, %rsp=regs/cpuinfo
+ * Clobbers %rax, %rcx, %rdx
+ *
+ * The common case is that a guest has direct access to MSR_SPEC_CTRL, at
+ * which point we need to save the guest value before setting IBRS for Xen.
+ * Unilaterally saving the guest value is shorter and faster than checking.
+ */
+    mov $MSR_SPEC_CTRL, %ecx
+    rdmsr
+
+    /* Stash the value from hardware. */
+    mov VCPU_arch_msr(%rbx), %rdx
+    mov %eax, VCPUMSR_spec_ctrl_raw(%rdx)
+    xor %edx, %edx
+
+    /* Clear SPEC_CTRL shadowing *before* loading Xen's value. */
+    movb %dl, CPUINFO_use_shadow_spec_ctrl(%rsp)
+
+    /* Load Xen's intended value. */
+    mov $\ibrs_val, %eax
+    wrmsr
+.endm
+
+.macro DO_SPEC_CTRL_ENTRY maybexen:req ibrs_val:req
+/*
+ * Requires %rsp=regs (also cpuinfo if !maybexen)
+ * Requires %r14=stack_end (if maybexen)
+ * Clobbers %rax, %rcx, %rdx
+ *
+ * PV guests can't update MSR_SPEC_CTRL behind Xen's back, so no need to read
+ * it back.  Entries from guest context need to clear SPEC_CTRL shadowing,
+ * while entries from Xen must leave shadowing in its current state.
+ */
+    mov $MSR_SPEC_CTRL, %ecx
+    xor %edx, %edx
+
+    /*
+     * Clear SPEC_CTRL shadowing *before* loading Xen's value.  If entering
+     * from a possibly-xen context, %rsp doesn't necessarily alias the cpuinfo
+     * block so calculate the position directly.
+     */
+    .if \maybexen
+        /* Branchless `if ( !xen ) clear_shadowing` */
+        testb $3, UREGS_cs(%rsp)
+        setz %al
+        and %al, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)
+    .else
+        movb %dl, CPUINFO_use_shadow_spec_ctrl(%rsp)
+    .endif
+
+    /* Load Xen's intended value. */
+    mov $\ibrs_val, %eax
+    wrmsr
+.endm
+
+.macro DO_SPEC_CTRL_EXIT_TO_XEN
+/*
+ * Requires %rbx=stack_end
+ * Clobbers %rax, %rcx, %rdx
+ *
+ * When returning to Xen context, look to see whether SPEC_CTRL shadowing is
+ * in effect, and reload the shadow value.  This covers race conditions which
+ * exist with an NMI/MCE/etc hitting late in the return-to-guest path.
+ */
+    xor %edx, %edx
+
+    cmpb %dl, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%rbx)
+    je .L\@_skip
+
+    mov STACK_CPUINFO_FIELD(shadow_spec_ctrl)(%rbx), %eax
+    mov $MSR_SPEC_CTRL, %ecx
+    wrmsr
+
+.L\@_skip:
+.endm
+
+.macro DO_SPEC_CTRL_EXIT_TO_GUEST
+/*
+ * Requires %eax=spec_ctrl, %rsp=regs/cpuinfo
+ * Clobbers %rcx, %rdx
+ *
+ * When returning to guest context, set up SPEC_CTRL shadowing and load the
+ * guest value.
+ */
+    /* Set up shadow value *before* enabling shadowing. */
+    mov %eax, CPUINFO_shadow_spec_ctrl(%rsp)
+
+    /* Set SPEC_CTRL shadowing *before* loading the guest value. */
+    movb $1, CPUINFO_use_shadow_spec_ctrl(%rsp)
+
+    mov $MSR_SPEC_CTRL, %ecx
+    xor %edx, %edx
+    wrmsr
+.endm
+
+/* Use after a VMEXIT from an HVM guest. */
+#define SPEC_CTRL_ENTRY_FROM_VMEXIT                                     \
+    ALTERNATIVE_2 __stringify(ASM_NOP32),                               \
+        __stringify(DO_SPEC_CTRL_ENTRY_FROM_VMEXIT                      \
+                    ibrs_val=SPEC_CTRL_IBRS),                           \
+        X86_FEATURE_XEN_IBRS_SET,                                       \
+        __stringify(DO_SPEC_CTRL_ENTRY_FROM_VMEXIT                      \
+                    ibrs_val=0),                                        \
+        X86_FEATURE_XEN_IBRS_CLEAR
+
+/* Use after an entry from PV context (syscall/sysenter/int80/int82/etc). */
+#define SPEC_CTRL_ENTRY_FROM_PV                                         \
+    ALTERNATIVE_2 __stringify(ASM_NOP21),                               \
+        __stringify(DO_SPEC_CTRL_ENTRY maybexen=0                       \
+                    ibrs_val=SPEC_CTRL_IBRS),                           \
+        X86_FEATURE_XEN_IBRS_SET,                                       \
+        __stringify(DO_SPEC_CTRL_ENTRY maybexen=0 ibrs_val=0),          \
+        X86_FEATURE_XEN_IBRS_CLEAR
+
+/* Use in interrupt/exception context.  May interrupt Xen or PV context. */
+#define SPEC_CTRL_ENTRY_FROM_INTR                                       \
+    ALTERNATIVE_2 __stringify(ASM_NOP29),                               \
+        __stringify(DO_SPEC_CTRL_ENTRY maybexen=1                       \
+                    ibrs_val=SPEC_CTRL_IBRS),                           \
+        X86_FEATURE_XEN_IBRS_SET,                                       \
+        __stringify(DO_SPEC_CTRL_ENTRY maybexen=1 ibrs_val=0),          \
+        X86_FEATURE_XEN_IBRS_CLEAR
+
+/* Use when exiting to Xen context. */
+#define SPEC_CTRL_EXIT_TO_XEN                                           \
+    ALTERNATIVE_2 __stringify(ASM_NOP17),                               \
+        DO_SPEC_CTRL_EXIT_TO_XEN, X86_FEATURE_XEN_IBRS_SET,             \
+        DO_SPEC_CTRL_EXIT_TO_XEN, X86_FEATURE_XEN_IBRS_CLEAR
+
+/* Use when exiting to guest context. */
+#define SPEC_CTRL_EXIT_TO_GUEST                                         \
+    ALTERNATIVE_2 __stringify(ASM_NOP24),                               \
+        DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_SET,           \
+        DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_CLEAR
+
+#endif /* __ASSEMBLY__ */
+#endif /* !__X86_SPEC_CTRL_ASM_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v11 6/11] x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
  2018-01-24 13:12 ` [PATCH v10 06/11] x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen Andrew Cooper
  2018-01-25 13:19   ` Jan Beulich
@ 2018-01-25 16:54   ` Andrew Cooper
  2018-01-26 12:17     ` Jan Beulich
  1 sibling, 1 reply; 57+ messages in thread
From: Andrew Cooper @ 2018-01-25 16:54 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich

ret instructions are speculated directly to values recorded in the Return
Stack Buffer/Return Address Stack, as there is no uncertainty in well-formed
code.  Guests can take advantage of this in two ways:

  1) If they can find a path in Xen which executes more ret instructions than
     call instructions.  (At least one in the waitqueue infrastructure,
     probably others.)

  2) Use the fact that the RSB/RAS in hardware is actually a circular stack
     without a concept of empty.  (When it logically empties, stale values
     will start being used.)

To mitigate, overwrite the RSB on entry to Xen with gadgets which will capture
and contain rogue speculation.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>

v10:
 * Spelling/comment improvements.
 * Split to fit around IST safety.  Other half of the patch moved into
   "x86/boot: Calculate the most appropriate BTI mitigation to use"
 * Avoid using numeric labels in DO_OVERWRITE_RSB
v11:
 * Use pause;lfence;jmp to capture speculation, for AMD processors.
---
 xen/include/asm-x86/cpufeatures.h   |  2 ++
 xen/include/asm-x86/nops.h          |  1 +
 xen/include/asm-x86/spec_ctrl_asm.h | 44 +++++++++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+)

diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h
index dd2388f..0ee4a1f 100644
--- a/xen/include/asm-x86/cpufeatures.h
+++ b/xen/include/asm-x86/cpufeatures.h
@@ -28,3 +28,5 @@ XEN_CPUFEATURE(IND_THUNK_JMP,   (FSCAPINTS+0)*32+14) /* Use IND_THUNK_JMP */
 XEN_CPUFEATURE(XEN_IBPB,        (FSCAPINTS+0)*32+15) /* IBRSB || IBPB */
 XEN_CPUFEATURE(XEN_IBRS_SET,    (FSCAPINTS+0)*32+16) /* IBRSB && IRBS set in Xen */
 XEN_CPUFEATURE(XEN_IBRS_CLEAR,  (FSCAPINTS+0)*32+17) /* IBRSB && IBRS clear in Xen */
+XEN_CPUFEATURE(RSB_NATIVE,      (FSCAPINTS+0)*32+18) /* RSB overwrite needed for native */
+XEN_CPUFEATURE(RSB_VMEXIT,      (FSCAPINTS+0)*32+20) /* RSB overwrite needed for vmexit */
diff --git a/xen/include/asm-x86/nops.h b/xen/include/asm-x86/nops.h
index 80126c1..61319cc 100644
--- a/xen/include/asm-x86/nops.h
+++ b/xen/include/asm-x86/nops.h
@@ -70,6 +70,7 @@
 #define ASM_NOP24 ASM_NOP8; ASM_NOP8; ASM_NOP8
 #define ASM_NOP29 ASM_NOP8; ASM_NOP8; ASM_NOP8; ASM_NOP5
 #define ASM_NOP32 ASM_NOP8; ASM_NOP8; ASM_NOP8; ASM_NOP8
+#define ASM_NOP40 ASM_NOP8; ASM_NOP8; ASM_NOP8; ASM_NOP8; ASM_NOP8
 
 #define ASM_NOP_MAX 9
 
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
index ba55574..e27ea2b 100644
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -74,6 +74,44 @@
  *  - SPEC_CTRL_EXIT_TO_GUEST
  */
 
+.macro DO_OVERWRITE_RSB
+/*
+ * Requires nothing
+ * Clobbers %rax, %rcx
+ *
+ * Requires 256 bytes of stack space, but %rsp has no net change. Based on
+ * Google's performance numbers, the loop is unrolled to 16 iterations and two
+ * calls per iteration.
+ *
+ * The call filling the RSB needs a nonzero displacement.  A nop would do, but
+ * we use "1: pause; lfence; jmp 1b" to safely contains any ret-based
+ * speculation, even if the loop is speculatively executed prematurely.
+ *
+ * %rsp is preserved by using an extra GPR because a) we've got plenty spare,
+ * b) the two movs are shorter to encode than `add $32*8, %rsp`, and c) can be
+ * optimised with mov-elimination in modern cores.
+ */
+    mov $16, %ecx                   /* 16 iterations, two calls per loop */
+    mov %rsp, %rax                  /* Store the current %rsp */
+
+.L\@_fill_rsb_loop:
+
+    .irp n, 1, 2                    /* Unrolled twice. */
+    call .L\@_insert_rsb_entry_\n   /* Create an RSB entry. */
+
+.L\@_capture_speculation_\n:
+    pause
+    lfence
+    jmp .L\@_capture_speculation_\n /* Capture rogue speculation. */
+
+.L\@_insert_rsb_entry_\n:
+    .endr
+
+    sub $1, %ecx
+    jnz .L\@_fill_rsb_loop
+    mov %rax, %rsp                  /* Restore old %rsp */
+.endm
+
 .macro DO_SPEC_CTRL_ENTRY_FROM_VMEXIT ibrs_val:req
 /*
  * Requires %rbx=current, %rsp=regs/cpuinfo
@@ -173,6 +211,8 @@
 
 /* Use after a VMEXIT from an HVM guest. */
 #define SPEC_CTRL_ENTRY_FROM_VMEXIT                                     \
+    ALTERNATIVE __stringify(ASM_NOP40),                                 \
+        DO_OVERWRITE_RSB, X86_FEATURE_RSB_VMEXIT;                       \
     ALTERNATIVE_2 __stringify(ASM_NOP32),                               \
         __stringify(DO_SPEC_CTRL_ENTRY_FROM_VMEXIT                      \
                     ibrs_val=SPEC_CTRL_IBRS),                           \
@@ -183,6 +223,8 @@
 
 /* Use after an entry from PV context (syscall/sysenter/int80/int82/etc). */
 #define SPEC_CTRL_ENTRY_FROM_PV                                         \
+    ALTERNATIVE __stringify(ASM_NOP40),                                 \
+        DO_OVERWRITE_RSB, X86_FEATURE_RSB_NATIVE;                       \
     ALTERNATIVE_2 __stringify(ASM_NOP21),                               \
         __stringify(DO_SPEC_CTRL_ENTRY maybexen=0                       \
                     ibrs_val=SPEC_CTRL_IBRS),                           \
@@ -192,6 +234,8 @@
 
 /* Use in interrupt/exception context.  May interrupt Xen or PV context. */
 #define SPEC_CTRL_ENTRY_FROM_INTR                                       \
+    ALTERNATIVE __stringify(ASM_NOP40),                                 \
+        DO_OVERWRITE_RSB, X86_FEATURE_RSB_NATIVE;                       \
     ALTERNATIVE_2 __stringify(ASM_NOP29),                               \
         __stringify(DO_SPEC_CTRL_ENTRY maybexen=1                       \
                     ibrs_val=SPEC_CTRL_IBRS),                           \
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH v11 7/11] x86/entry: Avoid using alternatives in NMI/#MC paths
  2018-01-24 13:12 ` [PATCH v10 07/11] x86/entry: Avoid using alternatives in NMI/#MC paths Andrew Cooper
  2018-01-25 13:43   ` Jan Beulich
@ 2018-01-25 17:21   ` Andrew Cooper
  2018-01-26 12:23     ` Jan Beulich
  1 sibling, 1 reply; 57+ messages in thread
From: Andrew Cooper @ 2018-01-25 17:21 UTC (permalink / raw)
  To: Xen-devel; +Cc: Andrew Cooper, Jan Beulich

This patch is deliberately arranged to be easy to revert if/when alternatives
patching becomes NMI/#MC safe.

For safety, there must be a dispatch serialising instruction in (what is
logically) DO_SPEC_CTRL_ENTRY so that, in the case that Xen needs IBRS set in
context, an attacker can't speculate around the WRMSR and reach an indirect
branch within the speculation window.

Using conditionals opens this attack vector up, so the else clause gets an
LFENCE to force the pipeline to catch up before continuing.  This also covers
the safety of RSB conditional, as execution it is guaranteed to either hit the
WRMSR or LFENCE.

One downside of not using alternatives is that there unconditionally an LFENCE
in the IST path in cases where we are not using the features from IBRS-capable
microcode.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
v10:
 * New
v11:
 * Clear bti_ist_info in init_shadow_spec_ctrl_state() for bisectiblity
 * Introduce a BUILD_BUG_ON() to check the BTI_IST_IBRS == SPEC_CTRL_IBRS alias
 * Use UNLIKELY_DISPATCH_LABEL() in the opencoded UNLIKELY_START()
---
 xen/arch/x86/spec_ctrl.c            |  8 +++++
 xen/arch/x86/x86_64/asm-offsets.c   |  1 +
 xen/arch/x86/x86_64/entry.S         |  6 ++--
 xen/include/asm-x86/current.h       |  1 +
 xen/include/asm-x86/spec_ctrl.h     |  1 +
 xen/include/asm-x86/spec_ctrl_asm.h | 68 +++++++++++++++++++++++++++++++++++++
 6 files changed, 82 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c
index 89e7287..cc1c972 100644
--- a/xen/arch/x86/spec_ctrl.c
+++ b/xen/arch/x86/spec_ctrl.c
@@ -20,8 +20,10 @@
 #include <xen/init.h>
 #include <xen/lib.h>
 
+#include <asm/msr-index.h>
 #include <asm/processor.h>
 #include <asm/spec_ctrl.h>
+#include <asm/spec_ctrl_asm.h>
 
 static enum ind_thunk {
     THUNK_DEFAULT, /* Decide which thunk to use at boot time. */
@@ -150,6 +152,12 @@ void __init init_speculation_mitigations(void)
     print_details(thunk);
 }
 
+static void __init __maybe_unused build_assertions(void)
+{
+    /* The optimised assembly relies on this alias. */
+    BUILD_BUG_ON(BTI_IST_IBRS != SPEC_CTRL_IBRS);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 17f1d77..51be528 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -142,6 +142,7 @@ void __dummy__(void)
     OFFSET(CPUINFO_pv_cr3, struct cpu_info, pv_cr3);
     OFFSET(CPUINFO_shadow_spec_ctrl, struct cpu_info, shadow_spec_ctrl);
     OFFSET(CPUINFO_use_shadow_spec_ctrl, struct cpu_info, use_shadow_spec_ctrl);
+    OFFSET(CPUINFO_bti_ist_info, struct cpu_info, bti_ist_info);
     DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info));
     BLANK();
 
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index 73bd7ca..a5a6702 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -126,7 +126,7 @@ UNLIKELY_START(g, exit_cr3)
 UNLIKELY_END(exit_cr3)
 
         /* WARNING! `ret`, `call *`, `jmp *` not safe beyond this point. */
-        SPEC_CTRL_EXIT_TO_XEN /* Req: %rbx=end, Clob: acd */
+        SPEC_CTRL_EXIT_TO_XEN_IST /* Req: %rbx=end, Clob: acd */
 
         RESTORE_ALL adj=8
         iretq
@@ -732,7 +732,7 @@ ENTRY(double_fault)
 
         GET_STACK_END(14)
 
-        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, Clob: acd */
+        SPEC_CTRL_ENTRY_FROM_INTR_IST /* Req: %rsp=regs, %r14=end, Clob: acd */
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rbx
@@ -765,7 +765,7 @@ handle_ist_exception:
 
         GET_STACK_END(14)
 
-        SPEC_CTRL_ENTRY_FROM_INTR /* Req: %rsp=regs, %r14=end, Clob: acd */
+        SPEC_CTRL_ENTRY_FROM_INTR_IST /* Req: %rsp=regs, %r14=end, Clob: acd */
         /* WARNING! `ret`, `call *`, `jmp *` not safe before this point. */
 
         mov   STACK_CPUINFO_FIELD(xen_cr3)(%r14), %rcx
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 1009d05..4678a0f 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -57,6 +57,7 @@ struct cpu_info {
     /* See asm-x86/spec_ctrl_asm.h for usage. */
     unsigned int shadow_spec_ctrl;
     bool         use_shadow_spec_ctrl;
+    uint8_t      bti_ist_info;
 
     unsigned long __pad;
     /* get_stack_bottom() must be 16-byte aligned */
diff --git a/xen/include/asm-x86/spec_ctrl.h b/xen/include/asm-x86/spec_ctrl.h
index b451250..c454b02 100644
--- a/xen/include/asm-x86/spec_ctrl.h
+++ b/xen/include/asm-x86/spec_ctrl.h
@@ -29,6 +29,7 @@ static inline void init_shadow_spec_ctrl_state(void)
     struct cpu_info *info = get_cpu_info();
 
     info->shadow_spec_ctrl = info->use_shadow_spec_ctrl = 0;
+    info->bti_ist_info = 0;
 }
 
 #endif /* !__X86_SPEC_CTRL_H__ */
diff --git a/xen/include/asm-x86/spec_ctrl_asm.h b/xen/include/asm-x86/spec_ctrl_asm.h
index e27ea2b..25f94f3 100644
--- a/xen/include/asm-x86/spec_ctrl_asm.h
+++ b/xen/include/asm-x86/spec_ctrl_asm.h
@@ -20,6 +20,11 @@
 #ifndef __X86_SPEC_CTRL_ASM_H__
 #define __X86_SPEC_CTRL_ASM_H__
 
+/* Encoding of the bottom bits in cpuinfo.bti_ist_info */
+#define BTI_IST_IBRS  (1 << 0)
+#define BTI_IST_WRMSR (1 << 1)
+#define BTI_IST_RSB   (1 << 2)
+
 #ifdef __ASSEMBLY__
 #include <asm/msr-index.h>
 
@@ -255,6 +260,69 @@
         DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_SET,           \
         DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_CLEAR
 
+/* TODO: Drop these when the alternatives infrastructure is NMI/#MC safe. */
+.macro SPEC_CTRL_ENTRY_FROM_INTR_IST
+/*
+ * Requires %rsp=regs, %r14=stack_end
+ * Clobbers %rax, %rcx, %rdx
+ *
+ * This is logical merge of DO_OVERWRITE_RSB and DO_SPEC_CTRL_ENTRY
+ * maybexen=1, but with conditionals rather than alternatives.
+ */
+    movzbl STACK_CPUINFO_FIELD(bti_ist_info)(%r14), %eax
+
+    testb $BTI_IST_RSB, %al
+    jz .L\@_skip_rsb
+
+    DO_OVERWRITE_RSB
+
+.L\@_skip_rsb:
+
+    testb $BTI_IST_WRMSR, %al
+    jz .L\@_skip_wrmsr
+
+    testb $3, UREGS_cs(%rsp)
+    jz .L\@_entry_from_xen
+
+    movb $0, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)
+
+.L\@_entry_from_xen:
+    /*
+     * Load Xen's intended value.  SPEC_CTRL_IBRS vs 0 is encoded in the
+     * bottom bit of bti_ist_info, via a deliberate alias with BTI_IST_IBRS.
+     */
+    mov $MSR_SPEC_CTRL, %ecx
+    and $BTI_IST_IBRS, %eax
+    xor %edx, %edx
+    wrmsr
+
+    /* Opencoded UNLIKELY_START() with no condition. */
+UNLIKELY_DISPATCH_LABEL(\@_serialise):
+    .subsection 1
+    /*
+     * In the case that we might need to set SPEC_CTRL.IBRS for safety, we
+     * need to ensure that an attacker can't poison the `jz .L\@_skip_wrmsr`
+     * to speculate around the WRMSR.  As a result, we need a dispatch
+     * serialising instruction in the else clause.
+     */
+.L\@_skip_wrmsr:
+    lfence
+    UNLIKELY_END(\@_serialise)
+.endm
+
+.macro SPEC_CTRL_EXIT_TO_XEN_IST
+/*
+ * Requires %rbx=stack_end
+ * Clobbers %rax, %rcx, %rdx
+ */
+    testb $BTI_IST_WRMSR, STACK_CPUINFO_FIELD(bti_ist_info)(%rbx)
+    jz .L\@_skip
+
+    DO_SPEC_CTRL_EXIT_TO_XEN
+
+.L\@_skip:
+.endm
+
 #endif /* __ASSEMBLY__ */
 #endif /* !__X86_SPEC_CTRL_ASM_H__ */
 
-- 
2.1.4


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-25 16:09     ` Andrew Cooper
  2018-01-25 16:15       ` Andrew Cooper
  2018-01-25 16:31       ` Jan Beulich
@ 2018-01-25 18:49       ` Dario Faggioli
  2018-01-26  1:08         ` Dario Faggioli
  2 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2018-01-25 18:49 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich; +Cc: Dario Faggioli, David Woodhouse, Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1594 bytes --]

On Thu, 2018-01-25 at 16:09 +0000, Andrew Cooper wrote:
> On 25/01/18 15:57, Jan Beulich wrote:
> > > > > 
> > For the record, the overwhelming majority of calls to
> > __sync_local_execstate() being responsible for the behavior
> > come from invalidate_interrupt(), which suggests to me that
> > there's a meaningful number of cases where a vCPU is migrated
> > to another CPU and then back, without another vCPU having
> > run on the original CPU in between. If I'm not wrong with this,
> > I have to question why the vCPU is migrated then in the first
> > place.
> 
> This is a known (mis)feature of the Credit scheduler.  When a PCPU
> goes
> idle, it aggressively steals work from the adjacent cpus, even if the
> adjacent cpu only has a single vcpu to run and is running it.
> 
> XenServer has this gross hack which makes aggregate networking
> measurements far far better
> 
> https://github.com/xenserver/xen.pg/blob/XS-7.1.x/master/sched-credit
> 1-use-per-pcpu-runqueue-count.patch
> 
> Dario made a different fix to Credit1 upstream which was supposed to
> resolve this behaviour (although I can't locate the patch by a list
> of
> titles), but based on these observations, I'd say the misfeature
> hasn't
> been fixed.
> 
Yes, it's 341450eaf753 ("xen: credit1: increase efficiency and
scalability of load balancing."), and that commit and the XenServer
patch are functionally equivalent.

So, I'm not convinced about that being the actual cause of the
described behaviour. Tracing would be the way to tell (hopefully) for
sure.

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-25 18:49       ` Dario Faggioli
@ 2018-01-26  1:08         ` Dario Faggioli
  2018-01-26  9:43           ` Jan Beulich
  0 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2018-01-26  1:08 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Dario Faggioli, David Woodhouse, Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1927 bytes --]

On Thu, 2018-01-25 at 19:49 +0100, Dario Faggioli wrote:
> On Thu, 2018-01-25 at 16:09 +0000, Andrew Cooper wrote:
> > On 25/01/18 15:57, Jan Beulich wrote:
> > >
> > > For the record, the overwhelming majority of calls to
> > > __sync_local_execstate() being responsible for the behavior
> > > come from invalidate_interrupt(), which suggests to me that
> > > there's a meaningful number of cases where a vCPU is migrated
> > > to another CPU and then back, without another vCPU having
> > > run on the original CPU in between. If I'm not wrong with this,
> > > I have to question why the vCPU is migrated then in the first
> > > place.
> > 
> > Dario made a different fix to Credit1 upstream which was supposed
> > to
> > resolve this behaviour (although I can't locate the patch by a list
> > of
> > titles), but based on these observations, I'd say the misfeature
> > hasn't
> > been fixed.
> > 
> Yes, it's 341450eaf753 ("xen: credit1: increase efficiency and
> scalability of load balancing."), and that commit and the XenServer
> patch are functionally equivalent.
> 
> So, I'm not convinced about that being the actual cause of the
> described behaviour. Tracing would be the way to tell (hopefully) for
> sure.
> 
And in order to go and investigate this a bit further, Jan, what is it
that you were doing when you saw what you described above? AFAIUI,
that's booting an HVM guest, isn't it?

How many vCPUs on how many pCPUs? Basically, I would just need a
confirmation that the system was not oversubscribed, but if, while
we're here, you can tell me the details, I'll put together an as much
as possible similar scenario.

Thanks,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-26  1:08         ` Dario Faggioli
@ 2018-01-26  9:43           ` Jan Beulich
  2018-01-26 11:13             ` Dario Faggioli
  0 siblings, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2018-01-26  9:43 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: Andrew Cooper, Dario Faggioli, David Woodhouse, Xen-devel

>>> On 26.01.18 at 02:08, <dfaggioli@suse.com> wrote:
> On Thu, 2018-01-25 at 19:49 +0100, Dario Faggioli wrote:
>> On Thu, 2018-01-25 at 16:09 +0000, Andrew Cooper wrote:
>> > On 25/01/18 15:57, Jan Beulich wrote:
>> > >
>> > > For the record, the overwhelming majority of calls to
>> > > __sync_local_execstate() being responsible for the behavior
>> > > come from invalidate_interrupt(), which suggests to me that
>> > > there's a meaningful number of cases where a vCPU is migrated
>> > > to another CPU and then back, without another vCPU having
>> > > run on the original CPU in between. If I'm not wrong with this,
>> > > I have to question why the vCPU is migrated then in the first
>> > > place.
>> > 
>> > Dario made a different fix to Credit1 upstream which was supposed
>> > to
>> > resolve this behaviour (although I can't locate the patch by a list
>> > of
>> > titles), but based on these observations, I'd say the misfeature
>> > hasn't
>> > been fixed.
>> > 
>> Yes, it's 341450eaf753 ("xen: credit1: increase efficiency and
>> scalability of load balancing."), and that commit and the XenServer
>> patch are functionally equivalent.
>> 
>> So, I'm not convinced about that being the actual cause of the
>> described behaviour. Tracing would be the way to tell (hopefully) for
>> sure.
>> 
> And in order to go and investigate this a bit further, Jan, what is it
> that you were doing when you saw what you described above? AFAIUI,
> that's booting an HVM guest, isn't it?

Yes, plus then run some arbitrary work inside it.

> How many vCPUs on how many pCPUs? Basically, I would just need a
> confirmation that the system was not oversubscribed, but if, while
> we're here, you can tell me the details, I'll put together an as much
> as possible similar scenario.

12 pCPU-s, HVM guest has 8 vCPU-s, Dom0 has 12 pCPU-s (but
no other load than that caused by the guest). If you want, I could
bring the code I've used for monitoring into patch form and hand
it to you.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-26  9:43           ` Jan Beulich
@ 2018-01-26 11:13             ` Dario Faggioli
  2018-01-26 11:38               ` Jan Beulich
  0 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2018-01-26 11:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Dario Faggioli, David Woodhouse, Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1343 bytes --]

On Fri, 2018-01-26 at 02:43 -0700, Jan Beulich wrote:
> > > > On 26.01.18 at 02:08, <dfaggioli@suse.com> wrote:
> > And in order to go and investigate this a bit further, Jan, what is
> > it
> > that you were doing when you saw what you described above? AFAIUI,
> > that's booting an HVM guest, isn't it?
> 
> Yes, plus then run some arbitrary work inside it.
> 
Ok. And you've seen the "spurious" migrations only/mostly during boot,
or even afterwords, while running this work?

> > How many vCPUs on how many pCPUs? Basically, I would just need a
> > confirmation that the system was not oversubscribed, but if, while
> > we're here, you can tell me the details, I'll put together an as
> > much
> > as possible similar scenario.
> 
> 12 pCPU-s, HVM guest has 8 vCPU-s, Dom0 has 12 pCPU-s (but
> no other load than that caused by the guest). 
>
Ok.

> If you want, I could
> bring the code I've used for monitoring into patch form and hand
> it to you.
> 
If it's not a problem, and whenever you have time, yes, that would be
useful, I think.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-26 11:13             ` Dario Faggioli
@ 2018-01-26 11:38               ` Jan Beulich
  0 siblings, 0 replies; 57+ messages in thread
From: Jan Beulich @ 2018-01-26 11:38 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: Andrew Cooper, Dario Faggioli, David Woodhouse, Xen-devel

>>> On 26.01.18 at 12:13, <dfaggioli@suse.com> wrote:
> On Fri, 2018-01-26 at 02:43 -0700, Jan Beulich wrote:
>> > > > On 26.01.18 at 02:08, <dfaggioli@suse.com> wrote:
>> > And in order to go and investigate this a bit further, Jan, what is
>> > it
>> > that you were doing when you saw what you described above? AFAIUI,
>> > that's booting an HVM guest, isn't it?
>> 
>> Yes, plus then run some arbitrary work inside it.
>> 
> Ok. And you've seen the "spurious" migrations only/mostly during boot,
> or even afterwords, while running this work?

The ratio of spurious moves was higher during guest boot, but
their total amount kept growing when the guest had some sort
of load later on - it was just that there were more "explainable"
moves then.

>> If you want, I could
>> bring the code I've used for monitoring into patch form and hand
>> it to you.
>> 
> If it's not a problem, and whenever you have time, yes, that would be
> useful, I think.

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1691,6 +1691,7 @@ static void __context_switch(void)
 }
 
 
+static DEFINE_PER_CPU(const void *, last_sync);//temp
 void context_switch(struct vcpu *prev, struct vcpu *next)
 {
     unsigned int cpu = smp_processor_id();
@@ -1725,9 +1726,12 @@ void context_switch(struct vcpu *prev, s
          (is_idle_domain(nextd) && cpu_online(cpu)) )
     {
         local_irq_enable();
+per_cpu(last_sync, cpu) = NULL;//temp
     }
     else
     {
+static DEFINE_PER_CPU(const struct vcpu*, last_vcpu);//temp
+const struct vcpu*last_vcpu = per_cpu(curr_vcpu, cpu);//temp
         __context_switch();
 
         if ( is_pv_domain(nextd) &&
@@ -1750,6 +1754,31 @@ void context_switch(struct vcpu *prev, s
         }
 
         ctxt_switch_levelling(next);
+if(is_hvm_domain(nextd)) {//temp
+ static DEFINE_PER_CPU(unsigned long, good);
+ if(next != per_cpu(last_vcpu, cpu))
+  ++per_cpu(good, cpu);
+ else {
+  static DEFINE_PER_CPU(unsigned long, bad);
+  static DEFINE_PER_CPU(unsigned long, cnt);
+  static DEFINE_PER_CPU(unsigned long, thr);
+  static DEFINE_PER_CPU(domid_t, last_dom);
+  domid_t curr_dom = last_vcpu->domain->domain_id;
+  ++per_cpu(bad, cpu);
+  if(curr_dom < DOMID_FIRST_RESERVED && curr_dom > per_cpu(last_dom, cpu)) {
+   per_cpu(thr, cpu) = 0;
+   per_cpu(cnt, cpu) = 0;
+   per_cpu(last_dom, cpu) = curr_dom;
+  }
+  if(++per_cpu(cnt, cpu) > per_cpu(thr, cpu)) {
+   per_cpu(thr, cpu) |= per_cpu(cnt, cpu);
+   printk("%pv -> %pv -> %pv -> %pv %lu:%lu [%pS]\n",
+          per_cpu(last_vcpu, cpu), last_vcpu, prev, next,
+          per_cpu(good, cpu), per_cpu(bad, cpu), per_cpu(last_sync, cpu));
+  }
+ }
+}
+per_cpu(last_vcpu, cpu) = next;//temp
     }
 
     context_saved(prev);
@@ -1794,6 +1823,7 @@ int __sync_local_execstate(void)
     if ( switch_required )
     {
         ASSERT(current == idle_vcpu[smp_processor_id()]);
+this_cpu(last_sync) = __builtin_return_address(0);//temp
         __context_switch();
     }
 
Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v11 6/11] x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
  2018-01-25 16:54   ` [PATCH v11 6/11] " Andrew Cooper
@ 2018-01-26 12:17     ` Jan Beulich
  0 siblings, 0 replies; 57+ messages in thread
From: Jan Beulich @ 2018-01-26 12:17 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 25.01.18 at 17:54, <andrew.cooper3@citrix.com> wrote:
> ret instructions are speculated directly to values recorded in the Return
> Stack Buffer/Return Address Stack, as there is no uncertainty in well-formed
> code.  Guests can take advantage of this in two ways:
> 
>   1) If they can find a path in Xen which executes more ret instructions than
>      call instructions.  (At least one in the waitqueue infrastructure,
>      probably others.)
> 
>   2) Use the fact that the RSB/RAS in hardware is actually a circular stack
>      without a concept of empty.  (When it logically empties, stale values
>      will start being used.)
> 
> To mitigate, overwrite the RSB on entry to Xen with gadgets which will capture
> and contain rogue speculation.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Reviewed-by: Jan Beulich <jbeulich@suse.com>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v11 7/11] x86/entry: Avoid using alternatives in NMI/#MC paths
  2018-01-25 17:21   ` [PATCH v11 7/11] " Andrew Cooper
@ 2018-01-26 12:23     ` Jan Beulich
  2018-01-26 12:28       ` Andrew Cooper
  0 siblings, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2018-01-26 12:23 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 25.01.18 at 18:21, <andrew.cooper3@citrix.com> wrote:
> This patch is deliberately arranged to be easy to revert if/when alternatives
> patching becomes NMI/#MC safe.
> 
> For safety, there must be a dispatch serialising instruction in (what is
> logically) DO_SPEC_CTRL_ENTRY so that, in the case that Xen needs IBRS set in
> context, an attacker can't speculate around the WRMSR and reach an indirect
> branch within the speculation window.
> 
> Using conditionals opens this attack vector up, so the else clause gets an
> LFENCE to force the pipeline to catch up before continuing.  This also covers
> the safety of RSB conditional, as execution it is guaranteed to either hit the
> WRMSR or LFENCE.
> 
> One downside of not using alternatives is that there unconditionally an LFENCE
> in the IST path in cases where we are not using the features from IBRS-capable
> microcode.
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Despite this yet another remark:

> @@ -255,6 +260,69 @@
>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_SET,           \
>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_CLEAR
>  
> +/* TODO: Drop these when the alternatives infrastructure is NMI/#MC safe. */
> +.macro SPEC_CTRL_ENTRY_FROM_INTR_IST
> +/*
> + * Requires %rsp=regs, %r14=stack_end
> + * Clobbers %rax, %rcx, %rdx
> + *
> + * This is logical merge of DO_OVERWRITE_RSB and DO_SPEC_CTRL_ENTRY
> + * maybexen=1, but with conditionals rather than alternatives.
> + */
> +    movzbl STACK_CPUINFO_FIELD(bti_ist_info)(%r14), %eax
> +
> +    testb $BTI_IST_RSB, %al
> +    jz .L\@_skip_rsb
> +
> +    DO_OVERWRITE_RSB
> +
> +.L\@_skip_rsb:
> +
> +    testb $BTI_IST_WRMSR, %al
> +    jz .L\@_skip_wrmsr
> +
> +    testb $3, UREGS_cs(%rsp)
> +    jz .L\@_entry_from_xen
> +
> +    movb $0, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)

I take it you deliberately didn't want to use the same setz/and
pair here to avoid at least the second branch here too, as we
very much hope to only rarely hit this code path?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v11 7/11] x86/entry: Avoid using alternatives in NMI/#MC paths
  2018-01-26 12:23     ` Jan Beulich
@ 2018-01-26 12:28       ` Andrew Cooper
  0 siblings, 0 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-01-26 12:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel

On 26/01/18 12:23, Jan Beulich wrote:
>>>> On 25.01.18 at 18:21, <andrew.cooper3@citrix.com> wrote:
>> This patch is deliberately arranged to be easy to revert if/when alternatives
>> patching becomes NMI/#MC safe.
>>
>> For safety, there must be a dispatch serialising instruction in (what is
>> logically) DO_SPEC_CTRL_ENTRY so that, in the case that Xen needs IBRS set in
>> context, an attacker can't speculate around the WRMSR and reach an indirect
>> branch within the speculation window.
>>
>> Using conditionals opens this attack vector up, so the else clause gets an
>> LFENCE to force the pipeline to catch up before continuing.  This also covers
>> the safety of RSB conditional, as execution it is guaranteed to either hit the
>> WRMSR or LFENCE.
>>
>> One downside of not using alternatives is that there unconditionally an LFENCE
>> in the IST path in cases where we are not using the features from IBRS-capable
>> microcode.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> Despite this yet another remark:
>
>> @@ -255,6 +260,69 @@
>>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_SET,           \
>>          DO_SPEC_CTRL_EXIT_TO_GUEST, X86_FEATURE_XEN_IBRS_CLEAR
>>  
>> +/* TODO: Drop these when the alternatives infrastructure is NMI/#MC safe. */
>> +.macro SPEC_CTRL_ENTRY_FROM_INTR_IST
>> +/*
>> + * Requires %rsp=regs, %r14=stack_end
>> + * Clobbers %rax, %rcx, %rdx
>> + *
>> + * This is logical merge of DO_OVERWRITE_RSB and DO_SPEC_CTRL_ENTRY
>> + * maybexen=1, but with conditionals rather than alternatives.
>> + */
>> +    movzbl STACK_CPUINFO_FIELD(bti_ist_info)(%r14), %eax
>> +
>> +    testb $BTI_IST_RSB, %al
>> +    jz .L\@_skip_rsb
>> +
>> +    DO_OVERWRITE_RSB
>> +
>> +.L\@_skip_rsb:
>> +
>> +    testb $BTI_IST_WRMSR, %al
>> +    jz .L\@_skip_wrmsr
>> +
>> +    testb $3, UREGS_cs(%rsp)
>> +    jz .L\@_entry_from_xen
>> +
>> +    movb $0, STACK_CPUINFO_FIELD(use_shadow_spec_ctrl)(%r14)
> I take it you deliberately didn't want to use the same setz/and
> pair here to avoid at least the second branch here too, as we
> very much hope to only rarely hit this code path?

No - straight oversight.  I will fix up equivalently.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-25 16:15       ` Andrew Cooper
@ 2018-01-27  1:27         ` Dario Faggioli
  2018-01-29  9:28           ` Jan Beulich
  0 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2018-01-27  1:27 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, David Woodhouse, Xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 7842 bytes --]

> On 25/01/18 16:09, Andrew Cooper wrote:
> > On 25/01/18 15:57, Jan Beulich wrote:
> > > > > > 
> > > For the record, the overwhelming majority of calls to
> > > __sync_local_execstate() being responsible for the behavior
> > > come from invalidate_interrupt(), which suggests to me that
> > > there's a meaningful number of cases where a vCPU is migrated
> > > to another CPU and then back, without another vCPU having
> > > run on the original CPU in between. If I'm not wrong with this,
> > > I have to question why the vCPU is migrated then in the first
> > > place.
> > 
So, about this. I haven't applied Jan's measurement patch yet (I'm
doing some reshuffling of my dev and test hardware here), but I have
given a look at traces.

So, Jan, a question: why are you saying "migrated to another CPU **and
then back**"? I'm asking because, AFAICT, the fact that
__sync_local_execstate() is called from invalidate_interrupt() means
that:
* a vCPU is running on a pCPU
* the vCPU is migrated, and the pCPU became idle
* the vCPU starts to run where it was migrated, while its 'original'
  pCPU is still idle ==> inv. IPI ==> sync state.

So there seems to me to be no need for the vCPU to actually "go back",
is there it?

Anyway, looking at traces, I observed the following:

    28.371352689 --|------x------ d32767v9 csched:schedule cpu 9, idle
    28.371354095 --|------x------ d32767v9 sched_switch prev d32767v9, run for 3412.789us
    28.371354475 --|------x------ d32767v9 sched_switch next d3v8, was runnable for 59.917us, next slice 30000.0us
    28.371354752 --|------x------ d32767v9 sched_switch prev d32767v9 next d3v8
    28.371355267 --|------x------ d32767v9 runstate_change d32767v9 running->runnable
(1) 28.371355728 --|------x------ d?v? runstate_change d3v8 runnable->running
    ............ ................ ...
(2) 28.375501037 -----|||-x----|- d3v8 vcpu_wake d3v5
    28.375501540 -----|||-x----|- d3v8 runstate_change d3v5 blocked->runnable
(3) 28.375502300 -----|||-x----|- d3v8 csched:runq_tickle, cpu 8
    ............ ................ ...
    28.375509472 --|--|||x|----|- d32767v8 csched:schedule cpu 8, idle
    28.375510682 --|--|||x|----|- d32767v8 sched_switch prev d32767v8, run for 724.165us
    28.375511034 --|--|||x|----|- d32767v8 sched_switch next d3v5, was runnable for 7.396us, next slice 30000.0us
    28.375511300 --|--|||x|----|- d32767v8 sched_switch prev d32767v8 next d3v5
    28.375511640 --|--|||x|----|- d32767v8 runstate_change d32767v8 running->runnable
(4) 28.375512060 --|--|||x|----|- d?v? runstate_change d3v5 runnable->running
    ............ ................ ...
(5) 28.375624977 ----|-|||x----|- d3v8 csched: d3v8 unboosted
(6) 28.375628208 ----|-|||x----|- d3v8 csched:pick_cpu 11 for d3v8

At (1) d3v8 starts running on CPU 9. Then, at (2), d3v5 wakes up, and
at (3) CPU 8 (which is idle) is tickled, as a consequence of that. At
(4), CPU 8 picks up d3v5 and run it (this may seem unrelated, but bear
with me a little).

At (5), a periodic tick arrives on CPU 9. Periodic ticks are a core
part of the Credit1 algorithm, and are used for accounting and load
balancing. In fact, csched_tick() calls csched_vcpu_acct() which, at
(6), calls _csched_cpu_pick().

Pick realizes that d3v8 is running on CPU 9, and that CPU 8 is also
busy. Now, since CPU 8 and 9 are hyperthreads of the same core, and
since there are fully idle cores, Credit1 decides that it's better to
kick d3v8 to one of those fully idle cores, so both d3v5 and d3v8
itslef can run at full "core speed". In fact, we see that CPU 11 is
picked, as both the hyperthreads --CPU 10 and CPU 11 itself-- are idle.
(To be continued, below)

(7) 28.375630686 ----|-|||x----|- d3v8 csched:schedule cpu 9, busy
(*) 28.375631619 ----|-|||x----|- d3v8 csched:load_balance skipping 14
    28.375632094 ----|-|||x----|- d3v8 csched:load_balance skipping 8
    28.375632612 ----|-|||x----|- d3v8 csched:load_balance skipping 4
    28.375633004 ----|-|||x----|- d3v8 csched:load_balance skipping 6
    28.375633364 ----|-|||x----|- d3v8 csched:load_balance skipping 7
    28.375633960 ----|-|||x------ d3v8 csched:load_balance skipping 8
    28.375634470 ----|-|||x------ d3v8 csched:load_balance skipping 4
    28.375634820 ----|-|||x------ d3v8 csched:load_balance skipping 6
(**)28.375635067 ----|-|||x------ d3v8 csched:load_balance skipping 7
    28.375635560 ----|-|||x------ d3v8 sched_switch prev d3v8, run for 4288.140us
    28.375635988 ----|-|||x------ d3v8 sched_switch next d32767v9, was runnable for 4288.140us
    28.375636233 ----|-|||x------ d3v8 sched_switch prev d3v8 next d32767v9
    28.375636615 ----|-|||x------ d3v8 runstate_change d3v8 running->offline
(8) 28.375637015 ----|-|||x------ d?v? runstate_change d32767v9 runnable->running
    28.375638146 ----|-x||------- d3v2 vcpu_block d3v2
    ............ ................ ...
    28.375645627 ----|--||x------ d32767v9 csched:pick_cpu 11 for d3v8
    28.375647138 ----|--||x------ d32767v9 vcpu_wake d3v8
    28.375647640 ----|--||x------ d32767v9 runstate_change d3v8 offline->runnable
(9) 28.375648353 ----|--||x------ d32767v9 csched:runq_tickle, cpu 11
    ............ ................ ...
    28.375709505 ----|--||--x---- d32767v11 sched_switch prev d32767v11, run for 2320182.912us
    28.375709778 ----|--||--x---- d32767v11 sched_switch next d3v8, was runnable for 59.670us, next slice 30000.0us
    28.375710001 ----|--||--x---- d32767v11 sched_switch prev d32767v11 next d3v8
    28.375710501 ----|--||--x---- d32767v11 runstate_change d32767v11 running->runnable
(10)28.375710858 ----|--||--x---- d?v? runstate_change d3v8 runnable->running

At (7) we see that CPU 9 re-schedules, as a consequence of pick
deciding to migrate d3v8. As a side note, all the "load_balance
skipping xx" lines between (*) and (**) show that stealing work
attempts are actually prevented on all those CPUs, because they have
only 1 runnable (either running or ready to do so) vCPU. I.e., my patch
 works and achieves its goal of avoiding even trying to steal (which
means avoiding having to take a lock!), when there's no need. :-)

Anyway, at (8) d3v8 is gone, and CPU 9 eventually becomes idle. At (9),
another call to pick_cpu() confirms that d3v8 will land on CPU 11, and
at (10) we see it starting to run there. It should be at this point
that the invalidate IPI is sent, which causes the state sync request
(note, in fact, that CPU 8 is still idle).

Now, this is just _one_ example, but I am quite convinced that this may
actually be one of the most prominent causes of the behavior Jan
reported.

The problem, as I was expecting, is not work stealing, the problem is,
well... Credit1! :-/

In fact, when d3v5 wakes up, why, at point (3), CPU 8 is tickled,
instead of, for instance, CPU 10 (or 11, or 12, or 13)? CPU 8 and CPU 9
are hyperthread siblings, and CPU 9 is busy, so it would have been
better to try to leave CPU 8 alone. And that would have been possible,
as both the core of CPUs 10 and 11, and of CPUs 12 or 13 are fully
idle.

Well, point is, tickling in Credit1 does not check/consider
hyperthreading. Can it then start doing so? Not easily, IMO, and at an
added cost --which will be payed on the vCPU wakeup path (which is
already quite convoluted and complex, in that scheduler).

Credit2, for instance, does not suffer from this issue. In fact,
hyperthreading, there, is considered during wakeup/tickling already.

Hope this helps clarifying things a bit,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-27  1:27         ` Dario Faggioli
@ 2018-01-29  9:28           ` Jan Beulich
  2018-02-05 11:37             ` George Dunlap
  0 siblings, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2018-01-29  9:28 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: Andrew Cooper, David Woodhouse, Xen-devel

>>> On 27.01.18 at 02:27, <dfaggioli@suse.com> wrote:
>>  On 25/01/18 16:09, Andrew Cooper wrote:
>> > On 25/01/18 15:57, Jan Beulich wrote:
>> > > > > > 
>> > > For the record, the overwhelming majority of calls to
>> > > __sync_local_execstate() being responsible for the behavior
>> > > come from invalidate_interrupt(), which suggests to me that
>> > > there's a meaningful number of cases where a vCPU is migrated
>> > > to another CPU and then back, without another vCPU having
>> > > run on the original CPU in between. If I'm not wrong with this,
>> > > I have to question why the vCPU is migrated then in the first
>> > > place.
>> > 
> So, about this. I haven't applied Jan's measurement patch yet (I'm
> doing some reshuffling of my dev and test hardware here), but I have
> given a look at traces.
> 
> So, Jan, a question: why are you saying "migrated to another CPU **and
> then back**"?

Because that's how I interpret some of the output from my
logging additions.

> I'm asking because, AFAICT, the fact that
> __sync_local_execstate() is called from invalidate_interrupt() means
> that:
> * a vCPU is running on a pCPU
> * the vCPU is migrated, and the pCPU became idle
> * the vCPU starts to run where it was migrated, while its 'original'
>   pCPU is still idle ==> inv. IPI ==> sync state.

This is just the first half of it. In some cases I then see the vCPU
go back without the pCPU having run anything else in between.

> So there seems to me to be no need for the vCPU to actually "go back",
> is there it?

There is no need for it to go back, but it does. There's also no need
for it to be migrated in the first place if there are no more runnable
vCPU-s than there are pCPU-s.

> Anyway, looking at traces, I observed the following:
> [...]
> At (1) d3v8 starts running on CPU 9. Then, at (2), d3v5 wakes up, and
> at (3) CPU 8 (which is idle) is tickled, as a consequence of that. At
> (4), CPU 8 picks up d3v5 and run it (this may seem unrelated, but bear
> with me a little).
> 
> At (5), a periodic tick arrives on CPU 9. Periodic ticks are a core
> part of the Credit1 algorithm, and are used for accounting and load
> balancing. In fact, csched_tick() calls csched_vcpu_acct() which, at
> (6), calls _csched_cpu_pick().
> 
> Pick realizes that d3v8 is running on CPU 9, and that CPU 8 is also
> busy. Now, since CPU 8 and 9 are hyperthreads of the same core, and
> since there are fully idle cores, Credit1 decides that it's better to
> kick d3v8 to one of those fully idle cores, so both d3v5 and d3v8
> itslef can run at full "core speed". In fact, we see that CPU 11 is
> picked, as both the hyperthreads --CPU 10 and CPU 11 itself-- are idle.
> (To be continued, below)

Ah, I see. I did not look at the output with topology in mind, I
admit. Nevertheless I then still don't really understand why it
appears to be not uncommon for a vCPU to move back and
forth relatively rapidly (I take no other vCPU having run on a
pCPU as a sign that the period of time elapsed in between to
not be very large). Please don't forget that the act of moving
a vCPU has a (performance) price, too.

> The problem, as I was expecting, is not work stealing, the problem is,
> well... Credit1! :-/
> [...]
> Credit2, for instance, does not suffer from this issue. In fact,
> hyperthreading, there, is considered during wakeup/tickling already.

Well - when is Credit2 going to become the default?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 08/11] x86/boot: Calculate the most appropriate BTI mitigation to use
  2018-01-24 13:12 ` [PATCH v10 08/11] x86/boot: Calculate the most appropriate BTI mitigation to use Andrew Cooper
  2018-01-25 13:52   ` Jan Beulich
@ 2018-02-01  8:41   ` Jan Beulich
  2018-02-01 13:58     ` Andrew Cooper
  1 sibling, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2018-02-01  8:41 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Asit K Mallick, Jun Nakajima, Xen-devel

>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
>  void __init init_speculation_mitigations(void)
>  {
>      enum ind_thunk thunk = THUNK_DEFAULT;
> +    bool ibrs = false;
>  
>      /*
>       * Has the user specified any custom BTI mitigations?  If so, follow their
>       * instructions exactly and disable all heuristics.
>       */
> -    if ( opt_thunk != THUNK_DEFAULT )
> +    if ( opt_thunk != THUNK_DEFAULT || opt_ibrs != -1 )
>      {
>          thunk = opt_thunk;
> +        ibrs  = !!opt_ibrs;
>      }

The comment above has become stale with later additions. Question
is what the intentions are, i.e. whether it is the comment or the code
that should be changed.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 01/11] x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests
  2018-01-24 13:12 ` [PATCH v10 01/11] x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests Andrew Cooper
@ 2018-02-01  9:06   ` Jan Beulich
  2018-02-01 13:53     ` Andrew Cooper
  0 siblings, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2018-02-01  9:06 UTC (permalink / raw)
  To: Andrew Cooper; +Cc: Xen-devel

>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
> --- a/xen/arch/x86/cpuid.c
> +++ b/xen/arch/x86/cpuid.c
> @@ -383,6 +383,16 @@ static void __init calculate_pv_max_policy(void)
>      /* Unconditionally claim to be able to set the hypervisor bit. */
>      __set_bit(X86_FEATURE_HYPERVISOR, pv_featureset);
>  
> +    /* On hardware with IBRS/IBPB support, there are further adjustments. */
> +    if ( test_bit(X86_FEATURE_IBRSB, pv_featureset) )
> +    {
> +        /* Offer STIBP unconditionally.  It is a nop on non-HT hardware. */
> +        __set_bit(X86_FEATURE_STIBP, pv_featureset);
> +
> +        /* AMD's IBPB is a subset of IBRS/IBPB. */
> +        __set_bit(X86_FEATURE_IBPB, pv_featureset);
> +    }
> +
>      sanitise_featureset(pv_featureset);
>      cpuid_featureset_to_policy(pv_featureset, p);
>      recalculate_xstate(p);
> @@ -440,6 +450,16 @@ static void __init calculate_hvm_max_policy(void)
>              __clear_bit(X86_FEATURE_XSAVES, hvm_featureset);
>      }
>  
> +    /* On hardware with IBRS/IBPB support, there are further adjustments. */
> +    if ( test_bit(X86_FEATURE_IBRSB, hvm_featureset) )
> +    {
> +        /* Offer STIBP unconditionally.  It is a nop on non-HT hardware. */
> +        __set_bit(X86_FEATURE_STIBP, hvm_featureset);
> +
> +        /* AMD's IBPB is a subset of IBRS/IBPB. */
> +        __set_bit(X86_FEATURE_IBPB, hvm_featureset);
> +    }

One more thing regarding these changes, and their implications
on migration: If a capable guest is started on a system without
hardware support and then migrated to a system with hardware
support, it will continue to run unprotected. I do realize that
reporting the features to guests (and ignoring the MSR accesses)
on incapable hardware is not nice either (as it will give the guest a
false sense of security), but was that option at least considered
(perhaps as non-default behavior), so that after migration such a
guest would then be secure?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 01/11] x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests
  2018-02-01  9:06   ` Jan Beulich
@ 2018-02-01 13:53     ` Andrew Cooper
  0 siblings, 0 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-02-01 13:53 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Xen-devel

On 01/02/18 09:06, Jan Beulich wrote:
>>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
>> --- a/xen/arch/x86/cpuid.c
>> +++ b/xen/arch/x86/cpuid.c
>> @@ -383,6 +383,16 @@ static void __init calculate_pv_max_policy(void)
>>      /* Unconditionally claim to be able to set the hypervisor bit. */
>>      __set_bit(X86_FEATURE_HYPERVISOR, pv_featureset);
>>  
>> +    /* On hardware with IBRS/IBPB support, there are further adjustments. */
>> +    if ( test_bit(X86_FEATURE_IBRSB, pv_featureset) )
>> +    {
>> +        /* Offer STIBP unconditionally.  It is a nop on non-HT hardware. */
>> +        __set_bit(X86_FEATURE_STIBP, pv_featureset);
>> +
>> +        /* AMD's IBPB is a subset of IBRS/IBPB. */
>> +        __set_bit(X86_FEATURE_IBPB, pv_featureset);
>> +    }
>> +
>>      sanitise_featureset(pv_featureset);
>>      cpuid_featureset_to_policy(pv_featureset, p);
>>      recalculate_xstate(p);
>> @@ -440,6 +450,16 @@ static void __init calculate_hvm_max_policy(void)
>>              __clear_bit(X86_FEATURE_XSAVES, hvm_featureset);
>>      }
>>  
>> +    /* On hardware with IBRS/IBPB support, there are further adjustments. */
>> +    if ( test_bit(X86_FEATURE_IBRSB, hvm_featureset) )
>> +    {
>> +        /* Offer STIBP unconditionally.  It is a nop on non-HT hardware. */
>> +        __set_bit(X86_FEATURE_STIBP, hvm_featureset);
>> +
>> +        /* AMD's IBPB is a subset of IBRS/IBPB. */
>> +        __set_bit(X86_FEATURE_IBPB, hvm_featureset);
>> +    }
> One more thing regarding these changes, and their implications
> on migration: If a capable guest is started on a system without
> hardware support and then migrated to a system with hardware
> support, it will continue to run unprotected. I do realize that
> reporting the features to guests (and ignoring the MSR accesses)
> on incapable hardware is not nice either (as it will give the guest a
> false sense of security), but was that option at least considered
> (perhaps as non-default behavior), so that after migration such a
> guest would then be secure?

Trapping and ignoring the MSRs (especially for HVM guests) is going to
be a far higher overhead than even using the MSRs, and using the MSRs
comes with a steep perf hit.

The far more subtle problem occurs when migrating between a Broadwell
and Skylake CPU, where your choice of which mitigations to use changes.

It would be good for VMs in general to have a way of knowing they have
been migrated, and re-evaluating their idea of the world.  That,
combined with an explicit administrator decision to increase the visible
featureset of the VM would be a reasonable solution to this (and other)
problems.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 08/11] x86/boot: Calculate the most appropriate BTI mitigation to use
  2018-02-01  8:41   ` Jan Beulich
@ 2018-02-01 13:58     ` Andrew Cooper
  0 siblings, 0 replies; 57+ messages in thread
From: Andrew Cooper @ 2018-02-01 13:58 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Asit K Mallick, Jun Nakajima, Xen-devel

On 01/02/18 08:41, Jan Beulich wrote:
>>>> On 24.01.18 at 14:12, <andrew.cooper3@citrix.com> wrote:
>>  void __init init_speculation_mitigations(void)
>>  {
>>      enum ind_thunk thunk = THUNK_DEFAULT;
>> +    bool ibrs = false;
>>  
>>      /*
>>       * Has the user specified any custom BTI mitigations?  If so, follow their
>>       * instructions exactly and disable all heuristics.
>>       */
>> -    if ( opt_thunk != THUNK_DEFAULT )
>> +    if ( opt_thunk != THUNK_DEFAULT || opt_ibrs != -1 )
>>      {
>>          thunk = opt_thunk;
>> +        ibrs  = !!opt_ibrs;
>>      }
> The comment above has become stale with later additions. Question
> is what the intentions are, i.e. whether it is the comment or the code
> that should be changed.

Hmm true.  The comment used to be true, but I changed it based on
internal feedback.  The new logic means that explicit choices only
affect the defaults of their related area.

e.g. the choice of whether to use IBPB or not (which really is an
isolated decision here) shouldn't override the default choice for
whether to use IBRS/retpoline/etc.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
  2018-01-29  9:28           ` Jan Beulich
@ 2018-02-05 11:37             ` George Dunlap
  0 siblings, 0 replies; 57+ messages in thread
From: George Dunlap @ 2018-02-05 11:37 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Xen-devel, David Woodhouse, Dario Faggioli

On Mon, Jan 29, 2018 at 9:28 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 27.01.18 at 02:27, <dfaggioli@suse.com> wrote:
>>>  On 25/01/18 16:09, Andrew Cooper wrote:
>>> > On 25/01/18 15:57, Jan Beulich wrote:
>>> > > > > >
>>> > > For the record, the overwhelming majority of calls to
>>> > > __sync_local_execstate() being responsible for the behavior
>>> > > come from invalidate_interrupt(), which suggests to me that
>>> > > there's a meaningful number of cases where a vCPU is migrated
>>> > > to another CPU and then back, without another vCPU having
>>> > > run on the original CPU in between. If I'm not wrong with this,
>>> > > I have to question why the vCPU is migrated then in the first
>>> > > place.
>>> >
>> So, about this. I haven't applied Jan's measurement patch yet (I'm
>> doing some reshuffling of my dev and test hardware here), but I have
>> given a look at traces.
>>
>> So, Jan, a question: why are you saying "migrated to another CPU **and
>> then back**"?
>
> Because that's how I interpret some of the output from my
> logging additions.
>
>> I'm asking because, AFAICT, the fact that
>> __sync_local_execstate() is called from invalidate_interrupt() means
>> that:
>> * a vCPU is running on a pCPU
>> * the vCPU is migrated, and the pCPU became idle
>> * the vCPU starts to run where it was migrated, while its 'original'
>>   pCPU is still idle ==> inv. IPI ==> sync state.
>
> This is just the first half of it. In some cases I then see the vCPU
> go back without the pCPU having run anything else in between.
>
>> So there seems to me to be no need for the vCPU to actually "go back",
>> is there it?
>
> There is no need for it to go back, but it does. There's also no need
> for it to be migrated in the first place if there are no more runnable
> vCPU-s than there are pCPU-s.
>
>> Anyway, looking at traces, I observed the following:
>> [...]
>> At (1) d3v8 starts running on CPU 9. Then, at (2), d3v5 wakes up, and
>> at (3) CPU 8 (which is idle) is tickled, as a consequence of that. At
>> (4), CPU 8 picks up d3v5 and run it (this may seem unrelated, but bear
>> with me a little).
>>
>> At (5), a periodic tick arrives on CPU 9. Periodic ticks are a core
>> part of the Credit1 algorithm, and are used for accounting and load
>> balancing. In fact, csched_tick() calls csched_vcpu_acct() which, at
>> (6), calls _csched_cpu_pick().
>>
>> Pick realizes that d3v8 is running on CPU 9, and that CPU 8 is also
>> busy. Now, since CPU 8 and 9 are hyperthreads of the same core, and
>> since there are fully idle cores, Credit1 decides that it's better to
>> kick d3v8 to one of those fully idle cores, so both d3v5 and d3v8
>> itslef can run at full "core speed". In fact, we see that CPU 11 is
>> picked, as both the hyperthreads --CPU 10 and CPU 11 itself-- are idle.
>> (To be continued, below)
>
> Ah, I see. I did not look at the output with topology in mind, I
> admit. Nevertheless I then still don't really understand why it
> appears to be not uncommon for a vCPU to move back and
> forth relatively rapidly (I take no other vCPU having run on a
> pCPU as a sign that the period of time elapsed in between to
> not be very large). Please don't forget that the act of moving
> a vCPU has a (performance) price, too.
>
>> The problem, as I was expecting, is not work stealing, the problem is,
>> well... Credit1! :-/
>> [...]
>> Credit2, for instance, does not suffer from this issue. In fact,
>> hyperthreading, there, is considered during wakeup/tickling already.
>
> Well - when is Credit2 going to become the default?

I've just sent a patch.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2018-02-05 11:37 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-24 13:12 [PATCH v10 00/11] x86: Mitigations for SP2/CVE-2017-5715/Branch Target Injection Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 01/11] x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests Andrew Cooper
2018-02-01  9:06   ` Jan Beulich
2018-02-01 13:53     ` Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 02/11] x86/msr: Emulation of MSR_{SPEC_CTRL, PRED_CMD} " Andrew Cooper
2018-01-25 12:25   ` Jan Beulich
2018-01-24 13:12 ` [PATCH v10 03/11] x86/migrate: Move MSR_SPEC_CTRL on migrate Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 04/11] x86/hvm: Permit guests direct access to MSR_{SPEC_CTRL, PRED_CMD} Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 05/11] x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point Andrew Cooper
2018-01-25 13:08   ` Jan Beulich
2018-01-25 14:12     ` Andrew Cooper
2018-01-25 14:36       ` Jan Beulich
2018-01-25 14:46         ` Andrew Cooper
2018-01-25 15:08           ` Jan Beulich
2018-01-25 15:10             ` Andrew Cooper
2018-01-25 16:52   ` [PATCH v11 5/11] " Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 06/11] x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen Andrew Cooper
2018-01-25 13:19   ` Jan Beulich
2018-01-25 14:17     ` Andrew Cooper
2018-01-25 14:40       ` Jan Beulich
2018-01-25 14:44         ` Andrew Cooper
2018-01-25 14:48           ` Jan Beulich
2018-01-25 16:54   ` [PATCH v11 6/11] " Andrew Cooper
2018-01-26 12:17     ` Jan Beulich
2018-01-24 13:12 ` [PATCH v10 07/11] x86/entry: Avoid using alternatives in NMI/#MC paths Andrew Cooper
2018-01-25 13:43   ` Jan Beulich
2018-01-25 15:04     ` Andrew Cooper
2018-01-25 15:14       ` Jan Beulich
2018-01-25 15:19         ` Andrew Cooper
2018-01-25 16:17           ` Jan Beulich
2018-01-25 17:21   ` [PATCH v11 7/11] " Andrew Cooper
2018-01-26 12:23     ` Jan Beulich
2018-01-26 12:28       ` Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 08/11] x86/boot: Calculate the most appropriate BTI mitigation to use Andrew Cooper
2018-01-25 13:52   ` Jan Beulich
2018-02-01  8:41   ` Jan Beulich
2018-02-01 13:58     ` Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts Andrew Cooper
2018-01-24 13:34   ` Woodhouse, David
2018-01-24 13:49     ` Andrew Cooper
2018-01-24 14:31       ` David Woodhouse
2018-01-25 14:46         ` Konrad Rzeszutek Wilk
2018-01-25 15:57   ` Jan Beulich
2018-01-25 16:09     ` Andrew Cooper
2018-01-25 16:15       ` Andrew Cooper
2018-01-27  1:27         ` Dario Faggioli
2018-01-29  9:28           ` Jan Beulich
2018-02-05 11:37             ` George Dunlap
2018-01-25 16:31       ` Jan Beulich
2018-01-25 16:48         ` Andrew Cooper
2018-01-25 18:49       ` Dario Faggioli
2018-01-26  1:08         ` Dario Faggioli
2018-01-26  9:43           ` Jan Beulich
2018-01-26 11:13             ` Dario Faggioli
2018-01-26 11:38               ` Jan Beulich
2018-01-24 13:12 ` [PATCH v10 10/11] x86/cpuid: Offer Indirect Branch Controls to guests Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 11/11] x86/idle: Clear SPEC_CTRL while idle Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.