Re: [Xen-devel] [PATCH] x86/spec-ctrl: Scrub stale segment registers on leaky hardware

From: Jan Beulich <jbeulich@suse.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Xen-devel <xen-devel@lists.xenproject.org>,
	"Wei Liu" <wl@xen.org>, "Roger Pau Monné" <roger.pau@citrix.com>
Subject: Re: [Xen-devel] [PATCH] x86/spec-ctrl: Scrub stale segment registers on leaky hardware
Date: Mon, 12 Aug 2019 10:00:10 +0200	[thread overview]
Message-ID: <87885176-366e-3eb1-4427-25c977694a0e@suse.com> (raw)
In-Reply-To: <20190809171623.25657-1-andrew.cooper3@citrix.com>

On 09.08.2019 19:16, Andrew Cooper wrote:
> --- a/docs/misc/xen-command-line.pandoc
> +++ b/docs/misc/xen-command-line.pandoc
> @@ -1914,7 +1914,7 @@ By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`).
>   ### spec-ctrl (x86)
>   > `= List of [ <bool>, xen=<bool>, {pv,hvm,msr-sc,rsb,md-clear}=<bool>,
>   >              bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,eager-fpu,
> ->              l1d-flush,l1tf-barrier}=<bool> ]`
> +>              l1d-flush,stale-seg-clear,l1tf-barrier}=<bool> ]`
>   
>   Controls for speculative execution sidechannel mitigations.  By default, Xen
>   will pick the most appropriate mitigations based on compiled in support,
> @@ -1986,6 +1986,12 @@ Irrespective of Xen's setting, the feature is virtualised for HVM guests to
>   use.  By default, Xen will enable this mitigation on hardware believed to be
>   vulnerable to L1TF.
>   
> +On all hardware, the `stale-seg-clear=` option can be used to force or prevent
> +Xen from clearing the microarchitectural segment register copies on context
> +switch.  By default, Xen will choose to use stale segment clearing on affected
> +hardware.  The clearing logic is tuned to microarchitectural details of the
> +affected CPUs.
> +
>   On hardware vulnerable to L1TF, the `l1tf-barrier=` option can be used to force
>   or prevent Xen from protecting evaluations inside the hypervisor with a barrier
>   instruction to not load potentially secret information into L1 cache.  By

Purely out of curiosity: Is there a reason both insertions happen between
two pre-existing sub-options, not after all of them?

> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -1328,6 +1328,29 @@ static void load_segments(struct vcpu *n)
>       dirty_segment_mask = per_cpu(dirty_segment_mask, cpu);
>       per_cpu(dirty_segment_mask, cpu) = 0;
>   
> +    /*
> +     * CPUs which suffer from stale segment register leakage have two copies
> +     * of each segment register [Correct at the time of writing - Aug 2019].
> +     *
> +     * We must write to both of them to scrub state from the previous vcpu.
> +     * However, two writes in quick succession stall the pipeline, as only one
> +     * write per segment can be speculatively outstanding.
> +     *
> +     * Split the two sets of writes to each register to maximise the chance
> +     * that these writes have retired before the second set starts, thus
> +     * reducing the chance of stalling.
> +     */
> +    if ( opt_stale_seg_clear )
> +    {
> +        asm volatile ( "mov %[sel], %%ds;"
> +                       "mov %[sel], %%es;"
> +                       "mov %[sel], %%fs;"
> +                       "mov %[sel], %%gs;"
> +                       :: [sel] "r" (0) );
> +        /* Force a reload of all segments to be the second scrubbing write. */
> +        dirty_segment_mask = ~0;
> +    }

Having the command line option, do we care about people actually using
it on AMD hardware? For %ds and %es this would now lead to up to three
selector register loads each, with the one above being completely
pointless (due to not clearing the segment bases anyway). Question of
course is whether adding yet another conditional (to the added code
above or to preload_segment()) would be any better than having this
extra selector register write.

Furthermore, if we cared, shouldn't SVM code also honor the flag and
gain extra loads, albeit perhaps with unlikely() used in the if()?

As to your choice of loading a nul selector: For the VERW change iirc
we've been told that using a nul selector is a bad choice from
performance pov. Are nul selectors being treated better for selector
register writes?

> @@ -872,11 +873,38 @@ static void vmx_ctxt_switch_from(struct vcpu *v)
>   
>   static void vmx_ctxt_switch_to(struct vcpu *v)
>   {
> +    /*
> +     * CPUs which suffer from stale segment register leakage have two copies
> +     * of each segment register [Correct at the time of writing - Aug 2019].
> +     *
> +     * We must write to both of them to scrub state from the previous vcpu.
> +     * However, two writes in quick succession stall the pipeline, as only one
> +     * write per segment can be speculatively outstanding.
> +     *
> +     * Split the two sets of writes to each register to maximise the chance
> +     * that these writes have retired before the second set starts, thus
> +     * reducing the chance of stalling.
> +     */
> +    if ( opt_stale_seg_clear )
> +        asm volatile ( "mov %[sel], %%ds;"
> +                       "mov %[sel], %%es;"
> +                       "mov %[sel], %%fs;"
> +                       "mov %[sel], %%gs;"
> +                       :: [sel] "r" (0) );
> +
>       vmx_restore_guest_msrs(v);
>       vmx_restore_dr(v);
>   
>       if ( v->domain->arch.hvm.pi_ops.flags & PI_CSW_TO )
>           vmx_pi_switch_to(v);
> +
> +    /* Should be last in the function.  See above. */
> +    if ( opt_stale_seg_clear )
> +        asm volatile ( "mov %[sel], %%ds;"
> +                       "mov %[sel], %%es;"
> +                       "mov %[sel], %%fs;"
> +                       "mov %[sel], %%gs;"
> +                       :: [sel] "r" (0) );
>   }

Why two instances? Aren't the selector register loads during VM
entry sufficient to clear both instances? (The question is
rhetorical in a way, because I think I know the answer, but
neither the comment here nor the patch description provide it.)

> @@ -111,6 +114,7 @@ static int __init parse_spec_ctrl(const char *s)
>               opt_ibpb = false;
>               opt_ssbd = false;
>               opt_l1d_flush = 0;
> +            opt_stale_seg_clear = 0;
>           }

Is it really correct for this to be affected by both "spec-ctrl=no"
and "spec-ctrl=no-xen"? It would seem to me that it would belong
above the "disable_common" label, as the control also is meant to
guard against guest->guest leaks.

> @@ -864,6 +871,83 @@ static __init void mds_calculations(uint64_t caps)
>       }
>   }
>   
> +/* Calculate whether this CPU leaks segment registers between contexts. */
> +static void __init stale_segment_calculations(void)
> +{
> +    /*
> +     * Assume all unrecognised processors are ok.  This is only known to
> +     * affect Intel Family 6 processors.
> +     */
> +    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
> +         boot_cpu_data.x86 != 6 )
> +        return;

The comment above here and ...

> +    switch ( boot_cpu_data.x86_model )
> +    {
> +        /*
> +         * Core processors since at least Nehalem are vulnerable.
> +         */
> +    case 0x1e: /* Nehalem */
> +    case 0x1f: /* Auburndale / Havendale */
> +    case 0x1a: /* Nehalem EP */
> +    case 0x2e: /* Nehalem EX */
> +    case 0x25: /* Westmere */
> +    case 0x2c: /* Westmere EP */
> +    case 0x2f: /* Westmere EX */
> +    case 0x2a: /* SandyBridge */
> +    case 0x2d: /* SandyBridge EP/EX */
> +    case 0x3a: /* IvyBridge */
> +    case 0x3e: /* IvyBridge EP/EX */
> +    case 0x3c: /* Haswell */
> +    case 0x3f: /* Haswell EX/EP */
> +    case 0x45: /* Haswell D */
> +    case 0x46: /* Haswell H */
> +    case 0x3d: /* Broadwell */
> +    case 0x47: /* Broadwell H */
> +    case 0x4f: /* Broadwell EP/EX */
> +    case 0x56: /* Broadwell D */
> +    case 0x4e: /* Skylake M */
> +    case 0x55: /* Skylake X */
> +    case 0x5e: /* Skylake D */
> +    case 0x66: /* Cannonlake */
> +    case 0x67: /* Cannonlake? */
> +    case 0x8e: /* Kabylake M */
> +    case 0x9e: /* Kabylake D */
> +        cpu_has_bug_stale_seg = true;
> +        break;
> +
> +        /*
> +         * Atom processors are not vulnerable.
> +         */
> +    case 0x1c: /* Pineview */
> +    case 0x26: /* Lincroft */
> +    case 0x27: /* Penwell */
> +    case 0x35: /* Cloverview */
> +    case 0x36: /* Cedarview */
> +    case 0x37: /* Baytrail / Valleyview (Silvermont) */
> +    case 0x4d: /* Avaton / Rangely (Silvermont) */
> +    case 0x4c: /* Cherrytrail / Brasswell */
> +    case 0x4a: /* Merrifield */
> +    case 0x5a: /* Moorefield */
> +    case 0x5c: /* Goldmont */
> +    case 0x5f: /* Denverton */
> +    case 0x7a: /* Gemini Lake */
> +        break;
> +
> +        /*
> +         * Knights processors are not vulnerable.
> +         */
> +    case 0x57: /* Knights Landing */
> +    case 0x85: /* Knights Mill */
> +        break;
> +
> +    default:
> +        printk("Unrecognised CPU model %#x - assuming vulnerable to StaleSeg\n",
> +               boot_cpu_data.x86_model);
> +        break;

... the plain "break" here are not in line with the log message text.
Did you mean to add "not"?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel