Re: [PATCH v3 03/22] x86/xstate: re-size save area when CPUID policy changes

From: Jan Beulich <jbeulich@suse.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: "George Dunlap" <george.dunlap@citrix.com>,
	"Ian Jackson" <iwj@xenproject.org>,
	"Julien Grall" <julien@xen.org>,
	"Stefano Stabellini" <sstabellini@kernel.org>,
	"Wei Liu" <wl@xen.org>, "Roger Pau Monné" <roger.pau@citrix.com>,
	"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Subject: Re: [PATCH v3 03/22] x86/xstate: re-size save area when CPUID policy changes
Date: Mon, 3 May 2021 16:22:38 +0200	[thread overview]
Message-ID: <f515fdfb-d1a6-56d8-5db3-ebddeed23806@suse.com> (raw)
In-Reply-To: <5a954be8-e213-36d8-27da-4c51243dc280@citrix.com>

On 03.05.2021 15:57, Andrew Cooper wrote:
> On 22/04/2021 15:44, Jan Beulich wrote:
>> vCPU-s get maximum size areas allocated initially. Hidden (and in
>> particular default-off) features may allow for a smaller size area to
>> suffice.
>>
>> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> ---
>> v2: Use 1ul instead of 1ull. Re-base.
>> ---
>> This could be further shrunk if we used XSAVEC / if we really used
>> XSAVES, as then we don't need to also cover the holes. But since we
>> currently use neither of the two in reality, this would require more
>> work than just adding the alternative size calculation here.
>>
>> Seeing that both vcpu_init_fpu() and cpuid_policy_updated() get called
>> from arch_vcpu_create(), I'm not sure we really need this two-stage
>> approach - the slightly longer period of time during which
>> v->arch.xsave_area would remain NULL doesn't look all that problematic.
>> But since xstate_alloc_save_area() gets called for idle vCPU-s, it has
>> to stay anyway in some form, so the extra code churn may not be worth
>> it.
>>
>> Instead of cpuid_policy_xcr0_max(), cpuid_policy_xstates() may be the
>> interface to use here. But it remains to be determined whether the
>> xcr0_accum field is meant to be inclusive of XSS (in which case it would
>> better be renamed) or exclusive. Right now there's no difference as we
>> don't support any XSS-controlled features.
> 
> I've been figuring out what we need to for supervisors states.  The
> current code is not in a good shape, but I also think some of the
> changes in this series take us in an unhelpful direction.

From reading through the rest your reply I'm not sure I see what you
mean. ORing in host_xss at certain points shouldn't be a big deal.

> I've got a cleanup series which I will post shortly.  It interacts
> texturally although not fundamentally with this series, but does fix
> several issues.
> 
> For supervisor states, we need use XSAVES unilaterally, even for PV. 
> This is because XSS_CET_S needs to be the HVM kernel's context, or Xen's
> in PV context (specifically, MSR_PL0_SSP which is the shstk equivalent
> of TSS.RSP0).
> 
> 
> A consequence is that Xen's data handling shall use the compressed
> format, and include supervisor states.  (While in principle we could
> manage CET_S, CET_U, and potentially PT when vmtrace gets expanded, each
> WRMSR there is a similar order of magnitude to an XSAVES/XRSTORS
> instruction.)

I agree.

> I'm planning a host xss setting, similar to mmu_cr4_features, which
> shall be the setting in context for everything other than HVM vcpus
> (which need the guest setting in context, and/or the VT-x bodge to
> support host-only states).  Amongst other things, all context switch
> paths, including from-HVM, need to step XSS up to the host setting to
> let XSAVES function correctly.
> 
> However, a consequence of this is that the size of the xsave area needs
> deriving from host, as well as guest-max state.  i.e. even if some VMs
> aren't using CET, we still need space in the xsave areas to function
> correctly when a single VM is using CET.

Right - as said above, taking this into consideration here shouldn't
be overly problematic.

> Another consequence is that we need to rethink our hypercall behaviour. 
> There is no such thing as supervisor states in an uncompressed XSAVE
> image, which means we can't continue with that being the ABI.

I don't think the hypercall input / output blob needs to follow any
specific hardware layout.

> I've also found some substantial issues with how we handle
> xcr0/xcr0_accum and plan to address these.  There is no such thing as
> xcr0 without the bottom bit set, ever, and xcr0_accum needs to default
> to X87|SSE seeing as that's how we use it anyway.  However, in a context
> switch, I expect we'll still be using xcr0_accum | host_xss when it
> comes to the context switch path.

Right, and to avoid confusion I think we also want to move from
xcr0_accum to e.g. xstate_accum, covering both XCR0 and XSS parts
all in one go.

> In terms of actual context switching, we want to be using XSAVES/XRSTORS
> whenever it is available, even if we're not using supervisor states. 
> XSAVES has both the inuse and modified optimisations, without the broken
> consequence of XSAVEOPT (which is firmly in the "don't ever use this"
> bucket now).

The XSAVEOPT anomaly is affecting user mode only, isn't it? Or are
you talking of something I have forgot about?

> There's no point ever using XSAVEC.  There is no hardware where it
> exists in the absence of XSAVES, and can't even in theoretical
> circumstances due to (perhaps unintentional) linkage of the CPUID data. 
> XSAVEC also doesn't use the modified optimisation, and is therefore
> strictly worse than XSAVES, even when MSR_XSS is 0.
> 
> Therefore, our choice of instruction wants to be XSAVES, or XSAVE, or
> FXSAVE, depending on hardware capability.

Makes sense to me (perhaps - see above - minus your omission of
XSAVEOPT here).

Jan