Re: [PATCH 5/5] x86/cpuid: Fix handling of xsave dynamic leaves

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Jan Beulich <jbeulich@suse.com>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>, "Wei Liu" <wl@xen.org>,
	Xen-devel <xen-devel@lists.xenproject.org>
Subject: Re: [PATCH 5/5] x86/cpuid: Fix handling of xsave dynamic leaves
Date: Wed, 5 May 2021 17:59:28 +0100	[thread overview]
Message-ID: <7ce89520-7cca-6c9f-f229-37515116d74c@citrix.com> (raw)
In-Reply-To: <d951dc24-e613-8a1d-13ea-b1e439048165@suse.com>

On 05/05/2021 09:33, Jan Beulich wrote:
> On 04.05.2021 16:17, Andrew Cooper wrote:
>> On 04/05/2021 13:56, Jan Beulich wrote:
>>> On 03.05.2021 17:39, Andrew Cooper wrote:
>>>> +unsigned int xstate_compressed_size(uint64_t xstates)
>>>> +{
>>>> +    unsigned int i, size = XSTATE_AREA_MIN_SIZE;
>>>> +
>>>> +    xstates &= ~XSTATE_FP_SSE;
>>>> +    for_each_set_bit ( i, &xstates, 63 )
>>>> +    {
>>>> +        if ( test_bit(i, &xstate_align) )
>>>> +            size = ROUNDUP(size, 64);
>>>> +
>>>> +        size += xstate_sizes[i];
>>>> +    }
>>>> +
>>>> +    /* In debug builds, cross-check our calculation with hardware. */
>>>> +    if ( IS_ENABLED(CONFIG_DEBUG) )
>>>> +    {
>>>> +        unsigned int hwsize;
>>>> +
>>>> +        xstates |= XSTATE_FP_SSE;
>>>> +        hwsize = hw_compressed_size(xstates);
>>>> +
>>>> +        if ( size != hwsize )
>>>> +            printk_once(XENLOG_ERR "%s(%#"PRIx64") size %#x != hwsize %#x\n",
>>>> +                        __func__, xstates, size, hwsize);
>>>> +        size = hwsize;
>>> To be honest, already on the earlier patch I was wondering whether
>>> it does any good to override size here: That'll lead to different
>>> behavior on debug vs release builds. If the log message is not
>>> paid attention to, we'd then end up with longer term breakage.
>> Well - our options are pass hardware size, or BUG(), because getting
>> this wrong will cause memory corruption.
> I'm afraid I'm lost: Neither passing hardware size nor BUG() would
> happen in a release build, so getting this wrong does mean memory
> corruption there. And I'm of the clear opinion that debug builds
> shouldn't differ in behavior in such regards.

The point of not cross-checking with hardware in release builds is to
remove a bunch of very expensive operations from path which is hit
several times on every fork(), so isn't exactly rare.

But yes - the consequence of being wrong, for whatever reason, is memory
corruption (especially as the obvious way it goes wrong is having an
xsave_size[] of 0, so we end up under-reporting).

So one way or another, we need to ensure that every newly exposed option
is tested in a debug build first.  The integration is either complete,
or it isn't, so I don't think this terribly onerous to do.

As for actually having a behaviour difference between debug and release,
I'm not concerned.

Hitting this failure means "there is definitely a serious error in Xen,
needing fixing", but it will only be encountered the during development
of a new feature, so is for all practical concerns, limited to
development of the new feature in question.

BUG() gets the point across very obviously, but is unhelpful when in
practice the test system wont explode because the dom0 kernel won't be
using this new state yet.

> If there wasn't an increasing number of possible combinations I
> would be inclined to suggest that in all builds we check during
> boot that calculation and hardware provided values match for all
> possible (valid) combinations.

I was actually considering an XTF test on the matter, which would be a
cleaning up of the one generating the example in the cover letter.

As above, logic is only liable to be wrong during development of support
for a new state component, which is why it is reasonable to take away
the overhead in release builds.

>> The BUG() option is a total pain when developing new support - the first
>> version of this patch did use BUG(), but after hitting it 4 times in a
>> row (caused by issues with constants elsewhere), I decided against it.
> I can fully understand this aspect. I'm not opposed to printk_once()
> at all. My comment was purely towards the override.
>
>> If we had something which was a mix between WARN_ONCE() and a proper
>> printk() explaining what was going on, then I would have used that. 
>> Maybe it's time to introduce one...
> I don't think there's a need here - what you have in terms of info
> put into the log is imo sufficient.

Well - it needs to be sufficiently obvious to people who aren't you and
me.  There are also other areas in Xen which would benefit from changing
their diagnostics to be as described.

~Andrew