Re: [Xen-devel] [PATCH v7 06/10] microcode: split out apply_microcode() from cpu_request_microcode()

From: "Jan Beulich" <JBeulich@suse.com>
To: "Chao Gao" <chao.gao@intel.com>
Cc: Sergey Dyasli <sergey.dyasli@citrix.com>,
	Ashok Raj <ashok.raj@intel.com>, WeiLiu <wl@xen.org>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	xen-devel <xen-devel@lists.xenproject.org>,
	Roger Pau Monne <roger.pau@citrix.com>
Subject: Re: [Xen-devel] [PATCH v7 06/10] microcode: split out apply_microcode() from cpu_request_microcode()
Date: Tue, 11 Jun 2019 01:08:36 -0600	[thread overview]
Message-ID: <5CFF53740200007800236D68@prv1-mh.provo.novell.com> (raw)
In-Reply-To: <20190611033240.GA26361@gao-cwp>

>>> On 11.06.19 at 05:32, <chao.gao@intel.com> wrote:
> On Wed, Jun 05, 2019 at 06:37:27AM -0600, Jan Beulich wrote:
>>>>> On 27.05.19 at 10:31, <chao.gao@intel.com> wrote:
>>> During late microcode update, apply_microcode() is invoked in
>>> cpu_request_microcode(). To make late microcode update more reliable,
>>> we want to put the apply_microcode() into stop_machine context. So
>>> we split out it from cpu_request_microcode(). As a consequence,
>>> apply_microcode() should be invoked explicitly in the common code.
>>> 
>>> Previously, apply_microcode() gets the microcode patch to be applied from
>>> the microcode cache. Now, the patch is passed as a function argument and
>>> a patch is cached for cpu-hotplug and cpu resuming, only after it has
>>> been loaded to a cpu without any error. As a consequence, the
>>> 'match_cpu' check in microcode_update_cache is removed, which otherwise
>>> would fail.
>>
>>The "only after it has been loaded to a cpu without any error" is a
>>problem, precisely for the case where ucode on the different cores
>>is not in sync initially. I would actually like to put up this question:
>>When a core has no ucode loaded at all yet and only strictly older
>>(than loaded on some other cores) ucode is found to be available,
>>whether then it wouldn't still be better to apply that ucode to
>>_at least_ the cores that have none loaded yet.
> 
> Yes, it is better for this special case. And I agree to support this case.
> 
> This in v7, a patch is loaded only if its revision is newer than that
> loaded to current CPU. And it is stored only if it has been loaded
> successfully. But, as you described, a broken bios might puts the system
> in an inconsistent state (multiple microcode revision in the system) and
> furthermore in this case, if no or an older microcode update is
> provided, early loading cannot get the system into sane state. So for
> both early and late microcode loading, we could face a situation that
> the patch to be loaded has equal or old revision than microcode of some
> CPUs.
> 
> Changes I plan to make in next version are:
> 1. For early microcode, a patch would be stored if it covers current CPU's
> signature. All CPUs would try to load from the cache.
> 2. For late microcode, a patch is loaded only if its revision is newer than
> *the patch cached*. And it is stored only if has been loaded without an
> "EIO" error.
> 3. Cache replacement remains the same.

Why the difference between early and late loading?

> But it is a temperary solution, especially for CSPs. A better way
> might be getting the newest ucode or upgrading to the newest bios,
> even downgrading the bios to an older version which wouldn't put
> the system into "insane" state.

On the quoted system, all BIOS versions I've ever been provided
had the same odd behavior.

>>> +static int microcode_update_cpu(struct microcode_patch *patch)
>>>  {
>>> -    int err;
>>> -    struct cpu_signature *sig = &this_cpu(cpu_sig);
>>> +    int ret = microcode_ops->collect_cpu_info(&this_cpu(cpu_sig));
>>>  
>>> -    if ( !microcode_ops )
>>> -        return 0;
>>> +    if ( unlikely(ret) )
>>> +        return ret;
>>>  
>>>      spin_lock(&microcode_mutex);
>>>  
>>> -    err = microcode_ops->collect_cpu_info(sig);
>>> -    if ( likely(!err) )
>>> -        err = microcode_ops->apply_microcode();
>>> -    spin_unlock(&microcode_mutex);
>>> +    if ( patch )
>>> +    {
>>> +        /*
>>> +         * If a patch is specified, it should has newer revision than
>>> +         * that of the patch cached.
>>> +         */
>>> +        if ( microcode_cache &&
>>> +             microcode_ops->compare_patch(patch, microcode_cache) != NEW_UCODE )
>>> +        {
>>> +            spin_unlock(&microcode_mutex);
>>> +            return -EINVAL;
>>> +        }
>>>  
>>> -    return err;
>>> -}
>>> +        ret = microcode_ops->apply_microcode(patch);
>>
>>There's no printk() here but ...
>>
>>> +    }
>>> +    else if ( microcode_cache )
>>> +    {
>>> +        ret = microcode_ops->apply_microcode(microcode_cache);
>>> +        if ( ret == -EIO )
>>> +            printk("Update failed. Reboot needed\n");
>>
>>... you emit a log message here. Why the difference? And wouldn't
>>it be better to have just a single call to ->apply anyway, by simply
>>assigning "microcode_cache" to "patch" and moving the call a little
>>further down?
> 
> -EIO means we loaded the patch but the revision didn't change. This
> error code indicates an error in the patch. It is unlikely to happen.
> And in this v7, a patch is stored after being loading successfully
> on a CPU. To simplify handling of load failure and avoid cleanup to the
> global cache (if a patch has a success rate, we need to consider which
> one to save when we have a new patch with 95% rate and an old one with
> 100% rate, it would be really complex), we assume that loading the cache
> (being loaded successfully on a CPU) shouldn't return EIO. Otherwise,
> an error message is prompted.

But then the log message itself needs to be more specific. That'll
then either allow to understand why one is wanted here but not
elsewhere, or additionally a comment should be attached.

>>> --- a/xen/arch/x86/smpboot.c
>>> +++ b/xen/arch/x86/smpboot.c
>>> @@ -363,10 +363,7 @@ void start_secondary(void *unused)
>>>  
>>>      initialize_cpu_data(cpu);
>>>  
>>> -    if ( system_state <= SYS_STATE_smp_boot )
>>> -        early_microcode_update_cpu(false);
>>> -    else
>>> -        microcode_resume_cpu();
>>> +    early_microcode_update_cpu();
>>
>>I'm struggling to understand how you get away without the "false"
>>argument that was passed here before. You look to now be calling
>>->start_update() unconditionally (so long as the hook is not NULL),
>>which I don't think is correct. This should be called only once by
>>the CPU _leading_ an update (the BSP during boot, and the CPU
>>the hypercall gets invoked on (or the first CPU an update gets
>>issued one) for a late update. Am I missing something?
> 
> BSP and APs call different functions, early_microcode_parse_and_update_cpu()
> and early_microcode_update_cpu() respectively. The latter won't call
> ->start_update().

Oh, I see - I'm sorry. I've been mislead by the patch context
indicators. But still, by their names there's not supposed to be
such a difference. I wonder whether we wouldn't better stick
to just one function (early_microcode_update_cpu()), having
it take a boolean as is the case now. That would control both
parsing and ->start_update() invocation.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel