Re: [Xen-devel] [PATCH v7 08/10] x86/microcode: Synchronize late microcode loading

From: Chao Gao <chao.gao@intel.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: Sergey Dyasli <sergey.dyasli@citrix.com>,
	Kevin Tian <kevin.tian@intel.com>,
	Ashok Raj <ashok.raj@intel.com>, WeiLiu <wl@xen.org>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	Jun Nakajima <jun.nakajima@intel.com>,
	xen-devel <xen-devel@lists.xenproject.org>,
	tglx@linutronix.de, Borislav Petkov <bp@suse.de>,
	Roger Pau Monne <roger.pau@citrix.com>
Subject: Re: [Xen-devel] [PATCH v7 08/10] x86/microcode: Synchronize late microcode loading
Date: Tue, 11 Jun 2019 20:36:17 +0800	[thread overview]
Message-ID: <20190611123615.GA22930@gao-cwp> (raw)
In-Reply-To: <5CF7CD2702000078002358F4@prv1-mh.provo.novell.com>

On Wed, Jun 05, 2019 at 08:09:43AM -0600, Jan Beulich wrote:
>>>> On 27.05.19 at 10:31, <chao.gao@intel.com> wrote:
>> This patch ports microcode improvement patches from linux kernel.
>> 
>> Before you read any further: the early loading method is still the
>> preferred one and you should always do that. The following patch is
>> improving the late loading mechanism for long running jobs and cloud use
>> cases.
>> 
>> Gather all cores and serialize the microcode update on them by doing it
>> one-by-one to make the late update process as reliable as possible and
>> avoid potential issues caused by the microcode update.
>> 
>> Signed-off-by: Chao Gao <chao.gao@intel.com>
>> Tested-by: Chao Gao <chao.gao@intel.com>
>> [linux commit: a5321aec6412b20b5ad15db2d6b916c05349dbff]
>> [linux commit: bb8c13d61a629276a162c1d2b1a20a815cbcfbb7]
>> Cc: Kevin Tian <kevin.tian@intel.com>
>> Cc: Jun Nakajima <jun.nakajima@intel.com>
>> Cc: Ashok Raj <ashok.raj@intel.com>
>> Cc: Borislav Petkov <bp@suse.de>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
>> Cc: Jan Beulich <jbeulich@suse.com>
>> ---
>> Changes in v7:
>>  - Check whether 'timeout' is 0 rather than "<=0" since it is unsigned int.
>>  - reword the comment above microcode_update_cpu() to clearly state that
>>  one thread per core should do the update.
>> 
>> Changes in v6:
>>  - Use one timeout period for rendezvous stage and another for update stage.
>>  - scale time to wait by the number of remaining cpus to respond.
>>    It helps to find something wrong earlier and thus we can reboot the
>>    system earlier.
>> ---
>>  xen/arch/x86/microcode.c | 171 ++++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 155 insertions(+), 16 deletions(-)
>> 
>> diff --git a/xen/arch/x86/microcode.c b/xen/arch/x86/microcode.c
>> index 23cf550..f4a417e 100644
>> --- a/xen/arch/x86/microcode.c
>> +++ b/xen/arch/x86/microcode.c
>> @@ -22,6 +22,7 @@
>>   */
>>  
>>  #include <xen/cpu.h>
>> +#include <xen/cpumask.h>
>
>It seems vanishingly unlikely that you would need this explicit #include
>here, but it certainly isn't wrong.
>
>> @@ -270,31 +296,90 @@ bool microcode_update_cache(struct microcode_patch *patch)
>>      return true;
>>  }
>>  
>> -static long do_microcode_update(void *patch)
>> +/* Wait for CPUs to rendezvous with a timeout (us) */
>> +static int wait_for_cpus(atomic_t *cnt, unsigned int expect,
>> +                         unsigned int timeout)
>>  {
>> -    int error, cpu;
>> -
>> -    error = microcode_update_cpu(patch);
>> -    if ( error )
>> +    while ( atomic_read(cnt) < expect )
>>      {
>> -        microcode_ops->free_patch(microcode_cache);
>> -        return error;
>> +        if ( !timeout )
>> +        {
>> +            printk("CPU%d: Timeout when waiting for CPUs calling in\n",
>> +                   smp_processor_id());
>> +            return -EBUSY;
>> +        }
>> +        udelay(1);
>> +        timeout--;
>>      }
>
>There's no comment here and nothing in the description: I don't
>recall clarification as to whether RDTSC is fine to be issued by a
>thread when ucode is being updated by another thread on the
>same core.

Yes. I think it is fine.

Ashok, could you share your opinion on this question?

>
>> +static int do_microcode_update(void *patch)
>> +{
>> +    unsigned int cpu = smp_processor_id();
>> +    unsigned int cpu_nr = num_online_cpus();
>> +    unsigned int finished;
>> +    int ret;
>> +    static bool error;
>>  
>> -    microcode_update_cache(patch);
>> +    atomic_inc(&cpu_in);
>> +    ret = wait_for_cpus(&cpu_in, cpu_nr, MICROCODE_CALLIN_TIMEOUT_US);
>> +    if ( ret )
>> +        return ret;
>>  
>> -    return error;
>> +    ret = microcode_ops->collect_cpu_info(&this_cpu(cpu_sig));
>> +    /*
>> +     * Load microcode update on only one logical processor per core.
>> +     * Here, among logical processors of a core, the one with the
>> +     * lowest thread id is chosen to perform the loading.
>> +     */
>> +    if ( !ret && (cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu))) )
>
>At the very least it's not obvious whether this hyper-threading-centric
>view ("logical processor") also applies to AMD's compute unit model
>(which reuses cpu_sibling_mask). It does, as the respective MSRs are
>per-compute-unit rather than per-core, but I'd appreciate if the
>wording could be adjusted to explicitly name both cases (multiple
>threads per core and multiple cores per CU).

OK. Will do

>
>> +    {
>> +        ret = microcode_ops->apply_microcode(patch);
>> +        if ( !ret )
>> +            atomic_inc(&cpu_updated);
>> +    }
>> +    /*
>> +     * Increase the wait timeout to a safe value here since we're serializing
>
>I'm struggling with the "increase": I don't see anything being increased
>here. You simply use a larger timeout than above.
>
>> +     * the microcode update and that could take a while on a large number of
>> +     * CPUs. And that is fine as the *actual* timeout will be determined by
>> +     * the last CPU finished updating and thus cut short
>> +     */
>> +    atomic_inc(&cpu_out);
>> +    finished = atomic_read(&cpu_out);
>> +    while ( !error && finished != cpu_nr )
>> +    {
>> +        /*
>> +         * During each timeout interval, at least a CPU is expected to
>> +         * finish its update. Otherwise, something goes wrong.
>> +         */
>> +        if ( wait_for_cpus(&cpu_out, finished + 1,
>> +                           MICROCODE_UPDATE_TIMEOUT_US) && !error )
>> +        {
>> +            error = true;
>> +            panic("Timeout when finishing updating microcode (finished %d/%d)",
>> +                  finished, cpu_nr);
>
>Why the setting of "error" when you panic anyway?
>
>And please use format specifiers matching the types of the
>further arguments (i.e. twice %u here, but please check other
>code as well).
>
>Furthermore (and I'm sure I've given this comment before) if
>you really hit the limit, how many panic() invocations are there
>going to be? You run this function on all CPUs after all.

"error" is to avoid calling of panic() on multiple CPUs simultaneously.
Roger is right: atomic primitives should be used here.

>
>On the whole, taking a 256-thread system as example, you
>allow the whole process to take over 4 min without calling
>panic().
>Leaving aside guests, I don't think Xen itself would
>survive this in all cases. We've found the need to process
>softirqs with far smaller delays, in particular from key handlers
>producing lots of output. At the very least there should be a
>bold warning logged if the system had been in stop-machine
>state for, say, longer than 100ms (value subject to discussion).
>

In theory, if you mean 256 cores, yes. Do you think a configurable and
run-time changeable upper bound for the whole process can address your
concern? The default value for this upper bound can be set to a large
value (for example, 1s * the number of online core) and the admin can
ajust/lower the upper bound according to the way (serial or parallel) to
perform the update and other requirements. Once the upper bound is
reached, we would call panic().

>> +        }
>> +
>> +        finished = atomic_read(&cpu_out);
>> +    }
>> +
>> +    /*
>> +     * Refresh CPU signature (revision) on threads which didn't call
>> +     * apply_microcode().
>> +     */
>> +    if ( cpu != cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
>> +        ret = microcode_ops->collect_cpu_info(&this_cpu(cpu_sig));
>
>Another option would be for the CPU doing the update to simply
>propagate the new value to all its siblings' cpu_sig values.

Will do.

>
>> @@ -337,12 +429,59 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
>>          if ( patch )
>>              microcode_ops->free_patch(patch);
>>          ret = -EINVAL;
>> -        goto free;
>> +        goto put;
>>      }
>>  
>> -    ret = continue_hypercall_on_cpu(cpumask_first(&cpu_online_map),
>> -                                    do_microcode_update, patch);
>> +    atomic_set(&cpu_in, 0);
>> +    atomic_set(&cpu_out, 0);
>> +    atomic_set(&cpu_updated, 0);
>> +
>> +    /* Calculate the number of online CPU core */
>> +    nr_cores = 0;
>> +    for_each_online_cpu(cpu)
>> +        if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
>> +            nr_cores++;
>> +
>> +    printk(XENLOG_INFO "%d cores are to update their microcode\n", nr_cores);
>> +
>> +    /*
>> +     * We intend to disable interrupt for long time, which may lead to
>> +     * watchdog timeout.
>> +     */
>> +    watchdog_disable();
>> +    /*
>> +     * Late loading dance. Why the heavy-handed stop_machine effort?
>> +     *
>> +     * - HT siblings must be idle and not execute other code while the other
>> +     *   sibling is loading microcode in order to avoid any negative
>> +     *   interactions cause by the loading.
>> +     *
>> +     * - In addition, microcode update on the cores must be serialized until
>> +     *   this requirement can be relaxed in the future. Right now, this is
>> +     *   conservative and good.
>> +     */
>> +    ret = stop_machine_run(do_microcode_update, patch, NR_CPUS);
>> +    watchdog_enable();
>> +
>> +    if ( atomic_read(&cpu_updated) == nr_cores )
>> +    {
>> +        spin_lock(&microcode_mutex);
>> +        microcode_update_cache(patch);
>> +        spin_unlock(&microcode_mutex);
>> +    }
>> +    else if ( atomic_read(&cpu_updated) == 0 )
>> +        microcode_ops->free_patch(patch);
>> +    else
>> +    {
>> +        printk("Updating microcode succeeded on part of CPUs and failed on\n"
>> +               "others due to an unknown reason. A system with different\n"
>> +               "microcode revisions is considered unstable. Please reboot and\n"
>> +               "do not load the microcode that triggers this warning\n");
>> +        microcode_ops->free_patch(patch);
>> +    }
>
>As said on an earlier patch, I think the cache can be updated if at
>least one CPU loaded the blob successfully. Additionally I'd like to
>ask that you log the number of successfully updated cores. And
>finally perhaps "differing" instead of "different" and omit "due to
>an unknown reason"?

Will do.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel