Re: [PATCH 1/2] x86/microcode: Synchronize late microcode loading

From: "Jan Beulich" <JBeulich@suse.com>
To: Chao Gao <chao.gao@intel.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>,
	Kevin Tian <kevin.tian@intel.com>,
	Ashok Raj <ashok.raj@intel.com>,
	Jun Nakajima <jun.nakajima@intel.com>,
	xen-devel@lists.xen.org
Subject: Re: [PATCH 1/2] x86/microcode: Synchronize late microcode loading
Date: Fri, 13 Apr 2018 09:49:17 -0600	[thread overview]
Message-ID: <5AD0D17D02000078001BB298@prv1-mh.provo.novell.com> (raw)
In-Reply-To: <1522393141-90344-1-git-send-email-chao.gao@intel.com>

>>> On 30.03.18 at 08:59, <chao.gao@intel.com> wrote:
> @@ -281,24 +287,52 @@ static int microcode_update_cpu(const void *buf, size_t size)
>      return err;
>  }
>  
> -static long do_microcode_update(void *_info)
> +static int __wait_for_cpus(atomic_t *cnt, int timeout)

No new double-underscore prefixed functions please.

>  {
> -    struct microcode_info *info = _info;
> -    int error;
> +    int cpus = num_online_cpus();

unsigned int

> -    BUG_ON(info->cpu != smp_processor_id());
> +    atomic_inc(cnt);
>  
> -    error = microcode_update_cpu(info->buffer, info->buffer_size);
> -    if ( error )
> -        info->error = error;
> +    while (atomic_read(cnt) != cpus)

There are a number of style issues in the patch, mostly (like here) missing
blanks.

> +    {
> +        if ( timeout <= 0 )
> +        {
> +            printk("Timeout when waiting for CPUs calling in\n");
> +            return -1;
> +        }
> +        udelay(1);
> +        timeout--;
> +    }
>  
> -    info->cpu = cpumask_next(info->cpu, &cpu_online_map);
> -    if ( info->cpu < nr_cpu_ids )
> -        return continue_hypercall_on_cpu(info->cpu, do_microcode_update, info);
> +    return 0;
> +}
>  
> -    error = info->error;
> -    xfree(info);
> -    return error;
> +static int do_microcode_update(void *_info)
> +{
> +    struct microcode_info *info = _info;
> +    int error, ret = 0;
> +
> +    error = __wait_for_cpus(&info->cpu_in, USEC_PER_SEC);

Why this long a timeout here?

> +    if ( error )
> +    {
> +        ret = -EBUSY;
> +        return ret;
> +    }
> +
> +    error = microcode_update_cpu(info->buffer, info->buffer_size);
> +    if ( error && !ret )
> +        ret = error;
> +    /*
> +     * Increase the wait timeout to a safe value here since we're serializing
> +     * the microcode update and that could take a while on a large number of
> +     * CPUs. And that is fine as the *actual* timeout will be determined by
> +     * the last CPU finished updating and thus cut short
> +     */
> +    error = __wait_for_cpus(&info->cpu_out, USEC_PER_SEC * num_online_cpus());

And this one's even worse, in particular on huge systems. I'm afraid such a long
period of time in stop-machine context is going to confuse most of the running
domains (including Dom0). There's nothing inherently wrong with e.g. processing
the updates on distinct cores (and even more so on distinct sockets) in parallel.
Therefore revising the locking in microcode_update_cpu() might be a necessary
prereq step. Or alternatively you may need to demand that no other running
domains exist besides Dom0 (and hope the best for Dom0 itself).

I also don't think there's any point invoking the operation on all HT threads on a
core, but I realize stop_machine_run() isn't flexible enough to allow such.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel