All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] x86/microcode: Synchronize late microcode loading
@ 2018-03-30  6:59 Chao Gao
  2018-03-30  6:59 ` [PATCH 2/2] x86/microcode: Do not upload microcode if CPUs are offline Chao Gao
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Chao Gao @ 2018-03-30  6:59 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Ashok Raj, Andrew Cooper, Jan Beulich, Jun Nakajima,
	Thomas Gleixner, Borislav Petkov, Gao Chao

From: Gao Chao <chao.gao@intel.com>

This patch is to backport microcode improvement patches from linux
kernel. Below are the original patches description:

    commit a5321aec6412b20b5ad15db2d6b916c05349dbff
    Author: Ashok Raj <ashok.raj@intel.com>
    Date:   Wed Feb 28 11:28:46 2018 +0100

	x86/microcode: Synchronize late microcode loading

	Original idea by Ashok, completely rewritten by Borislav.

	Before you read any further: the early loading method is still the
	preferred one and you should always do that. The following patch is
	improving the late loading mechanism for long running jobs and cloud use
	cases.

	Gather all cores and serialize the microcode update on them by doing it
	one-by-one to make the late update process as reliable as possible and
	avoid potential issues caused by the microcode update.

	[ Borislav: Rewrite completely. ]

	Co-developed-by: Borislav Petkov <bp@suse.de>
	Signed-off-by: Ashok Raj <ashok.raj@intel.com>
	Signed-off-by: Borislav Petkov <bp@suse.de>
	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
	Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
	Tested-by: Ashok Raj <ashok.raj@intel.com>
	Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
	Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
	Link: https://lkml.kernel.org/r/20180228102846.13447-8-bp@alien8.de

    commit bb8c13d61a629276a162c1d2b1a20a815cbcfbb7
    Author: Borislav Petkov <bp@suse.de>
    Date:   Wed Mar 14 19:36:15 2018 +0100

	x86/microcode: Fix CPU synchronization routine

	Emanuel reported an issue with a hang during microcode update because my
	dumb idea to use one atomic synchronization variable for both rendezvous
	- before and after update - was simply bollocks:

	  microcode: microcode_reload_late: late_cpus: 4
	  microcode: __reload_late: cpu 2 entered
	  microcode: __reload_late: cpu 1 entered
	  microcode: __reload_late: cpu 3 entered
	  microcode: __reload_late: cpu 0 entered
	  microcode: __reload_late: cpu 1 left
	  microcode: Timeout while waiting for CPUs rendezvous, remaining: 1

	CPU1 above would finish, leave and the others will still spin waiting for
	it to join.

	So do two synchronization atomics instead, which makes the code a lot more
	straightforward.

	Also, since the update is serialized and it also takes quite some time per
	microcode engine, increase the exit timeout by the number of CPUs on the
	system.

	That's ok because the moment all CPUs are done, that timeout will be cut
	short.

	Furthermore, panic when some of the CPUs timeout when returning from a
	microcode update: we can't allow a system with not all cores updated.

	Also, as an optimization, do not do the exit sync if microcode wasn't
	updated.

	Reported-by: Emanuel Czirai <xftroxgpx@protonmail.com>
	Signed-off-by: Borislav Petkov <bp@suse.de>
	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
	Tested-by: Emanuel Czirai <xftroxgpx@protonmail.com>
	Tested-by: Ashok Raj <ashok.raj@intel.com>
	Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
	Link: https://lkml.kernel.org/r/20180314183615.17629-2-bp@alien8.de

This patch is also in accord with Andrew's suggestion,
"Rendezvous all online cpus in an IPI to apply the patch, and keep the
processors in until all have completed the patch.", in [1].

[1]:https://wiki.xenproject.org/wiki/XenParavirtOps/microcode_update#Run_time_microcode_updates

Signed-off-by: Chao Gao <chao.gao@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 xen/arch/x86/microcode.c | 89 +++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 73 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/microcode.c b/xen/arch/x86/microcode.c
index 4163f50..94c1ca2 100644
--- a/xen/arch/x86/microcode.c
+++ b/xen/arch/x86/microcode.c
@@ -22,6 +22,7 @@
  */
 
 #include <xen/cpu.h>
+#include <xen/cpumask.h>
 #include <xen/lib.h>
 #include <xen/kernel.h>
 #include <xen/init.h>
@@ -30,15 +31,20 @@
 #include <xen/smp.h>
 #include <xen/softirq.h>
 #include <xen/spinlock.h>
+#include <xen/stop_machine.h>
 #include <xen/tasklet.h>
 #include <xen/guest_access.h>
 #include <xen/earlycpio.h>
+#include <xen/watchdog.h>
 
+#include <asm/delay.h>
 #include <asm/msr.h>
 #include <asm/processor.h>
 #include <asm/setup.h>
 #include <asm/microcode.h>
 
+#define USEC_PER_SEC 1000000UL
+
 static module_t __initdata ucode_mod;
 static signed int __initdata ucode_mod_idx;
 static bool_t __initdata ucode_mod_forced;
@@ -187,7 +193,7 @@ static DEFINE_SPINLOCK(microcode_mutex);
 DEFINE_PER_CPU(struct ucode_cpu_info, ucode_cpu_info);
 
 struct microcode_info {
-    unsigned int cpu;
+    atomic_t cpu_in, cpu_out;
     uint32_t buffer_size;
     int error;
     char buffer[1];
@@ -281,24 +287,52 @@ static int microcode_update_cpu(const void *buf, size_t size)
     return err;
 }
 
-static long do_microcode_update(void *_info)
+static int __wait_for_cpus(atomic_t *cnt, int timeout)
 {
-    struct microcode_info *info = _info;
-    int error;
+    int cpus = num_online_cpus();
 
-    BUG_ON(info->cpu != smp_processor_id());
+    atomic_inc(cnt);
 
-    error = microcode_update_cpu(info->buffer, info->buffer_size);
-    if ( error )
-        info->error = error;
+    while (atomic_read(cnt) != cpus)
+    {
+        if ( timeout <= 0 )
+        {
+            printk("Timeout when waiting for CPUs calling in\n");
+            return -1;
+        }
+        udelay(1);
+        timeout--;
+    }
 
-    info->cpu = cpumask_next(info->cpu, &cpu_online_map);
-    if ( info->cpu < nr_cpu_ids )
-        return continue_hypercall_on_cpu(info->cpu, do_microcode_update, info);
+    return 0;
+}
 
-    error = info->error;
-    xfree(info);
-    return error;
+static int do_microcode_update(void *_info)
+{
+    struct microcode_info *info = _info;
+    int error, ret = 0;
+
+    error = __wait_for_cpus(&info->cpu_in, USEC_PER_SEC);
+    if ( error )
+    {
+        ret = -EBUSY;
+        return ret;
+    }
+
+    error = microcode_update_cpu(info->buffer, info->buffer_size);
+    if ( error && !ret )
+        ret = error;
+    /*
+     * Increase the wait timeout to a safe value here since we're serializing
+     * the microcode update and that could take a while on a large number of
+     * CPUs. And that is fine as the *actual* timeout will be determined by
+     * the last CPU finished updating and thus cut short
+     */
+    error = __wait_for_cpus(&info->cpu_out, USEC_PER_SEC * num_online_cpus());
+    if (error)
+        panic("Timeout when finishing updating microcode");
+    info->error = ret;
+    return ret;
 }
 
 int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
@@ -325,7 +359,6 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
 
     info->buffer_size = len;
     info->error = 0;
-    info->cpu = cpumask_first(&cpu_online_map);
 
     if ( microcode_ops->start_update )
     {
@@ -337,7 +370,31 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
         }
     }
 
-    return continue_hypercall_on_cpu(info->cpu, do_microcode_update, info);
+    atomic_set(&info->cpu_in, 0);
+    atomic_set(&info->cpu_out, 0);
+
+    /*
+     * We intend to disable interrupt for long time, which may lead to
+     * watchdog timeout. Disable watchdog temporally.
+     */
+    watchdog_disable();
+    /*
+     * Late loading dance. Why the heavy-handed stop_machine effort?
+     *
+     * -HT siblings must be idle and not execute other code while the other
+     *  sibling is loading microcode in order to avoid any negative
+     *  interactions cause by the loading.
+     *
+     * -In addition, microcode update on the cores must be serialized until
+     *  this requirement can be relaxed in the feature. Right now, this is
+     *  conservative and good.
+     */
+    stop_machine_run(do_microcode_update, info, NR_CPUS);
+    watchdog_enable();
+
+    ret = info->error;
+    xfree(info);
+    return ret;
 }
 
 static int __init microcode_init(void)
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] x86/microcode: Do not upload microcode if CPUs are offline
  2018-03-30  6:59 [PATCH 1/2] x86/microcode: Synchronize late microcode loading Chao Gao
@ 2018-03-30  6:59 ` Chao Gao
  2018-04-13 15:57   ` Jan Beulich
  2018-04-12  6:31 ` [PATCH 1/2] x86/microcode: Synchronize late microcode loading Chao Gao
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Chao Gao @ 2018-03-30  6:59 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Ashok Raj, Andrew Cooper, Jan Beulich, Jun Nakajima,
	Thomas Gleixner, Borislav Petkov, Chao Gao

This patch is to backport the patch below from linux kernel.

    commit 30ec26da9967d0d785abc24073129a34c3211777
    Author: Ashok Raj <ashok.raj@intel.com>
    Date:   Wed Feb 28 11:28:43 2018 +0100

        x86/microcode: Do not upload microcode if CPUs are offline

        Avoid loading microcode if any of the CPUs are offline, and issue a
        warning. Having different microcode revisions on the system at any time
        is outright dangerous.

        [ Borislav: Massage changelog. ]

        Signed-off-by: Ashok Raj <ashok.raj@intel.com>
        Signed-off-by: Borislav Petkov <bp@suse.de>
        Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
        Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
        Tested-by: Ashok Raj <ashok.raj@intel.com>
        Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
        Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
        Link: http://lkml.kernel.org/r/1519352533-15992-4-git-send-email-ashok.raj@intel.com
        Link: https://lkml.kernel.org/r/20180228102846.13447-5-bp@alien8.de

Signed-off-by: Chao Gao <chao.gao@intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 xen/arch/x86/microcode.c | 30 +++++++++++++++++++++---------
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/xen/arch/x86/microcode.c b/xen/arch/x86/microcode.c
index 94c1ca2..25d9112 100644
--- a/xen/arch/x86/microcode.c
+++ b/xen/arch/x86/microcode.c
@@ -346,16 +346,27 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
     if ( microcode_ops == NULL )
         return -EINVAL;
 
+    /* cpu_online_map must not change. */
+    if ( !get_cpu_maps() )
+        return -EBUSY;
+
+    if ( num_present_cpus() != num_online_cpus() )
+    {
+        ret = -EINVAL;
+        printk("Not all CPUs online, aborting microcode update\n");
+        goto put;
+    }
+
     info = xmalloc_bytes(sizeof(*info) + len);
     if ( info == NULL )
-        return -ENOMEM;
+    {
+        ret = -ENOMEM;
+        goto put;
+    }
 
     ret = copy_from_guest(info->buffer, buf, len);
     if ( ret != 0 )
-    {
-        xfree(info);
-        return ret;
-    }
+        goto err;
 
     info->buffer_size = len;
     info->error = 0;
@@ -364,10 +375,7 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
     {
         ret = microcode_ops->start_update();
         if ( ret != 0 )
-        {
-            xfree(info);
-            return ret;
-        }
+            goto err;
     }
 
     atomic_set(&info->cpu_in, 0);
@@ -393,7 +401,11 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
     watchdog_enable();
 
     ret = info->error;
+
+ err:
     xfree(info);
+ put:
+    put_cpu_maps();
     return ret;
 }
 
-- 
1.8.3.1


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] x86/microcode: Synchronize late microcode loading
  2018-03-30  6:59 [PATCH 1/2] x86/microcode: Synchronize late microcode loading Chao Gao
  2018-03-30  6:59 ` [PATCH 2/2] x86/microcode: Do not upload microcode if CPUs are offline Chao Gao
@ 2018-04-12  6:31 ` Chao Gao
  2018-04-12  6:38   ` Jan Beulich
  2018-04-12 16:29 ` Raj, Ashok
  2018-04-13 15:49 ` Jan Beulich
  3 siblings, 1 reply; 15+ messages in thread
From: Chao Gao @ 2018-04-12  6:31 UTC (permalink / raw)
  To: xen-devel
  Cc: Kevin Tian, Jan Beulich, Andrew Cooper, Jun Nakajima,
	Thomas Gleixner, Borislav Petkov, Ashok Raj

Ping...

Can someone help to review these two patches?

On Fri, Mar 30, 2018 at 02:59:00PM +0800, Chao Gao wrote:
>From: Gao Chao <chao.gao@intel.com>
>
>This patch is to backport microcode improvement patches from linux
>kernel. Below are the original patches description:
>
>    commit a5321aec6412b20b5ad15db2d6b916c05349dbff
>    Author: Ashok Raj <ashok.raj@intel.com>
>    Date:   Wed Feb 28 11:28:46 2018 +0100
>
>	x86/microcode: Synchronize late microcode loading
>
>	Original idea by Ashok, completely rewritten by Borislav.
>
>	Before you read any further: the early loading method is still the
>	preferred one and you should always do that. The following patch is
>	improving the late loading mechanism for long running jobs and cloud use
>	cases.
>
>	Gather all cores and serialize the microcode update on them by doing it
>	one-by-one to make the late update process as reliable as possible and
>	avoid potential issues caused by the microcode update.
>
>	[ Borislav: Rewrite completely. ]
>
>	Co-developed-by: Borislav Petkov <bp@suse.de>
>	Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>	Signed-off-by: Borislav Petkov <bp@suse.de>
>	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>	Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
>	Tested-by: Ashok Raj <ashok.raj@intel.com>
>	Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
>	Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
>	Link: https://lkml.kernel.org/r/20180228102846.13447-8-bp@alien8.de
>
>    commit bb8c13d61a629276a162c1d2b1a20a815cbcfbb7
>    Author: Borislav Petkov <bp@suse.de>
>    Date:   Wed Mar 14 19:36:15 2018 +0100
>
>	x86/microcode: Fix CPU synchronization routine
>
>	Emanuel reported an issue with a hang during microcode update because my
>	dumb idea to use one atomic synchronization variable for both rendezvous
>	- before and after update - was simply bollocks:
>
>	  microcode: microcode_reload_late: late_cpus: 4
>	  microcode: __reload_late: cpu 2 entered
>	  microcode: __reload_late: cpu 1 entered
>	  microcode: __reload_late: cpu 3 entered
>	  microcode: __reload_late: cpu 0 entered
>	  microcode: __reload_late: cpu 1 left
>	  microcode: Timeout while waiting for CPUs rendezvous, remaining: 1
>
>	CPU1 above would finish, leave and the others will still spin waiting for
>	it to join.
>
>	So do two synchronization atomics instead, which makes the code a lot more
>	straightforward.
>
>	Also, since the update is serialized and it also takes quite some time per
>	microcode engine, increase the exit timeout by the number of CPUs on the
>	system.
>
>	That's ok because the moment all CPUs are done, that timeout will be cut
>	short.
>
>	Furthermore, panic when some of the CPUs timeout when returning from a
>	microcode update: we can't allow a system with not all cores updated.
>
>	Also, as an optimization, do not do the exit sync if microcode wasn't
>	updated.
>
>	Reported-by: Emanuel Czirai <xftroxgpx@protonmail.com>
>	Signed-off-by: Borislav Petkov <bp@suse.de>
>	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>	Tested-by: Emanuel Czirai <xftroxgpx@protonmail.com>
>	Tested-by: Ashok Raj <ashok.raj@intel.com>
>	Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
>	Link: https://lkml.kernel.org/r/20180314183615.17629-2-bp@alien8.de
>
>This patch is also in accord with Andrew's suggestion,
>"Rendezvous all online cpus in an IPI to apply the patch, and keep the
>processors in until all have completed the patch.", in [1].
>
>[1]:https://wiki.xenproject.org/wiki/XenParavirtOps/microcode_update#Run_time_microcode_updates
>
>Signed-off-by: Chao Gao <chao.gao@intel.com>
>Cc: Kevin Tian <kevin.tian@intel.com>
>Cc: Jun Nakajima <jun.nakajima@intel.com>
>Cc: Ashok Raj <ashok.raj@intel.com>
>Cc: Borislav Petkov <bp@suse.de>
>Cc: Thomas Gleixner <tglx@linutronix.de>
>---
> xen/arch/x86/microcode.c | 89 +++++++++++++++++++++++++++++++++++++++---------
> 1 file changed, 73 insertions(+), 16 deletions(-)
>
>diff --git a/xen/arch/x86/microcode.c b/xen/arch/x86/microcode.c
>index 4163f50..94c1ca2 100644
>--- a/xen/arch/x86/microcode.c
>+++ b/xen/arch/x86/microcode.c
>@@ -22,6 +22,7 @@
>  */
> 
> #include <xen/cpu.h>
>+#include <xen/cpumask.h>
> #include <xen/lib.h>
> #include <xen/kernel.h>
> #include <xen/init.h>
>@@ -30,15 +31,20 @@
> #include <xen/smp.h>
> #include <xen/softirq.h>
> #include <xen/spinlock.h>
>+#include <xen/stop_machine.h>
> #include <xen/tasklet.h>
> #include <xen/guest_access.h>
> #include <xen/earlycpio.h>
>+#include <xen/watchdog.h>
> 
>+#include <asm/delay.h>
> #include <asm/msr.h>
> #include <asm/processor.h>
> #include <asm/setup.h>
> #include <asm/microcode.h>
> 
>+#define USEC_PER_SEC 1000000UL
>+
> static module_t __initdata ucode_mod;
> static signed int __initdata ucode_mod_idx;
> static bool_t __initdata ucode_mod_forced;
>@@ -187,7 +193,7 @@ static DEFINE_SPINLOCK(microcode_mutex);
> DEFINE_PER_CPU(struct ucode_cpu_info, ucode_cpu_info);
> 
> struct microcode_info {
>-    unsigned int cpu;
>+    atomic_t cpu_in, cpu_out;
>     uint32_t buffer_size;
>     int error;
>     char buffer[1];
>@@ -281,24 +287,52 @@ static int microcode_update_cpu(const void *buf, size_t size)
>     return err;
> }
> 
>-static long do_microcode_update(void *_info)
>+static int __wait_for_cpus(atomic_t *cnt, int timeout)
> {
>-    struct microcode_info *info = _info;
>-    int error;
>+    int cpus = num_online_cpus();
> 
>-    BUG_ON(info->cpu != smp_processor_id());
>+    atomic_inc(cnt);
> 
>-    error = microcode_update_cpu(info->buffer, info->buffer_size);
>-    if ( error )
>-        info->error = error;
>+    while (atomic_read(cnt) != cpus)
>+    {
>+        if ( timeout <= 0 )
>+        {
>+            printk("Timeout when waiting for CPUs calling in\n");
>+            return -1;
>+        }
>+        udelay(1);
>+        timeout--;
>+    }
> 
>-    info->cpu = cpumask_next(info->cpu, &cpu_online_map);
>-    if ( info->cpu < nr_cpu_ids )
>-        return continue_hypercall_on_cpu(info->cpu, do_microcode_update, info);
>+    return 0;
>+}
> 
>-    error = info->error;
>-    xfree(info);
>-    return error;
>+static int do_microcode_update(void *_info)
>+{
>+    struct microcode_info *info = _info;
>+    int error, ret = 0;
>+
>+    error = __wait_for_cpus(&info->cpu_in, USEC_PER_SEC);
>+    if ( error )
>+    {
>+        ret = -EBUSY;
>+        return ret;
>+    }
>+
>+    error = microcode_update_cpu(info->buffer, info->buffer_size);
>+    if ( error && !ret )
>+        ret = error;
>+    /*
>+     * Increase the wait timeout to a safe value here since we're serializing
>+     * the microcode update and that could take a while on a large number of
>+     * CPUs. And that is fine as the *actual* timeout will be determined by
>+     * the last CPU finished updating and thus cut short
>+     */
>+    error = __wait_for_cpus(&info->cpu_out, USEC_PER_SEC * num_online_cpus());
>+    if (error)
>+        panic("Timeout when finishing updating microcode");
>+    info->error = ret;
>+    return ret;
> }
> 
> int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
>@@ -325,7 +359,6 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
> 
>     info->buffer_size = len;
>     info->error = 0;
>-    info->cpu = cpumask_first(&cpu_online_map);
> 
>     if ( microcode_ops->start_update )
>     {
>@@ -337,7 +370,31 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
>         }
>     }
> 
>-    return continue_hypercall_on_cpu(info->cpu, do_microcode_update, info);
>+    atomic_set(&info->cpu_in, 0);
>+    atomic_set(&info->cpu_out, 0);
>+
>+    /*
>+     * We intend to disable interrupt for long time, which may lead to
>+     * watchdog timeout. Disable watchdog temporally.
>+     */
>+    watchdog_disable();
>+    /*
>+     * Late loading dance. Why the heavy-handed stop_machine effort?
>+     *
>+     * -HT siblings must be idle and not execute other code while the other
>+     *  sibling is loading microcode in order to avoid any negative
>+     *  interactions cause by the loading.
>+     *
>+     * -In addition, microcode update on the cores must be serialized until
>+     *  this requirement can be relaxed in the feature. Right now, this is
>+     *  conservative and good.
>+     */
>+    stop_machine_run(do_microcode_update, info, NR_CPUS);
>+    watchdog_enable();
>+
>+    ret = info->error;
>+    xfree(info);
>+    return ret;
> }
> 
> static int __init microcode_init(void)
>-- 
>1.8.3.1
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] x86/microcode: Synchronize late microcode loading
  2018-04-12  6:31 ` [PATCH 1/2] x86/microcode: Synchronize late microcode loading Chao Gao
@ 2018-04-12  6:38   ` Jan Beulich
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2018-04-12  6:38 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, Ashok Raj, Andrew Cooper, xen-devel, Jun Nakajima,
	Thomas Gleixner, Borislav Petkov

>>> On 12.04.18 at 08:31, <chao.gao@intel.com> wrote:
> Ping...
> 
> Can someone help to review these two patches?

Sure, they're on my list of things to look at. Too many other things
going on.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] x86/microcode: Synchronize late microcode loading
  2018-03-30  6:59 [PATCH 1/2] x86/microcode: Synchronize late microcode loading Chao Gao
  2018-03-30  6:59 ` [PATCH 2/2] x86/microcode: Do not upload microcode if CPUs are offline Chao Gao
  2018-04-12  6:31 ` [PATCH 1/2] x86/microcode: Synchronize late microcode loading Chao Gao
@ 2018-04-12 16:29 ` Raj, Ashok
  2018-04-13  5:25   ` Chao Gao
  2018-04-13 15:49 ` Jan Beulich
  3 siblings, 1 reply; 15+ messages in thread
From: Raj, Ashok @ 2018-04-12 16:29 UTC (permalink / raw)
  To: Chao Gao
  Cc: Kevin Tian, Jan Beulich, Andrew Cooper, xen-devel, Jun Nakajima,
	Thomas Gleixner, Borislav Petkov, Ashok Raj

On Fri, Mar 30, 2018 at 02:59:00PM +0800, Chao Gao wrote:
> From: Gao Chao <chao.gao@intel.com>
> 
> This patch is to backport microcode improvement patches from linux
> kernel. Below are the original patches description:
> 
>     commit a5321aec6412b20b5ad15db2d6b916c05349dbff
>     Author: Ashok Raj <ashok.raj@intel.com>
>     Date:   Wed Feb 28 11:28:46 2018 +0100
> 
> 	x86/microcode: Synchronize late microcode loading
> 
> 	Original idea by Ashok, completely rewritten by Borislav.
> 
> 	Before you read any further: the early loading method is still the
> 	preferred one and you should always do that. The following patch is
> 	improving the late loading mechanism for long running jobs and cloud use
> 	cases.
> 
> 	Gather all cores and serialize the microcode update on them by doing it
> 	one-by-one to make the late update process as reliable as possible and
> 	avoid potential issues caused by the microcode update.
> 
> 	[ Borislav: Rewrite completely. ]
> 
> 	Co-developed-by: Borislav Petkov <bp@suse.de>
> 	Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> 	Signed-off-by: Borislav Petkov <bp@suse.de>
> 	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 	Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
> 	Tested-by: Ashok Raj <ashok.raj@intel.com>

The tested by tags were good for linux upstream. Can you make sure
you add your name under tested-by for xen?

> 	Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
> 	Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
> 	Link: https://lkml.kernel.org/r/20180228102846.13447-8-bp@alien8.de
> 
> +static int do_microcode_update(void *_info)
> +{
> +    struct microcode_info *info = _info;
> +    int error, ret = 0;
> +
> +    error = __wait_for_cpus(&info->cpu_in, USEC_PER_SEC);
> +    if ( error )
> +    {
> +        ret = -EBUSY;
> +        return ret;
> +    }
> +
> +    error = microcode_update_cpu(info->buffer, info->buffer_size);
> +    if ( error && !ret )
> +        ret = error;

Isn't this redundant? can ret become non-zero in the check above?

> +    /*
> +     * Increase the wait timeout to a safe value here since we're serializing
> +     * the microcode update and that could take a while on a large number of
> +     * CPUs. And that is fine as the *actual* timeout will be determined by
> +     * the last CPU finished updating and thus cut short
> +     */
> +    error = __wait_for_cpus(&info->cpu_out, USEC_PER_SEC * num_online_cpus());
> +    if (error)
> +        panic("Timeout when finishing updating microcode");
> +    info->error = ret;
> +    return ret;
>  }
>  



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] x86/microcode: Synchronize late microcode loading
  2018-04-12 16:29 ` Raj, Ashok
@ 2018-04-13  5:25   ` Chao Gao
  2018-04-13  8:20     ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Chao Gao @ 2018-04-13  5:25 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Kevin Tian, Jan Beulich, Andrew Cooper, xen-devel, Jun Nakajima,
	Thomas Gleixner, Borislav Petkov

On Thu, Apr 12, 2018 at 09:29:34AM -0700, Raj, Ashok wrote:
>On Fri, Mar 30, 2018 at 02:59:00PM +0800, Chao Gao wrote:
>> From: Gao Chao <chao.gao@intel.com>
>> 
>> This patch is to backport microcode improvement patches from linux
>> kernel. Below are the original patches description:
>> 
>>     commit a5321aec6412b20b5ad15db2d6b916c05349dbff
>>     Author: Ashok Raj <ashok.raj@intel.com>
>>     Date:   Wed Feb 28 11:28:46 2018 +0100
>> 
>> 	x86/microcode: Synchronize late microcode loading
>> 
>> 	Original idea by Ashok, completely rewritten by Borislav.
>> 
>> 	Before you read any further: the early loading method is still the
>> 	preferred one and you should always do that. The following patch is
>> 	improving the late loading mechanism for long running jobs and cloud use
>> 	cases.
>> 
>> 	Gather all cores and serialize the microcode update on them by doing it
>> 	one-by-one to make the late update process as reliable as possible and
>> 	avoid potential issues caused by the microcode update.
>> 
>> 	[ Borislav: Rewrite completely. ]
>> 
>> 	Co-developed-by: Borislav Petkov <bp@suse.de>
>> 	Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>> 	Signed-off-by: Borislav Petkov <bp@suse.de>
>> 	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> 	Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
>> 	Tested-by: Ashok Raj <ashok.raj@intel.com>
>
>The tested by tags were good for linux upstream. Can you make sure
>you add your name under tested-by for xen?

Tested-by doesn't mean they have tested this patch. I just put the
original commits description here.

If Xen permits such tested-by from the sender and author, I will add one.

>
>> 	Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
>> 	Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
>> 	Link: https://lkml.kernel.org/r/20180228102846.13447-8-bp@alien8.de
>> 
>> +static int do_microcode_update(void *_info)
>> +{
>> +    struct microcode_info *info = _info;
>> +    int error, ret = 0;
>> +
>> +    error = __wait_for_cpus(&info->cpu_in, USEC_PER_SEC);
>> +    if ( error )
>> +    {
>> +        ret = -EBUSY;
>> +        return ret;
>> +    }
>> +
>> +    error = microcode_update_cpu(info->buffer, info->buffer_size);
>> +    if ( error && !ret )
>> +        ret = error;
>
>Isn't this redundant? can ret become non-zero in the check above?

Yes, it is redundant. I will also remove 'error' field from struct
microcode_info  because stop_machine_run already has the ability to return the
first error code. Hence, don't need to implement it again here (this is the
reason why the above fragment is weird).

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] x86/microcode: Synchronize late microcode loading
  2018-04-13  5:25   ` Chao Gao
@ 2018-04-13  8:20     ` Jan Beulich
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2018-04-13  8:20 UTC (permalink / raw)
  To: Ashok Raj, Chao Gao
  Cc: Kevin Tian, Andrew Cooper, xen-devel, Jun Nakajima, tglx,
	Borislav Petkov

>>> On 13.04.18 at 07:25, <chao.gao@intel.com> wrote:
> On Thu, Apr 12, 2018 at 09:29:34AM -0700, Raj, Ashok wrote:
>>On Fri, Mar 30, 2018 at 02:59:00PM +0800, Chao Gao wrote:
>>> From: Gao Chao <chao.gao@intel.com>
>>> 
>>> This patch is to backport microcode improvement patches from linux
>>> kernel. Below are the original patches description:
>>> 
>>>     commit a5321aec6412b20b5ad15db2d6b916c05349dbff
>>>     Author: Ashok Raj <ashok.raj@intel.com>
>>>     Date:   Wed Feb 28 11:28:46 2018 +0100
>>> 
>>> 	x86/microcode: Synchronize late microcode loading
>>> 
>>> 	Original idea by Ashok, completely rewritten by Borislav.
>>> 
>>> 	Before you read any further: the early loading method is still the
>>> 	preferred one and you should always do that. The following patch is
>>> 	improving the late loading mechanism for long running jobs and cloud use
>>> 	cases.
>>> 
>>> 	Gather all cores and serialize the microcode update on them by doing it
>>> 	one-by-one to make the late update process as reliable as possible and
>>> 	avoid potential issues caused by the microcode update.
>>> 
>>> 	[ Borislav: Rewrite completely. ]
>>> 
>>> 	Co-developed-by: Borislav Petkov <bp@suse.de>
>>> 	Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>>> 	Signed-off-by: Borislav Petkov <bp@suse.de>
>>> 	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>>> 	Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
>>> 	Tested-by: Ashok Raj <ashok.raj@intel.com>
>>
>>The tested by tags were good for linux upstream. Can you make sure
>>you add your name under tested-by for xen?
> 
> Tested-by doesn't mean they have tested this patch. I just put the
> original commits description here.

Tested-by being as meaningful in Xen as it is in Linux, retaining such tags (other
than authorship ones) is generally wrong, as it gives a wrong impression on
what testing the _Xen_ patch has seen.

> If Xen permits such tested-by from the sender and author, I will add one.

I think it is generally implied that the author has done some testing. In the
case here - with a ported over Linux commit by other than the original
author - I would view a T-b by the original author as meaningful though. It
should be made clear though that this is a ported Linux commit, which
generally we do by naming the Linux commit. See e.g. the history of
mwait-idle.c, where most of the commits are straight ports from Linux.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] x86/microcode: Synchronize late microcode loading
  2018-03-30  6:59 [PATCH 1/2] x86/microcode: Synchronize late microcode loading Chao Gao
                   ` (2 preceding siblings ...)
  2018-04-12 16:29 ` Raj, Ashok
@ 2018-04-13 15:49 ` Jan Beulich
  2018-04-16  6:20   ` Chao Gao
  3 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2018-04-13 15:49 UTC (permalink / raw)
  To: Chao Gao; +Cc: Andrew Cooper, Kevin Tian, Ashok Raj, Jun Nakajima, xen-devel

>>> On 30.03.18 at 08:59, <chao.gao@intel.com> wrote:
> @@ -281,24 +287,52 @@ static int microcode_update_cpu(const void *buf, size_t size)
>      return err;
>  }
>  
> -static long do_microcode_update(void *_info)
> +static int __wait_for_cpus(atomic_t *cnt, int timeout)

No new double-underscore prefixed functions please.

>  {
> -    struct microcode_info *info = _info;
> -    int error;
> +    int cpus = num_online_cpus();

unsigned int

> -    BUG_ON(info->cpu != smp_processor_id());
> +    atomic_inc(cnt);
>  
> -    error = microcode_update_cpu(info->buffer, info->buffer_size);
> -    if ( error )
> -        info->error = error;
> +    while (atomic_read(cnt) != cpus)

There are a number of style issues in the patch, mostly (like here) missing
blanks.

> +    {
> +        if ( timeout <= 0 )
> +        {
> +            printk("Timeout when waiting for CPUs calling in\n");
> +            return -1;
> +        }
> +        udelay(1);
> +        timeout--;
> +    }
>  
> -    info->cpu = cpumask_next(info->cpu, &cpu_online_map);
> -    if ( info->cpu < nr_cpu_ids )
> -        return continue_hypercall_on_cpu(info->cpu, do_microcode_update, info);
> +    return 0;
> +}
>  
> -    error = info->error;
> -    xfree(info);
> -    return error;
> +static int do_microcode_update(void *_info)
> +{
> +    struct microcode_info *info = _info;
> +    int error, ret = 0;
> +
> +    error = __wait_for_cpus(&info->cpu_in, USEC_PER_SEC);

Why this long a timeout here?

> +    if ( error )
> +    {
> +        ret = -EBUSY;
> +        return ret;
> +    }
> +
> +    error = microcode_update_cpu(info->buffer, info->buffer_size);
> +    if ( error && !ret )
> +        ret = error;
> +    /*
> +     * Increase the wait timeout to a safe value here since we're serializing
> +     * the microcode update and that could take a while on a large number of
> +     * CPUs. And that is fine as the *actual* timeout will be determined by
> +     * the last CPU finished updating and thus cut short
> +     */
> +    error = __wait_for_cpus(&info->cpu_out, USEC_PER_SEC * num_online_cpus());

And this one's even worse, in particular on huge systems. I'm afraid such a long
period of time in stop-machine context is going to confuse most of the running
domains (including Dom0). There's nothing inherently wrong with e.g. processing
the updates on distinct cores (and even more so on distinct sockets) in parallel.
Therefore revising the locking in microcode_update_cpu() might be a necessary
prereq step. Or alternatively you may need to demand that no other running
domains exist besides Dom0 (and hope the best for Dom0 itself).

I also don't think there's any point invoking the operation on all HT threads on a
core, but I realize stop_machine_run() isn't flexible enough to allow such.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] x86/microcode: Do not upload microcode if CPUs are offline
  2018-03-30  6:59 ` [PATCH 2/2] x86/microcode: Do not upload microcode if CPUs are offline Chao Gao
@ 2018-04-13 15:57   ` Jan Beulich
  2018-04-13 17:57     ` Raj, Ashok
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2018-04-13 15:57 UTC (permalink / raw)
  To: Chao Gao, xen-devel; +Cc: Andrew Cooper, Kevin Tian, Ashok Raj, Jun Nakajima

>>> On 30.03.18 at 08:59, <chao.gao@intel.com> wrote:
> This patch is to backport the patch below from linux kernel.
> 
>     commit 30ec26da9967d0d785abc24073129a34c3211777
>     Author: Ashok Raj <ashok.raj@intel.com>
>     Date:   Wed Feb 28 11:28:43 2018 +0100
> 
>         x86/microcode: Do not upload microcode if CPUs are offline
> 
>         Avoid loading microcode if any of the CPUs are offline, and issue a
>         warning. Having different microcode revisions on the system at any time
>         is outright dangerous.

I'm afraid I don't fully agree - not applying an ucode update to the online
CPUs because some are offline isn't any better. Plus (while updating)
there's always going to be some discrepancy between ucode versions.
As long as we apply updates while bringing a CPU online, I think we're fine.

Also please consider valid cases you make not work anymore, like someone
having brought offline all sibling hyperthreads, with each core still having
one thread active. In that case an ucode update will implicitly update all
offline threads as well.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] x86/microcode: Do not upload microcode if CPUs are offline
  2018-04-13 15:57   ` Jan Beulich
@ 2018-04-13 17:57     ` Raj, Ashok
  2018-04-13 18:38       ` Borislav Petkov
  2018-04-16 10:30       ` Jan Beulich
  0 siblings, 2 replies; 15+ messages in thread
From: Raj, Ashok @ 2018-04-13 17:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Ashok Raj, Andrew Cooper, xen-devel, Borislav Petkov,
	Jun Nakajima, Chao Gao

Hi Jan

+ Boris

On Fri, Apr 13, 2018 at 09:57:25AM -0600, Jan Beulich wrote:
> >>> On 30.03.18 at 08:59, <chao.gao@intel.com> wrote:
> > This patch is to backport the patch below from linux kernel.
> > 
> >     commit 30ec26da9967d0d785abc24073129a34c3211777
> >     Author: Ashok Raj <ashok.raj@intel.com>
> >     Date:   Wed Feb 28 11:28:43 2018 +0100
> > 
> >         x86/microcode: Do not upload microcode if CPUs are offline
> > 
> >         Avoid loading microcode if any of the CPUs are offline, and issue a
> >         warning. Having different microcode revisions on the system at any time
> >         is outright dangerous.
> 
> I'm afraid I don't fully agree - not applying an ucode update to the online
> CPUs because some are offline isn't any better. Plus (while updating)
> there's always going to be some discrepancy between ucode versions.
> As long as we apply updates while bringing a CPU online, I think we're fine.

This is the safest option. Microcode is considered part of the cpu. We don't
allow cpus with different capabilities in the same system.. yes they might
work, but not something we allow. 

In general we recommend early update. Earliest the best. 

- BIOS update (difficult to deploy, but some microcodes have to be done
  this way.)
- early update from initrd.. almost same as #1, since we apply at earliest
  chance that's the closest and most recommended method.
- late update. Before this procedure of stopping all cpus, we did have a 
  time when some are updated and some werent uptodate yet. This synchronized
  update is precicely to get as close as possible to updating all of them.
> 
> Also please consider valid cases you make not work anymore, like someone
> having brought offline all sibling hyperthreads, with each core still having
> one thread active. In that case an ucode update will implicitly update all
> offline threads as well.

Of cource we can tweak this to be much better, there are other ideas, but
this is an effort to keep this simple, and also address microcode requirements
post spectre for some processors.  In the grand scheme of things although its
interesting to allow such updates we think it may not be best practice.

We want to get this working right first before getting fancy.

Cheers,
Ashok

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] x86/microcode: Do not upload microcode if CPUs are offline
  2018-04-13 17:57     ` Raj, Ashok
@ 2018-04-13 18:38       ` Borislav Petkov
  2018-04-16 10:30       ` Jan Beulich
  1 sibling, 0 replies; 15+ messages in thread
From: Borislav Petkov @ 2018-04-13 18:38 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Kevin Tian, Jan Beulich, Andrew Cooper, xen-devel, Jun Nakajima,
	Chao Gao

On Fri, Apr 13, 2018 at 10:57:46AM -0700, Raj, Ashok wrote:
> > I'm afraid I don't fully agree - not applying an ucode update to the online
> > CPUs because some are offline isn't any better. Plus (while updating)
> > there's always going to be some discrepancy between ucode versions.
> > As long as we apply updates while bringing a CPU online, I think we're fine.

When a core is soft-offlined as we do it, it doesn't mean that it
doesn't execute instructions...

> This is the safest option. Microcode is considered part of the cpu. We don't
> allow cpus with different capabilities in the same system.. yes they might
> work, but not something we allow.

... so yes, we better be safe than having to debug some insane lockups.

> In general we recommend early update. Earliest the best. 
> 
> - BIOS update (difficult to deploy, but some microcodes have to be done
>   this way.)
> - early update from initrd.. almost same as #1, since we apply at earliest
>   chance that's the closest and most recommended method.
> - late update. Before this procedure of stopping all cpus, we did have a 
>   time when some are updated and some werent uptodate yet. This synchronized
>   update is precicely to get as close as possible to updating all of them.

Yap.

> > Also please consider valid cases you make not work anymore, like someone
> > having brought offline all sibling hyperthreads, with each core still having
> > one thread active. In that case an ucode update will implicitly update all
> > offline threads as well.

That's some strange use case which I can't imagine worked, like ever.
Because the microcode engine is shared between the HT threads so no
matter on which of the hyperthreads you update, the same, one and only
engine gets updated.

> Of cource we can tweak this to be much better, there are other ideas, but
> this is an effort to keep this simple, and also address microcode requirements
> post spectre for some processors.  In the grand scheme of things although its
> interesting to allow such updates we think it may not be best practice.
> 
> We want to get this working right first before getting fancy.

Ack.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] x86/microcode: Synchronize late microcode loading
  2018-04-13 15:49 ` Jan Beulich
@ 2018-04-16  6:20   ` Chao Gao
  2018-04-16 10:26     ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Chao Gao @ 2018-04-16  6:20 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Andrew Cooper, Kevin Tian, Ashok Raj, Jun Nakajima, xen-devel

On Fri, Apr 13, 2018 at 09:49:17AM -0600, Jan Beulich wrote:
>>>> On 30.03.18 at 08:59, <chao.gao@intel.com> wrote:
>> @@ -281,24 +287,52 @@ static int microcode_update_cpu(const void *buf, size_t size)
>>      return err;
>>  }
>>  
>> -static long do_microcode_update(void *_info)
>> +static int __wait_for_cpus(atomic_t *cnt, int timeout)
>
>No new double-underscore prefixed functions please.
>
>>  {
>> -    struct microcode_info *info = _info;
>> -    int error;
>> +    int cpus = num_online_cpus();
>
>unsigned int
>
>> -    BUG_ON(info->cpu != smp_processor_id());
>> +    atomic_inc(cnt);
>>  
>> -    error = microcode_update_cpu(info->buffer, info->buffer_size);
>> -    if ( error )
>> -        info->error = error;
>> +    while (atomic_read(cnt) != cpus)
>
>There are a number of style issues in the patch, mostly (like here) missing
>blanks.

Will fix all issues pointed out by comments above.

>
>> +    {
>> +        if ( timeout <= 0 )
>> +        {
>> +            printk("Timeout when waiting for CPUs calling in\n");
>> +            return -1;
>> +        }
>> +        udelay(1);
>> +        timeout--;
>> +    }
>>  
>> -    info->cpu = cpumask_next(info->cpu, &cpu_online_map);
>> -    if ( info->cpu < nr_cpu_ids )
>> -        return continue_hypercall_on_cpu(info->cpu, do_microcode_update, info);
>> +    return 0;
>> +}
>>  
>> -    error = info->error;
>> -    xfree(info);
>> -    return error;
>> +static int do_microcode_update(void *_info)
>> +{
>> +    struct microcode_info *info = _info;
>> +    int error, ret = 0;
>> +
>> +    error = __wait_for_cpus(&info->cpu_in, USEC_PER_SEC);
>
>Why this long a timeout here?

I just use the same timeout as the patch on linux kernel side. As we
know if the timeout is too small, updating microcode may be likely to
failed even if other CPUs disabled interrupt temporally.

If you object to such a long timeout (for Xen may need much smaller
time to rendezvous all CPUs compared to linux kernel because Xen doesn't
have device drivers which may malfunction), how about just use the
default timeout, 30ms, used by live patching? if it is also not
good enough, then we make it an option which comes from callers.

>
>> +    if ( error )
>> +    {
>> +        ret = -EBUSY;
>> +        return ret;
>> +    }
>> +
>> +    error = microcode_update_cpu(info->buffer, info->buffer_size);
>> +    if ( error && !ret )
>> +        ret = error;
>> +    /*
>> +     * Increase the wait timeout to a safe value here since we're serializing
>> +     * the microcode update and that could take a while on a large number of
>> +     * CPUs. And that is fine as the *actual* timeout will be determined by
>> +     * the last CPU finished updating and thus cut short
>> +     */
>> +    error = __wait_for_cpus(&info->cpu_out, USEC_PER_SEC * num_online_cpus());
>
>And this one's even worse, in particular on huge systems. I'm afraid such a long
>period of time in stop-machine context is going to confuse most of the running
>domains (including Dom0). There's nothing inherently wrong with e.g. processing
>the updates on distinct cores (and even more so on distinct sockets) in parallel.

I cannot say for sure. But the original patch does want updating
microcode be performed one-by-one.

>Therefore revising the locking in microcode_update_cpu() might be a necessary
>prereq step.

Do you mean changing it to a per-core or per-socket lock?


>Or alternatively you may need to demand that no other running
>domains exist besides Dom0 (and hope the best for Dom0 itself).
>
>I also don't think there's any point invoking the operation on all HT threads on a
>core, but I realize stop_machine_run() isn't flexible enough to allow such.

Only one thread in a core will do the actual update. Other threads only
check the microcode version and find the version is already
update-to-date, then exit the critical region. Considering the check may
don't need much time, I wonder whether it can significantly benefit the
overall time to invoking the operation on all HT threads on a core?  

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] x86/microcode: Synchronize late microcode loading
  2018-04-16  6:20   ` Chao Gao
@ 2018-04-16 10:26     ` Jan Beulich
  2018-04-19  8:38       ` Chao Gao
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2018-04-16 10:26 UTC (permalink / raw)
  To: Chao Gao; +Cc: Andrew Cooper, Kevin Tian, Ashok Raj, Jun Nakajima, xen-devel

>>> On 16.04.18 at 08:20, <chao.gao@intel.com> wrote:
> On Fri, Apr 13, 2018 at 09:49:17AM -0600, Jan Beulich wrote:
>>>>> On 30.03.18 at 08:59, <chao.gao@intel.com> wrote:
>>> +static int do_microcode_update(void *_info)
>>> +{
>>> +    struct microcode_info *info = _info;
>>> +    int error, ret = 0;
>>> +
>>> +    error = __wait_for_cpus(&info->cpu_in, USEC_PER_SEC);
>>
>>Why this long a timeout here?
> 
> I just use the same timeout as the patch on linux kernel side. As we
> know if the timeout is too small, updating microcode may be likely to
> failed even if other CPUs disabled interrupt temporally.
> 
> If you object to such a long timeout (for Xen may need much smaller
> time to rendezvous all CPUs compared to linux kernel because Xen doesn't
> have device drivers which may malfunction), how about just use the
> default timeout, 30ms, used by live patching? if it is also not
> good enough, then we make it an option which comes from callers.

Yes, 30ms is likely to be acceptable.

>>> +    if ( error )
>>> +    {
>>> +        ret = -EBUSY;
>>> +        return ret;
>>> +    }
>>> +
>>> +    error = microcode_update_cpu(info->buffer, info->buffer_size);
>>> +    if ( error && !ret )
>>> +        ret = error;
>>> +    /*
>>> +     * Increase the wait timeout to a safe value here since we're serializing
>>> +     * the microcode update and that could take a while on a large number of
>>> +     * CPUs. And that is fine as the *actual* timeout will be determined by
>>> +     * the last CPU finished updating and thus cut short
>>> +     */
>>> +    error = __wait_for_cpus(&info->cpu_out, USEC_PER_SEC * num_online_cpus());
>>
>>And this one's even worse, in particular on huge systems. I'm afraid such a long
>>period of time in stop-machine context is going to confuse most of the running
>>domains (including Dom0). There's nothing inherently wrong with e.g. processing
>>the updates on distinct cores (and even more so on distinct sockets) in parallel.
> 
> I cannot say for sure. But the original patch does want updating
> microcode be performed one-by-one.

And they don't restrict the "when" aspect in any way? I.e. all sorts of
applications (and perhaps even KVM guests) may be running?

>>Therefore revising the locking in microcode_update_cpu() might be a necessary
>>prereq step.
> 
> Do you mean changing it to a per-core or per-socket lock?

Either changing to such, or introducing a second, more fine grained lock.

>>Or alternatively you may need to demand that no other running
>>domains exist besides Dom0 (and hope the best for Dom0 itself).
>>
>>I also don't think there's any point invoking the operation on all HT threads on a
>>core, but I realize stop_machine_run() isn't flexible enough to allow such.
> 
> Only one thread in a core will do the actual update. Other threads only
> check the microcode version and find the version is already
> update-to-date, then exit the critical region. Considering the check may
> don't need much time, I wonder whether it can significantly benefit the
> overall time to invoking the operation on all HT threads on a core?  

With better parallelism this would be less of an issue. Plus, as said - the
infrastructure we have at present wouldn't allow anyway what I've
described above, and hand-rolling another "flavor" of the stop-machine
logic is quite certainly out of question. To answer your question though:
Taking as the example a 4-threads-per-core system with sufficiently
many cores, I'm afraid the overall time spent in handling the
"uninteresting" threads may become measurable. But of course, if actual
numbers said otherwise, I'd be fine.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] x86/microcode: Do not upload microcode if CPUs are offline
  2018-04-13 17:57     ` Raj, Ashok
  2018-04-13 18:38       ` Borislav Petkov
@ 2018-04-16 10:30       ` Jan Beulich
  1 sibling, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2018-04-16 10:30 UTC (permalink / raw)
  To: Ashok Raj
  Cc: Kevin Tian, Andrew Cooper, xen-devel, Borislav Petkov,
	Jun Nakajima, Chao Gao

>>> On 13.04.18 at 19:57, <ashok.raj@intel.com> wrote:
> On Fri, Apr 13, 2018 at 09:57:25AM -0600, Jan Beulich wrote:
>> >>> On 30.03.18 at 08:59, <chao.gao@intel.com> wrote:
>> > This patch is to backport the patch below from linux kernel.
>> > 
>> >     commit 30ec26da9967d0d785abc24073129a34c3211777
>> >     Author: Ashok Raj <ashok.raj@intel.com>
>> >     Date:   Wed Feb 28 11:28:43 2018 +0100
>> > 
>> >         x86/microcode: Do not upload microcode if CPUs are offline
>> > 
>> >         Avoid loading microcode if any of the CPUs are offline, and issue a
>> >         warning. Having different microcode revisions on the system at any time
>> >         is outright dangerous.
>> 
>> I'm afraid I don't fully agree - not applying an ucode update to the online
>> CPUs because some are offline isn't any better. Plus (while updating)
>> there's always going to be some discrepancy between ucode versions.
>> As long as we apply updates while bringing a CPU online, I think we're fine.
> 
> This is the safest option. Microcode is considered part of the cpu. We don't
> allow cpus with different capabilities in the same system.. yes they might
> work, but not something we allow. 

Please compare with the boot time CPU bringup cases: Just like when onlining
a CPU at run time, the new microcode is put in place early during bringing up
the CPU. I don't see the difference in risk, yet from what you say we should
disallow boot time applying ucode to the BSP (much earlier than the APs) as
well.

> In general we recommend early update. Earliest the best. 
> 
> - BIOS update (difficult to deploy, but some microcodes have to be done
>   this way.)
> - early update from initrd.. almost same as #1, since we apply at earliest
>   chance that's the closest and most recommended method.
> - late update. Before this procedure of stopping all cpus, we did have a 
>   time when some are updated and some werent uptodate yet. This synchronized
>   update is precicely to get as close as possible to updating all of them.
>> 
>> Also please consider valid cases you make not work anymore, like someone
>> having brought offline all sibling hyperthreads, with each core still having
>> one thread active. In that case an ucode update will implicitly update all
>> offline threads as well.
> 
> Of cource we can tweak this to be much better, there are other ideas, but
> this is an effort to keep this simple, and also address microcode requirements
> post spectre for some processors.  In the grand scheme of things although its
> interesting to allow such updates we think it may not be best practice.
> 
> We want to get this working right first before getting fancy.

Simplicity and correctness are very worthwhile goals, but please without
regressing valid scenarios.

Jan



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] x86/microcode: Synchronize late microcode loading
  2018-04-16 10:26     ` Jan Beulich
@ 2018-04-19  8:38       ` Chao Gao
  0 siblings, 0 replies; 15+ messages in thread
From: Chao Gao @ 2018-04-19  8:38 UTC (permalink / raw)
  To: Jan Beulich, Ashok Raj; +Cc: Andrew Cooper, Kevin Tian, Jun Nakajima, xen-devel

On Mon, Apr 16, 2018 at 04:26:09AM -0600, Jan Beulich wrote:
>>>> On 16.04.18 at 08:20, <chao.gao@intel.com> wrote:
>> On Fri, Apr 13, 2018 at 09:49:17AM -0600, Jan Beulich wrote:
>>>>>> On 30.03.18 at 08:59, <chao.gao@intel.com> wrote:
>>>> +static int do_microcode_update(void *_info)
>>>> +{
>>>> +    struct microcode_info *info = _info;
>>>> +    int error, ret = 0;
>>>> +
>>>> +    error = __wait_for_cpus(&info->cpu_in, USEC_PER_SEC);
>>>
>>>Why this long a timeout here?
>> 
>> I just use the same timeout as the patch on linux kernel side. As we
>> know if the timeout is too small, updating microcode may be likely to
>> failed even if other CPUs disabled interrupt temporally.
>> 
>> If you object to such a long timeout (for Xen may need much smaller
>> time to rendezvous all CPUs compared to linux kernel because Xen doesn't
>> have device drivers which may malfunction), how about just use the
>> default timeout, 30ms, used by live patching? if it is also not
>> good enough, then we make it an option which comes from callers.
>
>Yes, 30ms is likely to be acceptable.
>
>>>> +    if ( error )
>>>> +    {
>>>> +        ret = -EBUSY;
>>>> +        return ret;
>>>> +    }
>>>> +
>>>> +    error = microcode_update_cpu(info->buffer, info->buffer_size);
>>>> +    if ( error && !ret )
>>>> +        ret = error;
>>>> +    /*
>>>> +     * Increase the wait timeout to a safe value here since we're serializing
>>>> +     * the microcode update and that could take a while on a large number of
>>>> +     * CPUs. And that is fine as the *actual* timeout will be determined by
>>>> +     * the last CPU finished updating and thus cut short
>>>> +     */
>>>> +    error = __wait_for_cpus(&info->cpu_out, USEC_PER_SEC * num_online_cpus());
>>>
>>>And this one's even worse, in particular on huge systems. I'm afraid such a long
>>>period of time in stop-machine context is going to confuse most of the running
>>>domains (including Dom0). There's nothing inherently wrong with e.g. processing
>>>the updates on distinct cores (and even more so on distinct sockets) in parallel.
>> 
>> I cannot say for sure. But the original patch does want updating
>> microcode be performed one-by-one.
>
>And they don't restrict the "when" aspect in any way? I.e. all sorts of
>applications (and perhaps even KVM guests) may be running?

No any restriction on the timing I think. Ashok, could you confirm this?

>
>>>Therefore revising the locking in microcode_update_cpu() might be a necessary
>>>prereq step.
>> 
>> Do you mean changing it to a per-core or per-socket lock?
>
>Either changing to such, or introducing a second, more fine grained lock.

I will introduce a per-core lock. Then only one thread in a core will
try to update microcode as an optimization. For I am not sure whether it
is permitted to update distinct cores in parallel, I am inclined to keep
the original logic.

Thanks
Chao

>
>>>Or alternatively you may need to demand that no other running
>>>domains exist besides Dom0 (and hope the best for Dom0 itself).
>>>
>>>I also don't think there's any point invoking the operation on all HT threads on a
>>>core, but I realize stop_machine_run() isn't flexible enough to allow such.
>> 
>> Only one thread in a core will do the actual update. Other threads only
>> check the microcode version and find the version is already
>> update-to-date, then exit the critical region. Considering the check may
>> don't need much time, I wonder whether it can significantly benefit the
>> overall time to invoking the operation on all HT threads on a core?  
>
>With better parallelism this would be less of an issue. Plus, as said - the
>infrastructure we have at present wouldn't allow anyway what I've
>described above, and hand-rolling another "flavor" of the stop-machine
>logic is quite certainly out of question. To answer your question though:
>Taking as the example a 4-threads-per-core system with sufficiently
>many cores, I'm afraid the overall time spent in handling the
>"uninteresting" threads may become measurable. But of course, if actual
>numbers said otherwise, I'd be fine.
>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-04-19  8:38 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-30  6:59 [PATCH 1/2] x86/microcode: Synchronize late microcode loading Chao Gao
2018-03-30  6:59 ` [PATCH 2/2] x86/microcode: Do not upload microcode if CPUs are offline Chao Gao
2018-04-13 15:57   ` Jan Beulich
2018-04-13 17:57     ` Raj, Ashok
2018-04-13 18:38       ` Borislav Petkov
2018-04-16 10:30       ` Jan Beulich
2018-04-12  6:31 ` [PATCH 1/2] x86/microcode: Synchronize late microcode loading Chao Gao
2018-04-12  6:38   ` Jan Beulich
2018-04-12 16:29 ` Raj, Ashok
2018-04-13  5:25   ` Chao Gao
2018-04-13  8:20     ` Jan Beulich
2018-04-13 15:49 ` Jan Beulich
2018-04-16  6:20   ` Chao Gao
2018-04-16 10:26     ` Jan Beulich
2018-04-19  8:38       ` Chao Gao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.