[PATCH v3 0/4] CPUID-less CPU/sync_core fixes and improvements

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/4] CPUID-less CPU/sync_core fixes and improvements
@ 2016-12-05 21:32 Andy Lutomirski
  2016-12-05 21:32 ` [PATCH v3 1/4] x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels Andy Lutomirski
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Andy Lutomirski @ 2016-12-05 21:32 UTC (permalink / raw)
  To: x86
  Cc: One Thousand Gnomes, Borislav Petkov, linux-kernel, Brian Gerst,
	Matthew Whitehead, Henrique de Moraes Holschuh, Peter Zijlstra,
	xen-devel, Juergen Gross, Boris Ostrovsky, Andrew Cooper,
	Andy Lutomirski

*** PATCHES 1 and 2 MAY BE 4.9 MATERIAL ***

Alan Cox pointed out that the 486 isn't the only supported CPU that
doesn't have CPUID.  Let's clean up the mess and make everything
faster while we're at it.

Patch 1 is intended to be an easy fix: it makes sync_core() work
without CPUID on all 32-bit kernels.  It should be quite safe.  This
will have a negligible performance cost during boot on kernels built
for newer CPUs.  With this in place, patch 2 reverts the buggy 486
check I added.

Patches 3-4 are meant to improve the situation.  Patch 3 cleans up
the Intel microcode loader and the patch 4 (which depends on patch 3
to work correctly) stops using CPUID in sync_core() altogether.

Changes from v2:
 - Switch to IRET-to-self and get rid of all the paravirt code.
 - Further immprove the sync_core() comment.

Changes from v1:
 - Fix Xen
 - Add timing info to the changelog (hint: 2x speedup)
 - Document patch 1 a bit better.

Andy Lutomirski (4):
  x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit
    kernels
  Revert "x86/boot: Fail the boot if !M486 and CPUID is missing"
  x86/microcode/intel: Replace sync_core() with native_cpuid()
  x86/asm: Rewrite sync_core() to use IRET-to-self

 arch/x86/boot/cpu.c                   |  6 ---
 arch/x86/include/asm/processor.h      | 77 +++++++++++++++++++++++++----------
 arch/x86/kernel/cpu/microcode/intel.c | 26 ++++++++++--
 3 files changed, 78 insertions(+), 31 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/4] x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels
  2016-12-05 21:32 [PATCH v3 0/4] CPUID-less CPU/sync_core fixes and improvements Andy Lutomirski
@ 2016-12-05 21:32 ` Andy Lutomirski
  2016-12-05 21:32 ` [PATCH v3 2/4] Revert "x86/boot: Fail the boot if !M486 and CPUID is missing" Andy Lutomirski
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Andy Lutomirski @ 2016-12-05 21:32 UTC (permalink / raw)
  To: x86
  Cc: One Thousand Gnomes, Borislav Petkov, linux-kernel, Brian Gerst,
	Matthew Whitehead, Henrique de Moraes Holschuh, Peter Zijlstra,
	xen-devel, Juergen Gross, Boris Ostrovsky, Andrew Cooper,
	Andy Lutomirski

We support various non-Intel CPUs that don't have the CPUID
instruction, so the M486 test was wrong.  For now, fix it with a big
hammer: handle missing CPUID on all 32-bit CPUs.

Reported-by: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/processor.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 984a7bf17f6a..64fbc937d586 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -595,7 +595,7 @@ static inline void sync_core(void)
 {
 	int tmp;
 
-#ifdef CONFIG_M486
+#ifdef CONFIG_X86_32
 	/*
 	 * Do a CPUID if available, otherwise do a jump.  The jump
 	 * can conveniently enough be the jump around CPUID.
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 2/4] Revert "x86/boot: Fail the boot if !M486 and CPUID is missing"
  2016-12-05 21:32 [PATCH v3 0/4] CPUID-less CPU/sync_core fixes and improvements Andy Lutomirski
  2016-12-05 21:32 ` [PATCH v3 1/4] x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels Andy Lutomirski
@ 2016-12-05 21:32 ` Andy Lutomirski
  2016-12-05 21:32 ` [PATCH v3 3/4] x86/microcode/intel: Replace sync_core() with native_cpuid() Andy Lutomirski
  2016-12-05 21:32 ` [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self Andy Lutomirski
  3 siblings, 0 replies; 11+ messages in thread
From: Andy Lutomirski @ 2016-12-05 21:32 UTC (permalink / raw)
  To: x86
  Cc: One Thousand Gnomes, Borislav Petkov, linux-kernel, Brian Gerst,
	Matthew Whitehead, Henrique de Moraes Holschuh, Peter Zijlstra,
	xen-devel, Juergen Gross, Boris Ostrovsky, Andrew Cooper,
	Andy Lutomirski, Denys Vlasenko, H . Peter Anvin, Josh Poimboeuf,
	Linus Torvalds, Thomas Gleixner

This reverts commit ed68d7e9b9cfb64f3045ffbcb108df03c09a0f98.

The patch wasn't quite correct -- there are non-Intel (and hence
non-486) CPUs that we support that don't have CPUID.  Since we no
longer require CPUID for sync_core(), just revert the patch.

I think the relevant CPUs are Geode and Elan, but I'm not sure.

In principle, we should try to do better at identifying CPUID-less
CPUs in early boot, but that's more complicated.

Reported-by: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
Cc: Matthew Whitehead <tedheadster@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/boot/cpu.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/arch/x86/boot/cpu.c b/arch/x86/boot/cpu.c
index 4224ede43b4e..26240dde081e 100644
--- a/arch/x86/boot/cpu.c
+++ b/arch/x86/boot/cpu.c
@@ -87,12 +87,6 @@ int validate_cpu(void)
 		return -1;
 	}
 
-	if (CONFIG_X86_MINIMUM_CPU_FAMILY <= 4 && !IS_ENABLED(CONFIG_M486) &&
-	    !has_eflag(X86_EFLAGS_ID)) {
-		printf("This kernel requires a CPU with the CPUID instruction.  Build with CONFIG_M486=y to run on this CPU.\n");
-		return -1;
-	}
-
 	if (err_flags) {
 		puts("This kernel requires the following features "
 		     "not present on the CPU:\n");
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 3/4] x86/microcode/intel: Replace sync_core() with native_cpuid()
  2016-12-05 21:32 [PATCH v3 0/4] CPUID-less CPU/sync_core fixes and improvements Andy Lutomirski
  2016-12-05 21:32 ` [PATCH v3 1/4] x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels Andy Lutomirski
  2016-12-05 21:32 ` [PATCH v3 2/4] Revert "x86/boot: Fail the boot if !M486 and CPUID is missing" Andy Lutomirski
@ 2016-12-05 21:32 ` Andy Lutomirski
  2016-12-05 21:32 ` [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self Andy Lutomirski
  3 siblings, 0 replies; 11+ messages in thread
From: Andy Lutomirski @ 2016-12-05 21:32 UTC (permalink / raw)
  To: x86
  Cc: One Thousand Gnomes, Borislav Petkov, linux-kernel, Brian Gerst,
	Matthew Whitehead, Henrique de Moraes Holschuh, Peter Zijlstra,
	xen-devel, Juergen Gross, Boris Ostrovsky, Andrew Cooper,
	Andy Lutomirski

The Intel microcode driver is using sync_core() to mean "do CPUID
with EAX=1".  I want to rework sync_core(), but first the Intel
microcode driver needs to stop depending on its current behavior.

Reported-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Acked-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/kernel/cpu/microcode/intel.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
index cdc0deab00c9..e0981bb2a351 100644
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -356,6 +356,26 @@ get_matching_model_microcode(unsigned long start, void *data, size_t size,
 	return state;
 }
 
+static void cpuid_1(void)
+{
+	/*
+	 * According to the Intel SDM, Volume 3, 9.11.7:
+	 *
+	 *   CPUID returns a value in a model specific register in
+	 *   addition to its usual register return values. The
+	 *   semantics of CPUID cause it to deposit an update ID value
+	 *   in the 64-bit model-specific register at address 08BH
+	 *   (IA32_BIOS_SIGN_ID). If no update is present in the
+	 *   processor, the value in the MSR remains unmodified.
+	 *
+	 * Use native_cpuid -- this code runs very early and we don't
+	 * want to mess with paravirt.
+	 */
+	unsigned int eax = 1, ebx, ecx = 0, edx;
+
+	native_cpuid(&eax, &ebx, &ecx, &edx);
+}
+
 static int collect_cpu_info_early(struct ucode_cpu_info *uci)
 {
 	unsigned int val[2];
@@ -385,7 +405,7 @@ static int collect_cpu_info_early(struct ucode_cpu_info *uci)
 	native_wrmsrl(MSR_IA32_UCODE_REV, 0);
 
 	/* As documented in the SDM: Do a CPUID 1 here */
-	sync_core();
+	cpuid_1();
 
 	/* get the current revision from MSR 0x8B */
 	native_rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
@@ -627,7 +647,7 @@ static int apply_microcode_early(struct ucode_cpu_info *uci, bool early)
 	native_wrmsrl(MSR_IA32_UCODE_REV, 0);
 
 	/* As documented in the SDM: Do a CPUID 1 here */
-	sync_core();
+	cpuid_1();
 
 	/* get the current revision from MSR 0x8B */
 	native_rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
@@ -927,7 +947,7 @@ static int apply_microcode_intel(int cpu)
 	wrmsrl(MSR_IA32_UCODE_REV, 0);
 
 	/* As documented in the SDM: Do a CPUID 1 here */
-	sync_core();
+	cpuid_1();
 
 	/* get the current revision from MSR 0x8B */
 	rdmsr(MSR_IA32_UCODE_REV, val[0], val[1]);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self
  2016-12-05 21:32 [PATCH v3 0/4] CPUID-less CPU/sync_core fixes and improvements Andy Lutomirski
                   ` (2 preceding siblings ...)
  2016-12-05 21:32 ` [PATCH v3 3/4] x86/microcode/intel: Replace sync_core() with native_cpuid() Andy Lutomirski
@ 2016-12-05 21:32 ` Andy Lutomirski
  2016-12-06  7:52   ` Borislav Petkov
  2016-12-06  8:46   ` [Xen-devel] " Jan Beulich
  3 siblings, 2 replies; 11+ messages in thread
From: Andy Lutomirski @ 2016-12-05 21:32 UTC (permalink / raw)
  To: x86
  Cc: One Thousand Gnomes, Borislav Petkov, linux-kernel, Brian Gerst,
	Matthew Whitehead, Henrique de Moraes Holschuh, Peter Zijlstra,
	xen-devel, Juergen Gross, Boris Ostrovsky, Andrew Cooper,
	Andy Lutomirski, H. Peter Anvin

Aside from being excessively slow, CPUID is problematic: Linux runs
on a handful of CPUs that don't have CPUID.  Use IRET-to-self
instead.  IRET-to-self works everywhere, so it makes testing easy.

For reference, On my laptop, IRET-to-self is ~110ns,
CPUID(eax=1, ecx=0) is ~83ns on native and very very slow under KVM,
and MOV-to-CR2 is ~42ns.

While we're at it: sync_core() serves a very specific purpose.
Document it.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---
 arch/x86/include/asm/processor.h | 77 ++++++++++++++++++++++++++++------------
 1 file changed, 55 insertions(+), 22 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 64fbc937d586..201a956e345f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -590,33 +590,66 @@ static __always_inline void cpu_relax(void)
 
 #define cpu_relax_lowlatency() cpu_relax()
 
-/* Stop speculative execution and prefetching of modified code. */
+/*
+ * This function forces the icache and prefetched instruction stream to
+ * catch up with reality in two very specific cases:
+ *
+ *  a) Text was modified using one virtual address and is about to be executed
+ *     from the same physical page at a different virtual address.
+ *
+ *  b) Text was modified on a different CPU, may subsequently be
+ *     executed on this CPU, and you want to make sure the new version
+ *     gets executed.  This generally means you're calling this in a IPI.
+ *
+ * If you're calling this for a different reason, you're probably doing
+ * it wrong.
+ */
 static inline void sync_core(void)
 {
-	int tmp;
-
-#ifdef CONFIG_X86_32
 	/*
-	 * Do a CPUID if available, otherwise do a jump.  The jump
-	 * can conveniently enough be the jump around CPUID.
+	 * There are quite a few ways to do this.  IRET-to-self is nice
+	 * because it works on every CPU, at any CPL (so it's compatible
+	 * with paravirtualization), and it never exits to a hypervisor.
+	 * The only down sides are that it's a bit slow (it seems to be
+	 * a bit more than 2x slower than the fastest options) and that
+	 * it unmasks NMIs.  The "push %cs" is needed because, in
+	 * paravirtual environments, __KERNEL_CS may not be a valid CS
+	 * value when we do IRET directly.
+	 *
+	 * In case NMI unmasking or performance every becomes a problem,
+	 * the next best option appears to be MOV-to-CR2 and an
+	 * unconditional jump.  That sequence also works on all CPUs,
+	 * but it will fault at CPL3.
+	 *
+	 * CPUID is the conventional way, but it's nasty: it doesn't
+	 * exist on some 486-like CPUs, and it usually exits to a
+	 * hypervisor.
 	 */
-	asm volatile("cmpl %2,%1\n\t"
-		     "jl 1f\n\t"
-		     "cpuid\n"
-		     "1:"
-		     : "=a" (tmp)
-		     : "rm" (boot_cpu_data.cpuid_level), "ri" (0), "0" (1)
-		     : "ebx", "ecx", "edx", "memory");
+	register void *__sp asm(_ASM_SP);
+
+#ifdef CONFIG_X86_32
+	asm volatile (
+		"pushfl\n\t"
+		"pushl %%cs\n\t"
+		"pushl $1f\n\t"
+		"iret\n\t"
+		"1:"
+		: "+r" (__sp) : : "cc", "memory");
 #else
-	/*
-	 * CPUID is a barrier to speculative execution.
-	 * Prefetched instructions are automatically
-	 * invalidated when modified.
-	 */
-	asm volatile("cpuid"
-		     : "=a" (tmp)
-		     : "0" (1)
-		     : "ebx", "ecx", "edx", "memory");
+	unsigned long tmp;
+
+	asm volatile (
+		"movq %%ss, %0\n\t"
+		"pushq %0\n\t"
+		"pushq %%rsp\n\t"
+		"addq $8, (%%rsp)\n\t"
+		"pushfq\n\t"
+		"movq %%cs, %0\n\t"
+		"pushq %0\n\t"
+		"pushq $1f\n\t"
+		"iretq\n\t"
+		"1:"
+		: "=r" (tmp), "+r" (__sp) : : "cc", "memory");
 #endif
 }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self
  2016-12-05 21:32 ` [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self Andy Lutomirski
@ 2016-12-06  7:52   ` Borislav Petkov
  2016-12-06 17:49     ` Andy Lutomirski
  2016-12-06  8:46   ` [Xen-devel] " Jan Beulich
  1 sibling, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2016-12-06  7:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, One Thousand Gnomes, linux-kernel, Brian Gerst,
	Matthew Whitehead, Henrique de Moraes Holschuh, Peter Zijlstra,
	xen-devel, Juergen Gross, Boris Ostrovsky, Andrew Cooper,
	H. Peter Anvin

On Mon, Dec 05, 2016 at 01:32:43PM -0800, Andy Lutomirski wrote:
> Aside from being excessively slow, CPUID is problematic: Linux runs
> on a handful of CPUs that don't have CPUID.  Use IRET-to-self
> instead.  IRET-to-self works everywhere, so it makes testing easy.
> 
> For reference, On my laptop, IRET-to-self is ~110ns,
> CPUID(eax=1, ecx=0) is ~83ns on native and very very slow under KVM,
> and MOV-to-CR2 is ~42ns.
> 
> While we're at it: sync_core() serves a very specific purpose.
> Document it.
> 
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
>  arch/x86/include/asm/processor.h | 77 ++++++++++++++++++++++++++++------------
>  1 file changed, 55 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index 64fbc937d586..201a956e345f 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -590,33 +590,66 @@ static __always_inline void cpu_relax(void)
>  
>  #define cpu_relax_lowlatency() cpu_relax()
>  
> -/* Stop speculative execution and prefetching of modified code. */
> +/*
> + * This function forces the icache and prefetched instruction stream to
> + * catch up with reality in two very specific cases:
> + *
> + *  a) Text was modified using one virtual address and is about to be executed
> + *     from the same physical page at a different virtual address.
> + *
> + *  b) Text was modified on a different CPU, may subsequently be
> + *     executed on this CPU, and you want to make sure the new version
> + *     gets executed.  This generally means you're calling this in a IPI.
> + *
> + * If you're calling this for a different reason, you're probably doing
> + * it wrong.

"... and think hard before you call this - it is slow."

I'd add that now that it is even slower than CPUID.

> + */
>  static inline void sync_core(void)
>  {
> -	int tmp;
> -
> -#ifdef CONFIG_X86_32
>  	/*
> -	 * Do a CPUID if available, otherwise do a jump.  The jump
> -	 * can conveniently enough be the jump around CPUID.
> +	 * There are quite a few ways to do this.  IRET-to-self is nice
> +	 * because it works on every CPU, at any CPL (so it's compatible
> +	 * with paravirtualization), and it never exits to a hypervisor.
> +	 * The only down sides are that it's a bit slow (it seems to be
> +	 * a bit more than 2x slower than the fastest options) and that
> +	 * it unmasks NMIs.

Ewww, I hadn't thought of that angle. Aren't we going to get in all
kinds of hard to debug issues due to that couple of cycles window of
unmasked NMIs?

We sync_core in some NMI handler and then right in the middle of it we
get another NMI. Yeah, we have the nested NMI stuff still but I'd like
to avoid complications if possible.

> The "push %cs" is needed because, in
> +	 * paravirtual environments, __KERNEL_CS may not be a valid CS
> +	 * value when we do IRET directly.
> +	 *
> +	 * In case NMI unmasking or performance every becomes a problem,
> +	 * the next best option appears to be MOV-to-CR2 and an
> +	 * unconditional jump.  That sequence also works on all CPUs,
> +	 * but it will fault at CPL3.

Does it really have to be non-priviledged?

If not, there are a couple more serializing insns:

"• Privileged serializing instructions — INVD, INVEPT, INVLPG,
INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to control register, with the
exception of MOV CR83), MOV (to debug register), WBINVD, and WRMSR"

What about INVD? It is expensive too :-)

Can't we use, MOV %dr or so? With previously saving/restoring the reg?

WBINVD could be another candidate, albeit a big hammer.

WRMSR maybe too. If it faults, it's fine too because then you have the
IRET automagically. Hell, we could even make it fault on purpose by
writing into an invalid MSR but then we're back to the unmasking NMIs...
:-\

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xen-devel] [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self
  2016-12-05 21:32 ` [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self Andy Lutomirski
  2016-12-06  7:52   ` Borislav Petkov
@ 2016-12-06  8:46   ` Jan Beulich
  2016-12-06  9:25     ` Peter Zijlstra
  2016-12-06 19:32     ` H. Peter Anvin
  1 sibling, 2 replies; 11+ messages in thread
From: Jan Beulich @ 2016-12-06  8:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Andrew Cooper, Brian Gerst, Matthew Whitehead,
	Henrique de Moraes Holschuh, Peter Zijlstra, x86, xen-devel,
	One Thousand Gnomes, Boris Ostrovsky, Juergen Gross,
	linux-kernel, H. Peter Anvin

>>> On 05.12.16 at 22:32, <luto@kernel.org> wrote:
>  static inline void sync_core(void)
>  {
> -	int tmp;
> -
> -#ifdef CONFIG_X86_32
>  	/*
> -	 * Do a CPUID if available, otherwise do a jump.  The jump
> -	 * can conveniently enough be the jump around CPUID.
> +	 * There are quite a few ways to do this.  IRET-to-self is nice
> +	 * because it works on every CPU, at any CPL (so it's compatible
> +	 * with paravirtualization), and it never exits to a hypervisor.
> +	 * The only down sides are that it's a bit slow (it seems to be
> +	 * a bit more than 2x slower than the fastest options) and that
> +	 * it unmasks NMIs.  The "push %cs" is needed because, in
> +	 * paravirtual environments, __KERNEL_CS may not be a valid CS
> +	 * value when we do IRET directly.
> +	 *
> +	 * In case NMI unmasking or performance every becomes a problem,
> +	 * the next best option appears to be MOV-to-CR2 and an
> +	 * unconditional jump.  That sequence also works on all CPUs,
> +	 * but it will fault at CPL3.

CPL > 0 I think.

> +	 * CPUID is the conventional way, but it's nasty: it doesn't
> +	 * exist on some 486-like CPUs, and it usually exits to a
> +	 * hypervisor.
>  	 */
> -	asm volatile("cmpl %2,%1\n\t"
> -		     "jl 1f\n\t"
> -		     "cpuid\n"
> -		     "1:"
> -		     : "=a" (tmp)
> -		     : "rm" (boot_cpu_data.cpuid_level), "ri" (0), "0" (1)
> -		     : "ebx", "ecx", "edx", "memory");
> +	register void *__sp asm(_ASM_SP);
> +
> +#ifdef CONFIG_X86_32
> +	asm volatile (
> +		"pushfl\n\t"
> +		"pushl %%cs\n\t"
> +		"pushl $1f\n\t"
> +		"iret\n\t"
> +		"1:"
> +		: "+r" (__sp) : : "cc", "memory");

I don't thing EFLAGS (i.e. "cc") gets modified anywhere here. And
the memory clobber would perhaps better be pulled out into an
explicit barrier() invocation (making it more obvious what it's needed
for)?

>  #else
> -	/*
> -	 * CPUID is a barrier to speculative execution.
> -	 * Prefetched instructions are automatically
> -	 * invalidated when modified.
> -	 */
> -	asm volatile("cpuid"
> -		     : "=a" (tmp)
> -		     : "0" (1)
> -		     : "ebx", "ecx", "edx", "memory");
> +	unsigned long tmp;
> +
> +	asm volatile (
> +		"movq %%ss, %0\n\t"
> +		"pushq %0\n\t"
> +		"pushq %%rsp\n\t"
> +		"addq $8, (%%rsp)\n\t"
> +		"pushfq\n\t"
> +		"movq %%cs, %0\n\t"
> +		"pushq %0\n\t"
> +		"pushq $1f\n\t"
> +		"iretq\n\t"
> +		"1:"
> +		: "=r" (tmp), "+r" (__sp) : : "cc", "memory");

The first output needs to be "=&r". And is movq really a good
idea for selector reads? Why don't you make tmp unsigned int,
use plain mov, and use %q0 as pushq's operands?

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xen-devel] [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self
  2016-12-06  8:46   ` [Xen-devel] " Jan Beulich
@ 2016-12-06  9:25     ` Peter Zijlstra
       [not found]       ` <584697BA0200007800125855@prv-mh.provo.novell.com>
  2016-12-06 19:32     ` H. Peter Anvin
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2016-12-06  9:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andy Lutomirski, Borislav Petkov, Andrew Cooper, Brian Gerst,
	Matthew Whitehead, Henrique de Moraes Holschuh, x86, xen-devel,
	One Thousand Gnomes, Boris Ostrovsky, Juergen Gross,
	linux-kernel, H. Peter Anvin

On Tue, Dec 06, 2016 at 01:46:37AM -0700, Jan Beulich wrote:
> > +	asm volatile (
> > +		"pushfl\n\t"
> > +		"pushl %%cs\n\t"
> > +		"pushl $1f\n\t"
> > +		"iret\n\t"
> > +		"1:"
> > +		: "+r" (__sp) : : "cc", "memory");
> 
> I don't thing EFLAGS (i.e. "cc") gets modified anywhere here. And
> the memory clobber would perhaps better be pulled out into an
> explicit barrier() invocation (making it more obvious what it's needed
> for)?

EVerything that implies a memory barrier (and I think serializing
instructions do that) also imply a compiler barrier.

Not doing the memory clobber gets you inconsistency wrt everything else.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xen-devel] [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self
       [not found]       ` <584697BA0200007800125855@prv-mh.provo.novell.com>
@ 2016-12-06 17:46         ` Andy Lutomirski
  0 siblings, 0 replies; 11+ messages in thread
From: Andy Lutomirski @ 2016-12-06 17:46 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Peter Zijlstra, Borislav Petkov, Andrew Cooper, Brian Gerst,
	Matthew Whitehead, Henrique de Moraes Holschuh, Andy Lutomirski,
	X86 ML, xen-devel, One Thousand Gnomes, Boris Ostrovsky,
	Juergen Gross, linux-kernel, H. Peter Anvin

On Tue, Dec 6, 2016 at 1:49 AM, Jan Beulich <JBeulich@suse.com> wrote:
>>>> On 06.12.16 at 10:25, <peterz@infradead.org> wrote:
>> On Tue, Dec 06, 2016 at 01:46:37AM -0700, Jan Beulich wrote:
>>> > +  asm volatile (
>>> > +          "pushfl\n\t"
>>> > +          "pushl %%cs\n\t"
>>> > +          "pushl $1f\n\t"
>>> > +          "iret\n\t"
>>> > +          "1:"
>>> > +          : "+r" (__sp) : : "cc", "memory");
>>>
>>> I don't thing EFLAGS (i.e. "cc") gets modified anywhere here. And
>>> the memory clobber would perhaps better be pulled out into an
>>> explicit barrier() invocation (making it more obvious what it's needed
>>> for)?
>>
>> EVerything that implies a memory barrier (and I think serializing
>> instructions do that) also imply a compiler barrier.
>>
>> Not doing the memory clobber gets you inconsistency wrt everything else.
>
> Well, I didn't say dropping the memory clobber altogether, but
> split it into a separate barrier() invocation (placed perhaps after
> the #endif).

I'll add a comment.  I'm fixing up the constraints, too.  (Although if
gcc allocated tmp into rsp, that would be very strange indeed.)

--Andy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self
  2016-12-06  7:52   ` Borislav Petkov
@ 2016-12-06 17:49     ` Andy Lutomirski
  0 siblings, 0 replies; 11+ messages in thread
From: Andy Lutomirski @ 2016-12-06 17:49 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, X86 ML, One Thousand Gnomes, linux-kernel,
	Brian Gerst, Matthew Whitehead, Henrique de Moraes Holschuh,
	Peter Zijlstra, xen-devel, Juergen Gross, Boris Ostrovsky,
	Andrew Cooper, H. Peter Anvin

On Mon, Dec 5, 2016 at 11:52 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Mon, Dec 05, 2016 at 01:32:43PM -0800, Andy Lutomirski wrote:
>> Aside from being excessively slow, CPUID is problematic: Linux runs
>> on a handful of CPUs that don't have CPUID.  Use IRET-to-self
>> instead.  IRET-to-self works everywhere, so it makes testing easy.
>>
>> For reference, On my laptop, IRET-to-self is ~110ns,
>> CPUID(eax=1, ecx=0) is ~83ns on native and very very slow under KVM,
>> and MOV-to-CR2 is ~42ns.
>>
>> While we're at it: sync_core() serves a very specific purpose.
>> Document it.
>>
>> Cc: "H. Peter Anvin" <hpa@zytor.com>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>  arch/x86/include/asm/processor.h | 77 ++++++++++++++++++++++++++++------------
>>  1 file changed, 55 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
>> index 64fbc937d586..201a956e345f 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -590,33 +590,66 @@ static __always_inline void cpu_relax(void)
>>
>>  #define cpu_relax_lowlatency() cpu_relax()
>>
>> -/* Stop speculative execution and prefetching of modified code. */
>> +/*
>> + * This function forces the icache and prefetched instruction stream to
>> + * catch up with reality in two very specific cases:
>> + *
>> + *  a) Text was modified using one virtual address and is about to be executed
>> + *     from the same physical page at a different virtual address.
>> + *
>> + *  b) Text was modified on a different CPU, may subsequently be
>> + *     executed on this CPU, and you want to make sure the new version
>> + *     gets executed.  This generally means you're calling this in a IPI.
>> + *
>> + * If you're calling this for a different reason, you're probably doing
>> + * it wrong.
>
> "... and think hard before you call this - it is slow."
>
> I'd add that now that it is even slower than CPUID.

But only barely.  And it's hugely faster than CPUID under KVM or
similar.  And it works on all CPUs.

>
>> + */
>>  static inline void sync_core(void)
>>  {
>> -     int tmp;
>> -
>> -#ifdef CONFIG_X86_32
>>       /*
>> -      * Do a CPUID if available, otherwise do a jump.  The jump
>> -      * can conveniently enough be the jump around CPUID.
>> +      * There are quite a few ways to do this.  IRET-to-self is nice
>> +      * because it works on every CPU, at any CPL (so it's compatible
>> +      * with paravirtualization), and it never exits to a hypervisor.
>> +      * The only down sides are that it's a bit slow (it seems to be
>> +      * a bit more than 2x slower than the fastest options) and that
>> +      * it unmasks NMIs.
>
> Ewww, I hadn't thought of that angle. Aren't we going to get in all
> kinds of hard to debug issues due to that couple of cycles window of
> unmasked NMIs?
>
> We sync_core in some NMI handler and then right in the middle of it we
> get another NMI. Yeah, we have the nested NMI stuff still but I'd like
> to avoid complications if possible.

I'll counter with:

1. Why on earth would an NMI call sync_core()?

2. We have very careful and code to handle this issue, and NMIs really
do cause page faults which have exactly the same problem.

So I'm not too worried.

>
>> The "push %cs" is needed because, in
>> +      * paravirtual environments, __KERNEL_CS may not be a valid CS
>> +      * value when we do IRET directly.
>> +      *
>> +      * In case NMI unmasking or performance every becomes a problem,
>> +      * the next best option appears to be MOV-to-CR2 and an
>> +      * unconditional jump.  That sequence also works on all CPUs,
>> +      * but it will fault at CPL3.
>
> Does it really have to be non-priviledged?

Unless we want to declare lguest unsupported, delete it from the tree,
or, sigh, actually maintain it, then yes :(  native_write_cr2() would
work on Xen, but it's slow.

>
> If not, there are a couple more serializing insns:
>
> "• Privileged serializing instructions — INVD, INVEPT, INVLPG,
> INVVPID, LGDT, LIDT, LLDT, LTR, MOV (to control register, with the
> exception of MOV CR83), MOV (to debug register), WBINVD, and WRMSR"
>
> What about INVD? It is expensive too :-)

Only if you write the patch and label it:

Snickered-at-by: Andy Lutomirski <luto@kernel.org>

>
> Can't we use, MOV %dr or so? With previously saving/restoring the reg?
>
> WBINVD could be another candidate, albeit a big hammer.
>
> WRMSR maybe too. If it faults, it's fine too because then you have the
> IRET automagically. Hell, we could even make it fault on purpose by
> writing into an invalid MSR but then we're back to the unmasking NMIs...
> :-\

I still like MOV-to-CR2 better than all of these.

--Andy

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Xen-devel] [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self
  2016-12-06  8:46   ` [Xen-devel] " Jan Beulich
  2016-12-06  9:25     ` Peter Zijlstra
@ 2016-12-06 19:32     ` H. Peter Anvin
  1 sibling, 0 replies; 11+ messages in thread
From: H. Peter Anvin @ 2016-12-06 19:32 UTC (permalink / raw)
  To: Jan Beulich, Andy Lutomirski
  Cc: Borislav Petkov, Andrew Cooper, Brian Gerst, Matthew Whitehead,
	Henrique de Moraes Holschuh, Peter Zijlstra, x86, xen-devel,
	One Thousand Gnomes, Boris Ostrovsky, Juergen Gross,
	linux-kernel

On 12/06/16 00:46, Jan Beulich wrote:
>> +
>> +#ifdef CONFIG_X86_32
>> +	asm volatile (
>> +		"pushfl\n\t"
>> +		"pushl %%cs\n\t"
>> +		"pushl $1f\n\t"
>> +		"iret\n\t"
>> +		"1:"
>> +		: "+r" (__sp) : : "cc", "memory");
> 
> I don't thing EFLAGS (i.e. "cc") gets modified anywhere here. And
> the memory clobber would perhaps better be pulled out into an
> explicit barrier() invocation (making it more obvious what it's needed
> for)?
> 

Not to mention "cc" doesn't do anything on x86 at all.

	-hpa

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-12-06 19:35 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-05 21:32 [PATCH v3 0/4] CPUID-less CPU/sync_core fixes and improvements Andy Lutomirski
2016-12-05 21:32 ` [PATCH v3 1/4] x86/asm/32: Make sync_core() handle missing CPUID on all 32-bit kernels Andy Lutomirski
2016-12-05 21:32 ` [PATCH v3 2/4] Revert "x86/boot: Fail the boot if !M486 and CPUID is missing" Andy Lutomirski
2016-12-05 21:32 ` [PATCH v3 3/4] x86/microcode/intel: Replace sync_core() with native_cpuid() Andy Lutomirski
2016-12-05 21:32 ` [PATCH v3 4/4] x86/asm: Rewrite sync_core() to use IRET-to-self Andy Lutomirski
2016-12-06  7:52   ` Borislav Petkov
2016-12-06 17:49     ` Andy Lutomirski
2016-12-06  8:46   ` [Xen-devel] " Jan Beulich
2016-12-06  9:25     ` Peter Zijlstra
     [not found]       ` <584697BA0200007800125855@prv-mh.provo.novell.com>
2016-12-06 17:46         ` Andy Lutomirski
2016-12-06 19:32     ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).