linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/10] Speculation Control feature support
@ 2018-01-20 19:22 KarimAllah Ahmed
  2018-01-20 19:22 ` [RFC 01/10] x86/speculation: Add basic support for IBPB KarimAllah Ahmed
                   ` (10 more replies)
  0 siblings, 11 replies; 143+ messages in thread
From: KarimAllah Ahmed @ 2018-01-20 19:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

Start using the newly-added microcode features for speculation control on both
Intel and AMD CPUs to protect against Spectre v2.

This patch series covers interrupts, system calls, context switching between
processes, and context switching between VMs. It also exposes Indirect Branch
Prediction Barrier MSR, aka IBPB MSR, to KVM guests.

TODO:

- Introduce a microcode blacklist to disable the feature for broken microcodes.
- Restrict/Unrestrict the speculation (by toggling IBRS) around VMExit and
  VMEnter for KVM and expose IBRS to guests.

Ashok Raj (1):
  x86/kvm: Add IBPB support

David Woodhouse (1):
  x86/speculation: Add basic IBRS support infrastructure

KarimAllah Ahmed (1):
  x86: Simplify spectre_v2 command line parsing

Thomas Gleixner (4):
  x86/speculation: Add basic support for IBPB
  x86/speculation: Use Indirect Branch Prediction Barrier in context
    switch
  x86/speculation: Add inlines to control Indirect Branch Speculation
  x86/idle: Control Indirect Branch Speculation in idle

Tim Chen (3):
  x86/mm: Only flush indirect branches when switching into non dumpable
    process
  x86/enter: Create macros to restrict/unrestrict Indirect Branch
    Speculation
  x86/enter: Use IBRS on syscall and interrupts

 Documentation/admin-guide/kernel-parameters.txt |   1 +
 arch/x86/entry/calling.h                        |  73 ++++++++++
 arch/x86/entry/entry_64.S                       |  35 ++++-
 arch/x86/entry/entry_64_compat.S                |  21 ++-
 arch/x86/include/asm/cpufeatures.h              |   2 +
 arch/x86/include/asm/mwait.h                    |  14 ++
 arch/x86/include/asm/nospec-branch.h            |  54 ++++++-
 arch/x86/kernel/cpu/bugs.c                      | 183 +++++++++++++++---------
 arch/x86/kernel/process.c                       |  14 ++
 arch/x86/kvm/svm.c                              |  14 ++
 arch/x86/kvm/vmx.c                              |   4 +
 arch/x86/mm/tlb.c                               |  21 ++-
 12 files changed, 359 insertions(+), 77 deletions(-)


Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: x86@kernel.org

-- 
2.7.4

^ permalink raw reply	[flat|nested] 143+ messages in thread

* [RFC 01/10] x86/speculation: Add basic support for IBPB
  2018-01-20 19:22 [RFC 00/10] Speculation Control feature support KarimAllah Ahmed
@ 2018-01-20 19:22 ` KarimAllah Ahmed
  2018-01-20 19:22 ` [RFC 02/10] x86/kvm: Add IBPB support KarimAllah Ahmed
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 143+ messages in thread
From: KarimAllah Ahmed @ 2018-01-20 19:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

From: Thomas Gleixner <tglx@linutronix.de>

Expose indirect_branch_prediction_barrier() for use in subsequent patches.

[karahmed: remove the special-casing of skylake for using IBPB (wtf?),
           switch to using ALTERNATIVES instead of static_cpu_has]
[dwmw2:    set up ax/cx/dx in the asm too so it gets NOP'd out]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/cpufeatures.h   |  1 +
 arch/x86/include/asm/nospec-branch.h | 16 ++++++++++++++++
 arch/x86/kernel/cpu/bugs.c           |  7 +++++++
 3 files changed, 24 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 624d978..8ec9588 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -207,6 +207,7 @@
 #define X86_FEATURE_RETPOLINE_AMD	( 7*32+13) /* AMD Retpoline mitigation for Spectre variant 2 */
 #define X86_FEATURE_INTEL_PPIN		( 7*32+14) /* Intel Processor Inventory Number */
 
+#define X86_FEATURE_IBPB		( 7*32+16) /* Using Indirect Branch Prediction Barrier */
 #define X86_FEATURE_AMD_PRED_CMD	( 7*32+17) /* Prediction Command MSR (AMD) */
 #define X86_FEATURE_MBA			( 7*32+18) /* Memory Bandwidth Allocation */
 #define X86_FEATURE_RSB_CTXSW		( 7*32+19) /* Fill RSB on context switches */
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 4ad4108..c333c95 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -218,5 +218,21 @@ static inline void vmexit_fill_RSB(void)
 #endif
 }
 
+static inline void indirect_branch_prediction_barrier(void)
+{
+	unsigned long ax, cx, dx;
+
+	asm volatile(ALTERNATIVE("",
+				 "movl %[msr], %%ecx\n\t"
+				 "movl %[val], %%eax\n\t"
+				 "movl $0, %%edx\n\t"
+				 "wrmsr",
+				 X86_FEATURE_IBPB)
+		     : "=a" (ax), "=c" (cx), "=d" (dx)
+		     : [msr] "i" (MSR_IA32_PRED_CMD),
+		       [val] "i" (PRED_CMD_IBPB)
+		     : "memory");
+}
+
 #endif /* __ASSEMBLY__ */
 #endif /* __NOSPEC_BRANCH_H__ */
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 390b3dc..96548ff 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -249,6 +249,13 @@ static void __init spectre_v2_select_mitigation(void)
 		setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
 		pr_info("Filling RSB on context switch\n");
 	}
+
+	/* Initialize Indirect Branch Prediction Barrier if supported */
+	if (boot_cpu_has(X86_FEATURE_SPEC_CTRL) ||
+	    boot_cpu_has(X86_FEATURE_AMD_PRED_CMD)) {
+		setup_force_cpu_cap(X86_FEATURE_IBPB);
+		pr_info("Enabling Indirect Branch Prediction Barrier\n");
+	}
 }
 
 #undef pr_fmt
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [RFC 02/10] x86/kvm: Add IBPB support
  2018-01-20 19:22 [RFC 00/10] Speculation Control feature support KarimAllah Ahmed
  2018-01-20 19:22 ` [RFC 01/10] x86/speculation: Add basic support for IBPB KarimAllah Ahmed
@ 2018-01-20 19:22 ` KarimAllah Ahmed
  2018-01-20 20:18   ` Woodhouse, David
  2018-01-22 18:56   ` Jim Mattson
  2018-01-20 19:22 ` [RFC 03/10] x86/speculation: Use Indirect Branch Prediction Barrier in context switch KarimAllah Ahmed
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 143+ messages in thread
From: KarimAllah Ahmed @ 2018-01-20 19:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Arjan Van De Ven

From: Ashok Raj <ashok.raj@intel.com>

Add MSR passthrough for MSR_IA32_PRED_CMD and place branch predictor
barriers on switching between VMs to avoid inter VM specte-v2 attacks.

[peterz: rebase and changelog rewrite]
[dwmw2: fixes]
[karahmed: - vmx: expose PRED_CMD whenever it is available
	   - svm: only pass through IBPB if it is available]

Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1515720739-43819-6-git-send-email-ashok.raj@intel.com

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
---
 arch/x86/kvm/svm.c | 14 ++++++++++++++
 arch/x86/kvm/vmx.c |  4 ++++
 2 files changed, 18 insertions(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 2744b973..cfdb9ab 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -529,6 +529,7 @@ struct svm_cpu_data {
 	struct kvm_ldttss_desc *tss_desc;
 
 	struct page *save_area;
+	struct vmcb *current_vmcb;
 };
 
 static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
@@ -918,6 +919,9 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)
 
 		set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1);
 	}
+
+	if (boot_cpu_has(X86_FEATURE_AMD_PRED_CMD))
+		set_msr_interception(msrpm, MSR_IA32_PRED_CMD, 1, 1);
 }
 
 static void add_msr_offset(u32 offset)
@@ -1706,11 +1710,17 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
 	__free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
 	kvm_vcpu_uninit(vcpu);
 	kmem_cache_free(kvm_vcpu_cache, svm);
+	/*
+	 * The vmcb page can be recycled, causing a false negative in
+	 * svm_vcpu_load(). So do a full IBPB now.
+	 */
+	indirect_branch_prediction_barrier();
 }
 
 static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
 	int i;
 
 	if (unlikely(cpu != vcpu->cpu)) {
@@ -1739,6 +1749,10 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (static_cpu_has(X86_FEATURE_RDTSCP))
 		wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
 
+	if (sd->current_vmcb != svm->vmcb) {
+		sd->current_vmcb = svm->vmcb;
+		indirect_branch_prediction_barrier();
+	}
 	avic_vcpu_load(vcpu, cpu);
 }
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d1e25db..3b64de2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2279,6 +2279,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
 		per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
 		vmcs_load(vmx->loaded_vmcs->vmcs);
+		indirect_branch_prediction_barrier();
 	}
 
 	if (!already_loaded) {
@@ -6791,6 +6792,9 @@ static __init int hardware_setup(void)
 		kvm_tsc_scaling_ratio_frac_bits = 48;
 	}
 
+	if (boot_cpu_has(X86_FEATURE_SPEC_CTRL))
+		vmx_disable_intercept_for_msr(MSR_IA32_PRED_CMD, false);
+
 	vmx_disable_intercept_for_msr(MSR_FS_BASE, false);
 	vmx_disable_intercept_for_msr(MSR_GS_BASE, false);
 	vmx_disable_intercept_for_msr(MSR_KERNEL_GS_BASE, true);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [RFC 03/10] x86/speculation: Use Indirect Branch Prediction Barrier in context switch
  2018-01-20 19:22 [RFC 00/10] Speculation Control feature support KarimAllah Ahmed
  2018-01-20 19:22 ` [RFC 01/10] x86/speculation: Add basic support for IBPB KarimAllah Ahmed
  2018-01-20 19:22 ` [RFC 02/10] x86/kvm: Add IBPB support KarimAllah Ahmed
@ 2018-01-20 19:22 ` KarimAllah Ahmed
  2018-01-20 19:22 ` [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process KarimAllah Ahmed
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 143+ messages in thread
From: KarimAllah Ahmed @ 2018-01-20 19:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

From: Thomas Gleixner <tglx@linutronix.de>

[peterz: comment]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/mm/tlb.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index a156195..304de7d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -6,13 +6,14 @@
 #include <linux/interrupt.h>
 #include <linux/export.h>
 #include <linux/cpu.h>
+#include <linux/debugfs.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
+#include <asm/nospec-branch.h>
 #include <asm/cache.h>
 #include <asm/apic.h>
 #include <asm/uv/uv.h>
-#include <linux/debugfs.h>
 
 /*
  *	TLB flushing, formerly SMP-only
@@ -220,6 +221,13 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		u16 new_asid;
 		bool need_flush;
 
+		/*
+		 * Avoid user/user BTB poisoning by flushing the branch predictor
+		 * when switching between processes. This stops one process from
+		 * doing Spectre-v2 attacks on another.
+		 */
+		indirect_branch_prediction_barrier();
+
 		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
 			/*
 			 * If our current stack is in vmalloc space and isn't
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process
  2018-01-20 19:22 [RFC 00/10] Speculation Control feature support KarimAllah Ahmed
                   ` (2 preceding siblings ...)
  2018-01-20 19:22 ` [RFC 03/10] x86/speculation: Use Indirect Branch Prediction Barrier in context switch KarimAllah Ahmed
@ 2018-01-20 19:22 ` KarimAllah Ahmed
  2018-01-20 21:06   ` Woodhouse, David
  2018-01-21 11:22   ` Peter Zijlstra
  2018-01-20 19:22 ` [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure KarimAllah Ahmed
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 143+ messages in thread
From: KarimAllah Ahmed @ 2018-01-20 19:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

From: Tim Chen <tim.c.chen@linux.intel.com>

Flush indirect branches when switching into a process that marked
itself non dumpable.  This protects high value processes like gpg
better, without having too high performance overhead.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
---
 arch/x86/mm/tlb.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 304de7d..f64e80c 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -225,8 +225,19 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		 * Avoid user/user BTB poisoning by flushing the branch predictor
 		 * when switching between processes. This stops one process from
 		 * doing Spectre-v2 attacks on another.
+		 *
+		 * As an optimization: Flush indirect branches only when
+		 * switching into processes that disable dumping.
+		 *
+		 * This will not flush when switching into kernel threads.
+		 * But it would flush when switching into idle and back
+		 *
+		 * It might be useful to have a one-off cache here
+		 * to also not flush the idle case, but we would need some
+		 * kind of stable sequence number to remember the previous mm.
 		 */
-		indirect_branch_prediction_barrier();
+		if (tsk && tsk->mm && get_dumpable(tsk->mm) != SUID_DUMP_USER)
+			indirect_branch_prediction_barrier();
 
 		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
 			/*
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-20 19:22 [RFC 00/10] Speculation Control feature support KarimAllah Ahmed
                   ` (3 preceding siblings ...)
  2018-01-20 19:22 ` [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process KarimAllah Ahmed
@ 2018-01-20 19:22 ` KarimAllah Ahmed
  2018-01-21 14:31   ` Thomas Gleixner
                     ` (2 more replies)
  2018-01-20 19:22 ` [RFC 06/10] x86/speculation: Add inlines to control Indirect Branch Speculation KarimAllah Ahmed
                   ` (5 subsequent siblings)
  10 siblings, 3 replies; 143+ messages in thread
From: KarimAllah Ahmed @ 2018-01-20 19:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

From: David Woodhouse <dwmw@amazon.co.uk>

Not functional yet; just add the handling for it in the Spectre v2
mitigation selection, and the X86_FEATURE_IBRS flag which will control
the code to be added in later patches.

Also take the #ifdef CONFIG_RETPOLINE from around the RSB-stuffing; IBRS
mode will want that too.

For now we are auto-selecting IBRS on Skylake. We will probably end up
changing that but for now let's default to the safest option.

XX: Do we want a microcode blacklist?

[karahmed: simplify the switch block and get rid of all the magic]

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
---
 Documentation/admin-guide/kernel-parameters.txt |   1 +
 arch/x86/include/asm/cpufeatures.h              |   1 +
 arch/x86/include/asm/nospec-branch.h            |   2 -
 arch/x86/kernel/cpu/bugs.c                      | 108 +++++++++++++++---------
 4 files changed, 68 insertions(+), 44 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 8122b5f..e597650 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3932,6 +3932,7 @@
 			retpoline	  - replace indirect branches
 			retpoline,generic - google's original retpoline
 			retpoline,amd     - AMD-specific minimal thunk
+			ibrs		  - Intel: Indirect Branch Restricted Speculation
 
 			Not specifying this option is equivalent to
 			spectre_v2=auto.
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 8ec9588..ae86ad9 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -211,6 +211,7 @@
 #define X86_FEATURE_AMD_PRED_CMD	( 7*32+17) /* Prediction Command MSR (AMD) */
 #define X86_FEATURE_MBA			( 7*32+18) /* Memory Bandwidth Allocation */
 #define X86_FEATURE_RSB_CTXSW		( 7*32+19) /* Fill RSB on context switches */
+#define X86_FEATURE_IBRS		( 7*32+21) /* Use IBRS for Spectre v2 safety */
 
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index c333c95..8759449 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -205,7 +205,6 @@ extern char __indirect_thunk_end[];
  */
 static inline void vmexit_fill_RSB(void)
 {
-#ifdef CONFIG_RETPOLINE
 	unsigned long loops;
 
 	asm volatile (ANNOTATE_NOSPEC_ALTERNATIVE
@@ -215,7 +214,6 @@ static inline void vmexit_fill_RSB(void)
 		      "910:"
 		      : "=r" (loops), ASM_CALL_CONSTRAINT
 		      : : "memory" );
-#endif
 }
 
 static inline void indirect_branch_prediction_barrier(void)
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 96548ff..1d5e12f 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -79,6 +79,7 @@ enum spectre_v2_mitigation_cmd {
 	SPECTRE_V2_CMD_RETPOLINE,
 	SPECTRE_V2_CMD_RETPOLINE_GENERIC,
 	SPECTRE_V2_CMD_RETPOLINE_AMD,
+	SPECTRE_V2_CMD_IBRS,
 };
 
 static const char *spectre_v2_strings[] = {
@@ -87,6 +88,7 @@ static const char *spectre_v2_strings[] = {
 	[SPECTRE_V2_RETPOLINE_MINIMAL_AMD]	= "Vulnerable: Minimal AMD ASM retpoline",
 	[SPECTRE_V2_RETPOLINE_GENERIC]		= "Mitigation: Full generic retpoline",
 	[SPECTRE_V2_RETPOLINE_AMD]		= "Mitigation: Full AMD retpoline",
+	[SPECTRE_V2_IBRS]			= "Mitigation: Indirect Branch Restricted Speculation",
 };
 
 #undef pr_fmt
@@ -132,9 +134,17 @@ static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void)
 			spec2_print_if_secure("force enabled on command line.");
 			return SPECTRE_V2_CMD_FORCE;
 		} else if (match_option(arg, ret, "retpoline")) {
+			if (!IS_ENABLED(CONFIG_RETPOLINE)) {
+				pr_err("retpoline selected but not compiled in. Switching to AUTO select\n");
+				return SPECTRE_V2_CMD_AUTO;
+			}
 			spec2_print_if_insecure("retpoline selected on command line.");
 			return SPECTRE_V2_CMD_RETPOLINE;
 		} else if (match_option(arg, ret, "retpoline,amd")) {
+			if (!IS_ENABLED(CONFIG_RETPOLINE)) {
+				pr_err("retpoline,amd selected but not compiled in. Switching to AUTO select\n");
+				return SPECTRE_V2_CMD_AUTO;
+			}
 			if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD) {
 				pr_err("retpoline,amd selected but CPU is not AMD. Switching to AUTO select\n");
 				return SPECTRE_V2_CMD_AUTO;
@@ -142,8 +152,19 @@ static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void)
 			spec2_print_if_insecure("AMD retpoline selected on command line.");
 			return SPECTRE_V2_CMD_RETPOLINE_AMD;
 		} else if (match_option(arg, ret, "retpoline,generic")) {
+			if (!IS_ENABLED(CONFIG_RETPOLINE)) {
+				pr_err("retpoline,generic selected but not compiled in. Switching to AUTO select\n");
+				return SPECTRE_V2_CMD_AUTO;
+			}
 			spec2_print_if_insecure("generic retpoline selected on command line.");
 			return SPECTRE_V2_CMD_RETPOLINE_GENERIC;
+		} else if (match_option(arg, ret, "ibrs")) {
+			if (!boot_cpu_has(X86_FEATURE_SPEC_CTRL)) {
+				pr_err("IBRS selected but no CPU support. Switching to AUTO select\n");
+				return SPECTRE_V2_CMD_AUTO;
+			}
+			spec2_print_if_insecure("IBRS seleted on command line.");
+			return SPECTRE_V2_CMD_IBRS;
 		} else if (match_option(arg, ret, "auto")) {
 			return SPECTRE_V2_CMD_AUTO;
 		}
@@ -156,7 +177,7 @@ static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void)
 	return SPECTRE_V2_CMD_NONE;
 }
 
-/* Check for Skylake-like CPUs (for RSB handling) */
+/* Check for Skylake-like CPUs (for RSB and IBRS handling) */
 static bool __init is_skylake_era(void)
 {
 	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
@@ -178,55 +199,58 @@ static void __init spectre_v2_select_mitigation(void)
 	enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline();
 	enum spectre_v2_mitigation mode = SPECTRE_V2_NONE;
 
-	/*
-	 * If the CPU is not affected and the command line mode is NONE or AUTO
-	 * then nothing to do.
-	 */
-	if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2) &&
-	    (cmd == SPECTRE_V2_CMD_NONE || cmd == SPECTRE_V2_CMD_AUTO))
-		return;
-
 	switch (cmd) {
 	case SPECTRE_V2_CMD_NONE:
+		if (boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
+			pr_err("kernel not compiled with retpoline; no mitigation available!");
 		return;
-
-	case SPECTRE_V2_CMD_FORCE:
-		/* FALLTRHU */
-	case SPECTRE_V2_CMD_AUTO:
-		goto retpoline_auto;
-
-	case SPECTRE_V2_CMD_RETPOLINE_AMD:
-		if (IS_ENABLED(CONFIG_RETPOLINE))
-			goto retpoline_amd;
-		break;
-	case SPECTRE_V2_CMD_RETPOLINE_GENERIC:
-		if (IS_ENABLED(CONFIG_RETPOLINE))
-			goto retpoline_generic;
+	case SPECTRE_V2_CMD_IBRS:
+		mode = SPECTRE_V2_IBRS;
+		setup_force_cpu_cap(X86_FEATURE_IBRS);
 		break;
+	case SPECTRE_V2_CMD_AUTO:
+		if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
+			return;
+		/* Fall through */
+	case SPECTRE_V2_CMD_FORCE:
+		/*
+		 * If we have IBRS support, and either Skylake or !RETPOLINE,
+		 * then that's what we do.
+		 */
+		if (boot_cpu_has(X86_FEATURE_SPEC_CTRL) &&
+		    (is_skylake_era() || !retp_compiler())) {
+			mode = SPECTRE_V2_IBRS;
+			setup_force_cpu_cap(X86_FEATURE_IBRS);
+			break;
+		}
+		/* Fall through */
 	case SPECTRE_V2_CMD_RETPOLINE:
-		if (IS_ENABLED(CONFIG_RETPOLINE))
-			goto retpoline_auto;
-		break;
-	}
-	pr_err("kernel not compiled with retpoline; no mitigation available!");
-	return;
+	case SPECTRE_V2_CMD_RETPOLINE_AMD:
+		if (IS_ENABLED(CONFIG_RETPOLINE) &&
+		    boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
+			if (boot_cpu_has(X86_FEATURE_LFENCE_RDTSC)) {
+				mode = retp_compiler() ? SPECTRE_V2_RETPOLINE_AMD :
+							 SPECTRE_V2_RETPOLINE_MINIMAL_AMD;
+				setup_force_cpu_cap(X86_FEATURE_RETPOLINE_AMD);
+				setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
+				break;
+			}
 
-retpoline_auto:
-	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
-	retpoline_amd:
-		if (!boot_cpu_has(X86_FEATURE_LFENCE_RDTSC)) {
 			pr_err("LFENCE not serializing. Switching to generic retpoline\n");
-			goto retpoline_generic;
 		}
-		mode = retp_compiler() ? SPECTRE_V2_RETPOLINE_AMD :
-					 SPECTRE_V2_RETPOLINE_MINIMAL_AMD;
-		setup_force_cpu_cap(X86_FEATURE_RETPOLINE_AMD);
-		setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
-	} else {
-	retpoline_generic:
-		mode = retp_compiler() ? SPECTRE_V2_RETPOLINE_GENERIC :
-					 SPECTRE_V2_RETPOLINE_MINIMAL;
-		setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
+		/* Fall through */
+	case SPECTRE_V2_CMD_RETPOLINE_GENERIC:
+		if (IS_ENABLED(CONFIG_RETPOLINE)) {
+			mode = retp_compiler() ? SPECTRE_V2_RETPOLINE_GENERIC :
+						 SPECTRE_V2_RETPOLINE_MINIMAL;
+			setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
+			break;
+		}
+		/* Fall through */
+	default:
+		if (boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
+			pr_err("kernel not compiled with retpoline; no mitigation available!");
+		return;
 	}
 
 	spectre_v2_enabled = mode;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [RFC 06/10] x86/speculation: Add inlines to control Indirect Branch Speculation
  2018-01-20 19:22 [RFC 00/10] Speculation Control feature support KarimAllah Ahmed
                   ` (4 preceding siblings ...)
  2018-01-20 19:22 ` [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure KarimAllah Ahmed
@ 2018-01-20 19:22 ` KarimAllah Ahmed
  2018-01-20 19:22 ` [RFC 07/10] x86: Simplify spectre_v2 command line parsing KarimAllah Ahmed
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 143+ messages in thread
From: KarimAllah Ahmed @ 2018-01-20 19:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

From: Thomas Gleixner <tglx@linutronix.de>

XX: I am utterly unconvinced that having "friendly, self-explanatory"
    names for the IBRS-frobbing inlines is useful. There be dragons
    here for anyone who isn't intimately familiar with what's going
    on, and it's almost better to just call it IBRS, put a reference
    to the spec, and have a clear "you must be →this← tall to ride."

[karahmed: switch to using ALTERNATIVES instead of static_cpu_has]
[dwmw2: wrmsr args inside the ALTERNATIVE again, bikeshed naming]

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/nospec-branch.h | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 8759449..5be3443 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -232,5 +232,41 @@ static inline void indirect_branch_prediction_barrier(void)
 		     : "memory");
 }
 
+/*
+ * This also performs a barrier, and setting it again when it was already
+ * set is NOT a no-op.
+ */
+static inline void restrict_branch_speculation(void)
+{
+	unsigned long ax, cx, dx;
+
+	asm volatile(ALTERNATIVE("",
+				 "movl %[msr], %%ecx\n\t"
+				 "movl %[val], %%eax\n\t"
+				 "movl $0, %%edx\n\t"
+				 "wrmsr",
+				 X86_FEATURE_IBRS)
+		     : "=a" (ax), "=c" (cx), "=d" (dx)
+		     : [msr] "i" (MSR_IA32_SPEC_CTRL),
+		       [val] "i" (SPEC_CTRL_IBRS)
+		     : "memory");
+}
+
+static inline void unrestrict_branch_speculation(void)
+{
+	unsigned long ax, cx, dx;
+
+	asm volatile(ALTERNATIVE("",
+				 "movl %[msr], %%ecx\n\t"
+				 "movl %[val], %%eax\n\t"
+				 "movl $0, %%edx\n\t"
+				 "wrmsr",
+				 X86_FEATURE_IBRS)
+		     : "=a" (ax), "=c" (cx), "=d" (dx)
+		     : [msr] "i" (MSR_IA32_SPEC_CTRL),
+		       [val] "i" (0)
+		     : "memory");
+}
+
 #endif /* __ASSEMBLY__ */
 #endif /* __NOSPEC_BRANCH_H__ */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [RFC 07/10] x86: Simplify spectre_v2 command line parsing
  2018-01-20 19:22 [RFC 00/10] Speculation Control feature support KarimAllah Ahmed
                   ` (5 preceding siblings ...)
  2018-01-20 19:22 ` [RFC 06/10] x86/speculation: Add inlines to control Indirect Branch Speculation KarimAllah Ahmed
@ 2018-01-20 19:22 ` KarimAllah Ahmed
  2018-01-20 19:22 ` [RFC 08/10] x86/idle: Control Indirect Branch Speculation in idle KarimAllah Ahmed
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 143+ messages in thread
From: KarimAllah Ahmed @ 2018-01-20 19:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
---
 arch/x86/kernel/cpu/bugs.c | 106 +++++++++++++++++++++++++--------------------
 1 file changed, 58 insertions(+), 48 deletions(-)

diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 1d5e12f..349c7f4 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -99,13 +99,13 @@ static enum spectre_v2_mitigation spectre_v2_enabled = SPECTRE_V2_NONE;
 static void __init spec2_print_if_insecure(const char *reason)
 {
 	if (boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
-		pr_info("%s\n", reason);
+		pr_info("%s selected on command line.\n", reason);
 }
 
 static void __init spec2_print_if_secure(const char *reason)
 {
 	if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
-		pr_info("%s\n", reason);
+		pr_info("%s selected on command line.\n", reason);
 }
 
 static inline bool retp_compiler(void)
@@ -120,61 +120,71 @@ static inline bool match_option(const char *arg, int arglen, const char *opt)
 	return len == arglen && !strncmp(arg, opt, len);
 }
 
+static struct {
+	char *option;
+	enum spectre_v2_mitigation_cmd cmd;
+	bool secure;
+} mitigation_options[] = {
+	{ "off",               SPECTRE_V2_CMD_NONE,              false },
+	{ "on",                SPECTRE_V2_CMD_FORCE,             true },
+	{ "retpoline",         SPECTRE_V2_CMD_RETPOLINE,         false },
+	{ "retpoline,amd",     SPECTRE_V2_CMD_RETPOLINE_AMD,     false },
+	{ "retpoline,generic", SPECTRE_V2_CMD_RETPOLINE_GENERIC, false },
+	{ "ibrs",              SPECTRE_V2_CMD_IBRS,              false },
+	{ "auto",              SPECTRE_V2_CMD_AUTO,              false },
+};
+
+static const int mitigation_options_count = sizeof(mitigation_options) /
+					    sizeof(mitigation_options[0]);
+
 static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void)
 {
 	char arg[20];
-	int ret;
+	int ret, i;
+	enum spectre_v2_mitigation_cmd cmd = SPECTRE_V2_CMD_AUTO;
+
+	if (cmdline_find_option_bool(boot_command_line, "nospectre_v2"))
+		return SPECTRE_V2_CMD_NONE;
 
 	ret = cmdline_find_option(boot_command_line, "spectre_v2", arg,
 				  sizeof(arg));
-	if (ret > 0)  {
-		if (match_option(arg, ret, "off")) {
-			goto disable;
-		} else if (match_option(arg, ret, "on")) {
-			spec2_print_if_secure("force enabled on command line.");
-			return SPECTRE_V2_CMD_FORCE;
-		} else if (match_option(arg, ret, "retpoline")) {
-			if (!IS_ENABLED(CONFIG_RETPOLINE)) {
-				pr_err("retpoline selected but not compiled in. Switching to AUTO select\n");
-				return SPECTRE_V2_CMD_AUTO;
-			}
-			spec2_print_if_insecure("retpoline selected on command line.");
-			return SPECTRE_V2_CMD_RETPOLINE;
-		} else if (match_option(arg, ret, "retpoline,amd")) {
-			if (!IS_ENABLED(CONFIG_RETPOLINE)) {
-				pr_err("retpoline,amd selected but not compiled in. Switching to AUTO select\n");
-				return SPECTRE_V2_CMD_AUTO;
-			}
-			if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD) {
-				pr_err("retpoline,amd selected but CPU is not AMD. Switching to AUTO select\n");
-				return SPECTRE_V2_CMD_AUTO;
-			}
-			spec2_print_if_insecure("AMD retpoline selected on command line.");
-			return SPECTRE_V2_CMD_RETPOLINE_AMD;
-		} else if (match_option(arg, ret, "retpoline,generic")) {
-			if (!IS_ENABLED(CONFIG_RETPOLINE)) {
-				pr_err("retpoline,generic selected but not compiled in. Switching to AUTO select\n");
-				return SPECTRE_V2_CMD_AUTO;
-			}
-			spec2_print_if_insecure("generic retpoline selected on command line.");
-			return SPECTRE_V2_CMD_RETPOLINE_GENERIC;
-		} else if (match_option(arg, ret, "ibrs")) {
-			if (!boot_cpu_has(X86_FEATURE_SPEC_CTRL)) {
-				pr_err("IBRS selected but no CPU support. Switching to AUTO select\n");
-				return SPECTRE_V2_CMD_AUTO;
-			}
-			spec2_print_if_insecure("IBRS seleted on command line.");
-			return SPECTRE_V2_CMD_IBRS;
-		} else if (match_option(arg, ret, "auto")) {
-			return SPECTRE_V2_CMD_AUTO;
-		}
+	if (ret < 0)
+		return SPECTRE_V2_CMD_AUTO;
+
+	for (i = 0; i < mitigation_options_count; i++) {
+		if (!match_option(arg, ret, mitigation_options[i].option))
+			continue;
+		cmd = mitigation_options[i].cmd;
+		break;
 	}
 
-	if (!cmdline_find_option_bool(boot_command_line, "nospectre_v2"))
+	if (i >= mitigation_options_count) {
+		pr_err("unknown option (%s). Switching to AUTO select\n",
+		       mitigation_options[i].option);
 		return SPECTRE_V2_CMD_AUTO;
-disable:
-	spec2_print_if_insecure("disabled on command line.");
-	return SPECTRE_V2_CMD_NONE;
+	}
+
+	if ((cmd == SPECTRE_V2_CMD_RETPOLINE ||
+	     cmd == SPECTRE_V2_CMD_RETPOLINE_AMD ||
+	     cmd == SPECTRE_V2_CMD_RETPOLINE_GENERIC) &&
+	    !IS_ENABLED(CONFIG_RETPOLINE)) {
+			pr_err("%s selected but not compiled in. Switching to AUTO select\n",
+			       mitigation_options[i].option);
+			return SPECTRE_V2_CMD_AUTO;
+	}
+
+	if (cmd == SPECTRE_V2_CMD_RETPOLINE_AMD &&
+	    boot_cpu_data.x86_vendor != X86_VENDOR_AMD) {
+			pr_err("retpoline,amd selected but CPU is not AMD. Switching to AUTO select\n");
+			return SPECTRE_V2_CMD_AUTO;
+	}
+
+	if (mitigation_options[i].secure)
+		spec2_print_if_secure(mitigation_options[i].option);
+	else
+		spec2_print_if_insecure(mitigation_options[i].option);
+
+	return cmd;
 }
 
 /* Check for Skylake-like CPUs (for RSB and IBRS handling) */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [RFC 08/10] x86/idle: Control Indirect Branch Speculation in idle
  2018-01-20 19:22 [RFC 00/10] Speculation Control feature support KarimAllah Ahmed
                   ` (6 preceding siblings ...)
  2018-01-20 19:22 ` [RFC 07/10] x86: Simplify spectre_v2 command line parsing KarimAllah Ahmed
@ 2018-01-20 19:22 ` KarimAllah Ahmed
  2018-01-20 19:23 ` [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation KarimAllah Ahmed
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 143+ messages in thread
From: KarimAllah Ahmed @ 2018-01-20 19:22 UTC (permalink / raw)
  To: linux-kernel
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

From: Thomas Gleixner <tglx@linutronix.de>

Indirect Branch Speculation (IBS) is controlled per physical core. If one
thread disables it then it's disabled for the core. If a thread enters idle
it makes sense to reenable IBS so the sibling thread can run with full
speculation enabled in user space.

This makes only sense in mwait_idle_with_hints() because mwait_idle() can
serve an interrupt immediately before speculation can be stopped again. SKL
which requires IBRS should use mwait_idle_with_hints() so this is a non
issue and in the worst case a missed optimization.

Originally-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mwait.h | 14 ++++++++++++++
 arch/x86/kernel/process.c    | 14 ++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/arch/x86/include/asm/mwait.h b/arch/x86/include/asm/mwait.h
index 39a2fb2..f173072 100644
--- a/arch/x86/include/asm/mwait.h
+++ b/arch/x86/include/asm/mwait.h
@@ -6,6 +6,7 @@
 #include <linux/sched/idle.h>
 
 #include <asm/cpufeature.h>
+#include <asm/nospec-branch.h>
 
 #define MWAIT_SUBSTATE_MASK		0xf
 #define MWAIT_CSTATE_MASK		0xf
@@ -106,7 +107,20 @@ static inline void mwait_idle_with_hints(unsigned long eax, unsigned long ecx)
 			mb();
 		}
 
+		/*
+		 * Indirect Branch Speculation (IBS) is controlled per
+		 * physical core. If one thread disables it, then it's
+		 * disabled on all threads of the core. The kernel disables
+		 * it on entry from user space. Reenable it on the thread
+		 * which goes idle so the other thread has a chance to run
+		 * with full speculation enabled in userspace.
+		 */
+		unrestrict_branch_speculation();
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
+		/*
+		 * Restrict IBS again to protect kernel execution.
+		 */
+		restrict_branch_speculation();
 		if (!need_resched())
 			__mwait(eax, ecx);
 	}
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 3cb2486..f941c5d 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -461,6 +461,20 @@ static __cpuidle void mwait_idle(void)
 			mb(); /* quirk */
 		}
 
+		/*
+		 * Indirect Branch Speculation (IBS) is controlled per
+		 * physical core. If one thread disables it, then it's
+		 * disabled on all threads of the core. The kernel disables
+		 * it on entry from user space. For __sti_mwait() it's
+		 * wrong to reenable it because an interrupt can be served
+		 * before speculation can be stopped again.
+		 *
+		 * To plug that hole the interrupt entry code would need to
+		 * save current state and restore. Not worth the trouble as
+		 * SKL should not use mwait_idle(). It should use
+		 * mwait_idle_with_hints() which can do speculation control
+		 * safely.
+		 */
 		__monitor((void *)&current_thread_info()->flags, 0, 0);
 		if (!need_resched())
 			__sti_mwait(0, 0);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-20 19:22 [RFC 00/10] Speculation Control feature support KarimAllah Ahmed
                   ` (7 preceding siblings ...)
  2018-01-20 19:22 ` [RFC 08/10] x86/idle: Control Indirect Branch Speculation in idle KarimAllah Ahmed
@ 2018-01-20 19:23 ` KarimAllah Ahmed
  2018-01-21 19:14   ` Andy Lutomirski
  2018-01-21 19:34   ` Linus Torvalds
  2018-01-20 19:23 ` [RFC 10/10] x86/enter: Use IBRS on syscall and interrupts KarimAllah Ahmed
  2018-01-21 14:02 ` [RFC 00/10] Speculation Control feature support Konrad Rzeszutek Wilk
  10 siblings, 2 replies; 143+ messages in thread
From: KarimAllah Ahmed @ 2018-01-20 19:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Arjan Van De Ven

From: Tim Chen <tim.c.chen@linux.intel.com>

Create macros to control Indirect Branch Speculation.

Name them so they reflect what they are actually doing.
The macros are used to restrict and unrestrict the indirect branch speculation.
They do not *disable* (or *enable*) indirect branch speculation. A trip back to
user-space after *restricting* speculation would still affect the BTB.

Quoting from a commit by Tim Chen:

"""
    If IBRS is set, near returns and near indirect jumps/calls will not allow
    their predicted target address to be controlled by code that executed in a
    less privileged prediction mode *BEFORE* the IBRS mode was last written with
    a value of 1 or on another logical processor so long as all Return Stack
    Buffer (RSB) entries from the previous less privileged prediction mode are
    overwritten.

    Thus a near indirect jump/call/return may be affected by code in a less
    privileged prediction mode that executed *AFTER* IBRS mode was last written
    with a value of 1.
"""

[ tglx: Changed macro names and rewrote changelog ]
[ karahmed: changed macro names *again* and rewrote changelog ]

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Ashok Raj <ashok.raj@intel.com>
Link: https://lkml.kernel.org/r/3aab341725ee6a9aafd3141387453b45d788d61a.1515542293.git.tim.c.chen@linux.intel.com
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/entry/calling.h | 73 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 3f48f69..5aafb51 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -6,6 +6,8 @@
 #include <asm/percpu.h>
 #include <asm/asm-offsets.h>
 #include <asm/processor-flags.h>
+#include <asm/msr-index.h>
+#include <asm/cpufeatures.h>
 
 /*
 
@@ -349,3 +351,74 @@ For 32-bit we have the following conventions - kernel is built with
 .Lafter_call_\@:
 #endif
 .endm
+
+/*
+ * IBRS related macros
+ */
+.macro PUSH_MSR_REGS
+	pushq	%rax
+	pushq	%rcx
+	pushq	%rdx
+.endm
+
+.macro POP_MSR_REGS
+	popq	%rdx
+	popq	%rcx
+	popq	%rax
+.endm
+
+.macro WRMSR_ASM msr_nr:req edx_val:req eax_val:req
+	movl	\msr_nr, %ecx
+	movl	\edx_val, %edx
+	movl	\eax_val, %eax
+	wrmsr
+.endm
+
+.macro RESTRICT_IB_SPEC
+	ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
+	PUSH_MSR_REGS
+	WRMSR_ASM $MSR_IA32_SPEC_CTRL, $0, $SPEC_CTRL_IBRS
+	POP_MSR_REGS
+.Lskip_\@:
+.endm
+
+.macro UNRESTRICT_IB_SPEC
+	ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
+	PUSH_MSR_REGS
+	WRMSR_ASM $MSR_IA32_SPEC_CTRL, $0, $0
+	POP_MSR_REGS
+.Lskip_\@:
+.endm
+
+.macro RESTRICT_IB_SPEC_CLOBBER
+	ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
+	WRMSR_ASM $MSR_IA32_SPEC_CTRL, $0, $SPEC_CTRL_IBRS
+.Lskip_\@:
+.endm
+
+.macro UNRESTRICT_IB_SPEC_CLOBBER
+	ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
+	WRMSR_ASM $MSR_IA32_SPEC_CTRL, $0, $0
+.Lskip_\@:
+.endm
+
+.macro RESTRICT_IB_SPEC_SAVE_AND_CLOBBER save_reg:req
+	ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
+	movl	$MSR_IA32_SPEC_CTRL, %ecx
+	rdmsr
+	movl	%eax, \save_reg
+	movl	$0, %edx
+	movl	$SPEC_CTRL_IBRS, %eax
+	wrmsr
+.Lskip_\@:
+.endm
+
+.macro RESTORE_IB_SPEC_CLOBBER save_reg:req
+	ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
+	/* Set IBRS to the value saved in the save_reg */
+	movl    $MSR_IA32_SPEC_CTRL, %ecx
+	movl    $0, %edx
+	movl    \save_reg, %eax
+	wrmsr
+.Lskip_\@:
+.endm
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* [RFC 10/10] x86/enter: Use IBRS on syscall and interrupts
  2018-01-20 19:22 [RFC 00/10] Speculation Control feature support KarimAllah Ahmed
                   ` (8 preceding siblings ...)
  2018-01-20 19:23 ` [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation KarimAllah Ahmed
@ 2018-01-20 19:23 ` KarimAllah Ahmed
  2018-01-21 13:50   ` Konrad Rzeszutek Wilk
  2018-01-21 14:02 ` [RFC 00/10] Speculation Control feature support Konrad Rzeszutek Wilk
  10 siblings, 1 reply; 143+ messages in thread
From: KarimAllah Ahmed @ 2018-01-20 19:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Arjan Van De Ven

From: Tim Chen <tim.c.chen@linux.intel.com>

Stop Indirect Branch Speculation on every user space to kernel space
transition and reenable it when returning to user space./

The NMI interrupt save/restore of IBRS state was based on Andrea
Arcangeli's implementation.  Here's an explanation by Dave Hansen on why we
save IBRS state for NMI.

The normal interrupt code uses the 'error_entry' path which uses the
Code Segment (CS) of the instruction that was interrupted to tell
whether it interrupted the kernel or userspace and thus has to switch
IBRS, or leave it alone.

The NMI code is different.  It uses 'paranoid_entry' because it can
interrupt the kernel while it is running with a userspace IBRS (and %GS
and CR3) value, but has a kernel CS.  If we used the same approach as
the normal interrupt code, we might do the following;

	SYSENTER_entry
<-------------- NMI HERE
	IBRS=1
		do_something()
	IBRS=0
	SYSRET

The NMI code might notice that we are running in the kernel and decide
that it is OK to skip the IBRS=1.  This would leave it running
unprotected with IBRS=0, which is bad.

However, if we unconditionally set IBRS=1, in the NMI, we might get the
following case:

	SYSENTER_entry
	IBRS=1
		do_something()
	IBRS=0
<-------------- NMI HERE (set IBRS=1)
	SYSRET

and we would return to userspace with IBRS=1.  Userspace would run
slowly until we entered and exited the kernel again.

Instead of those two approaches, we chose a third one where we simply
save the IBRS value in a scratch register (%r13) and then restore that
value, verbatim.

[karahmed use the new SPEC_CTRL_IBRS defines]

Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Ashok Raj <ashok.raj@intel.com>
Link: https://lkml.kernel.org/r/d5e4c03ec290c61dfbe5a769f7287817283fa6b7.1515542293.git.tim.c.chen@linux.intel.com
---
 arch/x86/entry/entry_64.S        | 35 ++++++++++++++++++++++++++++++++++-
 arch/x86/entry/entry_64_compat.S | 21 +++++++++++++++++++--
 2 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 63f4320..b3d90cf 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -171,6 +171,8 @@ ENTRY(entry_SYSCALL_64_trampoline)
 
 	/* Load the top of the task stack into RSP */
 	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
+	/* Restrict indirect branch speculation */
+	RESTRICT_IB_SPEC
 
 	/* Start building the simulated IRET frame. */
 	pushq	$__USER_DS			/* pt_regs->ss */
@@ -214,6 +216,8 @@ ENTRY(entry_SYSCALL_64)
 	 */
 	movq	%rsp, PER_CPU_VAR(rsp_scratch)
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	/* Restrict Indirect Branch Speculation */
+	RESTRICT_IB_SPEC
 
 	TRACE_IRQS_OFF
 
@@ -409,6 +413,8 @@ syscall_return_via_sysret:
 	pushq	RSP-RDI(%rdi)	/* RSP */
 	pushq	(%rdi)		/* RDI */
 
+	/* Unrestrict Indirect Branch Speculation */
+	UNRESTRICT_IB_SPEC
 	/*
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
@@ -757,11 +763,12 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
 	/* Push user RDI on the trampoline stack. */
 	pushq	(%rdi)
 
+	/* Unrestrict Indirect Branch Speculation */
+	UNRESTRICT_IB_SPEC
 	/*
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
-
 	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
 	/* Restore RDI. */
@@ -849,6 +856,13 @@ native_irq_return_ldt:
 	SWAPGS					/* to kernel GS */
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi	/* to kernel CR3 */
 
+	/*
+	 * There is no point in disabling Indirect Branch Speculation
+	 * here as this is going to return to user space immediately
+	 * after fixing ESPFIX stack.  There is no vulnerable code
+	 * to protect so spare two MSR writes.
+	 */
+
 	movq	PER_CPU_VAR(espfix_waddr), %rdi
 	movq	%rax, (0*8)(%rdi)		/* user RAX */
 	movq	(1*8)(%rsp), %rax		/* user RIP */
@@ -982,6 +996,8 @@ ENTRY(switch_to_thread_stack)
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
 	movq	%rsp, %rdi
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	/* Restrict Indirect Branch Speculation */
+	RESTRICT_IB_SPEC
 	UNWIND_HINT sp_offset=16 sp_reg=ORC_REG_DI
 
 	pushq	7*8(%rdi)		/* regs->ss */
@@ -1282,6 +1298,8 @@ ENTRY(paranoid_entry)
 
 1:
 	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
+	/* Restrict Indirect Branch speculation */
+	RESTRICT_IB_SPEC_SAVE_AND_CLOBBER save_reg=%r13d
 
 	ret
 END(paranoid_entry)
@@ -1305,6 +1323,8 @@ ENTRY(paranoid_exit)
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	.Lparanoid_exit_no_swapgs
 	TRACE_IRQS_IRETQ
+	/* Restore Indirect Branch Speculation to the previous state */
+	RESTORE_IB_SPEC_CLOBBER save_reg=%r13d
 	RESTORE_CR3	scratch_reg=%rbx save_reg=%r14
 	SWAPGS_UNSAFE_STACK
 	jmp	.Lparanoid_exit_restore
@@ -1335,6 +1355,8 @@ ENTRY(error_entry)
 	SWAPGS
 	/* We have user CR3.  Change to kernel CR3. */
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
+	/* Restrict Indirect Branch Speculation */
+	RESTRICT_IB_SPEC_CLOBBER
 
 .Lerror_entry_from_usermode_after_swapgs:
 	/* Put us onto the real thread stack. */
@@ -1382,6 +1404,8 @@ ENTRY(error_entry)
 	 */
 	SWAPGS
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
+	/* Restrict Indirect Branch Speculation */
+	RESTRICT_IB_SPEC_CLOBBER
 	jmp .Lerror_entry_done
 
 .Lbstep_iret:
@@ -1396,6 +1420,8 @@ ENTRY(error_entry)
 	 */
 	SWAPGS
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
+	/* Restrict Indirect Branch Speculation */
+	RESTRICT_IB_SPEC
 
 	/*
 	 * Pretend that the exception came from user mode: set up pt_regs
@@ -1497,6 +1523,10 @@ ENTRY(nmi)
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
 	movq	%rsp, %rdx
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+
+	/* Restrict Indirect Branch Speculation */
+	RESTRICT_IB_SPEC
+
 	UNWIND_HINT_IRET_REGS base=%rdx offset=8
 	pushq	5*8(%rdx)	/* pt_regs->ss */
 	pushq	4*8(%rdx)	/* pt_regs->rsp */
@@ -1747,6 +1777,9 @@ end_repeat_nmi:
 	movq	$-1, %rsi
 	call	do_nmi
 
+	/* Restore Indirect Branch speculation to the previous state */
+	RESTORE_IB_SPEC_CLOBBER save_reg=%r13d
+
 	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
 
 	testl	%ebx, %ebx			/* swapgs needed? */
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 98d5358..5b45d93 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -54,6 +54,8 @@ ENTRY(entry_SYSENTER_compat)
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
 
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	/* Restrict Indirect Branch Speculation */
+	RESTRICT_IB_SPEC
 
 	/*
 	 * User tracing code (ptrace or signal handlers) might assume that
@@ -224,12 +226,18 @@ GLOBAL(entry_SYSCALL_compat_after_hwframe)
 	pushq   $0			/* pt_regs->r14 = 0 */
 	pushq   $0			/* pt_regs->r15 = 0 */
 
-	/*
-	 * User mode is traced as though IRQs are on, and SYSENTER
+	/* Restrict Indirect Branch Speculation. All registers are saved already */
+	RESTRICT_IB_SPEC_CLOBBER
+
+	/* User mode is traced as though IRQs are on, and SYSENTER
 	 * turned them off.
 	 */
 	TRACE_IRQS_OFF
 
+	/*
+	 * We just saved %rdi so it is safe to clobber.  It is not
+	 * preserved during the C calls inside TRACE_IRQS_OFF anyway.
+	 */
 	movq	%rsp, %rdi
 	call	do_fast_syscall_32
 	/* XEN PV guests always use IRET path */
@@ -239,6 +247,15 @@ GLOBAL(entry_SYSCALL_compat_after_hwframe)
 	/* Opportunistic SYSRET */
 sysret32_from_system_call:
 	TRACE_IRQS_ON			/* User mode traces as IRQs on. */
+
+	/*
+	 * Unrestrict Indirect Branch Speculation. This is safe to do here
+	 * because there are no indirect branches between here and the
+	 * return to userspace (sysretl).
+	 * Clobber of %rax, %rcx, %rdx is OK before register restoring.
+	 */
+	UNRESTRICT_IB_SPEC_CLOBBER
+
 	movq	RBX(%rsp), %rbx		/* pt_regs->rbx */
 	movq	RBP(%rsp), %rbp		/* pt_regs->rbp */
 	movq	EFLAGS(%rsp), %r11	/* pt_regs->flags (in r11) */
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* Re: [RFC 02/10] x86/kvm: Add IBPB support
  2018-01-20 19:22 ` [RFC 02/10] x86/kvm: Add IBPB support KarimAllah Ahmed
@ 2018-01-20 20:18   ` Woodhouse, David
  2018-01-22 18:56   ` Jim Mattson
  1 sibling, 0 replies; 143+ messages in thread
From: Woodhouse, David @ 2018-01-20 20:18 UTC (permalink / raw)
  To: linux-kernel, Raslan, KarimAllah
  Cc: kvm, tim.c.chen, peterz, arjan, ashok.raj, arjan.van.de.ven, bp,
	torvalds, tglx, Janakarajan.Natarajan, ak, joro, dan.j.williams,
	x86, hpa, aarcange, mingo, luto, pbonzini, gregkh, dave.hansen,
	mhiramat, thomas.lendacky, asit.k.mallick, jun.nakajima, labbott,
	rkrcmar


[-- Attachment #1.1: Type: text/plain, Size: 753 bytes --]

On Sat, 2018-01-20 at 20:22 +0100, KarimAllah Ahmed wrote:
> 
> @@ -6791,6 +6792,9 @@ static __init int hardware_setup(void)
>                 kvm_tsc_scaling_ratio_frac_bits = 48;
>         }
>  
> +       if (boot_cpu_has(X86_FEATURE_SPEC_CTRL))
> +               vmx_disable_intercept_for_msr(MSR_IA32_PRED_CMD, false);
> +

I've updated that to allow X86_FEATURE_AMD_PRED_CMD too, since some
hypervisors may expose *only* that MSR to guests even on Intel
hardware. PRED_CMD is a lot easier to expose as it doesn't need
storage, live migration support, and all that crap.

Our shared tree at
http://git.infradead.org/linux-retpoline.git/shortlog/refs/heads/ibpb
updated accordingly.


[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 197 bytes --]




Amazon Web Services UK Limited. Registered in England and Wales with registration number 08650665 and which has its registered office at 60 Holborn Viaduct, London EC1A 2FD, United Kingdom.

[-- Attachment #2.2: Type: text/html, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process
  2018-01-20 19:22 ` [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process KarimAllah Ahmed
@ 2018-01-20 21:06   ` Woodhouse, David
  2018-01-22 18:29     ` Tim Chen
  2018-01-21 11:22   ` Peter Zijlstra
  1 sibling, 1 reply; 143+ messages in thread
From: Woodhouse, David @ 2018-01-20 21:06 UTC (permalink / raw)
  To: linux-kernel, Raslan, KarimAllah
  Cc: kvm, tim.c.chen, peterz, arjan, ashok.raj, bp, torvalds, tglx,
	Janakarajan.Natarajan, ak, joro, dan.j.williams, x86, hpa,
	aarcange, mingo, luto, pbonzini, gregkh, dave.hansen, mhiramat,
	thomas.lendacky, asit.k.mallick, jun.nakajima, labbott, rkrcmar


[-- Attachment #1.1: Type: text/plain, Size: 2300 bytes --]

On Sat, 2018-01-20 at 20:22 +0100, KarimAllah Ahmed wrote:
> From: Tim Chen <tim.c.chen@linux.intel.com>

I think this is probably From: Andi now rather than From: Tim?

We do need the series this far in order to have a full retpoline-based
mitigation, and I'd like to see that go in sooner rather than later.
There's a little more discussion to be had about the IBRS parts which
come later in the series (and the final one or two which weren't posted
yet).

I think this is the one patch of the "we want this now" IBPB set that
we expect serious debate on, which is why it's still a separate
"optimisation" patch on top of the previous one which just does IBPB
unconditionally.


> Flush indirect branches when switching into a process that marked
> itself non dumpable.  This protects high value processes like gpg
> better, without having too high performance overhead.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> ---
>  arch/x86/mm/tlb.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 304de7d..f64e80c 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -225,8 +225,19 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  		 * Avoid user/user BTB poisoning by flushing the branch predictor
>  		 * when switching between processes. This stops one process from
>  		 * doing Spectre-v2 attacks on another.
> +		 *
> +		 * As an optimization: Flush indirect branches only when
> +		 * switching into processes that disable dumping.
> +		 *
> +		 * This will not flush when switching into kernel threads.
> +		 * But it would flush when switching into idle and back
> +		 *
> +		 * It might be useful to have a one-off cache here
> +		 * to also not flush the idle case, but we would need some
> +		 * kind of stable sequence number to remember the previous mm.
>  		 */
> -		indirect_branch_prediction_barrier();
> +		if (tsk && tsk->mm && get_dumpable(tsk->mm) != SUID_DUMP_USER)
> +			indirect_branch_prediction_barrier();
>  
>  		if (IS_ENABLED(CONFIG_VMAP_STACK)) {
>  			/*

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 197 bytes --]




Amazon Web Services UK Limited. Registered in England and Wales with registration number 08650665 and which has its registered office at 60 Holborn Viaduct, London EC1A 2FD, United Kingdom.

[-- Attachment #2.2: Type: text/html, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process
  2018-01-20 19:22 ` [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process KarimAllah Ahmed
  2018-01-20 21:06   ` Woodhouse, David
@ 2018-01-21 11:22   ` Peter Zijlstra
  2018-01-21 12:04     ` David Woodhouse
                       ` (2 more replies)
  1 sibling, 3 replies; 143+ messages in thread
From: Peter Zijlstra @ 2018-01-21 11:22 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

On Sat, Jan 20, 2018 at 08:22:55PM +0100, KarimAllah Ahmed wrote:
> From: Tim Chen <tim.c.chen@linux.intel.com>
> 
> Flush indirect branches when switching into a process that marked
> itself non dumpable.  This protects high value processes like gpg
> better, without having too high performance overhead.

So if I understand it right, this is only needed if the 'other'
executable itself is susceptible to spectre. If say someone audited gpg
for spectre-v1 and build it with retpoline, it would be safe to not
issue the IBPB, right?

So would it make sense to provide an ELF flag / personality thing such
that userspace can indicate its spectre-safe?

I realize that this is all future work, because so far auditing for v1
is a lot of pain (we need better tools), but would it be something that
makes sense in the longer term?

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process
  2018-01-21 11:22   ` Peter Zijlstra
@ 2018-01-21 12:04     ` David Woodhouse
  2018-01-21 14:07       ` H.J. Lu
  2018-01-22 10:19       ` Peter Zijlstra
  2018-01-21 16:21     ` Ingo Molnar
  2018-01-29  6:35     ` Jon Masters
  2 siblings, 2 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-21 12:04 UTC (permalink / raw)
  To: Peter Zijlstra, hjl.tools
  Cc: KarimAllah Ahmed, linux-kernel, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Arjan van de Ven, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	"Radim Krčmář",
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86


> On Sat, Jan 20, 2018 at 08:22:55PM +0100, KarimAllah Ahmed wrote:
>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>
>> Flush indirect branches when switching into a process that marked
>> itself non dumpable.  This protects high value processes like gpg
>> better, without having too high performance overhead.
>
> So if I understand it right, this is only needed if the 'other'
> executable itself is susceptible to spectre. If say someone audited gpg
> for spectre-v1 and build it with retpoline, it would be safe to not
> issue the IBPB, right?


Spectre V2 not v1. V1 is separate.
For V2 retpoline is enough... as long as all the libraries have it too.

> So would it make sense to provide an ELF flag / personality thing such
> that userspace can indicate its spectre-safe?

Yes, Arjan and I were pondering that yesterday; it probably does make
sense. Also for allowing a return to userspace after vmexit, if the army
process itself is so marked.

> I realize that this is all future work, because so far auditing for v1
> is a lot of pain (we need better tools), but would it be something that
> makes sense in the longer term?

It's *only* retpoline so it isn't actually that much. Although I'm wary of
Cc'ing HJ on such thoughts because he seems to never sleep and always
respond promptly with "OK I did that... " :)

If we did systematically do this in userspace we'd probably want to do
external thunks there too, and a flag in the auxvec to tell it not to
bother (for IBRS_ALL etc.).

-- 
dwmw2

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 10/10] x86/enter: Use IBRS on syscall and interrupts
  2018-01-20 19:23 ` [RFC 10/10] x86/enter: Use IBRS on syscall and interrupts KarimAllah Ahmed
@ 2018-01-21 13:50   ` Konrad Rzeszutek Wilk
  2018-01-21 14:40     ` KarimAllah Ahmed
  2018-01-21 17:22     ` Dave Hansen
  0 siblings, 2 replies; 143+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-01-21 13:50 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Arjan Van De Ven

On Sat, Jan 20, 2018 at 08:23:01PM +0100, KarimAllah Ahmed wrote:
> From: Tim Chen <tim.c.chen@linux.intel.com>
> 
> Stop Indirect Branch Speculation on every user space to kernel space
> transition and reenable it when returning to user space./

How about interrupts?

That is should .macro interrupt have the same treatment?

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 00/10] Speculation Control feature support
  2018-01-20 19:22 [RFC 00/10] Speculation Control feature support KarimAllah Ahmed
                   ` (9 preceding siblings ...)
  2018-01-20 19:23 ` [RFC 10/10] x86/enter: Use IBRS on syscall and interrupts KarimAllah Ahmed
@ 2018-01-21 14:02 ` Konrad Rzeszutek Wilk
  2018-01-22 21:27   ` David Woodhouse
  10 siblings, 1 reply; 143+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-01-21 14:02 UTC (permalink / raw)
  To: KarimAllah Ahmed, Mihai Carabas
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

On Sat, Jan 20, 2018 at 08:22:51PM +0100, KarimAllah Ahmed wrote:
> Start using the newly-added microcode features for speculation control on both
> Intel and AMD CPUs to protect against Spectre v2.

Thank you posting these.
> 
> This patch series covers interrupts, system calls, context switching between
> processes, and context switching between VMs. It also exposes Indirect Branch
> Prediction Barrier MSR, aka IBPB MSR, to KVM guests.
> 
> TODO:
> 
> - Introduce a microcode blacklist to disable the feature for broken microcodes.
> - Restrict/Unrestrict the speculation (by toggling IBRS) around VMExit and
>   VMEnter for KVM and expose IBRS to guests.
> 

Depend on what we expose to the guest. That is if the guest is not suppose to have this exposed
(say cpuid 27 bit is not exposed) then trap on the MSR (and give an #GP)?

Mihai (CC-ed) is working on this, when ready he can post an patch against this tree?

> Ashok Raj (1):
>   x86/kvm: Add IBPB support
> 
> David Woodhouse (1):
>   x86/speculation: Add basic IBRS support infrastructure
> 
> KarimAllah Ahmed (1):
>   x86: Simplify spectre_v2 command line parsing
> 
> Thomas Gleixner (4):
>   x86/speculation: Add basic support for IBPB
>   x86/speculation: Use Indirect Branch Prediction Barrier in context
>     switch
>   x86/speculation: Add inlines to control Indirect Branch Speculation
>   x86/idle: Control Indirect Branch Speculation in idle
> 
> Tim Chen (3):
>   x86/mm: Only flush indirect branches when switching into non dumpable
>     process
>   x86/enter: Create macros to restrict/unrestrict Indirect Branch
>     Speculation
>   x86/enter: Use IBRS on syscall and interrupts
> 
>  Documentation/admin-guide/kernel-parameters.txt |   1 +
>  arch/x86/entry/calling.h                        |  73 ++++++++++
>  arch/x86/entry/entry_64.S                       |  35 ++++-
>  arch/x86/entry/entry_64_compat.S                |  21 ++-
>  arch/x86/include/asm/cpufeatures.h              |   2 +
>  arch/x86/include/asm/mwait.h                    |  14 ++
>  arch/x86/include/asm/nospec-branch.h            |  54 ++++++-
>  arch/x86/kernel/cpu/bugs.c                      | 183 +++++++++++++++---------
>  arch/x86/kernel/process.c                       |  14 ++
>  arch/x86/kvm/svm.c                              |  14 ++
>  arch/x86/kvm/vmx.c                              |   4 +
>  arch/x86/mm/tlb.c                               |  21 ++-
>  12 files changed, 359 insertions(+), 77 deletions(-)
> 
> 
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Arjan van de Ven <arjan@linux.intel.com>
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Asit Mallick <asit.k.mallick@intel.com>
> Cc: Borislav Petkov <bp@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: David Woodhouse <dwmw@amazon.co.uk>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: Jun Nakajima <jun.nakajima@intel.com>
> Cc: Laura Abbott <labbott@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Tom Lendacky <thomas.lendacky@amd.com>
> Cc: kvm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: x86@kernel.org
> 
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process
  2018-01-21 12:04     ` David Woodhouse
@ 2018-01-21 14:07       ` H.J. Lu
  2018-01-22 10:19       ` Peter Zijlstra
  1 sibling, 0 replies; 143+ messages in thread
From: H.J. Lu @ 2018-01-21 14:07 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Peter Zijlstra, KarimAllah Ahmed, LKML, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	David Woodhouse, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

On Sun, Jan 21, 2018 at 4:04 AM, David Woodhouse <dwmw2@infradead.org> wrote:
>
>> On Sat, Jan 20, 2018 at 08:22:55PM +0100, KarimAllah Ahmed wrote:
>>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>>
>>> Flush indirect branches when switching into a process that marked
>>> itself non dumpable.  This protects high value processes like gpg
>>> better, without having too high performance overhead.
>>
>> So if I understand it right, this is only needed if the 'other'
>> executable itself is susceptible to spectre. If say someone audited gpg
>> for spectre-v1 and build it with retpoline, it would be safe to not
>> issue the IBPB, right?
>
>
> Spectre V2 not v1. V1 is separate.
> For V2 retpoline is enough... as long as all the libraries have it too.
>
>> So would it make sense to provide an ELF flag / personality thing such
>> that userspace can indicate its spectre-safe?
>
> Yes, Arjan and I were pondering that yesterday; it probably does make
> sense. Also for allowing a return to userspace after vmexit, if the army
> process itself is so marked.

Please take a look at how CET is handled in program property in
x86-64 psABI for CET:

https://github.com/hjl-tools/x86-psABI/wiki/x86-64-psABI-cet.pdf


-- 
H.J.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-20 19:22 ` [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure KarimAllah Ahmed
@ 2018-01-21 14:31   ` Thomas Gleixner
  2018-01-21 14:56     ` Borislav Petkov
                       ` (2 more replies)
  2018-01-29 20:14   ` [RFC,05/10] " Eduardo Habkost
  2018-01-31 10:03   ` [RFC 05/10] " Christophe de Dinechin
  2 siblings, 3 replies; 143+ messages in thread
From: Thomas Gleixner @ 2018-01-21 14:31 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

On Sat, 20 Jan 2018, KarimAllah Ahmed wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Not functional yet; just add the handling for it in the Spectre v2
> mitigation selection, and the X86_FEATURE_IBRS flag which will control
> the code to be added in later patches.
> 
> Also take the #ifdef CONFIG_RETPOLINE from around the RSB-stuffing; IBRS
> mode will want that too.
> 
> For now we are auto-selecting IBRS on Skylake. We will probably end up
> changing that but for now let's default to the safest option.
> 
> XX: Do we want a microcode blacklist?

Oh yes, we want a microcode blacklist. Ideally we refuse to load the
affected microcode in the first place and if its already loaded then at
least avoid to use the borked features.

PR texts promising that Intel is committed to transparency in this matter
are not sufficient. Intel, please provide the facts, i.e. a proper list of
micro codes and affected SKUs, ASAP.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 10/10] x86/enter: Use IBRS on syscall and interrupts
  2018-01-21 13:50   ` Konrad Rzeszutek Wilk
@ 2018-01-21 14:40     ` KarimAllah Ahmed
  2018-01-21 17:22     ` Dave Hansen
  1 sibling, 0 replies; 143+ messages in thread
From: KarimAllah Ahmed @ 2018-01-21 14:40 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, KarimAllah Ahmed
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Arjan Van De Ven

On 01/21/2018 02:50 PM, Konrad Rzeszutek Wilk wrote:

> On Sat, Jan 20, 2018 at 08:23:01PM +0100, KarimAllah Ahmed wrote:
>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>
>> Stop Indirect Branch Speculation on every user space to kernel space
>> transition and reenable it when returning to user space./
> How about interrupts?
>
> That is should .macro interrupt have the same treatment?

RESTRICT_IB_SPEC is called in switch_to_thread_stack which is almost the 
first thing called from ".macro interrupt".

>

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-21 14:31   ` Thomas Gleixner
@ 2018-01-21 14:56     ` Borislav Petkov
  2018-01-22  9:51       ` Peter Zijlstra
  2018-01-21 15:25     ` David Woodhouse
  2018-01-23 20:58     ` David Woodhouse
  2 siblings, 1 reply; 143+ messages in thread
From: Borislav Petkov @ 2018-01-21 14:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: KarimAllah Ahmed, linux-kernel, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Arjan van de Ven, Ashok Raj, Asit Mallick,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

On Sun, Jan 21, 2018 at 03:31:28PM +0100, Thomas Gleixner wrote:
> Oh yes, we want a microcode blacklist. Ideally we refuse to load the
> affected microcode in the first place and if its already loaded then at
> least avoid to use the borked features.
> 
> PR texts promising that Intel is committed to transparency in this matter
> are not sufficient. Intel, please provide the facts, i.e. a proper list of
> micro codes and affected SKUs, ASAP.

If we have to do blacklisting, then we need to blacklist microcode
revisions and fixed ones should be incremented. I.e., we need a way to
*detect* the faulty microcode revision at load time.

Also, blacklisting microcode for early loading will become an ugly dance
so I'd like to avoid it if possible.

Thus, it would be much much easier if dracut/initrd creation thing
already filters those blacklisted blobs by looking at the revision in
the header. Which is much easier.

Yeah, something like that.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-21 14:31   ` Thomas Gleixner
  2018-01-21 14:56     ` Borislav Petkov
@ 2018-01-21 15:25     ` David Woodhouse
  2018-01-23 20:58     ` David Woodhouse
  2 siblings, 0 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-21 15:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: KarimAllah Ahmed, linux-kernel, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Arjan van de Ven, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	"Radim Krčmář",
	Tim Chen, Tom Lendacky, kvm, x86


> On Sat, 20 Jan 2018, KarimAllah Ahmed wrote:
>> From: David Woodhouse <dwmw@amazon.co.uk>
>>
>> Not functional yet; just add the handling for it in the Spectre v2
>> mitigation selection, and the X86_FEATURE_IBRS flag which will control
>> the code to be added in later patches.
>>
>> Also take the #ifdef CONFIG_RETPOLINE from around the RSB-stuffing; IBRS
>> mode will want that too.
>>
>> For now we are auto-selecting IBRS on Skylake. We will probably end up
>> changing that but for now let's default to the safest option.
>>
>> XX: Do we want a microcode blacklist?
>
> Oh yes, we want a microcode blacklist. Ideally we refuse to load the
> affected microcode in the first place and if its already loaded then at
> least avoid to use the borked features.
>
> PR texts promising that Intel is committed to transparency in this matter
> are not sufficient. Intel, please provide the facts, i.e. a proper list of
> micro codes and affected SKUs, ASAP.

Perhaps we could start with the list already published by VMware at
https://kb.vmware.com/s/article/52345


-- 
dwmw2

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process
  2018-01-21 11:22   ` Peter Zijlstra
  2018-01-21 12:04     ` David Woodhouse
@ 2018-01-21 16:21     ` Ingo Molnar
  2018-01-21 16:25       ` Arjan van de Ven
  2018-01-21 22:20       ` Woodhouse, David
  2018-01-29  6:35     ` Jon Masters
  2 siblings, 2 replies; 143+ messages in thread
From: Ingo Molnar @ 2018-01-21 16:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KarimAllah Ahmed, linux-kernel, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Arjan van de Ven, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86, Dave Hansen


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Sat, Jan 20, 2018 at 08:22:55PM +0100, KarimAllah Ahmed wrote:
> > From: Tim Chen <tim.c.chen@linux.intel.com>
> > 
> > Flush indirect branches when switching into a process that marked
> > itself non dumpable.  This protects high value processes like gpg
> > better, without having too high performance overhead.
> 
> So if I understand it right, this is only needed if the 'other'
> executable itself is susceptible to spectre. If say someone audited gpg
> for spectre-v1 and build it with retpoline, it would be safe to not
> issue the IBPB, right?
> 
> So would it make sense to provide an ELF flag / personality thing such
> that userspace can indicate its spectre-safe?
> 
> I realize that this is all future work, because so far auditing for v1
> is a lot of pain (we need better tools), but would it be something that
> makes sense in the longer term?

So if it's only about the scheduler barrier, what cycle cost are we talking about 
here?

Because putting something like this into an ELF flag raises the question of who is 
allowed to set the flag - does a user-compiled binary count? If yes then it would 
be a trivial thing for local exploits to set the flag and turn off the barrier.

Yes, we could make a distinction based on the owner of the file, we could use 
security labels, etc. - but it gets somewhat awkward and fragile.

So unless we are talking about measurably high scheduler costs here, I'd prefer to 
err on the side of caution (and simplicity) and issue the barrier unconditionally.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process
  2018-01-21 16:21     ` Ingo Molnar
@ 2018-01-21 16:25       ` Arjan van de Ven
  2018-01-21 22:20       ` Woodhouse, David
  1 sibling, 0 replies; 143+ messages in thread
From: Arjan van de Ven @ 2018-01-21 16:25 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: KarimAllah Ahmed, linux-kernel, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86, Dave Hansen

On 1/21/2018 8:21 AM, Ingo Molnar wrote:
> 
> 
> So if it's only about the scheduler barrier, what cycle cost are we talking about
> here?
>

in the order of 5000 to 10000 cycles.
(depends a bit on the cpu generation but this range is a reasonable approximation)



> Because putting something like this into an ELF flag raises the question of who is
> allowed to set the flag - does a user-compiled binary count? If yes then it would
> be a trivial thing for local exploits to set the flag and turn off the barrier.

the barrier is about who you go TO, e.g. the thing under attack.
as you say, depending on the thing that would be the evil one does not work.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 10/10] x86/enter: Use IBRS on syscall and interrupts
  2018-01-21 13:50   ` Konrad Rzeszutek Wilk
  2018-01-21 14:40     ` KarimAllah Ahmed
@ 2018-01-21 17:22     ` Dave Hansen
  1 sibling, 0 replies; 143+ messages in thread
From: Dave Hansen @ 2018-01-21 17:22 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, KarimAllah Ahmed
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Arjan Van De Ven

On 01/21/2018 05:50 AM, Konrad Rzeszutek Wilk wrote:
> On Sat, Jan 20, 2018 at 08:23:01PM +0100, KarimAllah Ahmed wrote:
>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>
>> Stop Indirect Branch Speculation on every user space to kernel space
>> transition and reenable it when returning to user space./
> 
> How about interrupts?

This code covers all kernel entry/exit paths, including interrupts.
Despite its name, "error_entry" is used by the interrupt path.

> That is should .macro interrupt have the same treatment?

It already does.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-20 19:23 ` [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation KarimAllah Ahmed
@ 2018-01-21 19:14   ` Andy Lutomirski
  2018-01-23 16:12     ` Tom Lendacky
  2018-01-21 19:34   ` Linus Torvalds
  1 sibling, 1 reply; 143+ messages in thread
From: Andy Lutomirski @ 2018-01-21 19:14 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Arjan Van De Ven



> On Jan 20, 2018, at 11:23 AM, KarimAllah Ahmed <karahmed@amazon.de> wrote:
> 
> From: Tim Chen <tim.c.chen@linux.intel.com>
> 
> Create macros to control Indirect Branch Speculation.
> 
> Name them so they reflect what they are actually doing.
> The macros are used to restrict and unrestrict the indirect branch speculation.
> They do not *disable* (or *enable*) indirect branch speculation. A trip back to
> user-space after *restricting* speculation would still affect the BTB.
> 
> Quoting from a commit by Tim Chen:
> 
> """
>    If IBRS is set, near returns and near indirect jumps/calls will not allow
>    their predicted target address to be controlled by code that executed in a
>    less privileged prediction mode *BEFORE* the IBRS mode was last written with
>    a value of 1 or on another logical processor so long as all Return Stack
>    Buffer (RSB) entries from the previous less privileged prediction mode are
>    overwritten.
> 
>    Thus a near indirect jump/call/return may be affected by code in a less
>    privileged prediction mode that executed *AFTER* IBRS mode was last written
>    with a value of 1.
> """
> 
> [ tglx: Changed macro names and rewrote changelog ]
> [ karahmed: changed macro names *again* and rewrote changelog ]
> 
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: David Woodhouse <dwmw@amazon.co.uk>
> Cc: Ashok Raj <ashok.raj@intel.com>
> Link: https://lkml.kernel.org/r/3aab341725ee6a9aafd3141387453b45d788d61a.1515542293.git.tim.c.chen@linux.intel.com
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
> arch/x86/entry/calling.h | 73 ++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 73 insertions(+)
> 
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 3f48f69..5aafb51 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -6,6 +6,8 @@
> #include <asm/percpu.h>
> #include <asm/asm-offsets.h>
> #include <asm/processor-flags.h>
> +#include <asm/msr-index.h>
> +#include <asm/cpufeatures.h>
> 
> /*
> 
> @@ -349,3 +351,74 @@ For 32-bit we have the following conventions - kernel is built with
> .Lafter_call_\@:
> #endif
> .endm
> +
> +/*
> + * IBRS related macros
> + */
> +.macro PUSH_MSR_REGS
> +    pushq    %rax
> +    pushq    %rcx
> +    pushq    %rdx
> +.endm
> +
> +.macro POP_MSR_REGS
> +    popq    %rdx
> +    popq    %rcx
> +    popq    %rax
> +.endm
> +
> +.macro WRMSR_ASM msr_nr:req edx_val:req eax_val:req
> +    movl    \msr_nr, %ecx
> +    movl    \edx_val, %edx
> +    movl    \eax_val, %eax
> +    wrmsr
> +.endm
> +
> +.macro RESTRICT_IB_SPEC
> +    ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
> +    PUSH_MSR_REGS
> +    WRMSR_ASM $MSR_IA32_SPEC_CTRL, $0, $SPEC_CTRL_IBRS
> +    POP_MSR_REGS
> +.Lskip_\@:
> +.endm
> +
> +.macro UNRESTRICT_IB_SPEC
> +    ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
> +    PUSH_MSR_REGS
> +    WRMSR_ASM $MSR_IA32_SPEC_CTRL, $0, $0

I think you should be writing 2, not 0, since I'm reasonably confident that we want STIBP on.  Can you explain why you're writing 0?

Also, holy cow, there are so many macros here.

And a meta question: why are there so many submitters of the same series?

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-20 19:23 ` [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation KarimAllah Ahmed
  2018-01-21 19:14   ` Andy Lutomirski
@ 2018-01-21 19:34   ` Linus Torvalds
  2018-01-21 20:28     ` David Woodhouse
  1 sibling, 1 reply; 143+ messages in thread
From: Linus Torvalds @ 2018-01-21 19:34 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: Linux Kernel Mailing List, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Arjan van de Ven, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Arjan Van De Ven

[-- Attachment #1: Type: text/plain, Size: 4852 bytes --]

All of this is pure garbage.

Is Intel really planning on making this shit architectural? Has anybody
talked to them and told them they are f*cking insane?

Please, any Intel engineers here - talk to your managers.

        Linus

On Jan 20, 2018 11:23, "KarimAllah Ahmed" <karahmed@amazon.de> wrote:

> From: Tim Chen <tim.c.chen@linux.intel.com>
>
> Create macros to control Indirect Branch Speculation.
>
> Name them so they reflect what they are actually doing.
> The macros are used to restrict and unrestrict the indirect branch
> speculation.
> They do not *disable* (or *enable*) indirect branch speculation. A trip
> back to
> user-space after *restricting* speculation would still affect the BTB.
>
> Quoting from a commit by Tim Chen:
>
> """
>     If IBRS is set, near returns and near indirect jumps/calls will not
> allow
>     their predicted target address to be controlled by code that executed
> in a
>     less privileged prediction mode *BEFORE* the IBRS mode was last
> written with
>     a value of 1 or on another logical processor so long as all Return
> Stack
>     Buffer (RSB) entries from the previous less privileged prediction mode
> are
>     overwritten.
>
>     Thus a near indirect jump/call/return may be affected by code in a less
>     privileged prediction mode that executed *AFTER* IBRS mode was last
> written
>     with a value of 1.
> """
>
> [ tglx: Changed macro names and rewrote changelog ]
> [ karahmed: changed macro names *again* and rewrote changelog ]
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: David Woodhouse <dwmw@amazon.co.uk>
> Cc: Ashok Raj <ashok.raj@intel.com>
> Link: https://lkml.kernel.org/r/3aab341725ee6a9aafd3141387453b
> 45d788d61a.1515542293.git.tim.c.chen@linux.intel.com
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/entry/calling.h | 73 ++++++++++++++++++++++++++++++
> ++++++++++++++++++
>  1 file changed, 73 insertions(+)
>
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 3f48f69..5aafb51 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -6,6 +6,8 @@
>  #include <asm/percpu.h>
>  #include <asm/asm-offsets.h>
>  #include <asm/processor-flags.h>
> +#include <asm/msr-index.h>
> +#include <asm/cpufeatures.h>
>
>  /*
>
> @@ -349,3 +351,74 @@ For 32-bit we have the following conventions - kernel
> is built with
>  .Lafter_call_\@:
>  #endif
>  .endm
> +
> +/*
> + * IBRS related macros
> + */
> +.macro PUSH_MSR_REGS
> +       pushq   %rax
> +       pushq   %rcx
> +       pushq   %rdx
> +.endm
> +
> +.macro POP_MSR_REGS
> +       popq    %rdx
> +       popq    %rcx
> +       popq    %rax
> +.endm
> +
> +.macro WRMSR_ASM msr_nr:req edx_val:req eax_val:req
> +       movl    \msr_nr, %ecx
> +       movl    \edx_val, %edx
> +       movl    \eax_val, %eax
> +       wrmsr
> +.endm
> +
> +.macro RESTRICT_IB_SPEC
> +       ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
> +       PUSH_MSR_REGS
> +       WRMSR_ASM $MSR_IA32_SPEC_CTRL, $0, $SPEC_CTRL_IBRS
> +       POP_MSR_REGS
> +.Lskip_\@:
> +.endm
> +
> +.macro UNRESTRICT_IB_SPEC
> +       ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
> +       PUSH_MSR_REGS
> +       WRMSR_ASM $MSR_IA32_SPEC_CTRL, $0, $0
> +       POP_MSR_REGS
> +.Lskip_\@:
> +.endm
> +
> +.macro RESTRICT_IB_SPEC_CLOBBER
> +       ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
> +       WRMSR_ASM $MSR_IA32_SPEC_CTRL, $0, $SPEC_CTRL_IBRS
> +.Lskip_\@:
> +.endm
> +
> +.macro UNRESTRICT_IB_SPEC_CLOBBER
> +       ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
> +       WRMSR_ASM $MSR_IA32_SPEC_CTRL, $0, $0
> +.Lskip_\@:
> +.endm
> +
> +.macro RESTRICT_IB_SPEC_SAVE_AND_CLOBBER save_reg:req
> +       ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
> +       movl    $MSR_IA32_SPEC_CTRL, %ecx
> +       rdmsr
> +       movl    %eax, \save_reg
> +       movl    $0, %edx
> +       movl    $SPEC_CTRL_IBRS, %eax
> +       wrmsr
> +.Lskip_\@:
> +.endm
> +
> +.macro RESTORE_IB_SPEC_CLOBBER save_reg:req
> +       ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
> +       /* Set IBRS to the value saved in the save_reg */
> +       movl    $MSR_IA32_SPEC_CTRL, %ecx
> +       movl    $0, %edx
> +       movl    \save_reg, %eax
> +       wrmsr
> +.Lskip_\@:
> +.endm
> --
> 2.7.4
>
>

[-- Attachment #2: Type: text/html, Size: 7077 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-21 19:34   ` Linus Torvalds
@ 2018-01-21 20:28     ` David Woodhouse
  2018-01-21 21:35       ` Linus Torvalds
  2018-01-23 20:16       ` Pavel Machek
  0 siblings, 2 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-21 20:28 UTC (permalink / raw)
  To: Linus Torvalds, KarimAllah Ahmed
  Cc: Linux Kernel Mailing List, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Arjan van de Ven, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Masami Hiramatsu,
	Paolo Bonzini, Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Arjan Van De Ven

[-- Attachment #1: Type: text/plain, Size: 1774 bytes --]

On Sun, 2018-01-21 at 11:34 -0800, Linus Torvalds wrote:
> All of this is pure garbage.
> 
> Is Intel really planning on making this shit architectural? Has
> anybody talked to them and told them they are f*cking insane?
> 
> Please, any Intel engineers here - talk to your managers. 

If the alternative was a two-decade product recall and giving everyone
free CPUs, I'm not sure it was entirely insane.

Certainly it's a nasty hack, but hey — the world was on fire and in the
end we didn't have to just turn the datacentres off and go back to goat
farming, so it's not all bad.

As a hack for existing CPUs, it's just about tolerable — as long as it
can die entirely by the next generation.

So the part is I think is odd is the IBRS_ALL feature, where a future
CPU will advertise "I am able to be not broken" and then you have to
set the IBRS bit once at boot time to *ask* it not to be broken. That
part is weird, because it ought to have been treated like the RDCL_NO
bit — just "you don't have to worry any more, it got better".

https://software.intel.com/sites/default/files/managed/c5/63/336996-Speculative-Execution-Side-Channel-Mitigations.pdf

We do need the IBPB feature to complete the protection that retpoline
gives us — it's that or rebuild all of userspace with retpoline.

We'll also want to expose IBRS to VM guests, since Windows uses it.

I think we could probably live without the IBRS frobbing in our own
syscall/interrupt paths, as long as we're prepared to live with the
very hypothetical holes that still exist on Skylake. Because I like
IBRS more... no, let me rephrase... I hate IBRS less than I hate the
'deepstack' and other stuff that was being proposed to make Skylake
almost safe with retpoline.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-21 20:28     ` David Woodhouse
@ 2018-01-21 21:35       ` Linus Torvalds
  2018-01-21 22:00         ` David Woodhouse
  2018-01-23 20:16       ` Pavel Machek
  1 sibling, 1 reply; 143+ messages in thread
From: Linus Torvalds @ 2018-01-21 21:35 UTC (permalink / raw)
  To: David Woodhouse
  Cc: KarimAllah Ahmed, Linux Kernel Mailing List, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Arjan Van De Ven

On Sun, Jan 21, 2018 at 12:28 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> On Sun, 2018-01-21 at 11:34 -0800, Linus Torvalds wrote:
>> All of this is pure garbage.
>>
>> Is Intel really planning on making this shit architectural? Has
>> anybody talked to them and told them they are f*cking insane?
>>
>> Please, any Intel engineers here - talk to your managers.
>
> If the alternative was a two-decade product recall and giving everyone
> free CPUs, I'm not sure it was entirely insane.

You seem to have bought into the cool-aid. Please add a healthy dose
of critical thinking. Because this isn't the kind of cool-aid that
makes for a fun trip with pretty pictures. This is the kind that melts
your brain.

> Certainly it's a nasty hack, but hey — the world was on fire and in the
> end we didn't have to just turn the datacentres off and go back to goat
> farming, so it's not all bad.

It's not that it's a nasty hack. It's much worse than that.

> As a hack for existing CPUs, it's just about tolerable — as long as it
> can die entirely by the next generation.

That's part of the big problem here. The speculation control cpuid
stuff shows that Intel actually seems to plan on doing the right thing
for meltdown (the main question being _when_). Which is not a huge
surprise, since it should be easy to fix, and it's a really honking
big hole to drive through. Not doing the right thing for meltdown
would be completely unacceptable.

So the IBRS garbage implies that Intel is _not_ planning on doing the
right thing for the indirect branch speculation.

Honestly, that's completely unacceptable too.

> So the part is I think is odd is the IBRS_ALL feature, where a future
> CPU will advertise "I am able to be not broken" and then you have to
> set the IBRS bit once at boot time to *ask* it not to be broken. That
> part is weird, because it ought to have been treated like the RDCL_NO
> bit — just "you don't have to worry any more, it got better".

It's not "weird" at all. It's very much part of the whole "this is
complete garbage" issue.

The whole IBRS_ALL feature to me very clearly says "Intel is not
serious about this, we'll have a ugly hack that will be so expensive
that we don't want to enable it by default, because that would look
bad in benchmarks".

So instead they try to push the garbage down to us. And they are doing
it entirely wrong, even from a technical standpoint.

I'm sure there is some lawyer there who says "we'll have to go through
motions to protect against a lawsuit". But legal reasons do not make
for good technology, or good patches that I should apply.

> We do need the IBPB feature to complete the protection that retpoline
> gives us — it's that or rebuild all of userspace with retpoline.

BULLSHIT.

Have you _looked_ at the patches you are talking about?  You should
have - several of them bear your name.

The patches do things like add the garbage MSR writes to the kernel
entry/exit points. That's insane. That says "we're trying to protect
the kernel".  We already have retpoline there, with less overhead.

So somebody isn't telling the truth here. Somebody is pushing complete
garbage for unclear reasons. Sorry for having to point that out.

If this was about flushing the BTB at actual context switches between
different users, I'd believe you. But that's not at all what the
patches do.

As it is, the patches  are COMPLETE AND UTTER GARBAGE.

They do literally insane things. They do things that do not make
sense. That makes all your arguments questionable and suspicious. The
patches do things that are not sane.

WHAT THE F*CK IS GOING ON?

And that's actually ignoring the much _worse_ issue, namely that the
whole hardware interface is literally mis-designed by morons.

It's mis-designed for two major reasons:

 - the "the interface implies Intel will never fix it" reason.

   See the difference between IBRS_ALL and RDCL_NO. One implies Intel
will fix something. The other does not.

   Do you really think that is acceptable?

 - the "there is no performance indicator".

   The whole point of having cpuid and flags from the
microarchitecture is that we can use those to make decisions.

   But since we already know that the IBRS overhead is <i>huge</i> on
existing hardware, all those hardware capability bits are just
complete and utter garbage. Nobody sane will use them, since the cost
is too damn high. So you end up having to look at "which CPU stepping
is this" anyway.

I think we need something better than this garbage.

                Linus

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-21 21:35       ` Linus Torvalds
@ 2018-01-21 22:00         ` David Woodhouse
  2018-01-21 22:27           ` Linus Torvalds
  0 siblings, 1 reply; 143+ messages in thread
From: David Woodhouse @ 2018-01-21 22:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KarimAllah Ahmed, Linux Kernel Mailing List, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Arjan Van De Ven

[-- Attachment #1: Type: text/plain, Size: 3369 bytes --]

On Sun, 2018-01-21 at 13:35 -0800, Linus Torvalds wrote:
> On Sun, Jan 21, 2018 at 12:28 PM, David Woodhouse  wrote:
> > As a hack for existing CPUs, it's just about tolerable — as long as it
> > can die entirely by the next generation.
>
> That's part of the big problem here. The speculation control cpuid
> stuff shows that Intel actually seems to plan on doing the right thing
> for meltdown (the main question being _when_). Which is not a huge
> surprise, since it should be easy to fix, and it's a really honking
> big hole to drive through. Not doing the right thing for meltdown
> would be completely unacceptable.
> 
> So the IBRS garbage implies that Intel is _not_ planning on doing the
> right thing for the indirect branch speculation.
> 
> Honestly, that's completely unacceptable too.

Agreed. I've been saying that since I first saw the IBRS_ALL proposal.
There's *no* good reason for it to be opt-in. Just fix it!

> > So the part is I think is odd is the IBRS_ALL feature, where a future
> > CPU will advertise "I am able to be not broken" and then you have to
> > set the IBRS bit once at boot time to *ask* it not to be broken. That
> > part is weird, because it ought to have been treated like the RDCL_NO
> > bit — just "you don't have to worry any more, it got better".
>
> It's not "weird" at all. It's very much part of the whole "this is
> complete garbage" issue.
> 
> The whole IBRS_ALL feature to me very clearly says "Intel is not
> serious about this, we'll have a ugly hack that will be so expensive
> that we don't want to enable it by default, because that would look
> bad in benchmarks".
> 
> So instead they try to push the garbage down to us. And they are doing
> it entirely wrong, even from a technical standpoint.

Right. The whole IBRS/IBPB thing as a nasty hack in the short term I
could live with, but it's the long-term implications of IBRS_ALL that
I'm unhappy about.

My understanding was that the IBRS_ALL performance was supposed to not
suck — to the extent that we'd just turn it on and then ALTERNATIVE out
the retpolines, and that would be the best option.

But if that's the case, why are they making it an option, and not just
doing the same as RDCL_NO does for "we fixed Meltdown"?

> > We do need the IBPB feature to complete the protection that retpoline
> > gives us — it's that or rebuild all of userspace with retpoline.
>
> BULLSHIT.
> 
> Have you _looked_ at the patches you are talking about?  You should
> have - several of them bear your name.
> 
> The patches do things like add the garbage MSR writes to the kernel
> entry/exit points. That's insane. That says "we're trying to protect
> the kernel".  We already have retpoline there, with less overhead.

You're looking at IBRS usage, not IBPB. They are different things.

Yes, the one you're looking at really *is* trying to protect the
kernel, and you're right that it's largely redundant with retpoline.
(Assuming we can live with the implications on Skylake, as I said.)

> If this was about flushing the BTB at actual context switches between
> different users, I'd believe you. But that's not at all what the
> patches do.

That's what the *IBPB* patches do. Those were deliberately put first in
the series (and in fact that's where I stopped, when I posted).

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process
  2018-01-21 16:21     ` Ingo Molnar
  2018-01-21 16:25       ` Arjan van de Ven
@ 2018-01-21 22:20       ` Woodhouse, David
  1 sibling, 0 replies; 143+ messages in thread
From: Woodhouse, David @ 2018-01-21 22:20 UTC (permalink / raw)
  To: mingo, peterz
  Cc: kvm, linux-kernel, tim.c.chen, ashok.raj, Raslan, KarimAllah,
	arjan, bp, tglx, Janakarajan.Natarajan, dave, ak, joro,
	dan.j.williams, x86, hpa, aarcange, mingo, luto, torvalds,
	gregkh, dave.hansen, mhiramat, thomas.lendacky, asit.k.mallick,
	jun.nakajima, labbott, rkrcmar, pbonzini


[-- Attachment #1.1: Type: text/plain, Size: 432 bytes --]

On Sun, 2018-01-21 at 17:21 +0100, Ingo Molnar wrote:
> 
> Because putting something like this into an ELF flag raises the question of who is 
> allowed to set the flag - does a user-compiled binary count? If yes then it would 
> be a trivial thing for local exploits to set the flag and turn off the barrier.

You can only allow *yourself* to be exploited that way. The flag says,
"I'm OK, you don't need to protect me".

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 197 bytes --]




Amazon Web Services UK Limited. Registered in England and Wales with registration number 08650665 and which has its registered office at 60 Holborn Viaduct, London EC1A 2FD, United Kingdom.

[-- Attachment #2.2: Type: text/html, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-21 22:00         ` David Woodhouse
@ 2018-01-21 22:27           ` Linus Torvalds
  2018-01-22 16:27             ` David Woodhouse
  0 siblings, 1 reply; 143+ messages in thread
From: Linus Torvalds @ 2018-01-21 22:27 UTC (permalink / raw)
  To: David Woodhouse
  Cc: KarimAllah Ahmed, Linux Kernel Mailing List, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Arjan Van De Ven

On Sun, Jan 21, 2018 at 2:00 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>>
>> The patches do things like add the garbage MSR writes to the kernel
>> entry/exit points. That's insane. That says "we're trying to protect
>> the kernel".  We already have retpoline there, with less overhead.
>
> You're looking at IBRS usage, not IBPB. They are different things.

Ehh. Odd intel naming detail.

If you look at this series, it very much does that kernel entry/exit
stuff. It was patch 10/10, iirc. In fact, the patch I was replying to
was explicitly setting that garbage up.

And I really don't want to see these garbage patches just mindlessly
sent around.

                  Linus

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-21 14:56     ` Borislav Petkov
@ 2018-01-22  9:51       ` Peter Zijlstra
  2018-01-22 12:06         ` Borislav Petkov
  0 siblings, 1 reply; 143+ messages in thread
From: Peter Zijlstra @ 2018-01-22  9:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, KarimAllah Ahmed, linux-kernel, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

On Sun, Jan 21, 2018 at 03:56:55PM +0100, Borislav Petkov wrote:
> Also, blacklisting microcode for early loading will become an ugly dance
> so I'd like to avoid it if possible.
> 
> Thus, it would be much much easier if dracut/initrd creation thing
> already filters those blacklisted blobs by looking at the revision in
> the header. Which is much easier.

That wouldn't be enough; AFAIU there's people with this stuff already
flashed in their BIOS. So the kernel needs to deal with it one way or
another.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process
  2018-01-21 12:04     ` David Woodhouse
  2018-01-21 14:07       ` H.J. Lu
@ 2018-01-22 10:19       ` Peter Zijlstra
  2018-01-22 10:23         ` David Woodhouse
  1 sibling, 1 reply; 143+ messages in thread
From: Peter Zijlstra @ 2018-01-22 10:19 UTC (permalink / raw)
  To: David Woodhouse
  Cc: hjl.tools, KarimAllah Ahmed, linux-kernel, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	David Woodhouse, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	"Radim Krčmář",
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

On Sun, Jan 21, 2018 at 12:04:03PM -0000, David Woodhouse wrote:
> > So if I understand it right, this is only needed if the 'other'
> > executable itself is susceptible to spectre. If say someone audited gpg
> > for spectre-v1 and build it with retpoline, it would be safe to not
> > issue the IBPB, right?
> 
> 
> Spectre V2 not v1. V1 is separate.
> For V2 retpoline is enough... as long as all the libraries have it too.

Ah, easy then. So we need this toolchain bit and then simply rebuild
works and everything is happy again, well except of course those people
running closed sores binaries, but meh.. :-)

> > I realize that this is all future work, because so far auditing for v1
> > is a lot of pain (we need better tools), but would it be something that
> > makes sense in the longer term?
> 
> It's *only* retpoline so it isn't actually that much. Although I'm wary of
> Cc'ing HJ on such thoughts because he seems to never sleep and always
> respond promptly with "OK I did that... " :)
> 
> If we did systematically do this in userspace we'd probably want to do
> external thunks there too, and a flag in the auxvec to tell it not to
> bother (for IBRS_ALL etc.).

Right, so if its v2/retpoline only, we really should do this asap and
then rebuild world on distros (or arch/gentoo people could read a book
or something).

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process
  2018-01-22 10:19       ` Peter Zijlstra
@ 2018-01-22 10:23         ` David Woodhouse
  0 siblings, 0 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-22 10:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: hjl.tools, KarimAllah Ahmed, linux-kernel, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	"Radim Krčmář",
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

[-- Attachment #1: Type: text/plain, Size: 545 bytes --]

On Mon, 2018-01-22 at 11:19 +0100, Peter Zijlstra wrote:
> Right, so if its v2/retpoline only, we really should do this asap and
> then rebuild world on distros (or arch/gentoo people could read a book
> or something).

By the time we manage to rebuild all the distros, I *seriously* hope
that someone would be shipping a fixed CPU.

And not just the half-way-there IBRS_ALL bit that still requires the
IBPB flushing on context switches that's discussed in this patch, but
an *actual* fix so we can forget about it all and go drinking.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-22  9:51       ` Peter Zijlstra
@ 2018-01-22 12:06         ` Borislav Petkov
  2018-01-22 13:30           ` Greg Kroah-Hartman
  0 siblings, 1 reply; 143+ messages in thread
From: Borislav Petkov @ 2018-01-22 12:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, KarimAllah Ahmed, linux-kernel, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

On Mon, Jan 22, 2018 at 10:51:53AM +0100, Peter Zijlstra wrote:
> That wouldn't be enough; AFAIU there's people with this stuff already
> flashed in their BIOS. So the kernel needs to deal with it one way or
> another.

Not a lot we can do there except maybe disable IBRS on those and users
can go and complain to their BIOS vendor to give them a downgrade or
they can downgrade themselves.

If we had free BIOS, this would've been a whole different story...

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-22 12:06         ` Borislav Petkov
@ 2018-01-22 13:30           ` Greg Kroah-Hartman
  2018-01-22 13:37             ` Woodhouse, David
  0 siblings, 1 reply; 143+ messages in thread
From: Greg Kroah-Hartman @ 2018-01-22 13:30 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Thomas Gleixner, KarimAllah Ahmed, linux-kernel,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Dan Williams, Dave Hansen,
	David Woodhouse, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

On Mon, Jan 22, 2018 at 01:06:18PM +0100, Borislav Petkov wrote:
> On Mon, Jan 22, 2018 at 10:51:53AM +0100, Peter Zijlstra wrote:
> > That wouldn't be enough; AFAIU there's people with this stuff already
> > flashed in their BIOS. So the kernel needs to deal with it one way or
> > another.
> 
> Not a lot we can do there except maybe disable IBRS on those and users
> can go and complain to their BIOS vendor to give them a downgrade or
> they can downgrade themselves.
> 
> If we had free BIOS, this would've been a whole different story...

We kind of do, you can submit patches to UEFI, but I doubt that the
processor-specific portions are actually present in the Tianocore code
to be able to be patched.

What about LinuxBoot <https://linuxboot.org>, does it too take over too
late in the boot process to control this?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-22 13:30           ` Greg Kroah-Hartman
@ 2018-01-22 13:37             ` Woodhouse, David
  0 siblings, 0 replies; 143+ messages in thread
From: Woodhouse, David @ 2018-01-22 13:37 UTC (permalink / raw)
  To: gregkh, bp
  Cc: kvm, linux-kernel, peterz, arjan, Raslan, KarimAllah, ashok.raj,
	tglx, Janakarajan.Natarajan, tim.c.chen, ak, joro,
	dan.j.williams, x86, hpa, aarcange, mingo, luto, torvalds,
	pbonzini, dave.hansen, mhiramat, thomas.lendacky, asit.k.mallick,
	jun.nakajima, labbott, rkrcmar


[-- Attachment #1.1: Type: text/plain, Size: 660 bytes --]

On Mon, 2018-01-22 at 14:30 +0100, Greg Kroah-Hartman wrote:
> We kind of do, you can submit patches to UEFI, but I doubt that the
> processor-specific portions are actually present in the Tianocore code
> to be able to be patched.

This is just about which microcode your BIOS loads into the CPU before
booting the OS. It's not "process-specific portions in the Tianocore
code"; more a data blob — just like when Linux updates microcode.

> What about LinuxBoot <https://linuxboot.org>, does it too take over too
> late in the boot process to control this?

Yes, I believe microcode updates are done in PEI which is before
LinuxBoot takes over.

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 197 bytes --]




Amazon Web Services UK Limited. Registered in England and Wales with registration number 08650665 and which has its registered office at 60 Holborn Viaduct, London EC1A 2FD, United Kingdom.

[-- Attachment #2.2: Type: text/html, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-21 22:27           ` Linus Torvalds
@ 2018-01-22 16:27             ` David Woodhouse
  2018-01-23  7:29               ` Ingo Molnar
  0 siblings, 1 reply; 143+ messages in thread
From: David Woodhouse @ 2018-01-22 16:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: KarimAllah Ahmed, Linux Kernel Mailing List, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Arjan Van De Ven

[-- Attachment #1: Type: text/plain, Size: 5893 bytes --]

On Sun, 2018-01-21 at 14:27 -0800, Linus Torvalds wrote:
> On Sun, Jan 21, 2018 at 2:00 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> >>
> >> The patches do things like add the garbage MSR writes to the kernel
> >> entry/exit points. That's insane. That says "we're trying to protect
> >> the kernel".  We already have retpoline there, with less overhead.
> >
> > You're looking at IBRS usage, not IBPB. They are different things.
> 
> Ehh. Odd intel naming detail.
> 
> If you look at this series, it very much does that kernel entry/exit
> stuff. It was patch 10/10, iirc. In fact, the patch I was replying to
> was explicitly setting that garbage up.
> 
> And I really don't want to see these garbage patches just mindlessly
> sent around.

I think we've covered the technical part of this now, not that you like
it — not that any of us *like* it. But since the peanut gallery is
paying lots of attention it's probably worth explaining it a little
more for their benefit.

This is all about Spectre variant 2, where the CPU can be tricked into
mispredicting the target of an indirect branch. And I'm specifically
looking at what we can do on *current* hardware, where we're limited to
the hacks they can manage to add in the microcode.

The new microcode from Intel and AMD adds three new features.

One new feature (IBPB) is a complete barrier for branch prediction.
After frobbing this, no branch targets learned earlier are going to be
used. It's kind of expensive (order of magnitude ~4000 cycles).

The second (STIBP) protects a hyperthread sibling from following branch
predictions which were learned on another sibling. You *might* want
this when running unrelated processes in userspace, for example. Or
different VM guests running on HT siblings.

The third feature (IBRS) is more complicated. It's designed to be
set when you enter a more privileged execution mode (i.e. the kernel).
It prevents branch targets learned in a less-privileged execution mode,
BEFORE IT WAS MOST RECENTLY SET, from taking effect. But it's not just
a 'set-and-forget' feature, it also has barrier-like semantics and
needs to be set on *each* entry into the kernel (from userspace or a VM
guest). It's *also* expensive. And a vile hack, but for a while it was
the only option we had.

Even with IBRS, the CPU cannot tell the difference between different
userspace processes, and between different VM guests. So in addition to
IBRS to protect the kernel, we need the full IBPB barrier on context
switch and vmexit. And maybe STIBP while they're running.

Then along came Paul with the cunning plan of "oh, indirect branches
can be exploited? Screw it, let's not have any of *those* then", which
is retpoline. And it's a *lot* faster than frobbing IBRS on every entry
into the kernel. It's a massive performance win.

So now we *mostly* don't need IBRS. We build with retpoline, use IBPB
on context switches/vmexit (which is in the first part of this patch
series before IBRS is added), and we're safe. We even refactored the
patch series to put retpoline first.

But wait, why did I say "mostly"? Well, not everyone has a retpoline
compiler yet... but OK, screw them; they need to update.

Then there's Skylake, and that generation of CPU cores. For complicated
reasons they actually end up being vulnerable not just on indirect
branches, but also on a 'ret' in some circumstances (such as 16+ CALLs
in a deep chain).

The IBRS solution, ugly though it is, did address that. Retpoline
doesn't. There are patches being floated to detect and prevent deep
stacks, and deal with some of the other special cases that bite on SKL,
but those are icky too. And in fact IBRS performance isn't anywhere
near as bad on this generation of CPUs as it is on earlier CPUs
*anyway*, which makes it not quite so insane to *contemplate* using it
as Intel proposed.

That's why my initial idea, as implemented in this RFC patchset, was to
stick with IBRS on Skylake, and use retpoline everywhere else. I'll
give you "garbage patches", but they weren't being "just mindlessly
sent around". If we're going to drop IBRS support and accept the
caveats, then let's do it as a conscious decision having seen what it
would look like, not just drop it quietly because poor Davey is too
scared that Linus might shout at him again. :)

I have seen *hand-wavy* analyses of the Skylake thing that mean I'm not
actually lying awake at night fretting about it, but nothing concrete
that really says it's OK.

If you view retpoline as a performance optimisation, which is how it
first arrived, then it's rather unconventional to say "well, it only
opens a *little* bit of a security hole but it does go nice and fast so
let's do it".

But fine, I'm content with ditching the use of IBRS to protect the
kernel, and I'm not even surprised. There's a *reason* we put it last
in the series, as both the most contentious and most dispensable part.
I'd be *happier* with a coherent analysis showing Skylake is still OK,
but hey-ho, screw Skylake.

The early part of the series adds the new feature bits and detects when
it can turn KPTI off on non-Meltdown-vulnerable Intel CPUs, and also
supports the IBPB barrier that we need to make retpoline complete. That
much I think we definitely *do* want. There have been a bunch of us
working on this behind the scenes; one of us will probably post that
bit in the next day or so.

I think we also want to expose IBRS to VM guests, even if we don't use
it ourselves. Because Windows guests (and RHEL guests; yay!) do use it.

If we can be done with the shouty part, I'd actually quite like to have
a sensible discussion about when, if ever, we do IBPB on context switch
(ptraceability and dumpable have both been suggested) and when, if
ever, we set STIPB in userspace.




[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process
  2018-01-20 21:06   ` Woodhouse, David
@ 2018-01-22 18:29     ` Tim Chen
  0 siblings, 0 replies; 143+ messages in thread
From: Tim Chen @ 2018-01-22 18:29 UTC (permalink / raw)
  To: Woodhouse, David, KarimAllah Ahmed, linux-kernel
  Cc: Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tom Lendacky, kvm, x86

On 01/20/2018 01:06 PM, Woodhouse, David wrote:
> On Sat, 2018-01-20 at 20:22 +0100, KarimAllah Ahmed wrote:
>> From: Tim Chen <tim.c.chen@linux.intel.com>
> 
> I think this is probably From: Andi now rather than From: Tim?

This change is from Andi.


>>  1 file changed, 12 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index 304de7d..f64e80c 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -225,8 +225,19 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>>  		 * Avoid user/user BTB poisoning by flushing the branch predictor
>>  		 * when switching between processes. This stops one process from
>>  		 * doing Spectre-v2 attacks on another.
>> +		 *
>> +		 * As an optimization: Flush indirect branches only when
>> +		 * switching into processes that disable dumping.
>> +		 *
>> +		 * This will not flush when switching into kernel threads.
>> +		 * But it would flush when switching into idle and back
>> +		 *
>> +		 * It might be useful to have a one-off cache here
>> +		 * to also not flush the idle case, but we would need some
>> +		 * kind of stable sequence number to remember the previous mm.
>>  		 */
>> -		indirect_branch_prediction_barrier();
>> +		if (tsk && tsk->mm && get_dumpable(tsk->mm) != SUID_DUMP_USER)
>> +			indirect_branch_prediction_barrier();

We could move this close to the cr3 write. The cr3 write provides
barrier against unwanted speculation in the above if check.

Tim

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 02/10] x86/kvm: Add IBPB support
  2018-01-20 19:22 ` [RFC 02/10] x86/kvm: Add IBPB support KarimAllah Ahmed
  2018-01-20 20:18   ` Woodhouse, David
@ 2018-01-22 18:56   ` Jim Mattson
  2018-01-22 19:31     ` Jim Mattson
  1 sibling, 1 reply; 143+ messages in thread
From: Jim Mattson @ 2018-01-22 18:56 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: LKML, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm list,
	the arch/x86 maintainers, Arjan Van De Ven

On Sat, Jan 20, 2018 at 11:22 AM, KarimAllah Ahmed <karahmed@amazon.de> wrote:
> From: Ashok Raj <ashok.raj@intel.com>
>
> Add MSR passthrough for MSR_IA32_PRED_CMD and place branch predictor
> barriers on switching between VMs to avoid inter VM specte-v2 attacks.
>
> [peterz: rebase and changelog rewrite]
> [dwmw2: fixes]
> [karahmed: - vmx: expose PRED_CMD whenever it is available
>            - svm: only pass through IBPB if it is available]
>
> Cc: Asit Mallick <asit.k.mallick@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Jun Nakajima <jun.nakajima@intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: David Woodhouse <dwmw@amazon.co.uk>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1515720739-43819-6-git-send-email-ashok.raj@intel.com
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> ---
>  arch/x86/kvm/svm.c | 14 ++++++++++++++
>  arch/x86/kvm/vmx.c |  4 ++++
>  2 files changed, 18 insertions(+)
>
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 2744b973..cfdb9ab 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -529,6 +529,7 @@ struct svm_cpu_data {
>         struct kvm_ldttss_desc *tss_desc;
>
>         struct page *save_area;
> +       struct vmcb *current_vmcb;
>  };
>
>  static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
> @@ -918,6 +919,9 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)
>
>                 set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1);
>         }
> +
> +       if (boot_cpu_has(X86_FEATURE_AMD_PRED_CMD))
> +               set_msr_interception(msrpm, MSR_IA32_PRED_CMD, 1, 1);
>  }
>
>  static void add_msr_offset(u32 offset)
> @@ -1706,11 +1710,17 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
>         __free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
>         kvm_vcpu_uninit(vcpu);
>         kmem_cache_free(kvm_vcpu_cache, svm);
> +       /*
> +        * The vmcb page can be recycled, causing a false negative in
> +        * svm_vcpu_load(). So do a full IBPB now.
> +        */
> +       indirect_branch_prediction_barrier();
>  }
>
>  static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  {
>         struct vcpu_svm *svm = to_svm(vcpu);
> +       struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
>         int i;
>
>         if (unlikely(cpu != vcpu->cpu)) {
> @@ -1739,6 +1749,10 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>         if (static_cpu_has(X86_FEATURE_RDTSCP))
>                 wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
>
> +       if (sd->current_vmcb != svm->vmcb) {
> +               sd->current_vmcb = svm->vmcb;
> +               indirect_branch_prediction_barrier();
> +       }
>         avic_vcpu_load(vcpu, cpu);
>  }
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index d1e25db..3b64de2 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2279,6 +2279,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>         if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
>                 per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
>                 vmcs_load(vmx->loaded_vmcs->vmcs);
> +               indirect_branch_prediction_barrier();
>         }
>
>         if (!already_loaded) {
> @@ -6791,6 +6792,9 @@ static __init int hardware_setup(void)
>                 kvm_tsc_scaling_ratio_frac_bits = 48;
>         }
>
> +       if (boot_cpu_has(X86_FEATURE_SPEC_CTRL))

I think the condition here should be:

if (guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))

__do_cpuid_ent should pass through X86_FEATURE_SPEC_CTRL from the
host, but userspace should be allowed to clear it.
(Userspace should not be allowed to set it if the host doesn't support it.)

> +               vmx_disable_intercept_for_msr(MSR_IA32_PRED_CMD, false);
> +
>         vmx_disable_intercept_for_msr(MSR_FS_BASE, false);
>         vmx_disable_intercept_for_msr(MSR_GS_BASE, false);
>         vmx_disable_intercept_for_msr(MSR_KERNEL_GS_BASE, true);
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 02/10] x86/kvm: Add IBPB support
  2018-01-22 18:56   ` Jim Mattson
@ 2018-01-22 19:31     ` Jim Mattson
  0 siblings, 0 replies; 143+ messages in thread
From: Jim Mattson @ 2018-01-22 19:31 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: LKML, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm list,
	the arch/x86 maintainers, Arjan Van De Ven

Oh, but to do that properly, you need one of the per-vCPU bitmap
implementations that Paolo and I have independently posted.

On Mon, Jan 22, 2018 at 10:56 AM, Jim Mattson <jmattson@google.com> wrote:
> On Sat, Jan 20, 2018 at 11:22 AM, KarimAllah Ahmed <karahmed@amazon.de> wrote:
>> From: Ashok Raj <ashok.raj@intel.com>
>>
>> Add MSR passthrough for MSR_IA32_PRED_CMD and place branch predictor
>> barriers on switching between VMs to avoid inter VM specte-v2 attacks.
>>
>> [peterz: rebase and changelog rewrite]
>> [dwmw2: fixes]
>> [karahmed: - vmx: expose PRED_CMD whenever it is available
>>            - svm: only pass through IBPB if it is available]
>>
>> Cc: Asit Mallick <asit.k.mallick@intel.com>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
>> Cc: Tim Chen <tim.c.chen@linux.intel.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Jun Nakajima <jun.nakajima@intel.com>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Greg KH <gregkh@linuxfoundation.org>
>> Cc: David Woodhouse <dwmw@amazon.co.uk>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Link: http://lkml.kernel.org/r/1515720739-43819-6-git-send-email-ashok.raj@intel.com
>>
>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
>> ---
>>  arch/x86/kvm/svm.c | 14 ++++++++++++++
>>  arch/x86/kvm/vmx.c |  4 ++++
>>  2 files changed, 18 insertions(+)
>>
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index 2744b973..cfdb9ab 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -529,6 +529,7 @@ struct svm_cpu_data {
>>         struct kvm_ldttss_desc *tss_desc;
>>
>>         struct page *save_area;
>> +       struct vmcb *current_vmcb;
>>  };
>>
>>  static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
>> @@ -918,6 +919,9 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)
>>
>>                 set_msr_interception(msrpm, direct_access_msrs[i].index, 1, 1);
>>         }
>> +
>> +       if (boot_cpu_has(X86_FEATURE_AMD_PRED_CMD))
>> +               set_msr_interception(msrpm, MSR_IA32_PRED_CMD, 1, 1);
>>  }
>>
>>  static void add_msr_offset(u32 offset)
>> @@ -1706,11 +1710,17 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
>>         __free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
>>         kvm_vcpu_uninit(vcpu);
>>         kmem_cache_free(kvm_vcpu_cache, svm);
>> +       /*
>> +        * The vmcb page can be recycled, causing a false negative in
>> +        * svm_vcpu_load(). So do a full IBPB now.
>> +        */
>> +       indirect_branch_prediction_barrier();
>>  }
>>
>>  static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>  {
>>         struct vcpu_svm *svm = to_svm(vcpu);
>> +       struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
>>         int i;
>>
>>         if (unlikely(cpu != vcpu->cpu)) {
>> @@ -1739,6 +1749,10 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>         if (static_cpu_has(X86_FEATURE_RDTSCP))
>>                 wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
>>
>> +       if (sd->current_vmcb != svm->vmcb) {
>> +               sd->current_vmcb = svm->vmcb;
>> +               indirect_branch_prediction_barrier();
>> +       }
>>         avic_vcpu_load(vcpu, cpu);
>>  }
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index d1e25db..3b64de2 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2279,6 +2279,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>         if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
>>                 per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
>>                 vmcs_load(vmx->loaded_vmcs->vmcs);
>> +               indirect_branch_prediction_barrier();
>>         }
>>
>>         if (!already_loaded) {
>> @@ -6791,6 +6792,9 @@ static __init int hardware_setup(void)
>>                 kvm_tsc_scaling_ratio_frac_bits = 48;
>>         }
>>
>> +       if (boot_cpu_has(X86_FEATURE_SPEC_CTRL))
>
> I think the condition here should be:
>
> if (guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
>
> __do_cpuid_ent should pass through X86_FEATURE_SPEC_CTRL from the
> host, but userspace should be allowed to clear it.
> (Userspace should not be allowed to set it if the host doesn't support it.)
>
>> +               vmx_disable_intercept_for_msr(MSR_IA32_PRED_CMD, false);
>> +
>>         vmx_disable_intercept_for_msr(MSR_FS_BASE, false);
>>         vmx_disable_intercept_for_msr(MSR_GS_BASE, false);
>>         vmx_disable_intercept_for_msr(MSR_KERNEL_GS_BASE, true);
>> --
>> 2.7.4
>>

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 00/10] Speculation Control feature support
  2018-01-21 14:02 ` [RFC 00/10] Speculation Control feature support Konrad Rzeszutek Wilk
@ 2018-01-22 21:27   ` David Woodhouse
  0 siblings, 0 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-22 21:27 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, KarimAllah Ahmed, Mihai Carabas
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

[-- Attachment #1: Type: text/plain, Size: 1052 bytes --]

On Sun, 2018-01-21 at 09:02 -0500, Konrad Rzeszutek Wilk wrote:
> 
> 
> Depend on what we expose to the guest. That is if the guest is not suppose to have this exposed
> (say cpuid 27 bit is not exposed) then trap on the MSR (and give an #GP)?

I think for SPEC_CTRL we want to trap on the MSR anyway. Saving and
restoring is is *bizarrely* slow, apparently, even when it's zero.

I think we want to trap on the first access, and only then disable the
intercept and enable the save/restore. That way, sane guests that only
ever use retpoline and IBPB (which is write-only and doesn't need
saving) won't ever take the performance hit.

It's going to want this: https://patchwork.kernel.org/patch/10167667/

> Mihai (CC-ed) is working on this, when ready he can post an patch against this tree?

That'd be useful; thanks. The latest (including the bits on top that we
probably aren't going to submit, with saner bits near the beginning)
should always be at
http://git.infradead.org/linux-retpoline.git/shortlog/refs/heads/ibpb



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-22 16:27             ` David Woodhouse
@ 2018-01-23  7:29               ` Ingo Molnar
  2018-01-23  7:53                 ` Ingo Molnar
  2018-01-24  0:05                 ` Andi Kleen
  0 siblings, 2 replies; 143+ messages in thread
From: Ingo Molnar @ 2018-01-23  7:29 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Arjan Van De Ven

* David Woodhouse <dwmw2@infradead.org> wrote:

> But wait, why did I say "mostly"? Well, not everyone has a retpoline
> compiler yet... but OK, screw them; they need to update.
> 
> Then there's Skylake, and that generation of CPU cores. For complicated
> reasons they actually end up being vulnerable not just on indirect
> branches, but also on a 'ret' in some circumstances (such as 16+ CALLs
> in a deep chain).
> 
> The IBRS solution, ugly though it is, did address that. Retpoline
> doesn't. There are patches being floated to detect and prevent deep
> stacks, and deal with some of the other special cases that bite on SKL,
> but those are icky too. And in fact IBRS performance isn't anywhere
> near as bad on this generation of CPUs as it is on earlier CPUs
> *anyway*, which makes it not quite so insane to *contemplate* using it
> as Intel proposed.

There's another possible method to avoid deep stacks on Skylake, without compiler 
support:

  - Use the existing mcount based function tracing live patching machinery
    (CONFIG_FUNCTION_TRACER=y) to install a _very_ fast and simple stack depth 
    tracking tracer which would issue a retpoline when stack depth crosses 
    boundaries of ~16 entries.

The overhead of that would _still_ very likely be much cheaper than a hundreds 
(thousands) of cycle expensive MSR write at every kernel entry (syscall entry, IRQ 
entry, etc.).

Note the huge number of advantages:

 - All distro kernels already enable the mcount based patching options, so there's
   literally zero overhead to anything except SkyLake.

 - It is fully kernel patching based and can be activated on Skylake only

 - It doesn't require any microcode updates, so it will work on all existing CPUs
   with no firmware or microcode modificatons

 - It doesn't require any compiler updates

 - SkyLake performance is very likely to be much less fragile than relying on a 
   hastily deployed microcode hack

 - The "SkyLake stack depth tracer" can be tested on other CPUs as well in debug 
   builds, broadening the testing base

 - The tracer is very obviously simple and reviewable, and we can forget about it
   in the far future.

 - It's much more backportable to older kernels: should there be a new class of
   exploits then this machinery could be updated to cover that too - while 
   upgrades to newer kernels would give the higher performant solution.

Yes, there are some practical complications like always enabling 
CONFIG_FUNCTION_TRACER=y on x86, plus the ftrace interaction has to be sorted out, 
but in practice it's enabled on all major distros anyway, due to ftrace.

Is there any reason why this wouldn't work?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23  7:29               ` Ingo Molnar
@ 2018-01-23  7:53                 ` Ingo Molnar
  2018-01-23  9:27                   ` Ingo Molnar
  2018-01-23  9:30                   ` David Woodhouse
  2018-01-24  0:05                 ` Andi Kleen
  1 sibling, 2 replies; 143+ messages in thread
From: Ingo Molnar @ 2018-01-23  7:53 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Arjan Van De Ven


* Ingo Molnar <mingo@kernel.org> wrote:

> * David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > But wait, why did I say "mostly"? Well, not everyone has a retpoline
> > compiler yet... but OK, screw them; they need to update.
> > 
> > Then there's Skylake, and that generation of CPU cores. For complicated
> > reasons they actually end up being vulnerable not just on indirect
> > branches, but also on a 'ret' in some circumstances (such as 16+ CALLs
> > in a deep chain).
> > 
> > The IBRS solution, ugly though it is, did address that. Retpoline
> > doesn't. There are patches being floated to detect and prevent deep
> > stacks, and deal with some of the other special cases that bite on SKL,
> > but those are icky too. And in fact IBRS performance isn't anywhere
> > near as bad on this generation of CPUs as it is on earlier CPUs
> > *anyway*, which makes it not quite so insane to *contemplate* using it
> > as Intel proposed.
> 
> There's another possible method to avoid deep stacks on Skylake, without compiler 
> support:
> 
>   - Use the existing mcount based function tracing live patching machinery
>     (CONFIG_FUNCTION_TRACER=y) to install a _very_ fast and simple stack depth 
>     tracking tracer which would issue a retpoline when stack depth crosses 
>     boundaries of ~16 entries.

The patch below demonstrates the principle, it forcibly enables dynamic ftrace 
patching (CONFIG_DYNAMIC_FTRACE=y et al) and turns mcount/__fentry__ into a RET:

  ffffffff81a01a40 <__fentry__>:
  ffffffff81a01a40:       c3                      retq   

This would have to be extended with (very simple) call stack depth tracking (just 
3 more instructions would do in the fast path I believe) and a suitable SkyLake 
workaround (and also has to play nice with the ftrace callbacks).

On non-SkyLake the overhead would be 0 cycles.

On SkyLake this would add an overhead of maybe 2-3 cycles per function call and 
obviously all this code and data would be very cache hot. Given that the average 
number of function calls per system call is around a dozen, this would be _much_ 
faster than any microcode/MSR based approach.

Is there a testcase for the SkyLake 16-deep-call-stack problem that I could run? 
Is there a description of the exact speculative execution vulnerability that has 
to be addressed to begin with?

If this approach is workable I'd much prefer it to any MSR writes in the syscall 
entry path not just because it's fast enough in practice to not be turned off by 
everyone, but also because everyone would agree that per function call overhead 
needs to go away on new CPUs. Both deployment and backporting is also _much_ more 
flexible, simpler, faster and more complete than microcode/firmware or compiler 
based solutions.

Assuming the vulnerability can be addressed via this route that is, which is a big 
assumption!

Thanks,

	Ingo

 arch/x86/Kconfig            | 3 +++
 arch/x86/kernel/ftrace_64.S | 1 +
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 423e4b64e683..df471538a79c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -133,6 +133,8 @@ config X86
 	select HAVE_DMA_CONTIGUOUS
 	select HAVE_DYNAMIC_FTRACE
 	select HAVE_DYNAMIC_FTRACE_WITH_REGS
+	select DYNAMIC_FTRACE
+	select DYNAMIC_FTRACE_WITH_REGS
 	select HAVE_EBPF_JIT			if X86_64
 	select HAVE_EFFICIENT_UNALIGNED_ACCESS
 	select HAVE_EXIT_THREAD
@@ -140,6 +142,7 @@ config X86
 	select HAVE_FTRACE_MCOUNT_RECORD
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_FUNCTION_TRACER
+	select FUNCTION_TRACER
 	select HAVE_GCC_PLUGINS
 	select HAVE_HW_BREAKPOINT
 	select HAVE_IDE
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index 7cb8ba08beb9..1e219e0f2887 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -19,6 +19,7 @@ EXPORT_SYMBOL(__fentry__)
 # define function_hook	mcount
 EXPORT_SYMBOL(mcount)
 #endif
+	ret
 
 /* All cases save the original rbp (8 bytes) */
 #ifdef CONFIG_FRAME_POINTER

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23  7:53                 ` Ingo Molnar
@ 2018-01-23  9:27                   ` Ingo Molnar
  2018-01-23  9:37                     ` David Woodhouse
  2018-01-23 15:01                     ` Dave Hansen
  2018-01-23  9:30                   ` David Woodhouse
  1 sibling, 2 replies; 143+ messages in thread
From: Ingo Molnar @ 2018-01-23  9:27 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Arjan Van De Ven


* Ingo Molnar <mingo@kernel.org> wrote:

> Is there a testcase for the SkyLake 16-deep-call-stack problem that I could run? 
> Is there a description of the exact speculative execution vulnerability that has 
> to be addressed to begin with?

Ok, so for now I'm assuming that this is the 16 entries return-stack-buffer 
underflow condition where SkyLake falls back to the branch predictor (while other 
CPUs wrap the buffer).

> If this approach is workable I'd much prefer it to any MSR writes in the syscall 
> entry path not just because it's fast enough in practice to not be turned off by 
> everyone, but also because everyone would agree that per function call overhead 
> needs to go away on new CPUs. Both deployment and backporting is also _much_ more 
> flexible, simpler, faster and more complete than microcode/firmware or compiler 
> based solutions.
> 
> Assuming the vulnerability can be addressed via this route that is, which is a big 
> assumption!

So I talked this over with PeterZ, and I think it's all doable:

 - the CALL __fentry__ callbacks maintain the depth tracking (on the kernel 
   stack, fast to access), and issue an "RSB-stuffing sequence" when depth reaches
   16 entries.

 - "the RSB-stuffing sequence" is a return trampoline that pushes a CALL on the 
   stack which is executed on the RET.

 - All asynchronous contexts (IRQs, NMIs, etc.) stuff the RSB before IRET. (The 
   tracking could probably made IRQ and maybe even NMI safe, but the worst-case 
   nesting scenarios make my head ache.)

I.e. IBRS can be mostly replaced with a kernel based solution that is better than 
IBRS and which does not negatively impact any other non-SkyLake CPUs or general 
code quality.

I.e. a full upstream Spectre solution.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23  7:53                 ` Ingo Molnar
  2018-01-23  9:27                   ` Ingo Molnar
@ 2018-01-23  9:30                   ` David Woodhouse
  2018-01-23 10:15                     ` Ingo Molnar
                                       ` (2 more replies)
  1 sibling, 3 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-23  9:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Arjan Van De Ven

[-- Attachment #1: Type: text/plain, Size: 4756 bytes --]

On Tue, 2018-01-23 at 08:53 +0100, Ingo Molnar wrote:
> 
> The patch below demonstrates the principle, it forcibly enables dynamic ftrace 
> patching (CONFIG_DYNAMIC_FTRACE=y et al) and turns mcount/__fentry__ into a RET:
> 
>   ffffffff81a01a40 <__fentry__>:
>   ffffffff81a01a40:       c3                      retq   
> 
> This would have to be extended with (very simple) call stack depth tracking (just 
> 3 more instructions would do in the fast path I believe) and a suitable SkyLake 
> workaround (and also has to play nice with the ftrace callbacks).
> 
> On non-SkyLake the overhead would be 0 cycles.

The overhead of forcing CONFIG_DYNAMIC_FTRACE=y is precisely zero
cycles? That seems a little optimistic. ;)

I'll grant you if it goes straight to a 'ret' it isn't *that* high
though.

> On SkyLake this would add an overhead of maybe 2-3 cycles per function call and 
> obviously all this code and data would be very cache hot. Given that the average 
> number of function calls per system call is around a dozen, this would be _much_ 
> faster than any microcode/MSR based approach.

That's kind of neat, except you don't want it at the top of the
function; you want it at the bottom.

If you could hijack the *return* site, then you could check for
underflow and stuff the RSB right there. But in __fentry__ there's not
a lot you can do other than complain that something bad is going to
happen in the future. You know that a string of 16+ rets is going to
happen, but you've got no gadget in *there* to deal with it when it
does.

HJ did have patches to turn 'ret' into a form of retpoline, which I
don't think ever even got performance-tested. They'd have forced a
mispredict on *every* ret. A cheaper option might be to turn ret into a
'jmp skylake_ret_hack'. Which on pre-SKL will be a bare ret, and SKL+
can do the counting (in conjunction with a 'per_cpu(call_depth)++' in
__fentry__) and stuff the RSB before actually returning, when
appropriate.

By the time you've made it work properly, I suspect we're approaching
the barf-factor of IBRS, for a less complete solution.

> Is there a testcase for the SkyLake 16-deep-call-stack problem that I could run? 

Andi's been experimenting at 
https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=spec/deep-chain-3

> Is there a description of the exact speculative execution vulnerability that has 
> to be addressed to begin with?

"It takes predictions from the generic branch target buffer when the
RSB underflows".

IBRS filters what can come from the BTB, and resolves the problem that
way. Retpoline avoids the indirect branches that on *earlier* CPUs were
the only things that would use the offending predictions. But on SKL,
now 'ret' is one of the problematic instructions too. Fun! :)

> If this approach is workable I'd much prefer it to any MSR writes in the syscall 
> entry path not just because it's fast enough in practice to not be turned off by 
> everyone, but also because everyone would agree that per function call overhead 
> needs to go away on new CPUs. Both deployment and backporting is also _much_ more 
> flexible, simpler, faster and more complete than microcode/firmware or compiler 
> based solutions.
> 
> Assuming the vulnerability can be addressed via this route that is, which is a big 
> assumption!

I think it's close. There are some other cases which empty the RSB,
like sleeping and loading microcode, which can happily be special-
cased. Andi's rounded up many of the remaining details already at 
https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=spec/skl-rsb-3

And there's SMI, which is a pain but I think Linus is right we can
possibly just stick our fingers in our ears and pretend we didn't hear
about that one as it's likely to be hard to trigger (famous last
words).

On the whole though, I think you can see why we're keeping IBRS around
for now, sent out purely as an RFC and rebased on top of the stuff
we're *actually* sending to Linus for inclusion.

When we have a clear idea of what we're doing for Skylake, it'll be
useful to have a proper comparison of the security, the performance and
the "ick" factor of whatever we come up with, vs. IBRS.

Right now the plan is just "screw Skylake"; we'll just forget it's a
special snowflake and treat it like everything else, except for a bit
of extra RSB-stuffing on context switch (since we had to add that for
!SMEP anyway). And that's not *entirely* unreasonable but as I said I'd
*really* like to have a decent analysis of the implications of that,
not just some hand-wavy "nah, it'll be fine".


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23  9:27                   ` Ingo Molnar
@ 2018-01-23  9:37                     ` David Woodhouse
  2018-01-23 15:01                     ` Dave Hansen
  1 sibling, 0 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-23  9:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Arjan Van De Ven

[-- Attachment #1: Type: text/plain, Size: 2561 bytes --]

On Tue, 2018-01-23 at 10:27 +0100, Ingo Molnar wrote:
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> > 
> > Is there a testcase for the SkyLake 16-deep-call-stack problem that I could run? 
> > Is there a description of the exact speculative execution vulnerability that has 
> > to be addressed to begin with?
>
> Ok, so for now I'm assuming that this is the 16 entries return-stack-buffer 
> underflow condition where SkyLake falls back to the branch predictor (while other 
> CPUs wrap the buffer).

Yep.

> > 
> > If this approach is workable I'd much prefer it to any MSR writes in the syscall 
> > entry path not just because it's fast enough in practice to not be turned off by 
> > everyone, but also because everyone would agree that per function call overhead 
> > needs to go away on new CPUs. Both deployment and backporting is also _much_ more 
> > flexible, simpler, faster and more complete than microcode/firmware or compiler 
> > based solutions.
> > 
> > Assuming the vulnerability can be addressed via this route that is, which is a big 
> > assumption!
>
> So I talked this over with PeterZ, and I think it's all doable:
> 
>  - the CALL __fentry__ callbacks maintain the depth tracking (on the kernel 
>    stack, fast to access), and issue an "RSB-stuffing sequence" when depth reaches
>    16 entries.
> 
>  - "the RSB-stuffing sequence" is a return trampoline that pushes a CALL on the 
>    stack which is executed on the RET.

That's neat. We'll want to make sure the unwinder can cope but hey,
Peter *loves* hacking objtool, right? :)

>  - All asynchronous contexts (IRQs, NMIs, etc.) stuff the RSB before IRET. (The 
>    tracking could probably made IRQ and maybe even NMI safe, but the worst-case 
>    nesting scenarios make my head ache.)
> 
> I.e. IBRS can be mostly replaced with a kernel based solution that is better than 
> IBRS and which does not negatively impact any other non-SkyLake CPUs or general 
> code quality.
> 
> I.e. a full upstream Spectre solution.

Sounds good. I look forward to seeing it.

In the meantime I'll resend the basic bits of the feature detection and
especially turning off KPTI when RDCL_NO is set.

We do also want to do IBPB even with retpoline, so I'll send those
patches for KVM and context switch. There is some bikeshedding to be
done there about the precise conditions under which we do it.

Finally, KVM should be *exposing* IBRS to guests even if we don't use
it ourselves. We'll do that too.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23  9:30                   ` David Woodhouse
@ 2018-01-23 10:15                     ` Ingo Molnar
  2018-01-23 10:27                       ` David Woodhouse
  2018-01-23 10:23                     ` Ingo Molnar
  2018-01-25 16:19                     ` Mason
  2 siblings, 1 reply; 143+ messages in thread
From: Ingo Molnar @ 2018-01-23 10:15 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Arjan Van De Ven


* David Woodhouse <dwmw2@infradead.org> wrote:

> On Tue, 2018-01-23 at 08:53 +0100, Ingo Molnar wrote:
> > 
> > The patch below demonstrates the principle, it forcibly enables dynamic ftrace 
> > patching (CONFIG_DYNAMIC_FTRACE=y et al) and turns mcount/__fentry__ into a RET:
> > 
> >   ffffffff81a01a40 <__fentry__>:
> >   ffffffff81a01a40:       c3                      retq   
> > 
> > This would have to be extended with (very simple) call stack depth tracking (just 
> > 3 more instructions would do in the fast path I believe) and a suitable SkyLake 
> > workaround (and also has to play nice with the ftrace callbacks).
> > 
> > On non-SkyLake the overhead would be 0 cycles.
> 
> The overhead of forcing CONFIG_DYNAMIC_FTRACE=y is precisely zero
> cycles? That seems a little optimistic. ;)

The overhead of the quick hack patch I sent to show what exact code I mean is 
obviously not zero.

The overhead of using my proposed solution, to utilize the function call callback 
that CONFIG_DYNAMIC_FTRACE=y provides, is exactly zero on non-SkyLake systems 
where the callback is patched out, on typical Linux distros.

The callback is widely enabled on distro kernels:

  Fedora:                    CONFIG_DYNAMIC_FTRACE=y
  Ubuntu:                    CONFIG_DYNAMIC_FTRACE=y
  OpenSuse (default flavor): CONFIG_DYNAMIC_FTRACE=y

BTW., the reason this is enabled on all distro kernels is because the overhead is 
a single patched-in NOP instruction in the function epilogue, when tracing is 
disabled. So it's not even a CALL+RET - it's a patched in NOP.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23  9:30                   ` David Woodhouse
  2018-01-23 10:15                     ` Ingo Molnar
@ 2018-01-23 10:23                     ` Ingo Molnar
  2018-01-23 10:35                       ` David Woodhouse
  2018-02-04 18:43                       ` Thomas Gleixner
  2018-01-25 16:19                     ` Mason
  2 siblings, 2 replies; 143+ messages in thread
From: Ingo Molnar @ 2018-01-23 10:23 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Arjan Van De Ven


* David Woodhouse <dwmw2@infradead.org> wrote:

> > On SkyLake this would add an overhead of maybe 2-3 cycles per function call and 
> > obviously all this code and data would be very cache hot. Given that the average 
> > number of function calls per system call is around a dozen, this would be _much_ 
> > faster than any microcode/MSR based approach.
> 
> That's kind of neat, except you don't want it at the top of the
> function; you want it at the bottom.
> 
> If you could hijack the *return* site, then you could check for
> underflow and stuff the RSB right there. But in __fentry__ there's not
> a lot you can do other than complain that something bad is going to
> happen in the future. You know that a string of 16+ rets is going to
> happen, but you've got no gadget in *there* to deal with it when it
> does.

No, it can be done with the existing CALL instrumentation callback that 
CONFIG_DYNAMIC_FTRACE=y provides, by pushing a RET trampoline on the stack from 
the CALL trampoline - see my previous email.

> HJ did have patches to turn 'ret' into a form of retpoline, which I
> don't think ever even got performance-tested.

Return instrumentation is possible as well, but there are two major drawbacks:

 - GCC support for it is not as widely available and return instrumentation is 
   less tested in Linux kernel contexts

 - a major point of my suggestion is that CONFIG_DYNAMIC_FTRACE=y is already 
   enabled in distros here and today, so the runtime overhead to non-SkyLake CPUs 
   would be literally zero, while still allowing to fix the RSB vulnerability on 
   SkyLake.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23 10:15                     ` Ingo Molnar
@ 2018-01-23 10:27                       ` David Woodhouse
  2018-01-23 10:44                         ` Ingo Molnar
  0 siblings, 1 reply; 143+ messages in thread
From: David Woodhouse @ 2018-01-23 10:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Arjan Van De Ven

[-- Attachment #1: Type: text/plain, Size: 586 bytes --]

On Tue, 2018-01-23 at 11:15 +0100, Ingo Molnar wrote:
> 
> BTW., the reason this is enabled on all distro kernels is because the overhead is 
> a single patched-in NOP instruction in the function epilogue, when tracing is 
> disabled. So it's not even a CALL+RET - it's a patched in NOP.

Hm? We still have GCC emitting 'call __fentry__' don't we? Would be
nice to get to the point where we can patch *that* out into a NOP... or
are you saying we already can?

But this is a digression. I was being pedantic about the "0 cycles" but
sure, this would be perfectly tolerable.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23 10:23                     ` Ingo Molnar
@ 2018-01-23 10:35                       ` David Woodhouse
  2018-02-04 18:43                       ` Thomas Gleixner
  1 sibling, 0 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-23 10:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 2289 bytes --]

On Tue, 2018-01-23 at 11:23 +0100, Ingo Molnar wrote:
> * David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > 
> > > 
> > > On SkyLake this would add an overhead of maybe 2-3 cycles per function call and 
> > > obviously all this code and data would be very cache hot. Given that the average 
> > > number of function calls per system call is around a dozen, this would be _much_ 
> > > faster than any microcode/MSR based approach.
> > That's kind of neat, except you don't want it at the top of the
> > function; you want it at the bottom.
> > 
> > If you could hijack the *return* site, then you could check for
> > underflow and stuff the RSB right there. But in __fentry__ there's not
> > a lot you can do other than complain that something bad is going to
> > happen in the future. You know that a string of 16+ rets is going to
> > happen, but you've got no gadget in *there* to deal with it when it
> > does.
>
> No, it can be done with the existing CALL instrumentation callback that 
> CONFIG_DYNAMIC_FTRACE=y provides, by pushing a RET trampoline on the stack from 
> the CALL trampoline - see my previous email.

Yes, that's a neat solution.

> > 
> > HJ did have patches to turn 'ret' into a form of retpoline, which I
> > don't think ever even got performance-tested.
> Return instrumentation is possible as well, but there are two major drawbacks:
> 
>  - GCC support for it is not as widely available and return instrumentation is 
>    less tested in Linux kernel contexts

Hey, we're *already* making people upgrade their compiler, and HJ
apparently never sleeps. So don't actually be held back too much by
that consideration. If it could be better done with GCC help, we really
*can* explore that.

>  - a major point of my suggestion is that CONFIG_DYNAMIC_FTRACE=y is already 
>    enabled in distros here and today, so the runtime overhead to non-SkyLake CPUs 
>    would be literally zero, while still allowing to fix the RSB vulnerability on 
>    SkyLake.

Sure. You still have a few holes to fix (or declare acceptable) to
bring it to the full coverage of the IBRS solution, and it's still
possible that by the time it's complete it's approaching the ick factor
of IBRS, but I'd love to see it.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23 10:27                       ` David Woodhouse
@ 2018-01-23 10:44                         ` Ingo Molnar
  2018-01-23 10:57                           ` David Woodhouse
  0 siblings, 1 reply; 143+ messages in thread
From: Ingo Molnar @ 2018-01-23 10:44 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Arjan Van De Ven


* David Woodhouse <dwmw2@infradead.org> wrote:

> On Tue, 2018-01-23 at 11:15 +0100, Ingo Molnar wrote:
> > 
> > BTW., the reason this is enabled on all distro kernels is because the overhead 
> > is  a single patched-in NOP instruction in the function epilogue, when tracing 
> > is  disabled. So it's not even a CALL+RET - it's a patched in NOP.
> 
> Hm? We still have GCC emitting 'call __fentry__' don't we? Would be nice to get 
> to the point where we can patch *that* out into a NOP... or are you saying we 
> already can?

Yes, we already can and do patch the 'call __fentry__/ mcount' call site into a 
NOP today - all 50,000+ call sites on a typical distro kernel.

We did so for a long time - this is all a well established, working mechanism.

> But this is a digression. I was being pedantic about the "0 cycles" but sure, 
> this would be perfectly tolerable.

It's not a digression in two ways:

- I wanted to make it clear that for distro kernels it _is_ a zero cycles overhead
  mechanism for non-SkyLake CPUs, literally.

- I noticed that Meltdown and the CR3 writes for PTI appears to have established a
  kind of ... insensitivity and numbness to kernel micro-costs, which peaked with
  the per-syscall MSR write nonsense patch of the SkyLake workaround.
  That attitude is totally unacceptable to me as x86 maintainer and yes, still
  every cycle counts.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23 10:44                         ` Ingo Molnar
@ 2018-01-23 10:57                           ` David Woodhouse
  0 siblings, 0 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-23 10:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 2259 bytes --]

On Tue, 2018-01-23 at 11:44 +0100, Ingo Molnar wrote:
> * David Woodhouse <dwmw2@infradead.org> wrote:
> > Hm? We still have GCC emitting 'call __fentry__' don't we? Would be nice to get 
> > to the point where we can patch *that* out into a NOP... or are you saying we 
> > already can?
> Yes, we already can and do patch the 'call __fentry__/ mcount' call site into a 
> NOP today - all 50,000+ call sites on a typical distro kernel.
> 
> We did so for a long time - this is all a well established, working mechanism.

That's neat; I'd missed that.

> > But this is a digression. I was being pedantic about the "0 cycles" but sure, 
> > this would be perfectly tolerable.
> It's not a digression in two ways:
> 
> - I wanted to make it clear that for distro kernels it _is_ a zero cycles overhead
>   mechanism for non-SkyLake CPUs, literally.
> 
> - I noticed that Meltdown and the CR3 writes for PTI appears to have established a
>   kind of ... insensitivity and numbness to kernel micro-costs, which peaked with
>   the per-syscall MSR write nonsense patch of the SkyLake workaround.
>   That attitude is totally unacceptable to me as x86 maintainer and yes, still
>   every cycle counts.

Yeah, absolutely. But here we're talking about the overhead on non-SKL, 
and on non-SKL the IBRS overhead is zero too (well, again not precisely
zero because it turns into NOPs).

You're absolutely right that we shouldn't stop counting cycles.

I've already noted that on SKL IBRS is actually a lot faster than on
earlier generations, and we also get back some of the overhead by
turning the retpoline into a bare jmp again. We haven't *forgotten*
about performance.

I'd like to see your solution once the details are sorted out, and see
proper benchmarks — both microbenchmarks and real workloads — comparing
the two. And then make a reasoned decision based on that, and on how
happy we are with the theoretical holes that your solution leaves, in
the cold light of day.

We should also look at whether we want to set STIBP too, which is
somewhat orthogonal to using IBRS to protect the kernel, and could end
up with some of the same MSR writes (at least setting to zero) on some
of the same code paths.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23  9:27                   ` Ingo Molnar
  2018-01-23  9:37                     ` David Woodhouse
@ 2018-01-23 15:01                     ` Dave Hansen
  1 sibling, 0 replies; 143+ messages in thread
From: Dave Hansen @ 2018-01-23 15:01 UTC (permalink / raw)
  To: Ingo Molnar, David Woodhouse
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Arjan Van De Ven

On 01/23/2018 01:27 AM, Ingo Molnar wrote:
> 
>  - All asynchronous contexts (IRQs, NMIs, etc.) stuff the RSB before IRET. (The 
>    tracking could probably made IRQ and maybe even NMI safe, but the worst-case 
>    nesting scenarios make my head ache.)

This all sounds totally workable to me.  We talked about using ftrace
itself to track call depth, but it would be unusable in production, of
course.  This seems workable, though.  You're also totally right about
the zero overhead on most kernels with it turned off when we don't need
RSB underflow protection (basically pre-Skylake).

I also agree that the safe thing to do is to just stuff before iret.  I
bet we can get a ftrace-driven RSB tracker working precisely enough even
with NMIs, but it's way simpler to just stuff and be done with it for now.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-21 19:14   ` Andy Lutomirski
@ 2018-01-23 16:12     ` Tom Lendacky
  2018-01-23 16:20       ` Woodhouse, David
  0 siblings, 1 reply; 143+ messages in thread
From: Tom Lendacky @ 2018-01-23 16:12 UTC (permalink / raw)
  To: Andy Lutomirski, KarimAllah Ahmed
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, kvm, x86, Arjan Van De Ven

On 1/21/2018 1:14 PM, Andy Lutomirski wrote:
> 
> 
>> On Jan 20, 2018, at 11:23 AM, KarimAllah Ahmed <karahmed@amazon.de> wrote:
>>
>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>
>> Create macros to control Indirect Branch Speculation.
>>
>> Name them so they reflect what they are actually doing.
>> The macros are used to restrict and unrestrict the indirect branch speculation.
>> They do not *disable* (or *enable*) indirect branch speculation. A trip back to
>> user-space after *restricting* speculation would still affect the BTB.
>>
>> Quoting from a commit by Tim Chen:
>>
>> """
>>    If IBRS is set, near returns and near indirect jumps/calls will not allow
>>    their predicted target address to be controlled by code that executed in a
>>    less privileged prediction mode *BEFORE* the IBRS mode was last written with
>>    a value of 1 or on another logical processor so long as all Return Stack
>>    Buffer (RSB) entries from the previous less privileged prediction mode are
>>    overwritten.
>>
>>    Thus a near indirect jump/call/return may be affected by code in a less
>>    privileged prediction mode that executed *AFTER* IBRS mode was last written
>>    with a value of 1.
>> """
>>
>> [ tglx: Changed macro names and rewrote changelog ]
>> [ karahmed: changed macro names *again* and rewrote changelog ]
>>
>> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Greg KH <gregkh@linuxfoundation.org>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: Andy Lutomirski <luto@kernel.org>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: David Woodhouse <dwmw@amazon.co.uk>
>> Cc: Ashok Raj <ashok.raj@intel.com>
>> Link: https://lkml.kernel.org/r/3aab341725ee6a9aafd3141387453b45d788d61a.1515542293.git.tim.c.chen@linux.intel.com
>> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
>> ---
>> arch/x86/entry/calling.h | 73 ++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 73 insertions(+)
>>
>> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
>> index 3f48f69..5aafb51 100644
>> --- a/arch/x86/entry/calling.h
>> +++ b/arch/x86/entry/calling.h
>> @@ -6,6 +6,8 @@
>> #include <asm/percpu.h>
>> #include <asm/asm-offsets.h>
>> #include <asm/processor-flags.h>
>> +#include <asm/msr-index.h>
>> +#include <asm/cpufeatures.h>
>>
>> /*
>>
>> @@ -349,3 +351,74 @@ For 32-bit we have the following conventions - kernel is built with
>> .Lafter_call_\@:
>> #endif
>> .endm
>> +
>> +/*
>> + * IBRS related macros
>> + */
>> +.macro PUSH_MSR_REGS
>> +    pushq    %rax
>> +    pushq    %rcx
>> +    pushq    %rdx
>> +.endm
>> +
>> +.macro POP_MSR_REGS
>> +    popq    %rdx
>> +    popq    %rcx
>> +    popq    %rax
>> +.endm
>> +
>> +.macro WRMSR_ASM msr_nr:req edx_val:req eax_val:req
>> +    movl    \msr_nr, %ecx
>> +    movl    \edx_val, %edx
>> +    movl    \eax_val, %eax
>> +    wrmsr
>> +.endm
>> +
>> +.macro RESTRICT_IB_SPEC
>> +    ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
>> +    PUSH_MSR_REGS
>> +    WRMSR_ASM $MSR_IA32_SPEC_CTRL, $0, $SPEC_CTRL_IBRS
>> +    POP_MSR_REGS
>> +.Lskip_\@:
>> +.endm
>> +
>> +.macro UNRESTRICT_IB_SPEC
>> +    ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
>> +    PUSH_MSR_REGS
>> +    WRMSR_ASM $MSR_IA32_SPEC_CTRL, $0, $0
> 
> I think you should be writing 2, not 0, since I'm reasonably confident that we want STIBP on.  Can you explain why you're writing 0?

Do we want to talk about STIBP in general?  Should it be (yet another)
boot option to enable or disable?  If there is STIBP support without
IBRS support, it could be a set and forget at boot time.

Thanks,
Tom

> 
> Also, holy cow, there are so many macros here.
> 
> And a meta question: why are there so many submitters of the same series?
> 

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23 16:12     ` Tom Lendacky
@ 2018-01-23 16:20       ` Woodhouse, David
  2018-01-23 22:37         ` Tom Lendacky
  0 siblings, 1 reply; 143+ messages in thread
From: Woodhouse, David @ 2018-01-23 16:20 UTC (permalink / raw)
  To: thomas.lendacky, luto, Raslan, KarimAllah
  Cc: kvm, linux-kernel, peterz, ashok.raj, arjan, arjan.van.de.ven,
	bp, torvalds, tglx, Janakarajan.Natarajan, tim.c.chen, ak, joro,
	dan.j.williams, x86, hpa, aarcange, mingo, luto, pbonzini,
	gregkh, dave.hansen, mhiramat, asit.k.mallick, jun.nakajima,
	labbott, rkrcmar


[-- Attachment #1.1: Type: text/plain, Size: 1585 bytes --]

On Tue, 2018-01-23 at 10:12 -0600, Tom Lendacky wrote:
> 
> >> +.macro UNRESTRICT_IB_SPEC
> >> +    ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
> >> +    PUSH_MSR_REGS
> >> +    WRMSR_ASM $MSR_IA32_SPEC_CTRL, $0, $0
> > 
> I think you should be writing 2, not 0, since I'm reasonably
> confident that we want STIBP on.  Can you explain why you're writing
> 0?
> 
> Do we want to talk about STIBP in general?  Should it be (yet another)
> boot option to enable or disable?  If there is STIBP support without
> IBRS support, it could be a set and forget at boot time.

We haven't got patches which enable STIBP in general. The kernel itself
is safe either way with retpoline, or because IBRS implies STIBP too
(that is, there's no difference between writing 1 and 3).

So STIBP is purely about protecting userspace processes from one
another, and VM guests from one another, when they run on HT siblings.

There's an argument that there are so many other information leaks
between HT siblings that we might not care. Especially as it's hard to
*tell* when you're scheduling, whether you trust all the processes (or
guests) on your HT siblings right now... let alone later when
scheduling another process if you need to *now* set STIBP on a sibling
which is no longer save from this process now running.

I'm not sure we want to set STIBP *unconditionally* either because of
the performance implications.

For IBRS we had an answer and it was just ugly. For STIBP we don't
actually have an answer for "how do we use this?". Do we?



[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 197 bytes --]




Amazon Web Services UK Limited. Registered in England and Wales with registration number 08650665 and which has its registered office at 60 Holborn Viaduct, London EC1A 2FD, United Kingdom.

[-- Attachment #2.2: Type: text/html, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-21 20:28     ` David Woodhouse
  2018-01-21 21:35       ` Linus Torvalds
@ 2018-01-23 20:16       ` Pavel Machek
  1 sibling, 0 replies; 143+ messages in thread
From: Pavel Machek @ 2018-01-23 20:16 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Arjan Van De Ven

On Sun 2018-01-21 20:28:17, David Woodhouse wrote:
> On Sun, 2018-01-21 at 11:34 -0800, Linus Torvalds wrote:
> > All of this is pure garbage.
> > 
> > Is Intel really planning on making this shit architectural? Has
> > anybody talked to them and told them they are f*cking insane?
> > 
> > Please, any Intel engineers here - talk to your managers. 
> 
> If the alternative was a two-decade product recall and giving everyone
> free CPUs, I'm not sure it was entirely insane.
> 
> Certainly it's a nasty hack, but hey — the world was on fire and in the
> end we didn't have to just turn the datacentres off and go back to goat
> farming, so it's not all bad.

Well, someone at Intel put world on fire. And then was selling faulty
CPUs for half a year while world was on fire; they knew they are
faulty yet they sold them anyway.

Then Intel talks about how great they are and how security is
important for them.... Intentionaly confusing between Meltdown and
Spectre so they can mask how badly they screwed. And without apologies.

> As a hack for existing CPUs, it's just about tolerable — as long as it
> can die entirely by the next generation.
> 
> So the part is I think is odd is the IBRS_ALL feature, where a future
> CPU will advertise "I am able to be not broken" and then you have to
> set the IBRS bit once at boot time to *ask* it not to be broken. That
> part is weird, because it ought to have been treated like the RDCL_NO
> bit — just "you don't have to worry any more, it got better".

And now Intel wants to cheat at benchmarks, to put companies that do
right thing at disadvantage and thinks that that's okay because world
was on fire?

At this point, I believe that yes, product recall would be
appropriate. If Intel is not willing to do it on their own, well,
perhaps courts can force them. Ouch and I wound not mind some jail time
for whoever is responsible for selling known-faulty CPUs to the public.

Oh, and still no word about the real fixes. World is not only Linux,
you see? https://pavelmachek.livejournal.com/140949.html?nojs=1

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-21 14:31   ` Thomas Gleixner
  2018-01-21 14:56     ` Borislav Petkov
  2018-01-21 15:25     ` David Woodhouse
@ 2018-01-23 20:58     ` David Woodhouse
  2018-01-23 22:43       ` Johannes Erdfelt
  2018-01-24  8:47       ` Peter Zijlstra
  2 siblings, 2 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-23 20:58 UTC (permalink / raw)
  To: Thomas Gleixner, KarimAllah Ahmed
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Peter Zijlstra, Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

[-- Attachment #1: Type: text/plain, Size: 4341 bytes --]

On Sun, 2018-01-21 at 15:31 +0100, Thomas Gleixner wrote:
> > 
> > XX: Do we want a microcode blacklist?
> 
> Oh yes, we want a microcode blacklist. Ideally we refuse to load the
> affected microcode in the first place and if its already loaded then at
> least avoid to use the borked features.
> 
> PR texts promising that Intel is committed to transparency in this matter
> are not sufficient. Intel, please provide the facts, i.e. a proper list of
> micro codes and affected SKUs, ASAP.

They've finally published one, at
https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/microcode-update-guidance.pdf

For shits and giggles, you can compare it with the one at
https://kb.vmware.com/s/article/52345

Intel's seems to be a bit rushed. For example for Broadwell-EX 406F1
they say "0x25, 0x23" are bad, but VMware's list says 0x0B000025 and I
have a CPU with 0x0B0000xx. So I've "corrected" their numbers in
attempt at a blacklist patch accordingly, and likewise for some Skylake
SKUs. But there are others in Intel's list that I can't easily
proofread for them right now. Am I missing something?

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index b720dacac051..52855d1a4f9a 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -102,6 +102,57 @@ static void probe_xeon_phi_r3mwait(struct cpuinfo_x86 *c)
 		ELF_HWCAP2 |= HWCAP2_RING3MWAIT;
 }
 
+/*
+ * Early microcode releases for the Spectre v2 mitigation were broken:
+ * https://newsroom.intel.com/wp-content/uploads/sites/11/2018/01/microcode-update-guidance.pdf
+ * VMware also has a list at https://kb.vmware.com/s/article/52345
+ */
+struct sku_microcode {
+	u8 model;
+	u8 stepping;
+	u32 microcode;
+};
+static const struct sku_microcode spectre_bad_microcodes[] = {
+	{ INTEL_FAM6_KABYLAKE_DESKTOP, 0x0B, 0x80 },
+	{ INTEL_FAM6_KABYLAKE_MOBILE, 0x0A, 0x80 },
+	{ INTEL_FAM6_KABYLAKE_MOBILE, 0x0A, 0x80 },
+	{ INTEL_FAM6_KABYLAKE_MOBILE, 0x09, 0x80 },
+	{ INTEL_FAM6_KABYLAKE_DESKTOP, 0x09, 0x80 },
+	{ INTEL_FAM6_SKYLAKE_X, 0x04, 0x0200003C },
+	{ INTEL_FAM6_SKYLAKE_MOBILE, 0x03, 0x000000C2 },
+	{ INTEL_FAM6_SKYLAKE_DESKTOP, 0x03, 0x000000C2 },
+	{ INTEL_FAM6_BROADWELL_CORE, 0x04, 0x28 },
+	{ INTEL_FAM6_BROADWELL_GT3E, 0x01, 0x0000001B },
+	{ INTEL_FAM6_HASWELL_ULT, 0x01, 0x21 },
+	{ INTEL_FAM6_HASWELL_GT3E, 0x01, 0x18 },
+	{ INTEL_FAM6_HASWELL_CORE, 0x03, 0x23 },
+	{ INTEL_FAM6_IVYBRIDGE_X, 0x04, 0x42a },
+	{ INTEL_FAM6_HASWELL_X, 0x02, 0x3b },
+	{ INTEL_FAM6_HASWELL_X, 0x04, 0x10 },
+	{ INTEL_FAM6_HASWELL_CORE, 0x03, 0x23 },
+	{ INTEL_FAM6_BROADWELL_XEON_D, 0x02, 0x14 },
+	{ INTEL_FAM6_BROADWELL_XEON_D, 0x03, 0x7000011 },
+	{ INTEL_FAM6_BROADWELL_GT3E, 0x01, 0x0000001B },
+	/* For 406F1 Intel says "0x25, 0x23" while VMware says 0x0B000025
+	 * and a real CPU has a firmware in the 0x0B0000xx range. So: */
+	{ INTEL_FAM6_BROADWELL_X, 0x01, 0x0b000025 },
+	{ INTEL_FAM6_KABYLAKE_DESKTOP, 0x09, 0x80 },
+	{ INTEL_FAM6_SKYLAKE_X, 0x03, 0x100013e },
+	{ INTEL_FAM6_SKYLAKE_X, 0x04, 0x200003c },
+};
+
+static int bad_spectre_microcode(struct cpuinfo_x86 *c)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(spectre_bad_microcodes); i++) {
+		if (c->x86_model == spectre_bad_microcodes[i].model &&
+		    c->x86_mask == spectre_bad_microcodes[i].stepping)
+			return (c->microcode <= spectre_bad_microcodes[i].microcode);
+	}
+	return 0;
+}
+
 static void early_init_intel(struct cpuinfo_x86 *c)
 {
 	u64 misc_enable;
@@ -122,6 +173,18 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 	if (c->x86 >= 6 && !cpu_has(c, X86_FEATURE_IA64))
 		c->microcode = intel_get_microcode_revision();
 
+	if ((cpu_has(c, X86_FEATURE_SPEC_CTRL) ||
+	     cpu_has(c, X86_FEATURE_AMD_SPEC_CTRL) ||
+	     cpu_has(c, X86_FEATURE_AMD_PRED_CMD) ||
+	     cpu_has(c, X86_FEATURE_AMD_STIBP)) && bad_spectre_microcode(c)) {
+		pr_warn("Intel Spectre v2 broken microcode detected; disabling SPEC_CTRL\n");
+		clear_cpu_cap(c, X86_FEATURE_SPEC_CTRL);
+		clear_cpu_cap(c, X86_FEATURE_STIBP);
+		clear_cpu_cap(c, X86_FEATURE_AMD_SPEC_CTRL);
+		clear_cpu_cap(c, X86_FEATURE_AMD_PRED_CMD);
+		clear_cpu_cap(c, X86_FEATURE_AMD_STIBP);
+	}
+
 	/*
 	 * Atom erratum AAE44/AAF40/AAG38/AAH41:
 	 *

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply related	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23 16:20       ` Woodhouse, David
@ 2018-01-23 22:37         ` Tom Lendacky
  2018-01-23 22:49           ` Andi Kleen
  0 siblings, 1 reply; 143+ messages in thread
From: Tom Lendacky @ 2018-01-23 22:37 UTC (permalink / raw)
  To: Woodhouse, David, Andy Lutomirski, KarimAllah Ahmed
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Tim Chen, kvm, x86, Arjan Van De Ven

On 1/23/2018 10:20 AM, Woodhouse, David wrote:
> On Tue, 2018-01-23 at 10:12 -0600, Tom Lendacky wrote:
>>
>>>> +.macro UNRESTRICT_IB_SPEC
>>>> +    ALTERNATIVE "jmp .Lskip_\@", "", X86_FEATURE_IBRS
>>>> +    PUSH_MSR_REGS
>>>> +    WRMSR_ASM $MSR_IA32_SPEC_CTRL, $0, $0
>>>  
>> I think you should be writing 2, not 0, since I'm reasonably
>> confident that we want STIBP on.  Can you explain why you're writing
>> 0?
>>
>> Do we want to talk about STIBP in general?  Should it be (yet another)
>> boot option to enable or disable?  If there is STIBP support without
>> IBRS support, it could be a set and forget at boot time.
> 
> We haven't got patches which enable STIBP in general. The kernel itself
> is safe either way with retpoline, or because IBRS implies STIBP too
> (that is, there's no difference between writing 1 and 3).
> 
> So STIBP is purely about protecting userspace processes from one
> another, and VM guests from one another, when they run on HT siblings.
> 
> There's an argument that there are so many other information leaks
> between HT siblings that we might not care. Especially as it's hard to
> *tell* when you're scheduling, whether you trust all the processes (or
> guests) on your HT siblings right now... let alone later when
> scheduling another process if you need to *now* set STIBP on a sibling
> which is no longer save from this process now running.
> 
> I'm not sure we want to set STIBP *unconditionally* either because of
> the performance implications.
> 
> For IBRS we had an answer and it was just ugly. For STIBP we don't
> actually have an answer for "how do we use this?". Do we?

Not sure.  Maybe to start, the answer might be to allow it to be set for
the ultra-paranoid, but in general don't enable it by default.  Having it
enabled would be an alternative to someone deciding to disable SMT, since
that would have even more of a performance impact.

Thanks,
Tom

> 
> 

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-23 20:58     ` David Woodhouse
@ 2018-01-23 22:43       ` Johannes Erdfelt
  2018-01-24  8:47       ` Peter Zijlstra
  1 sibling, 0 replies; 143+ messages in thread
From: Johannes Erdfelt @ 2018-01-23 22:43 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Thomas Gleixner, KarimAllah Ahmed, linux-kernel, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

On Tue, Jan 23, 2018, David Woodhouse <dwmw2@infradead.org> wrote:
> +	{ INTEL_FAM6_KABYLAKE_MOBILE, 0x0A, 0x80 },
> +	{ INTEL_FAM6_KABYLAKE_MOBILE, 0x0A, 0x80 },

> +	{ INTEL_FAM6_KABYLAKE_DESKTOP, 0x09, 0x80 },
> +	{ INTEL_FAM6_KABYLAKE_DESKTOP, 0x09, 0x80 },

> +	{ INTEL_FAM6_SKYLAKE_X, 0x04, 0x0200003C },
> +	{ INTEL_FAM6_SKYLAKE_X, 0x04, 0x200003c },

> +	{ INTEL_FAM6_BROADWELL_GT3E, 0x01, 0x0000001B },
> +	{ INTEL_FAM6_BROADWELL_GT3E, 0x01, 0x0000001B },

> +	{ INTEL_FAM6_HASWELL_CORE, 0x03, 0x23 },
> +	{ INTEL_FAM6_HASWELL_CORE, 0x03, 0x23 },

There appear to be a handful of duplicates in this list.

JE

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23 22:37         ` Tom Lendacky
@ 2018-01-23 22:49           ` Andi Kleen
  2018-01-23 23:14             ` Woodhouse, David
  0 siblings, 1 reply; 143+ messages in thread
From: Andi Kleen @ 2018-01-23 22:49 UTC (permalink / raw)
  To: Tom Lendacky
  Cc: Woodhouse, David, Andy Lutomirski, KarimAllah Ahmed,
	linux-kernel, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Tim Chen, kvm, x86, Arjan Van De Ven

> Not sure.  Maybe to start, the answer might be to allow it to be set for
> the ultra-paranoid, but in general don't enable it by default.  Having it
> enabled would be an alternative to someone deciding to disable SMT, since
> that would have even more of a performance impact.

I agree. A reasonable strategy would be to only enable it for
processes that have dumpable disabled. This should be already set for
high value processes like GPG, and allows others to opt-in if
they need to.

-Andi

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23 22:49           ` Andi Kleen
@ 2018-01-23 23:14             ` Woodhouse, David
  2018-01-23 23:22               ` Andi Kleen
  2018-01-24  0:47               ` Tim Chen
  0 siblings, 2 replies; 143+ messages in thread
From: Woodhouse, David @ 2018-01-23 23:14 UTC (permalink / raw)
  To: thomas.lendacky, ak
  Cc: kvm, linux-kernel, peterz, ashok.raj, Raslan, KarimAllah,
	arjan.van.de.ven, arjan, bp, tglx, Janakarajan.Natarajan,
	tim.c.chen, torvalds, joro, dan.j.williams, x86, hpa, aarcange,
	mingo, luto, pbonzini, gregkh, dave.hansen, luto, mhiramat,
	asit.k.mallick, jun.nakajima, labbott, rkrcmar


[-- Attachment #1.1: Type: text/plain, Size: 763 bytes --]

On Tue, 2018-01-23 at 14:49 -0800, Andi Kleen wrote:
> > Not sure.  Maybe to start, the answer might be to allow it to be set for
> > the ultra-paranoid, but in general don't enable it by default.  Having it
> > enabled would be an alternative to someone deciding to disable SMT, since
> > that would have even more of a performance impact.
> 
> I agree. A reasonable strategy would be to only enable it for
> processes that have dumpable disabled. This should be already set for
> high value processes like GPG, and allows others to opt-in if
> they need to.

That seems to make sense, and I think was the solution we were
approaching for IBPB on context switch too, right?

Are we generally agreed on dumpable as the criterion for both of those?

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 197 bytes --]




Amazon Web Services UK Limited. Registered in England and Wales with registration number 08650665 and which has its registered office at 60 Holborn Viaduct, London EC1A 2FD, United Kingdom.

[-- Attachment #2.2: Type: text/html, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23 23:14             ` Woodhouse, David
@ 2018-01-23 23:22               ` Andi Kleen
  2018-01-24  0:47               ` Tim Chen
  1 sibling, 0 replies; 143+ messages in thread
From: Andi Kleen @ 2018-01-23 23:22 UTC (permalink / raw)
  To: Woodhouse, David
  Cc: thomas.lendacky, kvm, linux-kernel, peterz, ashok.raj, Raslan,
	KarimAllah, arjan.van.de.ven, arjan, bp, tglx,
	Janakarajan.Natarajan, tim.c.chen, torvalds, joro,
	dan.j.williams, x86, hpa, aarcange, mingo, luto, pbonzini,
	gregkh, dave.hansen, luto, mhiramat, asit.k.mallick,
	jun.nakajima, labbott, rkrcmar

On Tue, Jan 23, 2018 at 11:14:36PM +0000, Woodhouse, David wrote:
> On Tue, 2018-01-23 at 14:49 -0800, Andi Kleen wrote:
> > > Not sure.  Maybe to start, the answer might be to allow it to be set for
> > > the ultra-paranoid, but in general don't enable it by default.  Having it
> > > enabled would be an alternative to someone deciding to disable SMT, since
> > > that would have even more of a performance impact.
> > 
> > I agree. A reasonable strategy would be to only enable it for
> > processes that have dumpable disabled. This should be already set for
> > high value processes like GPG, and allows others to opt-in if
> > they need to.
> 
> That seems to make sense, and I think was the solution we were
> approaching for IBPB on context switch too, right?

Right.

-Andi

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23  7:29               ` Ingo Molnar
  2018-01-23  7:53                 ` Ingo Molnar
@ 2018-01-24  0:05                 ` Andi Kleen
  1 sibling, 0 replies; 143+ messages in thread
From: Andi Kleen @ 2018-01-24  0:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Woodhouse, Linus Torvalds, KarimAllah Ahmed,
	Linux Kernel Mailing List, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott

Ingo Molnar <mingo@kernel.org> writes:
>
> Is there any reason why this wouldn't work?

To actually maintain the true call depth you would need to intercept the
return of the function too, because the counter has to be decremented
at the end of the function.

Plain ftrace cannot do that because it only intercepts the function
entry.

The function graph tracer can do this, but only at the cost of
overwriting the return address (and saving return in a special stack)

This always causes a mispredict on every return, and other
overhead, and is one of the reasons why function graph
is so much slower than the plain function tracer.

I suspect the overhead would be significant.

To make your scheme work efficiently work likely we would
need custom gcc instrumentation for the returns.

FWIW our plan was to add enough manual stuffing at strategic
points, until we're sure enough of good enough coverage.

-Andi

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23 23:14             ` Woodhouse, David
  2018-01-23 23:22               ` Andi Kleen
@ 2018-01-24  0:47               ` Tim Chen
  2018-01-24  1:00                 ` Andy Lutomirski
  1 sibling, 1 reply; 143+ messages in thread
From: Tim Chen @ 2018-01-24  0:47 UTC (permalink / raw)
  To: Woodhouse, David, Andi Kleen, Tom Lendacky
  Cc: Andy Lutomirski, KarimAllah Ahmed, linux-kernel,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, kvm, x86, Arjan Van De Ven

On 01/23/2018 03:14 PM, Woodhouse, David wrote:
> On Tue, 2018-01-23 at 14:49 -0800, Andi Kleen wrote:
>>> Not sure.  Maybe to start, the answer might be to allow it to be set for
>>> the ultra-paranoid, but in general don't enable it by default.  Having it
>>> enabled would be an alternative to someone deciding to disable SMT, since
>>> that would have even more of a performance impact.
>>
>> I agree. A reasonable strategy would be to only enable it for
>> processes that have dumpable disabled. This should be already set for
>> high value processes like GPG, and allows others to opt-in if
>> they need to.
> 
> That seems to make sense, and I think was the solution we were
> approaching for IBPB on context switch too, right?
> 
> Are we generally agreed on dumpable as the criterion for both of those?
> 

It is a reasonable approach.  Let a process who needs max security
opt in with disabled dumpable. It can have a flush with IBPB clear before
starting to run, and have STIBP set while running.

Tim

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-24  0:47               ` Tim Chen
@ 2018-01-24  1:00                 ` Andy Lutomirski
  2018-01-24  1:22                   ` David Woodhouse
  2018-01-24  1:59                   ` Van De Ven, Arjan
  0 siblings, 2 replies; 143+ messages in thread
From: Andy Lutomirski @ 2018-01-24  1:00 UTC (permalink / raw)
  To: Tim Chen
  Cc: Woodhouse, David, Andi Kleen, Tom Lendacky, KarimAllah Ahmed,
	LKML, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, kvm list, X86 ML, Arjan Van De Ven

On Tue, Jan 23, 2018 at 4:47 PM, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> On 01/23/2018 03:14 PM, Woodhouse, David wrote:
>> On Tue, 2018-01-23 at 14:49 -0800, Andi Kleen wrote:
>>>> Not sure.  Maybe to start, the answer might be to allow it to be set for
>>>> the ultra-paranoid, but in general don't enable it by default.  Having it
>>>> enabled would be an alternative to someone deciding to disable SMT, since
>>>> that would have even more of a performance impact.
>>>
>>> I agree. A reasonable strategy would be to only enable it for
>>> processes that have dumpable disabled. This should be already set for
>>> high value processes like GPG, and allows others to opt-in if
>>> they need to.
>>
>> That seems to make sense, and I think was the solution we were
>> approaching for IBPB on context switch too, right?
>>
>> Are we generally agreed on dumpable as the criterion for both of those?
>>
>
> It is a reasonable approach.  Let a process who needs max security
> opt in with disabled dumpable. It can have a flush with IBPB clear before
> starting to run, and have STIBP set while running.
>

Do we maybe want a separate opt in?  I can easily imagine things like
web browsers that *don't* want to be non-dumpable but do want this
opt-in.

Also, what's the performance hit of STIBP?

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-24  1:00                 ` Andy Lutomirski
@ 2018-01-24  1:22                   ` David Woodhouse
  2018-01-24  1:59                   ` Van De Ven, Arjan
  1 sibling, 0 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-24  1:22 UTC (permalink / raw)
  To: Andy Lutomirski, Tim Chen
  Cc: Andi Kleen, Tom Lendacky, KarimAllah Ahmed, LKML,
	Andrea Arcangeli, Arjan van de Ven, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, kvm list, X86 ML, Arjan Van De Ven

[-- Attachment #1: Type: text/plain, Size: 1823 bytes --]

On Tue, 2018-01-23 at 17:00 -0800, Andy Lutomirski wrote:
> On Tue, Jan 23, 2018 at 4:47 PM, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > 
> > On 01/23/2018 03:14 PM, Woodhouse, David wrote:
> > > 
> > > On Tue, 2018-01-23 at 14:49 -0800, Andi Kleen wrote:
> > > > 
> > > > > 
> > > > > Not sure.  Maybe to start, the answer might be to allow it to be set for
> > > > > the ultra-paranoid, but in general don't enable it by default.  Having it
> > > > > enabled would be an alternative to someone deciding to disable SMT, since
> > > > > that would have even more of a performance impact.
> > > > I agree. A reasonable strategy would be to only enable it for
> > > > processes that have dumpable disabled. This should be already set for
> > > > high value processes like GPG, and allows others to opt-in if
> > > > they need to.
> > > That seems to make sense, and I think was the solution we were
> > > approaching for IBPB on context switch too, right?
> > > 
> > > Are we generally agreed on dumpable as the criterion for both of those?
> > > 
> > It is a reasonable approach.  Let a process who needs max security
> > opt in with disabled dumpable. It can have a flush with IBPB clear before
> > starting to run, and have STIBP set while running.
> > 
> Do we maybe want a separate opt in?  I can easily imagine things like
> web browsers that *don't* want to be non-dumpable but do want this
> opt-in.
 
This is to protect you from another local process running on a HT
sibling. Not the kind of thing that web browsers are normally worrying
about.

> Also, what's the performance hit of STIBP?

Varies per CPU generation, but generally approaching that of full IBRS
I think? I don't recall looking at this specifically (since we haven't
actually used it for this yet).

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* RE: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-24  1:00                 ` Andy Lutomirski
  2018-01-24  1:22                   ` David Woodhouse
@ 2018-01-24  1:59                   ` Van De Ven, Arjan
  2018-01-24  3:25                     ` Andy Lutomirski
  1 sibling, 1 reply; 143+ messages in thread
From: Van De Ven, Arjan @ 2018-01-24  1:59 UTC (permalink / raw)
  To: Andy Lutomirski, Tim Chen
  Cc: Woodhouse, David, Andi Kleen, Tom Lendacky, KarimAllah Ahmed,
	LKML, Andrea Arcangeli, Arjan van de Ven, Raj, Ashok, Mallick,
	Asit K, Borislav Petkov, Williams, Dan J, Hansen, Dave,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Nakajima, Jun, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krcmár, Thomas Gleixner, kvm list, X86 ML


> > It is a reasonable approach.  Let a process who needs max security
> > opt in with disabled dumpable. It can have a flush with IBPB clear before
> > starting to run, and have STIBP set while running.
> >
> 
> Do we maybe want a separate opt in?  I can easily imagine things like
> web browsers that *don't* want to be non-dumpable but do want this
> opt-in.

eventually we need something better. Probably in addition.
dumpable is used today for things that want this.

> 
> Also, what's the performance hit of STIBP?

pretty steep, but it depends on the CPU generation, for some it's cheaper than others. (yes I realize this is a vague answer, but the range is really from just about zero to oh my god)

I'm not a fan of doing this right now to be honest. We really need to not piece meal some of this, and come up with a better concept of protection on a higher level.
For example, you mention web browsers, but the threat model for browsers is generally internet content. For V2 to work you need to get some "evil pointer" into the app from the observer and browsers usually aren't doing that.
The most likely user would be some software-TPM-like service that has magic keys.

And for keys we want something else... we want an madvice() sort of thing that does a few things, like equivalent of mlock (so the key does not end up in swap), not having the page (but potentially the rest) end up in core dumps, and the kernel making sure that if the program exits (say for segv) that the key page gets zeroed before going into the free pool. Once you do that as feature, making the key speculation safe is not too hard (intel and arm have cpu options to mark pages for that)



^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-24  1:59                   ` Van De Ven, Arjan
@ 2018-01-24  3:25                     ` Andy Lutomirski
  0 siblings, 0 replies; 143+ messages in thread
From: Andy Lutomirski @ 2018-01-24  3:25 UTC (permalink / raw)
  To: Van De Ven, Arjan
  Cc: Andy Lutomirski, Tim Chen, Woodhouse, David, Andi Kleen,
	Tom Lendacky, KarimAllah Ahmed, LKML, Andrea Arcangeli,
	Arjan van de Ven, Raj, Ashok, Mallick, Asit K, Borislav Petkov,
	Williams, Dan J, Hansen, Dave, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Nakajima, Jun, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krcmár, Thomas Gleixner, kvm list, X86 ML



> On Jan 23, 2018, at 5:59 PM, Van De Ven, Arjan <arjan.van.de.ven@intel.com> wrote:
> 
> 
>>> It is a reasonable approach.  Let a process who needs max security
>>> opt in with disabled dumpable. It can have a flush with IBPB clear before
>>> starting to run, and have STIBP set while running.
>>> 
>> 
>> Do we maybe want a separate opt in?  I can easily imagine things like
>> web browsers that *don't* want to be non-dumpable but do want this
>> opt-in.
> 
> eventually we need something better. Probably in addition.
> dumpable is used today for things that want this.
> 
>> 
>> Also, what's the performance hit of STIBP?
> 
> pretty steep, but it depends on the CPU generation, for some it's cheaper than others. (yes I realize this is a vague answer, but the range is really from just about zero to oh my god)
> 
> I'm not a fan of doing this right now to be honest. We really need to not piece meal some of this, and come up with a better concept of protection on a higher level.
> For example, you mention web browsers, but the threat model for browsers is generally internet content. For V2 to work you need to get some "evil pointer" into the app from the observer and browsers usually aren't doing that.
> The most likely user would be some software-TPM-like service that has magic keys.
> 
> And for keys we want something else... we want an madvice() sort of thing that does a few things, like equivalent of mlock (so the key does not end up in swap),

I'd love to see a slight variant: encrypt that page against some ephemeral key if it gets swapped.

> not having the page (but potentially the rest) end up in core dumps, and the kernel making sure that if the program exits (say for segv) that the key page gets zeroed before going into the free pool. Once you do that as feature, making the key speculation safe is not too hard (intel and arm have cpu options to mark pages for that)
> 
> 

How do we do that on Intel?  Make it UC?

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-23 20:58     ` David Woodhouse
  2018-01-23 22:43       ` Johannes Erdfelt
@ 2018-01-24  8:47       ` Peter Zijlstra
  2018-01-24  9:02         ` David Woodhouse
  2018-01-24 12:14         ` David Woodhouse
  1 sibling, 2 replies; 143+ messages in thread
From: Peter Zijlstra @ 2018-01-24  8:47 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Thomas Gleixner, KarimAllah Ahmed, linux-kernel, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

On Tue, Jan 23, 2018 at 08:58:36PM +0000, David Woodhouse wrote:

> +static const struct sku_microcode spectre_bad_microcodes[] = {
> +	{ INTEL_FAM6_KABYLAKE_DESKTOP, 0x0B, 0x80 },
> +	{ INTEL_FAM6_KABYLAKE_MOBILE, 0x0A, 0x80 },
> +	{ INTEL_FAM6_KABYLAKE_MOBILE, 0x0A, 0x80 },
> +	{ INTEL_FAM6_KABYLAKE_MOBILE, 0x09, 0x80 },
> +	{ INTEL_FAM6_KABYLAKE_DESKTOP, 0x09, 0x80 },
> +	{ INTEL_FAM6_SKYLAKE_X, 0x04, 0x0200003C },
> +	{ INTEL_FAM6_SKYLAKE_MOBILE, 0x03, 0x000000C2 },
> +	{ INTEL_FAM6_SKYLAKE_DESKTOP, 0x03, 0x000000C2 },
> +	{ INTEL_FAM6_BROADWELL_CORE, 0x04, 0x28 },
> +	{ INTEL_FAM6_BROADWELL_GT3E, 0x01, 0x0000001B },
> +	{ INTEL_FAM6_HASWELL_ULT, 0x01, 0x21 },
> +	{ INTEL_FAM6_HASWELL_GT3E, 0x01, 0x18 },
> +	{ INTEL_FAM6_HASWELL_CORE, 0x03, 0x23 },
> +	{ INTEL_FAM6_IVYBRIDGE_X, 0x04, 0x42a },
> +	{ INTEL_FAM6_HASWELL_X, 0x02, 0x3b },
> +	{ INTEL_FAM6_HASWELL_X, 0x04, 0x10 },
> +	{ INTEL_FAM6_HASWELL_CORE, 0x03, 0x23 },
> +	{ INTEL_FAM6_BROADWELL_XEON_D, 0x02, 0x14 },
> +	{ INTEL_FAM6_BROADWELL_XEON_D, 0x03, 0x7000011 },
> +	{ INTEL_FAM6_BROADWELL_GT3E, 0x01, 0x0000001B },
> +	/* For 406F1 Intel says "0x25, 0x23" while VMware says 0x0B000025
> +	 * and a real CPU has a firmware in the 0x0B0000xx range. So: */
> +	{ INTEL_FAM6_BROADWELL_X, 0x01, 0x0b000025 },
> +	{ INTEL_FAM6_KABYLAKE_DESKTOP, 0x09, 0x80 },
> +	{ INTEL_FAM6_SKYLAKE_X, 0x03, 0x100013e },
> +	{ INTEL_FAM6_SKYLAKE_X, 0x04, 0x200003c },
> +};

Typically tglx likes to use x86_match_cpu() for these things; see also
commit: bd9240a18edfb ("x86/apic: Add TSC_DEADLINE quirk due to
errata").

> +
> +static int bad_spectre_microcode(struct cpuinfo_x86 *c)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(spectre_bad_microcodes); i++) {
> +		if (c->x86_model == spectre_bad_microcodes[i].model &&
> +		    c->x86_mask == spectre_bad_microcodes[i].stepping)
> +			return (c->microcode <= spectre_bad_microcodes[i].microcode);
> +	}
> +	return 0;
> +}

The above is Intel only, you should check vendor too I think.

>  static void early_init_intel(struct cpuinfo_x86 *c)
>  {
>  	u64 misc_enable;
> @@ -122,6 +173,18 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>  	if (c->x86 >= 6 && !cpu_has(c, X86_FEATURE_IA64))
>  		c->microcode = intel_get_microcode_revision();
>  
> +	if ((cpu_has(c, X86_FEATURE_SPEC_CTRL) ||
> +	     cpu_has(c, X86_FEATURE_AMD_SPEC_CTRL) ||
> +	     cpu_has(c, X86_FEATURE_AMD_PRED_CMD) ||
> +	     cpu_has(c, X86_FEATURE_AMD_STIBP)) && bad_spectre_microcode(c)) {
> +		pr_warn("Intel Spectre v2 broken microcode detected; disabling SPEC_CTRL\n");
> +		clear_cpu_cap(c, X86_FEATURE_SPEC_CTRL);
> +		clear_cpu_cap(c, X86_FEATURE_STIBP);
> +		clear_cpu_cap(c, X86_FEATURE_AMD_SPEC_CTRL);
> +		clear_cpu_cap(c, X86_FEATURE_AMD_PRED_CMD);
> +		clear_cpu_cap(c, X86_FEATURE_AMD_STIBP);
> +	}

And since its Intel only, what are those AMD features doing there?

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-24  8:47       ` Peter Zijlstra
@ 2018-01-24  9:02         ` David Woodhouse
  2018-01-24  9:10           ` Greg Kroah-Hartman
                             ` (2 more replies)
  2018-01-24 12:14         ` David Woodhouse
  1 sibling, 3 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-24  9:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, KarimAllah Ahmed, linux-kernel, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

[-- Attachment #1: Type: text/plain, Size: 3037 bytes --]

On Wed, 2018-01-24 at 09:47 +0100, Peter Zijlstra wrote:
> Typically tglx likes to use x86_match_cpu() for these things; see also
> commit: bd9240a18edfb ("x86/apic: Add TSC_DEADLINE quirk due to
> errata").

Thanks, will fix. I think we might also end up in whitelist mode,
adding "known good" microcodes to the list as they get released or
retroactively blessed.

I would really have liked a new bit in IA32_ARCH_CAPABILITIES to say
that it's safe, but that's not possible for *existing* microcode which
actually turns out to be OK in the end.

That means the whitelist ends up basically empty right now. Should I
add a command line parameter to override it? Otherwise we end up having
to rebuild the kernel every time there's a microcode release which
covers a new CPU SKU (which is why I kind of hate the whitelist, but
Arjan is very insistent...)

I'm kind of tempted to turn it into a whitelist just by adding 1 to the
microcode revision in each table entry. Sure, that N+1 might be another
microcode build that also has issues but never saw the light of day...
but that's OK as long it never *does*. And yes we'd have to tweak it if
revisions that are blacklisted in the Intel doc are subsequently
cleared. But at least it'd require *less* tweaking.

> > 
> > +
> > +static int bad_spectre_microcode(struct cpuinfo_x86 *c)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < ARRAY_SIZE(spectre_bad_microcodes); i++) {
> > +		if (c->x86_model == spectre_bad_microcodes[i].model &&
> > +		    c->x86_mask == spectre_bad_microcodes[i].stepping)
> > +			return (c->microcode <= spectre_bad_microcodes[i].microcode);
> > +	}
> > +	return 0;
> > +}
> The above is Intel only, you should check vendor too I think.

It's in intel.c, called from early_init_intel(). Isn't that sufficient?

> > 
> >  static void early_init_intel(struct cpuinfo_x86 *c)
> >  {
> >  	u64 misc_enable;
> > @@ -122,6 +173,18 @@ static void early_init_intel(struct cpuinfo_x86 *c)
> >  	if (c->x86 >= 6 && !cpu_has(c, X86_FEATURE_IA64))
> >  		c->microcode = intel_get_microcode_revision();
> >  
> > +	if ((cpu_has(c, X86_FEATURE_SPEC_CTRL) ||
> > +	     cpu_has(c, X86_FEATURE_AMD_SPEC_CTRL) ||
> > +	     cpu_has(c, X86_FEATURE_AMD_PRED_CMD) ||
> > +	     cpu_has(c, X86_FEATURE_AMD_STIBP)) && bad_spectre_microcode(c)) {
> > +		pr_warn("Intel Spectre v2 broken microcode detected; disabling SPEC_CTRL\n");
> > +		clear_cpu_cap(c, X86_FEATURE_SPEC_CTRL);
> > +		clear_cpu_cap(c, X86_FEATURE_STIBP);
> > +		clear_cpu_cap(c, X86_FEATURE_AMD_SPEC_CTRL);
> > +		clear_cpu_cap(c, X86_FEATURE_AMD_PRED_CMD);
> > +		clear_cpu_cap(c, X86_FEATURE_AMD_STIBP);
> > +	}
> And since its Intel only, what are those AMD features doing there?

Hypervisors which only want to expose PRED_CMD may do so using the AMD
feature bit. SPEC_CTRL requires save/restore and live migration
support, and isn't needed with retpoline anyway (since guests won't be
calling directly into firmware).


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-24  9:02         ` David Woodhouse
@ 2018-01-24  9:10           ` Greg Kroah-Hartman
  2018-01-24 15:09             ` Arjan van de Ven
  2018-01-24  9:34           ` Peter Zijlstra
  2018-01-24 10:49           ` Henrique de Moraes Holschuh
  2 siblings, 1 reply; 143+ messages in thread
From: Greg Kroah-Hartman @ 2018-01-24  9:10 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Peter Zijlstra, Thomas Gleixner, KarimAllah Ahmed, linux-kernel,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

On Wed, Jan 24, 2018 at 09:02:21AM +0000, David Woodhouse wrote:
> On Wed, 2018-01-24 at 09:47 +0100, Peter Zijlstra wrote:
> > Typically tglx likes to use x86_match_cpu() for these things; see also
> > commit: bd9240a18edfb ("x86/apic: Add TSC_DEADLINE quirk due to
> > errata").
> 
> Thanks, will fix. I think we might also end up in whitelist mode,
> adding "known good" microcodes to the list as they get released or
> retroactively blessed.
> 
> I would really have liked a new bit in IA32_ARCH_CAPABILITIES to say
> that it's safe, but that's not possible for *existing* microcode which
> actually turns out to be OK in the end.
> 
> That means the whitelist ends up basically empty right now. Should I
> add a command line parameter to override it? Otherwise we end up having
> to rebuild the kernel every time there's a microcode release which
> covers a new CPU SKU (which is why I kind of hate the whitelist, but
> Arjan is very insistent...)

Ick, no, whitelists are a pain for everyone involved.  Don't do that
unless it is absolutely the only way it will ever work.

Arjan, why do you think this can only be done as a whitelist?

It's much easier to just mark the "bad" microcode versions as those
_should_ be a much smaller list that Intel knows about today.  And of
course, any future microcode updates will not be "bad" because they know
how to properly test for this now before they are released :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-24  9:02         ` David Woodhouse
  2018-01-24  9:10           ` Greg Kroah-Hartman
@ 2018-01-24  9:34           ` Peter Zijlstra
  2018-01-24 10:49           ` Henrique de Moraes Holschuh
  2 siblings, 0 replies; 143+ messages in thread
From: Peter Zijlstra @ 2018-01-24  9:34 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Thomas Gleixner, KarimAllah Ahmed, linux-kernel, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

> > > +	for (i = 0; i < ARRAY_SIZE(spectre_bad_microcodes); i++) {
> > > +		if (c->x86_model == spectre_bad_microcodes[i].model &&
> > > +		    c->x86_mask == spectre_bad_microcodes[i].stepping)
> > > +			return (c->microcode <= spectre_bad_microcodes[i].microcode);
> > > +	}
> > > +	return 0;
> > > +}
> > The above is Intel only, you should check vendor too I think.
> 
> It's in intel.c, called from early_init_intel(). Isn't that sufficient?

Duh, so much for reading skillz on my end ;-)

> > > +		pr_warn("Intel Spectre v2 broken microcode detected; disabling SPEC_CTRL\n");
> > > +		clear_cpu_cap(c, X86_FEATURE_SPEC_CTRL);
> > > +		clear_cpu_cap(c, X86_FEATURE_STIBP);
> > > +		clear_cpu_cap(c, X86_FEATURE_AMD_SPEC_CTRL);
> > > +		clear_cpu_cap(c, X86_FEATURE_AMD_PRED_CMD);
> > > +		clear_cpu_cap(c, X86_FEATURE_AMD_STIBP);
> > > +	}
> > And since its Intel only, what are those AMD features doing there?
> 
> Hypervisors which only want to expose PRED_CMD may do so using the AMD
> feature bit. SPEC_CTRL requires save/restore and live migration
> support, and isn't needed with retpoline anyway (since guests won't be
> calling directly into firmware).

Egads, I suppose that makes some sense, but it does make a horrible
muddle of things.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-24  9:02         ` David Woodhouse
  2018-01-24  9:10           ` Greg Kroah-Hartman
  2018-01-24  9:34           ` Peter Zijlstra
@ 2018-01-24 10:49           ` Henrique de Moraes Holschuh
  2018-01-24 12:30             ` David Woodhouse
  2 siblings, 1 reply; 143+ messages in thread
From: Henrique de Moraes Holschuh @ 2018-01-24 10:49 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Peter Zijlstra, Thomas Gleixner, KarimAllah Ahmed, linux-kernel,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

On Wed, 24 Jan 2018, David Woodhouse wrote:
> I'm kind of tempted to turn it into a whitelist just by adding 1 to the
> microcode revision in each table entry. Sure, that N+1 might be another
> microcode build that also has issues but never saw the light of day...

Watch out for the (AFAIK) still not properly documented where it should
be (i.e. the microcode chapter of the Intel SDM) weirdness in Skylake+
microcode revision.  Actually, this is related to SGX, so anything that
has SGX.

When it has SGX inside, Intel will release microcode only with even
revision numbers, but the processor may report it as odd (and will do so
by subtracting 1, so microcode 0xb0 is the same as microcode 0xaf) when
the update is loaded by the processor itself from FIT (as opposed as
being loaded by WRMSR from BIOS/UEFI/OS).
 
So, you could see N-1 from within Linux if we did not update the
microcode, and fail to trigger a whitelist (or mistrigger a blacklist).

-- 
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-24  8:47       ` Peter Zijlstra
  2018-01-24  9:02         ` David Woodhouse
@ 2018-01-24 12:14         ` David Woodhouse
  2018-01-24 12:29           ` Peter Zijlstra
  1 sibling, 1 reply; 143+ messages in thread
From: David Woodhouse @ 2018-01-24 12:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, KarimAllah Ahmed, linux-kernel, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

[-- Attachment #1: Type: text/plain, Size: 1217 bytes --]

On Wed, 2018-01-24 at 09:47 +0100, Peter Zijlstra wrote:
> 
> Typically tglx likes to use x86_match_cpu() for these things; see also
> commit: bd9240a18edfb ("x86/apic: Add TSC_DEADLINE quirk due to
> errata").

Ewww.

static u32 hsx_deadline_rev(void)
{
       switch (boot_cpu_data.x86_mask) {
       case 0x02: return 0x3a; /* EP */
       case 0x04: return 0x0f; /* EX */
       }

       return ~0U;
}
...
static const struct x86_cpu_id deadline_match[] = {
       DEADLINE_MODEL_MATCH_FUNC( INTEL_FAM6_HASWELL_X,        hsx_deadline_rev),
       DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_BROADWELL_X,      0x0b000020),
       DEADLINE_MODEL_MATCH_FUNC( INTEL_FAM6_BROADWELL_XEON_D, bdx_deadline_rev),
       DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_X,        0x02000014),
...

       /*
        * Function pointers will have the MSB set due to address layout,
        * immediate revisions will not.
        */
       if ((long)m->driver_data < 0)
               rev = ((u32 (*)(void))(m->driver_data))();
       else
               rev = (u32)m->driver_data;

EWWWW!

Shan't.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-24 12:14         ` David Woodhouse
@ 2018-01-24 12:29           ` Peter Zijlstra
  2018-01-24 12:58             ` David Woodhouse
  0 siblings, 1 reply; 143+ messages in thread
From: Peter Zijlstra @ 2018-01-24 12:29 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Thomas Gleixner, KarimAllah Ahmed, linux-kernel, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

On Wed, Jan 24, 2018 at 12:14:51PM +0000, David Woodhouse wrote:
> On Wed, 2018-01-24 at 09:47 +0100, Peter Zijlstra wrote:
> > 
> > Typically tglx likes to use x86_match_cpu() for these things; see also
> > commit: bd9240a18edfb ("x86/apic: Add TSC_DEADLINE quirk due to
> > errata").
> 
> Ewww.
> 
> static u32 hsx_deadline_rev(void)
> {
>        switch (boot_cpu_data.x86_mask) {
>        case 0x02: return 0x3a; /* EP */
>        case 0x04: return 0x0f; /* EX */
>        }
> 
>        return ~0U;
> }
> ...
> static const struct x86_cpu_id deadline_match[] = {
>        DEADLINE_MODEL_MATCH_FUNC( INTEL_FAM6_HASWELL_X,        hsx_deadline_rev),
>        DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_BROADWELL_X,      0x0b000020),
>        DEADLINE_MODEL_MATCH_FUNC( INTEL_FAM6_BROADWELL_XEON_D, bdx_deadline_rev),
>        DEADLINE_MODEL_MATCH_REV ( INTEL_FAM6_SKYLAKE_X,        0x02000014),
> ...
> 
>        /*
>         * Function pointers will have the MSB set due to address layout,
>         * immediate revisions will not.
>         */
>        if ((long)m->driver_data < 0)
>                rev = ((u32 (*)(void))(m->driver_data))();
>        else
>                rev = (u32)m->driver_data;
> 
> EWWWW!
> 

Yes :/

We could look at extending x86_cpu_id and x86_match_cpu with a stepping
option I suppose, but that might be lots of churn.

Thomas?

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-24 10:49           ` Henrique de Moraes Holschuh
@ 2018-01-24 12:30             ` David Woodhouse
  0 siblings, 0 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-24 12:30 UTC (permalink / raw)
  To: Henrique de Moraes Holschuh
  Cc: Peter Zijlstra, Thomas Gleixner, KarimAllah Ahmed, linux-kernel,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

[-- Attachment #1: Type: text/plain, Size: 1938 bytes --]

On Wed, 2018-01-24 at 08:49 -0200, Henrique de Moraes Holschuh wrote:
> On Wed, 24 Jan 2018, David Woodhouse wrote:
> > 
> > I'm kind of tempted to turn it into a whitelist just by adding 1 to the
> > microcode revision in each table entry. Sure, that N+1 might be another
> > microcode build that also has issues but never saw the light of day...
> Watch out for the (AFAIK) still not properly documented where it should
> be (i.e. the microcode chapter of the Intel SDM) weirdness in Skylake+
> microcode revision.  Actually, this is related to SGX, so anything that
> has SGX.
> 
> When it has SGX inside, Intel will release microcode only with even
> revision numbers, but the processor may report it as odd (and will do so
> by subtracting 1, so microcode 0xb0 is the same as microcode 0xaf) when
> the update is loaded by the processor itself from FIT (as opposed as
> being loaded by WRMSR from BIOS/UEFI/OS).
>  
> So, you could see N-1 from within Linux if we did not update the
> microcode, and fail to trigger a whitelist (or mistrigger a blacklist).

That's OK. If they ship a fixed 0x0200003E firmware for SKX, for
example, which appears as 0x0200003D when it's loaded from FIT, that's
still >= 0x0200003C *and* !(<0x0200003D) if we were to do that.

In fact, the code for the "whitelist X+1" vs. "blacklist X" approach is
*entirely* equivalent; it's purely a cosmetic change. Because

   !(< X)   ≡   ≥ (X+1)

The *real* change here is that for ∀ SKU, we are being asked to
blacklist all microcode revisions <= 0xFFFFFFFF¹ for now, and change
that only once new microcode is actually released. Every time, and then
get people to rebuild their kernels because they can *use* the features
from the new microcode.


¹(OK, *there's* a functional difference between whitelist and blacklist
approach. But we'll never actually see 0xffffffff so that's not
important right now :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-24 12:29           ` Peter Zijlstra
@ 2018-01-24 12:58             ` David Woodhouse
  0 siblings, 0 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-24 12:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, KarimAllah Ahmed, linux-kernel, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

[-- Attachment #1: Type: text/plain, Size: 697 bytes --]

On Wed, 2018-01-24 at 13:29 +0100, Peter Zijlstra wrote:
> 
> Yes :/
> 
> We could look at extending x86_cpu_id and x86_match_cpu with a stepping
> option I suppose, but that might be lots of churn.

That goes all the way to mod_deviceinfo, and would be horrid.

We could add an x86_match_cpu_stepping() function, I suppose? But I'm
mostly trying to avoid depending on other stuff like that, for patches
which are going to need to be backported to all the stable kernels.

I'd much rather do it this way and then if we see another use case for
it (that commit you mentioned could be nicer, I suppose), consolidate
into a single stepping-capable lookup function in a later "cleanup".

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-24  9:10           ` Greg Kroah-Hartman
@ 2018-01-24 15:09             ` Arjan van de Ven
  2018-01-24 15:18               ` David Woodhouse
  0 siblings, 1 reply; 143+ messages in thread
From: Arjan van de Ven @ 2018-01-24 15:09 UTC (permalink / raw)
  To: Greg Kroah-Hartman, David Woodhouse
  Cc: Peter Zijlstra, Thomas Gleixner, KarimAllah Ahmed, linux-kernel,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

On 1/24/2018 1:10 AM, Greg Kroah-Hartman wrote:
> 
>> That means the whitelist ends up basically empty right now. Should I
>> add a command line parameter to override it? Otherwise we end up having
>> to rebuild the kernel every time there's a microcode release which
>> covers a new CPU SKU (which is why I kind of hate the whitelist, but
>> Arjan is very insistent...)
> 
> Ick, no, whitelists are a pain for everyone involved.  Don't do that
> unless it is absolutely the only way it will ever work.
> 
> Arjan, why do you think this can only be done as a whitelist?

I suggested a minimum version list for those cpus that need it.

microcode versions are tricky (and we've released betas etc etc with their own numbers)
and as a result there might be several numbers that have those issues with their IBRS for the same F/M/S

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-24 15:09             ` Arjan van de Ven
@ 2018-01-24 15:18               ` David Woodhouse
  0 siblings, 0 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-24 15:18 UTC (permalink / raw)
  To: Arjan van de Ven, Greg Kroah-Hartman
  Cc: Peter Zijlstra, Thomas Gleixner, KarimAllah Ahmed, linux-kernel,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Radim Krčmář,
	Tim Chen, Tom Lendacky, kvm, x86

[-- Attachment #1: Type: text/plain, Size: 1112 bytes --]

On Wed, 2018-01-24 at 07:09 -0800, Arjan van de Ven wrote:
> On 1/24/2018 1:10 AM, Greg Kroah-Hartman wrote:
> > Arjan, why do you think this can only be done as a whitelist?
>
> I suggested a minimum version list for those cpus that need it.
>
> microcode versions are tricky (and we've released betas etc etc with their own numbers)
> and as a result there might be several numbers that have those issues with their IBRS for the same F/M/S

I really think that's fine. Anyone who uses beta microcodes, should be
perfectly prepared to deal with the results. And probably *wanted* to
be able to actually test them, instead of having the kernel refuse to
do so.

So if there are beta microcodes floating around with numbers higher
than in Intel's currently-published list, which are not yet known to be
safe (or even if they're known not to be), that's absolutely OK.

If you're telling me that there will be *publicly* released microcodes
with version numbers higher than those in the list, which still have
the same issues... well, then I think Mr Shouty is going to come for
another visit.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23  9:30                   ` David Woodhouse
  2018-01-23 10:15                     ` Ingo Molnar
  2018-01-23 10:23                     ` Ingo Molnar
@ 2018-01-25 16:19                     ` Mason
  2018-01-25 17:16                       ` Greg Kroah-Hartman
  2 siblings, 1 reply; 143+ messages in thread
From: Mason @ 2018-01-25 16:19 UTC (permalink / raw)
  To: Linux ARM
  Cc: David Woodhouse, Ingo Molnar, Linus Torvalds, KarimAllah Ahmed,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	LKML

On 23/01/2018 10:30, David Woodhouse wrote:

> Skylake takes predictions from the generic branch target buffer when
> the RSB underflows.

Adding LAKML.

AFAIU, some ARM Cortex cores have the same optimization.
(A9 maybe, A17 probably, some recent 64-bit cores)

Are there software work-arounds for Spectre planned for arm32 and arm64?

Regards.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-25 16:19                     ` Mason
@ 2018-01-25 17:16                       ` Greg Kroah-Hartman
  2018-01-29 11:59                         ` Mason
  0 siblings, 1 reply; 143+ messages in thread
From: Greg Kroah-Hartman @ 2018-01-25 17:16 UTC (permalink / raw)
  To: Mason
  Cc: Linux ARM, David Woodhouse, Ingo Molnar, Linus Torvalds,
	KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	LKML

On Thu, Jan 25, 2018 at 05:19:04PM +0100, Mason wrote:
> On 23/01/2018 10:30, David Woodhouse wrote:
> 
> > Skylake takes predictions from the generic branch target buffer when
> > the RSB underflows.
> 
> Adding LAKML.
> 
> AFAIU, some ARM Cortex cores have the same optimization.
> (A9 maybe, A17 probably, some recent 64-bit cores)
> 
> Are there software work-arounds for Spectre planned for arm32 and arm64?

Yes, I think they are currently burried in one of the arm64 trees, and
they have been posted to the mailing list a few times in the past.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process
  2018-01-21 11:22   ` Peter Zijlstra
  2018-01-21 12:04     ` David Woodhouse
  2018-01-21 16:21     ` Ingo Molnar
@ 2018-01-29  6:35     ` Jon Masters
  2018-01-29 14:07       ` Peter Zijlstra
  2 siblings, 1 reply; 143+ messages in thread
From: Jon Masters @ 2018-01-29  6:35 UTC (permalink / raw)
  To: Peter Zijlstra, KarimAllah Ahmed
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

Hi Peter, David, all,

First a quick note on David's earlier comment, about this optimization
being still up for debate. The problem with this optimization as-is is
that it doesn't protect userspace-to-userspace unless applications are
rebuilt and we get the infrastructure to handle that (ELF, whatever).

But...

On 01/21/2018 06:22 AM, Peter Zijlstra wrote:
> On Sat, Jan 20, 2018 at 08:22:55PM +0100, KarimAllah Ahmed wrote:
>> From: Tim Chen <tim.c.chen@linux.intel.com>
>>
>> Flush indirect branches when switching into a process that marked
>> itself non dumpable.  This protects high value processes like gpg
>> better, without having too high performance overhead.
> 
> So if I understand it right, this is only needed if the 'other'
> executable itself is susceptible to spectre. If say someone audited gpg
> for spectre-v1 and build it with retpoline, it would be safe to not
> issue the IBPB, right?

More importantly, rebuilding the world introduces a lot of challenges
that need to be discussed heavily before they happen (I would like to
see someone run a session at one of the various upcoming events on
userspace, I've already prodded a few people to nudge that forward). In
particular, we don't have the infrastructure in gcc/glibc to dynamically
patch userspace call sites to enable/disable retpolines.

We discussed nasty hacks last year (I even suggested an ugly kernel
exported page similar to VDSO that could be implementation patched for
different uarches), but the bottom line is there isn't anything in place
to provide a similar userspace experience to what the kernel can do, and
that would need to be solved in addition to the ELF/ABI bits.

> So would it make sense to provide an ELF flag / personality thing such
> that userspace can indicate its spectre-safe?
> 
> I realize that this is all future work, because so far auditing for v1
> is a lot of pain (we need better tools), but would it be something that
> makes sense in the longer term?

So I would just caution that doing this isn't necessarily bad, but it's
far more than just ELF bits and rebuilding. Once userspace is rebuilt
with un-nopable retpolines, they're there whether you need them on
$future_hardware or not, and that fancy branch predictor is useless. So
we really need a way to allow for userspace patchable calls, or at least
some kind of plan before everyone runs away with rebuilding.

(unless they're embedded/Gentoo/whatever...have fun in that case)

Jon.

P.S. This is why for certain downstream distros you'll see IBPB use like
prior to this patch - it'll prevent certain attacks that can't be
otherwise mitigated without going and properly solving the tools issue.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-25 17:16                       ` Greg Kroah-Hartman
@ 2018-01-29 11:59                         ` Mason
  0 siblings, 0 replies; 143+ messages in thread
From: Mason @ 2018-01-29 11:59 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: LKML, Linux ARM, Marc Zyngier, Will Deacon, Arnd Bergmann

[ Dropping large CC list ]

On 25/01/2018 18:16, Greg Kroah-Hartman wrote:

> On Thu, Jan 25, 2018 at 05:19:04PM +0100, Mason wrote:
> 
>> On 23/01/2018 10:30, David Woodhouse wrote:
>>
>>> Skylake takes predictions from the generic branch target buffer when
>>> the RSB underflows.
>>
>> Adding LAKML.
>>
>> AFAIU, some ARM Cortex cores have the same optimization.
>> (A9 maybe, A17 probably, some recent 64-bit cores)
>>
>> Are there software work-arounds for Spectre planned for arm32 and arm64?
> 
> Yes, I think they are currently buried in one of the arm64 trees, and
> they have been posted to the mailing list a few times in the past.

Found the burial ground, thanks Greg.

  https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h=kpti

Via https://developer.arm.com/support/security-update

"For Cortex-R8, Cortex-A8, Cortex-A9, and Cortex-A17, invalidate
the branch predictor using a BPIALL instruction."

The latest arm32 patch series was submitted recently:

  https://www.spinics.net/lists/arm-kernel/msg630892.html

Regards.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process
  2018-01-29  6:35     ` Jon Masters
@ 2018-01-29 14:07       ` Peter Zijlstra
  0 siblings, 0 replies; 143+ messages in thread
From: Peter Zijlstra @ 2018-01-29 14:07 UTC (permalink / raw)
  To: Jon Masters
  Cc: KarimAllah Ahmed, linux-kernel, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Arjan van de Ven, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, David Woodhouse,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

On Mon, Jan 29, 2018 at 01:35:30AM -0500, Jon Masters wrote:
> > So if I understand it right, this is only needed if the 'other'
> > executable itself is susceptible to spectre. If say someone audited gpg
> > for spectre-v1 and build it with retpoline, it would be safe to not
> > issue the IBPB, right?
> 
> More importantly, rebuilding the world introduces a lot of challenges
> that need to be discussed heavily before they happen (I would like to
> see someone run a session at one of the various upcoming events on
> userspace, I've already prodded a few people to nudge that forward). In
> particular, we don't have the infrastructure in gcc/glibc to dynamically
> patch userspace call sites to enable/disable retpolines.

GCC/GLIBC do in fact have some infrastructure for this; see
target_clones/ifunc function attributes. We can at (dynamic) link time
select between alternative functions.

With this we could select different retpoline thunks for different
systems, much like what we end up doing for the kernel.

> We discussed nasty hacks last year (I even suggested an ugly kernel
> exported page similar to VDSO that could be implementation patched for
> different uarches), but the bottom line is there isn't anything in place
> to provide a similar userspace experience to what the kernel can do, and
> that would need to be solved in addition to the ELF/ABI bits.

Not sure where you discussed what, but I spoke with a bunch of the
facebook people at plumbers about kernel support for (runtime) userspace
patching a-la asm-goto/jump-labels.

And while that would be entirely fun, I don't see how we'd need this
here.

> > So would it make sense to provide an ELF flag / personality thing such
> > that userspace can indicate its spectre-safe?
> > 
> > I realize that this is all future work, because so far auditing for v1
> > is a lot of pain (we need better tools), but would it be something that
> > makes sense in the longer term?
> 
> So I would just caution that doing this isn't necessarily bad, but it's
> far more than just ELF bits and rebuilding. Once userspace is rebuilt
> with un-nopable retpolines, they're there whether you need them on
> $future_hardware or not, and that fancy branch predictor is useless. So
> we really need a way to allow for userspace patchable calls, or at least
> some kind of plan before everyone runs away with rebuilding.

Just rebuild world again; there's plenty distros where this is not in
fact a difficult thing to do :-) You just don't happen to work for one
... :-)

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-20 19:22 ` [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure KarimAllah Ahmed
  2018-01-21 14:31   ` Thomas Gleixner
@ 2018-01-29 20:14   ` Eduardo Habkost
  2018-01-29 20:17     ` David Woodhouse
  2018-01-31 10:03   ` [RFC 05/10] " Christophe de Dinechin
  2 siblings, 1 reply; 143+ messages in thread
From: Eduardo Habkost @ 2018-01-29 20:14 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

On Sat, Jan 20, 2018 at 08:22:56PM +0100, KarimAllah Ahmed wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Not functional yet; just add the handling for it in the Spectre v2
> mitigation selection, and the X86_FEATURE_IBRS flag which will control
> the code to be added in later patches.
> 
> Also take the #ifdef CONFIG_RETPOLINE from around the RSB-stuffing; IBRS
> mode will want that too.
> 
> For now we are auto-selecting IBRS on Skylake. We will probably end up
> changing that but for now let's default to the safest option.
> 
> XX: Do we want a microcode blacklist?
> 
> [karahmed: simplify the switch block and get rid of all the magic]
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
[...]
> +	case SPECTRE_V2_CMD_FORCE:
> +		/*
> +		 * If we have IBRS support, and either Skylake or !RETPOLINE,
> +		 * then that's what we do.
> +		 */
> +		if (boot_cpu_has(X86_FEATURE_SPEC_CTRL) &&
> +		    (is_skylake_era() || !retp_compiler())) {


Sorry for being confused here, as probably the answer is buried
on a LKML thread somewhere.  The comment explains what the code
does, but not why.  Why exactly IBRS is preferred on Skylake?

I'm asking this because I would like to understand the risks
involved when running under a hypervisor exposing CPUID data that
don't match the host CPU.  e.g.: what happens if a VM is migrated
from a Broadwell host to a Skylake host?



> +			mode = SPECTRE_V2_IBRS;
> +			setup_force_cpu_cap(X86_FEATURE_IBRS);
> +			break;
> +		}
> +		/* Fall through */
>  	case SPECTRE_V2_CMD_RETPOLINE:
[...]

-- 
Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 20:14   ` [RFC,05/10] " Eduardo Habkost
@ 2018-01-29 20:17     ` David Woodhouse
  2018-01-29 20:42       ` Eduardo Habkost
  0 siblings, 1 reply; 143+ messages in thread
From: David Woodhouse @ 2018-01-29 20:17 UTC (permalink / raw)
  To: Eduardo Habkost, KarimAllah Ahmed
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86

[-- Attachment #1: Type: text/plain, Size: 589 bytes --]

On Mon, 2018-01-29 at 18:14 -0200, Eduardo Habkost wrote:
> 
> Sorry for being confused here, as probably the answer is buried
> on a LKML thread somewhere.  The comment explains what the code
> does, but not why.  Why exactly IBRS is preferred on Skylake?
> 
> I'm asking this because I would like to understand the risks
> involved when running under a hypervisor exposing CPUID data that
> don't match the host CPU.  e.g.: what happens if a VM is migrated
> from a Broadwell host to a Skylake host?

https://lkml.org/lkml/2018/1/22/598 should cover most of that, I think.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 20:17     ` David Woodhouse
@ 2018-01-29 20:42       ` Eduardo Habkost
  2018-01-29 20:44         ` Arjan van de Ven
  0 siblings, 1 reply; 143+ messages in thread
From: Eduardo Habkost @ 2018-01-29 20:42 UTC (permalink / raw)
  To: David Woodhouse
  Cc: KarimAllah Ahmed, linux-kernel, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Arjan van de Ven, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Dr. David Alan Gilbert

On Mon, Jan 29, 2018 at 08:17:02PM +0000, David Woodhouse wrote:
> On Mon, 2018-01-29 at 18:14 -0200, Eduardo Habkost wrote:
> > 
> > Sorry for being confused here, as probably the answer is buried
> > on a LKML thread somewhere.  The comment explains what the code
> > does, but not why.  Why exactly IBRS is preferred on Skylake?
> > 
> > I'm asking this because I would like to understand the risks
> > involved when running under a hypervisor exposing CPUID data that
> > don't match the host CPU.  e.g.: what happens if a VM is migrated
> > from a Broadwell host to a Skylake host?
> 
> https://lkml.org/lkml/2018/1/22/598 should cover most of that, I think.

Thanks, it does answer some of my questions.

So, it sounds like live-migration of a VM from a non-Skylake to a
Skylake host will make the guest unsafe, unless the guest was
explicitly configured to use IBRS.

In a perfect world, Linux would never look at CPU
family/model/stepping/microcode if running under a hypervisor, to
take any decision.  If Linux knows it's running under a
hypervisor, it would be safer to assume retpolines aren't enough,
unless the hypervisor is telling us otherwise.

The question is how the hypervisor could tell that to the guest.
If Intel doesn't give us a CPUID bit that can be used to tell
that retpolines are enough, maybe we should use a hypervisor
CPUID bit for that?

-- 
Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 20:42       ` Eduardo Habkost
@ 2018-01-29 20:44         ` Arjan van de Ven
  2018-01-29 21:02           ` David Woodhouse
  0 siblings, 1 reply; 143+ messages in thread
From: Arjan van de Ven @ 2018-01-29 20:44 UTC (permalink / raw)
  To: Eduardo Habkost, David Woodhouse
  Cc: KarimAllah Ahmed, linux-kernel, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Dr. David Alan Gilbert

On 1/29/2018 12:42 PM, Eduardo Habkost wrote:
> The question is how the hypervisor could tell that to the guest.
> If Intel doesn't give us a CPUID bit that can be used to tell
> that retpolines are enough, maybe we should use a hypervisor
> CPUID bit for that?

the objective is to have retpoline be safe everywhere and never use IBRS
(Linus was also pretty clear about that) so I'm confused by your question

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 20:44         ` Arjan van de Ven
@ 2018-01-29 21:02           ` David Woodhouse
  2018-01-29 21:37             ` Jim Mattson
                               ` (3 more replies)
  0 siblings, 4 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-29 21:02 UTC (permalink / raw)
  To: Arjan van de Ven, Eduardo Habkost
  Cc: KarimAllah Ahmed, linux-kernel, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 1339 bytes --]



On Mon, 2018-01-29 at 12:44 -0800, Arjan van de Ven wrote:
> On 1/29/2018 12:42 PM, Eduardo Habkost wrote:
> > 
> > The question is how the hypervisor could tell that to the guest.
> > If Intel doesn't give us a CPUID bit that can be used to tell
> > that retpolines are enough, maybe we should use a hypervisor
> > CPUID bit for that?
>
> the objective is to have retpoline be safe everywhere and never use IBRS
> (Linus was also pretty clear about that) so I'm confused by your question

The question is about all the additional RSB-frobbing and call depth
counting and other bits that don't really even exist for Skylake yet in
a coherent form.

If a guest doesn't have those, because it's running some future kernel
where they *are* implemented but not enabled because at *boot* time it
discovered it wasn't on Skylake, the question is what happens if that
guest is subsequently migrated to a Skylake-class machine.

To which the answer is obviously "oops, sucks to be you". So yes,
*maybe* we want a way to advertise "you might be migrated to Skylake"
if you're booted on a pre-SKL box in a migration pool where such is
possible. 

That question is a reasonable one, and the answer possibly the same,
regardless of whether the plan for Skylake is to use IBRS, or all the
hypothetical other extra stuff.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 21:02           ` David Woodhouse
@ 2018-01-29 21:37             ` Jim Mattson
  2018-01-29 21:50               ` Eduardo Habkost
  2018-01-29 21:37             ` Andi Kleen
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 143+ messages in thread
From: Jim Mattson @ 2018-01-29 21:37 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Arjan van de Ven, Eduardo Habkost, KarimAllah Ahmed, LKML,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

For GCE, "you might be migrated to Skylake" is pretty much a
certainty. Even if you're in a zone that doesn't currently have
Skylake machines, chances are pretty good that it will have Skylake
machines some day in the not-too-distant future.

In general, making these kinds of decisions based on F/M/S is probably
unwise when running in a VM.

On Mon, Jan 29, 2018 at 1:02 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>
>
> On Mon, 2018-01-29 at 12:44 -0800, Arjan van de Ven wrote:
>> On 1/29/2018 12:42 PM, Eduardo Habkost wrote:
>> >
>> > The question is how the hypervisor could tell that to the guest.
>> > If Intel doesn't give us a CPUID bit that can be used to tell
>> > that retpolines are enough, maybe we should use a hypervisor
>> > CPUID bit for that?
>>
>> the objective is to have retpoline be safe everywhere and never use IBRS
>> (Linus was also pretty clear about that) so I'm confused by your question
>
> The question is about all the additional RSB-frobbing and call depth
> counting and other bits that don't really even exist for Skylake yet in
> a coherent form.
>
> If a guest doesn't have those, because it's running some future kernel
> where they *are* implemented but not enabled because at *boot* time it
> discovered it wasn't on Skylake, the question is what happens if that
> guest is subsequently migrated to a Skylake-class machine.
>
> To which the answer is obviously "oops, sucks to be you". So yes,
> *maybe* we want a way to advertise "you might be migrated to Skylake"
> if you're booted on a pre-SKL box in a migration pool where such is
> possible.
>
> That question is a reasonable one, and the answer possibly the same,
> regardless of whether the plan for Skylake is to use IBRS, or all the
> hypothetical other extra stuff.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 21:02           ` David Woodhouse
  2018-01-29 21:37             ` Jim Mattson
@ 2018-01-29 21:37             ` Andi Kleen
  2018-01-29 21:44             ` Eduardo Habkost
  2018-01-30  0:23             ` Linus Torvalds
  3 siblings, 0 replies; 143+ messages in thread
From: Andi Kleen @ 2018-01-29 21:37 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Arjan van de Ven, Eduardo Habkost, KarimAllah Ahmed,
	linux-kernel, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Dr. David Alan Gilbert

> The question is about all the additional RSB-frobbing and call depth
> counting and other bits that don't really even exist for Skylake yet in
> a coherent form.

We have had several patch kits posted that all are in a "coherent form"

That was the original one

http://lkml.iu.edu/hypermail/linux/kernel/1801.1/05556.html

and that's the newer one with only interrupt stuffing

https://marc.info/?l=linux-kernel&m=151674718914504

We don't have generic deep chain handling yet, but everything else
is there.

-Andi

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 21:02           ` David Woodhouse
  2018-01-29 21:37             ` Jim Mattson
  2018-01-29 21:37             ` Andi Kleen
@ 2018-01-29 21:44             ` Eduardo Habkost
  2018-01-29 22:10               ` Konrad Rzeszutek Wilk
  2018-01-30  0:23             ` Linus Torvalds
  3 siblings, 1 reply; 143+ messages in thread
From: Eduardo Habkost @ 2018-01-29 21:44 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Arjan van de Ven, KarimAllah Ahmed, linux-kernel, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Dr. David Alan Gilbert

On Mon, Jan 29, 2018 at 09:02:39PM +0000, David Woodhouse wrote:
> 
> 
> On Mon, 2018-01-29 at 12:44 -0800, Arjan van de Ven wrote:
> > On 1/29/2018 12:42 PM, Eduardo Habkost wrote:
> > > 
> > > The question is how the hypervisor could tell that to the guest.
> > > If Intel doesn't give us a CPUID bit that can be used to tell
> > > that retpolines are enough, maybe we should use a hypervisor
> > > CPUID bit for that?
> >
> > the objective is to have retpoline be safe everywhere and never use IBRS
> > (Linus was also pretty clear about that) so I'm confused by your question
> 
> The question is about all the additional RSB-frobbing and call depth
> counting and other bits that don't really even exist for Skylake yet in
> a coherent form.
> 
> If a guest doesn't have those, because it's running some future kernel
> where they *are* implemented but not enabled because at *boot* time it
> discovered it wasn't on Skylake, the question is what happens if that
> guest is subsequently migrated to a Skylake-class machine.
> 
> To which the answer is obviously "oops, sucks to be you". So yes,
> *maybe* we want a way to advertise "you might be migrated to Skylake"
> if you're booted on a pre-SKL box in a migration pool where such is
> possible. 
> 
> That question is a reasonable one, and the answer possibly the same,
> regardless of whether the plan for Skylake is to use IBRS, or all the
> hypothetical other extra stuff.

Maybe a generic "family/model/stepping/microcode really matches
the CPU you are running on" bit would be useful.  The bit could
be enabled only on host-passthrough (aka "-cpu host") mode.

If we really want to be able to migrate to host with different
CPU models (except Skylake), we could add a more specific "we
promise the host CPU is never going to be Skylake" bit.

Now, if the hypervisor is not providing any of those bits, I
would advise against trusting family/model/stepping/microcode
under a hypervisor.  Using a pre-defined CPU model (that doesn't
necessarily match the host) is very common when using KVM VM
management stacks.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 21:37             ` Jim Mattson
@ 2018-01-29 21:50               ` Eduardo Habkost
  2018-01-29 22:12                 ` Jim Mattson
  2018-01-29 22:25                 ` Andi Kleen
  0 siblings, 2 replies; 143+ messages in thread
From: Eduardo Habkost @ 2018-01-29 21:50 UTC (permalink / raw)
  To: Jim Mattson
  Cc: David Woodhouse, Arjan van de Ven, KarimAllah Ahmed, LKML,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

On Mon, Jan 29, 2018 at 01:37:05PM -0800, Jim Mattson wrote:
> For GCE, "you might be migrated to Skylake" is pretty much a
> certainty. Even if you're in a zone that doesn't currently have
> Skylake machines, chances are pretty good that it will have Skylake
> machines some day in the not-too-distant future.

This kind of scenario is why I suggest a "we promise you're not
going to be migrated to Skylake" bit instead a "you may be
migrated to Skylake" bit.  The hypervisor could prevent migration
to Skylake hosts if management software chose to enable this bit,
and guests would choose the safest option (i.e. assume the worst)
if running on older hypervisors that don't set the bit.

> 
> In general, making these kinds of decisions based on F/M/S is probably
> unwise when running in a VM.

Certainly.  That's why I suggest not trusting f/m/s unless the
hypervisor is explicitly saying it's accurate.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 21:44             ` Eduardo Habkost
@ 2018-01-29 22:10               ` Konrad Rzeszutek Wilk
  2018-01-30  1:12                 ` Eduardo Habkost
  0 siblings, 1 reply; 143+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-01-29 22:10 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: David Woodhouse, Arjan van de Ven, KarimAllah Ahmed,
	linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Dr. David Alan Gilbert

On Mon, Jan 29, 2018 at 07:44:21PM -0200, Eduardo Habkost wrote:
> On Mon, Jan 29, 2018 at 09:02:39PM +0000, David Woodhouse wrote:
> > 
> > 
> > On Mon, 2018-01-29 at 12:44 -0800, Arjan van de Ven wrote:
> > > On 1/29/2018 12:42 PM, Eduardo Habkost wrote:
> > > > 
> > > > The question is how the hypervisor could tell that to the guest.
> > > > If Intel doesn't give us a CPUID bit that can be used to tell
> > > > that retpolines are enough, maybe we should use a hypervisor
> > > > CPUID bit for that?
> > >
> > > the objective is to have retpoline be safe everywhere and never use IBRS
> > > (Linus was also pretty clear about that) so I'm confused by your question
> > 
> > The question is about all the additional RSB-frobbing and call depth
> > counting and other bits that don't really even exist for Skylake yet in
> > a coherent form.
> > 
> > If a guest doesn't have those, because it's running some future kernel
> > where they *are* implemented but not enabled because at *boot* time it
> > discovered it wasn't on Skylake, the question is what happens if that
> > guest is subsequently migrated to a Skylake-class machine.
> > 
> > To which the answer is obviously "oops, sucks to be you". So yes,
> > *maybe* we want a way to advertise "you might be migrated to Skylake"
> > if you're booted on a pre-SKL box in a migration pool where such is
> > possible. 
> > 
> > That question is a reasonable one, and the answer possibly the same,
> > regardless of whether the plan for Skylake is to use IBRS, or all the
> > hypothetical other extra stuff.
> 
> Maybe a generic "family/model/stepping/microcode really matches
> the CPU you are running on" bit would be useful.  The bit could
> be enabled only on host-passthrough (aka "-cpu host") mode.
> 
> If we really want to be able to migrate to host with different
> CPU models (except Skylake), we could add a more specific "we
> promise the host CPU is never going to be Skylake" bit.
> 
> Now, if the hypervisor is not providing any of those bits, I
> would advise against trusting family/model/stepping/microcode
> under a hypervisor.  Using a pre-defined CPU model (that doesn't

The migration code could be 'tickled' (when arrived at the destination)
to recheck the CPUID and do the alternative logic to turn the
proper bits on.

And this tickling could be as simple as an ACPI DSDT/AML code
specific to KVM PnP devices (say the CPUs?) to tell the guest to
resample its environment?

> necessarily match the host) is very common when using KVM VM
> management stacks.
> 
> -- 
> Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 21:50               ` Eduardo Habkost
@ 2018-01-29 22:12                 ` Jim Mattson
  2018-01-30  1:22                   ` Eduardo Habkost
  2018-01-29 22:25                 ` Andi Kleen
  1 sibling, 1 reply; 143+ messages in thread
From: Jim Mattson @ 2018-01-29 22:12 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: David Woodhouse, Arjan van de Ven, KarimAllah Ahmed, LKML,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

On Mon, Jan 29, 2018 at 1:50 PM, Eduardo Habkost <ehabkost@redhat.com> wrote:
> On Mon, Jan 29, 2018 at 01:37:05PM -0800, Jim Mattson wrote:
>> For GCE, "you might be migrated to Skylake" is pretty much a
>> certainty. Even if you're in a zone that doesn't currently have
>> Skylake machines, chances are pretty good that it will have Skylake
>> machines some day in the not-too-distant future.
>
> This kind of scenario is why I suggest a "we promise you're not
> going to be migrated to Skylake" bit instead a "you may be
> migrated to Skylake" bit.  The hypervisor could prevent migration
> to Skylake hosts if management software chose to enable this bit,
> and guests would choose the safest option (i.e. assume the worst)
> if running on older hypervisors that don't set the bit.

Giving customers this option promises the logistical nightmare of
provisioning sufficient pre-Skylake-era machines in all pools until
sufficient post-Skylake-era machines can be deployed to replace them.

>> In general, making these kinds of decisions based on F/M/S is probably
>> unwise when running in a VM.
>
> Certainly.  That's why I suggest not trusting f/m/s unless the
> hypervisor is explicitly saying it's accurate.
>
> --
> Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 21:50               ` Eduardo Habkost
  2018-01-29 22:12                 ` Jim Mattson
@ 2018-01-29 22:25                 ` Andi Kleen
  2018-01-30  1:37                   ` Eduardo Habkost
  1 sibling, 1 reply; 143+ messages in thread
From: Andi Kleen @ 2018-01-29 22:25 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Jim Mattson, David Woodhouse, Arjan van de Ven, KarimAllah Ahmed,
	LKML, Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm list,
	the arch/x86 maintainers, Dr. David Alan Gilbert


I agree with your point that the common hypervisor practice to fake
old model numbers will break some of the workarounds. Hypervisors
may need to revisit their practice.

> > In general, making these kinds of decisions based on F/M/S is probably
> > unwise when running in a VM.
> 
> Certainly.  That's why I suggest not trusting f/m/s unless the
> hypervisor is explicitly saying it's accurate.

This would be only useful if there's an useful result of this
non trust.

But there isn't. Except for panic there's nothing you could do.
And I don't think panic would be reasonable.

The "Skylake bit " or "not skylake bit" doesn't make any sense
to me. If a hypervisor wants to enable Skylake workarounds
they need to provide the Skylake model number. If they don't
think they need them because the VM can never be migrated
to Skylake, then they don't need to set that model
number. 

So there isn't any need for inventing any new bits, it's
all already possible.

-Andi

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 21:02           ` David Woodhouse
                               ` (2 preceding siblings ...)
  2018-01-29 21:44             ` Eduardo Habkost
@ 2018-01-30  0:23             ` Linus Torvalds
  2018-01-30  1:03               ` Jim Mattson
                                 ` (6 more replies)
  3 siblings, 7 replies; 143+ messages in thread
From: Linus Torvalds @ 2018-01-30  0:23 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Arjan van de Ven, Eduardo Habkost, KarimAllah Ahmed,
	Linux Kernel Mailing List, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

On Mon, Jan 29, 2018 at 1:02 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>
> On Mon, 2018-01-29 at 12:44 -0800, Arjan van de Ven wrote:
>>
>> the objective is to have retpoline be safe everywhere and never use IBRS
>> (Linus was also pretty clear about that) so I'm confused by your question

Note on the unhappiness with some of the patches involved: what I do
*not* want to see is the "on every kernel entry" kind of garbage.

So my unhappiness with the intel microcode patches is two-fold:

 (a) the interface is nasty and wrong, and I absolutely detest how Intel did it.

 (b) the write to random MSR's on every kernel entry/exit is wrong

but that doesn't mean that I will necessarily end up NAK'ing every
single IBRS/IBPB patch.

My concern with (a) is that unlike meltdown, the intel work-around
isn't forward-looking, and doesn't have a "we fixed it" bit. Instead,
it has a "we have a nasty workaround that may or may not be horribly
expensive" bit, and isn't all that well-defined.

My dislike of (b) comes from "we have retpoline and various wondrous
RSB filling crud already, we're smarter than that". So it's not that I
refuse any IBRS/IBPB use, I refuse the stupid and _mindless_ kind of
use.

> The question is about all the additional RSB-frobbing and call depth
> counting and other bits that don't really even exist for Skylake yet in
> a coherent form.
>
> If a guest doesn't have those, because it's running some future kernel
> where they *are* implemented but not enabled because at *boot* time it
> discovered it wasn't on Skylake, the question is what happens if that
> guest is subsequently migrated to a Skylake-class machine.

So I actually have a _different_ question to the virtualization
people. This includes the vmware people, but it also obviously
incldues the Amazon AWS kind of usage.

When you're a hypervisor (whether vmware or Amazon), why do you even
end up caring about these things so much? You're protected from
meltdown thanks to the virtual environment already having separate
page tables.  And the "big hammer" approach to spectre would seem to
be to just make sure the BTB and RSB are flushed at vmexit time - and
even then you might decide that you really want to just move it to
vmenter time, and only do it if the VM has changed since last time
(per CPU).

Why do you even _care_ about the guest, and how it acts wrt Skylake?
What you should care about is not so much the guests (which do their
own thing) but protect guests from each other, no?

So I'm a bit mystified by some of this discussion within the context
of virtual machines. I think that is separate from any measures that
the guest machine may then decide to partake in.

If you are ever going to migrate to Skylake, I think you should just
always tell the guests that you're running on Skylake. That way the
guests will always assume the worst case situation wrt Specte.

Maybe that mystification comes from me missing something.

               Linus

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30  0:23             ` Linus Torvalds
@ 2018-01-30  1:03               ` Jim Mattson
  2018-01-30  3:13                 ` Andi Kleen
  2018-01-30  1:32               ` Arjan van de Ven
                                 ` (5 subsequent siblings)
  6 siblings, 1 reply; 143+ messages in thread
From: Jim Mattson @ 2018-01-30  1:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Arjan van de Ven, Eduardo Habkost,
	KarimAllah Ahmed, Linux Kernel Mailing List, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Masami Hiramatsu,
	Paolo Bonzini, Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

The guest OS is responsible for protecting itself from intra-guest
attacks. The hypervisor can't do that. We want to give the guest OS
the tools it needs to make reasonable decisions about the intra-guest
protections it wants to enable, in an environment where the virtual
processor and the physical processor may not actually have the same
F/M/S (and in fact, where the physical processor may change at any
time).

Right now, we are dealing with one workaround, which is tied to
Skylake-era model numbers. Yes, we could report a Skylake model
number, and Linux guests would use IBRS instead of retpoline. But this
approach doesn't scale. What happens when someone introduces a
workaround tied to some other model numbers?

On Mon, Jan 29, 2018 at 4:23 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jan 29, 2018 at 1:02 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>>
>> On Mon, 2018-01-29 at 12:44 -0800, Arjan van de Ven wrote:
>>>
>>> the objective is to have retpoline be safe everywhere and never use IBRS
>>> (Linus was also pretty clear about that) so I'm confused by your question
>
> Note on the unhappiness with some of the patches involved: what I do
> *not* want to see is the "on every kernel entry" kind of garbage.
>
> So my unhappiness with the intel microcode patches is two-fold:
>
>  (a) the interface is nasty and wrong, and I absolutely detest how Intel did it.
>
>  (b) the write to random MSR's on every kernel entry/exit is wrong
>
> but that doesn't mean that I will necessarily end up NAK'ing every
> single IBRS/IBPB patch.
>
> My concern with (a) is that unlike meltdown, the intel work-around
> isn't forward-looking, and doesn't have a "we fixed it" bit. Instead,
> it has a "we have a nasty workaround that may or may not be horribly
> expensive" bit, and isn't all that well-defined.
>
> My dislike of (b) comes from "we have retpoline and various wondrous
> RSB filling crud already, we're smarter than that". So it's not that I
> refuse any IBRS/IBPB use, I refuse the stupid and _mindless_ kind of
> use.
>
>> The question is about all the additional RSB-frobbing and call depth
>> counting and other bits that don't really even exist for Skylake yet in
>> a coherent form.
>>
>> If a guest doesn't have those, because it's running some future kernel
>> where they *are* implemented but not enabled because at *boot* time it
>> discovered it wasn't on Skylake, the question is what happens if that
>> guest is subsequently migrated to a Skylake-class machine.
>
> So I actually have a _different_ question to the virtualization
> people. This includes the vmware people, but it also obviously
> incldues the Amazon AWS kind of usage.
>
> When you're a hypervisor (whether vmware or Amazon), why do you even
> end up caring about these things so much? You're protected from
> meltdown thanks to the virtual environment already having separate
> page tables.  And the "big hammer" approach to spectre would seem to
> be to just make sure the BTB and RSB are flushed at vmexit time - and
> even then you might decide that you really want to just move it to
> vmenter time, and only do it if the VM has changed since last time
> (per CPU).
>
> Why do you even _care_ about the guest, and how it acts wrt Skylake?
> What you should care about is not so much the guests (which do their
> own thing) but protect guests from each other, no?
>
> So I'm a bit mystified by some of this discussion within the context
> of virtual machines. I think that is separate from any measures that
> the guest machine may then decide to partake in.
>
> If you are ever going to migrate to Skylake, I think you should just
> always tell the guests that you're running on Skylake. That way the
> guests will always assume the worst case situation wrt Specte.
>
> Maybe that mystification comes from me missing something.
>
>                Linus

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 22:10               ` Konrad Rzeszutek Wilk
@ 2018-01-30  1:12                 ` Eduardo Habkost
  0 siblings, 0 replies; 143+ messages in thread
From: Eduardo Habkost @ 2018-01-30  1:12 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: David Woodhouse, Arjan van de Ven, KarimAllah Ahmed,
	linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Dr. David Alan Gilbert

On Mon, Jan 29, 2018 at 05:10:11PM -0500, Konrad Rzeszutek Wilk wrote:
[...]
> The migration code could be 'tickled' (when arrived at the destination)
> to recheck the CPUID and do the alternative logic to turn the
> proper bits on.
> 
> And this tickling could be as simple as an ACPI DSDT/AML code
> specific to KVM PnP devices (say the CPUs?) to tell the guest to
> resample its environment?

This would be nice to have for other CPU features, but if I
understood a previous message from Andi on this thread correctly,
it wouldn't be useful for the Spectre mitigations.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 22:12                 ` Jim Mattson
@ 2018-01-30  1:22                   ` Eduardo Habkost
  0 siblings, 0 replies; 143+ messages in thread
From: Eduardo Habkost @ 2018-01-30  1:22 UTC (permalink / raw)
  To: Jim Mattson
  Cc: David Woodhouse, Arjan van de Ven, KarimAllah Ahmed, LKML,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

On Mon, Jan 29, 2018 at 02:12:02PM -0800, Jim Mattson wrote:
> On Mon, Jan 29, 2018 at 1:50 PM, Eduardo Habkost <ehabkost@redhat.com> wrote:
> > On Mon, Jan 29, 2018 at 01:37:05PM -0800, Jim Mattson wrote:
> >> For GCE, "you might be migrated to Skylake" is pretty much a
> >> certainty. Even if you're in a zone that doesn't currently have
> >> Skylake machines, chances are pretty good that it will have Skylake
> >> machines some day in the not-too-distant future.
> >
> > This kind of scenario is why I suggest a "we promise you're not
> > going to be migrated to Skylake" bit instead a "you may be
> > migrated to Skylake" bit.  The hypervisor could prevent migration
> > to Skylake hosts if management software chose to enable this bit,
> > and guests would choose the safest option (i.e. assume the worst)
> > if running on older hypervisors that don't set the bit.
> 
> Giving customers this option promises the logistical nightmare of
> provisioning sufficient pre-Skylake-era machines in all pools until
> sufficient post-Skylake-era machines can be deployed to replace them.

If this is not practical, the hypervisor can simply choose to
never make any of those promises to the guest OS.

Never implementing any of those bits is also an option.  But then
guest OSes must be aware that the hypervisor can _not_ promise
that f/m/s matches the host CPU, and can _not_ promise that the
VM will never be migrated to Skylake CPUs.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30  0:23             ` Linus Torvalds
  2018-01-30  1:03               ` Jim Mattson
@ 2018-01-30  1:32               ` Arjan van de Ven
  2018-01-30  3:32                 ` Linus Torvalds
  2018-01-30  8:22               ` David Woodhouse
                                 ` (4 subsequent siblings)
  6 siblings, 1 reply; 143+ messages in thread
From: Arjan van de Ven @ 2018-01-30  1:32 UTC (permalink / raw)
  To: Linus Torvalds, David Woodhouse
  Cc: Eduardo Habkost, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

On 1/29/2018 4:23 PM, Linus Torvalds wrote:
>
> Why do you even _care_ about the guest, and how it acts wrt Skylake?
> What you should care about is not so much the guests (which do their
> own thing) but protect guests from each other, no?

the most simple solution is that we set the internal feature bit in Linux
to turn on the "stuff the RSB" workaround is we're on a SKL *or* as a guest in a VM.

The stuffing is not free, but it's also not insane either... so if it's turned on in guests,
the impact is still limited, while bare metal doesn't need it at all

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 22:25                 ` Andi Kleen
@ 2018-01-30  1:37                   ` Eduardo Habkost
  0 siblings, 0 replies; 143+ messages in thread
From: Eduardo Habkost @ 2018-01-30  1:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Jim Mattson, David Woodhouse, Arjan van de Ven, KarimAllah Ahmed,
	LKML, Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

On Mon, Jan 29, 2018 at 02:25:12PM -0800, Andi Kleen wrote:
> 
> I agree with your point that the common hypervisor practice to fake
> old model numbers will break some of the workarounds. Hypervisors
> may need to revisit their practice.
> 
> > > In general, making these kinds of decisions based on F/M/S is probably
> > > unwise when running in a VM.
> > 
> > Certainly.  That's why I suggest not trusting f/m/s unless the
> > hypervisor is explicitly saying it's accurate.
> 
> This would be only useful if there's an useful result of this
> non trust.
> 
> But there isn't. Except for panic there's nothing you could do.
> And I don't think panic would be reasonable.

Why it isn't an useful result to enable the Skylake workaround if
unsure about the host CPU?


> 
> The "Skylake bit " or "not skylake bit" doesn't make any sense
> to me. If a hypervisor wants to enable Skylake workarounds
> they need to provide the Skylake model number. If they don't
> think they need them because the VM can never be migrated
> to Skylake, then they don't need to set that model
> number. 
> 
> So there isn't any need for inventing any new bits, it's
> all already possible.

It's already possible, until we find another bug in another CPU
model that also needs to be worked around.  We can't represent
"please work around bugs in both Skylake and Westmere" in f/m/s.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30  1:03               ` Jim Mattson
@ 2018-01-30  3:13                 ` Andi Kleen
  2018-01-31 15:03                   ` Paolo Bonzini
  0 siblings, 1 reply; 143+ messages in thread
From: Andi Kleen @ 2018-01-30  3:13 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Linus Torvalds, David Woodhouse, Arjan van de Ven,
	Eduardo Habkost, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Masami Hiramatsu,
	Paolo Bonzini, Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

> Right now, we are dealing with one workaround, which is tied to
> Skylake-era model numbers. Yes, we could report a Skylake model
> number, and Linux guests would use IBRS instead of retpoline. But this

Nobody is planning to use IBRS and Linus has rejected it.

> approach doesn't scale. What happens when someone introduces a
> workaround tied to some other model numbers?

There are already many of those in the tree for other issues and features. 
So far you managed to survive without. Likely that will be true
in the future too.

-Andi

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30  1:32               ` Arjan van de Ven
@ 2018-01-30  3:32                 ` Linus Torvalds
  2018-01-30 12:04                   ` Eduardo Habkost
  2018-01-30 13:54                   ` Arjan van de Ven
  0 siblings, 2 replies; 143+ messages in thread
From: Linus Torvalds @ 2018-01-30  3:32 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: David Woodhouse, Eduardo Habkost, KarimAllah Ahmed,
	Linux Kernel Mailing List, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

On Mon, Jan 29, 2018 at 5:32 PM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>
> the most simple solution is that we set the internal feature bit in Linux
> to turn on the "stuff the RSB" workaround is we're on a SKL *or* as a guest
> in a VM.

That sounds reasonable.

However, wouldn't it be even better to extend on the current cpuid
model, and actually have some real architectural bits in there.

Maybe it could be a bit in that IA32_ARCH_CAPABILITIES MSR. Say, add a
bit #2 that says "ret falls back on BTB".

Then that bit basically becomes the "Skylake bit". Hmm?

                  Linus

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30  0:23             ` Linus Torvalds
  2018-01-30  1:03               ` Jim Mattson
  2018-01-30  1:32               ` Arjan van de Ven
@ 2018-01-30  8:22               ` David Woodhouse
  2018-01-30 11:35               ` David Woodhouse
                                 ` (3 subsequent siblings)
  6 siblings, 0 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-30  8:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Eduardo Habkost, KarimAllah Ahmed,
	Linux Kernel Mailing List, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 1634 bytes --]

On Mon, 2018-01-29 at 16:23 -0800, Linus Torvalds wrote:
>   And the "big hammer" approach to spectre would seem to
> be to just make sure the BTB and RSB are flushed at vmexit time - and
> even then you might decide that you really want to just move it to
> vmenter time, and only do it if the VM has changed since last time
> (per CPU).

The IBPB which flushes the BTB is *expensive*; we really want to reduce
the amount we do that. For VM guests it's not so bad — we do it only on
VMPTRLD which is sufficient to ensure it's done between running one
vCPU and the next. And if vCPUs are pinned to pCPUs that means we
basically never do it.

Even for userspace we've mostly settled on a heuristic where we only do
the IBPB flush for non-dumpable processes, precisely because it's so
expensive.

> Why do you even _care_ about the guest, and how it acts wrt Skylake?
> What you should care about is not so much the guests (which do their
> own thing) but protect guests from each other, no?

Well yes, that's the part we had to fix before anyone was allowed to
sleep. But customers kind of care about security *within* their part
too, and we care about customers. :)

Sure, the cloud *enables* a model where a given VM guest is just a
single-tenant standalone compute job, and the kernel is effectively
just a library to provide services to the application. In some sense
it's all about the app, and you might as well be using uCLinux from the
security point of view. So *some* (perhaps even *many*) guests don't
need to care.

But there are still plenty who *do* need to care, for various reasons.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30  0:23             ` Linus Torvalds
                                 ` (2 preceding siblings ...)
  2018-01-30  8:22               ` David Woodhouse
@ 2018-01-30 11:35               ` David Woodhouse
  2018-01-30 11:56               ` Dr. David Alan Gilbert
                                 ` (2 subsequent siblings)
  6 siblings, 0 replies; 143+ messages in thread
From: David Woodhouse @ 2018-01-30 11:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, Eduardo Habkost, KarimAllah Ahmed,
	Linux Kernel Mailing List, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 2813 bytes --]

On Mon, 2018-01-29 at 16:23 -0800, Linus Torvalds wrote:
> 
> Note on the unhappiness with some of the patches involved: what I do
> *not* want to see is the "on every kernel entry" kind of garbage.
> 
> So my unhappiness with the intel microcode patches is two-fold:
> 
>  (a) the interface is nasty and wrong, and I absolutely detest how Intel did it.
> 
>  (b) the write to random MSR's on every kernel entry/exit is wrong
> 
> but that doesn't mean that I will necessarily end up NAK'ing every
> single IBRS/IBPB patch.
> 
> My concern with (a) is that unlike meltdown, the intel work-around
> isn't forward-looking, and doesn't have a "we fixed it" bit. Instead,
> it has a "we have a nasty workaround that may or may not be horribly
> expensive" bit, and isn't all that well-defined.

The lack of a "we fixed it" bit is certainly problematic.

But as an interim hack for the upcoming hardware, IBRS_ALL isn't so
badly defined. Sure, the reassurances about performance all got ripped
out before the document saw the light of day — quelle surprise? — but
my understanding is that it *will* be fast. It is expected to be fast
enough that we can ALTERNATIVE away the retpolines, set it once and
leave it set.

The reason it isn't just a "we fixed it" bit is because we'll still
need the IBPB on context/vCPU switches.

I suspect they managed to tag BTB entries with VMX mode and ring, but
*not* the full VMID/PCID tagging (and associated automatic flushing)
that they'd need to truly say "we fixed it".

I seriously hope they're working on a complete fix for the subsequent
generation, and just neglected to mention it in their public
documentation that far in advance.

> My dislike of (b) comes from "we have retpoline and various wondrous
> RSB filling crud already, we're smarter than that". So it's not that I
> refuse any IBRS/IBPB use, I refuse the stupid and _mindless_ kind of
> use.

Well... for Skylake we probably need something like Ingo's cunning plan
to abuse function tracing to count call depth. I won't be utterly
shocked if, by the time we have all that pulled together, it ends up
being fairly much as fugly as the IBRS version — for less complete
protection. But we'll see. :)

It may also be that some of the last remaining holes can be declared
just too unlikely for us to jump through fugly hoops for. In fact that
*has* to be our answer for the SMI issue if we're not using IBRS on
Skylake, so now it's just a question of degree — how many of the
*other* theoretical holes are we happy to do the same thing for?

That's a genuine question, not a rhetorical device arguing for IBRS. I
just haven't seen a clear analysis, other than some hand-waving, of how
feasible some of those attack vectors really are. I'd like to.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30  0:23             ` Linus Torvalds
                                 ` (3 preceding siblings ...)
  2018-01-30 11:35               ` David Woodhouse
@ 2018-01-30 11:56               ` Dr. David Alan Gilbert
  2018-01-30 12:11               ` Christian Borntraeger
  2018-01-30 20:46               ` Alan Cox
  6 siblings, 0 replies; 143+ messages in thread
From: Dr. David Alan Gilbert @ 2018-01-30 11:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Arjan van de Ven, Eduardo Habkost,
	KarimAllah Ahmed, Linux Kernel Mailing List, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Masami Hiramatsu,
	Paolo Bonzini, Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers

* Linus Torvalds (torvalds@linux-foundation.org) wrote:

> Why do you even _care_ about the guest, and how it acts wrt Skylake?
> What you should care about is not so much the guests (which do their
> own thing) but protect guests from each other, no?
> 
> So I'm a bit mystified by some of this discussion within the context
> of virtual machines. I think that is separate from any measures that
> the guest machine may then decide to partake in.

Because you'd never want to be the cause of the guest making the wrong
decision and thus being less secure than it was on real hardware.

> If you are ever going to migrate to Skylake, I think you should just
> always tell the guests that you're running on Skylake. That way the
> guests will always assume the worst case situation wrt Specte.

Say you've got a pile of Ivybridge, all running lots of VMs,
the guests see that they're running on Ivybridge.
Now you need some more hosts, so you buy the latest Skylake boxes,
and add them into your cluster.  Previously it was fine to live
migrate a VM to the Skylake box and the VM still sees it's running
Ivybridge; and you can migrate that VM back and forward.
The rule was that as long as the CPU type you told the guest was
old enough then it could migrate to any newer box.

You can't tell the VMs running on Ivybridge they're running on Skylake
otherwise they'll start trying to use Skylake features
(OK, they should be checking flags, but that's a separate story).

Dave


> Maybe that mystification comes from me missing something.
> 
>                Linus
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30  3:32                 ` Linus Torvalds
@ 2018-01-30 12:04                   ` Eduardo Habkost
  2018-01-30 13:54                   ` Arjan van de Ven
  1 sibling, 0 replies; 143+ messages in thread
From: Eduardo Habkost @ 2018-01-30 12:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Arjan van de Ven, David Woodhouse, KarimAllah Ahmed,
	Linux Kernel Mailing List, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

On Mon, Jan 29, 2018 at 07:32:06PM -0800, Linus Torvalds wrote:
> On Mon, Jan 29, 2018 at 5:32 PM, Arjan van de Ven <arjan@linux.intel.com> wrote:
> >
> > the most simple solution is that we set the internal feature bit in Linux
> > to turn on the "stuff the RSB" workaround is we're on a SKL *or* as a guest
> > in a VM.
> 
> That sounds reasonable.
> 
> However, wouldn't it be even better to extend on the current cpuid
> model, and actually have some real architectural bits in there.

If Intel could do that, it would be great.


> 
> Maybe it could be a bit in that IA32_ARCH_CAPABILITIES MSR. Say, add a
> bit #2 that says "ret falls back on BTB".
> 
> Then that bit basically becomes the "Skylake bit". Hmm?

Yes.  But note that the OS needs to be able to differentiate "old
Skylake that doesn't support the new bit" from "newer Skylake
that doesn't fall back om BTB".  That's why I suggest a
"non-Skylake bit" instead of a "Skylake bit".

-- 
Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30  0:23             ` Linus Torvalds
                                 ` (4 preceding siblings ...)
  2018-01-30 11:56               ` Dr. David Alan Gilbert
@ 2018-01-30 12:11               ` Christian Borntraeger
  2018-01-30 14:46                 ` Christophe de Dinechin
  2018-01-30 20:46               ` Alan Cox
  6 siblings, 1 reply; 143+ messages in thread
From: Christian Borntraeger @ 2018-01-30 12:11 UTC (permalink / raw)
  To: Linus Torvalds, David Woodhouse
  Cc: Arjan van de Ven, Eduardo Habkost, KarimAllah Ahmed,
	Linux Kernel Mailing List, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert



On 01/30/2018 01:23 AM, Linus Torvalds wrote:
[...]
> 
> So I actually have a _different_ question to the virtualization
> people. This includes the vmware people, but it also obviously
> incldues the Amazon AWS kind of usage.
> 
> When you're a hypervisor (whether vmware or Amazon), why do you even
> end up caring about these things so much? You're protected from
> meltdown thanks to the virtual environment already having separate
> page tables.  And the "big hammer" approach to spectre would seem to
> be to just make sure the BTB and RSB are flushed at vmexit time - and
> even then you might decide that you really want to just move it to
> vmenter time, and only do it if the VM has changed since last time
> (per CPU).
> 
> Why do you even _care_ about the guest, and how it acts wrt Skylake?
> What you should care about is not so much the guests (which do their
> own thing) but protect guests from each other, no?
> 
> So I'm a bit mystified by some of this discussion within the context
> of virtual machines. I think that is separate from any measures that
> the guest machine may then decide to partake in.
> 
> If you are ever going to migrate to Skylake, I think you should just
> always tell the guests that you're running on Skylake. That way the
> guests will always assume the worst case situation wrt Specte.
> 
> Maybe that mystification comes from me missing something.

I can only speak for KVM, but I think the hypervisor issues come from
the fact that for migration purposes the hypervisor "lies" to the guest
in regard to what kind of CPU is running.  (it has to lie, see below).

This is to avoid random guest crashes by not announcing features. For
example if you want to migrate forth and back between a system that
has AVX512 and another one that has not you must tell the guest that
AVX512 is not available - even if it runs on the capable system.

To protect against new features the hypervisor only announces features
that it understands.
So you essentially start a VM in QEMU of a given CPU type that is
constructed of a base cpu type plus extra features. Before migration, 
it is checked if  he target system can run a guest of given type - 
otherwise migration is rejected. 

The management stack also knows things like baselining - basically
creating the best possible guest CPU given a set of hosts.

The problem now is: If you have lets say Broadwell and Skylakes.
What kind of CPU type are you telling your guest? If you claim
broadwell but run on skylake then you prevent that the guest can 
protect itself, because the guest does not know that it should do 
something special. If you say skylake the guest might start using
features that broadwell does not understand.

So I think what we have here is that the current (guest) cpu model
for hypervisors was always designed for architectural features.
Presenting a microarchitectural knowledge for workarounds does
not seem to be well integrated into hypervisors.


PS: For a list of potential cpus/features look at
https://libvirt.org/git/?p=libvirt.git;a=blob;f=src/cpu/cpu_map.xml

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30  3:32                 ` Linus Torvalds
  2018-01-30 12:04                   ` Eduardo Habkost
@ 2018-01-30 13:54                   ` Arjan van de Ven
  1 sibling, 0 replies; 143+ messages in thread
From: Arjan van de Ven @ 2018-01-30 13:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Eduardo Habkost, KarimAllah Ahmed,
	Linux Kernel Mailing List, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

On 1/29/2018 7:32 PM, Linus Torvalds wrote:
> On Mon, Jan 29, 2018 at 5:32 PM, Arjan van de Ven <arjan@linux.intel.com> wrote:
>>
>> the most simple solution is that we set the internal feature bit in Linux
>> to turn on the "stuff the RSB" workaround is we're on a SKL *or* as a guest
>> in a VM.
> 
> That sounds reasonable.
> 
> However, wouldn't it be even better to extend on the current cpuid
> model, and actually have some real architectural bits in there.
> 
> Maybe it could be a bit in that IA32_ARCH_CAPABILITIES MSR. Say, add a
> bit #2 that says "ret falls back on BTB".
> 
> Then that bit basically becomes the "Skylake bit". Hmm?

we can try to do that, but existing systems don't have that, and then we
get in another long thread here about weird lists of stuff ;-)

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30 12:11               ` Christian Borntraeger
@ 2018-01-30 14:46                 ` Christophe de Dinechin
  2018-01-30 14:52                   ` Christian Borntraeger
  0 siblings, 1 reply; 143+ messages in thread
From: Christophe de Dinechin @ 2018-01-30 14:46 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Linus Torvalds, David Woodhouse, Arjan van de Ven,
	Eduardo Habkost, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert



> On 30 Jan 2018, at 13:11, Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
> 
> 
> On 01/30/2018 01:23 AM, Linus Torvalds wrote:
> [...]
>> 
>> So I actually have a _different_ question to the virtualization
>> people. This includes the vmware people, but it also obviously
>> incldues the Amazon AWS kind of usage.
>> 
>> When you're a hypervisor (whether vmware or Amazon), why do you even
>> end up caring about these things so much? You're protected from
>> meltdown thanks to the virtual environment already having separate
>> page tables.  And the "big hammer" approach to spectre would seem to
>> be to just make sure the BTB and RSB are flushed at vmexit time - and
>> even then you might decide that you really want to just move it to
>> vmenter time, and only do it if the VM has changed since last time
>> (per CPU).
>> 
>> Why do you even _care_ about the guest, and how it acts wrt Skylake?
>> What you should care about is not so much the guests (which do their
>> own thing) but protect guests from each other, no?
>> 
>> So I'm a bit mystified by some of this discussion within the context
>> of virtual machines. I think that is separate from any measures that
>> the guest machine may then decide to partake in.
>> 
>> If you are ever going to migrate to Skylake, I think you should just
>> always tell the guests that you're running on Skylake. That way the
>> guests will always assume the worst case situation wrt Specte.
>> 
>> Maybe that mystification comes from me missing something.
> 
> I can only speak for KVM, but I think the hypervisor issues come from
> the fact that for migration purposes the hypervisor "lies" to the guest
> in regard to what kind of CPU is running.  (it has to lie, see below).
> 
> This is to avoid random guest crashes by not announcing features. For
> example if you want to migrate forth and back between a system that
> has AVX512 and another one that has not you must tell the guest that
> AVX512 is not available - even if it runs on the capable system.
> 
> To protect against new features the hypervisor only announces features
> that it understands.
> So you essentially start a VM in QEMU of a given CPU type that is
> constructed of a base cpu type plus extra features. Before migration, 
> it is checked if  he target system can run a guest of given type - 
> otherwise migration is rejected. 
> 
> The management stack also knows things like baselining - basically
> creating the best possible guest CPU given a set of hosts.
> 
> The problem now is: If you have lets say Broadwell and Skylakes.
> What kind of CPU type are you telling your guest? If you claim
> broadwell but run on skylake then you prevent that the guest can 
> protect itself, because the guest does not know that it should do 
> something special. If you say skylake the guest might start using
> features that broadwell does not understand.

I believe that Linus’ question was whether it makes sense to defer
the entirety of the protection to the host kernel, although I was a bit
confused by his suggestion to always assume Skylake.

In other words, is it safe enough to rely on the host kernel countermeasure
to protect guest kernels and their applications? In which case having
the guest believe it runs on Broadwell would not be that problematic.

Aren’t there enough vmexits on the guest kernel context switch
to enforce protection on its behalf? Even if it’s

a) some old kernel that without mitigation code

or

b) some new kernel that thinks it runs on an old CPU and disabled mitigation


Christophe


> 
> So I think what we have here is that the current (guest) cpu model
> for hypervisors was always designed for architectural features.
> Presenting a microarchitectural knowledge for workarounds does
> not seem to be well integrated into hypervisors.
> 
> 
> PS: For a list of potential cpus/features look at
> https://libvirt.org/git/?p=libvirt.git;a=blob;f=src/cpu/cpu_map.xml
> 

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30 14:46                 ` Christophe de Dinechin
@ 2018-01-30 14:52                   ` Christian Borntraeger
  2018-01-30 14:56                     ` Christophe de Dinechin
  0 siblings, 1 reply; 143+ messages in thread
From: Christian Borntraeger @ 2018-01-30 14:52 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: Linus Torvalds, David Woodhouse, Arjan van de Ven,
	Eduardo Habkost, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert



On 01/30/2018 03:46 PM, Christophe de Dinechin wrote:
> 
> 
>> On 30 Jan 2018, at 13:11, Christian Borntraeger <borntraeger@de.ibm.com> wrote:
>>
>>
>>
>> On 01/30/2018 01:23 AM, Linus Torvalds wrote:
>> [...]
>>>
>>> So I actually have a _different_ question to the virtualization
>>> people. This includes the vmware people, but it also obviously
>>> incldues the Amazon AWS kind of usage.
>>>
>>> When you're a hypervisor (whether vmware or Amazon), why do you even
>>> end up caring about these things so much? You're protected from
>>> meltdown thanks to the virtual environment already having separate
>>> page tables.  And the "big hammer" approach to spectre would seem to
>>> be to just make sure the BTB and RSB are flushed at vmexit time - and
>>> even then you might decide that you really want to just move it to
>>> vmenter time, and only do it if the VM has changed since last time
>>> (per CPU).
>>>
>>> Why do you even _care_ about the guest, and how it acts wrt Skylake?
>>> What you should care about is not so much the guests (which do their
>>> own thing) but protect guests from each other, no?
>>>
>>> So I'm a bit mystified by some of this discussion within the context
>>> of virtual machines. I think that is separate from any measures that
>>> the guest machine may then decide to partake in.
>>>
>>> If you are ever going to migrate to Skylake, I think you should just
>>> always tell the guests that you're running on Skylake. That way the
>>> guests will always assume the worst case situation wrt Specte.
>>>
>>> Maybe that mystification comes from me missing something.
>>
>> I can only speak for KVM, but I think the hypervisor issues come from
>> the fact that for migration purposes the hypervisor "lies" to the guest
>> in regard to what kind of CPU is running.  (it has to lie, see below).
>>
>> This is to avoid random guest crashes by not announcing features. For
>> example if you want to migrate forth and back between a system that
>> has AVX512 and another one that has not you must tell the guest that
>> AVX512 is not available - even if it runs on the capable system.
>>
>> To protect against new features the hypervisor only announces features
>> that it understands.
>> So you essentially start a VM in QEMU of a given CPU type that is
>> constructed of a base cpu type plus extra features. Before migration, 
>> it is checked if  he target system can run a guest of given type - 
>> otherwise migration is rejected. 
>>
>> The management stack also knows things like baselining - basically
>> creating the best possible guest CPU given a set of hosts.
>>
>> The problem now is: If you have lets say Broadwell and Skylakes.
>> What kind of CPU type are you telling your guest? If you claim
>> broadwell but run on skylake then you prevent that the guest can 
>> protect itself, because the guest does not know that it should do 
>> something special. If you say skylake the guest might start using
>> features that broadwell does not understand.
> 
> I believe that Linus’ question was whether it makes sense to defer
> the entirety of the protection to the host kernel, although I was a bit
> confused by his suggestion to always assume Skylake.
> 
> In other words, is it safe enough to rely on the host kernel countermeasure
> to protect guest kernels and their applications? In which case having
> the guest believe it runs on Broadwell would not be that problematic.
> 
> Aren’t there enough vmexits on the guest kernel context switch
> to enforce protection on its behalf? Even if it’s
> 
> a) some old kernel that without mitigation code
> 
> or
> 
> b) some new kernel that thinks it runs on an old CPU and disabled mitigation
> 
I think it is not safe to just protect the host. CPU bound workload in the guest
will switch a lot between guest user and guest kernel without triggering an
exit.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30 14:52                   ` Christian Borntraeger
@ 2018-01-30 14:56                     ` Christophe de Dinechin
  2018-01-30 15:33                       ` Christian Borntraeger
  0 siblings, 1 reply; 143+ messages in thread
From: Christophe de Dinechin @ 2018-01-30 14:56 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Christophe de Dinechin, Linus Torvalds, David Woodhouse,
	Arjan van de Ven, Eduardo Habkost, KarimAllah Ahmed,
	Linux Kernel Mailing List, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert



> On 30 Jan 2018, at 15:52, Christian Borntraeger <borntraeger@de.ibm.com> wrote:
> 
> 
> 
> On 01/30/2018 03:46 PM, Christophe de Dinechin wrote:
>> 
>> 
>>> On 30 Jan 2018, at 13:11, Christian Borntraeger <borntraeger@de.ibm.com> wrote:
>>> 
>>> 
>>> 
>>> On 01/30/2018 01:23 AM, Linus Torvalds wrote:
>>> [...]
>>>> 
>>>> So I actually have a _different_ question to the virtualization
>>>> people. This includes the vmware people, but it also obviously
>>>> incldues the Amazon AWS kind of usage.
>>>> 
>>>> When you're a hypervisor (whether vmware or Amazon), why do you even
>>>> end up caring about these things so much? You're protected from
>>>> meltdown thanks to the virtual environment already having separate
>>>> page tables.  And the "big hammer" approach to spectre would seem to
>>>> be to just make sure the BTB and RSB are flushed at vmexit time - and
>>>> even then you might decide that you really want to just move it to
>>>> vmenter time, and only do it if the VM has changed since last time
>>>> (per CPU).
>>>> 
>>>> Why do you even _care_ about the guest, and how it acts wrt Skylake?
>>>> What you should care about is not so much the guests (which do their
>>>> own thing) but protect guests from each other, no?
>>>> 
>>>> So I'm a bit mystified by some of this discussion within the context
>>>> of virtual machines. I think that is separate from any measures that
>>>> the guest machine may then decide to partake in.
>>>> 
>>>> If you are ever going to migrate to Skylake, I think you should just
>>>> always tell the guests that you're running on Skylake. That way the
>>>> guests will always assume the worst case situation wrt Specte.
>>>> 
>>>> Maybe that mystification comes from me missing something.
>>> 
>>> I can only speak for KVM, but I think the hypervisor issues come from
>>> the fact that for migration purposes the hypervisor "lies" to the guest
>>> in regard to what kind of CPU is running.  (it has to lie, see below).
>>> 
>>> This is to avoid random guest crashes by not announcing features. For
>>> example if you want to migrate forth and back between a system that
>>> has AVX512 and another one that has not you must tell the guest that
>>> AVX512 is not available - even if it runs on the capable system.
>>> 
>>> To protect against new features the hypervisor only announces features
>>> that it understands.
>>> So you essentially start a VM in QEMU of a given CPU type that is
>>> constructed of a base cpu type plus extra features. Before migration, 
>>> it is checked if  he target system can run a guest of given type - 
>>> otherwise migration is rejected. 
>>> 
>>> The management stack also knows things like baselining - basically
>>> creating the best possible guest CPU given a set of hosts.
>>> 
>>> The problem now is: If you have lets say Broadwell and Skylakes.
>>> What kind of CPU type are you telling your guest? If you claim
>>> broadwell but run on skylake then you prevent that the guest can 
>>> protect itself, because the guest does not know that it should do 
>>> something special. If you say skylake the guest might start using
>>> features that broadwell does not understand.
>> 
>> I believe that Linus’ question was whether it makes sense to defer
>> the entirety of the protection to the host kernel, although I was a bit
>> confused by his suggestion to always assume Skylake.
>> 
>> In other words, is it safe enough to rely on the host kernel countermeasure
>> to protect guest kernels and their applications? In which case having
>> the guest believe it runs on Broadwell would not be that problematic.
>> 
>> Aren’t there enough vmexits on the guest kernel context switch
>> to enforce protection on its behalf? Even if it’s
>> 
>> a) some old kernel that without mitigation code
>> 
>> or
>> 
>> b) some new kernel that thinks it runs on an old CPU and disabled mitigation
>> 
> I think it is not safe to just protect the host. CPU bound workload in the guest
> will switch a lot between guest user and guest kernel without triggering an
> exit.

But that’s only if the guest does not take any page faults. Is it possible to run any
of the known approaches to spectre and meltdown without ever faulting?
If the workload is not faulting, then it’s reading only stuff it’s allowed to, isn’t it?


Christophe

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30 14:56                     ` Christophe de Dinechin
@ 2018-01-30 15:33                       ` Christian Borntraeger
  0 siblings, 0 replies; 143+ messages in thread
From: Christian Borntraeger @ 2018-01-30 15:33 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: Linus Torvalds, David Woodhouse, Arjan van de Ven,
	Eduardo Habkost, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert



On 01/30/2018 03:56 PM, Christophe de Dinechin wrote:
> 
> 
>> On 30 Jan 2018, at 15:52, Christian Borntraeger <borntraeger@de.ibm.com> wrote:
>>
>>
>>
>> On 01/30/2018 03:46 PM, Christophe de Dinechin wrote:
>>>
>>>
>>>> On 30 Jan 2018, at 13:11, Christian Borntraeger <borntraeger@de.ibm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 01/30/2018 01:23 AM, Linus Torvalds wrote:
>>>> [...]
>>>>>
>>>>> So I actually have a _different_ question to the virtualization
>>>>> people. This includes the vmware people, but it also obviously
>>>>> incldues the Amazon AWS kind of usage.
>>>>>
>>>>> When you're a hypervisor (whether vmware or Amazon), why do you even
>>>>> end up caring about these things so much? You're protected from
>>>>> meltdown thanks to the virtual environment already having separate
>>>>> page tables.  And the "big hammer" approach to spectre would seem to
>>>>> be to just make sure the BTB and RSB are flushed at vmexit time - and
>>>>> even then you might decide that you really want to just move it to
>>>>> vmenter time, and only do it if the VM has changed since last time
>>>>> (per CPU).
>>>>>
>>>>> Why do you even _care_ about the guest, and how it acts wrt Skylake?
>>>>> What you should care about is not so much the guests (which do their
>>>>> own thing) but protect guests from each other, no?
>>>>>
>>>>> So I'm a bit mystified by some of this discussion within the context
>>>>> of virtual machines. I think that is separate from any measures that
>>>>> the guest machine may then decide to partake in.
>>>>>
>>>>> If you are ever going to migrate to Skylake, I think you should just
>>>>> always tell the guests that you're running on Skylake. That way the
>>>>> guests will always assume the worst case situation wrt Specte.
>>>>>
>>>>> Maybe that mystification comes from me missing something.
>>>>
>>>> I can only speak for KVM, but I think the hypervisor issues come from
>>>> the fact that for migration purposes the hypervisor "lies" to the guest
>>>> in regard to what kind of CPU is running.  (it has to lie, see below).
>>>>
>>>> This is to avoid random guest crashes by not announcing features. For
>>>> example if you want to migrate forth and back between a system that
>>>> has AVX512 and another one that has not you must tell the guest that
>>>> AVX512 is not available - even if it runs on the capable system.
>>>>
>>>> To protect against new features the hypervisor only announces features
>>>> that it understands.
>>>> So you essentially start a VM in QEMU of a given CPU type that is
>>>> constructed of a base cpu type plus extra features. Before migration, 
>>>> it is checked if  he target system can run a guest of given type - 
>>>> otherwise migration is rejected. 
>>>>
>>>> The management stack also knows things like baselining - basically
>>>> creating the best possible guest CPU given a set of hosts.
>>>>
>>>> The problem now is: If you have lets say Broadwell and Skylakes.
>>>> What kind of CPU type are you telling your guest? If you claim
>>>> broadwell but run on skylake then you prevent that the guest can 
>>>> protect itself, because the guest does not know that it should do 
>>>> something special. If you say skylake the guest might start using
>>>> features that broadwell does not understand.
>>>
>>> I believe that Linus’ question was whether it makes sense to defer
>>> the entirety of the protection to the host kernel, although I was a bit
>>> confused by his suggestion to always assume Skylake.
>>>
>>> In other words, is it safe enough to rely on the host kernel countermeasure
>>> to protect guest kernels and their applications? In which case having
>>> the guest believe it runs on Broadwell would not be that problematic.
>>>
>>> Aren’t there enough vmexits on the guest kernel context switch
>>> to enforce protection on its behalf? Even if it’s
>>>
>>> a) some old kernel that without mitigation code
>>>
>>> or
>>>
>>> b) some new kernel that thinks it runs on an old CPU and disabled mitigation
>>>
>> I think it is not safe to just protect the host. CPU bound workload in the guest
>> will switch a lot between guest user and guest kernel without triggering an
>> exit.
> 
> But that’s only if the guest does not take any page faults. Is it possible to run any
> of the known approaches to spectre and meltdown without ever faulting?

Sure, after you have faulted in everything you can still flush the cache without refaulting,
And if you need a fault, it will be GUEST fault  - no hypervisor involvment,
Everything else would be too slow and is pre NPT.


> If the workload is not faulting, then it’s reading only stuff it’s allowed to, isn’t it?


The point is: The hypervisor will not try to fix the guest userspace against guest kernel space
or other guest userspaces. This is clearly the task of the guest operating system (you are 
also not asking the hypervisor build a guest kpti is the guest is too old).
The hypervisors task is to isolate guests against other guests and against the host.
At the same time the hypervisor will try to _enable_ the guest to also protect itself.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30  0:23             ` Linus Torvalds
                                 ` (5 preceding siblings ...)
  2018-01-30 12:11               ` Christian Borntraeger
@ 2018-01-30 20:46               ` Alan Cox
  2018-01-31 10:05                 ` Christophe de Dinechin
  6 siblings, 1 reply; 143+ messages in thread
From: Alan Cox @ 2018-01-30 20:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Woodhouse, Arjan van de Ven, Eduardo Habkost,
	KarimAllah Ahmed, Linux Kernel Mailing List, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Masami Hiramatsu,
	Paolo Bonzini, Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

> If you are ever going to migrate to Skylake, I think you should just
> always tell the guests that you're running on Skylake. That way the
> guests will always assume the worst case situation wrt Specte.

Unfortunately if you do that then guest may also decide to use other
Skylake hardware features and pop its clogs when it finds out its actually
running on Westmere or SandyBridge.

So you need to be able to both lie to the OS and user space via cpuid and
also have a second 'but do skylake protections' that only mitigation
aware software knows about.

Alan

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-20 19:22 ` [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure KarimAllah Ahmed
  2018-01-21 14:31   ` Thomas Gleixner
  2018-01-29 20:14   ` [RFC,05/10] " Eduardo Habkost
@ 2018-01-31 10:03   ` Christophe de Dinechin
  2 siblings, 0 replies; 143+ messages in thread
From: Christophe de Dinechin @ 2018-01-31 10:03 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: linux-kernel, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86


KarimAllah Ahmed writes:

> From: David Woodhouse <dwmw@amazon.co.uk>
>
> Not functional yet; just add the handling for it in the Spectre v2
> mitigation selection, and the X86_FEATURE_IBRS flag which will control
> the code to be added in later patches.
>
> Also take the #ifdef CONFIG_RETPOLINE from around the RSB-stuffing; IBRS
> mode will want that too.
>
> For now we are auto-selecting IBRS on Skylake. We will probably end up
> changing that but for now let's default to the safest option.
>
> XX: Do we want a microcode blacklist?
>
> [karahmed: simplify the switch block and get rid of all the magic]
>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |   1 +
>  arch/x86/include/asm/cpufeatures.h              |   1 +
>  arch/x86/include/asm/nospec-branch.h            |   2 -
>  arch/x86/kernel/cpu/bugs.c                      | 108 +++++++++++++++---------
>  4 files changed, 68 insertions(+), 44 deletions(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 8122b5f..e597650 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3932,6 +3932,7 @@
>  			retpoline	  - replace indirect branches
>  			retpoline,generic - google's original retpoline
>  			retpoline,amd     - AMD-specific minimal thunk
> +			ibrs		  - Intel: Indirect Branch Restricted Speculation
>
>  			Not specifying this option is equivalent to
>  			spectre_v2=auto.
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 8ec9588..ae86ad9 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -211,6 +211,7 @@
>  #define X86_FEATURE_AMD_PRED_CMD	( 7*32+17) /* Prediction Command MSR (AMD) */
>  #define X86_FEATURE_MBA			( 7*32+18) /* Memory Bandwidth Allocation */
>  #define X86_FEATURE_RSB_CTXSW		( 7*32+19) /* Fill RSB on context switches */
> +#define X86_FEATURE_IBRS		( 7*32+21) /* Use IBRS for Spectre v2 safety */
>
>  /* Virtualization flags: Linux defined, word 8 */
>  #define X86_FEATURE_TPR_SHADOW		( 8*32+ 0) /* Intel TPR Shadow */
> diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
> index c333c95..8759449 100644
> --- a/arch/x86/include/asm/nospec-branch.h
> +++ b/arch/x86/include/asm/nospec-branch.h
> @@ -205,7 +205,6 @@ extern char __indirect_thunk_end[];
>   */
>  static inline void vmexit_fill_RSB(void)
>  {
> -#ifdef CONFIG_RETPOLINE
>  	unsigned long loops;
>
>  	asm volatile (ANNOTATE_NOSPEC_ALTERNATIVE
> @@ -215,7 +214,6 @@ static inline void vmexit_fill_RSB(void)
>  		      "910:"
>  		      : "=r" (loops), ASM_CALL_CONSTRAINT
>  		      : : "memory" );
> -#endif
>  }
>
>  static inline void indirect_branch_prediction_barrier(void)
> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 96548ff..1d5e12f 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c
> @@ -79,6 +79,7 @@ enum spectre_v2_mitigation_cmd {
>  	SPECTRE_V2_CMD_RETPOLINE,
>  	SPECTRE_V2_CMD_RETPOLINE_GENERIC,
>  	SPECTRE_V2_CMD_RETPOLINE_AMD,
> +	SPECTRE_V2_CMD_IBRS,
>  };
>
>  static const char *spectre_v2_strings[] = {
> @@ -87,6 +88,7 @@ static const char *spectre_v2_strings[] = {
>  	[SPECTRE_V2_RETPOLINE_MINIMAL_AMD]	= "Vulnerable: Minimal AMD ASM retpoline",
>  	[SPECTRE_V2_RETPOLINE_GENERIC]		= "Mitigation: Full generic retpoline",
>  	[SPECTRE_V2_RETPOLINE_AMD]		= "Mitigation: Full AMD retpoline",
> +	[SPECTRE_V2_IBRS]			= "Mitigation: Indirect Branch Restricted Speculation",
>  };
>
>  #undef pr_fmt
> @@ -132,9 +134,17 @@ static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void)
>  			spec2_print_if_secure("force enabled on command line.");
>  			return SPECTRE_V2_CMD_FORCE;
>  		} else if (match_option(arg, ret, "retpoline")) {
> +			if (!IS_ENABLED(CONFIG_RETPOLINE)) {
> +				pr_err("retpoline selected but not compiled in. Switching to AUTO select\n");
> +				return SPECTRE_V2_CMD_AUTO;
> +			}
>  			spec2_print_if_insecure("retpoline selected on command line.");
>  			return SPECTRE_V2_CMD_RETPOLINE;
>  		} else if (match_option(arg, ret, "retpoline,amd")) {
> +			if (!IS_ENABLED(CONFIG_RETPOLINE)) {
> +				pr_err("retpoline,amd selected but not compiled in. Switching to AUTO select\n");
> +				return SPECTRE_V2_CMD_AUTO;
> +			}
>  			if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD) {
>  				pr_err("retpoline,amd selected but CPU is not AMD. Switching to AUTO select\n");
>  				return SPECTRE_V2_CMD_AUTO;
> @@ -142,8 +152,19 @@ static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void)
>  			spec2_print_if_insecure("AMD retpoline selected on command line.");
>  			return SPECTRE_V2_CMD_RETPOLINE_AMD;
>  		} else if (match_option(arg, ret, "retpoline,generic")) {
> +			if (!IS_ENABLED(CONFIG_RETPOLINE)) {
> +				pr_err("retpoline,generic selected but not compiled in. Switching to AUTO select\n");
> +				return SPECTRE_V2_CMD_AUTO;
> +			}
>  			spec2_print_if_insecure("generic retpoline selected on command line.");
>  			return SPECTRE_V2_CMD_RETPOLINE_GENERIC;
> +		} else if (match_option(arg, ret, "ibrs")) {
> +			if (!boot_cpu_has(X86_FEATURE_SPEC_CTRL)) {
> +				pr_err("IBRS selected but no CPU support. Switching to AUTO select\n");
> +				return SPECTRE_V2_CMD_AUTO;
> +			}
> +			spec2_print_if_insecure("IBRS seleted on command line.");
> +			return SPECTRE_V2_CMD_IBRS;
>  		} else if (match_option(arg, ret, "auto")) {
>  			return SPECTRE_V2_CMD_AUTO;
>  		}
> @@ -156,7 +177,7 @@ static enum spectre_v2_mitigation_cmd __init spectre_v2_parse_cmdline(void)
>  	return SPECTRE_V2_CMD_NONE;
>  }
>
> -/* Check for Skylake-like CPUs (for RSB handling) */
> +/* Check for Skylake-like CPUs (for RSB and IBRS handling) */
>  static bool __init is_skylake_era(void)
>  {
>  	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
> @@ -178,55 +199,58 @@ static void __init spectre_v2_select_mitigation(void)
>  	enum spectre_v2_mitigation_cmd cmd = spectre_v2_parse_cmdline();
>  	enum spectre_v2_mitigation mode = SPECTRE_V2_NONE;
>
> -	/*
> -	 * If the CPU is not affected and the command line mode is NONE or AUTO
> -	 * then nothing to do.
> -	 */
> -	if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2) &&
> -	    (cmd == SPECTRE_V2_CMD_NONE || cmd == SPECTRE_V2_CMD_AUTO))
> -		return;
> -
>  	switch (cmd) {
>  	case SPECTRE_V2_CMD_NONE:
> +		if (boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
> +			pr_err("kernel not compiled with retpoline; no mitigation available!");
>  		return;
> -
> -	case SPECTRE_V2_CMD_FORCE:
> -		/* FALLTRHU */
> -	case SPECTRE_V2_CMD_AUTO:
> -		goto retpoline_auto;
> -
> -	case SPECTRE_V2_CMD_RETPOLINE_AMD:
> -		if (IS_ENABLED(CONFIG_RETPOLINE))
> -			goto retpoline_amd;
> -		break;
> -	case SPECTRE_V2_CMD_RETPOLINE_GENERIC:
> -		if (IS_ENABLED(CONFIG_RETPOLINE))
> -			goto retpoline_generic;
> +	case SPECTRE_V2_CMD_IBRS:
> +		mode = SPECTRE_V2_IBRS;
> +		setup_force_cpu_cap(X86_FEATURE_IBRS);
>  		break;
> +	case SPECTRE_V2_CMD_AUTO:
> +		if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
> +			return;
> +		/* Fall through */
> +	case SPECTRE_V2_CMD_FORCE:
> +		/*
> +		 * If we have IBRS support, and either Skylake or !RETPOLINE,
> +		 * then that's what we do.
> +		 */
> +		if (boot_cpu_has(X86_FEATURE_SPEC_CTRL) &&
> +		    (is_skylake_era() || !retp_compiler())) {

As per Eduardo's comments and followups, it's unclear this will play
well under virtualization. Putting this under a separate function
with a name making it clear that what we care about is the host, not
guest CPU.

Under virtualization, you may want to force is_skylake() to return
true (unless there is a way to get a more precise answer about the
host CPU at that stage?)


> +			mode = SPECTRE_V2_IBRS;
> +			setup_force_cpu_cap(X86_FEATURE_IBRS);
> +			break;
> +		}
> +		/* Fall through */

Given the complexity of the decision and the number of fall-through
cases, it's probably a good idea to add some printouts for system mgmt
or debugging.


>  	case SPECTRE_V2_CMD_RETPOLINE:
> -		if (IS_ENABLED(CONFIG_RETPOLINE))
> -			goto retpoline_auto;
> -		break;
> -	}
> -	pr_err("kernel not compiled with retpoline; no mitigation available!");
> -	return;
> +	case SPECTRE_V2_CMD_RETPOLINE_AMD:
> +		if (IS_ENABLED(CONFIG_RETPOLINE) &&
> +		    boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
> +			if (boot_cpu_has(X86_FEATURE_LFENCE_RDTSC)) {
> +				mode = retp_compiler() ? SPECTRE_V2_RETPOLINE_AMD :
> +							 SPECTRE_V2_RETPOLINE_MINIMAL_AMD;
> +				setup_force_cpu_cap(X86_FEATURE_RETPOLINE_AMD);
> +				setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
> +				break;
> +			}
>
> -retpoline_auto:
> -	if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD) {
> -	retpoline_amd:
> -		if (!boot_cpu_has(X86_FEATURE_LFENCE_RDTSC)) {
>  			pr_err("LFENCE not serializing. Switching to generic retpoline\n");
> -			goto retpoline_generic;
>  		}
> -		mode = retp_compiler() ? SPECTRE_V2_RETPOLINE_AMD :
> -					 SPECTRE_V2_RETPOLINE_MINIMAL_AMD;
> -		setup_force_cpu_cap(X86_FEATURE_RETPOLINE_AMD);
> -		setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
> -	} else {
> -	retpoline_generic:
> -		mode = retp_compiler() ? SPECTRE_V2_RETPOLINE_GENERIC :
> -					 SPECTRE_V2_RETPOLINE_MINIMAL;
> -		setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
> +		/* Fall through */
> +	case SPECTRE_V2_CMD_RETPOLINE_GENERIC:
> +		if (IS_ENABLED(CONFIG_RETPOLINE)) {
> +			mode = retp_compiler() ? SPECTRE_V2_RETPOLINE_GENERIC :
> +						 SPECTRE_V2_RETPOLINE_MINIMAL;
> +			setup_force_cpu_cap(X86_FEATURE_RETPOLINE);
> +			break;
> +		}
> +		/* Fall through */
> +	default:
> +		if (boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
> +			pr_err("kernel not compiled with retpoline; no mitigation available!");
> +		return;
>  	}
>
>  	spectre_v2_enabled = mode;


--
Cheers,
Christophe de Dinechin (IRC c3d)

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30 20:46               ` Alan Cox
@ 2018-01-31 10:05                 ` Christophe de Dinechin
  2018-01-31 10:15                   ` Thomas Gleixner
  0 siblings, 1 reply; 143+ messages in thread
From: Christophe de Dinechin @ 2018-01-31 10:05 UTC (permalink / raw)
  To: Alan Cox
  Cc: Linus Torvalds, David Woodhouse, Arjan van de Ven,
	Eduardo Habkost, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert



> On 30 Jan 2018, at 21:46, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
> 
>> If you are ever going to migrate to Skylake, I think you should just
>> always tell the guests that you're running on Skylake. That way the
>> guests will always assume the worst case situation wrt Specte.
> 
> Unfortunately if you do that then guest may also decide to use other
> Skylake hardware features and pop its clogs when it finds out its actually
> running on Westmere or SandyBridge.
> 
> So you need to be able to both lie to the OS and user space via cpuid and
> also have a second 'but do skylake protections' that only mitigation
> aware software knows about.

Yes. The most desirable lie is different depending on whether you want to
allow virtualization features such as migration (where you’d gravitate
towards a CPU with less features) or whether you want to allow mitigation
(where you’d rather present the most fragile CPUID, probably Skylake).

Looking at some recent patches, I’m concerned that the code being added
often assumes that the CPUID is the correct way to get that info.
I do not think this is correct. You really want specific information about
the host CPUID, not whatever KVM CPUID emulation makes up.

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-31 10:05                 ` Christophe de Dinechin
@ 2018-01-31 10:15                   ` Thomas Gleixner
  2018-01-31 11:04                     ` Dr. David Alan Gilbert
                                       ` (3 more replies)
  0 siblings, 4 replies; 143+ messages in thread
From: Thomas Gleixner @ 2018-01-31 10:15 UTC (permalink / raw)
  To: Christophe de Dinechin
  Cc: Alan Cox, Linus Torvalds, David Woodhouse, Arjan van de Ven,
	Eduardo Habkost, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, KVM list, the arch/x86 maintainers,
	Dr. David Alan Gilbert

[-- Attachment #1: Type: text/plain, Size: 1875 bytes --]

On Wed, 31 Jan 2018, Christophe de Dinechin wrote:
> > On 30 Jan 2018, at 21:46, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
> > 
> >> If you are ever going to migrate to Skylake, I think you should just
> >> always tell the guests that you're running on Skylake. That way the
> >> guests will always assume the worst case situation wrt Specte.
> > 
> > Unfortunately if you do that then guest may also decide to use other
> > Skylake hardware features and pop its clogs when it finds out its actually
> > running on Westmere or SandyBridge.
> > 
> > So you need to be able to both lie to the OS and user space via cpuid and
> > also have a second 'but do skylake protections' that only mitigation
> > aware software knows about.
> 
> Yes. The most desirable lie is different depending on whether you want to
> allow virtualization features such as migration (where you’d gravitate
> towards a CPU with less features) or whether you want to allow mitigation
> (where you’d rather present the most fragile CPUID, probably Skylake).
> 
> Looking at some recent patches, I’m concerned that the code being added
> often assumes that the CPUID is the correct way to get that info.
> I do not think this is correct. You really want specific information about
> the host CPUID, not whatever KVM CPUID emulation makes up.

That wont cut it. If you have a heterogenous farm of systems, then you need:

  - All CPUs have to support IBRS/IBPB or at least hte hypervisor has to
    pretend they do by providing fake MRS for that

  - Have a 'force IBRS/IBPB' mechanism so the guests don't discard it due
    to missing CPU feature bits.

Though this gets worse. You have to make sure that the guest keeps _ALL_
sorts of mitigation mechanisms enabled and does not decide to disable
retpolines because IBRS/IBPB are "available".

Good luck with making all that work.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-31 10:15                   ` Thomas Gleixner
@ 2018-01-31 11:04                     ` Dr. David Alan Gilbert
  2018-01-31 11:52                       ` Borislav Petkov
  2018-01-31 11:07                     ` Christophe de Dinechin
                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 143+ messages in thread
From: Dr. David Alan Gilbert @ 2018-01-31 11:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christophe de Dinechin, Alan Cox, Linus Torvalds,
	David Woodhouse, Arjan van de Ven, Eduardo Habkost,
	KarimAllah Ahmed, Linux Kernel Mailing List, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Masami Hiramatsu,
	Paolo Bonzini, Peter Zijlstra, Radim Krčmář,
	Tim Chen, Tom Lendacky, KVM list, the arch/x86 maintainers

* Thomas Gleixner (tglx@linutronix.de) wrote:
> On Wed, 31 Jan 2018, Christophe de Dinechin wrote:
> > > On 30 Jan 2018, at 21:46, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
> > > 
> > >> If you are ever going to migrate to Skylake, I think you should just
> > >> always tell the guests that you're running on Skylake. That way the
> > >> guests will always assume the worst case situation wrt Specte.
> > > 
> > > Unfortunately if you do that then guest may also decide to use other
> > > Skylake hardware features and pop its clogs when it finds out its actually
> > > running on Westmere or SandyBridge.
> > > 
> > > So you need to be able to both lie to the OS and user space via cpuid and
> > > also have a second 'but do skylake protections' that only mitigation
> > > aware software knows about.
> > 
> > Yes. The most desirable lie is different depending on whether you want to
> > allow virtualization features such as migration (where you’d gravitate
> > towards a CPU with less features) or whether you want to allow mitigation
> > (where you’d rather present the most fragile CPUID, probably Skylake).
> > 
> > Looking at some recent patches, I’m concerned that the code being added
> > often assumes that the CPUID is the correct way to get that info.
> > I do not think this is correct. You really want specific information about
> > the host CPUID, not whatever KVM CPUID emulation makes up.
> 
> That wont cut it. If you have a heterogenous farm of systems, then you need:
> 
>   - All CPUs have to support IBRS/IBPB or at least hte hypervisor has to
>     pretend they do by providing fake MRS for that
> 
>   - Have a 'force IBRS/IBPB' mechanism so the guests don't discard it due
>     to missing CPU feature bits.

That half is the easy bit, we've already got that (thanks to Eduardo),
QEMU has -IBRS variants of CPU types, so if you start a VM with
-cpu Broadwell-IBRS  it'll get advertised to the guest as having IBRS;
and (with appropriate flags) the management layers will only allow that
to be started on hosts that support IBRS and wont allow migration
between hosts with and without it.

> Though this gets worse. You have to make sure that the guest keeps _ALL_
> sorts of mitigation mechanisms enabled and does not decide to disable
> retpolines because IBRS/IBPB are "available".

This is what's different with this set; it's all coming down to sets
of heuristics which include CPU model etc, rather than just a 'we've got
a feature, use it'.

Dave

> Good luck with making all that work.
> 
> Thanks,
> 
> 	tglx

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-31 10:15                   ` Thomas Gleixner
  2018-01-31 11:04                     ` Dr. David Alan Gilbert
@ 2018-01-31 11:07                     ` Christophe de Dinechin
  2018-01-31 15:00                     ` Eduardo Habkost
  2018-01-31 15:11                     ` Arjan van de Ven
  3 siblings, 0 replies; 143+ messages in thread
From: Christophe de Dinechin @ 2018-01-31 11:07 UTC (permalink / raw)
  To: Thomas Gleixner, Eduardo Habkost, KarimAllah Ahmed
  Cc: Christophe de Dinechin, Alan Cox, Linus Torvalds,
	David Woodhouse, Arjan van de Ven, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, KVM list, the arch/x86 maintainers,
	Dr. David Alan Gilbert



> On 31 Jan 2018, at 11:15, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Wed, 31 Jan 2018, Christophe de Dinechin wrote:
>>> On 30 Jan 2018, at 21:46, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
>>> 
>>>> If you are ever going to migrate to Skylake, I think you should just
>>>> always tell the guests that you're running on Skylake. That way the
>>>> guests will always assume the worst case situation wrt Specte.
>>> 
>>> Unfortunately if you do that then guest may also decide to use other
>>> Skylake hardware features and pop its clogs when it finds out its actually
>>> running on Westmere or SandyBridge.
>>> 
>>> So you need to be able to both lie to the OS and user space via cpuid and
>>> also have a second 'but do skylake protections' that only mitigation
>>> aware software knows about.
>> 
>> Yes. The most desirable lie is different depending on whether you want to
>> allow virtualization features such as migration (where you’d gravitate
>> towards a CPU with less features) or whether you want to allow mitigation
>> (where you’d rather present the most fragile CPUID, probably Skylake).
>> 
>> Looking at some recent patches, I’m concerned that the code being added
>> often assumes that the CPUID is the correct way to get that info.
>> I do not think this is correct. You really want specific information about
>> the host CPUID, not whatever KVM CPUID emulation makes up.
> 
> That wont cut it. If you have a heterogenous farm of systems, then you need:
> 
>  - All CPUs have to support IBRS/IBPB or at least hte hypervisor has to
>    pretend they do by providing fake MRS for that
> 
>  - Have a 'force IBRS/IBPB' mechanism so the guests don't discard it due
>    to missing CPU feature bits.
> 
> Though this gets worse. You have to make sure that the guest keeps _ALL_
> sorts of mitigation mechanisms enabled and does not decide to disable
> retpolines because IBRS/IBPB are "available”.

What you are saying is that it’s one thing to test at boot time, but
(at least) migration events should also cause a re-check. Agreed.
The alternative is to pessimistically enable mitigation in VMs.
I believe this is the current “state of the art”, i.e. enable
IBRS statically via a CPU type variant.

What is the best place to re-check anyway?

(Just out of curiosity: there are no non-symmetric systems
that mix CPUs of different generation, right?)


> 
> Good luck with making all that work.

:-)

> 
> Thanks,
> 
> 	tglx

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-31 11:04                     ` Dr. David Alan Gilbert
@ 2018-01-31 11:52                       ` Borislav Petkov
  2018-01-31 12:30                         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 143+ messages in thread
From: Borislav Petkov @ 2018-01-31 11:52 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Thomas Gleixner, Christophe de Dinechin, Alan Cox,
	Linus Torvalds, David Woodhouse, Arjan van de Ven,
	Eduardo Habkost, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Masami Hiramatsu,
	Paolo Bonzini, Peter Zijlstra, Radim Krčmář,
	Tim Chen, Tom Lendacky, KVM list, the arch/x86 maintainers

On Wed, Jan 31, 2018 at 11:04:07AM +0000, Dr. David Alan Gilbert wrote:
> That half is the easy bit, we've already got that (thanks to Eduardo),
> QEMU has -IBRS variants of CPU types, so if you start a VM with
> -cpu Broadwell-IBRS

Eww, a CPU model with a specific feature bit. I hope you guys don't add
a model like that for every CPU feature.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-31 11:52                       ` Borislav Petkov
@ 2018-01-31 12:30                         ` Dr. David Alan Gilbert
  2018-01-31 13:18                           ` Borislav Petkov
  0 siblings, 1 reply; 143+ messages in thread
From: Dr. David Alan Gilbert @ 2018-01-31 12:30 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Christophe de Dinechin, Alan Cox,
	Linus Torvalds, David Woodhouse, Arjan van de Ven,
	Eduardo Habkost, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Masami Hiramatsu,
	Paolo Bonzini, Peter Zijlstra, Radim Krčmář,
	Tim Chen, Tom Lendacky, KVM list, the arch/x86 maintainers

* Borislav Petkov (bp@suse.de) wrote:
> On Wed, Jan 31, 2018 at 11:04:07AM +0000, Dr. David Alan Gilbert wrote:
> > That half is the easy bit, we've already got that (thanks to Eduardo),
> > QEMU has -IBRS variants of CPU types, so if you start a VM with
> > -cpu Broadwell-IBRS
> 
> Eww, a CPU model with a specific feature bit. I hope you guys don't add
> a model like that for every CPU feature.

Indeed, it's only for this weird case where you suddenly need to change
it.

Dave

> -- 
> Regards/Gruss,
>     Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
> -- 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-31 12:30                         ` Dr. David Alan Gilbert
@ 2018-01-31 13:18                           ` Borislav Petkov
  2018-01-31 14:04                             ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 143+ messages in thread
From: Borislav Petkov @ 2018-01-31 13:18 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Thomas Gleixner, Christophe de Dinechin, Alan Cox,
	Linus Torvalds, David Woodhouse, Arjan van de Ven,
	Eduardo Habkost, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Masami Hiramatsu,
	Paolo Bonzini, Peter Zijlstra, Radim Krčmář,
	Tim Chen, Tom Lendacky, KVM list, the arch/x86 maintainers

On Wed, Jan 31, 2018 at 12:30:36PM +0000, Dr. David Alan Gilbert wrote:
> Indeed, it's only for this weird case where you suddenly need to change
> it.

No, there's more:

	.name = "Broadwell-noTSX",
	.name = "Haswell-noTSX",

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-31 13:18                           ` Borislav Petkov
@ 2018-01-31 14:04                             ` Dr. David Alan Gilbert
  2018-01-31 14:44                               ` Eduardo Habkost
  0 siblings, 1 reply; 143+ messages in thread
From: Dr. David Alan Gilbert @ 2018-01-31 14:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Christophe de Dinechin, Alan Cox,
	Linus Torvalds, David Woodhouse, Arjan van de Ven,
	Eduardo Habkost, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Masami Hiramatsu,
	Paolo Bonzini, Peter Zijlstra, Radim Krčmář,
	Tim Chen, Tom Lendacky, KVM list, the arch/x86 maintainers

* Borislav Petkov (bp@suse.de) wrote:
> On Wed, Jan 31, 2018 at 12:30:36PM +0000, Dr. David Alan Gilbert wrote:
> > Indeed, it's only for this weird case where you suddenly need to change
> > it.
> 
> No, there's more:
> 
> 	.name = "Broadwell-noTSX",
> 	.name = "Haswell-noTSX",

Haswell came out and we made the CPU definition, and then got a
microcode update that removed the feature.

So the common feature of noTSX and IBRS is that they're the only two
cases where a CPU has released and then the flags have changed later.

Dave

> -- 
> Regards/Gruss,
>     Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
> -- 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-31 14:04                             ` Dr. David Alan Gilbert
@ 2018-01-31 14:44                               ` Eduardo Habkost
  2018-01-31 16:28                                 ` Borislav Petkov
  0 siblings, 1 reply; 143+ messages in thread
From: Eduardo Habkost @ 2018-01-31 14:44 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Borislav Petkov, Thomas Gleixner, Christophe de Dinechin,
	Alan Cox, Linus Torvalds, David Woodhouse, Arjan van de Ven,
	KarimAllah Ahmed, Linux Kernel Mailing List, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, KVM list, the arch/x86 maintainers

On Wed, Jan 31, 2018 at 02:04:49PM +0000, Dr. David Alan Gilbert wrote:
> * Borislav Petkov (bp@suse.de) wrote:
> > On Wed, Jan 31, 2018 at 12:30:36PM +0000, Dr. David Alan Gilbert wrote:
> > > Indeed, it's only for this weird case where you suddenly need to change
> > > it.
> > 
> > No, there's more:
> > 
> > 	.name = "Broadwell-noTSX",
> > 	.name = "Haswell-noTSX",
> 
> Haswell came out and we made the CPU definition, and then got a
> microcode update that removed the feature.
> 
> So the common feature of noTSX and IBRS is that they're the only two
> cases where a CPU has released and then the flags have changed later.

Also, if anybody don't like it, users can already specify, e.g.,
"Broadwell,-hle,-rtm" or "Skylake,+spec_ctrl".

QEMU only adds have the -noTSX and -IBRS CPU for convenience of
management systems that don't know how to check/configure
individual CPU features.  We're working with libvirt and
OpenStack folks to make this kind of trick unnecessary.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-31 10:15                   ` Thomas Gleixner
  2018-01-31 11:04                     ` Dr. David Alan Gilbert
  2018-01-31 11:07                     ` Christophe de Dinechin
@ 2018-01-31 15:00                     ` Eduardo Habkost
  2018-01-31 15:11                     ` Arjan van de Ven
  3 siblings, 0 replies; 143+ messages in thread
From: Eduardo Habkost @ 2018-01-31 15:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Christophe de Dinechin, Alan Cox, Linus Torvalds,
	David Woodhouse, Arjan van de Ven, KarimAllah Ahmed,
	Linux Kernel Mailing List, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, KVM list, the arch/x86 maintainers,
	Dr. David Alan Gilbert

On Wed, Jan 31, 2018 at 11:15:50AM +0100, Thomas Gleixner wrote:
> On Wed, 31 Jan 2018, Christophe de Dinechin wrote:
> > > On 30 Jan 2018, at 21:46, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
> > > 
> > >> If you are ever going to migrate to Skylake, I think you should just
> > >> always tell the guests that you're running on Skylake. That way the
> > >> guests will always assume the worst case situation wrt Specte.
> > > 
> > > Unfortunately if you do that then guest may also decide to use other
> > > Skylake hardware features and pop its clogs when it finds out its actually
> > > running on Westmere or SandyBridge.
> > > 
> > > So you need to be able to both lie to the OS and user space via cpuid and
> > > also have a second 'but do skylake protections' that only mitigation
> > > aware software knows about.
> > 
> > Yes. The most desirable lie is different depending on whether you want to
> > allow virtualization features such as migration (where you’d gravitate
> > towards a CPU with less features) or whether you want to allow mitigation
> > (where you’d rather present the most fragile CPUID, probably Skylake).
> > 
> > Looking at some recent patches, I’m concerned that the code being added
> > often assumes that the CPUID is the correct way to get that info.
> > I do not think this is correct. You really want specific information about
> > the host CPUID, not whatever KVM CPUID emulation makes up.
> 
> That wont cut it. If you have a heterogenous farm of systems, then you need:
> 
>   - All CPUs have to support IBRS/IBPB or at least hte hypervisor has to
>     pretend they do by providing fake MRS for that
> 
>   - Have a 'force IBRS/IBPB' mechanism so the guests don't discard it due
>     to missing CPU feature bits.

If all your hosts have IBRS/IBPB, you enable it.  If some of your
hosts don't have IBRS/IBPB, you don't expose it to the guest (and
deal with the consequences of not applying updates to your
hardware).  Where's the problem?

> 
> Though this gets worse. You have to make sure that the guest keeps _ALL_
> sorts of mitigation mechanisms enabled and does not decide to disable
> retpolines because IBRS/IBPB are "available".

If IBRS/IBPB are reported as available to the guest, the VM
management system will ensure the VM won't be migrated to a host
that doesn't have it.  That's a pretty basic feature of VM
management stacks.

Exactly the same could happen to a "(non-)skylake bit".  The host
reports a feature (or a bug fix) as available to a guest, and
then the system ensures you won't migrate to a host that doesn't
provide that feature.

The problem I see here is that Linux guests currently have no way
to tell if it needs to enable Skylake-specific mitigations or
not.  Unless you make Linux always enable skylake mitigations if
seeing the hypervisor bit, you will need the hypervisor to
provide more useful information than f/m/s.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30  3:13                 ` Andi Kleen
@ 2018-01-31 15:03                   ` Paolo Bonzini
  2018-01-31 15:07                     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 143+ messages in thread
From: Paolo Bonzini @ 2018-01-31 15:03 UTC (permalink / raw)
  To: Andi Kleen, Jim Mattson
  Cc: Linus Torvalds, David Woodhouse, Arjan van de Ven,
	Eduardo Habkost, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Masami Hiramatsu,
	Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers, Dr. David Alan Gilbert

On 29/01/2018 22:13, Andi Kleen wrote:
>> What happens when someone introduces a
>> workaround tied to some other model numbers?
> There are already many of those in the tree for other issues and features. 
> So far you managed to survive without. Likely that will be true
> in the future too.

"Guests have to live with processor fuckups" is actually a much better
answer than "Hypervisors may need to revisit their practice", since at
least it's clear where the blame lies.

Because really it's just plain luck.  It just happens that most errata
are for functionality that is not available to a virtual machine (e.g.
perfmon and monitor workarounds or buggy TSC deadline timer that
hypervisors emulate anyway), that only needs a chicken bit to be set in
the host, or the bugs are there only for old hardware that doesn't have
virtualization (X86_BUG_F00F, X86_BUGS_SWAPGS_FENCE).

CPUID flags are guaranteed to never change---never come, never go away.
For anything that doesn't map nicely to a CPUID flag, you cannot really
express it.  Also if something is not architectural, you can pretty much
assume that you cannot know it under virtualization.  f/m/s is not
architectural; family, model and stepping mean absolutely nothing when
running in virtualization, because the host CPU model can change under
your feet at any time.  We force guest vendor == host vendor just
because otherwise too much stuff breaks, but that's it.

Paolo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-31 15:03                   ` Paolo Bonzini
@ 2018-01-31 15:07                     ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 143+ messages in thread
From: Dr. David Alan Gilbert @ 2018-01-31 15:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andi Kleen, Jim Mattson, Linus Torvalds, David Woodhouse,
	Arjan van de Ven, Eduardo Habkost, KarimAllah Ahmed,
	Linux Kernel Mailing List, Andrea Arcangeli, Andy Lutomirski,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Peter Zijlstra, Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, KVM list,
	the arch/x86 maintainers

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> On 29/01/2018 22:13, Andi Kleen wrote:
> >> What happens when someone introduces a
> >> workaround tied to some other model numbers?
> > There are already many of those in the tree for other issues and features. 
> > So far you managed to survive without. Likely that will be true
> > in the future too.
> 
> "Guests have to live with processor fuckups" is actually a much better
> answer than "Hypervisors may need to revisit their practice", since at
> least it's clear where the blame lies.
> 
> Because really it's just plain luck.  It just happens that most errata
> are for functionality that is not available to a virtual machine (e.g.
> perfmon and monitor workarounds or buggy TSC deadline timer that
> hypervisors emulate anyway), that only needs a chicken bit to be set in
> the host, or the bugs are there only for old hardware that doesn't have
> virtualization (X86_BUG_F00F, X86_BUGS_SWAPGS_FENCE).
> 
> CPUID flags are guaranteed to never change---never come, never go away.
> For anything that doesn't map nicely to a CPUID flag, you cannot really
> express it.  Also if something is not architectural, you can pretty much
> assume that you cannot know it under virtualization.  f/m/s is not
> architectural; family, model and stepping mean absolutely nothing when
> running in virtualization, because the host CPU model can change under
> your feet at any time.  We force guest vendor == host vendor just
> because otherwise too much stuff breaks, but that's it.

In some ways we've been luckiest on x86; my understanding is ARM have a
similar set of architecture-specific errata and aren't really sure
how to expose this to guests either.

Dave

> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-31 10:15                   ` Thomas Gleixner
                                       ` (2 preceding siblings ...)
  2018-01-31 15:00                     ` Eduardo Habkost
@ 2018-01-31 15:11                     ` Arjan van de Ven
  3 siblings, 0 replies; 143+ messages in thread
From: Arjan van de Ven @ 2018-01-31 15:11 UTC (permalink / raw)
  To: Thomas Gleixner, Christophe de Dinechin
  Cc: Alan Cox, Linus Torvalds, David Woodhouse, Eduardo Habkost,
	KarimAllah Ahmed, Linux Kernel Mailing List, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Masami Hiramatsu,
	Paolo Bonzini, Peter Zijlstra, Radim Krčmář,
	Tim Chen, Tom Lendacky, KVM list, the arch/x86 maintainers,
	Dr. David Alan Gilbert

On 1/31/2018 2:15 AM, Thomas Gleixner wrote:

> Good luck with making all that work.

on the Intel side we're checking what we can do that works and doesn't break
things right now; hopefully we just end up with a bit in the arch capabilities
MSR for "you should do RSB stuffing" and then the HV's can emulate that.

(people sometimes think that should be a 5 minute thing, but we need to check
many cpu models/etc to make sure a bit we pick is really free etc which makes
it take longer than some folks have patience for)

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-31 14:44                               ` Eduardo Habkost
@ 2018-01-31 16:28                                 ` Borislav Petkov
  0 siblings, 0 replies; 143+ messages in thread
From: Borislav Petkov @ 2018-01-31 16:28 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Dr. David Alan Gilbert, Thomas Gleixner, Christophe de Dinechin,
	Alan Cox, Linus Torvalds, David Woodhouse, Arjan van de Ven,
	KarimAllah Ahmed, Linux Kernel Mailing List, Andi Kleen,
	Andrea Arcangeli, Andy Lutomirski, Ashok Raj, Asit Mallick,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, KVM list, the arch/x86 maintainers

On Wed, Jan 31, 2018 at 12:44:41PM -0200, Eduardo Habkost wrote:
> Also, if anybody don't like it, users can already specify, e.g.,
> "Broadwell,-hle,-rtm" or "Skylake,+spec_ctrl".
> 
> QEMU only adds have the -noTSX and -IBRS CPU for convenience of
> management systems that don't know how to check/configure
> individual CPU features.  We're working with libvirt and
> OpenStack folks to make this kind of trick unnecessary.

Yeah, defining separate CPU models just for that seems hacky. The
+/-<feature> specification looks like the Right Thing(tm) to do.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-01-23 10:23                     ` Ingo Molnar
  2018-01-23 10:35                       ` David Woodhouse
@ 2018-02-04 18:43                       ` Thomas Gleixner
  2018-02-04 20:22                         ` David Woodhouse
  2018-02-06  9:14                         ` David Woodhouse
  1 sibling, 2 replies; 143+ messages in thread
From: Thomas Gleixner @ 2018-02-04 18:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Woodhouse, Linus Torvalds, KarimAllah Ahmed,
	Linux Kernel Mailing List, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Arjan van de Ven, Ashok Raj, Asit Mallick,
	Borislav Petkov, Dan Williams, Dave Hansen, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Masami Hiramatsu,
	Paolo Bonzini, Peter Zijlstra, Radim Krčmář,
	Tim Chen, Tom Lendacky, KVM list, the arch/x86 maintainers,
	Arjan Van De Ven

[-- Attachment #1: Type: text/plain, Size: 4473 bytes --]

On Tue, 23 Jan 2018, Ingo Molnar wrote:
> * David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > > On SkyLake this would add an overhead of maybe 2-3 cycles per function call and 
> > > obviously all this code and data would be very cache hot. Given that the average 
> > > number of function calls per system call is around a dozen, this would be _much_ 
> > > faster than any microcode/MSR based approach.
> > 
> > That's kind of neat, except you don't want it at the top of the
> > function; you want it at the bottom.
> > 
> > If you could hijack the *return* site, then you could check for
> > underflow and stuff the RSB right there. But in __fentry__ there's not
> > a lot you can do other than complain that something bad is going to
> > happen in the future. You know that a string of 16+ rets is going to
> > happen, but you've got no gadget in *there* to deal with it when it
> > does.
> 
> No, it can be done with the existing CALL instrumentation callback that 
> CONFIG_DYNAMIC_FTRACE=y provides, by pushing a RET trampoline on the stack from 
> the CALL trampoline - see my previous email.
> 
> > HJ did have patches to turn 'ret' into a form of retpoline, which I
> > don't think ever even got performance-tested.
> 
> Return instrumentation is possible as well, but there are two major drawbacks:
> 
>  - GCC support for it is not as widely available and return instrumentation is 
>    less tested in Linux kernel contexts
> 
>  - a major point of my suggestion is that CONFIG_DYNAMIC_FTRACE=y is already 
>    enabled in distros here and today, so the runtime overhead to non-SkyLake CPUs 
>    would be literally zero, while still allowing to fix the RSB vulnerability on 
>    SkyLake.

I played around with that a bit during the week and it turns out to be less
simple than you thought.

1) Injecting a trampoline return only works for functions which have all
   arguments in registers. For functions with arguments on stack like all
   varg functions this breaks because the function wont find its arguments
   anymore.

   I have not yet found a way to figure out reliably which functions have
   arguments on stack. That might be an option to simply ignore them.

   The workaround is to replace the original return on stack with the
   trampoline and store the original return in a per thread stack, which I
   implemented. But this sucks performance wise badly.

2) Doing the whole dance on function entry has a real down side because you
   refill RSB on every 15th return no matter whether its required or
   not. That really gives a very prominent performance hit.

An alternative idea is to do the following (not yet implemented):

__fentry__:
	incl	PER_CPU_VAR(call_depth)
	retq

and use -mfunction-return=thunk-extern which is available on retpoline
enabled compilers. That's a reasonable requirement because w/o retpoline
the whole SKL magic is pointless anyway.

-mfunction-return=thunk-extern issues

	jump	__x86_return_thunk

instead of ret. In the thunk we can do the whole shebang of mitigation.
That jump can be identified at build time and it can be patched into a ret
for unaffected CPUs. Ideally we do the patching at build time and only
patch the jump in when SKL is detected or paranoia requests it.

We could actually look into that for tracing as well. The only reason why
we don't do that is to select the ideal nop for the CPU the kernel runs on,
which obviously cannot be known at build time.

__x86_return_thunk would look like this:

__x86_return_thunk:
	testl	$0xf, PER_CPU_VAR(call_depth)
	jnz	1f	
	stuff_rsb
   1:
	decl	PER_CPU_VAR(call_depth)
   	ret

The call_depth variable would be reset on context switch.

Though that has another problem: tail calls. Tail calls will invoke the
__fentry__ call of the tail called function, which makes the call_depth
counter unbalanced. Tail calls can be prevented by using
-fno-optimize-sibling-calls, but that probably sucks as well.

Yet another possibility is to avoid the function entry and accouting magic
and use the generic gcc return thunk:

__x86_return_thunk:
	call L2
L1:
	pause
	lfence
	jmp L1
L2:
	lea 8(%rsp), %rsp|lea 4(%esp), %esp
	ret

which basically refills the RSB on every return. That can be inline or
extern, but in both cases we should be able to patch it out.

I have no idea how that affects performance, but it might be worthwhile to
experiment with that.

If nobody beats me to it, I'll play around with that some more after
vacation.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-02-04 18:43                       ` Thomas Gleixner
@ 2018-02-04 20:22                         ` David Woodhouse
  2018-02-06  9:14                         ` David Woodhouse
  1 sibling, 0 replies; 143+ messages in thread
From: David Woodhouse @ 2018-02-04 20:22 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, KVM list, the arch/x86 maintainers,
	Arjan Van De Ven

[-- Attachment #1: Type: text/plain, Size: 533 bytes --]

On Sun, 2018-02-04 at 19:43 +0100, Thomas Gleixner wrote:
> 
> __x86_return_thunk would look like this:
> 
> __x86_return_thunk:
>         testl   $0xf, PER_CPU_VAR(call_depth)
>         jnz     1f      
>         stuff_rsb
>    1:
>         decl    PER_CPU_VAR(call_depth)
>         ret
> 
> The call_depth variable would be reset on context switch.

Note that the 'jnz' can be predicted taken there, allowing the CPU to
speculate all the way to the 'ret'... and beyond.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation
  2018-02-04 18:43                       ` Thomas Gleixner
  2018-02-04 20:22                         ` David Woodhouse
@ 2018-02-06  9:14                         ` David Woodhouse
  1 sibling, 0 replies; 143+ messages in thread
From: David Woodhouse @ 2018-02-06  9:14 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar
  Cc: Linus Torvalds, KarimAllah Ahmed, Linux Kernel Mailing List,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Arjan van de Ven,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Tim Chen, Tom Lendacky, KVM list, the arch/x86 maintainers,
	Arjan Van De Ven

[-- Attachment #1: Type: text/plain, Size: 1008 bytes --]



On Sun, 2018-02-04 at 19:43 +0100, Thomas Gleixner wrote:
> Yet another possibility is to avoid the function entry and accouting magic
> and use the generic gcc return thunk:
> 
> __x86_return_thunk:
>         call L2
> L1:
>         pause
>         lfence
>         jmp L1
> L2:
>         lea 8(%rsp), %rsp|lea 4(%esp), %esp
>         ret
> 
> which basically refills the RSB on every return. That can be inline or
> extern, but in both cases we should be able to patch it out.
> 
> I have no idea how that affects performance, but it might be worthwhile to
> experiment with that.

That was what I had in mind when I asked HJ to add -mfunction-return.

I suspect the performance hit would be significant because it would
cause a prediction miss on *every* return.

But as I said, let's implement what we can without IBRS for Skylake,
then we can compare the two options for performance, security coverage
and general fugliness.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30  1:20       ` David Dunn
@ 2018-01-30  1:30         ` Eduardo Habkost
  0 siblings, 0 replies; 143+ messages in thread
From: Eduardo Habkost @ 2018-01-30  1:30 UTC (permalink / raw)
  To: David Dunn
  Cc: Jim Mattson, Andi Kleen, Arjan van de Ven, KarimAllah Ahmed,
	Wilson, Matt, linux-kernel, Andrea Arcangeli, Andy Lutomirski,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, Jorgensen, Bryan, kvm,
	x86, Dr. David Alan Gilbert, Fred Jacobs, David Woodhouse

On Tue, Jan 30, 2018 at 01:20:52AM +0000, David Dunn wrote:
> Eduardo,
> 
> This is why it would be good to have a CPUID bit that says:
> "apply SkyLake RSB stuffing."  That's preferable to "trust FMS"
> for VMware.

Agreed it would be more useful than "trust FMS".  However, I
believe a "no need to apply Skylake RSB stuffing" bit (which I
called "we promise we won't migrate to Skylake" previously) would
allow guests to enable safer behavior by default under older
hypervisors that don't support this bit.

> 
> If Intel defines such a feature flag, sets it on SkyLake, and
> Linux uses it... that would be very helpful for VMware.
> 
> I won't speak for GCE and AWS.  But hopefully they can indicate
> whether it would help them as well.

I agree that having a standard flag on the CPUID space to specify
that would be very helpful.

> 
> If Intel cannot define/implement such a flag on SkyLake, then
> maybe the engineers on this email could define a flag in the
> hypervisor specific CPUID space.  Linux would need to query
> that flag if it sees CPUID[1].ECX[31] set.  That's not as nice
> since it makes detection on bare metal and virtualization
> platforms different, but it better than keying off FMS.

Agreed.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-30  1:10     ` Eduardo Habkost
@ 2018-01-30  1:20       ` David Dunn
  2018-01-30  1:30         ` Eduardo Habkost
  0 siblings, 1 reply; 143+ messages in thread
From: David Dunn @ 2018-01-30  1:20 UTC (permalink / raw)
  To: Eduardo Habkost, Jim Mattson
  Cc: Andi Kleen, Arjan van de Ven, KarimAllah Ahmed, Wilson, Matt,
	linux-kernel, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, Jorgensen, Bryan, kvm,
	x86, Dr. David Alan Gilbert, Fred Jacobs, David Woodhouse

Eduardo,

This is why it would be good to have a CPUID bit that says: "apply SkyLake RSB stuffing."  That's preferable to "trust FMS" for VMware.

If Intel defines such a feature flag, sets it on SkyLake, and Linux uses it... that would be very helpful for VMware.

I won't speak for GCE and AWS.  But hopefully they can indicate whether it would help them as well.

If Intel cannot define/implement such a flag on SkyLake, then maybe the engineers on this email could define a flag in the hypervisor specific CPUID space.  Linux would need to query that flag if it sees CPUID[1].ECX[31] set.  That's not as nice since it makes detection on bare metal and virtualization platforms different, but it better than keying off FMS.

David Dunn

On 1/29/18, 5:11 PM, "Eduardo Habkost" <ehabkost@redhat.com> wrote:

    On Mon, Jan 29, 2018 at 02:49:51PM -0800, Jim Mattson wrote:
    > And if we expect to introduce Cascade Lake into the pool in the
    > future, we use a Cascade Lake model number?
    > 
    > It sounds like you are suggesting that we set the model number to the
    > highest model number that will ever be introduced into the pool, at
    > any time in the future. That approach would also fail the
    > 'is_skylake_era()' test. (Not to mention that we have no idea what
    > Intel's highest compatible model number will be.)
    
    Exactly, that's why virtualization and live-migration break the
    model of just checking f/m/s/microcode: the guest doesn't need to
    work around bugs that are present in the current host, but the
    set of bugs that could appear on any future host it can run on.
    
    > 
    > On Mon, Jan 29, 2018 at 2:41 PM, Andi Kleen <ak@linux.intel.com> wrote:
    > >> Even if we expose bit to indicate that FMS matches the underlying host, when does the guest know to query that?  The VM can be moved at any point in time, including after the guest asks if FMS matches host.
    > >
    > > There's no way to enable these mitigations later, so if you always
    > > have to enable the super set of all the mitigations for all the hosts you
    > > might be migrating too.
    > >
    > > As of currently that means if you want to ever migrate to Skylake you should
    > > set the Skylake model number and you're good.
    > >
    > > -Andi
    
    -- 
    Eduardo
    


^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 22:49   ` Jim Mattson
@ 2018-01-30  1:10     ` Eduardo Habkost
  2018-01-30  1:20       ` David Dunn
  0 siblings, 1 reply; 143+ messages in thread
From: Eduardo Habkost @ 2018-01-30  1:10 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Andi Kleen, David Dunn, Arjan van de Ven, KarimAllah Ahmed,
	Wilson, Matt, linux-kernel, Andrea Arcangeli, Andy Lutomirski,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Dr. David Alan Gilbert, Fred Jacobs, David Woodhouse

On Mon, Jan 29, 2018 at 02:49:51PM -0800, Jim Mattson wrote:
> And if we expect to introduce Cascade Lake into the pool in the
> future, we use a Cascade Lake model number?
> 
> It sounds like you are suggesting that we set the model number to the
> highest model number that will ever be introduced into the pool, at
> any time in the future. That approach would also fail the
> 'is_skylake_era()' test. (Not to mention that we have no idea what
> Intel's highest compatible model number will be.)

Exactly, that's why virtualization and live-migration break the
model of just checking f/m/s/microcode: the guest doesn't need to
work around bugs that are present in the current host, but the
set of bugs that could appear on any future host it can run on.

> 
> On Mon, Jan 29, 2018 at 2:41 PM, Andi Kleen <ak@linux.intel.com> wrote:
> >> Even if we expose bit to indicate that FMS matches the underlying host, when does the guest know to query that?  The VM can be moved at any point in time, including after the guest asks if FMS matches host.
> >
> > There's no way to enable these mitigations later, so if you always
> > have to enable the super set of all the mitigations for all the hosts you
> > might be migrating too.
> >
> > As of currently that means if you want to ever migrate to Skylake you should
> > set the Skylake model number and you're good.
> >
> > -Andi

-- 
Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 22:29 [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure David Dunn
  2018-01-29 22:41 ` Andi Kleen
  2018-01-29 23:51 ` Fred Jacobs
@ 2018-01-30  1:08 ` Eduardo Habkost
  2 siblings, 0 replies; 143+ messages in thread
From: Eduardo Habkost @ 2018-01-30  1:08 UTC (permalink / raw)
  To: David Dunn
  Cc: Arjan van de Ven, KarimAllah Ahmed, Wilson, Matt, linux-kernel,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Dr. David Alan Gilbert, Fred Jacobs, Jim Mattson,
	David Woodhouse

On Mon, Jan 29, 2018 at 10:29:28PM +0000, David Dunn wrote:
> On Mon, 2018-01-29 at 13:45:07 -0800, Eduardo Habkost wrote:
> 
> > Maybe a generic "family/model/stepping/microcode really matches
> > the CPU you are running on" bit would be useful.  The bit could
> > be enabled only on host-passthrough (aka "-cpu host") mode.
> > 
> > If we really want to be able to migrate to host with different
> > CPU models (except Skylake), we could add a more specific "we
> > promise the host CPU is never going to be Skylake" bit.
> > 
> > Now, if the hypervisor is not providing any of those bits, I
> > would advise against trusting family/model/stepping/microcode
> > under a hypervisor.  Using a pre-defined CPU model (that doesn't
> > necessarily match the host) is very common when using KVM VM
> > management stacks.
> > 
> 
> Eduardo,
> 
> I don't see how this is possible in a modern virtualization
> environment.
>  
> Under VMware, a VM will be migrated to SkyLake if one is in the
> cluster and supports the features exposed to the VM.  This can
> occur for suspend/resume as well.
> 
> The migration pool isn't a constant.  Hosts can be added to a
> cluster and VMs can be instructed to move across clusters.  So
> there doesn't need to be a SkyLake around when the VM powers on
> in order for it to eventually end up on a SkyLake.

If this is the case for your deployment, this means the guest
must never assume it won't run on a Skylake host (even if f/m/s
is not Skylake), doesn't it?  Then the hypervisor won't set the
"we promise the host CPU is never going to be Skylake" bit.

> 
> Even if we expose bit to indicate that FMS matches the
> underlying host, when does the guest know to query that?  The
> VM can be moved at any point in time, including after the guest
> asks if FMS matches host.

If the VM can be moved at any point of time to a different model
of host CPU, this means you won't tell the guest it can trust
f/m/s because it doesn't represent the underlying host.  You
won't set the "f/m/s/m really matches the host CPU" bit.

On both scenarios you describe above, it sounds like Linux must
assume it could migrated to a Skylake host at any moment.  This
is exactly why I'm proposing those extra bits.

-- 
Eduardo

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 22:29 [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure David Dunn
  2018-01-29 22:41 ` Andi Kleen
@ 2018-01-29 23:51 ` Fred Jacobs
  2018-01-30  1:08 ` Eduardo Habkost
  2 siblings, 0 replies; 143+ messages in thread
From: Fred Jacobs @ 2018-01-29 23:51 UTC (permalink / raw)
  To: David Dunn
  Cc: Eduardo Habkost, Arjan van de Ven, KarimAllah Ahmed, Wilson,
	Matt, linux-kernel, Andi Kleen, Andrea Arcangeli,
	Andy Lutomirski, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin,
	Ingo Molnar, Janakarajan Natarajan, Joerg Roedel, Jun Nakajima,
	Laura Abbott, Linus Torvalds, Masami Hiramatsu, Paolo Bonzini,
	Peter Zijlstra, Radim Kr??m????,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Dr. David Alan Gilbert, Jim Mattson, David Woodhouse

(Apologies as I was brought into this thread late, but I believe I have
context).

Could a new "feature" be enumerated on Skylake and beyond which specifies that
a particular problem exists which requires different mitigation than on
previous processors?  Perhaps a CPUID bit enumerating this feature (along side
IBRS, IBPB and STIBP) could be exposed on only the newer CPUs.  System software
could then query this to know what form of mitigation is necessary.

This could be over-reported in virtualized environments (e.g. a Nehalem CPU
could be represented as needing the Skylake mitigation), such that sometimes
the heavier Skylake+ mitigation would be applied on older CPUs.  This is
correct, just slower.

I'm just suggesting this rather than keying on Family/Model/Stepping to avoid
breaking virtual machine migration, et cetera.

Thanks,

Fred, sticking his neck out.

On Jan 29  2:29PM, David Dunn wrote:
> On Mon, 2018-01-29 at 13:45:07 -0800, Eduardo Habkost wrote:
> 
> > Maybe a generic "family/model/stepping/microcode really matches
> > the CPU you are running on" bit would be useful.  The bit could
> > be enabled only on host-passthrough (aka "-cpu host") mode.
> > 
> > If we really want to be able to migrate to host with different
> > CPU models (except Skylake), we could add a more specific "we
> > promise the host CPU is never going to be Skylake" bit.
> > 
> > Now, if the hypervisor is not providing any of those bits, I
> > would advise against trusting family/model/stepping/microcode
> > under a hypervisor.  Using a pre-defined CPU model (that doesn't
> > necessarily match the host) is very common when using KVM VM
> > management stacks.
> > 
> 
> Eduardo,
> 
> I don't see how this is possible in a modern virtualization environment.
>  
> Under VMware, a VM will be migrated to SkyLake if one is in the cluster and supports the features exposed to the VM.  This can occur for suspend/resume as well.
> 
> The migration pool isn't a constant.  Hosts can be added to a cluster and VMs can be instructed to move across clusters.  So there doesn't need to be a SkyLake around when the VM powers on in order for it to eventually end up on a SkyLake.
> 
> Even if we expose bit to indicate that FMS matches the underlying host, when does the guest know to query that?  The VM can be moved at any point in time, including after the guest asks if FMS matches host.
> 
> My apologies for posting onto the mailing list out of the blue.  Someone asked my opinion on this suggestion.  I'm definitely interested in figuring out whether Linux can fully mitigate the SkyLake RSB problem in virtual environments, but it's not clear how best to achieve that.
> 
> Thanks,
> 
> David Dunn
> 

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 22:41 ` Andi Kleen
@ 2018-01-29 22:49   ` Jim Mattson
  2018-01-30  1:10     ` Eduardo Habkost
  0 siblings, 1 reply; 143+ messages in thread
From: Jim Mattson @ 2018-01-29 22:49 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Dunn, Eduardo Habkost, Arjan van de Ven, KarimAllah Ahmed,
	Wilson, Matt, linux-kernel, Andrea Arcangeli, Andy Lutomirski,
	Ashok Raj, Asit Mallick, Borislav Petkov, Dan Williams,
	Dave Hansen, Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Dr. David Alan Gilbert, Fred Jacobs, David Woodhouse

And if we expect to introduce Cascade Lake into the pool in the
future, we use a Cascade Lake model number?

It sounds like you are suggesting that we set the model number to the
highest model number that will ever be introduced into the pool, at
any time in the future. That approach would also fail the
'is_skylake_era()' test. (Not to mention that we have no idea what
Intel's highest compatible model number will be.)

On Mon, Jan 29, 2018 at 2:41 PM, Andi Kleen <ak@linux.intel.com> wrote:
>> Even if we expose bit to indicate that FMS matches the underlying host, when does the guest know to query that?  The VM can be moved at any point in time, including after the guest asks if FMS matches host.
>
> There's no way to enable these mitigations later, so if you always
> have to enable the super set of all the mitigations for all the hosts you
> might be migrating too.
>
> As of currently that means if you want to ever migrate to Skylake you should
> set the Skylake model number and you're good.
>
> -Andi

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
  2018-01-29 22:29 [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure David Dunn
@ 2018-01-29 22:41 ` Andi Kleen
  2018-01-29 22:49   ` Jim Mattson
  2018-01-29 23:51 ` Fred Jacobs
  2018-01-30  1:08 ` Eduardo Habkost
  2 siblings, 1 reply; 143+ messages in thread
From: Andi Kleen @ 2018-01-29 22:41 UTC (permalink / raw)
  To: David Dunn
  Cc: Eduardo Habkost, Arjan van de Ven, KarimAllah Ahmed, Wilson,
	Matt, linux-kernel, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Dr. David Alan Gilbert, Fred Jacobs, Jim Mattson,
	David Woodhouse

> Even if we expose bit to indicate that FMS matches the underlying host, when does the guest know to query that?  The VM can be moved at any point in time, including after the guest asks if FMS matches host.

There's no way to enable these mitigations later, so if you always
have to enable the super set of all the mitigations for all the hosts you
might be migrating too.  

As of currently that means if you want to ever migrate to Skylake you should
set the Skylake model number and you're good.

-Andi

^ permalink raw reply	[flat|nested] 143+ messages in thread

* Re: [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure
@ 2018-01-29 22:29 David Dunn
  2018-01-29 22:41 ` Andi Kleen
                   ` (2 more replies)
  0 siblings, 3 replies; 143+ messages in thread
From: David Dunn @ 2018-01-29 22:29 UTC (permalink / raw)
  To: Eduardo Habkost
  Cc: Arjan van de Ven, KarimAllah Ahmed, Wilson, Matt, linux-kernel,
	Andi Kleen, Andrea Arcangeli, Andy Lutomirski, Ashok Raj,
	Asit Mallick, Borislav Petkov, Dan Williams, Dave Hansen,
	Greg Kroah-Hartman, H . Peter Anvin, Ingo Molnar,
	Janakarajan Natarajan, Joerg Roedel, Jun Nakajima, Laura Abbott,
	Linus Torvalds, Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky, kvm, x86,
	Dr. David Alan Gilbert, Fred Jacobs, Jim Mattson,
	David Woodhouse

On Mon, 2018-01-29 at 13:45:07 -0800, Eduardo Habkost wrote:

> Maybe a generic "family/model/stepping/microcode really matches
> the CPU you are running on" bit would be useful.  The bit could
> be enabled only on host-passthrough (aka "-cpu host") mode.
> 
> If we really want to be able to migrate to host with different
> CPU models (except Skylake), we could add a more specific "we
> promise the host CPU is never going to be Skylake" bit.
> 
> Now, if the hypervisor is not providing any of those bits, I
> would advise against trusting family/model/stepping/microcode
> under a hypervisor.  Using a pre-defined CPU model (that doesn't
> necessarily match the host) is very common when using KVM VM
> management stacks.
> 

Eduardo,

I don't see how this is possible in a modern virtualization environment.
 
Under VMware, a VM will be migrated to SkyLake if one is in the cluster and supports the features exposed to the VM.  This can occur for suspend/resume as well.

The migration pool isn't a constant.  Hosts can be added to a cluster and VMs can be instructed to move across clusters.  So there doesn't need to be a SkyLake around when the VM powers on in order for it to eventually end up on a SkyLake.

Even if we expose bit to indicate that FMS matches the underlying host, when does the guest know to query that?  The VM can be moved at any point in time, including after the guest asks if FMS matches host.

My apologies for posting onto the mailing list out of the blue.  Someone asked my opinion on this suggestion.  I'm definitely interested in figuring out whether Linux can fully mitigate the SkyLake RSB problem in virtual environments, but it's not clear how best to achieve that.

Thanks,

David Dunn


^ permalink raw reply	[flat|nested] 143+ messages in thread

end of thread, other threads:[~2018-02-06  9:14 UTC | newest]

Thread overview: 143+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-20 19:22 [RFC 00/10] Speculation Control feature support KarimAllah Ahmed
2018-01-20 19:22 ` [RFC 01/10] x86/speculation: Add basic support for IBPB KarimAllah Ahmed
2018-01-20 19:22 ` [RFC 02/10] x86/kvm: Add IBPB support KarimAllah Ahmed
2018-01-20 20:18   ` Woodhouse, David
2018-01-22 18:56   ` Jim Mattson
2018-01-22 19:31     ` Jim Mattson
2018-01-20 19:22 ` [RFC 03/10] x86/speculation: Use Indirect Branch Prediction Barrier in context switch KarimAllah Ahmed
2018-01-20 19:22 ` [RFC 04/10] x86/mm: Only flush indirect branches when switching into non dumpable process KarimAllah Ahmed
2018-01-20 21:06   ` Woodhouse, David
2018-01-22 18:29     ` Tim Chen
2018-01-21 11:22   ` Peter Zijlstra
2018-01-21 12:04     ` David Woodhouse
2018-01-21 14:07       ` H.J. Lu
2018-01-22 10:19       ` Peter Zijlstra
2018-01-22 10:23         ` David Woodhouse
2018-01-21 16:21     ` Ingo Molnar
2018-01-21 16:25       ` Arjan van de Ven
2018-01-21 22:20       ` Woodhouse, David
2018-01-29  6:35     ` Jon Masters
2018-01-29 14:07       ` Peter Zijlstra
2018-01-20 19:22 ` [RFC 05/10] x86/speculation: Add basic IBRS support infrastructure KarimAllah Ahmed
2018-01-21 14:31   ` Thomas Gleixner
2018-01-21 14:56     ` Borislav Petkov
2018-01-22  9:51       ` Peter Zijlstra
2018-01-22 12:06         ` Borislav Petkov
2018-01-22 13:30           ` Greg Kroah-Hartman
2018-01-22 13:37             ` Woodhouse, David
2018-01-21 15:25     ` David Woodhouse
2018-01-23 20:58     ` David Woodhouse
2018-01-23 22:43       ` Johannes Erdfelt
2018-01-24  8:47       ` Peter Zijlstra
2018-01-24  9:02         ` David Woodhouse
2018-01-24  9:10           ` Greg Kroah-Hartman
2018-01-24 15:09             ` Arjan van de Ven
2018-01-24 15:18               ` David Woodhouse
2018-01-24  9:34           ` Peter Zijlstra
2018-01-24 10:49           ` Henrique de Moraes Holschuh
2018-01-24 12:30             ` David Woodhouse
2018-01-24 12:14         ` David Woodhouse
2018-01-24 12:29           ` Peter Zijlstra
2018-01-24 12:58             ` David Woodhouse
2018-01-29 20:14   ` [RFC,05/10] " Eduardo Habkost
2018-01-29 20:17     ` David Woodhouse
2018-01-29 20:42       ` Eduardo Habkost
2018-01-29 20:44         ` Arjan van de Ven
2018-01-29 21:02           ` David Woodhouse
2018-01-29 21:37             ` Jim Mattson
2018-01-29 21:50               ` Eduardo Habkost
2018-01-29 22:12                 ` Jim Mattson
2018-01-30  1:22                   ` Eduardo Habkost
2018-01-29 22:25                 ` Andi Kleen
2018-01-30  1:37                   ` Eduardo Habkost
2018-01-29 21:37             ` Andi Kleen
2018-01-29 21:44             ` Eduardo Habkost
2018-01-29 22:10               ` Konrad Rzeszutek Wilk
2018-01-30  1:12                 ` Eduardo Habkost
2018-01-30  0:23             ` Linus Torvalds
2018-01-30  1:03               ` Jim Mattson
2018-01-30  3:13                 ` Andi Kleen
2018-01-31 15:03                   ` Paolo Bonzini
2018-01-31 15:07                     ` Dr. David Alan Gilbert
2018-01-30  1:32               ` Arjan van de Ven
2018-01-30  3:32                 ` Linus Torvalds
2018-01-30 12:04                   ` Eduardo Habkost
2018-01-30 13:54                   ` Arjan van de Ven
2018-01-30  8:22               ` David Woodhouse
2018-01-30 11:35               ` David Woodhouse
2018-01-30 11:56               ` Dr. David Alan Gilbert
2018-01-30 12:11               ` Christian Borntraeger
2018-01-30 14:46                 ` Christophe de Dinechin
2018-01-30 14:52                   ` Christian Borntraeger
2018-01-30 14:56                     ` Christophe de Dinechin
2018-01-30 15:33                       ` Christian Borntraeger
2018-01-30 20:46               ` Alan Cox
2018-01-31 10:05                 ` Christophe de Dinechin
2018-01-31 10:15                   ` Thomas Gleixner
2018-01-31 11:04                     ` Dr. David Alan Gilbert
2018-01-31 11:52                       ` Borislav Petkov
2018-01-31 12:30                         ` Dr. David Alan Gilbert
2018-01-31 13:18                           ` Borislav Petkov
2018-01-31 14:04                             ` Dr. David Alan Gilbert
2018-01-31 14:44                               ` Eduardo Habkost
2018-01-31 16:28                                 ` Borislav Petkov
2018-01-31 11:07                     ` Christophe de Dinechin
2018-01-31 15:00                     ` Eduardo Habkost
2018-01-31 15:11                     ` Arjan van de Ven
2018-01-31 10:03   ` [RFC 05/10] " Christophe de Dinechin
2018-01-20 19:22 ` [RFC 06/10] x86/speculation: Add inlines to control Indirect Branch Speculation KarimAllah Ahmed
2018-01-20 19:22 ` [RFC 07/10] x86: Simplify spectre_v2 command line parsing KarimAllah Ahmed
2018-01-20 19:22 ` [RFC 08/10] x86/idle: Control Indirect Branch Speculation in idle KarimAllah Ahmed
2018-01-20 19:23 ` [RFC 09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation KarimAllah Ahmed
2018-01-21 19:14   ` Andy Lutomirski
2018-01-23 16:12     ` Tom Lendacky
2018-01-23 16:20       ` Woodhouse, David
2018-01-23 22:37         ` Tom Lendacky
2018-01-23 22:49           ` Andi Kleen
2018-01-23 23:14             ` Woodhouse, David
2018-01-23 23:22               ` Andi Kleen
2018-01-24  0:47               ` Tim Chen
2018-01-24  1:00                 ` Andy Lutomirski
2018-01-24  1:22                   ` David Woodhouse
2018-01-24  1:59                   ` Van De Ven, Arjan
2018-01-24  3:25                     ` Andy Lutomirski
2018-01-21 19:34   ` Linus Torvalds
2018-01-21 20:28     ` David Woodhouse
2018-01-21 21:35       ` Linus Torvalds
2018-01-21 22:00         ` David Woodhouse
2018-01-21 22:27           ` Linus Torvalds
2018-01-22 16:27             ` David Woodhouse
2018-01-23  7:29               ` Ingo Molnar
2018-01-23  7:53                 ` Ingo Molnar
2018-01-23  9:27                   ` Ingo Molnar
2018-01-23  9:37                     ` David Woodhouse
2018-01-23 15:01                     ` Dave Hansen
2018-01-23  9:30                   ` David Woodhouse
2018-01-23 10:15                     ` Ingo Molnar
2018-01-23 10:27                       ` David Woodhouse
2018-01-23 10:44                         ` Ingo Molnar
2018-01-23 10:57                           ` David Woodhouse
2018-01-23 10:23                     ` Ingo Molnar
2018-01-23 10:35                       ` David Woodhouse
2018-02-04 18:43                       ` Thomas Gleixner
2018-02-04 20:22                         ` David Woodhouse
2018-02-06  9:14                         ` David Woodhouse
2018-01-25 16:19                     ` Mason
2018-01-25 17:16                       ` Greg Kroah-Hartman
2018-01-29 11:59                         ` Mason
2018-01-24  0:05                 ` Andi Kleen
2018-01-23 20:16       ` Pavel Machek
2018-01-20 19:23 ` [RFC 10/10] x86/enter: Use IBRS on syscall and interrupts KarimAllah Ahmed
2018-01-21 13:50   ` Konrad Rzeszutek Wilk
2018-01-21 14:40     ` KarimAllah Ahmed
2018-01-21 17:22     ` Dave Hansen
2018-01-21 14:02 ` [RFC 00/10] Speculation Control feature support Konrad Rzeszutek Wilk
2018-01-22 21:27   ` David Woodhouse
2018-01-29 22:29 [RFC,05/10] x86/speculation: Add basic IBRS support infrastructure David Dunn
2018-01-29 22:41 ` Andi Kleen
2018-01-29 22:49   ` Jim Mattson
2018-01-30  1:10     ` Eduardo Habkost
2018-01-30  1:20       ` David Dunn
2018-01-30  1:30         ` Eduardo Habkost
2018-01-29 23:51 ` Fred Jacobs
2018-01-30  1:08 ` Eduardo Habkost

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).