[PATCH v5 0/5] KVM: Expose speculation control feature to guests

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/5] KVM: Expose speculation control feature to guests
@ 2018-01-31 19:37 ` KarimAllah Ahmed
  0 siblings, 0 replies; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-01-31 19:37 UTC (permalink / raw)
  To: kvm, linux-kernel, x86
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu, Paolo Bonzini, Peter Zijlstra,
	Radim Krčmář,
	Thomas Gleixner, Tim Chen, Tom Lendacky

Add direct access to speculation control MSRs for KVM guests. This allows the
guest to protect itself against Spectre V2 using IBRS+IBPB instead of a
retpoline+IBPB based approach.

It also exposes the ARCH_CAPABILITIES MSR which is going to be used by future
Intel processors to indicate RDCL_NO and IBRS_ALL.

v5:
- svm: add PRED_CMD and SPEC_CTRL to direct_access_msrs list.
- vmx: check also for X86_FEATURE_SPEC_CTRL for msr reads and writes.
- vmx: Use MSR_TYPE_W instead of MSR_TYPE_R for the nested IBPB MSR
- rewrite commit message for IBPB patch [2/5] (Ashok)

v4:
- Add IBRS passthrough for SVM (5/5).
- Handle nested guests properly.
- expose F(IBRS) in kvm_cpuid_8000_0008_ebx_x86_features

Ashok Raj (1):
  KVM: x86: Add IBPB support

KarimAllah Ahmed (4):
  KVM: x86: Update the reverse_cpuid list to include CPUID_7_EDX
  KVM: VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
  KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  KVM: SVM: Allow direct access to MSR_IA32_SPEC_CTRL

 arch/x86/kvm/cpuid.c |  22 +++++++---
 arch/x86/kvm/cpuid.h |   1 +
 arch/x86/kvm/svm.c   |  87 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx.c   | 117 +++++++++++++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/x86.c   |   1 +
 5 files changed, 218 insertions(+), 10 deletions(-)

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: x86@kernel.org

-- 
2.7.4

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v5 0/5] KVM: Expose speculation control feature to guests
@ 2018-01-31 19:37 ` KarimAllah Ahmed
  0 siblings, 0 replies; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-01-31 19:37 UTC (permalink / raw)
  To: kvm, linux-kernel, x86
  Cc: KarimAllah Ahmed, Andi Kleen, Andrea Arcangeli, Andy Lutomirski,
	Arjan van de Ven, Ashok Raj, Asit Mallick, Borislav Petkov,
	Dan Williams, Dave Hansen, David Woodhouse, Greg Kroah-Hartman,
	H . Peter Anvin, Ingo Molnar, Janakarajan Natarajan,
	Joerg Roedel, Jun Nakajima, Laura Abbott, Linus Torvalds,
	Masami Hiramatsu

Add direct access to speculation control MSRs for KVM guests. This allows the
guest to protect itself against Spectre V2 using IBRS+IBPB instead of a
retpoline+IBPB based approach.

It also exposes the ARCH_CAPABILITIES MSR which is going to be used by future
Intel processors to indicate RDCL_NO and IBRS_ALL.

v5:
- svm: add PRED_CMD and SPEC_CTRL to direct_access_msrs list.
- vmx: check also for X86_FEATURE_SPEC_CTRL for msr reads and writes.
- vmx: Use MSR_TYPE_W instead of MSR_TYPE_R for the nested IBPB MSR
- rewrite commit message for IBPB patch [2/5] (Ashok)

v4:
- Add IBRS passthrough for SVM (5/5).
- Handle nested guests properly.
- expose F(IBRS) in kvm_cpuid_8000_0008_ebx_x86_features

Ashok Raj (1):
  KVM: x86: Add IBPB support

KarimAllah Ahmed (4):
  KVM: x86: Update the reverse_cpuid list to include CPUID_7_EDX
  KVM: VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
  KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  KVM: SVM: Allow direct access to MSR_IA32_SPEC_CTRL

 arch/x86/kvm/cpuid.c |  22 +++++++---
 arch/x86/kvm/cpuid.h |   1 +
 arch/x86/kvm/svm.c   |  87 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx.c   | 117 +++++++++++++++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/x86.c   |   1 +
 5 files changed, 218 insertions(+), 10 deletions(-)

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: x86@kernel.org

-- 
2.7.4

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH v5 1/5] KVM: x86: Update the reverse_cpuid list to include CPUID_7_EDX
  2018-01-31 19:37 ` KarimAllah Ahmed
  (?)
@ 2018-01-31 19:37 ` KarimAllah Ahmed
  2018-01-31 20:22   ` Konrad Rzeszutek Wilk
  -1 siblings, 1 reply; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-01-31 19:37 UTC (permalink / raw)
  To: kvm, linux-kernel, x86
  Cc: KarimAllah Ahmed, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H . Peter Anvin, David Woodhouse

[dwmw2: Stop using KF() for bits in it, too]
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/cpuid.c | 8 +++-----
 arch/x86/kvm/cpuid.h | 1 +
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0099e10..c0eb337 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -67,9 +67,7 @@ u64 kvm_supported_xcr0(void)
 
 #define F(x) bit(X86_FEATURE_##x)
 
-/* These are scattered features in cpufeatures.h. */
-#define KVM_CPUID_BIT_AVX512_4VNNIW     2
-#define KVM_CPUID_BIT_AVX512_4FMAPS     3
+/* For scattered features from cpufeatures.h; we currently expose none */
 #define KF(x) bit(KVM_CPUID_BIT_##x)
 
 int kvm_update_cpuid(struct kvm_vcpu *vcpu)
@@ -392,7 +390,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 
 	/* cpuid 7.0.edx*/
 	const u32 kvm_cpuid_7_0_edx_x86_features =
-		KF(AVX512_4VNNIW) | KF(AVX512_4FMAPS);
+		F(AVX512_4VNNIW) | F(AVX512_4FMAPS);
 
 	/* all calls to cpuid_count() should be made on the same cpu */
 	get_cpu();
@@ -477,7 +475,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 			if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
 				entry->ecx &= ~F(PKU);
 			entry->edx &= kvm_cpuid_7_0_edx_x86_features;
-			entry->edx &= get_scattered_cpuid_leaf(7, 0, CPUID_EDX);
+			cpuid_mask(&entry->edx, CPUID_7_EDX);
 		} else {
 			entry->ebx = 0;
 			entry->ecx = 0;
diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index c2cea66..9a327d5 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -54,6 +54,7 @@ static const struct cpuid_reg reverse_cpuid[] = {
 	[CPUID_8000_000A_EDX] = {0x8000000a, 0, CPUID_EDX},
 	[CPUID_7_ECX]         = {         7, 0, CPUID_ECX},
 	[CPUID_8000_0007_EBX] = {0x80000007, 0, CPUID_EBX},
+	[CPUID_7_EDX]         = {         7, 0, CPUID_EDX},
 };
 
 static __always_inline struct cpuid_reg x86_feature_cpuid(unsigned x86_feature)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 2/5] KVM: x86: Add IBPB support
  2018-01-31 19:37 ` KarimAllah Ahmed
  (?)
  (?)
@ 2018-01-31 19:37 ` KarimAllah Ahmed
  2018-01-31 19:45   ` Jim Mattson
                     ` (3 more replies)
  -1 siblings, 4 replies; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-01-31 19:37 UTC (permalink / raw)
  To: kvm, linux-kernel, x86
  Cc: Ashok Raj, Asit Mallick, Dave Hansen, Arjan Van De Ven, Tim Chen,
	Linus Torvalds, Andrea Arcangeli, Andi Kleen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, Andy Lutomirski, Greg KH,
	Paolo Bonzini, Peter Zijlstra, David Woodhouse, KarimAllah Ahmed

From: Ashok Raj <ashok.raj@intel.com>

The Indirect Branch Predictor Barrier (IBPB) is an indirect branch
control mechanism. It keeps earlier branches from influencing
later ones.

Unlike IBRS and STIBP, IBPB does not define a new mode of operation.
It's a command that ensures predicted branch targets aren't used after
the barrier. Although IBRS and IBPB are enumerated by the same CPUID
enumeration, IBPB is very different.

IBPB helps mitigate against three potential attacks:

* Mitigate guests from being attacked by other guests.
  - This is addressed by issing IBPB when we do a guest switch.

* Mitigate attacks from guest/ring3->host/ring3.
  These would require a IBPB during context switch in host, or after
  VMEXIT. The host process has two ways to mitigate
  - Either it can be compiled with retpoline
  - If its going through context switch, and has set !dumpable then
    there is a IBPB in that path.
    (Tim's patch: https://patchwork.kernel.org/patch/10192871)
  - The case where after a VMEXIT you return back to Qemu might make
    Qemu attackable from guest when Qemu isn't compiled with retpoline.
  There are issues reported when doing IBPB on every VMEXIT that resulted
  in some tsc calibration woes in guest.

* Mitigate guest/ring0->host/ring0 attacks.
  When host kernel is using retpoline it is safe against these attacks.
  If host kernel isn't using retpoline we might need to do a IBPB flush on
  every VMEXIT.

Even when using retpoline for indirect calls, in certain conditions 'ret'
can use the BTB on Skylake-era CPUs. There are other mitigations
available like RSB stuffing/clearing.

* IBPB is issued only for SVM during svm_free_vcpu().
  VMX has a vmclear and SVM doesn't.  Follow discussion here:
  https://lkml.org/lkml/2018/1/15/146

Please refer to the following spec for more details on the enumeration
and control.

Refer here to get documentation about mitigations.

https://software.intel.com/en-us/side-channel-security-support

[peterz: rebase and changelog rewrite]
[karahmed: - rebase
           - vmx: expose PRED_CMD if guest has it in CPUID
           - svm: only pass through IBPB if guest has it in CPUID
           - vmx: support !cpu_has_vmx_msr_bitmap()]
           - vmx: support nested]
[dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
        PRED_CMD is a write-only MSR]

Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1515720739-43819-6-git-send-email-ashok.raj@intel.com
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>

v5:
- Use MSR_TYPE_W instead of MSR_TYPE_R for the MSR.
- Always merge the bitmaps unconditionally.
- Add PRED_CMD to direct_access_msrs.
- Also check for X86_FEATURE_SPEC_CTRL for the msr reads/writes
- rewrite the commit message (from ashok.raj@)
---
 arch/x86/kvm/cpuid.c | 11 ++++++++++-
 arch/x86/kvm/svm.c   | 28 ++++++++++++++++++++++++++++
 arch/x86/kvm/vmx.c   | 29 +++++++++++++++++++++++++----
 3 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c0eb337..033004d 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -365,6 +365,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 		F(3DNOWPREFETCH) | F(OSVW) | 0 /* IBS */ | F(XOP) |
 		0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM);
 
+	/* cpuid 0x80000008.ebx */
+	const u32 kvm_cpuid_8000_0008_ebx_x86_features =
+		F(IBPB);
+
 	/* cpuid 0xC0000001.edx */
 	const u32 kvm_cpuid_C000_0001_edx_x86_features =
 		F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) |
@@ -625,7 +629,12 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 		if (!g_phys_as)
 			g_phys_as = phys_as;
 		entry->eax = g_phys_as | (virt_as << 8);
-		entry->ebx = entry->edx = 0;
+		entry->edx = 0;
+		/* IBPB isn't necessarily present in hardware cpuid */
+		if (boot_cpu_has(X86_FEATURE_IBPB))
+			entry->ebx |= F(IBPB);
+		entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
+		cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
 		break;
 	}
 	case 0x80000019:
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f40d0da..bfbb7b9 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -250,6 +250,7 @@ static const struct svm_direct_access_msrs {
 	{ .index = MSR_SYSCALL_MASK,			.always = true  },
 #endif
 	{ .index = MSR_IA32_LASTBRANCHFROMIP,		.always = false },
+	{ .index = MSR_IA32_PRED_CMD,			.always = false },
 	{ .index = MSR_IA32_LASTBRANCHTOIP,		.always = false },
 	{ .index = MSR_IA32_LASTINTFROMIP,		.always = false },
 	{ .index = MSR_IA32_LASTINTTOIP,		.always = false },
@@ -529,6 +530,7 @@ struct svm_cpu_data {
 	struct kvm_ldttss_desc *tss_desc;
 
 	struct page *save_area;
+	struct vmcb *current_vmcb;
 };
 
 static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
@@ -1703,11 +1705,17 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
 	__free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
 	kvm_vcpu_uninit(vcpu);
 	kmem_cache_free(kvm_vcpu_cache, svm);
+	/*
+	 * The vmcb page can be recycled, causing a false negative in
+	 * svm_vcpu_load(). So do a full IBPB now.
+	 */
+	indirect_branch_prediction_barrier();
 }
 
 static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
 	int i;
 
 	if (unlikely(cpu != vcpu->cpu)) {
@@ -1736,6 +1744,10 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (static_cpu_has(X86_FEATURE_RDTSCP))
 		wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
 
+	if (sd->current_vmcb != svm->vmcb) {
+		sd->current_vmcb = svm->vmcb;
+		indirect_branch_prediction_barrier();
+	}
 	avic_vcpu_load(vcpu, cpu);
 }
 
@@ -3684,6 +3696,22 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	case MSR_IA32_TSC:
 		kvm_write_tsc(vcpu, msr);
 		break;
+	case MSR_IA32_PRED_CMD:
+		if (!msr->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
+			return 1;
+
+		if (data & ~PRED_CMD_IBPB)
+			return 1;
+
+		if (!data)
+			break;
+
+		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
+		if (is_guest_mode(vcpu))
+			break;
+		set_msr_interception(svm->msrpm, MSR_IA32_PRED_CMD, 0, 1);
+		break;
 	case MSR_STAR:
 		svm->vmcb->save.star = data;
 		break;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d46a61b..2e4e8af 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2285,6 +2285,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
 		per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
 		vmcs_load(vmx->loaded_vmcs->vmcs);
+		indirect_branch_prediction_barrier();
 	}
 
 	if (!already_loaded) {
@@ -3342,6 +3343,26 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_TSC:
 		kvm_write_tsc(vcpu, msr_info);
 		break;
+	case MSR_IA32_PRED_CMD:
+		if (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
+			return 1;
+
+		if (data & ~PRED_CMD_IBPB)
+			return 1;
+
+		if (!data)
+			break;
+
+		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
+
+		if (is_guest_mode(vcpu))
+			break;
+
+		vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
+					      MSR_TYPE_W);
+		break;
 	case MSR_IA32_CR_PAT:
 		if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
 			if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
@@ -10045,10 +10066,6 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
 	unsigned long *msr_bitmap_l1;
 	unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
 
-	/* This shortcut is ok because we support only x2APIC MSRs so far. */
-	if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
-		return false;
-
 	page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->msr_bitmap);
 	if (is_error_page(page))
 		return false;
@@ -10056,6 +10073,10 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
 
 	memset(msr_bitmap_l0, 0xff, PAGE_SIZE);
 
+	nested_vmx_disable_intercept_for_msr(msr_bitmap_l1, msr_bitmap_l0,
+					     MSR_IA32_PRED_CMD,
+					     MSR_TYPE_W);
+
 	if (nested_cpu_has_virt_x2apic_mode(vmcs12)) {
 		if (nested_cpu_has_apic_reg_virt(vmcs12))
 			for (msr = 0x800; msr <= 0x8ff; msr++)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 3/5] KVM: VMX: Emulate MSR_IA32_ARCH_CAPABILITIES
  2018-01-31 19:37 ` KarimAllah Ahmed
                   ` (2 preceding siblings ...)
  (?)
@ 2018-01-31 19:37 ` KarimAllah Ahmed
  -1 siblings, 0 replies; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-01-31 19:37 UTC (permalink / raw)
  To: kvm, linux-kernel, x86
  Cc: KarimAllah Ahmed, Asit Mallick, Dave Hansen, Arjan Van De Ven,
	Tim Chen, Linus Torvalds, Andrea Arcangeli, Andi Kleen,
	Thomas Gleixner, Dan Williams, Jun Nakajima, Andy Lutomirski,
	Greg KH, Paolo Bonzini, Ashok Raj, David Woodhouse

Future intel processors will use MSR_IA32_ARCH_CAPABILITIES MSR to indicate
RDCL_NO (bit 0) and IBRS_ALL (bit 1). This is a read-only MSR. By default
the contents will come directly from the hardware, but user-space can still
override it.

[dwmw2: The bit in kvm_cpuid_7_0_edx_x86_features can be unconditional]

Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/cpuid.c |  2 +-
 arch/x86/kvm/vmx.c   | 15 +++++++++++++++
 arch/x86/kvm/x86.c   |  1 +
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 033004d..1909635 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -394,7 +394,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 
 	/* cpuid 7.0.edx*/
 	const u32 kvm_cpuid_7_0_edx_x86_features =
-		F(AVX512_4VNNIW) | F(AVX512_4FMAPS);
+		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);
 
 	/* all calls to cpuid_count() should be made on the same cpu */
 	get_cpu();
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2e4e8af..a0b2bd1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -592,6 +592,8 @@ struct vcpu_vmx {
 	u64 		      msr_host_kernel_gs_base;
 	u64 		      msr_guest_kernel_gs_base;
 #endif
+	u64 		      arch_capabilities;
+
 	u32 vm_entry_controls_shadow;
 	u32 vm_exit_controls_shadow;
 	u32 secondary_exec_control;
@@ -3236,6 +3238,12 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_TSC:
 		msr_info->data = guest_read_tsc(vcpu);
 		break;
+	case MSR_IA32_ARCH_CAPABILITIES:
+		if (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
+			return 1;
+		msr_info->data = to_vmx(vcpu)->arch_capabilities;
+		break;
 	case MSR_IA32_SYSENTER_CS:
 		msr_info->data = vmcs_read32(GUEST_SYSENTER_CS);
 		break;
@@ -3363,6 +3371,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
 					      MSR_TYPE_W);
 		break;
+	case MSR_IA32_ARCH_CAPABILITIES:
+		if (!msr_info->host_initiated)
+			return 1;
+		vmx->arch_capabilities = data;
+		break;
 	case MSR_IA32_CR_PAT:
 		if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
 			if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
@@ -5625,6 +5638,8 @@ static void vmx_vcpu_setup(struct vcpu_vmx *vmx)
 		++vmx->nmsrs;
 	}
 
+	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
+		rdmsrl(MSR_IA32_ARCH_CAPABILITIES, vmx->arch_capabilities);
 
 	vm_exit_controls_init(vmx, vmcs_config.vmexit_ctrl);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c53298d..4ec142e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1009,6 +1009,7 @@ static u32 msrs_to_save[] = {
 #endif
 	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
 	MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
+	MSR_IA32_ARCH_CAPABILITIES
 };
 
 static unsigned num_msrs_to_save;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 19:37 ` KarimAllah Ahmed
                   ` (3 preceding siblings ...)
  (?)
@ 2018-01-31 19:37 ` KarimAllah Ahmed
  2018-01-31 19:53   ` Jim Mattson
  2018-01-31 22:56   ` Raj, Ashok
  -1 siblings, 2 replies; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-01-31 19:37 UTC (permalink / raw)
  To: kvm, linux-kernel, x86
  Cc: KarimAllah Ahmed, Asit Mallick, Arjan Van De Ven, Dave Hansen,
	Andi Kleen, Andrea Arcangeli, Linus Torvalds, Tim Chen,
	Thomas Gleixner, Dan Williams, Jun Nakajima, Paolo Bonzini,
	David Woodhouse, Greg KH, Andy Lutomirski, Ashok Raj

[ Based on a patch from Ashok Raj <ashok.raj@intel.com> ]

Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
be using a retpoline+IBPB based approach.

To avoid the overhead of atomically saving and restoring the
MSR_IA32_SPEC_CTRL for guests that do not actually use the MSR, only
add_atomic_switch_msr when a non-zero is written to it.

No attempt is made to handle STIBP here, intentionally. Filtering STIBP
may be added in a future patch, which may require trapping all writes
if we don't want to pass it through directly to the guest.

[dwmw2: Clean up CPUID bits, save/restore manually, handle reset]

Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
v5:
- Also check for X86_FEATURE_SPEC_CTRL for the msr reads/writes
v4:
- Add IBRS to kvm_cpuid_8000_0008_ebx_x86_features
- Handling nested guests
v3:
- Save/restore manually
- Fix CPUID handling
- Fix a copy & paste error in the name of SPEC_CTRL MSR in
  disable_intercept.
- support !cpu_has_vmx_msr_bitmap()
v2:
- remove 'host_spec_ctrl' in favor of only a comment (dwmw@).
- special case writing '0' in SPEC_CTRL to avoid confusing live-migration
  when the instance never used the MSR (dwmw@).
- depend on X86_FEATURE_IBRS instead of X86_FEATURE_SPEC_CTRL (dwmw@).
- add MSR_IA32_SPEC_CTRL to the list of MSRs to save (dropped it by accident).
---
 arch/x86/kvm/cpuid.c |  9 ++++---
 arch/x86/kvm/vmx.c   | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c   |  2 +-
 3 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 1909635..13f5d42 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -367,7 +367,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 
 	/* cpuid 0x80000008.ebx */
 	const u32 kvm_cpuid_8000_0008_ebx_x86_features =
-		F(IBPB);
+		F(IBPB) | F(IBRS);
 
 	/* cpuid 0xC0000001.edx */
 	const u32 kvm_cpuid_C000_0001_edx_x86_features =
@@ -394,7 +394,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 
 	/* cpuid 7.0.edx*/
 	const u32 kvm_cpuid_7_0_edx_x86_features =
-		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);
+		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
+		F(ARCH_CAPABILITIES);
 
 	/* all calls to cpuid_count() should be made on the same cpu */
 	get_cpu();
@@ -630,9 +631,11 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 			g_phys_as = phys_as;
 		entry->eax = g_phys_as | (virt_as << 8);
 		entry->edx = 0;
-		/* IBPB isn't necessarily present in hardware cpuid */
+		/* IBRS and IBPB aren't necessarily present in hardware cpuid */
 		if (boot_cpu_has(X86_FEATURE_IBPB))
 			entry->ebx |= F(IBPB);
+		if (boot_cpu_has(X86_FEATURE_IBRS))
+			entry->ebx |= F(IBRS);
 		entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
 		cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
 		break;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a0b2bd1..4ee93cb 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -593,6 +593,8 @@ struct vcpu_vmx {
 	u64 		      msr_guest_kernel_gs_base;
 #endif
 	u64 		      arch_capabilities;
+	u64 		      spec_ctrl;
+	bool		      save_spec_ctrl_on_exit;
 
 	u32 vm_entry_controls_shadow;
 	u32 vm_exit_controls_shadow;
@@ -938,6 +940,8 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
 static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
 					    u16 error_code);
 static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
+static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
+							  u32 msr, int type);
 
 static DEFINE_PER_CPU(struct vmcs *, vmxarea);
 static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
@@ -3238,6 +3242,14 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_TSC:
 		msr_info->data = guest_read_tsc(vcpu);
 		break;
+	case MSR_IA32_SPEC_CTRL:
+		if (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
+			return 1;
+
+		msr_info->data = to_vmx(vcpu)->spec_ctrl;
+		break;
 	case MSR_IA32_ARCH_CAPABILITIES:
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
@@ -3351,6 +3363,36 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_TSC:
 		kvm_write_tsc(vcpu, msr_info);
 		break;
+	case MSR_IA32_SPEC_CTRL:
+		if (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
+			return 1;
+
+		/* The STIBP bit doesn't fault even if it's not advertised */
+		if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
+			return 1;
+
+		vmx->spec_ctrl = data;
+
+		/*
+		 * When it's written (to non-zero) for the first time, pass
+		 * it through. This means we don't have to take the perf
+		 * hit of saving it on vmexit for the common case of guests
+		 * that don't use it.
+		 */
+		if (cpu_has_vmx_msr_bitmap() && data &&
+		    !vmx->save_spec_ctrl_on_exit) {
+			vmx->save_spec_ctrl_on_exit = true;
+
+			if (is_guest_mode(vcpu))
+				break;
+
+			vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap,
+						      MSR_IA32_SPEC_CTRL,
+						      MSR_TYPE_RW);
+		}
+		break;
 	case MSR_IA32_PRED_CMD:
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
@@ -5668,6 +5710,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	u64 cr0;
 
 	vmx->rmode.vm86_active = 0;
+	vmx->spec_ctrl = 0;
 
 	vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
 	kvm_set_cr8(vcpu, 0);
@@ -9339,6 +9382,15 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	vmx_arm_hv_timer(vcpu);
 
+	/*
+	 * If this vCPU has touched SPEC_CTRL, restore the guest's value if
+	 * it's non-zero. Since vmentry is serialising on affected CPUs, there
+	 * is no need to worry about the conditional branch over the wrmsr
+	 * being speculatively taken.
+	 */
+	if (vmx->spec_ctrl)
+		wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
+
 	vmx->__launched = vmx->loaded_vmcs->launched;
 	asm(
 		/* Store host registers */
@@ -9457,6 +9509,19 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 #endif
 	      );
 
+	/*
+	 * We do not use IBRS in the kernel. If this vCPU has used the
+	 * SPEC_CTRL MSR it may have left it on; save the value and
+	 * turn it off. This is much more efficient than blindly adding
+	 * it to the atomic save/restore list. Especially as the former
+	 * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
+	 */
+	if (vmx->save_spec_ctrl_on_exit)
+		rdmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
+
+	if (vmx->spec_ctrl)
+		wrmsrl(MSR_IA32_SPEC_CTRL, 0);
+
 	/* Eliminate branch target predictions from guest mode */
 	vmexit_fill_RSB();
 
@@ -10115,6 +10180,14 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
 				MSR_TYPE_W);
 		}
 	}
+
+	if (to_vmx(vcpu)->save_spec_ctrl_on_exit) {
+		nested_vmx_disable_intercept_for_msr(
+				msr_bitmap_l1, msr_bitmap_l0,
+				MSR_IA32_SPEC_CTRL,
+				MSR_TYPE_R | MSR_TYPE_W);
+	}
+
 	kunmap(page);
 	kvm_release_page_clean(page);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4ec142e..ac38143 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1009,7 +1009,7 @@ static u32 msrs_to_save[] = {
 #endif
 	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
 	MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
-	MSR_IA32_ARCH_CAPABILITIES
+	MSR_IA32_SPEC_CTRL, MSR_IA32_ARCH_CAPABILITIES
 };
 
 static unsigned num_msrs_to_save;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH v5 5/5] KVM: SVM: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 19:37 ` KarimAllah Ahmed
                   ` (4 preceding siblings ...)
  (?)
@ 2018-01-31 19:37 ` KarimAllah Ahmed
  -1 siblings, 0 replies; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-01-31 19:37 UTC (permalink / raw)
  To: kvm, linux-kernel, x86
  Cc: KarimAllah Ahmed, Asit Mallick, Arjan Van De Ven, Dave Hansen,
	Andi Kleen, Andrea Arcangeli, Linus Torvalds, Tim Chen,
	Thomas Gleixner, Dan Williams, Jun Nakajima, Paolo Bonzini,
	David Woodhouse, Greg KH, Andy Lutomirski, Ashok Raj

[ Based on a patch from Paolo Bonzini <pbonzini@redhat.com> ]

... basically doing exactly what we do for VMX:

- Passthrough SPEC_CTRL to guests (if enabled in guest CPUID)
- Save and restore SPEC_CTRL around VMExit and VMEntry only if the guest
  actually used it.

Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
v5:
- Add SPEC_CTRL to direct_access_msrs.
---
 arch/x86/kvm/svm.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index bfbb7b9..0016a8a 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -184,6 +184,9 @@ struct vcpu_svm {
 		u64 gs_base;
 	} host;
 
+	u64 spec_ctrl;
+	bool save_spec_ctrl_on_exit;
+
 	u32 *msrpm;
 
 	ulong nmi_iret_rip;
@@ -250,6 +253,7 @@ static const struct svm_direct_access_msrs {
 	{ .index = MSR_SYSCALL_MASK,			.always = true  },
 #endif
 	{ .index = MSR_IA32_LASTBRANCHFROMIP,		.always = false },
+	{ .index = MSR_IA32_SPEC_CTRL,			.always = false },
 	{ .index = MSR_IA32_PRED_CMD,			.always = false },
 	{ .index = MSR_IA32_LASTBRANCHTOIP,		.always = false },
 	{ .index = MSR_IA32_LASTINTFROMIP,		.always = false },
@@ -1584,6 +1588,8 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	u32 dummy;
 	u32 eax = 1;
 
+	svm->spec_ctrl = 0;
+
 	if (!init_event) {
 		svm->vcpu.arch.apic_base = APIC_DEFAULT_PHYS_BASE |
 					   MSR_IA32_APICBASE_ENABLE;
@@ -3605,6 +3611,13 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_VM_CR:
 		msr_info->data = svm->nested.vm_cr_msr;
 		break;
+	case MSR_IA32_SPEC_CTRL:
+		if (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_IBRS))
+			return 1;
+
+		msr_info->data = svm->spec_ctrl;
+		break;
 	case MSR_IA32_UCODE_REV:
 		msr_info->data = 0x01000065;
 		break;
@@ -3696,6 +3709,30 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	case MSR_IA32_TSC:
 		kvm_write_tsc(vcpu, msr);
 		break;
+	case MSR_IA32_SPEC_CTRL:
+		if (!msr->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_IBRS))
+			return 1;
+
+		/* The STIBP bit doesn't fault even if it's not advertised */
+		if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
+			return 1;
+
+		svm->spec_ctrl = data;
+
+		/*
+		 * When it's written (to non-zero) for the first time, pass
+		 * it through. This means we don't have to take the perf
+		 * hit of saving it on vmexit for the common case of guests
+		 * that don't use it.
+		 */
+		if (data && !svm->save_spec_ctrl_on_exit) {
+			svm->save_spec_ctrl_on_exit = true;
+			if (is_guest_mode(vcpu))
+				break;
+			set_msr_interception(svm->msrpm, MSR_IA32_SPEC_CTRL, 1, 1);
+		}
+		break;
 	case MSR_IA32_PRED_CMD:
 		if (!msr->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
@@ -4964,6 +5001,15 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 
 	local_irq_enable();
 
+	/*
+	 * If this vCPU has touched SPEC_CTRL, restore the guest's value if
+	 * it's non-zero. Since vmentry is serialising on affected CPUs, there
+	 * is no need to worry about the conditional branch over the wrmsr
+	 * being speculatively taken.
+	 */
+	if (svm->spec_ctrl)
+		wrmsrl(MSR_IA32_SPEC_CTRL, svm->spec_ctrl);
+
 	asm volatile (
 		"push %%" _ASM_BP "; \n\t"
 		"mov %c[rbx](%[svm]), %%" _ASM_BX " \n\t"
@@ -5056,6 +5102,19 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 #endif
 		);
 
+	/*
+	 * We do not use IBRS in the kernel. If this vCPU has used the
+	 * SPEC_CTRL MSR it may have left it on; save the value and
+	 * turn it off. This is much more efficient than blindly adding
+	 * it to the atomic save/restore list. Especially as the former
+	 * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
+	 */
+	if (svm->save_spec_ctrl_on_exit)
+		rdmsrl(MSR_IA32_SPEC_CTRL, svm->spec_ctrl);
+
+	if (svm->spec_ctrl)
+		wrmsrl(MSR_IA32_SPEC_CTRL, 0);
+
 	/* Eliminate branch target predictions from guest mode */
 	vmexit_fill_RSB();
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 2/5] KVM: x86: Add IBPB support
  2018-01-31 19:37 ` [PATCH v5 2/5] KVM: x86: Add IBPB support KarimAllah Ahmed
@ 2018-01-31 19:45   ` Jim Mattson
  2018-01-31 19:53       ` David Woodhouse
  2018-01-31 20:28   ` Konrad Rzeszutek Wilk
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 56+ messages in thread
From: Jim Mattson @ 2018-01-31 19:45 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: kvm list, LKML, the arch/x86 maintainers, Ashok Raj,
	Asit Mallick, Dave Hansen, Arjan Van De Ven, Tim Chen,
	Linus Torvalds, Andrea Arcangeli, Andi Kleen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, Andy Lutomirski, Greg KH,
	Paolo Bonzini, Peter Zijlstra, David Woodhouse

On Wed, Jan 31, 2018 at 11:37 AM, KarimAllah Ahmed <karahmed@amazon.de> wrote:

> +       nested_vmx_disable_intercept_for_msr(msr_bitmap_l1, msr_bitmap_l0,
> +                                            MSR_IA32_PRED_CMD,
> +                                            MSR_TYPE_W);
> +

I still think this should be predicated on L1 having
guest_cpuid_has(vcpu, X86_FEATURE_IBPB) or guest_cpuid_has(vcpu,
X86_FEATURE_SPEC_CTRL), because of the potential impact to the
hypertwin. If L0 denies the feature to L1 by clearing those CPUID
bits, L1 shouldn't be able to bypass that restriction by launching L2.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 2/5] KVM: x86: Add IBPB support
  2018-01-31 19:45   ` Jim Mattson
@ 2018-01-31 19:53       ` David Woodhouse
  0 siblings, 0 replies; 56+ messages in thread
From: David Woodhouse @ 2018-01-31 19:53 UTC (permalink / raw)
  To: Jim Mattson, KarimAllah Ahmed
  Cc: kvm list, LKML, the arch/x86 maintainers, Ashok Raj,
	Asit Mallick, Dave Hansen, Arjan Van De Ven, Tim Chen,
	Linus Torvalds, Andrea Arcangeli, Andi Kleen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, Andy Lutomirski, Greg KH,
	Paolo Bonzini, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 928 bytes --]



On Wed, 2018-01-31 at 11:45 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 11:37 AM, KarimAllah Ahmed  wrote:
> 
> > +       nested_vmx_disable_intercept_for_msr(msr_bitmap_l1, msr_bitmap_l0,
> > +                                            MSR_IA32_PRED_CMD,
> > +                                            MSR_TYPE_W);
> > +
> 
> I still think this should be predicated on L1 having
> guest_cpuid_has(vcpu, X86_FEATURE_IBPB) or guest_cpuid_has(vcpu,
> X86_FEATURE_SPEC_CTRL), because of the potential impact to the
> hypertwin. If L0 denies the feature to L1 by clearing those CPUID
> bits, L1 shouldn't be able to bypass that restriction by launching L2.

Rather than doing the expensive guest_cpu_has() every time (which is
worse now as we realised we need two of them) perhaps we should
introduce a local flag for that too?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 2/5] KVM: x86: Add IBPB support
@ 2018-01-31 19:53       ` David Woodhouse
  0 siblings, 0 replies; 56+ messages in thread
From: David Woodhouse @ 2018-01-31 19:53 UTC (permalink / raw)
  To: Jim Mattson, KarimAllah Ahmed
  Cc: kvm list, LKML, the arch/x86 maintainers, Ashok Raj,
	Asit Mallick, Dave Hansen, Arjan Van De Ven, Tim Chen,
	Linus Torvalds, Andrea Arcangeli, Andi Kleen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, Andy Lutomirski, Greg KH,
	Paolo Bonzini, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 928 bytes --]



On Wed, 2018-01-31 at 11:45 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 11:37 AM, KarimAllah Ahmed  wrote:
> 
> > +       nested_vmx_disable_intercept_for_msr(msr_bitmap_l1, msr_bitmap_l0,
> > +                                            MSR_IA32_PRED_CMD,
> > +                                            MSR_TYPE_W);
> > +
> 
> I still think this should be predicated on L1 having
> guest_cpuid_has(vcpu, X86_FEATURE_IBPB) or guest_cpuid_has(vcpu,
> X86_FEATURE_SPEC_CTRL), because of the potential impact to the
> hypertwin. If L0 denies the feature to L1 by clearing those CPUID
> bits, L1 shouldn't be able to bypass that restriction by launching L2.

Rather than doing the expensive guest_cpu_has() every time (which is
worse now as we realised we need two of them) perhaps we should
introduce a local flag for that too?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 19:37 ` [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL KarimAllah Ahmed
@ 2018-01-31 19:53   ` Jim Mattson
  2018-01-31 20:00       ` David Woodhouse
  2018-01-31 20:01     ` KarimAllah Ahmed
  2018-01-31 22:56   ` Raj, Ashok
  1 sibling, 2 replies; 56+ messages in thread
From: Jim Mattson @ 2018-01-31 19:53 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: kvm list, LKML, the arch/x86 maintainers, Asit Mallick,
	Arjan Van De Ven, Dave Hansen, Andi Kleen, Andrea Arcangeli,
	Linus Torvalds, Tim Chen, Thomas Gleixner, Dan Williams,
	Jun Nakajima, Paolo Bonzini, David Woodhouse, Greg KH,
	Andy Lutomirski, Ashok Raj

On Wed, Jan 31, 2018 at 11:37 AM, KarimAllah Ahmed <karahmed@amazon.de> wrote:

> +
> +       if (to_vmx(vcpu)->save_spec_ctrl_on_exit) {
> +               nested_vmx_disable_intercept_for_msr(
> +                               msr_bitmap_l1, msr_bitmap_l0,
> +                               MSR_IA32_SPEC_CTRL,
> +                               MSR_TYPE_R | MSR_TYPE_W);
> +       }
> +

As this is written, L2 will never get direct access to this MSR until
after L1 writes it.  What if L1 never writes it? The condition should
really be something that captures, "if L0 is willing to yield this MSR
to the guest..."

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 2/5] KVM: x86: Add IBPB support
  2018-01-31 19:53       ` David Woodhouse
  (?)
@ 2018-01-31 19:55       ` Jim Mattson
  2018-02-01  0:27         ` KarimAllah Ahmed
  -1 siblings, 1 reply; 56+ messages in thread
From: Jim Mattson @ 2018-01-31 19:55 UTC (permalink / raw)
  To: David Woodhouse
  Cc: KarimAllah Ahmed, kvm list, LKML, the arch/x86 maintainers,
	Ashok Raj, Asit Mallick, Dave Hansen, Arjan Van De Ven, Tim Chen,
	Linus Torvalds, Andrea Arcangeli, Andi Kleen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, Andy Lutomirski, Greg KH,
	Paolo Bonzini, Peter Zijlstra

On Wed, Jan 31, 2018 at 11:53 AM, David Woodhouse <dwmw2@infradead.org> wrote:
> Rather than doing the expensive guest_cpu_has() every time (which is
> worse now as we realised we need two of them) perhaps we should
> introduce a local flag for that too?

That sounds good to me.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 19:53   ` Jim Mattson
@ 2018-01-31 20:00       ` David Woodhouse
  2018-01-31 20:01     ` KarimAllah Ahmed
  1 sibling, 0 replies; 56+ messages in thread
From: David Woodhouse @ 2018-01-31 20:00 UTC (permalink / raw)
  To: Jim Mattson, KarimAllah Ahmed
  Cc: kvm list, LKML, the arch/x86 maintainers, Asit Mallick,
	Arjan Van De Ven, Dave Hansen, Andi Kleen, Andrea Arcangeli,
	Linus Torvalds, Tim Chen, Thomas Gleixner, Dan Williams,
	Jun Nakajima, Paolo Bonzini, Greg KH, Andy Lutomirski, Ashok Raj

[-- Attachment #1: Type: text/plain, Size: 1199 bytes --]

On Wed, 2018-01-31 at 11:53 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 11:37 AM, KarimAllah Ahmed  wrote:
> 
> > +
> > +       if (to_vmx(vcpu)->save_spec_ctrl_on_exit) {
> > +               nested_vmx_disable_intercept_for_msr(
> > +                               msr_bitmap_l1, msr_bitmap_l0,
> > +                               MSR_IA32_SPEC_CTRL,
> > +                               MSR_TYPE_R | MSR_TYPE_W);
> > +       }
> > +
> 
> As this is written, L2 will never get direct access to this MSR until
> after L1 writes it.  What if L1 never writes it? The condition should
> really be something that captures, "if L0 is willing to yield this MSR
> to the guest..."

I'm still kind of lost here, but don't forget the requirement that the
MSR must *not* be passed through for direct access by L1 or L2 guests,
unless that ->save_spec_ctrl_on_exit flag is set.

Because that's what makes us set it back to zero on vmexit.

So the above condition doesn't look *so* wrong to me. Perhaps the issue
is that we're missing a way for L2 to actually cause that flag to get
set?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
@ 2018-01-31 20:00       ` David Woodhouse
  0 siblings, 0 replies; 56+ messages in thread
From: David Woodhouse @ 2018-01-31 20:00 UTC (permalink / raw)
  To: Jim Mattson, KarimAllah Ahmed
  Cc: kvm list, LKML, the arch/x86 maintainers, Asit Mallick,
	Arjan Van De Ven, Dave Hansen, Andi Kleen, Andrea Arcangeli,
	Linus Torvalds, Tim Chen, Thomas Gleixner, Dan Williams,
	Jun Nakajima, Paolo Bonzini, Greg KH, Andy Lutomirski, Ashok Raj

[-- Attachment #1: Type: text/plain, Size: 1199 bytes --]

On Wed, 2018-01-31 at 11:53 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 11:37 AM, KarimAllah Ahmed  wrote:
> 
> > +
> > +       if (to_vmx(vcpu)->save_spec_ctrl_on_exit) {
> > +               nested_vmx_disable_intercept_for_msr(
> > +                               msr_bitmap_l1, msr_bitmap_l0,
> > +                               MSR_IA32_SPEC_CTRL,
> > +                               MSR_TYPE_R | MSR_TYPE_W);
> > +       }
> > +
> 
> As this is written, L2 will never get direct access to this MSR until
> after L1 writes it.  What if L1 never writes it? The condition should
> really be something that captures, "if L0 is willing to yield this MSR
> to the guest..."

I'm still kind of lost here, but don't forget the requirement that the
MSR must *not* be passed through for direct access by L1 or L2 guests,
unless that ->save_spec_ctrl_on_exit flag is set.

Because that's what makes us set it back to zero on vmexit.

So the above condition doesn't look *so* wrong to me. Perhaps the issue
is that we're missing a way for L2 to actually cause that flag to get
set?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 19:53   ` Jim Mattson
  2018-01-31 20:00       ` David Woodhouse
@ 2018-01-31 20:01     ` KarimAllah Ahmed
  2018-01-31 20:18       ` Jim Mattson
  1 sibling, 1 reply; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-01-31 20:01 UTC (permalink / raw)
  To: Jim Mattson, KarimAllah Ahmed
  Cc: kvm list, LKML, the arch/x86 maintainers, Asit Mallick,
	Arjan Van De Ven, Dave Hansen, Andi Kleen, Andrea Arcangeli,
	Linus Torvalds, Tim Chen, Thomas Gleixner, Dan Williams,
	Jun Nakajima, Paolo Bonzini, David Woodhouse, Greg KH,
	Andy Lutomirski, Ashok Raj



On 01/31/2018 08:53 PM, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 11:37 AM, KarimAllah Ahmed <karahmed@amazon.de> wrote:
> 
>> +
>> +       if (to_vmx(vcpu)->save_spec_ctrl_on_exit) {
>> +               nested_vmx_disable_intercept_for_msr(
>> +                               msr_bitmap_l1, msr_bitmap_l0,
>> +                               MSR_IA32_SPEC_CTRL,
>> +                               MSR_TYPE_R | MSR_TYPE_W);
>> +       }
>> +
> 
> As this is written, L2 will never get direct access to this MSR until
> after L1 writes it.  What if L1 never writes it? The condition should
> really be something that captures, "if L0 is willing to yield this MSR
> to the guest..."

but save_spec_ctrl_on_exit is also set for L2 write. So once L2 writes
to it, this condition will be true and then the bitmap will be updated.

> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 20:01     ` KarimAllah Ahmed
@ 2018-01-31 20:18       ` Jim Mattson
  2018-01-31 20:21           ` David Woodhouse
                           ` (2 more replies)
  0 siblings, 3 replies; 56+ messages in thread
From: Jim Mattson @ 2018-01-31 20:18 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: KarimAllah Ahmed, kvm list, LKML, the arch/x86 maintainers,
	Asit Mallick, Arjan Van De Ven, Dave Hansen, Andi Kleen,
	Andrea Arcangeli, Linus Torvalds, Tim Chen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, Paolo Bonzini, David Woodhouse,
	Greg KH, Andy Lutomirski, Ashok Raj

On Wed, Jan 31, 2018 at 12:01 PM, KarimAllah Ahmed <karahmed@amazon.com> wrote:

> but save_spec_ctrl_on_exit is also set for L2 write. So once L2 writes
> to it, this condition will be true and then the bitmap will be updated.

So if L1 or any L2 writes to the MSR, then save_spec_ctrl_on_exit is
set to true, even if the MSR permission bitmap for a particular VMCS
*doesn't* allow the MSR to be written without an intercept. That's
functionally correct, but inefficient. It seems to me that
save_spec_ctrl_on_exit should indicate whether or not the *current*
MSR permission bitmap allows unintercepted writes to IA32_SPEC_CTRL.
To that end, perhaps save_spec_ctrl_on_exit rightfully belongs in the
loaded_vmcs structure, alongside the msr_bitmap pointer that it is
associated with. For vmcs02, nested_vmx_merge_msr_bitmap() should set
the vmcs02 save_spec_ctrl_on_exit based on (a) whether L0 is willing
to yield the MSR to L1, and (b) whether L1 is willing to yield the MSR
to L2.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 20:18       ` Jim Mattson
@ 2018-01-31 20:21           ` David Woodhouse
  2018-01-31 20:34         ` Paolo Bonzini
  2018-01-31 22:52         ` KarimAllah Ahmed
  2 siblings, 0 replies; 56+ messages in thread
From: David Woodhouse @ 2018-01-31 20:21 UTC (permalink / raw)
  To: Jim Mattson, KarimAllah Ahmed
  Cc: KarimAllah Ahmed, kvm list, LKML, the arch/x86 maintainers,
	Asit Mallick, Arjan Van De Ven, Dave Hansen, Andi Kleen,
	Andrea Arcangeli, Linus Torvalds, Tim Chen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, Paolo Bonzini, Greg KH,
	Andy Lutomirski, Ashok Raj

[-- Attachment #1: Type: text/plain, Size: 1499 bytes --]

On Wed, 2018-01-31 at 12:18 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 12:01 PM, KarimAllah Ahmed  wrote:
> 
> > 
> > but save_spec_ctrl_on_exit is also set for L2 write. So once L2 writes
> > to it, this condition will be true and then the bitmap will be updated.
> So if L1 or any L2 writes to the MSR, then save_spec_ctrl_on_exit is
> set to true, even if the MSR permission bitmap for a particular VMCS
> *doesn't* allow the MSR to be written without an intercept. That's
> functionally correct, but inefficient. It seems to me that
> save_spec_ctrl_on_exit should indicate whether or not the *current*
> MSR permission bitmap allows unintercepted writes to IA32_SPEC_CTRL.
> To that end, perhaps save_spec_ctrl_on_exit rightfully belongs in the
> loaded_vmcs structure, alongside the msr_bitmap pointer that it is
> associated with. For vmcs02, nested_vmx_merge_msr_bitmap() should set
> the vmcs02 save_spec_ctrl_on_exit based on (a) whether L0 is willing
> to yield the MSR to L1, and (b) whether L1 is willing to yield the MSR
> to L2.

Reading and writing this MSR is expensive. And if it's yielded to the
guest in the MSR bitmap, that means we have to save its value on vmexit
and set it back to zero.

Some of the gymnastics here are explicitly done to avoid having to do
that save-and-zero step unless the guest has *actually* touched the
MSR. Not just if we are *willing* to let it do so.

That's the whole point in the yield-after-first-write dance.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
@ 2018-01-31 20:21           ` David Woodhouse
  0 siblings, 0 replies; 56+ messages in thread
From: David Woodhouse @ 2018-01-31 20:21 UTC (permalink / raw)
  To: Jim Mattson, KarimAllah Ahmed
  Cc: KarimAllah Ahmed, kvm list, LKML, the arch/x86 maintainers,
	Asit Mallick, Arjan Van De Ven, Dave Hansen, Andi Kleen,
	Andrea Arcangeli, Linus Torvalds, Tim Chen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, Paolo Bonzini, Greg KH,
	Andy Lutomirski, Ashok Raj

[-- Attachment #1: Type: text/plain, Size: 1499 bytes --]

On Wed, 2018-01-31 at 12:18 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 12:01 PM, KarimAllah Ahmed  wrote:
> 
> > 
> > but save_spec_ctrl_on_exit is also set for L2 write. So once L2 writes
> > to it, this condition will be true and then the bitmap will be updated.
> So if L1 or any L2 writes to the MSR, then save_spec_ctrl_on_exit is
> set to true, even if the MSR permission bitmap for a particular VMCS
> *doesn't* allow the MSR to be written without an intercept. That's
> functionally correct, but inefficient. It seems to me that
> save_spec_ctrl_on_exit should indicate whether or not the *current*
> MSR permission bitmap allows unintercepted writes to IA32_SPEC_CTRL.
> To that end, perhaps save_spec_ctrl_on_exit rightfully belongs in the
> loaded_vmcs structure, alongside the msr_bitmap pointer that it is
> associated with. For vmcs02, nested_vmx_merge_msr_bitmap() should set
> the vmcs02 save_spec_ctrl_on_exit based on (a) whether L0 is willing
> to yield the MSR to L1, and (b) whether L1 is willing to yield the MSR
> to L2.

Reading and writing this MSR is expensive. And if it's yielded to the
guest in the MSR bitmap, that means we have to save its value on vmexit
and set it back to zero.

Some of the gymnastics here are explicitly done to avoid having to do
that save-and-zero step unless the guest has *actually* touched the
MSR. Not just if we are *willing* to let it do so.

That's the whole point in the yield-after-first-write dance.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 1/5] KVM: x86: Update the reverse_cpuid list to include CPUID_7_EDX
  2018-01-31 19:37 ` [PATCH v5 1/5] KVM: x86: Update the reverse_cpuid list to include CPUID_7_EDX KarimAllah Ahmed
@ 2018-01-31 20:22   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 56+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-01-31 20:22 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: kvm, linux-kernel, x86, Paolo Bonzini,
	Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H . Peter Anvin, David Woodhouse

On Wed, Jan 31, 2018 at 08:37:43PM +0100, KarimAllah Ahmed wrote:
> [dwmw2: Stop using KF() for bits in it, too]
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: x86@kernel.org
> Cc: kvm@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>  arch/x86/kvm/cpuid.c | 8 +++-----
>  arch/x86/kvm/cpuid.h | 1 +
>  2 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 0099e10..c0eb337 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -67,9 +67,7 @@ u64 kvm_supported_xcr0(void)
>  
>  #define F(x) bit(X86_FEATURE_##x)
>  
> -/* These are scattered features in cpufeatures.h. */
> -#define KVM_CPUID_BIT_AVX512_4VNNIW     2
> -#define KVM_CPUID_BIT_AVX512_4FMAPS     3
> +/* For scattered features from cpufeatures.h; we currently expose none */
>  #define KF(x) bit(KVM_CPUID_BIT_##x)
>  
>  int kvm_update_cpuid(struct kvm_vcpu *vcpu)
> @@ -392,7 +390,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  
>  	/* cpuid 7.0.edx*/
>  	const u32 kvm_cpuid_7_0_edx_x86_features =
> -		KF(AVX512_4VNNIW) | KF(AVX512_4FMAPS);
> +		F(AVX512_4VNNIW) | F(AVX512_4FMAPS);
>  
>  	/* all calls to cpuid_count() should be made on the same cpu */
>  	get_cpu();
> @@ -477,7 +475,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  			if (!tdp_enabled || !boot_cpu_has(X86_FEATURE_OSPKE))
>  				entry->ecx &= ~F(PKU);
>  			entry->edx &= kvm_cpuid_7_0_edx_x86_features;
> -			entry->edx &= get_scattered_cpuid_leaf(7, 0, CPUID_EDX);
> +			cpuid_mask(&entry->edx, CPUID_7_EDX);
>  		} else {
>  			entry->ebx = 0;
>  			entry->ecx = 0;
> diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> index c2cea66..9a327d5 100644
> --- a/arch/x86/kvm/cpuid.h
> +++ b/arch/x86/kvm/cpuid.h
> @@ -54,6 +54,7 @@ static const struct cpuid_reg reverse_cpuid[] = {
>  	[CPUID_8000_000A_EDX] = {0x8000000a, 0, CPUID_EDX},
>  	[CPUID_7_ECX]         = {         7, 0, CPUID_ECX},
>  	[CPUID_8000_0007_EBX] = {0x80000007, 0, CPUID_EBX},
> +	[CPUID_7_EDX]         = {         7, 0, CPUID_EDX},
>  };
>  
>  static __always_inline struct cpuid_reg x86_feature_cpuid(unsigned x86_feature)
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 2/5] KVM: x86: Add IBPB support
  2018-01-31 19:37 ` [PATCH v5 2/5] KVM: x86: Add IBPB support KarimAllah Ahmed
  2018-01-31 19:45   ` Jim Mattson
@ 2018-01-31 20:28   ` Konrad Rzeszutek Wilk
  2018-01-31 20:36     ` KarimAllah Ahmed
  2018-02-01  4:54   ` Tom Lendacky
  2018-02-01 17:00   ` Raj, Ashok
  3 siblings, 1 reply; 56+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-01-31 20:28 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: kvm, linux-kernel, x86, Ashok Raj, Asit Mallick, Dave Hansen,
	Arjan Van De Ven, Tim Chen, Linus Torvalds, Andrea Arcangeli,
	Andi Kleen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Andy Lutomirski, Greg KH, Paolo Bonzini, Peter Zijlstra,
	David Woodhouse

> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index d46a61b..2e4e8af 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2285,6 +2285,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  	if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
>  		per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
>  		vmcs_load(vmx->loaded_vmcs->vmcs);
> +		indirect_branch_prediction_barrier();
>  	}
>  
>  	if (!already_loaded) {
> @@ -3342,6 +3343,26 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	case MSR_IA32_TSC:
>  		kvm_write_tsc(vcpu, msr_info);
>  		break;
> +	case MSR_IA32_PRED_CMD:
> +		if (!msr_info->host_initiated &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> +			return 1;
> +
> +		if (data & ~PRED_CMD_IBPB)
> +			return 1;
> +
> +		if (!data)
> +			break;
> +
> +		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
> +
> +		if (is_guest_mode(vcpu))
> +			break;

Don't you want this the other way around? That is first do the disable_intercept
and then add the 'if (is_guest_mode(vcpu))' ? Otherwise the very first
MSR write from the guest is going to hit condition above and never end
up executing the disabling of the intercept?

> +
> +		vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
> +					      MSR_TYPE_W);
> +		break;
>  	case MSR_IA32_CR_PAT:
>  		if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
>  			if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 20:18       ` Jim Mattson
  2018-01-31 20:21           ` David Woodhouse
@ 2018-01-31 20:34         ` Paolo Bonzini
  2018-01-31 20:54           ` Jim Mattson
  2018-01-31 22:52         ` KarimAllah Ahmed
  2 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2018-01-31 20:34 UTC (permalink / raw)
  To: Jim Mattson, KarimAllah Ahmed
  Cc: KarimAllah Ahmed, kvm list, LKML, the arch/x86 maintainers,
	Asit Mallick, Arjan Van De Ven, Dave Hansen, Andi Kleen,
	Andrea Arcangeli, Linus Torvalds, Tim Chen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, David Woodhouse, Greg KH,
	Andy Lutomirski, Ashok Raj

On 31/01/2018 15:18, Jim Mattson wrote:
>> but save_spec_ctrl_on_exit is also set for L2 write. So once L2 writes
>> to it, this condition will be true and then the bitmap will be updated.
> So if L1 or any L2 writes to the MSR, then save_spec_ctrl_on_exit is
> set to true, even if the MSR permission bitmap for a particular VMCS
> *doesn't* allow the MSR to be written without an intercept. That's
> functionally correct, but inefficient. It seems to me that
> save_spec_ctrl_on_exit should indicate whether or not the *current*
> MSR permission bitmap allows unintercepted writes to IA32_SPEC_CTRL.
> To that end, perhaps save_spec_ctrl_on_exit rightfully belongs in the
> loaded_vmcs structure, alongside the msr_bitmap pointer that it is
> associated with. For vmcs02, nested_vmx_merge_msr_bitmap() should set
> the vmcs02 save_spec_ctrl_on_exit based on (a) whether L0 is willing
> to yield the MSR to L1, and (b) whether L1 is willing to yield the MSR
> to L2.

On the first nested write, (b) must be true for L0 to see the MSR write.
 If L1 doesn't yield the MSR to L2, the MSR write results in an L2->L1
vmexit and save_spec_ctrl_on_exit is not set to true.

So save_spec_ctrl_on_exit is set if all of the following are true:

(a) L0 is willing to yield the MSR to L1,
(b) and the write happens in L1,

or all of the following are true:

(a) L0 is willing to yield the MSR to L1,
(b) L1 is willing to yield the MSR to L2,
(c) and the write happens in L2,

It doesn't need to be placed in loaded_vmcs, because in the end if L1 is
willing to yield the MSR to L2, it will have to do reads and writes of
the MSR too, and both loaded_vmcs structs will have
save_spec_ctrl_on_exit=1.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 2/5] KVM: x86: Add IBPB support
  2018-01-31 20:28   ` Konrad Rzeszutek Wilk
@ 2018-01-31 20:36     ` KarimAllah Ahmed
  0 siblings, 0 replies; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-01-31 20:36 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, KarimAllah Ahmed
  Cc: kvm, linux-kernel, x86, Ashok Raj, Asit Mallick, Dave Hansen,
	Arjan Van De Ven, Tim Chen, Linus Torvalds, Andrea Arcangeli,
	Andi Kleen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Andy Lutomirski, Greg KH, Paolo Bonzini, Peter Zijlstra,
	David Woodhouse

On 01/31/2018 09:28 PM, Konrad Rzeszutek Wilk wrote:
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index d46a61b..2e4e8af 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2285,6 +2285,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>   	if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
>>   		per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
>>   		vmcs_load(vmx->loaded_vmcs->vmcs);
>> +		indirect_branch_prediction_barrier();
>>   	}
>>   
>>   	if (!already_loaded) {
>> @@ -3342,6 +3343,26 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>   	case MSR_IA32_TSC:
>>   		kvm_write_tsc(vcpu, msr_info);
>>   		break;
>> +	case MSR_IA32_PRED_CMD:
>> +		if (!msr_info->host_initiated &&
>> +		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
>> +		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
>> +			return 1;
>> +
>> +		if (data & ~PRED_CMD_IBPB)
>> +			return 1;
>> +
>> +		if (!data)
>> +			break;
>> +
>> +		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
>> +
>> +		if (is_guest_mode(vcpu))
>> +			break;
> 
> Don't you want this the other way around? That is first do the disable_intercept
> and then add the 'if (is_guest_mode(vcpu))' ? Otherwise the very first
> MSR write from the guest is going to hit condition above and never end
> up executing the disabling of the intercept?

is_guest_mode is checking if this is an L2 guest. I *should not* do
disable_intercept on the L1 guest bitmap if it is an L2 guest that is
why this check happens before disable_intercept.

For the short-circuited L2 path, nested_vmx_merge_msr_bitmap will
properly update the L02 MSR bitmap and use it.

So the checks are fine AFAICT.

> 
>> +
>> +		vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
>> +					      MSR_TYPE_W);
>> +		break;
>>   	case MSR_IA32_CR_PAT:
>>   		if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
>>   			if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 20:34         ` Paolo Bonzini
@ 2018-01-31 20:54           ` Jim Mattson
  2018-01-31 21:00             ` Paolo Bonzini
  0 siblings, 1 reply; 56+ messages in thread
From: Jim Mattson @ 2018-01-31 20:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: KarimAllah Ahmed, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	David Woodhouse, Greg KH, Andy Lutomirski, Ashok Raj

You seem to be making the assumption that there is one L2. What if
there are 100 L2s, and only one has write-access to IA32_SPEC_CTRL? Or
what if there once was such an L2, but it's been gone for months? The
current mechanism penalizes *all* L2s if any L2, ever, has
write-access to IA32_SPEC_CTRL.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 20:54           ` Jim Mattson
@ 2018-01-31 21:00             ` Paolo Bonzini
  2018-01-31 21:05               ` Jim Mattson
  0 siblings, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2018-01-31 21:00 UTC (permalink / raw)
  To: Jim Mattson
  Cc: KarimAllah Ahmed, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	David Woodhouse, Greg KH, Andy Lutomirski, Ashok Raj

On 31/01/2018 15:54, Jim Mattson wrote:
> You seem to be making the assumption that there is one L2. What if
> there are 100 L2s, and only one has write-access to IA32_SPEC_CTRL? Or
> what if there once was such an L2, but it's been gone for months? The
> current mechanism penalizes *all* L2s if any L2, ever, has
> write-access to IA32_SPEC_CTRL.

Yes, but how would moving the field into struct loaded_vmcs do anything?
 Only vmon/vmoff would change anything in vmx->nested.vmcs02.

Even then, L1 vmexits will also be penalized because L1 has probably
done an RDMSR/WRMSR on L2->L1 vmexit.  So I don't think it's an issue?

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 21:00             ` Paolo Bonzini
@ 2018-01-31 21:05               ` Jim Mattson
  2018-01-31 21:17                   ` Woodhouse, David
  2018-01-31 21:42                 ` Paolo Bonzini
  0 siblings, 2 replies; 56+ messages in thread
From: Jim Mattson @ 2018-01-31 21:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: KarimAllah Ahmed, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	David Woodhouse, Greg KH, Andy Lutomirski, Ashok Raj

On Wed, Jan 31, 2018 at 1:00 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:

> Yes, but how would moving the field into struct loaded_vmcs do anything?
>  Only vmon/vmoff would change anything in vmx->nested.vmcs02.

My suggestion was that nested_vmx_merge_msr_bitmap should set the
vmcs02 version of save_spec_ctrl_on_exit based on the calculated value
of the write permission bit for IA32_SPEC_CTRL in the vmcs02 MSR
permission bitmap.

> Even then, L1 vmexits will also be penalized because L1 has probably
> done an RDMSR/WRMSR on L2->L1 vmexit.  So I don't think it's an issue?

Yes, it sucks to be L1 in this situation.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 21:05               ` Jim Mattson
@ 2018-01-31 21:17                   ` Woodhouse, David
  2018-01-31 21:42                 ` Paolo Bonzini
  1 sibling, 0 replies; 56+ messages in thread
From: Woodhouse, David @ 2018-01-31 21:17 UTC (permalink / raw)
  To: Jim Mattson, Paolo Bonzini
  Cc: KarimAllah Ahmed, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima, Greg KH,
	Andy Lutomirski, Ashok Raj

[-- Attachment #1: Type: text/plain, Size: 973 bytes --]

On Wed, 2018-01-31 at 13:05 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 1:00 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> > Yes, but how would moving the field into struct loaded_vmcs do anything?
> >  Only vmon/vmoff would change anything in vmx->nested.vmcs02.
> 
> My suggestion was that nested_vmx_merge_msr_bitmap should set the
> vmcs02 version of save_spec_ctrl_on_exit based on the calculated value
> of the write permission bit for IA32_SPEC_CTRL in the vmcs02 MSR
> permission bitmap.
> 
> > Even then, L1 vmexits will also be penalized because L1 has probably
> > done an RDMSR/WRMSR on L2->L1 vmexit.  So I don't think it's an issue?
> 
> Yes, it sucks to be L1 in this situation.

Well... we *could* clear the save_spec_ctrl_on_exit flag and intercept
the MSR again, any time that the actual value of spec_ctrl is zero. 

I don't think we'd want to do that too aggressively, but there might be
something we could do there.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
@ 2018-01-31 21:17                   ` Woodhouse, David
  0 siblings, 0 replies; 56+ messages in thread
From: Woodhouse, David @ 2018-01-31 21:17 UTC (permalink / raw)
  To: jmattson, pbonzini
  Cc: kvm, linux-kernel, arjan.van.de.ven, ashok.raj, Raslan,
	KarimAllah, Raslan, KarimAllah, torvalds, tglx, tim.c.chen, ak,
	x86, dan.j.williams, aarcange, luto, gregkh, dave.hansen,
	asit.k.mallick, jun.nakajima


[-- Attachment #1.1: Type: text/plain, Size: 973 bytes --]

On Wed, 2018-01-31 at 13:05 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 1:00 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> > Yes, but how would moving the field into struct loaded_vmcs do anything?
> >  Only vmon/vmoff would change anything in vmx->nested.vmcs02.
> 
> My suggestion was that nested_vmx_merge_msr_bitmap should set the
> vmcs02 version of save_spec_ctrl_on_exit based on the calculated value
> of the write permission bit for IA32_SPEC_CTRL in the vmcs02 MSR
> permission bitmap.
> 
> > Even then, L1 vmexits will also be penalized because L1 has probably
> > done an RDMSR/WRMSR on L2->L1 vmexit.  So I don't think it's an issue?
> 
> Yes, it sucks to be L1 in this situation.

Well... we *could* clear the save_spec_ctrl_on_exit flag and intercept
the MSR again, any time that the actual value of spec_ctrl is zero. 

I don't think we'd want to do that too aggressively, but there might be
something we could do there.

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5210 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 197 bytes --]




Amazon Web Services UK Limited. Registered in England and Wales with registration number 08650665 and which has its registered office at 60 Holborn Viaduct, London EC1A 2FD, United Kingdom.

[-- Attachment #2.2: Type: text/html, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 20:21           ` David Woodhouse
  (?)
@ 2018-01-31 21:18           ` Jim Mattson
  2018-01-31 22:05             ` David Woodhouse
  -1 siblings, 1 reply; 56+ messages in thread
From: Jim Mattson @ 2018-01-31 21:18 UTC (permalink / raw)
  To: David Woodhouse
  Cc: KarimAllah Ahmed, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Paolo Bonzini, Greg KH, Andy Lutomirski, Ashok Raj

On Wed, Jan 31, 2018 at 12:21 PM, David Woodhouse <dwmw2@infradead.org> wrote:

> Reading and writing this MSR is expensive. And if it's yielded to the
> guest in the MSR bitmap, that means we have to save its value on vmexit
> and set it back to zero.

Agreed. But my point is that if it's not yielded to the guest in the
MSR bitmap, then we don't have to save its value on VM-exit and set it
back to zero. The vmcs02 MSR bitmap is reconstructed on every L1->L2
transition. Sometimes, it will yield the MSR and sometimes it won't.

> Some of the gymnastics here are explicitly done to avoid having to do
> that save-and-zero step unless the guest has *actually* touched the
> MSR. Not just if we are *willing* to let it do so.
>
> That's the whole point in the yield-after-first-write dance.

Sorry; bad choice of words on my part. All that L0 knows of L1's
"willingness" to pass the MSR through to L2 comes from the vmcs12 MSR
permission bitmap. If L1 also adopts a "clear the WRMSR intercept on
first write" strategy, then as far as L0 can tell, L1 is "unwilling"
to pass the MSR through until L2 has written it.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 21:05               ` Jim Mattson
  2018-01-31 21:17                   ` Woodhouse, David
@ 2018-01-31 21:42                 ` Paolo Bonzini
  2018-01-31 21:53                   ` Jim Mattson
  1 sibling, 1 reply; 56+ messages in thread
From: Paolo Bonzini @ 2018-01-31 21:42 UTC (permalink / raw)
  To: Jim Mattson
  Cc: KarimAllah Ahmed, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	David Woodhouse, Greg KH, Andy Lutomirski, Ashok Raj

On 31/01/2018 16:05, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 1:00 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
>> Yes, but how would moving the field into struct loaded_vmcs do anything?
>>  Only vmon/vmoff would change anything in vmx->nested.vmcs02.
> 
> My suggestion was that nested_vmx_merge_msr_bitmap should set the
> vmcs02 version of save_spec_ctrl_on_exit based on the calculated value
> of the write permission bit for IA32_SPEC_CTRL in the vmcs02 MSR
> permission bitmap.
> 
>> Even then, L1 vmexits will also be penalized because L1 has probably
>> done an RDMSR/WRMSR on L2->L1 vmexit.  So I don't think it's an issue?
> 
> Yes, it sucks to be L1 in this situation.

Can we just say it sucks to be L2 too? :)  Because in the end as long as
no one ever writes to spec_ctrl, everybody is happy.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 21:42                 ` Paolo Bonzini
@ 2018-01-31 21:53                   ` Jim Mattson
  2018-01-31 21:59                     ` Paolo Bonzini
  2018-01-31 21:59                     ` David Woodhouse
  0 siblings, 2 replies; 56+ messages in thread
From: Jim Mattson @ 2018-01-31 21:53 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: KarimAllah Ahmed, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	David Woodhouse, Greg KH, Andy Lutomirski, Ashok Raj

On Wed, Jan 31, 2018 at 1:42 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:

> Can we just say it sucks to be L2 too? :)  Because in the end as long as
> no one ever writes to spec_ctrl, everybody is happy.

Unfortunately, quite a few OS vendors shipped IBRS-based mitigations
earlier this month. (Has Redhat stopped writing to IA32_SPEC_CTRL yet?
:-)

And in the long run, everyone is going to set IA32_SPEC_CTRL.IBRS=1 on
CPUs with IA32_ARCH_CAPABILITIES.IBRS_ALL.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 21:53                   ` Jim Mattson
@ 2018-01-31 21:59                     ` Paolo Bonzini
  2018-01-31 21:59                     ` David Woodhouse
  1 sibling, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2018-01-31 21:59 UTC (permalink / raw)
  To: Jim Mattson
  Cc: KarimAllah Ahmed, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	David Woodhouse, Greg KH, Andy Lutomirski, Ashok Raj

On 31/01/2018 16:53, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 1:42 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
>> Can we just say it sucks to be L2 too? :)  Because in the end as long as
>> no one ever writes to spec_ctrl, everybody is happy.
> 
> Unfortunately, quite a few OS vendors shipped IBRS-based mitigations
> earlier this month. (Has Redhat stopped writing to IA32_SPEC_CTRL yet?
> :-)

Not yet, but getting there. :)

> And in the long run, everyone is going to set IA32_SPEC_CTRL.IBRS=1 on
> CPUs with IA32_ARCH_CAPABILITIES.IBRS_ALL.

And then it will suck for everyone---they will have to pay the price of
saving/restoring an MSR that is going to be written just once.  Perhaps
we will have to tweak the heuristic, only passing IBRS through when the
guest writes IBRS=0.

In the end I think it's premature to try and optimize for L2 guests of
long-lived L1 hypervisors.

Paolo

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 21:53                   ` Jim Mattson
  2018-01-31 21:59                     ` Paolo Bonzini
@ 2018-01-31 21:59                     ` David Woodhouse
  2018-01-31 22:06                       ` Jim Mattson
  2018-02-01 14:09                       ` Paolo Bonzini
  1 sibling, 2 replies; 56+ messages in thread
From: David Woodhouse @ 2018-01-31 21:59 UTC (permalink / raw)
  To: Jim Mattson, Paolo Bonzini
  Cc: KarimAllah Ahmed, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima, Greg KH,
	Andy Lutomirski, Ashok Raj

[-- Attachment #1: Type: text/plain, Size: 986 bytes --]



On Wed, 2018-01-31 at 13:53 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 1:42 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> > Can we just say it sucks to be L2 too? :)  Because in the end as long as
> > no one ever writes to spec_ctrl, everybody is happy.
> 
> Unfortunately, quite a few OS vendors shipped IBRS-based mitigations
> earlier this month. (Has Redhat stopped writing to IA32_SPEC_CTRL yet?
> :-)
> 
> And in the long run, everyone is going to set IA32_SPEC_CTRL.IBRS=1 on
> CPUs with IA32_ARCH_CAPABILITIES.IBRS_ALL.


I'm actually working on IBRS_ALL at the moment.

I was tempted to *not* let the guests turn it off. Expose SPEC_CTRL but
just make it a no-op.

Or if that really doesn't fly, perhaps with IBRS_ALL we should invert
the logic. Set IBRS to 1 on vCPU reset, and only if it's set to *zero*
do we pass through the MSR and set the save_spec_ctrl_on_exit flag.

But let's get the code for *current* hardware done first...

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 21:18           ` Jim Mattson
@ 2018-01-31 22:05             ` David Woodhouse
  0 siblings, 0 replies; 56+ messages in thread
From: David Woodhouse @ 2018-01-31 22:05 UTC (permalink / raw)
  To: Jim Mattson
  Cc: KarimAllah Ahmed, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Paolo Bonzini, Greg KH, Andy Lutomirski, Ashok Raj

[-- Attachment #1: Type: text/plain, Size: 902 bytes --]



On Wed, 2018-01-31 at 13:18 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 12:21 PM, David Woodhouse  wrote:
> 
> > 
> > Reading and writing this MSR is expensive. And if it's yielded to the
> > guest in the MSR bitmap, that means we have to save its value on vmexit
> > and set it back to zero.
>
> Agreed. But my point is that if it's not yielded to the guest in the
> MSR bitmap, then we don't have to save its value on VM-exit and set it
> back to zero. The vmcs02 MSR bitmap is reconstructed on every L1->L2
> transition. Sometimes, it will yield the MSR and sometimes it won't.

Strictly: if SPEC_CTRL is not already set to 1 *and* hasn't been
yielded to the guest in the MSR bitmap, then we don't have to set it
back to zero.

If L1 decides it's *always* going to trap and never pass through, but
the value is already set to non-zero, we need to get that case right.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 21:59                     ` David Woodhouse
@ 2018-01-31 22:06                       ` Jim Mattson
  2018-01-31 22:10                           ` David Woodhouse
  2018-01-31 22:53                           ` Andy Lutomirski
  2018-02-01 14:09                       ` Paolo Bonzini
  1 sibling, 2 replies; 56+ messages in thread
From: Jim Mattson @ 2018-01-31 22:06 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Paolo Bonzini, KarimAllah Ahmed, KarimAllah Ahmed, kvm list,
	LKML, the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima, Greg KH,
	Andy Lutomirski, Ashok Raj

On Wed, Jan 31, 2018 at 1:59 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> I'm actually working on IBRS_ALL at the moment.
>
> I was tempted to *not* let the guests turn it off. Expose SPEC_CTRL but
> just make it a no-op.

Maybe we could convince Intel to add a LOCK bit to IA32_SPEC_CTRL like
the one in IA32_FEATURE_CONTROL.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 22:06                       ` Jim Mattson
@ 2018-01-31 22:10                           ` David Woodhouse
  2018-01-31 22:53                           ` Andy Lutomirski
  1 sibling, 0 replies; 56+ messages in thread
From: David Woodhouse @ 2018-01-31 22:10 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, KarimAllah Ahmed, KarimAllah Ahmed, kvm list,
	LKML, the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima, Greg KH,
	Andy Lutomirski, Ashok Raj

[-- Attachment #1: Type: text/plain, Size: 834 bytes --]



On Wed, 2018-01-31 at 14:06 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 1:59 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> > I'm actually working on IBRS_ALL at the moment.
> >
> > I was tempted to *not* let the guests turn it off. Expose SPEC_CTRL but
> > just make it a no-op.
> 
> Maybe we could convince Intel to add a LOCK bit to IA32_SPEC_CTRL like
> the one in IA32_FEATURE_CONTROL.

Given that IBRS_ALL is supposed to be a sanely-performing option, I'd
rather convince Intel to just make it unconditional. If they've added
the appropriate tagging to the BTB, why even *have* this deliberately
insecure mode when IBRS==0?

I understand that until/unless they get a *proper* fix, software is
still going to have to use IBPB as appropriate. But there's no need for
the IBRS bit to do *anything*.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
@ 2018-01-31 22:10                           ` David Woodhouse
  0 siblings, 0 replies; 56+ messages in thread
From: David Woodhouse @ 2018-01-31 22:10 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, KarimAllah Ahmed, KarimAllah Ahmed, kvm list,
	LKML, the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima, Greg KH,
	Andy Lutomirski, Ashok Raj

[-- Attachment #1: Type: text/plain, Size: 834 bytes --]



On Wed, 2018-01-31 at 14:06 -0800, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 1:59 PM, David Woodhouse <dwmw2@infradead.org> wrote:
> > I'm actually working on IBRS_ALL at the moment.
> >
> > I was tempted to *not* let the guests turn it off. Expose SPEC_CTRL but
> > just make it a no-op.
> 
> Maybe we could convince Intel to add a LOCK bit to IA32_SPEC_CTRL like
> the one in IA32_FEATURE_CONTROL.

Given that IBRS_ALL is supposed to be a sanely-performing option, I'd
rather convince Intel to just make it unconditional. If they've added
the appropriate tagging to the BTB, why even *have* this deliberately
insecure mode when IBRS==0?

I understand that until/unless they get a *proper* fix, software is
still going to have to use IBPB as appropriate. But there's no need for
the IBRS bit to do *anything*.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 22:10                           ` David Woodhouse
  (?)
@ 2018-01-31 22:21                           ` Linus Torvalds
  -1 siblings, 0 replies; 56+ messages in thread
From: Linus Torvalds @ 2018-01-31 22:21 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Jim Mattson, Paolo Bonzini, KarimAllah Ahmed, KarimAllah Ahmed,
	kvm list, LKML, the arch/x86 maintainers, Asit Mallick,
	Arjan Van De Ven, Dave Hansen, Andi Kleen, Andrea Arcangeli,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima, Greg KH,
	Andy Lutomirski, Ashok Raj

On Wed, Jan 31, 2018 at 2:10 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>
> Given that IBRS_ALL is supposed to be a sanely-performing option, I'd
> rather convince Intel to just make it unconditional. If they've added
> the appropriate tagging to the BTB, why even *have* this deliberately
> insecure mode when IBRS==0?
>
> I understand that until/unless they get a *proper* fix, software is
> still going to have to use IBPB as appropriate. But there's no need for
> the IBRS bit to do *anything*.

Amen, brother!

Please please please can Amazon and friends push this? The current
situation with IBRS_ALL is complete nasty horrible garbage. It's
pointless on current CPU's, and it's not well-defined enough on future
CPU's.

The whole "you can enable this, but performance may or may not be
acceptable, and we won't tell you" thing is some bad mumbo-jumbo.

Before IBRS is good, we'll do retpoline and BTB stuffing and have
those (hopefully very rare) IBPB's. So the whole "badly performing
IBRS_ALL" is completely pointless, and actively wrong.

           Linus

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 20:18       ` Jim Mattson
  2018-01-31 20:21           ` David Woodhouse
  2018-01-31 20:34         ` Paolo Bonzini
@ 2018-01-31 22:52         ` KarimAllah Ahmed
  2018-02-01  0:24           ` KarimAllah Ahmed
  2 siblings, 1 reply; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-01-31 22:52 UTC (permalink / raw)
  To: Jim Mattson
  Cc: KarimAllah Ahmed, kvm list, LKML, the arch/x86 maintainers,
	Asit Mallick, Arjan Van De Ven, Dave Hansen, Andi Kleen,
	Andrea Arcangeli, Linus Torvalds, Tim Chen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, Paolo Bonzini, David Woodhouse,
	Greg KH, Andy Lutomirski, Ashok Raj

On 01/31/2018 09:18 PM, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 12:01 PM, KarimAllah Ahmed <karahmed@amazon.com> wrote:
> 
>> but save_spec_ctrl_on_exit is also set for L2 write. So once L2 writes
>> to it, this condition will be true and then the bitmap will be updated.
> 
> So if L1 or any L2 writes to the MSR, then save_spec_ctrl_on_exit is
> set to true, even if the MSR permission bitmap for a particular VMCS
> *doesn't* allow the MSR to be written without an intercept. That's
> functionally correct, but inefficient. It seems to me that
> save_spec_ctrl_on_exit should indicate whether or not the *current*
> MSR permission bitmap allows unintercepted writes to IA32_SPEC_CTRL.
> To that end, perhaps save_spec_ctrl_on_exit rightfully belongs in the
> loaded_vmcs structure, alongside the msr_bitmap pointer that it is
> associated with. For vmcs02, nested_vmx_merge_msr_bitmap() should set
> the vmcs02 save_spec_ctrl_on_exit based on (a) whether L0 is willing
> to yield the MSR to L1, and (b) whether L1 is willing to yield the MSR
> to L2.

I actually got rid of this save_spec_ctrl_on_exit variable and replaced
it with another variable like the one suggested for IBPB. Just to avoid
doing an expensive guest_cpuid_has. Now I peak instead in the MSR bitmap
to figure out if this MSR was supposed to be intercepted or not. This
test should provide a similar semantics to save_spec_ctrl_on_exit.

Anyway, cleaning up/testing now and will post a new version.
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 22:06                       ` Jim Mattson
@ 2018-01-31 22:53                           ` Andy Lutomirski
  2018-01-31 22:53                           ` Andy Lutomirski
  1 sibling, 0 replies; 56+ messages in thread
From: Andy Lutomirski @ 2018-01-31 22:53 UTC (permalink / raw)
  To: Jim Mattson
  Cc: David Woodhouse, Paolo Bonzini, KarimAllah Ahmed,
	KarimAllah Ahmed, kvm list, LKML, the arch/x86 maintainers,
	Asit Mallick, Arjan Van De Ven, Dave Hansen, Andi Kleen,
	Andrea Arcangeli, Linus Torvalds, Tim Chen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, Greg KH, Andy Lutomirski, Ashok Raj



> On Jan 31, 2018, at 2:06 PM, Jim Mattson <jmattson@google.com> wrote:
> 
>> On Wed, Jan 31, 2018 at 1:59 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>> I'm actually working on IBRS_ALL at the moment.
>> 
>> I was tempted to *not* let the guests turn it off. Expose SPEC_CTRL but
>> just make it a no-op.
> 
> Maybe we could convince Intel to add a LOCK bit to IA32_SPEC_CTRL like
> the one in IA32_FEATURE_CONTROL.

Please no.  Some BIOS vendor is going to lock it to zero to win some silly benchmark.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
@ 2018-01-31 22:53                           ` Andy Lutomirski
  0 siblings, 0 replies; 56+ messages in thread
From: Andy Lutomirski @ 2018-01-31 22:53 UTC (permalink / raw)
  To: Jim Mattson
  Cc: David Woodhouse, Paolo Bonzini, KarimAllah Ahmed,
	KarimAllah Ahmed, kvm list, LKML, the arch/x86 maintainers,
	Asit Mallick, Arjan Van De Ven, Dave Hansen, Andi Kleen,
	Andrea Arcangeli, Linus Torvalds, Tim Chen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, Greg KH, Andy Lutomirski, A



> On Jan 31, 2018, at 2:06 PM, Jim Mattson <jmattson@google.com> wrote:
> 
>> On Wed, Jan 31, 2018 at 1:59 PM, David Woodhouse <dwmw2@infradead.org> wrote:
>> I'm actually working on IBRS_ALL at the moment.
>> 
>> I was tempted to *not* let the guests turn it off. Expose SPEC_CTRL but
>> just make it a no-op.
> 
> Maybe we could convince Intel to add a LOCK bit to IA32_SPEC_CTRL like
> the one in IA32_FEATURE_CONTROL.

Please no.  Some BIOS vendor is going to lock it to zero to win some silly benchmark.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 19:37 ` [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL KarimAllah Ahmed
  2018-01-31 19:53   ` Jim Mattson
@ 2018-01-31 22:56   ` Raj, Ashok
  1 sibling, 0 replies; 56+ messages in thread
From: Raj, Ashok @ 2018-01-31 22:56 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: kvm, linux-kernel, x86, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Paolo Bonzini, David Woodhouse, Greg KH, Andy Lutomirski,
	Ashok Raj

Hi Karim

On Wed, Jan 31, 2018 at 08:37:46PM +0100, KarimAllah Ahmed wrote:
> [ Based on a patch from Ashok Raj <ashok.raj@intel.com> ]
> 
> Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
> guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
> be using a retpoline+IBPB based approach.

With these changes SPEC_CTRL is properly exposed to the guest when using
latest Qemu.

> 
> To avoid the overhead of atomically saving and restoring the
> MSR_IA32_SPEC_CTRL for guests that do not actually use the MSR, only
> add_atomic_switch_msr when a non-zero is written to it.
> 
> No attempt is made to handle STIBP here, intentionally. Filtering STIBP
> may be added in a future patch, which may require trapping all writes
> if we don't want to pass it through directly to the guest.
> 
> [dwmw2: Clean up CPUID bits, save/restore manually, handle reset]
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 22:52         ` KarimAllah Ahmed
@ 2018-02-01  0:24           ` KarimAllah Ahmed
  2018-02-01  4:26               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-02-01  0:24 UTC (permalink / raw)
  To: Jim Mattson
  Cc: KarimAllah Ahmed, kvm list, LKML, the arch/x86 maintainers,
	Asit Mallick, Arjan Van De Ven, Dave Hansen, Andi Kleen,
	Andrea Arcangeli, Linus Torvalds, Tim Chen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, Paolo Bonzini, David Woodhouse,
	Greg KH, Andy Lutomirski, Ashok Raj

[-- Attachment #1: Type: text/plain, Size: 1855 bytes --]

On 01/31/2018 11:52 PM, KarimAllah Ahmed wrote:
> On 01/31/2018 09:18 PM, Jim Mattson wrote:
>> On Wed, Jan 31, 2018 at 12:01 PM, KarimAllah Ahmed 
>> <karahmed@amazon.com> wrote:
>>
>>> but save_spec_ctrl_on_exit is also set for L2 write. So once L2 writes
>>> to it, this condition will be true and then the bitmap will be updated.
>>
>> So if L1 or any L2 writes to the MSR, then save_spec_ctrl_on_exit is
>> set to true, even if the MSR permission bitmap for a particular VMCS
>> *doesn't* allow the MSR to be written without an intercept. That's
>> functionally correct, but inefficient. It seems to me that
>> save_spec_ctrl_on_exit should indicate whether or not the *current*
>> MSR permission bitmap allows unintercepted writes to IA32_SPEC_CTRL.
>> To that end, perhaps save_spec_ctrl_on_exit rightfully belongs in the
>> loaded_vmcs structure, alongside the msr_bitmap pointer that it is
>> associated with. For vmcs02, nested_vmx_merge_msr_bitmap() should set
>> the vmcs02 save_spec_ctrl_on_exit based on (a) whether L0 is willing
>> to yield the MSR to L1, and (b) whether L1 is willing to yield the MSR
>> to L2.
> 
> I actually got rid of this save_spec_ctrl_on_exit variable and replaced
> it with another variable like the one suggested for IBPB. Just to avoid
> doing an expensive guest_cpuid_has. Now I peak instead in the MSR bitmap
> to figure out if this MSR was supposed to be intercepted or not. This
> test should provide a similar semantics to save_spec_ctrl_on_exit.
> 
> Anyway, cleaning up/testing now and will post a new version.

I think this patch should address all your concerns.
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

[-- Attachment #2: 0004-KVM-VMX-Allow-direct-access-to-MSR_IA32_SPEC_CTRL.patch --]
[-- Type: text/x-patch, Size: 9883 bytes --]

>From 9c19a8ac3f021efba6f70ad7e28f7ad06bb97e43 Mon Sep 17 00:00:00 2001
From: KarimAllah Ahmed <karahmed@amazon.de>
Date: Mon, 29 Jan 2018 19:58:10 +0000
Subject: [PATCH] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL

[ Based on a patch from Ashok Raj <ashok.raj@intel.com> ]

Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
be using a retpoline+IBPB based approach.

To avoid the overhead of atomically saving and restoring the
MSR_IA32_SPEC_CTRL for guests that do not actually use the MSR, only
add_atomic_switch_msr when a non-zero is written to it.

No attempt is made to handle STIBP here, intentionally. Filtering STIBP
may be added in a future patch, which may require trapping all writes
if we don't want to pass it through directly to the guest.

[dwmw2: Clean up CPUID bits, save/restore manually, handle reset]

Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
v6:
- got rid of save_spec_ctrl_on_exit
- introduce spec_ctrl_intercepted
- introduce spec_ctrl_used
v5:
- Also check for X86_FEATURE_SPEC_CTRL for the msr reads/writes
v4:
- Add IBRS to kvm_cpuid_8000_0008_ebx_x86_features
- Handling nested guests
v3:
- Save/restore manually
- Fix CPUID handling
- Fix a copy & paste error in the name of SPEC_CTRL MSR in
  disable_intercept.
- support !cpu_has_vmx_msr_bitmap()
v2:
- remove 'host_spec_ctrl' in favor of only a comment (dwmw@).
- special case writing '0' in SPEC_CTRL to avoid confusing live-migration
  when the instance never used the MSR (dwmw@).
- depend on X86_FEATURE_IBRS instead of X86_FEATURE_SPEC_CTRL (dwmw@).
- add MSR_IA32_SPEC_CTRL to the list of MSRs to save (dropped it by accident).
---
 arch/x86/kvm/cpuid.c |  9 +++--
 arch/x86/kvm/vmx.c   | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c   |  2 +-
 3 files changed, 100 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 1909635..13f5d42 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -367,7 +367,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 
 	/* cpuid 0x80000008.ebx */
 	const u32 kvm_cpuid_8000_0008_ebx_x86_features =
-		F(IBPB);
+		F(IBPB) | F(IBRS);
 
 	/* cpuid 0xC0000001.edx */
 	const u32 kvm_cpuid_C000_0001_edx_x86_features =
@@ -394,7 +394,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 
 	/* cpuid 7.0.edx*/
 	const u32 kvm_cpuid_7_0_edx_x86_features =
-		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);
+		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
+		F(ARCH_CAPABILITIES);
 
 	/* all calls to cpuid_count() should be made on the same cpu */
 	get_cpu();
@@ -630,9 +631,11 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 			g_phys_as = phys_as;
 		entry->eax = g_phys_as | (virt_as << 8);
 		entry->edx = 0;
-		/* IBPB isn't necessarily present in hardware cpuid */
+		/* IBRS and IBPB aren't necessarily present in hardware cpuid */
 		if (boot_cpu_has(X86_FEATURE_IBPB))
 			entry->ebx |= F(IBPB);
+		if (boot_cpu_has(X86_FEATURE_IBRS))
+			entry->ebx |= F(IBRS);
 		entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
 		cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
 		break;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6a9f4ec..bfc80ff 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -594,6 +594,14 @@ struct vcpu_vmx {
 #endif
 
 	u64 		      arch_capabilities;
+	u64 		      spec_ctrl;
+
+	/*
+	 * This indicates that:
+	 * 1) guest_cpuid_has(X86_FEATURE_IBRS) == true &&
+	 * 2) The guest has actually initiated a write against the MSR.
+	 */
+	bool spec_ctrl_used;
 
 	/*
 	 * This indicates that:
@@ -946,6 +954,8 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
 static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
 					    u16 error_code);
 static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
+static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
+							  u32 msr, int type);
 
 static DEFINE_PER_CPU(struct vmcs *, vmxarea);
 static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
@@ -1917,6 +1927,22 @@ static void update_exception_bitmap(struct kvm_vcpu *vcpu)
 	vmcs_write32(EXCEPTION_BITMAP, eb);
 }
 
+/* Is SPEC_CTRL intercepted for the currently running vCPU? */
+static bool spec_ctrl_intercepted(struct kvm_vcpu *vcpu)
+{
+	unsigned long *msr_bitmap;
+	int f = sizeof(unsigned long);
+
+	if (!cpu_has_vmx_msr_bitmap())
+		return true;
+
+	msr_bitmap = is_guest_mode(vcpu) ?
+			to_vmx(vcpu)->nested.vmcs02.msr_bitmap :
+			to_vmx(vcpu)->vmcs01.msr_bitmap;
+
+	return !!test_bit(MSR_IA32_SPEC_CTRL, msr_bitmap + 0x800 / f);
+}
+
 static void clear_atomic_switch_msr_special(struct vcpu_vmx *vmx,
 		unsigned long entry, unsigned long exit)
 {
@@ -3246,6 +3272,14 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_TSC:
 		msr_info->data = guest_read_tsc(vcpu);
 		break;
+	case MSR_IA32_SPEC_CTRL:
+		if (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
+			return 1;
+
+		msr_info->data = to_vmx(vcpu)->spec_ctrl;
+		break;
 	case MSR_IA32_ARCH_CAPABILITIES:
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
@@ -3359,6 +3393,34 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_TSC:
 		kvm_write_tsc(vcpu, msr_info);
 		break;
+	case MSR_IA32_SPEC_CTRL:
+		if (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
+			return 1;
+
+		vmx->spec_ctrl_used = true;
+
+		/* The STIBP bit doesn't fault even if it's not advertised */
+		if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
+			return 1;
+
+		vmx->spec_ctrl = data;
+
+		/*
+		 * When it's written (to non-zero) for the first time, pass
+		 * it through. This means we don't have to take the perf
+		 * hit of saving it on vmexit for the common case of guests
+		 * that don't use it.
+		 */
+		if (cpu_has_vmx_msr_bitmap() && data &&
+		    spec_ctrl_intercepted(vcpu) &&
+		    is_guest_mode(vcpu))
+			vmx_disable_intercept_for_msr(
+					vmx->vmcs01.msr_bitmap,
+					MSR_IA32_SPEC_CTRL,
+					MSR_TYPE_RW);
+		break;
 	case MSR_IA32_PRED_CMD:
 		if (!msr_info->host_initiated &&
 		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
@@ -5678,6 +5740,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	u64 cr0;
 
 	vmx->rmode.vm86_active = 0;
+	vmx->spec_ctrl = 0;
 
 	vmx->vcpu.arch.regs[VCPU_REGS_RDX] = get_rdx_init_val();
 	kvm_set_cr8(vcpu, 0);
@@ -9349,6 +9412,15 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	vmx_arm_hv_timer(vcpu);
 
+	/*
+	 * If this vCPU has touched SPEC_CTRL, restore the guest's value if
+	 * it's non-zero. Since vmentry is serialising on affected CPUs, there
+	 * is no need to worry about the conditional branch over the wrmsr
+	 * being speculatively taken.
+	 */
+	if (vmx->spec_ctrl)
+		wrmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
+
 	vmx->__launched = vmx->loaded_vmcs->launched;
 	asm(
 		/* Store host registers */
@@ -9467,6 +9539,19 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
 #endif
 	      );
 
+	/*
+	 * We do not use IBRS in the kernel. If this vCPU has used the
+	 * SPEC_CTRL MSR it may have left it on; save the value and
+	 * turn it off. This is much more efficient than blindly adding
+	 * it to the atomic save/restore list. Especially as the former
+	 * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
+	 */
+	if (!spec_ctrl_intercepted(vcpu))
+		rdmsrl(MSR_IA32_SPEC_CTRL, vmx->spec_ctrl);
+
+	if (vmx->spec_ctrl)
+		wrmsrl(MSR_IA32_SPEC_CTRL, 0);
+
 	/* Eliminate branch target predictions from guest mode */
 	vmexit_fill_RSB();
 
@@ -10092,7 +10177,8 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
 	unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
 
 	if (!nested_cpu_has_virt_x2apic_mode(vmcs12) &&
-	    !to_vmx(vcpu)->pred_cmd_used)
+	    !to_vmx(vcpu)->pred_cmd_used &&
+	    !to_vmx(vcpu)->spec_ctrl_used)
 		return false;
 
 	page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->msr_bitmap);
@@ -10126,6 +10212,12 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
 		}
 	}
 
+	if (to_vmx(vcpu)->spec_ctrl_used)
+		nested_vmx_disable_intercept_for_msr(
+					msr_bitmap_l1, msr_bitmap_l0,
+					MSR_IA32_SPEC_CTRL,
+					MSR_TYPE_R | MSR_TYPE_W);
+
 	if (to_vmx(vcpu)->pred_cmd_used)
 		nested_vmx_disable_intercept_for_msr(
 					msr_bitmap_l1, msr_bitmap_l0,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4ec142e..ac38143 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1009,7 +1009,7 @@ static u32 msrs_to_save[] = {
 #endif
 	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
 	MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
-	MSR_IA32_ARCH_CAPABILITIES
+	MSR_IA32_SPEC_CTRL, MSR_IA32_ARCH_CAPABILITIES
 };
 
 static unsigned num_msrs_to_save;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 2/5] KVM: x86: Add IBPB support
  2018-01-31 19:55       ` Jim Mattson
@ 2018-02-01  0:27         ` KarimAllah Ahmed
  0 siblings, 0 replies; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-02-01  0:27 UTC (permalink / raw)
  To: Jim Mattson, David Woodhouse
  Cc: KarimAllah Ahmed, kvm list, LKML, the arch/x86 maintainers,
	Ashok Raj, Asit Mallick, Dave Hansen, Arjan Van De Ven, Tim Chen,
	Linus Torvalds, Andrea Arcangeli, Andi Kleen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, Andy Lutomirski, Greg KH,
	Paolo Bonzini, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 582 bytes --]

On 01/31/2018 08:55 PM, Jim Mattson wrote:
> On Wed, Jan 31, 2018 at 11:53 AM, David Woodhouse <dwmw2@infradead.org> wrote:
>> Rather than doing the expensive guest_cpu_has() every time (which is
>> worse now as we realised we need two of them) perhaps we should
>> introduce a local flag for that too?
> 
> That sounds good to me.
> 

Done.
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

[-- Attachment #2: 0002-KVM-x86-Add-IBPB-support.patch --]
[-- Type: text/x-patch, Size: 9574 bytes --]

>From d51391ae3667f85cd1d6160e83c1d6c28b47b7d8 Mon Sep 17 00:00:00 2001
From: Ashok Raj <ashok.raj@intel.com>
Date: Thu, 11 Jan 2018 17:32:19 -0800
Subject: [PATCH] KVM: x86: Add IBPB support

The Indirect Branch Predictor Barrier (IBPB) is an indirect branch
control mechanism. It keeps earlier branches from influencing
later ones.

Unlike IBRS and STIBP, IBPB does not define a new mode of operation.
It's a command that ensures predicted branch targets aren't used after
the barrier. Although IBRS and IBPB are enumerated by the same CPUID
enumeration, IBPB is very different.

IBPB helps mitigate against three potential attacks:

* Mitigate guests from being attacked by other guests.
  - This is addressed by issing IBPB when we do a guest switch.

* Mitigate attacks from guest/ring3->host/ring3.
  These would require a IBPB during context switch in host, or after
  VMEXIT. The host process has two ways to mitigate
  - Either it can be compiled with retpoline
  - If its going through context switch, and has set !dumpable then
    there is a IBPB in that path.
    (Tim's patch: https://patchwork.kernel.org/patch/10192871)
  - The case where after a VMEXIT you return back to Qemu might make
    Qemu attackable from guest when Qemu isn't compiled with retpoline.
  There are issues reported when doing IBPB on every VMEXIT that resulted
  in some tsc calibration woes in guest.

* Mitigate guest/ring0->host/ring0 attacks.
  When host kernel is using retpoline it is safe against these attacks.
  If host kernel isn't using retpoline we might need to do a IBPB flush on
  every VMEXIT.

Even when using retpoline for indirect calls, in certain conditions 'ret'
can use the BTB on Skylake-era CPUs. There are other mitigations
available like RSB stuffing/clearing.

* IBPB is issued only for SVM during svm_free_vcpu().
  VMX has a vmclear and SVM doesn't.  Follow discussion here:
  https://lkml.org/lkml/2018/1/15/146

Please refer to the following spec for more details on the enumeration
and control.

Refer here to get documentation about mitigations.

https://software.intel.com/en-us/side-channel-security-support

[peterz: rebase and changelog rewrite]
[karahmed: - rebase
           - vmx: expose PRED_CMD if guest has it in CPUID
           - svm: only pass through IBPB if guest has it in CPUID
           - vmx: support !cpu_has_vmx_msr_bitmap()]
           - vmx: support nested]
[dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
        PRED_CMD is a write-only MSR]

Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1515720739-43819-6-git-send-email-ashok.raj@intel.com
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
---
v6:
- introduce pred_cmd_used

v5:
- Use MSR_TYPE_W instead of MSR_TYPE_R for the MSR.
- Always merge the bitmaps unconditionally.
- Add PRED_CMD to direct_access_msrs.
- Also check for X86_FEATURE_SPEC_CTRL for the msr reads/writes
- rewrite the commit message (from ashok.raj@)
---
 arch/x86/kvm/cpuid.c | 11 ++++++++++-
 arch/x86/kvm/svm.c   | 28 ++++++++++++++++++++++++++++
 arch/x86/kvm/vmx.c   | 42 ++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 78 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c0eb337..033004d 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -365,6 +365,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 		F(3DNOWPREFETCH) | F(OSVW) | 0 /* IBS */ | F(XOP) |
 		0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM);
 
+	/* cpuid 0x80000008.ebx */
+	const u32 kvm_cpuid_8000_0008_ebx_x86_features =
+		F(IBPB);
+
 	/* cpuid 0xC0000001.edx */
 	const u32 kvm_cpuid_C000_0001_edx_x86_features =
 		F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) |
@@ -625,7 +629,12 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
 		if (!g_phys_as)
 			g_phys_as = phys_as;
 		entry->eax = g_phys_as | (virt_as << 8);
-		entry->ebx = entry->edx = 0;
+		entry->edx = 0;
+		/* IBPB isn't necessarily present in hardware cpuid */
+		if (boot_cpu_has(X86_FEATURE_IBPB))
+			entry->ebx |= F(IBPB);
+		entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
+		cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
 		break;
 	}
 	case 0x80000019:
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index f40d0da..bfbb7b9 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -250,6 +250,7 @@ static const struct svm_direct_access_msrs {
 	{ .index = MSR_SYSCALL_MASK,			.always = true  },
 #endif
 	{ .index = MSR_IA32_LASTBRANCHFROMIP,		.always = false },
+	{ .index = MSR_IA32_PRED_CMD,			.always = false },
 	{ .index = MSR_IA32_LASTBRANCHTOIP,		.always = false },
 	{ .index = MSR_IA32_LASTINTFROMIP,		.always = false },
 	{ .index = MSR_IA32_LASTINTTOIP,		.always = false },
@@ -529,6 +530,7 @@ struct svm_cpu_data {
 	struct kvm_ldttss_desc *tss_desc;
 
 	struct page *save_area;
+	struct vmcb *current_vmcb;
 };
 
 static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
@@ -1703,11 +1705,17 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
 	__free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
 	kvm_vcpu_uninit(vcpu);
 	kmem_cache_free(kvm_vcpu_cache, svm);
+	/*
+	 * The vmcb page can be recycled, causing a false negative in
+	 * svm_vcpu_load(). So do a full IBPB now.
+	 */
+	indirect_branch_prediction_barrier();
 }
 
 static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
 	int i;
 
 	if (unlikely(cpu != vcpu->cpu)) {
@@ -1736,6 +1744,10 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (static_cpu_has(X86_FEATURE_RDTSCP))
 		wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
 
+	if (sd->current_vmcb != svm->vmcb) {
+		sd->current_vmcb = svm->vmcb;
+		indirect_branch_prediction_barrier();
+	}
 	avic_vcpu_load(vcpu, cpu);
 }
 
@@ -3684,6 +3696,22 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	case MSR_IA32_TSC:
 		kvm_write_tsc(vcpu, msr);
 		break;
+	case MSR_IA32_PRED_CMD:
+		if (!msr->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
+			return 1;
+
+		if (data & ~PRED_CMD_IBPB)
+			return 1;
+
+		if (!data)
+			break;
+
+		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
+		if (is_guest_mode(vcpu))
+			break;
+		set_msr_interception(svm->msrpm, MSR_IA32_PRED_CMD, 0, 1);
+		break;
 	case MSR_STAR:
 		svm->vmcb->save.star = data;
 		break;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index d46a61b..c057a0a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -592,6 +592,14 @@ struct vcpu_vmx {
 	u64 		      msr_host_kernel_gs_base;
 	u64 		      msr_guest_kernel_gs_base;
 #endif
+
+	/*
+	 * This indicates that:
+	 * 1) guest_cpuid_has(X86_FEATURE_IBPB) = true &&
+	 * 2) The guest has initiated a write against the MSR.
+	 */
+	bool pred_cmd_used;
+
 	u32 vm_entry_controls_shadow;
 	u32 vm_exit_controls_shadow;
 	u32 secondary_exec_control;
@@ -2285,6 +2293,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
 		per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
 		vmcs_load(vmx->loaded_vmcs->vmcs);
+		indirect_branch_prediction_barrier();
 	}
 
 	if (!already_loaded) {
@@ -3342,6 +3351,28 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_TSC:
 		kvm_write_tsc(vcpu, msr_info);
 		break;
+	case MSR_IA32_PRED_CMD:
+		if (!msr_info->host_initiated &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
+		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
+			return 1;
+
+		vmx->pred_cmd_used = true;
+
+		if (data & ~PRED_CMD_IBPB)
+			return 1;
+
+		if (!data)
+			break;
+
+		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
+
+		if (is_guest_mode(vcpu))
+			break;
+
+		vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
+					      MSR_TYPE_W);
+		break;
 	case MSR_IA32_CR_PAT:
 		if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
 			if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
@@ -10045,8 +10076,8 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
 	unsigned long *msr_bitmap_l1;
 	unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
 
-	/* This shortcut is ok because we support only x2APIC MSRs so far. */
-	if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
+	if (!nested_cpu_has_virt_x2apic_mode(vmcs12) &&
+	    !to_vmx(vcpu)->pred_cmd_used)
 		return false;
 
 	page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->msr_bitmap);
@@ -10079,6 +10110,13 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
 				MSR_TYPE_W);
 		}
 	}
+
+	if (to_vmx(vcpu)->pred_cmd_used)
+		nested_vmx_disable_intercept_for_msr(
+					msr_bitmap_l1, msr_bitmap_l0,
+					MSR_IA32_PRED_CMD,
+					MSR_TYPE_W);
+
 	kunmap(page);
 	kvm_release_page_clean(page);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-02-01  0:24           ` KarimAllah Ahmed
@ 2018-02-01  4:26               ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 56+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-02-01  4:26 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: Jim Mattson, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Paolo Bonzini, David Woodhouse, Greg KH, Andy Lutomirski,
	Ashok Raj

> >From 9c19a8ac3f021efba6f70ad7e28f7ad06bb97e43 Mon Sep 17 00:00:00 2001
> From: KarimAllah Ahmed <karahmed@amazon.de>
> Date: Mon, 29 Jan 2018 19:58:10 +0000
> Subject: [PATCH] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
> 
> [ Based on a patch from Ashok Raj <ashok.raj@intel.com> ]
> 
> Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
> guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
> be using a retpoline+IBPB based approach.
> 
> To avoid the overhead of atomically saving and restoring the
> MSR_IA32_SPEC_CTRL for guests that do not actually use the MSR, only
> add_atomic_switch_msr when a non-zero is written to it.
  ^^^^^^^^^^^^^^^^^^^^^

That part of the comment does not seem to be in sync with the code.

> 
> No attempt is made to handle STIBP here, intentionally. Filtering STIBP
> may be added in a future patch, which may require trapping all writes
> if we don't want to pass it through directly to the guest.
> 
> [dwmw2: Clean up CPUID bits, save/restore manually, handle reset]
> 
> Cc: Asit Mallick <asit.k.mallick@intel.com>
> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Jun Nakajima <jun.nakajima@intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: David Woodhouse <dwmw@amazon.co.uk>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
> v6:
> - got rid of save_spec_ctrl_on_exit
> - introduce spec_ctrl_intercepted
> - introduce spec_ctrl_used
> v5:
> - Also check for X86_FEATURE_SPEC_CTRL for the msr reads/writes
> v4:
> - Add IBRS to kvm_cpuid_8000_0008_ebx_x86_features
> - Handling nested guests
> v3:
> - Save/restore manually
> - Fix CPUID handling
> - Fix a copy & paste error in the name of SPEC_CTRL MSR in
>   disable_intercept.
> - support !cpu_has_vmx_msr_bitmap()
> v2:
> - remove 'host_spec_ctrl' in favor of only a comment (dwmw@).
> - special case writing '0' in SPEC_CTRL to avoid confusing live-migration
>   when the instance never used the MSR (dwmw@).
> - depend on X86_FEATURE_IBRS instead of X86_FEATURE_SPEC_CTRL (dwmw@).
> - add MSR_IA32_SPEC_CTRL to the list of MSRs to save (dropped it by accident).
> ---
>  arch/x86/kvm/cpuid.c |  9 +++--
>  arch/x86/kvm/vmx.c   | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c   |  2 +-
>  3 files changed, 100 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 1909635..13f5d42 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -367,7 +367,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  
>  	/* cpuid 0x80000008.ebx */
>  	const u32 kvm_cpuid_8000_0008_ebx_x86_features =
> -		F(IBPB);
> +		F(IBPB) | F(IBRS);
>  
>  	/* cpuid 0xC0000001.edx */
>  	const u32 kvm_cpuid_C000_0001_edx_x86_features =
> @@ -394,7 +394,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  
>  	/* cpuid 7.0.edx*/
>  	const u32 kvm_cpuid_7_0_edx_x86_features =
> -		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);
> +		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
> +		F(ARCH_CAPABILITIES);
>  
>  	/* all calls to cpuid_count() should be made on the same cpu */
>  	get_cpu();
> @@ -630,9 +631,11 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  			g_phys_as = phys_as;
>  		entry->eax = g_phys_as | (virt_as << 8);
>  		entry->edx = 0;
> -		/* IBPB isn't necessarily present in hardware cpuid */
> +		/* IBRS and IBPB aren't necessarily present in hardware cpuid */
>  		if (boot_cpu_has(X86_FEATURE_IBPB))
>  			entry->ebx |= F(IBPB);
> +		if (boot_cpu_has(X86_FEATURE_IBRS))
> +			entry->ebx |= F(IBRS);
>  		entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
>  		cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
>  		break;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 6a9f4ec..bfc80ff 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -594,6 +594,14 @@ struct vcpu_vmx {
>  #endif
>  
>  	u64 		      arch_capabilities;
> +	u64 		      spec_ctrl;
> +
> +	/*
> +	 * This indicates that:
> +	 * 1) guest_cpuid_has(X86_FEATURE_IBRS) == true &&
> +	 * 2) The guest has actually initiated a write against the MSR.
> +	 */
> +	bool spec_ctrl_used;
>  
>  	/*
>  	 * This indicates that:
> @@ -946,6 +954,8 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
>  static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
>  					    u16 error_code);
>  static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
> +static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
> +							  u32 msr, int type);
>  
>  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
>  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> @@ -1917,6 +1927,22 @@ static void update_exception_bitmap(struct kvm_vcpu *vcpu)
>  	vmcs_write32(EXCEPTION_BITMAP, eb);
>  }
>  
> +/* Is SPEC_CTRL intercepted for the currently running vCPU? */
> +static bool spec_ctrl_intercepted(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long *msr_bitmap;
> +	int f = sizeof(unsigned long);
> +
> +	if (!cpu_has_vmx_msr_bitmap())
> +		return true;
> +
> +	msr_bitmap = is_guest_mode(vcpu) ?
> +			to_vmx(vcpu)->nested.vmcs02.msr_bitmap :
> +			to_vmx(vcpu)->vmcs01.msr_bitmap;
> +
> +	return !!test_bit(MSR_IA32_SPEC_CTRL, msr_bitmap + 0x800 / f);
> +}
> +
>  static void clear_atomic_switch_msr_special(struct vcpu_vmx *vmx,
>  		unsigned long entry, unsigned long exit)
>  {
> @@ -3246,6 +3272,14 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	case MSR_IA32_TSC:
>  		msr_info->data = guest_read_tsc(vcpu);
>  		break;
> +	case MSR_IA32_SPEC_CTRL:
> +		if (!msr_info->host_initiated &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> +			return 1;
> +
> +		msr_info->data = to_vmx(vcpu)->spec_ctrl;
> +		break;
>  	case MSR_IA32_ARCH_CAPABILITIES:
>  		if (!msr_info->host_initiated &&
>  		    !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
> @@ -3359,6 +3393,34 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	case MSR_IA32_TSC:
>  		kvm_write_tsc(vcpu, msr_info);
>  		break;
> +	case MSR_IA32_SPEC_CTRL:
> +		if (!msr_info->host_initiated &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> +			return 1;
> +
> +		vmx->spec_ctrl_used = true;
> +
> +		/* The STIBP bit doesn't fault even if it's not advertised */
> +		if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
> +			return 1;
> +
> +		vmx->spec_ctrl = data;
> +
> +		/*
> +		 * When it's written (to non-zero) for the first time, pass
> +		 * it through. This means we don't have to take the perf

.. But only if it is a nested guest (as you have && is_guest_mode).

Do you want to update the comment a bit?

> +		 * hit of saving it on vmexit for the common case of guests
> +		 * that don't use it.
> +		 */
> +		if (cpu_has_vmx_msr_bitmap() && data &&
> +		    spec_ctrl_intercepted(vcpu) &&
> +		    is_guest_mode(vcpu))
                    ^^^^^^^^^^^^^^^^^^ <=== here
> +			vmx_disable_intercept_for_msr(
> +					vmx->vmcs01.msr_bitmap,
> +					MSR_IA32_SPEC_CTRL,
> +					MSR_TYPE_RW);
> +		break;
>  	case MSR_IA32_PRED_CMD:
>  		if (!msr_info->host_initiated &&
>  		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
@ 2018-02-01  4:26               ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 56+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-02-01  4:26 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: Jim Mattson, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Paolo Bonzini, David Woodhouse, Greg KH, Andy Lutomirski, Ashok

> >From 9c19a8ac3f021efba6f70ad7e28f7ad06bb97e43 Mon Sep 17 00:00:00 2001
> From: KarimAllah Ahmed <karahmed@amazon.de>
> Date: Mon, 29 Jan 2018 19:58:10 +0000
> Subject: [PATCH] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
> 
> [ Based on a patch from Ashok Raj <ashok.raj@intel.com> ]
> 
> Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
> guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
> be using a retpoline+IBPB based approach.
> 
> To avoid the overhead of atomically saving and restoring the
> MSR_IA32_SPEC_CTRL for guests that do not actually use the MSR, only
> add_atomic_switch_msr when a non-zero is written to it.
  ^^^^^^^^^^^^^^^^^^^^^

That part of the comment does not seem to be in sync with the code.

> 
> No attempt is made to handle STIBP here, intentionally. Filtering STIBP
> may be added in a future patch, which may require trapping all writes
> if we don't want to pass it through directly to the guest.
> 
> [dwmw2: Clean up CPUID bits, save/restore manually, handle reset]
> 
> Cc: Asit Mallick <asit.k.mallick@intel.com>
> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Jun Nakajima <jun.nakajima@intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: David Woodhouse <dwmw@amazon.co.uk>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
> v6:
> - got rid of save_spec_ctrl_on_exit
> - introduce spec_ctrl_intercepted
> - introduce spec_ctrl_used
> v5:
> - Also check for X86_FEATURE_SPEC_CTRL for the msr reads/writes
> v4:
> - Add IBRS to kvm_cpuid_8000_0008_ebx_x86_features
> - Handling nested guests
> v3:
> - Save/restore manually
> - Fix CPUID handling
> - Fix a copy & paste error in the name of SPEC_CTRL MSR in
>   disable_intercept.
> - support !cpu_has_vmx_msr_bitmap()
> v2:
> - remove 'host_spec_ctrl' in favor of only a comment (dwmw@).
> - special case writing '0' in SPEC_CTRL to avoid confusing live-migration
>   when the instance never used the MSR (dwmw@).
> - depend on X86_FEATURE_IBRS instead of X86_FEATURE_SPEC_CTRL (dwmw@).
> - add MSR_IA32_SPEC_CTRL to the list of MSRs to save (dropped it by accident).
> ---
>  arch/x86/kvm/cpuid.c |  9 +++--
>  arch/x86/kvm/vmx.c   | 94 +++++++++++++++++++++++++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c   |  2 +-
>  3 files changed, 100 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 1909635..13f5d42 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -367,7 +367,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  
>  	/* cpuid 0x80000008.ebx */
>  	const u32 kvm_cpuid_8000_0008_ebx_x86_features =
> -		F(IBPB);
> +		F(IBPB) | F(IBRS);
>  
>  	/* cpuid 0xC0000001.edx */
>  	const u32 kvm_cpuid_C000_0001_edx_x86_features =
> @@ -394,7 +394,8 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  
>  	/* cpuid 7.0.edx*/
>  	const u32 kvm_cpuid_7_0_edx_x86_features =
> -		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(ARCH_CAPABILITIES);
> +		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
> +		F(ARCH_CAPABILITIES);
>  
>  	/* all calls to cpuid_count() should be made on the same cpu */
>  	get_cpu();
> @@ -630,9 +631,11 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  			g_phys_as = phys_as;
>  		entry->eax = g_phys_as | (virt_as << 8);
>  		entry->edx = 0;
> -		/* IBPB isn't necessarily present in hardware cpuid */
> +		/* IBRS and IBPB aren't necessarily present in hardware cpuid */
>  		if (boot_cpu_has(X86_FEATURE_IBPB))
>  			entry->ebx |= F(IBPB);
> +		if (boot_cpu_has(X86_FEATURE_IBRS))
> +			entry->ebx |= F(IBRS);
>  		entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
>  		cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
>  		break;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 6a9f4ec..bfc80ff 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -594,6 +594,14 @@ struct vcpu_vmx {
>  #endif
>  
>  	u64 		      arch_capabilities;
> +	u64 		      spec_ctrl;
> +
> +	/*
> +	 * This indicates that:
> +	 * 1) guest_cpuid_has(X86_FEATURE_IBRS) == true &&
> +	 * 2) The guest has actually initiated a write against the MSR.
> +	 */
> +	bool spec_ctrl_used;
>  
>  	/*
>  	 * This indicates that:
> @@ -946,6 +954,8 @@ static void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
>  static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
>  					    u16 error_code);
>  static void vmx_update_msr_bitmap(struct kvm_vcpu *vcpu);
> +static void __always_inline vmx_disable_intercept_for_msr(unsigned long *msr_bitmap,
> +							  u32 msr, int type);
>  
>  static DEFINE_PER_CPU(struct vmcs *, vmxarea);
>  static DEFINE_PER_CPU(struct vmcs *, current_vmcs);
> @@ -1917,6 +1927,22 @@ static void update_exception_bitmap(struct kvm_vcpu *vcpu)
>  	vmcs_write32(EXCEPTION_BITMAP, eb);
>  }
>  
> +/* Is SPEC_CTRL intercepted for the currently running vCPU? */
> +static bool spec_ctrl_intercepted(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long *msr_bitmap;
> +	int f = sizeof(unsigned long);
> +
> +	if (!cpu_has_vmx_msr_bitmap())
> +		return true;
> +
> +	msr_bitmap = is_guest_mode(vcpu) ?
> +			to_vmx(vcpu)->nested.vmcs02.msr_bitmap :
> +			to_vmx(vcpu)->vmcs01.msr_bitmap;
> +
> +	return !!test_bit(MSR_IA32_SPEC_CTRL, msr_bitmap + 0x800 / f);
> +}
> +
>  static void clear_atomic_switch_msr_special(struct vcpu_vmx *vmx,
>  		unsigned long entry, unsigned long exit)
>  {
> @@ -3246,6 +3272,14 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	case MSR_IA32_TSC:
>  		msr_info->data = guest_read_tsc(vcpu);
>  		break;
> +	case MSR_IA32_SPEC_CTRL:
> +		if (!msr_info->host_initiated &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> +			return 1;
> +
> +		msr_info->data = to_vmx(vcpu)->spec_ctrl;
> +		break;
>  	case MSR_IA32_ARCH_CAPABILITIES:
>  		if (!msr_info->host_initiated &&
>  		    !guest_cpuid_has(vcpu, X86_FEATURE_ARCH_CAPABILITIES))
> @@ -3359,6 +3393,34 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	case MSR_IA32_TSC:
>  		kvm_write_tsc(vcpu, msr_info);
>  		break;
> +	case MSR_IA32_SPEC_CTRL:
> +		if (!msr_info->host_initiated &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> +			return 1;
> +
> +		vmx->spec_ctrl_used = true;
> +
> +		/* The STIBP bit doesn't fault even if it's not advertised */
> +		if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
> +			return 1;
> +
> +		vmx->spec_ctrl = data;
> +
> +		/*
> +		 * When it's written (to non-zero) for the first time, pass
> +		 * it through. This means we don't have to take the perf

.. But only if it is a nested guest (as you have && is_guest_mode).

Do you want to update the comment a bit?

> +		 * hit of saving it on vmexit for the common case of guests
> +		 * that don't use it.
> +		 */
> +		if (cpu_has_vmx_msr_bitmap() && data &&
> +		    spec_ctrl_intercepted(vcpu) &&
> +		    is_guest_mode(vcpu))
                    ^^^^^^^^^^^^^^^^^^ <=== here
> +			vmx_disable_intercept_for_msr(
> +					vmx->vmcs01.msr_bitmap,
> +					MSR_IA32_SPEC_CTRL,
> +					MSR_TYPE_RW);
> +		break;
>  	case MSR_IA32_PRED_CMD:
>  		if (!msr_info->host_initiated &&
>  		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 2/5] KVM: x86: Add IBPB support
  2018-01-31 19:37 ` [PATCH v5 2/5] KVM: x86: Add IBPB support KarimAllah Ahmed
  2018-01-31 19:45   ` Jim Mattson
  2018-01-31 20:28   ` Konrad Rzeszutek Wilk
@ 2018-02-01  4:54   ` Tom Lendacky
  2018-02-01 17:00   ` Raj, Ashok
  3 siblings, 0 replies; 56+ messages in thread
From: Tom Lendacky @ 2018-02-01  4:54 UTC (permalink / raw)
  To: KarimAllah Ahmed, kvm, linux-kernel, x86
  Cc: Ashok Raj, Asit Mallick, Dave Hansen, Arjan Van De Ven, Tim Chen,
	Linus Torvalds, Andrea Arcangeli, Andi Kleen, Thomas Gleixner,
	Dan Williams, Jun Nakajima, Andy Lutomirski, Greg KH,
	Paolo Bonzini, Peter Zijlstra, David Woodhouse

On 1/31/2018 1:37 PM, KarimAllah Ahmed wrote:
> From: Ashok Raj <ashok.raj@intel.com>
> 
> The Indirect Branch Predictor Barrier (IBPB) is an indirect branch
> control mechanism. It keeps earlier branches from influencing
> later ones.
> 
> Unlike IBRS and STIBP, IBPB does not define a new mode of operation.
> It's a command that ensures predicted branch targets aren't used after
> the barrier. Although IBRS and IBPB are enumerated by the same CPUID
> enumeration, IBPB is very different.
> 
> IBPB helps mitigate against three potential attacks:
> 
> * Mitigate guests from being attacked by other guests.
>   - This is addressed by issing IBPB when we do a guest switch.
> 
> * Mitigate attacks from guest/ring3->host/ring3.
>   These would require a IBPB during context switch in host, or after
>   VMEXIT. The host process has two ways to mitigate
>   - Either it can be compiled with retpoline
>   - If its going through context switch, and has set !dumpable then
>     there is a IBPB in that path.
>     (Tim's patch: https://patchwork.kernel.org/patch/10192871)
>   - The case where after a VMEXIT you return back to Qemu might make
>     Qemu attackable from guest when Qemu isn't compiled with retpoline.
>   There are issues reported when doing IBPB on every VMEXIT that resulted
>   in some tsc calibration woes in guest.
> 
> * Mitigate guest/ring0->host/ring0 attacks.
>   When host kernel is using retpoline it is safe against these attacks.
>   If host kernel isn't using retpoline we might need to do a IBPB flush on
>   every VMEXIT.
> 
> Even when using retpoline for indirect calls, in certain conditions 'ret'
> can use the BTB on Skylake-era CPUs. There are other mitigations
> available like RSB stuffing/clearing.
> 
> * IBPB is issued only for SVM during svm_free_vcpu().
>   VMX has a vmclear and SVM doesn't.  Follow discussion here:
>   https://lkml.org/lkml/2018/1/15/146
> 
> Please refer to the following spec for more details on the enumeration
> and control.
> 
> Refer here to get documentation about mitigations.
> 
> https://software.intel.com/en-us/side-channel-security-support
> 
> [peterz: rebase and changelog rewrite]
> [karahmed: - rebase
>            - vmx: expose PRED_CMD if guest has it in CPUID
>            - svm: only pass through IBPB if guest has it in CPUID
>            - vmx: support !cpu_has_vmx_msr_bitmap()]
>            - vmx: support nested]
> [dwmw2: Expose CPUID bit too (AMD IBPB only for now as we lack IBRS)
>         PRED_CMD is a write-only MSR]
> 
> Cc: Asit Mallick <asit.k.mallick@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Jun Nakajima <jun.nakajima@intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Greg KH <gregkh@linuxfoundation.org>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Ashok Raj <ashok.raj@intel.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/1515720739-43819-6-git-send-email-ashok.raj@intel.com
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
> 
> v5:
> - Use MSR_TYPE_W instead of MSR_TYPE_R for the MSR.
> - Always merge the bitmaps unconditionally.
> - Add PRED_CMD to direct_access_msrs.
> - Also check for X86_FEATURE_SPEC_CTRL for the msr reads/writes
> - rewrite the commit message (from ashok.raj@)
> ---
>  arch/x86/kvm/cpuid.c | 11 ++++++++++-
>  arch/x86/kvm/svm.c   | 28 ++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx.c   | 29 +++++++++++++++++++++++++----
>  3 files changed, 63 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index c0eb337..033004d 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -365,6 +365,10 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  		F(3DNOWPREFETCH) | F(OSVW) | 0 /* IBS */ | F(XOP) |
>  		0 /* SKINIT, WDT, LWP */ | F(FMA4) | F(TBM);
>  
> +	/* cpuid 0x80000008.ebx */
> +	const u32 kvm_cpuid_8000_0008_ebx_x86_features =
> +		F(IBPB);
> +
>  	/* cpuid 0xC0000001.edx */
>  	const u32 kvm_cpuid_C000_0001_edx_x86_features =
>  		F(XSTORE) | F(XSTORE_EN) | F(XCRYPT) | F(XCRYPT_EN) |
> @@ -625,7 +629,12 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>  		if (!g_phys_as)
>  			g_phys_as = phys_as;
>  		entry->eax = g_phys_as | (virt_as << 8);
> -		entry->ebx = entry->edx = 0;
> +		entry->edx = 0;
> +		/* IBPB isn't necessarily present in hardware cpuid */
> +		if (boot_cpu_has(X86_FEATURE_IBPB))
> +			entry->ebx |= F(IBPB);
> +		entry->ebx &= kvm_cpuid_8000_0008_ebx_x86_features;
> +		cpuid_mask(&entry->ebx, CPUID_8000_0008_EBX);
>  		break;
>  	}
>  	case 0x80000019:
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index f40d0da..bfbb7b9 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -250,6 +250,7 @@ static const struct svm_direct_access_msrs {
>  	{ .index = MSR_SYSCALL_MASK,			.always = true  },
>  #endif
>  	{ .index = MSR_IA32_LASTBRANCHFROMIP,		.always = false },
> +	{ .index = MSR_IA32_PRED_CMD,			.always = false },

Just a nit, but could you not split up the two LASTBRANCH related MSRs and
instead put this either before the LASTBRANCHFROMIP or at the end of the
array?  (Ditto for SPEC_CTRL in patch 5).

Thanks,
Tom

>  	{ .index = MSR_IA32_LASTBRANCHTOIP,		.always = false },
>  	{ .index = MSR_IA32_LASTINTFROMIP,		.always = false },
>  	{ .index = MSR_IA32_LASTINTTOIP,		.always = false },
> @@ -529,6 +530,7 @@ struct svm_cpu_data {
>  	struct kvm_ldttss_desc *tss_desc;
>  
>  	struct page *save_area;
> +	struct vmcb *current_vmcb;
>  };
>  
>  static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data);
> @@ -1703,11 +1705,17 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu)
>  	__free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER);
>  	kvm_vcpu_uninit(vcpu);
>  	kmem_cache_free(kvm_vcpu_cache, svm);
> +	/*
> +	 * The vmcb page can be recycled, causing a false negative in
> +	 * svm_vcpu_load(). So do a full IBPB now.
> +	 */
> +	indirect_branch_prediction_barrier();
>  }
>  
>  static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
> +	struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
>  	int i;
>  
>  	if (unlikely(cpu != vcpu->cpu)) {
> @@ -1736,6 +1744,10 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  	if (static_cpu_has(X86_FEATURE_RDTSCP))
>  		wrmsrl(MSR_TSC_AUX, svm->tsc_aux);
>  
> +	if (sd->current_vmcb != svm->vmcb) {
> +		sd->current_vmcb = svm->vmcb;
> +		indirect_branch_prediction_barrier();
> +	}
>  	avic_vcpu_load(vcpu, cpu);
>  }
>  
> @@ -3684,6 +3696,22 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
>  	case MSR_IA32_TSC:
>  		kvm_write_tsc(vcpu, msr);
>  		break;
> +	case MSR_IA32_PRED_CMD:
> +		if (!msr->host_initiated &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB))
> +			return 1;
> +
> +		if (data & ~PRED_CMD_IBPB)
> +			return 1;
> +
> +		if (!data)
> +			break;
> +
> +		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
> +		if (is_guest_mode(vcpu))
> +			break;
> +		set_msr_interception(svm->msrpm, MSR_IA32_PRED_CMD, 0, 1);
> +		break;
>  	case MSR_STAR:
>  		svm->vmcb->save.star = data;
>  		break;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index d46a61b..2e4e8af 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2285,6 +2285,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  	if (per_cpu(current_vmcs, cpu) != vmx->loaded_vmcs->vmcs) {
>  		per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs;
>  		vmcs_load(vmx->loaded_vmcs->vmcs);
> +		indirect_branch_prediction_barrier();
>  	}
>  
>  	if (!already_loaded) {
> @@ -3342,6 +3343,26 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	case MSR_IA32_TSC:
>  		kvm_write_tsc(vcpu, msr_info);
>  		break;
> +	case MSR_IA32_PRED_CMD:
> +		if (!msr_info->host_initiated &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
> +		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> +			return 1;
> +
> +		if (data & ~PRED_CMD_IBPB)
> +			return 1;
> +
> +		if (!data)
> +			break;
> +
> +		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
> +
> +		if (is_guest_mode(vcpu))
> +			break;
> +
> +		vmx_disable_intercept_for_msr(vmx->vmcs01.msr_bitmap, MSR_IA32_PRED_CMD,
> +					      MSR_TYPE_W);
> +		break;
>  	case MSR_IA32_CR_PAT:
>  		if (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PAT) {
>  			if (!kvm_mtrr_valid(vcpu, MSR_IA32_CR_PAT, data))
> @@ -10045,10 +10066,6 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
>  	unsigned long *msr_bitmap_l1;
>  	unsigned long *msr_bitmap_l0 = to_vmx(vcpu)->nested.vmcs02.msr_bitmap;
>  
> -	/* This shortcut is ok because we support only x2APIC MSRs so far. */
> -	if (!nested_cpu_has_virt_x2apic_mode(vmcs12))
> -		return false;
> -
>  	page = kvm_vcpu_gpa_to_page(vcpu, vmcs12->msr_bitmap);
>  	if (is_error_page(page))
>  		return false;
> @@ -10056,6 +10073,10 @@ static inline bool nested_vmx_merge_msr_bitmap(struct kvm_vcpu *vcpu,
>  
>  	memset(msr_bitmap_l0, 0xff, PAGE_SIZE);
>  
> +	nested_vmx_disable_intercept_for_msr(msr_bitmap_l1, msr_bitmap_l0,
> +					     MSR_IA32_PRED_CMD,
> +					     MSR_TYPE_W);
> +
>  	if (nested_cpu_has_virt_x2apic_mode(vmcs12)) {
>  		if (nested_cpu_has_apic_reg_virt(vmcs12))
>  			for (msr = 0x800; msr <= 0x8ff; msr++)
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-02-01  4:26               ` Konrad Rzeszutek Wilk
@ 2018-02-01 13:25                 ` David Woodhouse
  -1 siblings, 0 replies; 56+ messages in thread
From: David Woodhouse @ 2018-02-01 13:25 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, KarimAllah Ahmed
  Cc: Jim Mattson, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Paolo Bonzini, Greg KH, Andy Lutomirski, Ashok Raj

[-- Attachment #1: Type: text/plain, Size: 1537 bytes --]

On Wed, 2018-01-31 at 23:26 -0500, Konrad Rzeszutek Wilk wrote:
> 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 6a9f4ec..bfc80ff 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -594,6 +594,14 @@ struct vcpu_vmx {
> >  #endif
> >  
> >       u64                   arch_capabilities;
> > +     u64                   spec_ctrl;
> > +
> > +     /*
> > +      * This indicates that:
> > +      * 1) guest_cpuid_has(X86_FEATURE_IBRS) == true &&
> > +      * 2) The guest has actually initiated a write against the MSR.
> > +      */
> > +     bool spec_ctrl_used;
> >  
> >       /*
> >        * This indicates that:

Thanks for persisting with the details here, Karim. In addition to
Konrad's heckling at the comments, I'll add my own request to his...

I'd like the comment for spec_ctrl_used to explain why it isn't
entirely redundant with the spec_ctrl_intercepted() function.

Without nesting, I believe it *would* be redundant, but the difference
comes when an L2 is running for which L1 has not permitted the MSR to
be passed through. That's when we have spec_ctrl_used = true but the
MSR *isn't* actually passed through in the active msr_bitmap.

Question: if spec_ctrl_used is always equivalent to the intercept bit
in the vmcs01.msr_bitmap, just not the guest bitmap... should we ditch
it and always use the bit from the vmcs01.msr_bitmap?

Sorry :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
@ 2018-02-01 13:25                 ` David Woodhouse
  0 siblings, 0 replies; 56+ messages in thread
From: David Woodhouse @ 2018-02-01 13:25 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, KarimAllah Ahmed
  Cc: Jim Mattson, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Paolo Bonzini, Greg KH, Andy Lutomirski, Ashok Raj

[-- Attachment #1: Type: text/plain, Size: 1537 bytes --]

On Wed, 2018-01-31 at 23:26 -0500, Konrad Rzeszutek Wilk wrote:
> 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 6a9f4ec..bfc80ff 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -594,6 +594,14 @@ struct vcpu_vmx {
> >  #endif
> >  
> >       u64                   arch_capabilities;
> > +     u64                   spec_ctrl;
> > +
> > +     /*
> > +      * This indicates that:
> > +      * 1) guest_cpuid_has(X86_FEATURE_IBRS) == true &&
> > +      * 2) The guest has actually initiated a write against the MSR.
> > +      */
> > +     bool spec_ctrl_used;
> >  
> >       /*
> >        * This indicates that:

Thanks for persisting with the details here, Karim. In addition to
Konrad's heckling at the comments, I'll add my own request to his...

I'd like the comment for spec_ctrl_used to explain why it isn't
entirely redundant with the spec_ctrl_intercepted() function.

Without nesting, I believe it *would* be redundant, but the difference
comes when an L2 is running for which L1 has not permitted the MSR to
be passed through. That's when we have spec_ctrl_used = true but the
MSR *isn't* actually passed through in the active msr_bitmap.

Question: if spec_ctrl_used is always equivalent to the intercept bit
in the vmcs01.msr_bitmap, just not the guest bitmap... should we ditch
it and always use the bit from the vmcs01.msr_bitmap?

Sorry :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5213 bytes --]

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-01-31 21:59                     ` David Woodhouse
  2018-01-31 22:06                       ` Jim Mattson
@ 2018-02-01 14:09                       ` Paolo Bonzini
  1 sibling, 0 replies; 56+ messages in thread
From: Paolo Bonzini @ 2018-02-01 14:09 UTC (permalink / raw)
  To: David Woodhouse, Jim Mattson
  Cc: KarimAllah Ahmed, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima, Greg KH,
	Andy Lutomirski, Ashok Raj

On 31/01/2018 16:59, David Woodhouse wrote:
> 
> 
> On Wed, 2018-01-31 at 13:53 -0800, Jim Mattson wrote:
>> On Wed, Jan 31, 2018 at 1:42 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>> Can we just say it sucks to be L2 too? :)  Because in the end as long as
>>> no one ever writes to spec_ctrl, everybody is happy.
>>
>> Unfortunately, quite a few OS vendors shipped IBRS-based mitigations
>> earlier this month. (Has Redhat stopped writing to IA32_SPEC_CTRL yet?
>> :-)
>>
>> And in the long run, everyone is going to set IA32_SPEC_CTRL.IBRS=1 on
>> CPUs with IA32_ARCH_CAPABILITIES.IBRS_ALL.
> 
> 
> I'm actually working on IBRS_ALL at the moment.
> 
> I was tempted to *not* let the guests turn it off. Expose SPEC_CTRL but
> just make it a no-op.

That would be very slow.

> Or if that really doesn't fly, perhaps with IBRS_ALL we should invert
> the logic. Set IBRS to 1 on vCPU reset, and only if it's set to *zero*
> do we pass through the MSR and set the save_spec_ctrl_on_exit flag.

... but something like that would be a good idea.  Even if IBRS to 0 on
vCPU reset, only pass it through once it's set to zero.  The first
IBRS=1 write would not enable pass through, and would not set the
save_spec_ctrl_on_exit flag.  In fact it need not even be conditional on
IBRS_ALL.

Paolo

> But let's get the code for *current* hardware done first...
> 

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-02-01  4:26               ` Konrad Rzeszutek Wilk
@ 2018-02-01 14:19                 ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 56+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-02-01 14:19 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: Jim Mattson, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Paolo Bonzini, David Woodhouse, Greg KH, Andy Lutomirski,
	Ashok Raj

.snip..
> > +/* Is SPEC_CTRL intercepted for the currently running vCPU? */
> > +static bool spec_ctrl_intercepted(struct kvm_vcpu *vcpu)
> > +{
> > +	unsigned long *msr_bitmap;
> > +	int f = sizeof(unsigned long);
> > +
> > +	if (!cpu_has_vmx_msr_bitmap())
> > +		return true;
> > +
> > +	msr_bitmap = is_guest_mode(vcpu) ?
> > +			to_vmx(vcpu)->nested.vmcs02.msr_bitmap :
> > +			to_vmx(vcpu)->vmcs01.msr_bitmap;
> > +
> > +	return !!test_bit(MSR_IA32_SPEC_CTRL, msr_bitmap + 0x800 / f);
> > +}
> > +
..snip..
> > @@ -3359,6 +3393,34 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >  	case MSR_IA32_TSC:
> >  		kvm_write_tsc(vcpu, msr_info);
> >  		break;
> > +	case MSR_IA32_SPEC_CTRL:
> > +		if (!msr_info->host_initiated &&
> > +		    !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
> > +		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> > +			return 1;
> > +
> > +		vmx->spec_ctrl_used = true;
> > +
> > +		/* The STIBP bit doesn't fault even if it's not advertised */
> > +		if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
> > +			return 1;
> > +
> > +		vmx->spec_ctrl = data;
> > +
> > +		/*
> > +		 * When it's written (to non-zero) for the first time, pass
> > +		 * it through. This means we don't have to take the perf
> 
> .. But only if it is a nested guest (as you have && is_guest_mode).
> 
> Do you want to update the comment a bit?
> 
> > +		 * hit of saving it on vmexit for the common case of guests
> > +		 * that don't use it.
> > +		 */
> > +		if (cpu_has_vmx_msr_bitmap() && data &&
> > +		    spec_ctrl_intercepted(vcpu) &&
> > +		    is_guest_mode(vcpu))
>                     ^^^^^^^^^^^^^^^^^^ <=== here

Would it be perhaps also good to mention the complexity of how
we ought to be handling L1 and L2 guests in the commit?

We are all stressed and I am sure some of us haven't gotten much
sleep - but it can help in say three months when some unluckly new
soul is trying to understand this and gets utterly confused.

> > +			vmx_disable_intercept_for_msr(
> > +					vmx->vmcs01.msr_bitmap,
> > +					MSR_IA32_SPEC_CTRL,
> > +					MSR_TYPE_RW);
> > +		break;
> >  	case MSR_IA32_PRED_CMD:
> >  		if (!msr_info->host_initiated &&
> >  		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
@ 2018-02-01 14:19                 ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 56+ messages in thread
From: Konrad Rzeszutek Wilk @ 2018-02-01 14:19 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: Jim Mattson, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Paolo Bonzini, David Woodhouse, Greg KH, Andy Lutomirski, Ashok

.snip..
> > +/* Is SPEC_CTRL intercepted for the currently running vCPU? */
> > +static bool spec_ctrl_intercepted(struct kvm_vcpu *vcpu)
> > +{
> > +	unsigned long *msr_bitmap;
> > +	int f = sizeof(unsigned long);
> > +
> > +	if (!cpu_has_vmx_msr_bitmap())
> > +		return true;
> > +
> > +	msr_bitmap = is_guest_mode(vcpu) ?
> > +			to_vmx(vcpu)->nested.vmcs02.msr_bitmap :
> > +			to_vmx(vcpu)->vmcs01.msr_bitmap;
> > +
> > +	return !!test_bit(MSR_IA32_SPEC_CTRL, msr_bitmap + 0x800 / f);
> > +}
> > +
..snip..
> > @@ -3359,6 +3393,34 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >  	case MSR_IA32_TSC:
> >  		kvm_write_tsc(vcpu, msr_info);
> >  		break;
> > +	case MSR_IA32_SPEC_CTRL:
> > +		if (!msr_info->host_initiated &&
> > +		    !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
> > +		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
> > +			return 1;
> > +
> > +		vmx->spec_ctrl_used = true;
> > +
> > +		/* The STIBP bit doesn't fault even if it's not advertised */
> > +		if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
> > +			return 1;
> > +
> > +		vmx->spec_ctrl = data;
> > +
> > +		/*
> > +		 * When it's written (to non-zero) for the first time, pass
> > +		 * it through. This means we don't have to take the perf
> 
> .. But only if it is a nested guest (as you have && is_guest_mode).
> 
> Do you want to update the comment a bit?
> 
> > +		 * hit of saving it on vmexit for the common case of guests
> > +		 * that don't use it.
> > +		 */
> > +		if (cpu_has_vmx_msr_bitmap() && data &&
> > +		    spec_ctrl_intercepted(vcpu) &&
> > +		    is_guest_mode(vcpu))
>                     ^^^^^^^^^^^^^^^^^^ <=== here

Would it be perhaps also good to mention the complexity of how
we ought to be handling L1 and L2 guests in the commit?

We are all stressed and I am sure some of us haven't gotten much
sleep - but it can help in say three months when some unluckly new
soul is trying to understand this and gets utterly confused.

> > +			vmx_disable_intercept_for_msr(
> > +					vmx->vmcs01.msr_bitmap,
> > +					MSR_IA32_SPEC_CTRL,
> > +					MSR_TYPE_RW);
> > +		break;
> >  	case MSR_IA32_PRED_CMD:
> >  		if (!msr_info->host_initiated &&
> >  		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-02-01 14:19                 ` Konrad Rzeszutek Wilk
@ 2018-02-01 14:28                   ` KarimAllah Ahmed
  -1 siblings, 0 replies; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-02-01 14:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jim Mattson, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Paolo Bonzini, David Woodhouse, Greg KH, Andy Lutomirski,
	Ashok Raj

On 02/01/2018 03:19 PM, Konrad Rzeszutek Wilk wrote:
> .snip..
>>> +/* Is SPEC_CTRL intercepted for the currently running vCPU? */
>>> +static bool spec_ctrl_intercepted(struct kvm_vcpu *vcpu)
>>> +{
>>> +	unsigned long *msr_bitmap;
>>> +	int f = sizeof(unsigned long);
>>> +
>>> +	if (!cpu_has_vmx_msr_bitmap())
>>> +		return true;
>>> +
>>> +	msr_bitmap = is_guest_mode(vcpu) ?
>>> +			to_vmx(vcpu)->nested.vmcs02.msr_bitmap :
>>> +			to_vmx(vcpu)->vmcs01.msr_bitmap;
>>> +
>>> +	return !!test_bit(MSR_IA32_SPEC_CTRL, msr_bitmap + 0x800 / f);
>>> +}
>>> +
> ..snip..
>>> @@ -3359,6 +3393,34 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>>   	case MSR_IA32_TSC:
>>>   		kvm_write_tsc(vcpu, msr_info);
>>>   		break;
>>> +	case MSR_IA32_SPEC_CTRL:
>>> +		if (!msr_info->host_initiated &&
>>> +		    !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
>>> +		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
>>> +			return 1;
>>> +
>>> +		vmx->spec_ctrl_used = true;
>>> +
>>> +		/* The STIBP bit doesn't fault even if it's not advertised */
>>> +		if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
>>> +			return 1;
>>> +
>>> +		vmx->spec_ctrl = data;
>>> +
>>> +		/*
>>> +		 * When it's written (to non-zero) for the first time, pass
>>> +		 * it through. This means we don't have to take the perf
>>
>> .. But only if it is a nested guest (as you have && is_guest_mode).
>>
>> Do you want to update the comment a bit?
>>
>>> +		 * hit of saving it on vmexit for the common case of guests
>>> +		 * that don't use it.
>>> +		 */
>>> +		if (cpu_has_vmx_msr_bitmap() && data &&
>>> +		    spec_ctrl_intercepted(vcpu) &&
>>> +		    is_guest_mode(vcpu))
>>                      ^^^^^^^^^^^^^^^^^^ <=== here
> 
> Would it be perhaps also good to mention the complexity of how
> we ought to be handling L1 and L2 guests in the commit?
> 
> We are all stressed and I am sure some of us haven't gotten much
> sleep - but it can help in say three months when some unluckly new
> soul is trying to understand this and gets utterly confused.

Yup, I will go through the patches and add as much details as possible.

And yes, the is_guest_mode(vcpu) here is inverted :D I blame the late
night :)

> 
>>> +			vmx_disable_intercept_for_msr(
>>> +					vmx->vmcs01.msr_bitmap,
>>> +					MSR_IA32_SPEC_CTRL,
>>> +					MSR_TYPE_RW);
>>> +		break;
>>>   	case MSR_IA32_PRED_CMD:
>>>   		if (!msr_info->host_initiated &&
>>>   		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
@ 2018-02-01 14:28                   ` KarimAllah Ahmed
  0 siblings, 0 replies; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-02-01 14:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Jim Mattson, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Paolo Bonzini, David Woodhouse, Greg KH, Andy Lutomirski, As

On 02/01/2018 03:19 PM, Konrad Rzeszutek Wilk wrote:
> .snip..
>>> +/* Is SPEC_CTRL intercepted for the currently running vCPU? */
>>> +static bool spec_ctrl_intercepted(struct kvm_vcpu *vcpu)
>>> +{
>>> +	unsigned long *msr_bitmap;
>>> +	int f = sizeof(unsigned long);
>>> +
>>> +	if (!cpu_has_vmx_msr_bitmap())
>>> +		return true;
>>> +
>>> +	msr_bitmap = is_guest_mode(vcpu) ?
>>> +			to_vmx(vcpu)->nested.vmcs02.msr_bitmap :
>>> +			to_vmx(vcpu)->vmcs01.msr_bitmap;
>>> +
>>> +	return !!test_bit(MSR_IA32_SPEC_CTRL, msr_bitmap + 0x800 / f);
>>> +}
>>> +
> ..snip..
>>> @@ -3359,6 +3393,34 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>>   	case MSR_IA32_TSC:
>>>   		kvm_write_tsc(vcpu, msr_info);
>>>   		break;
>>> +	case MSR_IA32_SPEC_CTRL:
>>> +		if (!msr_info->host_initiated &&
>>> +		    !guest_cpuid_has(vcpu, X86_FEATURE_IBRS) &&
>>> +		    !guest_cpuid_has(vcpu, X86_FEATURE_SPEC_CTRL))
>>> +			return 1;
>>> +
>>> +		vmx->spec_ctrl_used = true;
>>> +
>>> +		/* The STIBP bit doesn't fault even if it's not advertised */
>>> +		if (data & ~(SPEC_CTRL_IBRS | SPEC_CTRL_STIBP))
>>> +			return 1;
>>> +
>>> +		vmx->spec_ctrl = data;
>>> +
>>> +		/*
>>> +		 * When it's written (to non-zero) for the first time, pass
>>> +		 * it through. This means we don't have to take the perf
>>
>> .. But only if it is a nested guest (as you have && is_guest_mode).
>>
>> Do you want to update the comment a bit?
>>
>>> +		 * hit of saving it on vmexit for the common case of guests
>>> +		 * that don't use it.
>>> +		 */
>>> +		if (cpu_has_vmx_msr_bitmap() && data &&
>>> +		    spec_ctrl_intercepted(vcpu) &&
>>> +		    is_guest_mode(vcpu))
>>                      ^^^^^^^^^^^^^^^^^^ <=== here
> 
> Would it be perhaps also good to mention the complexity of how
> we ought to be handling L1 and L2 guests in the commit?
> 
> We are all stressed and I am sure some of us haven't gotten much
> sleep - but it can help in say three months when some unluckly new
> soul is trying to understand this and gets utterly confused.

Yup, I will go through the patches and add as much details as possible.

And yes, the is_guest_mode(vcpu) here is inverted :D I blame the late
night :)

> 
>>> +			vmx_disable_intercept_for_msr(
>>> +					vmx->vmcs01.msr_bitmap,
>>> +					MSR_IA32_SPEC_CTRL,
>>> +					MSR_TYPE_RW);
>>> +		break;
>>>   	case MSR_IA32_PRED_CMD:
>>>   		if (!msr_info->host_initiated &&
>>>   		    !guest_cpuid_has(vcpu, X86_FEATURE_IBPB) &&
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 2/5] KVM: x86: Add IBPB support
  2018-01-31 19:37 ` [PATCH v5 2/5] KVM: x86: Add IBPB support KarimAllah Ahmed
                     ` (2 preceding siblings ...)
  2018-02-01  4:54   ` Tom Lendacky
@ 2018-02-01 17:00   ` Raj, Ashok
  3 siblings, 0 replies; 56+ messages in thread
From: Raj, Ashok @ 2018-02-01 17:00 UTC (permalink / raw)
  To: KarimAllah Ahmed
  Cc: kvm, linux-kernel, x86, Asit Mallick, Dave Hansen,
	Arjan Van De Ven, Tim Chen, Linus Torvalds, Andrea Arcangeli,
	Andi Kleen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Andy Lutomirski, Greg KH, Paolo Bonzini, Peter Zijlstra,
	David Woodhouse, Ashok Raj

Hi Karim

Thanks for cracking at it through all the iterations.

On Wed, Jan 31, 2018 at 08:37:44PM +0100, KarimAllah Ahmed wrote:
> From: Ashok Raj <ashok.raj@intel.com>
> 
> The Indirect Branch Predictor Barrier (IBPB) is an indirect branch
> control mechanism. It keeps earlier branches from influencing
> later ones.

It might be better to split this patch into two parts.

- Just the place where we do IBPB barrier for VMX/SVM
- The msr get/set handling to comprehend guest, nested support etc.

The first part should be trivial so it can possibly make it into
tip/x86/pti branch sooner, while we rework the feedback to 
address this completely. 


Cheers,
Ashok

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-02-01 13:25                 ` David Woodhouse
  (?)
@ 2018-02-01 17:37                 ` KarimAllah Ahmed
  2018-02-01 17:46                   ` KarimAllah Ahmed
  -1 siblings, 1 reply; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-02-01 17:37 UTC (permalink / raw)
  To: David Woodhouse, Konrad Rzeszutek Wilk
  Cc: Jim Mattson, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Paolo Bonzini, Greg KH, Andy Lutomirski, Ashok Raj

On 02/01/2018 02:25 PM, David Woodhouse wrote:
> 
> 
> On Wed, 2018-01-31 at 23:26 -0500, Konrad Rzeszutek Wilk wrote:
>>
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 6a9f4ec..bfc80ff 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -594,6 +594,14 @@ struct vcpu_vmx {
>>>    #endif
>>>    
>>>         u64                   arch_capabilities;
>>> +     u64                   spec_ctrl;
>>> +
>>> +     /*
>>> +      * This indicates that:
>>> +      * 1) guest_cpuid_has(X86_FEATURE_IBRS) == true &&
>>> +      * 2) The guest has actually initiated a write against the MSR.
>>> +      */
>>> +     bool spec_ctrl_used;
>>>    
>>>         /*
>>>          * This indicates that:
> 
> Thanks for persisting with the details here, Karim. In addition to
> Konrad's heckling at the comments, I'll add my own request to his...
> 
> I'd like the comment for spec_ctrl_used to explain why it isn't
> entirely redundant with the spec_ctrl_intercepted() function.
> 
> Without nesting, I believe it *would* be redundant, but the difference
> comes when an L2 is running for which L1 has not permitted the MSR to
> be passed through. That's when we have spec_ctrl_used = true but the
> MSR *isn't* actually passed through in the active msr_bitmap.
> 
> Question: if spec_ctrl_used is always equivalent to the intercept bit
> in the vmcs01.msr_bitmap, just not the guest bitmap... should we ditch
> it and always use the bit from the vmcs01.msr_bitmap?

If I used the vmcs01.msr_bitmap, spec_ctrl_used will always be true if
L0 passed it to L1. Even if L1 did not actually pass it to L2 and even
if L2 has not written to it yet (!used).

This pretty much renders the short-circuit at
nested_vmx_merge_msr_bitmap useless:

         if (!nested_cpu_has_virt_x2apic_mode(vmcs12) &&
             !to_vmx(vcpu)->pred_cmd_used &&
             !to_vmx(vcpu)->spec_ctrl_used)
                 return false;

... and the default path will be kvm_vcpu_gpa_to_page + kmap.

That being said, I have to admit the logic for spec_ctrl_used is not
perfect either.

If L1 or any of the L2s touched the MSR, spec_ctrl_used will be set to
true. So if one L2 used the MSR, all other L2s will also skip the short-
circuit mentioned above and end up *always* going through
kvm_vcpu_gpa_to_page + kmap.

Maybe all of this is over-thinking and in reality the short-circuit
above is really useless and all L2 guests are happily using x2apic :)

> 
> Sorry :)
> 
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL
  2018-02-01 17:37                 ` KarimAllah Ahmed
@ 2018-02-01 17:46                   ` KarimAllah Ahmed
  0 siblings, 0 replies; 56+ messages in thread
From: KarimAllah Ahmed @ 2018-02-01 17:46 UTC (permalink / raw)
  To: David Woodhouse, Konrad Rzeszutek Wilk
  Cc: Jim Mattson, KarimAllah Ahmed, kvm list, LKML,
	the arch/x86 maintainers, Asit Mallick, Arjan Van De Ven,
	Dave Hansen, Andi Kleen, Andrea Arcangeli, Linus Torvalds,
	Tim Chen, Thomas Gleixner, Dan Williams, Jun Nakajima,
	Paolo Bonzini, Greg KH, Andy Lutomirski, Ashok Raj



On 02/01/2018 06:37 PM, KarimAllah Ahmed wrote:
> On 02/01/2018 02:25 PM, David Woodhouse wrote:
>>
>>
>> On Wed, 2018-01-31 at 23:26 -0500, Konrad Rzeszutek Wilk wrote:
>>>
>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>> index 6a9f4ec..bfc80ff 100644
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -594,6 +594,14 @@ struct vcpu_vmx {
>>>>    #endif
>>>>         u64                   arch_capabilities;
>>>> +     u64                   spec_ctrl;
>>>> +
>>>> +     /*
>>>> +      * This indicates that:
>>>> +      * 1) guest_cpuid_has(X86_FEATURE_IBRS) == true &&
>>>> +      * 2) The guest has actually initiated a write against the MSR.
>>>> +      */
>>>> +     bool spec_ctrl_used;
>>>>         /*
>>>>          * This indicates that:
>>
>> Thanks for persisting with the details here, Karim. In addition to
>> Konrad's heckling at the comments, I'll add my own request to his...
>>
>> I'd like the comment for spec_ctrl_used to explain why it isn't
>> entirely redundant with the spec_ctrl_intercepted() function.
>>
>> Without nesting, I believe it *would* be redundant, but the difference
>> comes when an L2 is running for which L1 has not permitted the MSR to
>> be passed through. That's when we have spec_ctrl_used = true but the
>> MSR *isn't* actually passed through in the active msr_bitmap.
>>
>> Question: if spec_ctrl_used is always equivalent to the intercept bit
>> in the vmcs01.msr_bitmap, just not the guest bitmap... should we ditch
>> it and always use the bit from the vmcs01.msr_bitmap?
> 
> If I used the vmcs01.msr_bitmap, spec_ctrl_used will always be true if
> L0 passed it to L1. Even if L1 did not actually pass it to L2 and even
> if L2 has not written to it yet (!used).
> 
> This pretty much renders the short-circuit at
> nested_vmx_merge_msr_bitmap useless:
> 
>          if (!nested_cpu_has_virt_x2apic_mode(vmcs12) &&
>              !to_vmx(vcpu)->pred_cmd_used &&
>              !to_vmx(vcpu)->spec_ctrl_used)
>                  return false;
> 
> ... and the default path will be kvm_vcpu_gpa_to_page + kmap.
> 
> That being said, I have to admit the logic for spec_ctrl_used is not
> perfect either.
> 
> If L1 or any of the L2s touched the MSR, spec_ctrl_used will be set to
> true. So if one L2 used the MSR, all other L2s will also skip the short-
> circuit mentioned above and end up *always* going through
> kvm_vcpu_gpa_to_page + kmap.
> 
> Maybe all of this is over-thinking and in reality the short-circuit
> above is really useless and all L2 guests are happily using x2apic :)
> 

hehe ..

 >> if spec_ctrl_used is always equivalent to the intercept bit in the
vmcs01.msr_bitmap

actually yes, we can.

I just forgot that we update the msr bitmap lazily! :)

>>
>> Sorry :)
>>
Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2018-02-01 17:46 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-31 19:37 [PATCH v5 0/5] KVM: Expose speculation control feature to guests KarimAllah Ahmed
2018-01-31 19:37 ` KarimAllah Ahmed
2018-01-31 19:37 ` [PATCH v5 1/5] KVM: x86: Update the reverse_cpuid list to include CPUID_7_EDX KarimAllah Ahmed
2018-01-31 20:22   ` Konrad Rzeszutek Wilk
2018-01-31 19:37 ` [PATCH v5 2/5] KVM: x86: Add IBPB support KarimAllah Ahmed
2018-01-31 19:45   ` Jim Mattson
2018-01-31 19:53     ` David Woodhouse
2018-01-31 19:53       ` David Woodhouse
2018-01-31 19:55       ` Jim Mattson
2018-02-01  0:27         ` KarimAllah Ahmed
2018-01-31 20:28   ` Konrad Rzeszutek Wilk
2018-01-31 20:36     ` KarimAllah Ahmed
2018-02-01  4:54   ` Tom Lendacky
2018-02-01 17:00   ` Raj, Ashok
2018-01-31 19:37 ` [PATCH v5 3/5] KVM: VMX: Emulate MSR_IA32_ARCH_CAPABILITIES KarimAllah Ahmed
2018-01-31 19:37 ` [PATCH v5 4/5] KVM: VMX: Allow direct access to MSR_IA32_SPEC_CTRL KarimAllah Ahmed
2018-01-31 19:53   ` Jim Mattson
2018-01-31 20:00     ` David Woodhouse
2018-01-31 20:00       ` David Woodhouse
2018-01-31 20:01     ` KarimAllah Ahmed
2018-01-31 20:18       ` Jim Mattson
2018-01-31 20:21         ` David Woodhouse
2018-01-31 20:21           ` David Woodhouse
2018-01-31 21:18           ` Jim Mattson
2018-01-31 22:05             ` David Woodhouse
2018-01-31 20:34         ` Paolo Bonzini
2018-01-31 20:54           ` Jim Mattson
2018-01-31 21:00             ` Paolo Bonzini
2018-01-31 21:05               ` Jim Mattson
2018-01-31 21:17                 ` Woodhouse, David
2018-01-31 21:17                   ` Woodhouse, David
2018-01-31 21:42                 ` Paolo Bonzini
2018-01-31 21:53                   ` Jim Mattson
2018-01-31 21:59                     ` Paolo Bonzini
2018-01-31 21:59                     ` David Woodhouse
2018-01-31 22:06                       ` Jim Mattson
2018-01-31 22:10                         ` David Woodhouse
2018-01-31 22:10                           ` David Woodhouse
2018-01-31 22:21                           ` Linus Torvalds
2018-01-31 22:53                         ` Andy Lutomirski
2018-01-31 22:53                           ` Andy Lutomirski
2018-02-01 14:09                       ` Paolo Bonzini
2018-01-31 22:52         ` KarimAllah Ahmed
2018-02-01  0:24           ` KarimAllah Ahmed
2018-02-01  4:26             ` Konrad Rzeszutek Wilk
2018-02-01  4:26               ` Konrad Rzeszutek Wilk
2018-02-01 13:25               ` David Woodhouse
2018-02-01 13:25                 ` David Woodhouse
2018-02-01 17:37                 ` KarimAllah Ahmed
2018-02-01 17:46                   ` KarimAllah Ahmed
2018-02-01 14:19               ` Konrad Rzeszutek Wilk
2018-02-01 14:19                 ` Konrad Rzeszutek Wilk
2018-02-01 14:28                 ` KarimAllah Ahmed
2018-02-01 14:28                   ` KarimAllah Ahmed
2018-01-31 22:56   ` Raj, Ashok
2018-01-31 19:37 ` [PATCH v5 5/5] KVM: SVM: " KarimAllah Ahmed

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.