linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS
@ 2022-05-12 17:44 Jon Kohler
  2022-05-16 18:27 ` Jon Kohler
  2022-05-18  1:42 ` Sean Christopherson
  0 siblings, 2 replies; 9+ messages in thread
From: Jon Kohler @ 2022-05-12 17:44 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Andrea Arcangeli, Josh Poimboeuf, Kees Cook, Waiman Long, kvm,
	linux-kernel
  Cc: Jon Kohler

Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on
eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS)
and not [1] (STIBP) or [2] (SSBD) by not disabling interception in
the MSR bitmap.

eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR
once or twice per vCPU on boot, so it is far better to take those
VM exits on boot than having to read and save this msr on every
single VM exit forever. This outcome was suggested on Andrea's commit
2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
however, since interception is still unilaterally disabled, the rdmsr
tax is still there even after that commit.

This is a significant win for eIBRS enabled systems as this rdmsr
accounts for roughly ~50% of time for vmx_vcpu_run() as observed
by perf top disassembly, and is in the critical path for all
VM-Exits, including fastpath exits.

Update relevant comments in vmx_vcpu_run() with appropriate SDM
references for future onlookers.

Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
Signed-off-by: Jon Kohler <jon@nutanix.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Waiman Long <longman@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 46 +++++++++++++++++++++++++++++++-----------
 1 file changed, 34 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d58b763df855..d9da6fcecd8c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2056,6 +2056,25 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (kvm_spec_ctrl_test_value(data))
 			return 1;

+		/*
+		 * For Intel eIBRS, IBRS (SPEC_CTRL_IBRS aka 0x00000048 BIT(0))
+		 * only needs to be set once and can be left on forever without
+		 * needing to be constantly toggled. If the guest attempts to
+		 * write that value, let's not disable interception. Guests
+		 * with eIBRS awareness should only be writing SPEC_CTRL_IBRS
+		 * once per vCPU per boot.
+		 *
+		 * The guest can still use other SPEC_CTRL features on top of
+		 * eIBRS such as SSBD, and we should disable interception for
+		 * those situations to avoid a multitude of VM-Exits's;
+		 * however, we will need to check SPEC_CTRL on each exit to
+		 * make sure we restore the host value properly.
+		 */
+		if (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) && data == BIT(0)) {
+			vmx->spec_ctrl = data;
+			break;
+		}
+
 		vmx->spec_ctrl = data;
 		if (!data)
 			break;
@@ -6887,19 +6906,22 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	vmx_vcpu_enter_exit(vcpu, vmx);

 	/*
-	 * We do not use IBRS in the kernel. If this vCPU has used the
-	 * SPEC_CTRL MSR it may have left it on; save the value and
-	 * turn it off. This is much more efficient than blindly adding
-	 * it to the atomic save/restore list. Especially as the former
-	 * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
-	 *
-	 * For non-nested case:
-	 * If the L01 MSR bitmap does not intercept the MSR, then we need to
-	 * save it.
+	 * SDM 25.1.3 - handle conditional exit for MSR_IA32_SPEC_CTRL.
+	 * To prevent constant VM exits for SPEC_CTRL, kernel may
+	 * disable interception in the MSR bitmap for SPEC_CTRL MSR,
+	 * such that the guest can read and write to that MSR without
+	 * trapping to KVM; however, the guest may set a different
+	 * value than the host. For exit handling, do rdmsr below if
+	 * interception is disabled, such that we can save the guest
+	 * value for restore on VM entry, as it does not get saved
+	 * automatically per SDM 27.3.1.
 	 *
-	 * For nested case:
-	 * If the L02 MSR bitmap does not intercept the MSR, then we need to
-	 * save it.
+	 * This behavior is optimized on eIBRS enabled systems, such
+	 * that the kernel only disables interception for MSR_IA32_SPEC_CTRL
+	 * when guests choose to use additional SPEC_CTRL features
+	 * above and beyond IBRS, such as STIBP or SSBD. This
+	 * optimization allows the kernel to avoid doing the expensive
+	 * rdmsr below.
 	 */
 	if (unlikely(!msr_write_intercepted(vmx, MSR_IA32_SPEC_CTRL)))
 		vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);
--
2.30.1 (Apple Git-130)


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS
  2022-05-12 17:44 [PATCH] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS Jon Kohler
@ 2022-05-16 18:27 ` Jon Kohler
  2022-05-18  1:42 ` Sean Christopherson
  1 sibling, 0 replies; 9+ messages in thread
From: Jon Kohler @ 2022-05-16 18:27 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, X86 ML, H. Peter Anvin,
	Andrea Arcangeli, Josh Poimboeuf, Kees Cook, Waiman Long,
	kvm @ vger . kernel . org, LKML



> On May 12, 2022, at 1:44 PM, Jon Kohler <jon@nutanix.com> wrote:
> 
> Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on
> eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS)
> and not [1] (STIBP) or [2] (SSBD) by not disabling interception in
> the MSR bitmap.
> 
> eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR
> once or twice per vCPU on boot, so it is far better to take those
> VM exits on boot than having to read and save this msr on every
> single VM exit forever. This outcome was suggested on Andrea's commit
> 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
> however, since interception is still unilaterally disabled, the rdmsr
> tax is still there even after that commit.
> 
> This is a significant win for eIBRS enabled systems as this rdmsr
> accounts for roughly ~50% of time for vmx_vcpu_run() as observed
> by perf top disassembly, and is in the critical path for all
> VM-Exits, including fastpath exits.
> 
> Update relevant comments in vmx_vcpu_run() with appropriate SDM
> references for future onlookers.
> 

Gentle ping on this one

> Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> ---
> arch/x86/kvm/vmx/vmx.c | 46 +++++++++++++++++++++++++++++++-----------
> 1 file changed, 34 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d58b763df855..d9da6fcecd8c 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2056,6 +2056,25 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> 		if (kvm_spec_ctrl_test_value(data))
> 			return 1;
> 
> +		/*
> +		 * For Intel eIBRS, IBRS (SPEC_CTRL_IBRS aka 0x00000048 BIT(0))
> +		 * only needs to be set once and can be left on forever without
> +		 * needing to be constantly toggled. If the guest attempts to
> +		 * write that value, let's not disable interception. Guests
> +		 * with eIBRS awareness should only be writing SPEC_CTRL_IBRS
> +		 * once per vCPU per boot.
> +		 *
> +		 * The guest can still use other SPEC_CTRL features on top of
> +		 * eIBRS such as SSBD, and we should disable interception for
> +		 * those situations to avoid a multitude of VM-Exits's;
> +		 * however, we will need to check SPEC_CTRL on each exit to
> +		 * make sure we restore the host value properly.
> +		 */
> +		if (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) && data == BIT(0)) {
> +			vmx->spec_ctrl = data;
> +			break;
> +		}
> +
> 		vmx->spec_ctrl = data;
> 		if (!data)
> 			break;
> @@ -6887,19 +6906,22 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> 	vmx_vcpu_enter_exit(vcpu, vmx);
> 
> 	/*
> -	 * We do not use IBRS in the kernel. If this vCPU has used the
> -	 * SPEC_CTRL MSR it may have left it on; save the value and
> -	 * turn it off. This is much more efficient than blindly adding
> -	 * it to the atomic save/restore list. Especially as the former
> -	 * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
> -	 *
> -	 * For non-nested case:
> -	 * If the L01 MSR bitmap does not intercept the MSR, then we need to
> -	 * save it.
> +	 * SDM 25.1.3 - handle conditional exit for MSR_IA32_SPEC_CTRL.
> +	 * To prevent constant VM exits for SPEC_CTRL, kernel may
> +	 * disable interception in the MSR bitmap for SPEC_CTRL MSR,
> +	 * such that the guest can read and write to that MSR without
> +	 * trapping to KVM; however, the guest may set a different
> +	 * value than the host. For exit handling, do rdmsr below if
> +	 * interception is disabled, such that we can save the guest
> +	 * value for restore on VM entry, as it does not get saved
> +	 * automatically per SDM 27.3.1.
> 	 *
> -	 * For nested case:
> -	 * If the L02 MSR bitmap does not intercept the MSR, then we need to
> -	 * save it.
> +	 * This behavior is optimized on eIBRS enabled systems, such
> +	 * that the kernel only disables interception for MSR_IA32_SPEC_CTRL
> +	 * when guests choose to use additional SPEC_CTRL features
> +	 * above and beyond IBRS, such as STIBP or SSBD. This
> +	 * optimization allows the kernel to avoid doing the expensive
> +	 * rdmsr below.
> 	 */
> 	if (unlikely(!msr_write_intercepted(vmx, MSR_IA32_SPEC_CTRL)))
> 		vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);
> --
> 2.30.1 (Apple Git-130)
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS
  2022-05-12 17:44 [PATCH] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS Jon Kohler
  2022-05-16 18:27 ` Jon Kohler
@ 2022-05-18  1:42 ` Sean Christopherson
  2022-05-18 14:23   ` Jon Kohler
  1 sibling, 1 reply; 9+ messages in thread
From: Sean Christopherson @ 2022-05-18  1:42 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Andrea Arcangeli,
	Josh Poimboeuf, Kees Cook, Waiman Long, kvm, linux-kernel

On Thu, May 12, 2022, Jon Kohler wrote:
> Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on
> eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS)
> and not [1] (STIBP) or [2] (SSBD) by not disabling interception in
> the MSR bitmap.
> 
> eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR
> once or twice per vCPU on boot, so it is far better to take those
> VM exits on boot than having to read and save this msr on every
> single VM exit forever. This outcome was suggested on Andrea's commit
> 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
> however, since interception is still unilaterally disabled, the rdmsr
> tax is still there even after that commit.
> 
> This is a significant win for eIBRS enabled systems as this rdmsr
> accounts for roughly ~50% of time for vmx_vcpu_run() as observed
> by perf top disassembly, and is in the critical path for all
> VM-Exits, including fastpath exits.
> 
> Update relevant comments in vmx_vcpu_run() with appropriate SDM
> references for future onlookers.
> 
> Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Waiman Long <longman@redhat.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 46 +++++++++++++++++++++++++++++++-----------
>  1 file changed, 34 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d58b763df855..d9da6fcecd8c 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2056,6 +2056,25 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		if (kvm_spec_ctrl_test_value(data))
>  			return 1;
> 
> +		/*
> +		 * For Intel eIBRS, IBRS (SPEC_CTRL_IBRS aka 0x00000048 BIT(0))
> +		 * only needs to be set once and can be left on forever without
> +		 * needing to be constantly toggled. If the guest attempts to
> +		 * write that value, let's not disable interception. Guests
> +		 * with eIBRS awareness should only be writing SPEC_CTRL_IBRS
> +		 * once per vCPU per boot.
> +		 *
> +		 * The guest can still use other SPEC_CTRL features on top of
> +		 * eIBRS such as SSBD, and we should disable interception for

Please don't add comments that say "should" or "need", just state what is being
done.  Just because a comment says XYZ should do something doesn't necessarily
mean that XYZ actually does that thing.

> +		 * those situations to avoid a multitude of VM-Exits's;
> +		 * however, we will need to check SPEC_CTRL on each exit to
> +		 * make sure we restore the host value properly.
> +		 */

This whole comment can be more succint.  Better yet, merge it with the comment
below (out of sight in this diff) and opportunistically update that comment to
reflect what actually happens if L2 is the first to write a non-zero value (arguably
a bug that should be fixed, but meh).  The IBPB MSR has the same flaw.  :-/

> +		if (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) && data == BIT(0)) {

Use SPEC_CTRL_IBRS instead of open coding "BIT(0)", then a chunk of the comment
goes away.

> +			vmx->spec_ctrl = data;
> +			break;
> +		}

There's no need for a separate if statement.  And the boot_cpu_has() check can
be dropped, kvm_spec_ctrl_test_value() has already verified the bit is writable
(unless you're worried about bit 0 being used for something else?)

> +
>  		vmx->spec_ctrl = data;
>  		if (!data)
>  			break;
> @@ -6887,19 +6906,22 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>  	vmx_vcpu_enter_exit(vcpu, vmx);
> 
>  	/*
> -	 * We do not use IBRS in the kernel. If this vCPU has used the
> -	 * SPEC_CTRL MSR it may have left it on; save the value and
> -	 * turn it off. This is much more efficient than blindly adding
> -	 * it to the atomic save/restore list. Especially as the former
> -	 * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
> -	 *
> -	 * For non-nested case:
> -	 * If the L01 MSR bitmap does not intercept the MSR, then we need to
> -	 * save it.
> +	 * SDM 25.1.3 - handle conditional exit for MSR_IA32_SPEC_CTRL.
> +	 * To prevent constant VM exits for SPEC_CTRL, kernel may

Please wrap at 80 chars (ignore the bad examples in KVM).

> +	 * disable interception in the MSR bitmap for SPEC_CTRL MSR,
> +	 * such that the guest can read and write to that MSR without
> +	 * trapping to KVM; however, the guest may set a different
> +	 * value than the host. For exit handling, do rdmsr below if
> +	 * interception is disabled, such that we can save the guest
> +	 * value for restore on VM entry, as it does not get saved
> +	 * automatically per SDM 27.3.1.
>  	 *
> -	 * For nested case:
> -	 * If the L02 MSR bitmap does not intercept the MSR, then we need to
> -	 * save it.
> +	 * This behavior is optimized on eIBRS enabled systems, such
> +	 * that the kernel only disables interception for MSR_IA32_SPEC_CTRL
> +	 * when guests choose to use additional SPEC_CTRL features
> +	 * above and beyond IBRS, such as STIBP or SSBD. This
> +	 * optimization allows the kernel to avoid doing the expensive
> +	 * rdmsr below.

I don't see any reason to restate why the MSR may or may not be intercepted, just
explain when the value needs to be read.

E.g. something like this for the whole patch?

---
 arch/x86/kvm/vmx/vmx.c | 60 +++++++++++++++++++-----------------------
 1 file changed, 27 insertions(+), 33 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 610355b9ccce..70d863a7882d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2057,20 +2057,30 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 1;

 		vmx->spec_ctrl = data;
-		if (!data)
+
+		/*
+		 * Disable interception on the first non-zero write, unless the
+		 * guest is setting only SPEC_CTRL_IBRS, which is typically set
+		 * once at boot and never touched again.  All other bits are
+		 * often set on a per-task basis, i.e. may change frequently,
+		 * so the benefit of avoiding VM-exits during guest context
+		 * switches outweighs the cost of RDMSR on every VM-Exit to
+		 * save the guest's value.
+		 */
+		if (!data || data == SPEC_CTRL_IBRS)
 			break;

 		/*
-		 * For non-nested:
-		 * When it's written (to non-zero) for the first time, pass
-		 * it through.
-		 *
-		 * For nested:
-		 * The handling of the MSR bitmap for L2 guests is done in
-		 * nested_vmx_prepare_msr_bitmap. We should not touch the
-		 * vmcs02.msr_bitmap here since it gets completely overwritten
-		 * in the merging. We update the vmcs01 here for L1 as well
-		 * since it will end up touching the MSR anyway now.
+		 * Update vmcs01.msr_bitmap even if L2 is active, i.e. disable
+		 * interception for the vCPU on the first write regardless of
+		 * whether the WRMSR came from L1 or L2.  vmcs02's bitmap is a
+		 * combination of vmcs01 and vmcs12 bitmaps, and will be
+		 * recomputed by nested_vmx_prepare_msr_bitmap() on the next
+		 * nested VM-Enter.  Note, this does mean that future WRMSRs
+		 * from L2 will be intercepted until the next nested VM-Exit if
+		 * L2 was the first to write, but L1 exposing the MSR to L2
+		 * without first writing it is unlikely and not worth the
+		 * extra bit of complexity.
 		 */
 		vmx_disable_intercept_for_msr(vcpu,
 					      MSR_IA32_SPEC_CTRL,
@@ -2098,15 +2108,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);

 		/*
-		 * For non-nested:
-		 * When it's written (to non-zero) for the first time, pass
-		 * it through.
-		 *
-		 * For nested:
-		 * The handling of the MSR bitmap for L2 guests is done in
-		 * nested_vmx_prepare_msr_bitmap. We should not touch the
-		 * vmcs02.msr_bitmap here since it gets completely overwritten
-		 * in the merging.
+		 * Disable interception on the first IBPB, odds are good IBPB
+		 * will be a frequent guest action.  See the comment for
+		 * MSR_IA32_SPEC_CTRL for details on the nested interaction.
 		 */
 		vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PRED_CMD, MSR_TYPE_W);
 		break;
@@ -6887,19 +6891,9 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	vmx_vcpu_enter_exit(vcpu, vmx);

 	/*
-	 * We do not use IBRS in the kernel. If this vCPU has used the
-	 * SPEC_CTRL MSR it may have left it on; save the value and
-	 * turn it off. This is much more efficient than blindly adding
-	 * it to the atomic save/restore list. Especially as the former
-	 * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
-	 *
-	 * For non-nested case:
-	 * If the L01 MSR bitmap does not intercept the MSR, then we need to
-	 * save it.
-	 *
-	 * For nested case:
-	 * If the L02 MSR bitmap does not intercept the MSR, then we need to
-	 * save it.
+	 * Save SPEC_CTRL if it may have been written by the guest, the current
+	 * value in hardware is used by x86_spec_ctrl_restore_host() to avoid
+	 * WRMSR if the current value matches the host's desired value.
 	 */
 	if (unlikely(!msr_write_intercepted(vmx, MSR_IA32_SPEC_CTRL)))
 		vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);

base-commit: 69b59889b0147aa80098936e383b06fec30cdf5c
--


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS
  2022-05-18  1:42 ` Sean Christopherson
@ 2022-05-18 14:23   ` Jon Kohler
  2022-05-20 19:55     ` Jon Kohler
  0 siblings, 1 reply; 9+ messages in thread
From: Jon Kohler @ 2022-05-18 14:23 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jon Kohler, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Andrea Arcangeli, Josh Poimboeuf, Kees Cook, Waiman Long, kvm,
	linux-kernel



> On May 17, 2022, at 9:42 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Thu, May 12, 2022, Jon Kohler wrote:
>> Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on
>> eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS)
>> and not [1] (STIBP) or [2] (SSBD) by not disabling interception in
>> the MSR bitmap.
>> 
>> eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR
>> once or twice per vCPU on boot, so it is far better to take those
>> VM exits on boot than having to read and save this msr on every
>> single VM exit forever. This outcome was suggested on Andrea's commit
>> 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
>> however, since interception is still unilaterally disabled, the rdmsr
>> tax is still there even after that commit.
>> 
>> This is a significant win for eIBRS enabled systems as this rdmsr
>> accounts for roughly ~50% of time for vmx_vcpu_run() as observed
>> by perf top disassembly, and is in the critical path for all
>> VM-Exits, including fastpath exits.
>> 
>> Update relevant comments in vmx_vcpu_run() with appropriate SDM
>> references for future onlookers.
>> 
>> Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: Kees Cook <keescook@chromium.org>
>> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
>> Cc: Waiman Long <longman@redhat.com>
>> ---
>> arch/x86/kvm/vmx/vmx.c | 46 +++++++++++++++++++++++++++++++-----------
>> 1 file changed, 34 insertions(+), 12 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index d58b763df855..d9da6fcecd8c 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -2056,6 +2056,25 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> 		if (kvm_spec_ctrl_test_value(data))
>> 			return 1;
>> 
>> +		/*
>> +		 * For Intel eIBRS, IBRS (SPEC_CTRL_IBRS aka 0x00000048 BIT(0))
>> +		 * only needs to be set once and can be left on forever without
>> +		 * needing to be constantly toggled. If the guest attempts to
>> +		 * write that value, let's not disable interception. Guests
>> +		 * with eIBRS awareness should only be writing SPEC_CTRL_IBRS
>> +		 * once per vCPU per boot.
>> +		 *
>> +		 * The guest can still use other SPEC_CTRL features on top of
>> +		 * eIBRS such as SSBD, and we should disable interception for
> 
> Please don't add comments that say "should" or "need", just state what is being
> done.  Just because a comment says XYZ should do something doesn't necessarily
> mean that XYZ actually does that thing.
> 
>> +		 * those situations to avoid a multitude of VM-Exits's;
>> +		 * however, we will need to check SPEC_CTRL on each exit to
>> +		 * make sure we restore the host value properly.
>> +		 */
> 
> This whole comment can be more succint.  Better yet, merge it with the comment
> below (out of sight in this diff) and opportunistically update that comment to
> reflect what actually happens if L2 is the first to write a non-zero value (arguably
> a bug that should be fixed, but meh).  The IBPB MSR has the same flaw.  :-/
> 
>> +		if (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) && data == BIT(0)) {
> 
> Use SPEC_CTRL_IBRS instead of open coding "BIT(0)", then a chunk of the comment
> goes away.
> 
>> +			vmx->spec_ctrl = data;
>> +			break;
>> +		}
> 
> There's no need for a separate if statement.  And the boot_cpu_has() check can
> be dropped, kvm_spec_ctrl_test_value() has already verified the bit is writable
> (unless you're worried about bit 0 being used for something else?)
> 
>> +
>> 		vmx->spec_ctrl = data;
>> 		if (!data)
>> 			break;
>> @@ -6887,19 +6906,22 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>> 	vmx_vcpu_enter_exit(vcpu, vmx);
>> 
>> 	/*
>> -	 * We do not use IBRS in the kernel. If this vCPU has used the
>> -	 * SPEC_CTRL MSR it may have left it on; save the value and
>> -	 * turn it off. This is much more efficient than blindly adding
>> -	 * it to the atomic save/restore list. Especially as the former
>> -	 * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
>> -	 *
>> -	 * For non-nested case:
>> -	 * If the L01 MSR bitmap does not intercept the MSR, then we need to
>> -	 * save it.
>> +	 * SDM 25.1.3 - handle conditional exit for MSR_IA32_SPEC_CTRL.
>> +	 * To prevent constant VM exits for SPEC_CTRL, kernel may
> 
> Please wrap at 80 chars (ignore the bad examples in KVM).
> 
>> +	 * disable interception in the MSR bitmap for SPEC_CTRL MSR,
>> +	 * such that the guest can read and write to that MSR without
>> +	 * trapping to KVM; however, the guest may set a different
>> +	 * value than the host. For exit handling, do rdmsr below if
>> +	 * interception is disabled, such that we can save the guest
>> +	 * value for restore on VM entry, as it does not get saved
>> +	 * automatically per SDM 27.3.1.
>> 	 *
>> -	 * For nested case:
>> -	 * If the L02 MSR bitmap does not intercept the MSR, then we need to
>> -	 * save it.
>> +	 * This behavior is optimized on eIBRS enabled systems, such
>> +	 * that the kernel only disables interception for MSR_IA32_SPEC_CTRL
>> +	 * when guests choose to use additional SPEC_CTRL features
>> +	 * above and beyond IBRS, such as STIBP or SSBD. This
>> +	 * optimization allows the kernel to avoid doing the expensive
>> +	 * rdmsr below.
> 
> I don't see any reason to restate why the MSR may or may not be intercepted, just
> explain when the value needs to be read.
> 
> E.g. something like this for the whole patch?

Sean - Thanks for the feedback as always, good stuff. I’ll pull this together,
Double check the testing side, and send out a v2 with all this incorporated.

> 
> ---
> arch/x86/kvm/vmx/vmx.c | 60 +++++++++++++++++++-----------------------
> 1 file changed, 27 insertions(+), 33 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 610355b9ccce..70d863a7882d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2057,20 +2057,30 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> 			return 1;
> 
> 		vmx->spec_ctrl = data;
> -		if (!data)
> +
> +		/*
> +		 * Disable interception on the first non-zero write, unless the
> +		 * guest is setting only SPEC_CTRL_IBRS, which is typically set
> +		 * once at boot and never touched again.  All other bits are
> +		 * often set on a per-task basis, i.e. may change frequently,
> +		 * so the benefit of avoiding VM-exits during guest context
> +		 * switches outweighs the cost of RDMSR on every VM-Exit to
> +		 * save the guest's value.
> +		 */
> +		if (!data || data == SPEC_CTRL_IBRS)
> 			break;
> 
> 		/*
> -		 * For non-nested:
> -		 * When it's written (to non-zero) for the first time, pass
> -		 * it through.
> -		 *
> -		 * For nested:
> -		 * The handling of the MSR bitmap for L2 guests is done in
> -		 * nested_vmx_prepare_msr_bitmap. We should not touch the
> -		 * vmcs02.msr_bitmap here since it gets completely overwritten
> -		 * in the merging. We update the vmcs01 here for L1 as well
> -		 * since it will end up touching the MSR anyway now.
> +		 * Update vmcs01.msr_bitmap even if L2 is active, i.e. disable
> +		 * interception for the vCPU on the first write regardless of
> +		 * whether the WRMSR came from L1 or L2.  vmcs02's bitmap is a
> +		 * combination of vmcs01 and vmcs12 bitmaps, and will be
> +		 * recomputed by nested_vmx_prepare_msr_bitmap() on the next
> +		 * nested VM-Enter.  Note, this does mean that future WRMSRs
> +		 * from L2 will be intercepted until the next nested VM-Exit if
> +		 * L2 was the first to write, but L1 exposing the MSR to L2
> +		 * without first writing it is unlikely and not worth the
> +		 * extra bit of complexity.
> 		 */
> 		vmx_disable_intercept_for_msr(vcpu,
> 					      MSR_IA32_SPEC_CTRL,
> @@ -2098,15 +2108,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> 		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
> 
> 		/*
> -		 * For non-nested:
> -		 * When it's written (to non-zero) for the first time, pass
> -		 * it through.
> -		 *
> -		 * For nested:
> -		 * The handling of the MSR bitmap for L2 guests is done in
> -		 * nested_vmx_prepare_msr_bitmap. We should not touch the
> -		 * vmcs02.msr_bitmap here since it gets completely overwritten
> -		 * in the merging.
> +		 * Disable interception on the first IBPB, odds are good IBPB
> +		 * will be a frequent guest action.  See the comment for
> +		 * MSR_IA32_SPEC_CTRL for details on the nested interaction.
> 		 */
> 		vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PRED_CMD, MSR_TYPE_W);
> 		break;
> @@ -6887,19 +6891,9 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> 	vmx_vcpu_enter_exit(vcpu, vmx);
> 
> 	/*
> -	 * We do not use IBRS in the kernel. If this vCPU has used the
> -	 * SPEC_CTRL MSR it may have left it on; save the value and
> -	 * turn it off. This is much more efficient than blindly adding
> -	 * it to the atomic save/restore list. Especially as the former
> -	 * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
> -	 *
> -	 * For non-nested case:
> -	 * If the L01 MSR bitmap does not intercept the MSR, then we need to
> -	 * save it.
> -	 *
> -	 * For nested case:
> -	 * If the L02 MSR bitmap does not intercept the MSR, then we need to
> -	 * save it.
> +	 * Save SPEC_CTRL if it may have been written by the guest, the current
> +	 * value in hardware is used by x86_spec_ctrl_restore_host() to avoid
> +	 * WRMSR if the current value matches the host's desired value.
> 	 */
> 	if (unlikely(!msr_write_intercepted(vmx, MSR_IA32_SPEC_CTRL)))
> 		vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);
> 
> base-commit: 69b59889b0147aa80098936e383b06fec30cdf5c
> --


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS
  2022-05-18 14:23   ` Jon Kohler
@ 2022-05-20 19:55     ` Jon Kohler
  2022-05-20 20:06       ` Sean Christopherson
  0 siblings, 1 reply; 9+ messages in thread
From: Jon Kohler @ 2022-05-20 19:55 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Andrea Arcangeli, Josh Poimboeuf, Kees Cook, Waiman Long, kvm,
	linux-kernel



> On May 18, 2022, at 10:23 AM, Jon Kohler <jon@nutanix.com> wrote:
> 
> 
> 
>> On May 17, 2022, at 9:42 PM, Sean Christopherson <seanjc@google.com> wrote:
>> 
>> On Thu, May 12, 2022, Jon Kohler wrote:
>>> Avoid expensive rdmsr on every VM Exit for MSR_IA32_SPEC_CTRL on
>>> eIBRS enabled systems iff the guest only sets IA32_SPEC_CTRL[0] (IBRS)
>>> and not [1] (STIBP) or [2] (SSBD) by not disabling interception in
>>> the MSR bitmap.
>>> 
>>> eIBRS enabled guests using just IBRS will only write SPEC_CTRL MSR
>>> once or twice per vCPU on boot, so it is far better to take those
>>> VM exits on boot than having to read and save this msr on every
>>> single VM exit forever. This outcome was suggested on Andrea's commit
>>> 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
>>> however, since interception is still unilaterally disabled, the rdmsr
>>> tax is still there even after that commit.
>>> 
>>> This is a significant win for eIBRS enabled systems as this rdmsr
>>> accounts for roughly ~50% of time for vmx_vcpu_run() as observed
>>> by perf top disassembly, and is in the critical path for all
>>> VM-Exits, including fastpath exits.
>>> 
>>> Update relevant comments in vmx_vcpu_run() with appropriate SDM
>>> references for future onlookers.
>>> 
>>> Fixes: 2f46993d83ff ("x86: change default to spec_store_bypass_disable=prctl spectre_v2_user=prctl")
>>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> Cc: Kees Cook <keescook@chromium.org>
>>> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
>>> Cc: Waiman Long <longman@redhat.com>
>>> ---
>>> arch/x86/kvm/vmx/vmx.c | 46 +++++++++++++++++++++++++++++++-----------
>>> 1 file changed, 34 insertions(+), 12 deletions(-)
>>> 
>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>> index d58b763df855..d9da6fcecd8c 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -2056,6 +2056,25 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>> 		if (kvm_spec_ctrl_test_value(data))
>>> 			return 1;
>>> 
>>> +		/*
>>> +		 * For Intel eIBRS, IBRS (SPEC_CTRL_IBRS aka 0x00000048 BIT(0))
>>> +		 * only needs to be set once and can be left on forever without
>>> +		 * needing to be constantly toggled. If the guest attempts to
>>> +		 * write that value, let's not disable interception. Guests
>>> +		 * with eIBRS awareness should only be writing SPEC_CTRL_IBRS
>>> +		 * once per vCPU per boot.
>>> +		 *
>>> +		 * The guest can still use other SPEC_CTRL features on top of
>>> +		 * eIBRS such as SSBD, and we should disable interception for
>> 
>> Please don't add comments that say "should" or "need", just state what is being
>> done.  Just because a comment says XYZ should do something doesn't necessarily
>> mean that XYZ actually does that thing.
>> 
>>> +		 * those situations to avoid a multitude of VM-Exits's;
>>> +		 * however, we will need to check SPEC_CTRL on each exit to
>>> +		 * make sure we restore the host value properly.
>>> +		 */
>> 
>> This whole comment can be more succint.  Better yet, merge it with the comment
>> below (out of sight in this diff) and opportunistically update that comment to
>> reflect what actually happens if L2 is the first to write a non-zero value (arguably
>> a bug that should be fixed, but meh).  The IBPB MSR has the same flaw.  :-/
>> 
>>> +		if (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) && data == BIT(0)) {
>> 
>> Use SPEC_CTRL_IBRS instead of open coding "BIT(0)", then a chunk of the comment
>> goes away.
>> 
>>> +			vmx->spec_ctrl = data;
>>> +			break;
>>> +		}
>> 
>> There's no need for a separate if statement.  And the boot_cpu_has() check can
>> be dropped, kvm_spec_ctrl_test_value() has already verified the bit is writable
>> (unless you're worried about bit 0 being used for something else?)

I was (and am) worried about misbehaving guests on pre-eIBRS systems spamming IBRS
MSR, which we wouldn’t be able to see today. Intel’s guidance for eIBRS has long been
set it once and be done with it, so any eIBRS aware guest should behave nicely with that.
That limits the blast radius a bit here.

>> 
>>> +
>>> 		vmx->spec_ctrl = data;
>>> 		if (!data)
>>> 			break;
>>> @@ -6887,19 +6906,22 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>>> 	vmx_vcpu_enter_exit(vcpu, vmx);
>>> 
>>> 	/*
>>> -	 * We do not use IBRS in the kernel. If this vCPU has used the
>>> -	 * SPEC_CTRL MSR it may have left it on; save the value and
>>> -	 * turn it off. This is much more efficient than blindly adding
>>> -	 * it to the atomic save/restore list. Especially as the former
>>> -	 * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
>>> -	 *
>>> -	 * For non-nested case:
>>> -	 * If the L01 MSR bitmap does not intercept the MSR, then we need to
>>> -	 * save it.
>>> +	 * SDM 25.1.3 - handle conditional exit for MSR_IA32_SPEC_CTRL.
>>> +	 * To prevent constant VM exits for SPEC_CTRL, kernel may
>> 
>> Please wrap at 80 chars (ignore the bad examples in KVM).
>> 
>>> +	 * disable interception in the MSR bitmap for SPEC_CTRL MSR,
>>> +	 * such that the guest can read and write to that MSR without
>>> +	 * trapping to KVM; however, the guest may set a different
>>> +	 * value than the host. For exit handling, do rdmsr below if
>>> +	 * interception is disabled, such that we can save the guest
>>> +	 * value for restore on VM entry, as it does not get saved
>>> +	 * automatically per SDM 27.3.1.
>>> 	 *
>>> -	 * For nested case:
>>> -	 * If the L02 MSR bitmap does not intercept the MSR, then we need to
>>> -	 * save it.
>>> +	 * This behavior is optimized on eIBRS enabled systems, such
>>> +	 * that the kernel only disables interception for MSR_IA32_SPEC_CTRL
>>> +	 * when guests choose to use additional SPEC_CTRL features
>>> +	 * above and beyond IBRS, such as STIBP or SSBD. This
>>> +	 * optimization allows the kernel to avoid doing the expensive
>>> +	 * rdmsr below.
>> 
>> I don't see any reason to restate why the MSR may or may not be intercepted, just
>> explain when the value needs to be read.
>> 
>> E.g. something like this for the whole patch?
> 
> Sean - Thanks for the feedback as always, good stuff. I’ll pull this together,
> Double check the testing side, and send out a v2 with all this incorporated.

Sent out the v2 just now with a few minor tweaks, only notable one was keeping
the boot cpu check and small tweaks to comments here and there to suit.

> 
>> 
>> ---
>> arch/x86/kvm/vmx/vmx.c | 60 +++++++++++++++++++-----------------------
>> 1 file changed, 27 insertions(+), 33 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 610355b9ccce..70d863a7882d 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -2057,20 +2057,30 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> 			return 1;
>> 
>> 		vmx->spec_ctrl = data;
>> -		if (!data)
>> +
>> +		/*
>> +		 * Disable interception on the first non-zero write, unless the
>> +		 * guest is setting only SPEC_CTRL_IBRS, which is typically set
>> +		 * once at boot and never touched again.  All other bits are
>> +		 * often set on a per-task basis, i.e. may change frequently,
>> +		 * so the benefit of avoiding VM-exits during guest context
>> +		 * switches outweighs the cost of RDMSR on every VM-Exit to
>> +		 * save the guest's value.
>> +		 */
>> +		if (!data || data == SPEC_CTRL_IBRS)
>> 			break;
>> 
>> 		/*
>> -		 * For non-nested:
>> -		 * When it's written (to non-zero) for the first time, pass
>> -		 * it through.
>> -		 *
>> -		 * For nested:
>> -		 * The handling of the MSR bitmap for L2 guests is done in
>> -		 * nested_vmx_prepare_msr_bitmap. We should not touch the
>> -		 * vmcs02.msr_bitmap here since it gets completely overwritten
>> -		 * in the merging. We update the vmcs01 here for L1 as well
>> -		 * since it will end up touching the MSR anyway now.
>> +		 * Update vmcs01.msr_bitmap even if L2 is active, i.e. disable
>> +		 * interception for the vCPU on the first write regardless of
>> +		 * whether the WRMSR came from L1 or L2.  vmcs02's bitmap is a
>> +		 * combination of vmcs01 and vmcs12 bitmaps, and will be
>> +		 * recomputed by nested_vmx_prepare_msr_bitmap() on the next
>> +		 * nested VM-Enter.  Note, this does mean that future WRMSRs
>> +		 * from L2 will be intercepted until the next nested VM-Exit if
>> +		 * L2 was the first to write, but L1 exposing the MSR to L2
>> +		 * without first writing it is unlikely and not worth the
>> +		 * extra bit of complexity.
>> 		 */
>> 		vmx_disable_intercept_for_msr(vcpu,
>> 					      MSR_IA32_SPEC_CTRL,
>> @@ -2098,15 +2108,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> 		wrmsrl(MSR_IA32_PRED_CMD, PRED_CMD_IBPB);
>> 
>> 		/*
>> -		 * For non-nested:
>> -		 * When it's written (to non-zero) for the first time, pass
>> -		 * it through.
>> -		 *
>> -		 * For nested:
>> -		 * The handling of the MSR bitmap for L2 guests is done in
>> -		 * nested_vmx_prepare_msr_bitmap. We should not touch the
>> -		 * vmcs02.msr_bitmap here since it gets completely overwritten
>> -		 * in the merging.
>> +		 * Disable interception on the first IBPB, odds are good IBPB
>> +		 * will be a frequent guest action.  See the comment for
>> +		 * MSR_IA32_SPEC_CTRL for details on the nested interaction.
>> 		 */
>> 		vmx_disable_intercept_for_msr(vcpu, MSR_IA32_PRED_CMD, MSR_TYPE_W);
>> 		break;
>> @@ -6887,19 +6891,9 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>> 	vmx_vcpu_enter_exit(vcpu, vmx);
>> 
>> 	/*
>> -	 * We do not use IBRS in the kernel. If this vCPU has used the
>> -	 * SPEC_CTRL MSR it may have left it on; save the value and
>> -	 * turn it off. This is much more efficient than blindly adding
>> -	 * it to the atomic save/restore list. Especially as the former
>> -	 * (Saving guest MSRs on vmexit) doesn't even exist in KVM.
>> -	 *
>> -	 * For non-nested case:
>> -	 * If the L01 MSR bitmap does not intercept the MSR, then we need to
>> -	 * save it.
>> -	 *
>> -	 * For nested case:
>> -	 * If the L02 MSR bitmap does not intercept the MSR, then we need to
>> -	 * save it.
>> +	 * Save SPEC_CTRL if it may have been written by the guest, the current
>> +	 * value in hardware is used by x86_spec_ctrl_restore_host() to avoid
>> +	 * WRMSR if the current value matches the host's desired value.
>> 	 */
>> 	if (unlikely(!msr_write_intercepted(vmx, MSR_IA32_SPEC_CTRL)))
>> 		vmx->spec_ctrl = native_read_msr(MSR_IA32_SPEC_CTRL);
>> 
>> base-commit: 69b59889b0147aa80098936e383b06fec30cdf5c
>> --
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS
  2022-05-20 19:55     ` Jon Kohler
@ 2022-05-20 20:06       ` Sean Christopherson
  2022-05-20 20:14         ` Jon Kohler
  0 siblings, 1 reply; 9+ messages in thread
From: Sean Christopherson @ 2022-05-20 20:06 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Andrea Arcangeli,
	Josh Poimboeuf, Kees Cook, Waiman Long, kvm, linux-kernel

On Fri, May 20, 2022, Jon Kohler wrote:
> 
> > On May 18, 2022, at 10:23 AM, Jon Kohler <jon@nutanix.com> wrote:
> > 
> >> On May 17, 2022, at 9:42 PM, Sean Christopherson <seanjc@google.com> wrote:
> >>> +		if (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) && data == BIT(0)) {
> >> 
> >> Use SPEC_CTRL_IBRS instead of open coding "BIT(0)", then a chunk of the comment
> >> goes away.
> >> 
> >>> +			vmx->spec_ctrl = data;
> >>> +			break;
> >>> +		}
> >> 
> >> There's no need for a separate if statement.  And the boot_cpu_has() check can
> >> be dropped, kvm_spec_ctrl_test_value() has already verified the bit is writable
> >> (unless you're worried about bit 0 being used for something else?)
> 
> I was (and am) worried about misbehaving guests on pre-eIBRS systems spamming IBRS
> MSR, which we wouldn’t be able to see today. Intel’s guidance for eIBRS has long been
> set it once and be done with it, so any eIBRS aware guest should behave nicely with that.
> That limits the blast radius a bit here.

Then check the guest capabilities, not the host flag.

	if (data == SPEC_CTRL_IBRS &&
	    (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL))

> Sent out the v2 just now with a few minor tweaks, only notable one was keeping
> the boot cpu check and small tweaks to comments here and there to suit.

In the future, give reviewers a bit of time to respond to a contented point before
sending out the next revision, e.g. you could have avoided v3 :-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS
  2022-05-20 20:06       ` Sean Christopherson
@ 2022-05-20 20:14         ` Jon Kohler
  2022-05-20 20:30           ` Sean Christopherson
  0 siblings, 1 reply; 9+ messages in thread
From: Jon Kohler @ 2022-05-20 20:14 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jon Kohler, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Andrea Arcangeli, Josh Poimboeuf, Kees Cook, Waiman Long, kvm,
	linux-kernel



> On May 20, 2022, at 4:06 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Fri, May 20, 2022, Jon Kohler wrote:
>> 
>>> On May 18, 2022, at 10:23 AM, Jon Kohler <jon@nutanix.com> wrote:
>>> 
>>>> On May 17, 2022, at 9:42 PM, Sean Christopherson <seanjc@google.com> wrote:
>>>>> +		if (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) && data == BIT(0)) {
>>>> 
>>>> Use SPEC_CTRL_IBRS instead of open coding "BIT(0)", then a chunk of the comment
>>>> goes away.
>>>> 
>>>>> +			vmx->spec_ctrl = data;
>>>>> +			break;
>>>>> +		}
>>>> 
>>>> There's no need for a separate if statement.  And the boot_cpu_has() check can
>>>> be dropped, kvm_spec_ctrl_test_value() has already verified the bit is writable
>>>> (unless you're worried about bit 0 being used for something else?)
>> 
>> I was (and am) worried about misbehaving guests on pre-eIBRS systems spamming IBRS
>> MSR, which we wouldn’t be able to see today. Intel’s guidance for eIBRS has long been
>> set it once and be done with it, so any eIBRS aware guest should behave nicely with that.
>> That limits the blast radius a bit here.
> 
> Then check the guest capabilities, not the host flag.
> 
> 	if (data == SPEC_CTRL_IBRS &&
> 	    (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL))

So I originally did that in my first internal patch; however, the code you wrote is
effectively the code I wrote, because cpu_set_bug_bits() already does that exact
same thing when it sets up X86_FEATURE_IBRS_ENHANCED. 

Is the boot cpu check more expensive than checking the vCPU perhaps? Otherwise,
checking X86_FEATURE_IBRS_ENHANCED seemed like it might be easier
understand for future onlookers, as thats what the rest of the kernel keys off of
when checking for eIBRS (e.g. in bugs.c etc). 

>> Sent out the v2 just now with a few minor tweaks, only notable one was keeping
>> the boot cpu check and small tweaks to comments here and there to suit.
> 
> In the future, give reviewers a bit of time to respond to a contented point before
> sending out the next revision, e.g. you could have avoided v3 :-)

Thanks for the feedback/coaching. Still getting my legs under me for LKML, I
appreciate the kindness, thank you!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS
  2022-05-20 20:14         ` Jon Kohler
@ 2022-05-20 20:30           ` Sean Christopherson
  2022-05-20 20:33             ` Jon Kohler
  0 siblings, 1 reply; 9+ messages in thread
From: Sean Christopherson @ 2022-05-20 20:30 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Andrea Arcangeli,
	Josh Poimboeuf, Kees Cook, Waiman Long, kvm, linux-kernel

On Fri, May 20, 2022, Jon Kohler wrote:
> 
> 
> > On May 20, 2022, at 4:06 PM, Sean Christopherson <seanjc@google.com> wrote:
> > 
> > On Fri, May 20, 2022, Jon Kohler wrote:
> >> 
> >>> On May 18, 2022, at 10:23 AM, Jon Kohler <jon@nutanix.com> wrote:
> >>> 
> >>>> On May 17, 2022, at 9:42 PM, Sean Christopherson <seanjc@google.com> wrote:
> >>>>> +		if (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) && data == BIT(0)) {
> >>>> 
> >>>> Use SPEC_CTRL_IBRS instead of open coding "BIT(0)", then a chunk of the comment
> >>>> goes away.
> >>>> 
> >>>>> +			vmx->spec_ctrl = data;
> >>>>> +			break;
> >>>>> +		}
> >>>> 
> >>>> There's no need for a separate if statement.  And the boot_cpu_has() check can
> >>>> be dropped, kvm_spec_ctrl_test_value() has already verified the bit is writable
> >>>> (unless you're worried about bit 0 being used for something else?)
> >> 
> >> I was (and am) worried about misbehaving guests on pre-eIBRS systems spamming IBRS
> >> MSR, which we wouldn’t be able to see today. Intel’s guidance for eIBRS has long been
> >> set it once and be done with it, so any eIBRS aware guest should behave nicely with that.
> >> That limits the blast radius a bit here.
> > 
> > Then check the guest capabilities, not the host flag.
> > 
> > 	if (data == SPEC_CTRL_IBRS &&
> > 	    (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL))
> 
> So I originally did that in my first internal patch; however, the code you wrote is
> effectively the code I wrote, because cpu_set_bug_bits() already does that exact
> same thing when it sets up X86_FEATURE_IBRS_ENHANCED. 
> 
> Is the boot cpu check more expensive than checking the vCPU perhaps? Otherwise,
> checking X86_FEATURE_IBRS_ENHANCED seemed like it might be easier
> understand for future onlookers, as thats what the rest of the kernel keys off of
> when checking for eIBRS (e.g. in bugs.c etc). 

Cost is irrelevant, checking X86_FEATURE_IBRS_ENHANCED is simply wrong.  Just
because eIBRS is supported in the host doesn't mean it's advertised to the guest,
e.g. an older VM could have been created without eIBRS and then migrated to a host
that does support eIBRS.  Now you have a guest that thinks it needs to constantly
toggle IBRS (I assume that's the pre-eIBRS behavior), but by looking at the _host_
value KVM would assume it's a one-time write and not disable interception.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS
  2022-05-20 20:30           ` Sean Christopherson
@ 2022-05-20 20:33             ` Jon Kohler
  0 siblings, 0 replies; 9+ messages in thread
From: Jon Kohler @ 2022-05-20 20:33 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jon Kohler, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Andrea Arcangeli, Josh Poimboeuf, Kees Cook, Waiman Long, kvm,
	linux-kernel



> On May 20, 2022, at 4:30 PM, Sean Christopherson <seanjc@google.com> wrote:
> 
> On Fri, May 20, 2022, Jon Kohler wrote:
>> 
>> 
>>> On May 20, 2022, at 4:06 PM, Sean Christopherson <seanjc@google.com> wrote:
>>> 
>>> On Fri, May 20, 2022, Jon Kohler wrote:
>>>> 
>>>>> On May 18, 2022, at 10:23 AM, Jon Kohler <jon@nutanix.com> wrote:
>>>>> 
>>>>>> On May 17, 2022, at 9:42 PM, Sean Christopherson <seanjc@google.com> wrote:
>>>>>>> +		if (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) && data == BIT(0)) {
>>>>>> 
>>>>>> Use SPEC_CTRL_IBRS instead of open coding "BIT(0)", then a chunk of the comment
>>>>>> goes away.
>>>>>> 
>>>>>>> +			vmx->spec_ctrl = data;
>>>>>>> +			break;
>>>>>>> +		}
>>>>>> 
>>>>>> There's no need for a separate if statement.  And the boot_cpu_has() check can
>>>>>> be dropped, kvm_spec_ctrl_test_value() has already verified the bit is writable
>>>>>> (unless you're worried about bit 0 being used for something else?)
>>>> 
>>>> I was (and am) worried about misbehaving guests on pre-eIBRS systems spamming IBRS
>>>> MSR, which we wouldn’t be able to see today. Intel’s guidance for eIBRS has long been
>>>> set it once and be done with it, so any eIBRS aware guest should behave nicely with that.
>>>> That limits the blast radius a bit here.
>>> 
>>> Then check the guest capabilities, not the host flag.
>>> 
>>> 	if (data == SPEC_CTRL_IBRS &&
>>> 	    (vcpu->arch.arch_capabilities & ARCH_CAP_IBRS_ALL))
>> 
>> So I originally did that in my first internal patch; however, the code you wrote is
>> effectively the code I wrote, because cpu_set_bug_bits() already does that exact
>> same thing when it sets up X86_FEATURE_IBRS_ENHANCED. 
>> 
>> Is the boot cpu check more expensive than checking the vCPU perhaps? Otherwise,
>> checking X86_FEATURE_IBRS_ENHANCED seemed like it might be easier
>> understand for future onlookers, as thats what the rest of the kernel keys off of
>> when checking for eIBRS (e.g. in bugs.c etc). 
> 
> Cost is irrelevant, checking X86_FEATURE_IBRS_ENHANCED is simply wrong.  Just
> because eIBRS is supported in the host doesn't mean it's advertised to the guest,
> e.g. an older VM could have been created without eIBRS and then migrated to a host
> that does support eIBRS.  Now you have a guest that thinks it needs to constantly
> toggle IBRS (I assume that's the pre-eIBRS behavior), but by looking at the _host_
> value KVM would assume it's a one-time write and not disable interception.

Ahhhhhh, gotcha, ok I understand the nuance here. Off to v3 I go. 

Thanks again!


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-05-20 20:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-12 17:44 [PATCH] KVM: VMX: do not disable interception for MSR_IA32_SPEC_CTRL on eIBRS Jon Kohler
2022-05-16 18:27 ` Jon Kohler
2022-05-18  1:42 ` Sean Christopherson
2022-05-18 14:23   ` Jon Kohler
2022-05-20 19:55     ` Jon Kohler
2022-05-20 20:06       ` Sean Christopherson
2022-05-20 20:14         ` Jon Kohler
2022-05-20 20:30           ` Sean Christopherson
2022-05-20 20:33             ` Jon Kohler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).