All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Nested TSC handling
@ 2011-08-02 12:53 Nadav Har'El
  2011-08-02 12:54 ` [PATCH 1/3] L1 " Nadav Har'El
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Nadav Har'El @ 2011-08-02 12:53 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, Zachary Amsden, Bandan Das, Marcelo Tosatti, avi

The following are patches I propose for fixing the bug discovered by Bandan
Das and discussed in the "Nested VMX - L1 hangs on running L2" thread.

The first patch should fix the originally-reported bug, as explained in the
aforementioned thread: A new x86_op read_l1_tsc() is called L1's TSC is
needed, instead of assuming that calling kvm_read_msr() will do that
(because this has to return L2's TSC when a nested guest is running).

The second and third patches fix relatively-unimportant corner cases in
nested VMX and nested SVM TSC handling.

I'd appreciate it if the people who noticed this bug can verify that these
patches indeed solve it for them.

Patch statistics:
-----------------

 arch/x86/include/asm/kvm_host.h |    2 +
 arch/x86/kvm/svm.c              |   13 ++++++--
 arch/x86/kvm/vmx.c              |   48 +++++++++++++++++++++++-------
 arch/x86/kvm/x86.c              |    8 ++---
 4 files changed, 54 insertions(+), 17 deletions(-)

--
Nadav Har'El
IBM Haifa Research Lab

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/3] L1 TSC handling
  2011-08-02 12:53 [PATCH 0/3] Nested TSC handling Nadav Har'El
@ 2011-08-02 12:54 ` Nadav Har'El
  2011-08-03  7:54   ` Zachary Amsden
  2011-08-02 12:54 ` [PATCH 2/3] Fix nested VMX TSC emulation Nadav Har'El
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Nadav Har'El @ 2011-08-02 12:54 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, Zachary Amsden, Bandan Das, Marcelo Tosatti, avi

KVM assumed in several places that reading the TSC MSR returns the value for
L1. This is incorrect, because when L2 is running, the correct TSC read exit
emulation is to return L2's value.

We therefore add a new x86_ops function, read_l1_tsc, to use in places that
specifically need to read the L1 TSC, NOT the TSC of the current level of
guest.

Note that one change, of one line in kvm_arch_vcpu_load, is made redundant
by a different patch sent by Zachary Amsden (and not yet applied):
kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of
course we didn't have to change the call of kvm_get_msr() to read_l1_tsc(). 

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/include/asm/kvm_host.h |    2 ++
 arch/x86/kvm/svm.c              |    9 +++++++++
 arch/x86/kvm/vmx.c              |   17 +++++++++++++++++
 arch/x86/kvm/x86.c              |    8 ++++----
 4 files changed, 32 insertions(+), 4 deletions(-)

--- .before/arch/x86/include/asm/kvm_host.h	2011-08-02 15:51:02.000000000 +0300
+++ .after/arch/x86/include/asm/kvm_host.h	2011-08-02 15:51:02.000000000 +0300
@@ -636,6 +636,8 @@ struct kvm_x86_ops {
 			       struct x86_instruction_info *info,
 			       enum x86_intercept_stage stage);
 
+	u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
+
 	const struct trace_print_flags *exit_reasons_str;
 };
 
--- .before/arch/x86/kvm/vmx.c	2011-08-02 15:51:02.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-08-02 15:51:02.000000000 +0300
@@ -1748,6 +1748,21 @@ static u64 guest_read_tsc(void)
 }
 
 /*
+ * Like guest_read_tsc, but always returns L1's notion of the timestamp
+ * counter, even if a nested guest (L2) is currently running.
+ */
+u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu)
+{
+	u64 host_tsc, tsc_offset;
+
+	rdtscll(host_tsc);
+	tsc_offset = is_guest_mode(vcpu) ?
+		to_vmx(vcpu)->nested.vmcs01_tsc_offset :
+		vmcs_read64(TSC_OFFSET);
+	return host_tsc + tsc_offset;
+}
+
+/*
  * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ
  * ioctl. In this case the call-back should update internal vmx state to make
  * the changes effective.
@@ -7059,6 +7074,8 @@ static struct kvm_x86_ops vmx_x86_ops = 
 	.set_tdp_cr3 = vmx_set_cr3,
 
 	.check_intercept = vmx_check_intercept,
+
+	.read_l1_tsc = vmx_read_l1_tsc,
 };
 
 static int __init vmx_init(void)
--- .before/arch/x86/kvm/svm.c	2011-08-02 15:51:02.000000000 +0300
+++ .after/arch/x86/kvm/svm.c	2011-08-02 15:51:02.000000000 +0300
@@ -2894,6 +2894,13 @@ static int cr8_write_interception(struct
 	return 0;
 }
 
+u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu)
+{
+	struct vmcb *vmcb = get_host_vmcb(to_svm(vcpu));
+	return vmcb->control.tsc_offset +
+		svm_scale_tsc(vcpu, native_read_tsc());
+}
+
 static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4243,6 +4250,8 @@ static struct kvm_x86_ops svm_x86_ops = 
 	.set_tdp_cr3 = set_tdp_cr3,
 
 	.check_intercept = svm_check_intercept,
+
+	.read_l1_tsc = svm_read_l1_tsc,
 };
 
 static int __init svm_init(void)
--- .before/arch/x86/kvm/x86.c	2011-08-02 15:51:02.000000000 +0300
+++ .after/arch/x86/kvm/x86.c	2011-08-02 15:51:02.000000000 +0300
@@ -1098,7 +1098,7 @@ static int kvm_guest_time_update(struct 
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
+	tsc_timestamp = kvm_x86_ops->read_l1_tsc(v);
 	kernel_ns = get_kernel_ns();
 	this_tsc_khz = vcpu_tsc_khz(v);
 	if (unlikely(this_tsc_khz == 0)) {
@@ -2215,7 +2215,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu 
 		s64 tsc_delta;
 		u64 tsc;
 
-		kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc);
+		tsc = kvm_x86_ops->read_l1_tsc(vcpu);
 		tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
 			     tsc - vcpu->arch.last_guest_tsc;
 
@@ -2239,7 +2239,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *
 {
 	kvm_x86_ops->vcpu_put(vcpu);
 	kvm_put_guest_fpu(vcpu);
-	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
+	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
 }
 
 static int is_efer_nx(void)
@@ -5722,7 +5722,7 @@ static int vcpu_enter_guest(struct kvm_v
 	if (hw_breakpoint_active())
 		hw_breakpoint_restore();
 
-	kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
+	vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
 
 	vcpu->mode = OUTSIDE_GUEST_MODE;
 	smp_wmb();

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 2/3] Fix nested VMX TSC emulation
  2011-08-02 12:53 [PATCH 0/3] Nested TSC handling Nadav Har'El
  2011-08-02 12:54 ` [PATCH 1/3] L1 " Nadav Har'El
@ 2011-08-02 12:54 ` Nadav Har'El
  2011-08-03  8:19   ` Zachary Amsden
  2011-08-02 12:55 ` [PATCH 3/3] Fix TSC MSR read in nested SVM Nadav Har'El
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Nadav Har'El @ 2011-08-02 12:54 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, Zachary Amsden, Bandan Das, Marcelo Tosatti, avi

This patch fixes two corner cases in nested (L2) handling of TSC-related
issues:

1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to
the TSC MSR without an exit, then this should set L1's TSC value itself - not
offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code).

2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore
the vmcs12.TSC_OFFSET.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/vmx.c |   31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

--- .before/arch/x86/kvm/vmx.c	2011-08-02 15:51:02.000000000 +0300
+++ .after/arch/x86/kvm/vmx.c	2011-08-02 15:51:02.000000000 +0300
@@ -1777,15 +1777,23 @@ static void vmx_set_tsc_khz(struct kvm_v
  */
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
-	vmcs_write64(TSC_OFFSET, offset);
-	if (is_guest_mode(vcpu))
+	if (is_guest_mode(vcpu)) {
 		/*
-		 * We're here if L1 chose not to trap the TSC MSR. Since
-		 * prepare_vmcs12() does not copy tsc_offset, we need to also
-		 * set the vmcs12 field here.
+		 * We're here if L1 chose not to trap WRMSR to TSC. According
+		 * to the spec, this should set L1's TSC; The offset that L1
+		 * set for L2 remains unchanged, and still needs to be added
+		 * to the newly set TSC to get L2's TSC.
 		 */
-		get_vmcs12(vcpu)->tsc_offset = offset -
-			to_vmx(vcpu)->nested.vmcs01_tsc_offset;
+		struct vmcs12 *vmcs12;
+		to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset;
+		/* recalculate vmcs02.TSC_OFFSET: */
+		vmcs12 = get_vmcs12(vcpu);
+		vmcs_write64(TSC_OFFSET, offset +
+			(nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
+			 vmcs12->tsc_offset : 0));
+	} else {
+		vmcs_write64(TSC_OFFSET, offset);
+	}
 }
 
 static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
@@ -6529,8 +6537,11 @@ static void prepare_vmcs02(struct kvm_vc
 
 	set_cr4_guest_host_mask(vmx);
 
-	vmcs_write64(TSC_OFFSET,
-		vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
+	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
+		vmcs_write64(TSC_OFFSET,
+			vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
+	else
+		vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
 
 	if (enable_vpid) {
 		/*
@@ -6937,7 +6948,7 @@ static void nested_vmx_vmexit(struct kvm
 
 	load_vmcs12_host_state(vcpu, vmcs12);
 
-	/* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */
+	/* Update TSC_OFFSET if TSC was changed while L2 ran */
 	vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
 
 	/* This is needed for same reason as it was needed in prepare_vmcs02 */

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 3/3] Fix TSC MSR read in nested SVM
  2011-08-02 12:53 [PATCH 0/3] Nested TSC handling Nadav Har'El
  2011-08-02 12:54 ` [PATCH 1/3] L1 " Nadav Har'El
  2011-08-02 12:54 ` [PATCH 2/3] Fix nested VMX TSC emulation Nadav Har'El
@ 2011-08-02 12:55 ` Nadav Har'El
  2011-08-03  8:34   ` Zachary Amsden
  2011-08-02 19:24 ` [PATCH 0/3] Nested TSC handling Bandan Das
  2011-08-10 15:40 ` Avi Kivity
  4 siblings, 1 reply; 18+ messages in thread
From: Nadav Har'El @ 2011-08-02 12:55 UTC (permalink / raw)
  To: kvm; +Cc: Roedel, Joerg, Zachary Amsden, Bandan Das, Marcelo Tosatti, avi

When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be
read without exit), we need to return L2's notion of the TSC, not L1's.

The current code incorrectly returned L1 TSC, because svm_get_msr() was also
used in x86.c where this was assumed, but now that these places call the new
svm_read_l1_tsc(), the MSR read can be fixed.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
---
 arch/x86/kvm/svm.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- .before/arch/x86/kvm/svm.c	2011-08-02 15:51:02.000000000 +0300
+++ .after/arch/x86/kvm/svm.c	2011-08-02 15:51:02.000000000 +0300
@@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu *
 
 	switch (ecx) {
 	case MSR_IA32_TSC: {
-		struct vmcb *vmcb = get_host_vmcb(svm);
-
-		*data = vmcb->control.tsc_offset +
+		*data = svm->vmcb->control.tsc_offset +
 			svm_scale_tsc(vcpu, native_read_tsc());
 
 		break;

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/3] Nested TSC handling
  2011-08-02 12:53 [PATCH 0/3] Nested TSC handling Nadav Har'El
                   ` (2 preceding siblings ...)
  2011-08-02 12:55 ` [PATCH 3/3] Fix TSC MSR read in nested SVM Nadav Har'El
@ 2011-08-02 19:24 ` Bandan Das
  2011-08-02 19:54   ` Matt McGill
  2011-08-10 15:40 ` Avi Kivity
  4 siblings, 1 reply; 18+ messages in thread
From: Bandan Das @ 2011-08-02 19:24 UTC (permalink / raw)
  To: Nadav Har'El
  Cc: kvm, Roedel, Joerg, Zachary Amsden, Bandan Das, Marcelo Tosatti,
	avi, jan.kiszka, matt.mcgill

On  0, Nadav Har'El <nyh@il.ibm.com> wrote:
> The following are patches I propose for fixing the bug discovered by Bandan
> Das and discussed in the "Nested VMX - L1 hangs on running L2" thread.
> 
> The first patch should fix the originally-reported bug, as explained in the
> aforementioned thread: A new x86_op read_l1_tsc() is called L1's TSC is
> needed, instead of assuming that calling kvm_read_msr() will do that
> (because this has to return L2's TSC when a nested guest is running).
> 
> The second and third patches fix relatively-unimportant corner cases in
> nested VMX and nested SVM TSC handling.
> 
> I'd appreciate it if the people who noticed this bug can verify that these
> patches indeed solve it for them.
I can verify that these patches work for me. Thanks Nadav and others for looking 
into this!

I have cc'ed Jan and Matt in case they would be interested in trying out 
these patches.


> Patch statistics:
> -----------------
> 
>  arch/x86/include/asm/kvm_host.h |    2 +
>  arch/x86/kvm/svm.c              |   13 ++++++--
>  arch/x86/kvm/vmx.c              |   48 +++++++++++++++++++++++-------
>  arch/x86/kvm/x86.c              |    8 ++---
>  4 files changed, 54 insertions(+), 17 deletions(-)
> 
> --
> Nadav Har'El
> IBM Haifa Research Lab


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/3] Nested TSC handling
  2011-08-02 19:24 ` [PATCH 0/3] Nested TSC handling Bandan Das
@ 2011-08-02 19:54   ` Matt McGill
  0 siblings, 0 replies; 18+ messages in thread
From: Matt McGill @ 2011-08-02 19:54 UTC (permalink / raw)
  To: Bandan Das
  Cc: Nadav Har'El, kvm, Roedel, Joerg, Zachary Amsden,
	Marcelo Tosatti, avi, jan.kiszka

Thanks Bandan and Nadav,

I can confirm that these patches clear up the nVMX hang bug on my end. I don't have any AMD hardware on-hand to test the SVM change, unfortunately.

Thanks for all the work on this!

-Matt McGill

On Aug 2, 2011, at 3:24 PM, Bandan Das wrote:

> On  0, Nadav Har'El <nyh@il.ibm.com> wrote:
>> The following are patches I propose for fixing the bug discovered by Bandan
>> Das and discussed in the "Nested VMX - L1 hangs on running L2" thread.
>> 
>> The first patch should fix the originally-reported bug, as explained in the
>> aforementioned thread: A new x86_op read_l1_tsc() is called L1's TSC is
>> needed, instead of assuming that calling kvm_read_msr() will do that
>> (because this has to return L2's TSC when a nested guest is running).
>> 
>> The second and third patches fix relatively-unimportant corner cases in
>> nested VMX and nested SVM TSC handling.
>> 
>> I'd appreciate it if the people who noticed this bug can verify that these
>> patches indeed solve it for them.
> I can verify that these patches work for me. Thanks Nadav and others for looking 
> into this!
> 
> I have cc'ed Jan and Matt in case they would be interested in trying out 
> these patches.
> 
> 
>> Patch statistics:
>> -----------------
>> 
>> arch/x86/include/asm/kvm_host.h |    2 +
>> arch/x86/kvm/svm.c              |   13 ++++++--
>> arch/x86/kvm/vmx.c              |   48 +++++++++++++++++++++++-------
>> arch/x86/kvm/x86.c              |    8 ++---
>> 4 files changed, 54 insertions(+), 17 deletions(-)
>> 
>> --
>> Nadav Har'El
>> IBM Haifa Research Lab
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/3] L1 TSC handling
  2011-08-02 12:54 ` [PATCH 1/3] L1 " Nadav Har'El
@ 2011-08-03  7:54   ` Zachary Amsden
  0 siblings, 0 replies; 18+ messages in thread
From: Zachary Amsden @ 2011-08-03  7:54 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, Bandan Das, Marcelo Tosatti, avi

ACK from me, but please check with Joerg on the SVM change.  I believe
the scaling is done properly, however it won't hurt to have a second
opinion.

Zach

On Tue, Aug 2, 2011 at 5:54 AM, Nadav Har'El <nyh@il.ibm.com> wrote:
> KVM assumed in several places that reading the TSC MSR returns the value for
> L1. This is incorrect, because when L2 is running, the correct TSC read exit
> emulation is to return L2's value.
>
> We therefore add a new x86_ops function, read_l1_tsc, to use in places that
> specifically need to read the L1 TSC, NOT the TSC of the current level of
> guest.
>
> Note that one change, of one line in kvm_arch_vcpu_load, is made redundant
> by a different patch sent by Zachary Amsden (and not yet applied):
> kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of
> course we didn't have to change the call of kvm_get_msr() to read_l1_tsc().
>
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/include/asm/kvm_host.h |    2 ++
>  arch/x86/kvm/svm.c              |    9 +++++++++
>  arch/x86/kvm/vmx.c              |   17 +++++++++++++++++
>  arch/x86/kvm/x86.c              |    8 ++++----
>  4 files changed, 32 insertions(+), 4 deletions(-)
>
> --- .before/arch/x86/include/asm/kvm_host.h     2011-08-02 15:51:02.000000000 +0300
> +++ .after/arch/x86/include/asm/kvm_host.h      2011-08-02 15:51:02.000000000 +0300
> @@ -636,6 +636,8 @@ struct kvm_x86_ops {
>                               struct x86_instruction_info *info,
>                               enum x86_intercept_stage stage);
>
> +       u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu);
> +
>        const struct trace_print_flags *exit_reasons_str;
>  };
>
> --- .before/arch/x86/kvm/vmx.c  2011-08-02 15:51:02.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c   2011-08-02 15:51:02.000000000 +0300
> @@ -1748,6 +1748,21 @@ static u64 guest_read_tsc(void)
>  }
>
>  /*
> + * Like guest_read_tsc, but always returns L1's notion of the timestamp
> + * counter, even if a nested guest (L2) is currently running.
> + */
> +u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu)
> +{
> +       u64 host_tsc, tsc_offset;
> +
> +       rdtscll(host_tsc);
> +       tsc_offset = is_guest_mode(vcpu) ?
> +               to_vmx(vcpu)->nested.vmcs01_tsc_offset :
> +               vmcs_read64(TSC_OFFSET);
> +       return host_tsc + tsc_offset;
> +}
> +
> +/*
>  * Empty call-back. Needs to be implemented when VMX enables the SET_TSC_KHZ
>  * ioctl. In this case the call-back should update internal vmx state to make
>  * the changes effective.
> @@ -7059,6 +7074,8 @@ static struct kvm_x86_ops vmx_x86_ops =
>        .set_tdp_cr3 = vmx_set_cr3,
>
>        .check_intercept = vmx_check_intercept,
> +
> +       .read_l1_tsc = vmx_read_l1_tsc,
>  };
>
>  static int __init vmx_init(void)
> --- .before/arch/x86/kvm/svm.c  2011-08-02 15:51:02.000000000 +0300
> +++ .after/arch/x86/kvm/svm.c   2011-08-02 15:51:02.000000000 +0300
> @@ -2894,6 +2894,13 @@ static int cr8_write_interception(struct
>        return 0;
>  }
>
> +u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu)
> +{
> +       struct vmcb *vmcb = get_host_vmcb(to_svm(vcpu));
> +       return vmcb->control.tsc_offset +
> +               svm_scale_tsc(vcpu, native_read_tsc());
> +}
> +
>  static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data)
>  {
>        struct vcpu_svm *svm = to_svm(vcpu);
> @@ -4243,6 +4250,8 @@ static struct kvm_x86_ops svm_x86_ops =
>        .set_tdp_cr3 = set_tdp_cr3,
>
>        .check_intercept = svm_check_intercept,
> +
> +       .read_l1_tsc = svm_read_l1_tsc,
>  };
>
>  static int __init svm_init(void)
> --- .before/arch/x86/kvm/x86.c  2011-08-02 15:51:02.000000000 +0300
> +++ .after/arch/x86/kvm/x86.c   2011-08-02 15:51:02.000000000 +0300
> @@ -1098,7 +1098,7 @@ static int kvm_guest_time_update(struct
>
>        /* Keep irq disabled to prevent changes to the clock */
>        local_irq_save(flags);
> -       kvm_get_msr(v, MSR_IA32_TSC, &tsc_timestamp);
> +       tsc_timestamp = kvm_x86_ops->read_l1_tsc(v);
>        kernel_ns = get_kernel_ns();
>        this_tsc_khz = vcpu_tsc_khz(v);
>        if (unlikely(this_tsc_khz == 0)) {
> @@ -2215,7 +2215,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu
>                s64 tsc_delta;
>                u64 tsc;
>
> -               kvm_get_msr(vcpu, MSR_IA32_TSC, &tsc);
> +               tsc = kvm_x86_ops->read_l1_tsc(vcpu);
>                tsc_delta = !vcpu->arch.last_guest_tsc ? 0 :
>                             tsc - vcpu->arch.last_guest_tsc;
>
> @@ -2239,7 +2239,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *
>  {
>        kvm_x86_ops->vcpu_put(vcpu);
>        kvm_put_guest_fpu(vcpu);
> -       kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> +       vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
>  }
>
>  static int is_efer_nx(void)
> @@ -5722,7 +5722,7 @@ static int vcpu_enter_guest(struct kvm_v
>        if (hw_breakpoint_active())
>                hw_breakpoint_restore();
>
> -       kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> +       vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu);
>
>        vcpu->mode = OUTSIDE_GUEST_MODE;
>        smp_wmb();
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] Fix nested VMX TSC emulation
  2011-08-02 12:54 ` [PATCH 2/3] Fix nested VMX TSC emulation Nadav Har'El
@ 2011-08-03  8:19   ` Zachary Amsden
  2011-08-03  8:35     ` Nadav Har'El
  0 siblings, 1 reply; 18+ messages in thread
From: Zachary Amsden @ 2011-08-03  8:19 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, Bandan Das, Marcelo Tosatti, avi

Comments inline.  Sorry for top-posting.  Gmail is not my normal mode
of LKML processing, but hey.

On Tue, Aug 2, 2011 at 5:54 AM, Nadav Har'El <nyh@il.ibm.com> wrote:
> This patch fixes two corner cases in nested (L2) handling of TSC-related
> issues:
>
> 1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to
> the TSC MSR without an exit, then this should set L1's TSC value itself - not
> offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code).
>
> 2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore
> the vmcs12.TSC_OFFSET.
>
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/vmx.c |   31 +++++++++++++++++++++----------
>  1 file changed, 21 insertions(+), 10 deletions(-)
>
> --- .before/arch/x86/kvm/vmx.c  2011-08-02 15:51:02.000000000 +0300
> +++ .after/arch/x86/kvm/vmx.c   2011-08-02 15:51:02.000000000 +0300
> @@ -1777,15 +1777,23 @@ static void vmx_set_tsc_khz(struct kvm_v
>  */
>  static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  {
> -       vmcs_write64(TSC_OFFSET, offset);
> -       if (is_guest_mode(vcpu))
> +       if (is_guest_mode(vcpu)) {
>                /*
> -                * We're here if L1 chose not to trap the TSC MSR. Since
> -                * prepare_vmcs12() does not copy tsc_offset, we need to also
> -                * set the vmcs12 field here.
> +                * We're here if L1 chose not to trap WRMSR to TSC. According
> +                * to the spec, this should set L1's TSC; The offset that L1
> +                * set for L2 remains unchanged, and still needs to be added
> +                * to the newly set TSC to get L2's TSC.
>                 */
> -               get_vmcs12(vcpu)->tsc_offset = offset -
> -                       to_vmx(vcpu)->nested.vmcs01_tsc_offset;
> +               struct vmcs12 *vmcs12;
> +               to_vmx(vcpu)->nested.vmcs01_tsc_offset = offset;
> +               /* recalculate vmcs02.TSC_OFFSET: */
> +               vmcs12 = get_vmcs12(vcpu);
> +               vmcs_write64(TSC_OFFSET, offset +
> +                       (nested_cpu_has(vmcs12, CPU_BASED_USE_TSC_OFFSETING) ?
> +                        vmcs12->tsc_offset : 0));
> +       } else {
> +               vmcs_write64(TSC_OFFSET, offset);
> +       }
>  }

This part looks good.

>  static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
> @@ -6529,8 +6537,11 @@ static void prepare_vmcs02(struct kvm_vc
>
>        set_cr4_guest_host_mask(vmx);
>
> -       vmcs_write64(TSC_OFFSET,
> -               vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> +       if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
> +               vmcs_write64(TSC_OFFSET,
> +                       vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> +       else
> +               vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);

I need more context here... where do you apply the adjustment?

The offset should be added to the vmcs01_tsc_offset only (but also
written into the hardware VMCS, which should not be preserved when the
guest exits).

>
>        if (enable_vpid) {
>                /*
> @@ -6937,7 +6948,7 @@ static void nested_vmx_vmexit(struct kvm
>
>        load_vmcs12_host_state(vcpu, vmcs12);
>
> -       /* Update TSC_OFFSET if vmx_adjust_tsc_offset() was used while L2 ran */
> +       /* Update TSC_OFFSET if TSC was changed while L2 ran */
>        vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
>
>        /* This is needed for same reason as it was needed in prepare_vmcs02 */
>

This is correct.  You should always restore the L1 offset when exiting
if it might have changed.  This implies also that you must update
vmx->nested.vmcs01_tsc_offset if you receive a call to
vmx_adjust_tsc_offset while L2 is running, which is why I wanted to
see more context above.

Zach

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] Fix TSC MSR read in nested SVM
  2011-08-02 12:55 ` [PATCH 3/3] Fix TSC MSR read in nested SVM Nadav Har'El
@ 2011-08-03  8:34   ` Zachary Amsden
  2011-08-03 11:29     ` Nadav Har'El
  2011-08-03 21:00     ` Marcelo Tosatti
  0 siblings, 2 replies; 18+ messages in thread
From: Zachary Amsden @ 2011-08-03  8:34 UTC (permalink / raw)
  To: Nadav Har'El; +Cc: kvm, Roedel, Joerg, Bandan Das, Marcelo Tosatti, avi

Caution: this requires more care.

Pretty sure this breaks userspace suspend at the cost of supporting a
not-so-reasonable hardware feature.

On Tue, Aug 2, 2011 at 5:55 AM, Nadav Har'El <nyh@il.ibm.com> wrote:
> When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be
> read without exit), we need to return L2's notion of the TSC, not L1's.
>
> The current code incorrectly returned L1 TSC, because svm_get_msr() was also
> used in x86.c where this was assumed, but now that these places call the new
> svm_read_l1_tsc(), the MSR read can be fixed.
>
> Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> ---
>  arch/x86/kvm/svm.c |    4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> --- .before/arch/x86/kvm/svm.c  2011-08-02 15:51:02.000000000 +0300
> +++ .after/arch/x86/kvm/svm.c   2011-08-02 15:51:02.000000000 +0300
> @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu *
>
>        switch (ecx) {
>        case MSR_IA32_TSC: {
> -               struct vmcb *vmcb = get_host_vmcb(svm);
> -
> -               *data = vmcb->control.tsc_offset +
> +               *data = svm->vmcb->control.tsc_offset +
>                        svm_scale_tsc(vcpu, native_read_tsc());
>
>                break;
>


This is correct.  Now you properly return the correct MSR value for
the TSC to the guest in all cases.

However, there is a BIG PROBLEM (yes, it is that bad).  Sorry I did
not think of it before.

When the guest gets suspended and frozen by userspace, to be restarted
later, what qemu is going to do is come along and read all of the MSRs
as part of the saved state.  One of those happens to be the TSC MSR.

Since you can't guarantee suspend will take place when only L1 is
running, you may have mixed L1/L2 TSC MSRs now being returned to
userspace.  Now, when you resume this guest, you will have mixed L1/L2
TSC values written into the MSRs.

Those will almost certainly fail to be matched by the TSC offset
matching code, and thus, with multiple VCPUs, you will end up with
slightly unsynchronized TSCs, and with that, all the problems
associated with unstable TSC come back to haunt you again.  Basically,
all bets for stability are off.

Apart from recording the L1 and L2 TSC through separate MSRs, there
isn't a good way to solve this.  Walking through all the steps again,
I'm pretty sure this is why I thought it would be better for L0
kvm_get_msr() to return the L1 values even if L2 was running.

In the end, it may not be worth the hassle to support this mode of
operation that to our current knowledge, is in fact, unused.  I would
much prefer to see a bad virtualization of a known insecure and unused
guest mode than to have a well used feature such as L1 guest suspend /
resume continue to cause clock de-synchronization.

Zach

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/3] Fix nested VMX TSC emulation
  2011-08-03  8:19   ` Zachary Amsden
@ 2011-08-03  8:35     ` Nadav Har'El
  0 siblings, 0 replies; 18+ messages in thread
From: Nadav Har'El @ 2011-08-03  8:35 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: kvm, Roedel, Joerg, Bandan Das, Marcelo Tosatti, avi

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=iso-8859-8-i, Size: 2603 bytes --]

Hi,

On Wed, Aug 03, 2011, Zachary Amsden wrote about "Re: [PATCH 2/3] Fix nested VMX TSC emulation":
>...
> > @@ -6529,8 +6537,11 @@ static void prepare_vmcs02(struct kvm_vc
> > -       vmcs_write64(TSC_OFFSET,
> > -               vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> > +       if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETING)
> > +               vmcs_write64(TSC_OFFSET,
> > +                       vmx->nested.vmcs01_tsc_offset + vmcs12->tsc_offset);
> > +       else
> > +               vmcs_write64(TSC_OFFSET, vmx->nested.vmcs01_tsc_offset);
> 
> I need more context here... where do you apply the adjustment?
> 
> The offset should be added to the vmcs01_tsc_offset only (but also
> written into the hardware VMCS, which should not be preserved when the
> guest exits).

This code is part of prepare_vmcs02(), which prepares a VMCS to run L2 based
on vmcs12 (the VMCS that L1 prepared for L2) and vmcs01 (L0's wishes for L1).
We need to run L2 with the TSC offset being the sum of the offset that L0
asked for L1, and the offset that L1 then asked for L2. What I fixed in this
patch is that L1 actually has the option not to use TSC_OFFSETING, and if it
doesn't, no offset should be added to it. As I said, this is a corner case
that probably won't happen in practice (it certainly won't happen if L1 is
KVM).

> This is correct.  You should always restore the L1 offset when exiting
> if it might have changed.  This implies also that you must update
> vmx->nested.vmcs01_tsc_offset if you receive a call to
> vmx_adjust_tsc_offset while L2 is running, which is why I wanted to
> see more context above.

I think you mistook the patch hunk you mentioned above to be somehow
relevant to vmx_adjust_tsc_offset()? It isn't... vmx_adjust_tsc_offset()
is unmodified by these patches, and was already correct. It looked, and
still looks, like this:

static void vmx_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
{
	u64 offset = vmcs_read64(TSC_OFFSET);
	vmcs_write64(TSC_OFFSET, offset + adjustment);
	if (is_guest_mode(vcpu)) {
		/* Even when running L2, the adjustment needs to apply to L1 */
		to_vmx(vcpu)->nested.vmcs01_tsc_offset += adjustment;
	}
}


Thanks,
Nadav.

-- 
Nadav Har'El                        |        Wednesday, Aug  3 2011, 3 Av 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |"Mommy! The garbage man is here!" "Well,
http://nadav.harel.org.il           |tell him we don't want any!"- Groucho Marx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] Fix TSC MSR read in nested SVM
  2011-08-03  8:34   ` Zachary Amsden
@ 2011-08-03 11:29     ` Nadav Har'El
  2011-08-03 21:00     ` Marcelo Tosatti
  1 sibling, 0 replies; 18+ messages in thread
From: Nadav Har'El @ 2011-08-03 11:29 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: kvm, Roedel, Joerg, Bandan Das, Marcelo Tosatti, avi

Hi,

On Wed, Aug 03, 2011, Zachary Amsden wrote about "Re: [PATCH 3/3] Fix TSC MSR read in nested SVM":
> Pretty sure this breaks userspace suspend at the cost of supporting a
> not-so-reasonable hardware feature.
>...
> This is correct.  Now you properly return the correct MSR value for
> the TSC to the guest in all cases.
> 
> However, there is a BIG PROBLEM (yes, it is that bad).  Sorry I did
> not think of it before.
> 
> When the guest gets suspended and frozen by userspace, to be restarted
> later, what qemu is going to do is come along and read all of the MSRs
> as part of the saved state.  One of those happens to be the TSC MSR.

And just when I thought we were done with this bug :(

Does live migration (or suspend) actually work with nested SVM in the current
code? I certainly don't expect it to work correctly with nested VMX.

Also, I vaguely remember a discussion a while back on this mailing list about
the topic of live migration and nested virtualization, and one of the ideas
raised was that before we can migrate L1, we should force an exit from L2 to
L1, either really (with some real exit reason) or artificially (call the exit
emulation function directly). Then, during the migration we will be sure to
have all the L1 MSRs, in-memory structures, and so on, updated, and,
importantly, we will also be sure to have vmcs12 (the vmcs that L1 keeps for
L2) updated in L1's memory - so that we don't need to send even more internal
KVM state (like vmcs02) during the live migration.

> In the end, it may not be worth the hassle to support this mode of
> operation that to our current knowledge, is in fact, unused.  I would

I do agree that this doesn't sound like a useful mode of operation - but
I also don't like having deliberate mistakes in the code because they
have useful side-effects. I guess that if we can't figure out a way around
this new problem, what I can do is create a patch that:

  1. Always returns L1's TSC for the MSR (as in the original SVM code).

  2. Put a big comment above this function, about it being architecturaly
     *wrong*, but still useful (and explain why).

  3. Check for the case where users might expect the architecturally-correct
     version, not the current "wrong" version. I.e., check if L1 allows L2
     exit-less reads from TSC, using the MSR bitmap; If does, kill the guest,
     or find a way to prevent this setting.

Thanks,
Nadav.


-- 
Nadav Har'El                        |        Wednesday, Aug  3 2011, 3 Av 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Classical music: music written by a
http://nadav.harel.org.il           |decomposing composer.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] Fix TSC MSR read in nested SVM
  2011-08-03  8:34   ` Zachary Amsden
  2011-08-03 11:29     ` Nadav Har'El
@ 2011-08-03 21:00     ` Marcelo Tosatti
  2011-08-05 15:32       ` Marcelo Tosatti
  1 sibling, 1 reply; 18+ messages in thread
From: Marcelo Tosatti @ 2011-08-03 21:00 UTC (permalink / raw)
  To: Zachary Amsden; +Cc: Nadav Har'El, kvm, Roedel, Joerg, Bandan Das, avi

On Wed, Aug 03, 2011 at 01:34:58AM -0700, Zachary Amsden wrote:
> Caution: this requires more care.
> 
> Pretty sure this breaks userspace suspend at the cost of supporting a
> not-so-reasonable hardware feature.
> 
> On Tue, Aug 2, 2011 at 5:55 AM, Nadav Har'El <nyh@il.ibm.com> wrote:
> > When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be
> > read without exit), we need to return L2's notion of the TSC, not L1's.
> >
> > The current code incorrectly returned L1 TSC, because svm_get_msr() was also
> > used in x86.c where this was assumed, but now that these places call the new
> > svm_read_l1_tsc(), the MSR read can be fixed.
> >
> > Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > ---
> >  arch/x86/kvm/svm.c |    4 +---
> >  1 file changed, 1 insertion(+), 3 deletions(-)
> >
> > --- .before/arch/x86/kvm/svm.c  2011-08-02 15:51:02.000000000 +0300
> > +++ .after/arch/x86/kvm/svm.c   2011-08-02 15:51:02.000000000 +0300
> > @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu *
> >
> >        switch (ecx) {
> >        case MSR_IA32_TSC: {
> > -               struct vmcb *vmcb = get_host_vmcb(svm);
> > -
> > -               *data = vmcb->control.tsc_offset +
> > +               *data = svm->vmcb->control.tsc_offset +
> >                        svm_scale_tsc(vcpu, native_read_tsc());
> >
> >                break;
> >
> 
> 
> This is correct.  Now you properly return the correct MSR value for
> the TSC to the guest in all cases.
> 
> However, there is a BIG PROBLEM (yes, it is that bad).  Sorry I did
> not think of it before.
> 
> When the guest gets suspended and frozen by userspace, to be restarted
> later, what qemu is going to do is come along and read all of the MSRs
> as part of the saved state.  One of those happens to be the TSC MSR.
> 
> Since you can't guarantee suspend will take place when only L1 is
> running, you may have mixed L1/L2 TSC MSRs now being returned to
> userspace.  Now, when you resume this guest, you will have mixed L1/L2
> TSC values written into the MSRs.
> 
> Those will almost certainly fail to be matched by the TSC offset
> matching code, and thus, with multiple VCPUs, you will end up with
> slightly unsynchronized TSCs, and with that, all the problems
> associated with unstable TSC come back to haunt you again.  Basically,
> all bets for stability are off.

TSC synchronization is the least of your problems if you attempt to
save/restore state a guest while any vcpu is in L2 mode.

So to keep consistency between the remaining MSRs, i agree with Nadav's
patch. Apparently SVM's original patches to return L1 TSC were aimed at
fixing the problem VMX is facing now, which is fixed by introduction
read_l1_tsc helpers.

> Apart from recording the L1 and L2 TSC through separate MSRs, there
> isn't a good way to solve this.  Walking through all the steps again,
> I'm pretty sure this is why I thought it would be better for L0
> kvm_get_msr() to return the L1 values even if L2 was running.
> 
> In the end, it may not be worth the hassle to support this mode of
> operation that to our current knowledge, is in fact, unused.  I would
> much prefer to see a bad virtualization of a known insecure and unused
> guest mode than to have a well used feature such as L1 guest suspend /
> resume continue to cause clock de-synchronization.
> 
> Zach

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] Fix TSC MSR read in nested SVM
  2011-08-03 21:00     ` Marcelo Tosatti
@ 2011-08-05 15:32       ` Marcelo Tosatti
  2011-08-06 12:16         ` Joerg Roedel
  2011-08-07 12:04         ` Joerg Roedel
  0 siblings, 2 replies; 18+ messages in thread
From: Marcelo Tosatti @ 2011-08-05 15:32 UTC (permalink / raw)
  To: Zachary Amsden, Joerg Roedel; +Cc: Nadav Har'El, kvm, Bandan Das, avi

On Wed, Aug 03, 2011 at 06:00:52PM -0300, Marcelo Tosatti wrote:
> On Wed, Aug 03, 2011 at 01:34:58AM -0700, Zachary Amsden wrote:
> > Caution: this requires more care.
> > 
> > Pretty sure this breaks userspace suspend at the cost of supporting a
> > not-so-reasonable hardware feature.
> > 
> > On Tue, Aug 2, 2011 at 5:55 AM, Nadav Har'El <nyh@il.ibm.com> wrote:
> > > When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be
> > > read without exit), we need to return L2's notion of the TSC, not L1's.
> > >
> > > The current code incorrectly returned L1 TSC, because svm_get_msr() was also
> > > used in x86.c where this was assumed, but now that these places call the new
> > > svm_read_l1_tsc(), the MSR read can be fixed.
> > >
> > > Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > > ---
> > >  arch/x86/kvm/svm.c |    4 +---
> > >  1 file changed, 1 insertion(+), 3 deletions(-)
> > >
> > > --- .before/arch/x86/kvm/svm.c  2011-08-02 15:51:02.000000000 +0300
> > > +++ .after/arch/x86/kvm/svm.c   2011-08-02 15:51:02.000000000 +0300
> > > @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu *
> > >
> > >        switch (ecx) {
> > >        case MSR_IA32_TSC: {
> > > -               struct vmcb *vmcb = get_host_vmcb(svm);
> > > -
> > > -               *data = vmcb->control.tsc_offset +
> > > +               *data = svm->vmcb->control.tsc_offset +
> > >                        svm_scale_tsc(vcpu, native_read_tsc());
> > >
> > >                break;
> > >
> > 
> > 
> > This is correct.  Now you properly return the correct MSR value for
> > the TSC to the guest in all cases.
> > 
> > However, there is a BIG PROBLEM (yes, it is that bad).  Sorry I did
> > not think of it before.
> > 
> > When the guest gets suspended and frozen by userspace, to be restarted
> > later, what qemu is going to do is come along and read all of the MSRs
> > as part of the saved state.  One of those happens to be the TSC MSR.
> > 
> > Since you can't guarantee suspend will take place when only L1 is
> > running, you may have mixed L1/L2 TSC MSRs now being returned to
> > userspace.  Now, when you resume this guest, you will have mixed L1/L2
> > TSC values written into the MSRs.
> > 
> > Those will almost certainly fail to be matched by the TSC offset
> > matching code, and thus, with multiple VCPUs, you will end up with
> > slightly unsynchronized TSCs, and with that, all the problems
> > associated with unstable TSC come back to haunt you again.  Basically,
> > all bets for stability are off.
> 
> TSC synchronization is the least of your problems if you attempt to
> save/restore state a guest while any vcpu is in L2 mode.
> 
> So to keep consistency between the remaining MSRs, i agree with Nadav's
> patch. Apparently SVM's original patches to return L1 TSC were aimed at
> fixing the problem VMX is facing now, which is fixed by introduction
> read_l1_tsc helpers.

Joerg, can you review and ack Nadav's SVM patch? TIA


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] Fix TSC MSR read in nested SVM
  2011-08-05 15:32       ` Marcelo Tosatti
@ 2011-08-06 12:16         ` Joerg Roedel
  2011-08-07 12:04         ` Joerg Roedel
  1 sibling, 0 replies; 18+ messages in thread
From: Joerg Roedel @ 2011-08-06 12:16 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Zachary Amsden, Joerg Roedel, Nadav Har'El, kvm, Bandan Das, avi

On Fri, Aug 05, 2011 at 12:32:24PM -0300, Marcelo Tosatti wrote:
> On Wed, Aug 03, 2011 at 06:00:52PM -0300, Marcelo Tosatti wrote:
> > On Wed, Aug 03, 2011 at 01:34:58AM -0700, Zachary Amsden wrote:
> > > Caution: this requires more care.
> > > 
> > > Pretty sure this breaks userspace suspend at the cost of supporting a
> > > not-so-reasonable hardware feature.
> > > 
> > > On Tue, Aug 2, 2011 at 5:55 AM, Nadav Har'El <nyh@il.ibm.com> wrote:
> > > > When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be
> > > > read without exit), we need to return L2's notion of the TSC, not L1's.
> > > >
> > > > The current code incorrectly returned L1 TSC, because svm_get_msr() was also
> > > > used in x86.c where this was assumed, but now that these places call the new
> > > > svm_read_l1_tsc(), the MSR read can be fixed.
> > > >
> > > > Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
> > > > ---
> > > >  arch/x86/kvm/svm.c |    4 +---
> > > >  1 file changed, 1 insertion(+), 3 deletions(-)
> > > >
> > > > --- .before/arch/x86/kvm/svm.c  2011-08-02 15:51:02.000000000 +0300
> > > > +++ .after/arch/x86/kvm/svm.c   2011-08-02 15:51:02.000000000 +0300
> > > > @@ -2907,9 +2907,7 @@ static int svm_get_msr(struct kvm_vcpu *
> > > >
> > > >        switch (ecx) {
> > > >        case MSR_IA32_TSC: {
> > > > -               struct vmcb *vmcb = get_host_vmcb(svm);
> > > > -
> > > > -               *data = vmcb->control.tsc_offset +
> > > > +               *data = svm->vmcb->control.tsc_offset +
> > > >                        svm_scale_tsc(vcpu, native_read_tsc());
> > > >
> > > >                break;
> > > >
> > > 
> > > 
> > > This is correct.  Now you properly return the correct MSR value for
> > > the TSC to the guest in all cases.
> > > 
> > > However, there is a BIG PROBLEM (yes, it is that bad).  Sorry I did
> > > not think of it before.
> > > 
> > > When the guest gets suspended and frozen by userspace, to be restarted
> > > later, what qemu is going to do is come along and read all of the MSRs
> > > as part of the saved state.  One of those happens to be the TSC MSR.
> > > 
> > > Since you can't guarantee suspend will take place when only L1 is
> > > running, you may have mixed L1/L2 TSC MSRs now being returned to
> > > userspace.  Now, when you resume this guest, you will have mixed L1/L2
> > > TSC values written into the MSRs.
> > > 
> > > Those will almost certainly fail to be matched by the TSC offset
> > > matching code, and thus, with multiple VCPUs, you will end up with
> > > slightly unsynchronized TSCs, and with that, all the problems
> > > associated with unstable TSC come back to haunt you again.  Basically,
> > > all bets for stability are off.
> > 
> > TSC synchronization is the least of your problems if you attempt to
> > save/restore state a guest while any vcpu is in L2 mode.
> > 
> > So to keep consistency between the remaining MSRs, i agree with Nadav's
> > patch. Apparently SVM's original patches to return L1 TSC were aimed at
> > fixing the problem VMX is facing now, which is fixed by introduction
> > read_l1_tsc helpers.
> 
> Joerg, can you review and ack Nadav's SVM patch? TIA

Yes, sorry for the delay. I'll give it a review and test today.

	Joerg


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] Fix TSC MSR read in nested SVM
  2011-08-05 15:32       ` Marcelo Tosatti
  2011-08-06 12:16         ` Joerg Roedel
@ 2011-08-07 12:04         ` Joerg Roedel
  2011-08-10 12:35           ` Nadav Har'El
  1 sibling, 1 reply; 18+ messages in thread
From: Joerg Roedel @ 2011-08-07 12:04 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Zachary Amsden, Joerg Roedel, Nadav Har'El, kvm, Bandan Das, avi

On Fri, Aug 05, 2011 at 12:32:24PM -0300, Marcelo Tosatti wrote:
> > So to keep consistency between the remaining MSRs, i agree with Nadav's
> > patch. Apparently SVM's original patches to return L1 TSC were aimed at
> > fixing the problem VMX is facing now, which is fixed by introduction
> > read_l1_tsc helpers.
> 
> Joerg, can you review and ack Nadav's SVM patch? TIA

Tested-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Joerg Roedel <joerg.roedel@amd.com>

Reviewed and tested (didn't apply cleanly, but was easy to fix that up).
Looks all good so far.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] Fix TSC MSR read in nested SVM
  2011-08-07 12:04         ` Joerg Roedel
@ 2011-08-10 12:35           ` Nadav Har'El
  2011-08-10 13:02             ` Joerg Roedel
  0 siblings, 1 reply; 18+ messages in thread
From: Nadav Har'El @ 2011-08-10 12:35 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Marcelo Tosatti, Zachary Amsden, Joerg Roedel, kvm, Bandan Das, avi

On Sun, Aug 07, 2011, Joerg Roedel wrote about "Re: [PATCH 3/3] Fix TSC MSR read in nested SVM":
> > Joerg, can you review and ack Nadav's SVM patch? TIA
> 
> Tested-by: Joerg Roedel <joerg.roedel@amd.com>
> Acked-by: Joerg Roedel <joerg.roedel@amd.com>
> 
> Reviewed and tested (didn't apply cleanly, but was easy to fix that up).
> Looks all good so far.

Hi guys, I'm a bit confused how we want to proceed from here.

The patches which I sent appear to fix the original bug (as confirmed by
the two people who reported it), but Zachary warned that it would break
migration of nested SVM while L2 is running. I asked whether migration
works at all while L2 is running (without exiting to L1 first) - and
Marcelo suggested that it doesn't.

If the problem Zachary pointed to is considered serious, I proposed a second
option - to leave the code to *wrongly* return the L1 TSC MSR always, and
check (and warn) in situations where this value is wrongly returned to the
guest, but this will leave qemu to always read the TSC MSR from L1, even when
L2 is running. While I proposed this as a second option, I don't think I
can recommend it.

So what's the verdict? :-)

Thanks,
Nadav.

-- 
Nadav Har'El                        |       Wednesday, Aug 10 2011, 10 Av 5771
nyh@math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |It's fortunate I have bad luck - without
http://nadav.harel.org.il           |it I would have no luck at all!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/3] Fix TSC MSR read in nested SVM
  2011-08-10 12:35           ` Nadav Har'El
@ 2011-08-10 13:02             ` Joerg Roedel
  0 siblings, 0 replies; 18+ messages in thread
From: Joerg Roedel @ 2011-08-10 13:02 UTC (permalink / raw)
  To: Nadav Har'El
  Cc: Marcelo Tosatti, Zachary Amsden, Joerg Roedel, kvm, Bandan Das, avi

On Wed, Aug 10, 2011 at 03:35:28PM +0300, Nadav Har'El wrote:
> On Sun, Aug 07, 2011, Joerg Roedel wrote about "Re: [PATCH 3/3] Fix TSC MSR read in nested SVM":
> > > Joerg, can you review and ack Nadav's SVM patch? TIA
> > 
> > Tested-by: Joerg Roedel <joerg.roedel@amd.com>
> > Acked-by: Joerg Roedel <joerg.roedel@amd.com>
> > 
> > Reviewed and tested (didn't apply cleanly, but was easy to fix that up).
> > Looks all good so far.
> 
> Hi guys, I'm a bit confused how we want to proceed from here.
> 
> The patches which I sent appear to fix the original bug (as confirmed by
> the two people who reported it), but Zachary warned that it would break
> migration of nested SVM while L2 is running. I asked whether migration
> works at all while L2 is running (without exiting to L1 first) - and
> Marcelo suggested that it doesn't.

Migration doesn't work today when L2 is running, so don't worry about it
for now.

Regards,

	Joerg


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/3] Nested TSC handling
  2011-08-02 12:53 [PATCH 0/3] Nested TSC handling Nadav Har'El
                   ` (3 preceding siblings ...)
  2011-08-02 19:24 ` [PATCH 0/3] Nested TSC handling Bandan Das
@ 2011-08-10 15:40 ` Avi Kivity
  4 siblings, 0 replies; 18+ messages in thread
From: Avi Kivity @ 2011-08-10 15:40 UTC (permalink / raw)
  To: Nadav Har'El
  Cc: kvm, Roedel, Joerg, Zachary Amsden, Bandan Das, Marcelo Tosatti

On 08/02/2011 03:53 PM, Nadav Har'El wrote:
> The following are patches I propose for fixing the bug discovered by Bandan
> Das and discussed in the "Nested VMX - L1 hangs on running L2" thread.
>
> The first patch should fix the originally-reported bug, as explained in the
> aforementioned thread: A new x86_op read_l1_tsc() is called L1's TSC is
> needed, instead of assuming that calling kvm_read_msr() will do that
> (because this has to return L2's TSC when a nested guest is running).
>
> The second and third patches fix relatively-unimportant corner cases in
> nested VMX and nested SVM TSC handling.
>
> I'd appreciate it if the people who noticed this bug can verify that these
> patches indeed solve it for them.
>

Thanks, applied.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2011-08-10 15:40 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-02 12:53 [PATCH 0/3] Nested TSC handling Nadav Har'El
2011-08-02 12:54 ` [PATCH 1/3] L1 " Nadav Har'El
2011-08-03  7:54   ` Zachary Amsden
2011-08-02 12:54 ` [PATCH 2/3] Fix nested VMX TSC emulation Nadav Har'El
2011-08-03  8:19   ` Zachary Amsden
2011-08-03  8:35     ` Nadav Har'El
2011-08-02 12:55 ` [PATCH 3/3] Fix TSC MSR read in nested SVM Nadav Har'El
2011-08-03  8:34   ` Zachary Amsden
2011-08-03 11:29     ` Nadav Har'El
2011-08-03 21:00     ` Marcelo Tosatti
2011-08-05 15:32       ` Marcelo Tosatti
2011-08-06 12:16         ` Joerg Roedel
2011-08-07 12:04         ` Joerg Roedel
2011-08-10 12:35           ` Nadav Har'El
2011-08-10 13:02             ` Joerg Roedel
2011-08-02 19:24 ` [PATCH 0/3] Nested TSC handling Bandan Das
2011-08-02 19:54   ` Matt McGill
2011-08-10 15:40 ` Avi Kivity

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.