kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] KVM: VMX: Implement nested TSC scaling
@ 2021-05-06 10:32 ilstam
  2021-05-06 10:32 ` [PATCH 1/8] KVM: VMX: Add a TSC multiplier field in VMCS12 ilstam
                   ` (8 more replies)
  0 siblings, 9 replies; 36+ messages in thread
From: ilstam @ 2021-05-06 10:32 UTC (permalink / raw)
  To: kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

From: Ilias Stamatis <ilstam@amazon.com>

KVM currently supports hardware-assisted TSC scaling but only for L1 and it
doesn't expose the feature to nested guests. This patch series adds support for
nested TSC scaling and allows both L1 and L2 to be scaled with different
scaling factors.

When scaling and offsetting is applied, the TSC for the guest is calculated as:

(TSC * multiplier >> 48) + offset

With nested scaling the values in VMCS01 and VMCS12 need to be merged
together and stored in VMCS02.

The VMCS02 values are calculated as follows:

offset_02 = ((offset_01 * mult_12) >> 48) + offset_12
mult_02 = (mult_01 * mult_12) >> 48

The last patch of the series adds a KVM selftest.

Ilias Stamatis (8):
  KVM: VMX: Add a TSC multiplier field in VMCS12
  KVM: X86: Store L1's TSC scaling ratio in 'struct kvm_vcpu_arch'
  KVM: X86: Pass an additional 'L1' argument to kvm_scale_tsc()
  KVM: VMX: Adjust the TSC-related VMCS fields on L2 entry and exit
  KVM: X86: Move tracing outside write_l1_tsc_offset()
  KVM: VMX: Make vmx_write_l1_tsc_offset() work with nested TSC scaling
  KVM: VMX: Expose TSC scaling to L2
  KVM: selftests: x86: Add vmx_nested_tsc_scaling_test

 arch/x86/include/asm/kvm_host.h               |   8 +-
 arch/x86/kvm/svm/svm.c                        |   4 -
 arch/x86/kvm/vmx/nested.c                     |  32 ++-
 arch/x86/kvm/vmx/vmcs12.c                     |   1 +
 arch/x86/kvm/vmx/vmcs12.h                     |   4 +-
 arch/x86/kvm/vmx/vmx.c                        |  31 ++-
 arch/x86/kvm/x86.c                            |  54 ++++-
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../kvm/x86_64/vmx_nested_tsc_scaling_test.c  | 209 ++++++++++++++++++
 10 files changed, 312 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/8] KVM: VMX: Add a TSC multiplier field in VMCS12
  2021-05-06 10:32 [PATCH 0/8] KVM: VMX: Implement nested TSC scaling ilstam
@ 2021-05-06 10:32 ` ilstam
  2021-05-06 14:50   ` kernel test robot
                     ` (2 more replies)
  2021-05-06 10:32 ` [PATCH 2/8] KVM: X86: Store L1's TSC scaling ratio in 'struct kvm_vcpu_arch' ilstam
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 36+ messages in thread
From: ilstam @ 2021-05-06 10:32 UTC (permalink / raw)
  To: kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

From: Ilias Stamatis <ilstam@amazon.com>

This is required for supporting nested TSC scaling.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
---
 arch/x86/kvm/vmx/vmcs12.c | 1 +
 arch/x86/kvm/vmx/vmcs12.h | 4 +++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 034adb6404dc..d9f5d7c56ae3 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -37,6 +37,7 @@ const unsigned short vmcs_field_to_offset_table[] = {
 	FIELD64(VM_ENTRY_MSR_LOAD_ADDR, vm_entry_msr_load_addr),
 	FIELD64(PML_ADDRESS, pml_address),
 	FIELD64(TSC_OFFSET, tsc_offset),
+	FIELD64(TSC_MULTIPLIER, tsc_multiplier),
 	FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr),
 	FIELD64(APIC_ACCESS_ADDR, apic_access_addr),
 	FIELD64(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr),
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 13494956d0e9..bb81a23afe89 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -70,7 +70,8 @@ struct __packed vmcs12 {
 	u64 eptp_list_address;
 	u64 pml_address;
 	u64 encls_exiting_bitmap;
-	u64 padding64[2]; /* room for future expansion */
+	u64 tsc_multiplier;
+	u64 padding64[1]; /* room for future expansion */
 	/*
 	 * To allow migration of L1 (complete with its L2 guests) between
 	 * machines of different natural widths (32 or 64 bit), we cannot have
@@ -258,6 +259,7 @@ static inline void vmx_check_vmcs12_offsets(void)
 	CHECK_OFFSET(eptp_list_address, 304);
 	CHECK_OFFSET(pml_address, 312);
 	CHECK_OFFSET(encls_exiting_bitmap, 320);
+	CHECK_OFFSET(tsc_multiplier, 328);
 	CHECK_OFFSET(cr0_guest_host_mask, 344);
 	CHECK_OFFSET(cr4_guest_host_mask, 352);
 	CHECK_OFFSET(cr0_read_shadow, 360);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/8] KVM: X86: Store L1's TSC scaling ratio in 'struct kvm_vcpu_arch'
  2021-05-06 10:32 [PATCH 0/8] KVM: VMX: Implement nested TSC scaling ilstam
  2021-05-06 10:32 ` [PATCH 1/8] KVM: VMX: Add a TSC multiplier field in VMCS12 ilstam
@ 2021-05-06 10:32 ` ilstam
  2021-05-10 13:43   ` Maxim Levitsky
  2021-05-06 10:32 ` [PATCH 3/8] KVM: X86: Pass an additional 'L1' argument to kvm_scale_tsc() ilstam
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: ilstam @ 2021-05-06 10:32 UTC (permalink / raw)
  To: kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

From: Ilias Stamatis <ilstam@amazon.com>

Store L1's scaling ratio in that struct like we already do for L1's TSC
offset. This allows for easy save/restore when we enter and then exit
the nested guest.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
---
 arch/x86/include/asm/kvm_host.h | 5 +++--
 arch/x86/kvm/x86.c              | 4 +++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cbbcee0a84f9..132e820525fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -705,7 +705,7 @@ struct kvm_vcpu_arch {
 	} st;
 
 	u64 l1_tsc_offset;
-	u64 tsc_offset;
+	u64 tsc_offset; /* current tsc offset */
 	u64 last_guest_tsc;
 	u64 last_host_tsc;
 	u64 tsc_offset_adjustment;
@@ -719,7 +719,8 @@ struct kvm_vcpu_arch {
 	u32 virtual_tsc_khz;
 	s64 ia32_tsc_adjust_msr;
 	u64 msr_ia32_power_ctl;
-	u64 tsc_scaling_ratio;
+	u64 l1_tsc_scaling_ratio;
+	u64 tsc_scaling_ratio; /* current scaling ratio */
 
 	atomic_t nmi_queued;  /* unprocessed asynchronous NMIs */
 	unsigned nmi_pending; /* NMI queued after currently running handler */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cebdaa1e3cf5..7bc5155ac6fd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2119,6 +2119,7 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
 
 	/* Guest TSC same frequency as host TSC? */
 	if (!scale) {
+		vcpu->arch.l1_tsc_scaling_ratio = kvm_default_tsc_scaling_ratio;
 		vcpu->arch.tsc_scaling_ratio = kvm_default_tsc_scaling_ratio;
 		return 0;
 	}
@@ -2145,7 +2146,7 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
 		return -1;
 	}
 
-	vcpu->arch.tsc_scaling_ratio = ratio;
+	vcpu->arch.l1_tsc_scaling_ratio = vcpu->arch.tsc_scaling_ratio = ratio;
 	return 0;
 }
 
@@ -2157,6 +2158,7 @@ static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
 	/* tsc_khz can be zero if TSC calibration fails */
 	if (user_tsc_khz == 0) {
 		/* set tsc_scaling_ratio to a safe value */
+		vcpu->arch.l1_tsc_scaling_ratio = kvm_default_tsc_scaling_ratio;
 		vcpu->arch.tsc_scaling_ratio = kvm_default_tsc_scaling_ratio;
 		return -1;
 	}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/8] KVM: X86: Pass an additional 'L1' argument to kvm_scale_tsc()
  2021-05-06 10:32 [PATCH 0/8] KVM: VMX: Implement nested TSC scaling ilstam
  2021-05-06 10:32 ` [PATCH 1/8] KVM: VMX: Add a TSC multiplier field in VMCS12 ilstam
  2021-05-06 10:32 ` [PATCH 2/8] KVM: X86: Store L1's TSC scaling ratio in 'struct kvm_vcpu_arch' ilstam
@ 2021-05-06 10:32 ` ilstam
  2021-05-10 13:52   ` Maxim Levitsky
  2021-05-06 10:32 ` [PATCH 4/8] KVM: VMX: Adjust the TSC-related VMCS fields on L2 entry and exit ilstam
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: ilstam @ 2021-05-06 10:32 UTC (permalink / raw)
  To: kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

From: Ilias Stamatis <ilstam@amazon.com>

Sometimes kvm_scale_tsc() needs to use the current scaling ratio and
other times (like when reading the TSC from user space) it needs to use
L1's scaling ratio. Have the caller specify this by passing an
additional boolean argument.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/x86.c              | 21 +++++++++++++--------
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 132e820525fb..cdddbf0b1177 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1779,7 +1779,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
 void kvm_define_user_return_msr(unsigned index, u32 msr);
 int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
 
-u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc);
+u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc, bool l1);
 u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc);
 
 unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7bc5155ac6fd..26a4c0f46f15 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2241,10 +2241,14 @@ static inline u64 __scale_tsc(u64 ratio, u64 tsc)
 	return mul_u64_u64_shr(tsc, ratio, kvm_tsc_scaling_ratio_frac_bits);
 }
 
-u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc)
+/*
+ * If l1 is true the TSC is scaled using L1's scaling ratio, otherwise
+ * the current scaling ratio is used.
+ */
+u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc, bool l1)
 {
 	u64 _tsc = tsc;
-	u64 ratio = vcpu->arch.tsc_scaling_ratio;
+	u64 ratio = l1 ? vcpu->arch.l1_tsc_scaling_ratio : vcpu->arch.tsc_scaling_ratio;
 
 	if (ratio != kvm_default_tsc_scaling_ratio)
 		_tsc = __scale_tsc(ratio, tsc);
@@ -2257,14 +2261,14 @@ static u64 kvm_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)
 {
 	u64 tsc;
 
-	tsc = kvm_scale_tsc(vcpu, rdtsc());
+	tsc = kvm_scale_tsc(vcpu, rdtsc(), true);
 
 	return target_tsc - tsc;
 }
 
 u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
 {
-	return vcpu->arch.l1_tsc_offset + kvm_scale_tsc(vcpu, host_tsc);
+	return vcpu->arch.l1_tsc_offset + kvm_scale_tsc(vcpu, host_tsc, true);
 }
 EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
 
@@ -2395,9 +2399,9 @@ static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
 
 static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
 {
-	if (vcpu->arch.tsc_scaling_ratio != kvm_default_tsc_scaling_ratio)
+	if (vcpu->arch.l1_tsc_scaling_ratio != kvm_default_tsc_scaling_ratio)
 		WARN_ON(adjustment < 0);
-	adjustment = kvm_scale_tsc(vcpu, (u64) adjustment);
+	adjustment = kvm_scale_tsc(vcpu, (u64) adjustment, true);
 	adjust_tsc_offset_guest(vcpu, adjustment);
 }
 
@@ -2780,7 +2784,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	/* With all the info we got, fill in the values */
 
 	if (kvm_has_tsc_control)
-		tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz);
+		tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz, true);
 
 	if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
 		kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
@@ -3474,7 +3478,8 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		u64 tsc_offset = msr_info->host_initiated ? vcpu->arch.l1_tsc_offset :
 							    vcpu->arch.tsc_offset;
 
-		msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + tsc_offset;
+		msr_info->data = kvm_scale_tsc(vcpu, rdtsc(),
+					       msr_info->host_initiated) + tsc_offset;
 		break;
 	}
 	case MSR_MTRRcap:
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 4/8] KVM: VMX: Adjust the TSC-related VMCS fields on L2 entry and exit
  2021-05-06 10:32 [PATCH 0/8] KVM: VMX: Implement nested TSC scaling ilstam
                   ` (2 preceding siblings ...)
  2021-05-06 10:32 ` [PATCH 3/8] KVM: X86: Pass an additional 'L1' argument to kvm_scale_tsc() ilstam
@ 2021-05-06 10:32 ` ilstam
  2021-05-06 11:32   ` Paolo Bonzini
  2021-05-10 13:53   ` Maxim Levitsky
  2021-05-06 10:32 ` [PATCH 5/8] KVM: X86: Move tracing outside write_l1_tsc_offset() ilstam
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 36+ messages in thread
From: ilstam @ 2021-05-06 10:32 UTC (permalink / raw)
  To: kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

From: Ilias Stamatis <ilstam@amazon.com>

When L2 is entered we need to "merge" the TSC multiplier and TSC offset
values of VMCS01 and VMCS12 and store the result into the current
VMCS02.

The 02 values are calculated using the following equations:
  offset_02 = ((offset_01 * mult_12) >> 48) + offset_12
  mult_02 = (mult_01 * mult_12) >> 48

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/vmx/nested.c       | 26 ++++++++++++++++++++++----
 arch/x86/kvm/x86.c              | 25 +++++++++++++++++++++++++
 3 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cdddbf0b1177..e7a1eb36f95a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1780,6 +1780,7 @@ void kvm_define_user_return_msr(unsigned index, u32 msr);
 int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
 
 u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc, bool l1);
+u64 kvm_compute_02_tsc_offset(u64 l1_offset, u64 l2_multiplier, u64 l2_offset);
 u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc);
 
 unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index bced76637823..a1bf28f33837 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3353,8 +3353,22 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 	}
 
 	enter_guest_mode(vcpu);
-	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING)
-		vcpu->arch.tsc_offset += vmcs12->tsc_offset;
+
+	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING) {
+		if (vmcs12->secondary_vm_exec_control & SECONDARY_EXEC_TSC_SCALING) {
+			vcpu->arch.tsc_offset = kvm_compute_02_tsc_offset(
+					vcpu->arch.l1_tsc_offset,
+					vmcs12->tsc_multiplier,
+					vmcs12->tsc_offset);
+
+			vcpu->arch.tsc_scaling_ratio = mul_u64_u64_shr(
+					vcpu->arch.tsc_scaling_ratio,
+					vmcs12->tsc_multiplier,
+					kvm_tsc_scaling_ratio_frac_bits);
+		} else {
+			vcpu->arch.tsc_offset += vmcs12->tsc_offset;
+		}
+	}
 
 	if (prepare_vmcs02(vcpu, vmcs12, &entry_failure_code)) {
 		exit_reason.basic = EXIT_REASON_INVALID_STATE;
@@ -4454,8 +4468,12 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 	if (nested_cpu_has_preemption_timer(vmcs12))
 		hrtimer_cancel(&to_vmx(vcpu)->nested.preemption_timer);
 
-	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING)
-		vcpu->arch.tsc_offset -= vmcs12->tsc_offset;
+	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING) {
+		vcpu->arch.tsc_offset = vcpu->arch.l1_tsc_offset;
+
+		if (vmcs12->secondary_vm_exec_control & SECONDARY_EXEC_TSC_SCALING)
+			vcpu->arch.tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
+	}
 
 	if (likely(!vmx->fail)) {
 		sync_vmcs02_to_vmcs12(vcpu, vmcs12);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 26a4c0f46f15..87deb119c521 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2266,6 +2266,31 @@ static u64 kvm_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)
 	return target_tsc - tsc;
 }
 
+/*
+ * This function computes the TSC offset that is stored in VMCS02 when entering
+ * L2 by combining the offset and multiplier values of VMCS01 and VMCS12.
+ */
+u64 kvm_compute_02_tsc_offset(u64 l1_offset, u64 l2_multiplier, u64 l2_offset)
+{
+	u64 offset;
+
+	/*
+	 * The L1 offset is interpreted as a signed number by the CPU and can
+	 * be negative. So we extract the sign before the multiplication and
+	 * put it back afterwards if needed.
+	 */
+	offset = mul_u64_u64_shr(abs((s64) l1_offset),
+				 l2_multiplier,
+				 kvm_tsc_scaling_ratio_frac_bits);
+
+	if ((s64) l1_offset < 0)
+		offset = -((s64) offset);
+
+	offset += l2_offset;
+	return offset;
+}
+EXPORT_SYMBOL_GPL(kvm_compute_02_tsc_offset);
+
 u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
 {
 	return vcpu->arch.l1_tsc_offset + kvm_scale_tsc(vcpu, host_tsc, true);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 5/8] KVM: X86: Move tracing outside write_l1_tsc_offset()
  2021-05-06 10:32 [PATCH 0/8] KVM: VMX: Implement nested TSC scaling ilstam
                   ` (3 preceding siblings ...)
  2021-05-06 10:32 ` [PATCH 4/8] KVM: VMX: Adjust the TSC-related VMCS fields on L2 entry and exit ilstam
@ 2021-05-06 10:32 ` ilstam
  2021-05-10 13:54   ` Maxim Levitsky
  2021-05-06 10:32 ` [PATCH 6/8] KVM: VMX: Make vmx_write_l1_tsc_offset() work with nested TSC scaling ilstam
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: ilstam @ 2021-05-06 10:32 UTC (permalink / raw)
  To: kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

From: Ilias Stamatis <ilstam@amazon.com>

A subsequent patch fixes write_l1_tsc_offset() to account for nested TSC
scaling. Calculating the L1 TSC for logging it with the trace call
becomes more complex then.

This patch moves the trace call to the caller and avoids code
duplication as a result too.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
---
 arch/x86/kvm/svm/svm.c | 4 ----
 arch/x86/kvm/vmx/vmx.c | 3 ---
 arch/x86/kvm/x86.c     | 4 ++++
 3 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 9790c73f2a32..d2f9d6a9716f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1090,10 +1090,6 @@ static u64 svm_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 		svm->vmcb01.ptr->control.tsc_offset = offset;
 	}
 
-	trace_kvm_write_tsc_offset(vcpu->vcpu_id,
-				   svm->vmcb->control.tsc_offset - g_tsc_offset,
-				   offset);
-
 	svm->vmcb->control.tsc_offset = offset + g_tsc_offset;
 
 	vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cbe0cdade38a..49241423b854 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1812,9 +1812,6 @@ static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 	    (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING))
 		g_tsc_offset = vmcs12->tsc_offset;
 
-	trace_kvm_write_tsc_offset(vcpu->vcpu_id,
-				   vcpu->arch.tsc_offset - g_tsc_offset,
-				   offset);
 	vmcs_write64(TSC_OFFSET, offset + g_tsc_offset);
 	return offset + g_tsc_offset;
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 87deb119c521..c08295bcf50e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2299,6 +2299,10 @@ EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
 
 static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
+	trace_kvm_write_tsc_offset(vcpu->vcpu_id,
+				   vcpu->arch.l1_tsc_offset,
+				   offset);
+
 	vcpu->arch.l1_tsc_offset = offset;
 	vcpu->arch.tsc_offset = static_call(kvm_x86_write_l1_tsc_offset)(vcpu, offset);
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 6/8] KVM: VMX: Make vmx_write_l1_tsc_offset() work with nested TSC scaling
  2021-05-06 10:32 [PATCH 0/8] KVM: VMX: Implement nested TSC scaling ilstam
                   ` (4 preceding siblings ...)
  2021-05-06 10:32 ` [PATCH 5/8] KVM: X86: Move tracing outside write_l1_tsc_offset() ilstam
@ 2021-05-06 10:32 ` ilstam
  2021-05-10 13:54   ` Maxim Levitsky
  2021-05-06 10:32 ` [PATCH 7/8] KVM: VMX: Expose TSC scaling to L2 ilstam
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 36+ messages in thread
From: ilstam @ 2021-05-06 10:32 UTC (permalink / raw)
  To: kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

From: Ilias Stamatis <ilstam@amazon.com>

Calculating the current TSC offset is done differently when nested TSC
scaling is used.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
---
 arch/x86/kvm/vmx/vmx.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 49241423b854..df7dc0e4c903 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1797,10 +1797,16 @@ static void setup_msrs(struct vcpu_vmx *vmx)
 		vmx_update_msr_bitmap(&vmx->vcpu);
 }
 
-static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
+/*
+ * This function receives the requested offset for L1 as an argument but it
+ * actually writes the "current" tsc offset to the VMCS and returns it. The
+ * current offset might be different in case an L2 guest is currently running
+ * and its VMCS02 is loaded.
+ */
+static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 l1_offset)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-	u64 g_tsc_offset = 0;
+	u64 cur_offset = l1_offset;
 
 	/*
 	 * We're here if L1 chose not to trap WRMSR to TSC. According
@@ -1809,11 +1815,19 @@ static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 	 * to the newly set TSC to get L2's TSC.
 	 */
 	if (is_guest_mode(vcpu) &&
-	    (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING))
-		g_tsc_offset = vmcs12->tsc_offset;
+	    (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING)) {
+		if (vmcs12->secondary_vm_exec_control & SECONDARY_EXEC_TSC_SCALING) {
+			cur_offset = kvm_compute_02_tsc_offset(
+					l1_offset,
+					vmcs12->tsc_multiplier,
+					vmcs12->tsc_offset);
+		} else {
+			cur_offset = l1_offset + vmcs12->tsc_offset;
+		}
+	}
 
-	vmcs_write64(TSC_OFFSET, offset + g_tsc_offset);
-	return offset + g_tsc_offset;
+	vmcs_write64(TSC_OFFSET, cur_offset);
+	return cur_offset;
 }
 
 /*
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 7/8] KVM: VMX: Expose TSC scaling to L2
  2021-05-06 10:32 [PATCH 0/8] KVM: VMX: Implement nested TSC scaling ilstam
                   ` (5 preceding siblings ...)
  2021-05-06 10:32 ` [PATCH 6/8] KVM: VMX: Make vmx_write_l1_tsc_offset() work with nested TSC scaling ilstam
@ 2021-05-06 10:32 ` ilstam
  2021-05-10 13:56   ` Maxim Levitsky
  2021-05-06 10:32 ` [PATCH 8/8] KVM: selftests: x86: Add vmx_nested_tsc_scaling_test ilstam
  2021-05-06 17:16 ` [PATCH 0/8] KVM: VMX: Implement nested TSC scaling Jim Mattson
  8 siblings, 1 reply; 36+ messages in thread
From: ilstam @ 2021-05-06 10:32 UTC (permalink / raw)
  To: kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

From: Ilias Stamatis <ilstam@amazon.com>

Expose the TSC scaling feature to nested guests.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
---
 arch/x86/kvm/vmx/nested.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index a1bf28f33837..639cb9462103 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2277,7 +2277,8 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 				  SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE |
 				  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
 				  SECONDARY_EXEC_APIC_REGISTER_VIRT |
-				  SECONDARY_EXEC_ENABLE_VMFUNC);
+				  SECONDARY_EXEC_ENABLE_VMFUNC |
+				  SECONDARY_EXEC_TSC_SCALING);
 		if (nested_cpu_has(vmcs12,
 				   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
 			exec_control |= vmcs12->secondary_vm_exec_control;
@@ -6483,7 +6484,8 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
 		SECONDARY_EXEC_RDRAND_EXITING |
 		SECONDARY_EXEC_ENABLE_INVPCID |
 		SECONDARY_EXEC_RDSEED_EXITING |
-		SECONDARY_EXEC_XSAVES;
+		SECONDARY_EXEC_XSAVES |
+		SECONDARY_EXEC_TSC_SCALING;
 
 	/*
 	 * We can emulate "VMCS shadowing," even if the hardware
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 8/8] KVM: selftests: x86: Add vmx_nested_tsc_scaling_test
  2021-05-06 10:32 [PATCH 0/8] KVM: VMX: Implement nested TSC scaling ilstam
                   ` (6 preceding siblings ...)
  2021-05-06 10:32 ` [PATCH 7/8] KVM: VMX: Expose TSC scaling to L2 ilstam
@ 2021-05-06 10:32 ` ilstam
  2021-05-10 13:59   ` Maxim Levitsky
  2021-05-06 17:16 ` [PATCH 0/8] KVM: VMX: Implement nested TSC scaling Jim Mattson
  8 siblings, 1 reply; 36+ messages in thread
From: ilstam @ 2021-05-06 10:32 UTC (permalink / raw)
  To: kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

From: Ilias Stamatis <ilstam@amazon.com>

Test that nested TSC scaling works as expected with both L1 and L2
scaled.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
---
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../kvm/x86_64/vmx_nested_tsc_scaling_test.c  | 209 ++++++++++++++++++
 3 files changed, 211 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index bd83158e0e0b..cc02022f9951 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -29,6 +29,7 @@
 /x86_64/vmx_preemption_timer_test
 /x86_64/vmx_set_nested_state_test
 /x86_64/vmx_tsc_adjust_test
+/x86_64/vmx_nested_tsc_scaling_test
 /x86_64/xapic_ipi_test
 /x86_64/xen_shinfo_test
 /x86_64/xen_vmcall_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index e439d027939d..1078240b1313 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -60,6 +60,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_close_while_nested_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_dirty_log_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
+TEST_GEN_PROGS_x86_64 += x86_64/vmx_nested_tsc_scaling_test
 TEST_GEN_PROGS_x86_64 += x86_64/xapic_ipi_test
 TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test
 TEST_GEN_PROGS_x86_64 += x86_64/debug_regs
diff --git a/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c b/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c
new file mode 100644
index 000000000000..b05f5151ecbe
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c
@@ -0,0 +1,209 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * vmx_nested_tsc_scaling_test
+ *
+ * Copyright (C) 2021 Amazon.com, Inc. or its affiliates.
+ *
+ * This test case verifies that nested TSC scaling behaves as expected when
+ * both L1 and L2 are scaled using different ratios. For this test we scale
+ * L1 down and scale L2 up.
+ */
+
+
+#include "kvm_util.h"
+#include "vmx.h"
+#include "kselftest.h"
+
+
+#define VCPU_ID 0
+
+/* L1 is scaled down by this factor */
+#define L1_SCALE_FACTOR 2ULL
+/* L2 is scaled up (from L1's perspective) by this factor */
+#define L2_SCALE_FACTOR 4ULL
+
+#define TSC_OFFSET_L2 (1UL << 32)
+#define TSC_MULTIPLIER_L2 (L2_SCALE_FACTOR << 48)
+
+#define L2_GUEST_STACK_SIZE 64
+
+enum { USLEEP, UCHECK_L1, UCHECK_L2 };
+#define GUEST_SLEEP(sec)         ucall(UCALL_SYNC, 2, USLEEP, sec)
+#define GUEST_CHECK(level, freq) ucall(UCALL_SYNC, 2, level, freq)
+
+
+/*
+ * This function checks whether the "actual" TSC frequency of a guest matches
+ * its expected frequency. In order to account for delays in taking the TSC
+ * measurements, a difference of 1% between the actual and the expected value
+ * is tolerated.
+ */
+static void compare_tsc_freq(uint64_t actual, uint64_t expected)
+{
+	uint64_t tolerance, thresh_low, thresh_high;
+
+	tolerance = expected / 100;
+	thresh_low = expected - tolerance;
+	thresh_high = expected + tolerance;
+
+	TEST_ASSERT(thresh_low < actual,
+		"TSC freq is expected to be between %"PRIu64" and %"PRIu64
+		" but it actually is %"PRIu64,
+		thresh_low, thresh_high, actual);
+	TEST_ASSERT(thresh_high > actual,
+		"TSC freq is expected to be between %"PRIu64" and %"PRIu64
+		" but it actually is %"PRIu64,
+		thresh_low, thresh_high, actual);
+}
+
+static void check_tsc_freq(int level)
+{
+	uint64_t tsc_start, tsc_end, tsc_freq;
+
+	/*
+	 * Reading the TSC twice with about a second's difference should give
+	 * us an approximation of the TSC frequency from the guest's
+	 * perspective. Now, this won't be completely accurate, but it should
+	 * be good enough for the purposes of this test.
+	 */
+	tsc_start = rdmsr(MSR_IA32_TSC);
+	GUEST_SLEEP(1);
+	tsc_end = rdmsr(MSR_IA32_TSC);
+
+	tsc_freq = tsc_end - tsc_start;
+
+	GUEST_CHECK(level, tsc_freq);
+}
+
+static void l2_guest_code(void)
+{
+	check_tsc_freq(UCHECK_L2);
+
+	/* exit to L1 */
+	__asm__ __volatile__("vmcall");
+}
+
+static void l1_guest_code(struct vmx_pages *vmx_pages)
+{
+	unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
+	uint32_t control;
+
+	/* check that L1's frequency looks alright before launching L2 */
+	check_tsc_freq(UCHECK_L1);
+
+	GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages));
+	GUEST_ASSERT(load_vmcs(vmx_pages));
+
+	/* prepare the VMCS for L2 execution */
+	prepare_vmcs(vmx_pages, l2_guest_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]);
+
+	/* enable TSC offsetting and TSC scaling for L2 */
+	control = vmreadz(CPU_BASED_VM_EXEC_CONTROL);
+	control |= CPU_BASED_USE_MSR_BITMAPS | CPU_BASED_USE_TSC_OFFSETTING;
+	vmwrite(CPU_BASED_VM_EXEC_CONTROL, control);
+
+	control = vmreadz(SECONDARY_VM_EXEC_CONTROL);
+	control |= SECONDARY_EXEC_TSC_SCALING;
+	vmwrite(SECONDARY_VM_EXEC_CONTROL, control);
+
+	vmwrite(TSC_OFFSET, TSC_OFFSET_L2);
+	vmwrite(TSC_MULTIPLIER, TSC_MULTIPLIER_L2);
+	vmwrite(TSC_MULTIPLIER_HIGH, TSC_MULTIPLIER_L2 >> 32);
+
+	/* launch L2 */
+	GUEST_ASSERT(!vmlaunch());
+	GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL);
+
+	/* check that L1's frequency still looks good */
+	check_tsc_freq(UCHECK_L1);
+
+	GUEST_DONE();
+}
+
+static void tsc_scaling_check_supported(void)
+{
+	if (!kvm_check_cap(KVM_CAP_TSC_CONTROL)) {
+		print_skip("TSC scaling not supported by the HW");
+		exit(KSFT_SKIP);
+	}
+}
+
+int main(int argc, char *argv[])
+{
+	struct kvm_vm *vm;
+	vm_vaddr_t vmx_pages_gva;
+
+	uint64_t tsc_start, tsc_end;
+	uint64_t tsc_khz;
+	uint64_t l0_tsc_freq = 0;
+	uint64_t l1_tsc_freq = 0;
+	uint64_t l2_tsc_freq = 0;
+
+	nested_vmx_check_supported();
+	tsc_scaling_check_supported();
+
+	tsc_start = rdtsc();
+	sleep(1);
+	tsc_end = rdtsc();
+
+	l0_tsc_freq = tsc_end - tsc_start;
+	printf("real TSC frequency is around: %"PRIu64"\n", l0_tsc_freq);
+
+	vm = vm_create_default(VCPU_ID, 0, (void *) l1_guest_code);
+	vcpu_alloc_vmx(vm, &vmx_pages_gva);
+	vcpu_args_set(vm, VCPU_ID, 1, vmx_pages_gva);
+
+	tsc_khz = _vcpu_ioctl(vm, VCPU_ID, KVM_GET_TSC_KHZ, NULL);
+	TEST_ASSERT(tsc_khz != -1, "vcpu ioctl KVM_GET_TSC_KHZ failed");
+
+	/* scale down L1's TSC frequency */
+	vcpu_ioctl(vm, VCPU_ID, KVM_SET_TSC_KHZ,
+		  (void *) (tsc_khz / L1_SCALE_FACTOR));
+
+	for (;;) {
+		volatile struct kvm_run *run = vcpu_state(vm, VCPU_ID);
+		struct ucall uc;
+
+		vcpu_run(vm, VCPU_ID);
+		TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+			    "Got exit_reason other than KVM_EXIT_IO: %u (%s)\n",
+			    run->exit_reason,
+			    exit_reason_str(run->exit_reason));
+
+		switch (get_ucall(vm, VCPU_ID, &uc)) {
+		case UCALL_ABORT:
+			TEST_FAIL("%s", (const char *) uc.args[0]);
+		case UCALL_SYNC:
+			switch (uc.args[0]) {
+			case USLEEP:
+				sleep(uc.args[1]);
+				break;
+			case UCHECK_L1:
+				l1_tsc_freq = uc.args[1];
+				printf("L1's TSC frequency is around: %"PRIu64
+				       "\n", l1_tsc_freq);
+
+				compare_tsc_freq(l1_tsc_freq,
+						 l0_tsc_freq / L1_SCALE_FACTOR);
+				break;
+			case UCHECK_L2:
+				l2_tsc_freq = uc.args[1];
+				printf("L2's TSC frequency is around: %"PRIu64
+				       "\n", l2_tsc_freq);
+
+				compare_tsc_freq(l2_tsc_freq,
+						 l1_tsc_freq * L2_SCALE_FACTOR);
+				break;
+			}
+			break;
+		case UCALL_DONE:
+			goto done;
+		default:
+			TEST_FAIL("Unknown ucall %lu", uc.cmd);
+		}
+	}
+
+done:
+	kvm_vm_free(vm);
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/8] KVM: VMX: Adjust the TSC-related VMCS fields on L2 entry and exit
  2021-05-06 10:32 ` [PATCH 4/8] KVM: VMX: Adjust the TSC-related VMCS fields on L2 entry and exit ilstam
@ 2021-05-06 11:32   ` Paolo Bonzini
  2021-05-06 17:35     ` Stamatis, Ilias
  2021-05-10 13:53   ` Maxim Levitsky
  1 sibling, 1 reply; 36+ messages in thread
From: Paolo Bonzini @ 2021-05-06 11:32 UTC (permalink / raw)
  To: ilstam, kvm
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

On 06/05/21 12:32, ilstam@mailbox.org wrote:
> +	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING) {
> +		if (vmcs12->secondary_vm_exec_control & SECONDARY_EXEC_TSC_SCALING) {
> +			vcpu->arch.tsc_offset = kvm_compute_02_tsc_offset(
> +					vcpu->arch.l1_tsc_offset,
> +					vmcs12->tsc_multiplier,
> +					vmcs12->tsc_offset);
> +
> +			vcpu->arch.tsc_scaling_ratio = mul_u64_u64_shr(
> +					vcpu->arch.tsc_scaling_ratio,
> +					vmcs12->tsc_multiplier,
> +					kvm_tsc_scaling_ratio_frac_bits);
> +		} else {
> +			vcpu->arch.tsc_offset += vmcs12->tsc_offset;
> +		}

The computation of vcpu->arch.tsc_offset is (not coincidentially) the
same that appears in patch 6

+	    (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING)) {
+		if (vmcs12->secondary_vm_exec_control & SECONDARY_EXEC_TSC_SCALING) {
+			cur_offset = kvm_compute_02_tsc_offset(
+					l1_offset,
+					vmcs12->tsc_multiplier,
+					vmcs12->tsc_offset);
+		} else {
+			cur_offset = l1_offset + vmcs12->tsc_offset;

So I think you should just pass vmcs12 and the L1 offset to
kvm_compute_02_tsc_offset, and let it handle both cases (and possibly
even set vcpu->arch.tsc_scaling_ratio in the same function).

Paolo


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] KVM: VMX: Add a TSC multiplier field in VMCS12
  2021-05-06 10:32 ` [PATCH 1/8] KVM: VMX: Add a TSC multiplier field in VMCS12 ilstam
@ 2021-05-06 14:50   ` kernel test robot
  2021-05-06 17:36   ` Jim Mattson
  2021-05-10 13:42   ` Maxim Levitsky
  2 siblings, 0 replies; 36+ messages in thread
From: kernel test robot @ 2021-05-06 14:50 UTC (permalink / raw)
  To: ilstam, kvm, pbonzini
  Cc: kbuild-all, ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden

[-- Attachment #1: Type: text/plain, Size: 14466 bytes --]

Hi,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on kvm/queue]
[also build test WARNING on next-20210506]
[cannot apply to vhost/linux-next v5.12]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/ilstam-mailbox-org/KVM-VMX-Implement-nested-TSC-scaling/20210506-183826
base:   https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
config: x86_64-randconfig-s021-20210506 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-341-g8af24329-dirty
        # https://github.com/0day-ci/linux/commit/e4ab50e01366b1f40fd84b375b0c93701367af26
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review ilstam-mailbox-org/KVM-VMX-Implement-nested-TSC-scaling/20210506-183826
        git checkout e4ab50e01366b1f40fd84b375b0c93701367af26
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
   arch/x86/kvm/vmx/vmcs12.c:15:9: sparse: sparse: cast truncates bits from constant value (20002 becomes 2)
   arch/x86/kvm/vmx/vmcs12.c:16:9: sparse: sparse: cast truncates bits from constant value (20082 becomes 82)
   arch/x86/kvm/vmx/vmcs12.c:17:9: sparse: sparse: cast truncates bits from constant value (20102 becomes 102)
   arch/x86/kvm/vmx/vmcs12.c:18:9: sparse: sparse: cast truncates bits from constant value (20182 becomes 182)
   arch/x86/kvm/vmx/vmcs12.c:19:9: sparse: sparse: cast truncates bits from constant value (20202 becomes 202)
   arch/x86/kvm/vmx/vmcs12.c:20:9: sparse: sparse: cast truncates bits from constant value (20282 becomes 282)
   arch/x86/kvm/vmx/vmcs12.c:21:9: sparse: sparse: cast truncates bits from constant value (20302 becomes 302)
   arch/x86/kvm/vmx/vmcs12.c:22:9: sparse: sparse: cast truncates bits from constant value (20382 becomes 382)
   arch/x86/kvm/vmx/vmcs12.c:23:9: sparse: sparse: cast truncates bits from constant value (20402 becomes 402)
   arch/x86/kvm/vmx/vmcs12.c:24:9: sparse: sparse: cast truncates bits from constant value (20482 becomes 482)
   arch/x86/kvm/vmx/vmcs12.c:25:9: sparse: sparse: cast truncates bits from constant value (30003 becomes 3)
   arch/x86/kvm/vmx/vmcs12.c:26:9: sparse: sparse: cast truncates bits from constant value (30083 becomes 83)
   arch/x86/kvm/vmx/vmcs12.c:27:9: sparse: sparse: cast truncates bits from constant value (30103 becomes 103)
   arch/x86/kvm/vmx/vmcs12.c:28:9: sparse: sparse: cast truncates bits from constant value (30183 becomes 183)
   arch/x86/kvm/vmx/vmcs12.c:29:9: sparse: sparse: cast truncates bits from constant value (30203 becomes 203)
   arch/x86/kvm/vmx/vmcs12.c:30:9: sparse: sparse: cast truncates bits from constant value (30283 becomes 283)
   arch/x86/kvm/vmx/vmcs12.c:31:9: sparse: sparse: cast truncates bits from constant value (30303 becomes 303)
   arch/x86/kvm/vmx/vmcs12.c:32:9: sparse: sparse: cast truncates bits from constant value (80008 becomes 8)
   arch/x86/kvm/vmx/vmcs12.c:32:9: sparse: sparse: cast truncates bits from constant value (80048 becomes 48)
   arch/x86/kvm/vmx/vmcs12.c:33:9: sparse: sparse: cast truncates bits from constant value (80088 becomes 88)
   arch/x86/kvm/vmx/vmcs12.c:33:9: sparse: sparse: cast truncates bits from constant value (800c8 becomes c8)
   arch/x86/kvm/vmx/vmcs12.c:34:9: sparse: sparse: cast truncates bits from constant value (80108 becomes 108)
   arch/x86/kvm/vmx/vmcs12.c:34:9: sparse: sparse: cast truncates bits from constant value (80148 becomes 148)
   arch/x86/kvm/vmx/vmcs12.c:35:9: sparse: sparse: cast truncates bits from constant value (80188 becomes 188)
   arch/x86/kvm/vmx/vmcs12.c:35:9: sparse: sparse: cast truncates bits from constant value (801c8 becomes 1c8)
   arch/x86/kvm/vmx/vmcs12.c:36:9: sparse: sparse: cast truncates bits from constant value (80208 becomes 208)
   arch/x86/kvm/vmx/vmcs12.c:36:9: sparse: sparse: cast truncates bits from constant value (80248 becomes 248)
   arch/x86/kvm/vmx/vmcs12.c:37:9: sparse: sparse: cast truncates bits from constant value (80288 becomes 288)
   arch/x86/kvm/vmx/vmcs12.c:37:9: sparse: sparse: cast truncates bits from constant value (802c8 becomes 2c8)
   arch/x86/kvm/vmx/vmcs12.c:38:9: sparse: sparse: cast truncates bits from constant value (80388 becomes 388)
   arch/x86/kvm/vmx/vmcs12.c:38:9: sparse: sparse: cast truncates bits from constant value (803c8 becomes 3c8)
   arch/x86/kvm/vmx/vmcs12.c:39:9: sparse: sparse: cast truncates bits from constant value (80408 becomes 408)
   arch/x86/kvm/vmx/vmcs12.c:39:9: sparse: sparse: cast truncates bits from constant value (80448 becomes 448)
>> arch/x86/kvm/vmx/vmcs12.c:40:9: sparse: sparse: cast truncates bits from constant value (80c88 becomes c88)
>> arch/x86/kvm/vmx/vmcs12.c:40:9: sparse: sparse: cast truncates bits from constant value (80cc8 becomes cc8)
   arch/x86/kvm/vmx/vmcs12.c:41:9: sparse: sparse: cast truncates bits from constant value (80488 becomes 488)
   arch/x86/kvm/vmx/vmcs12.c:41:9: sparse: sparse: cast truncates bits from constant value (804c8 becomes 4c8)
   arch/x86/kvm/vmx/vmcs12.c:42:9: sparse: sparse: cast truncates bits from constant value (80508 becomes 508)
   arch/x86/kvm/vmx/vmcs12.c:42:9: sparse: sparse: cast truncates bits from constant value (80548 becomes 548)
   arch/x86/kvm/vmx/vmcs12.c:43:9: sparse: sparse: cast truncates bits from constant value (80588 becomes 588)
   arch/x86/kvm/vmx/vmcs12.c:43:9: sparse: sparse: cast truncates bits from constant value (805c8 becomes 5c8)
   arch/x86/kvm/vmx/vmcs12.c:44:9: sparse: sparse: cast truncates bits from constant value (80608 becomes 608)
   arch/x86/kvm/vmx/vmcs12.c:44:9: sparse: sparse: cast truncates bits from constant value (80648 becomes 648)
   arch/x86/kvm/vmx/vmcs12.c:45:9: sparse: sparse: cast truncates bits from constant value (80688 becomes 688)
   arch/x86/kvm/vmx/vmcs12.c:45:9: sparse: sparse: cast truncates bits from constant value (806c8 becomes 6c8)
   arch/x86/kvm/vmx/vmcs12.c:46:9: sparse: sparse: cast truncates bits from constant value (80708 becomes 708)
   arch/x86/kvm/vmx/vmcs12.c:46:9: sparse: sparse: cast truncates bits from constant value (80748 becomes 748)
   arch/x86/kvm/vmx/vmcs12.c:47:9: sparse: sparse: cast truncates bits from constant value (80788 becomes 788)
   arch/x86/kvm/vmx/vmcs12.c:47:9: sparse: sparse: cast truncates bits from constant value (807c8 becomes 7c8)
   arch/x86/kvm/vmx/vmcs12.c:48:9: sparse: sparse: cast truncates bits from constant value (80808 becomes 808)
   arch/x86/kvm/vmx/vmcs12.c:48:9: sparse: sparse: cast truncates bits from constant value (80848 becomes 848)
   arch/x86/kvm/vmx/vmcs12.c:49:9: sparse: sparse: cast truncates bits from constant value (80888 becomes 888)
   arch/x86/kvm/vmx/vmcs12.c:49:9: sparse: sparse: cast truncates bits from constant value (808c8 becomes 8c8)
   arch/x86/kvm/vmx/vmcs12.c:50:9: sparse: sparse: cast truncates bits from constant value (80908 becomes 908)
   arch/x86/kvm/vmx/vmcs12.c:50:9: sparse: sparse: cast truncates bits from constant value (80948 becomes 948)
   arch/x86/kvm/vmx/vmcs12.c:51:9: sparse: sparse: cast truncates bits from constant value (80988 becomes 988)
   arch/x86/kvm/vmx/vmcs12.c:51:9: sparse: sparse: cast truncates bits from constant value (809c8 becomes 9c8)
   arch/x86/kvm/vmx/vmcs12.c:52:9: sparse: sparse: cast truncates bits from constant value (80a08 becomes a08)
   arch/x86/kvm/vmx/vmcs12.c:52:9: sparse: sparse: cast truncates bits from constant value (80a48 becomes a48)
   arch/x86/kvm/vmx/vmcs12.c:53:9: sparse: sparse: cast truncates bits from constant value (80b08 becomes b08)
   arch/x86/kvm/vmx/vmcs12.c:53:9: sparse: sparse: cast truncates bits from constant value (80b48 becomes b48)
   arch/x86/kvm/vmx/vmcs12.c:54:9: sparse: sparse: cast truncates bits from constant value (80b88 becomes b88)
   arch/x86/kvm/vmx/vmcs12.c:54:9: sparse: sparse: cast truncates bits from constant value (80bc8 becomes bc8)
   arch/x86/kvm/vmx/vmcs12.c:55:9: sparse: sparse: cast truncates bits from constant value (90009 becomes 9)
   arch/x86/kvm/vmx/vmcs12.c:55:9: sparse: sparse: cast truncates bits from constant value (90049 becomes 49)
   arch/x86/kvm/vmx/vmcs12.c:56:9: sparse: sparse: cast truncates bits from constant value (a000a becomes a)
   arch/x86/kvm/vmx/vmcs12.c:56:9: sparse: sparse: cast truncates bits from constant value (a004a becomes 4a)
   arch/x86/kvm/vmx/vmcs12.c:57:9: sparse: sparse: cast truncates bits from constant value (a008a becomes 8a)
   arch/x86/kvm/vmx/vmcs12.c:57:9: sparse: sparse: cast truncates bits from constant value (a00ca becomes ca)
   arch/x86/kvm/vmx/vmcs12.c:58:9: sparse: sparse: cast truncates bits from constant value (a010a becomes 10a)
   arch/x86/kvm/vmx/vmcs12.c:58:9: sparse: sparse: cast truncates bits from constant value (a014a becomes 14a)
   arch/x86/kvm/vmx/vmcs12.c:59:9: sparse: sparse: cast truncates bits from constant value (a018a becomes 18a)
   arch/x86/kvm/vmx/vmcs12.c:59:9: sparse: sparse: cast truncates bits from constant value (a01ca becomes 1ca)
   arch/x86/kvm/vmx/vmcs12.c:60:9: sparse: sparse: cast truncates bits from constant value (a020a becomes 20a)
   arch/x86/kvm/vmx/vmcs12.c:60:9: sparse: sparse: cast truncates bits from constant value (a024a becomes 24a)
   arch/x86/kvm/vmx/vmcs12.c:61:9: sparse: sparse: cast truncates bits from constant value (a028a becomes 28a)
   arch/x86/kvm/vmx/vmcs12.c:61:9: sparse: sparse: cast truncates bits from constant value (a02ca becomes 2ca)
   arch/x86/kvm/vmx/vmcs12.c:62:9: sparse: sparse: cast truncates bits from constant value (a030a becomes 30a)
   arch/x86/kvm/vmx/vmcs12.c:62:9: sparse: sparse: cast truncates bits from constant value (a034a becomes 34a)
   arch/x86/kvm/vmx/vmcs12.c:63:9: sparse: sparse: cast truncates bits from constant value (a038a becomes 38a)
   arch/x86/kvm/vmx/vmcs12.c:63:9: sparse: sparse: cast truncates bits from constant value (a03ca becomes 3ca)
   arch/x86/kvm/vmx/vmcs12.c:64:9: sparse: sparse: cast truncates bits from constant value (a040a becomes 40a)
   arch/x86/kvm/vmx/vmcs12.c:64:9: sparse: sparse: cast truncates bits from constant value (a044a becomes 44a)
   arch/x86/kvm/vmx/vmcs12.c:65:9: sparse: sparse: cast truncates bits from constant value (a048a becomes 48a)
   arch/x86/kvm/vmx/vmcs12.c:65:9: sparse: sparse: cast truncates bits from constant value (a04ca becomes 4ca)
   arch/x86/kvm/vmx/vmcs12.c:66:9: sparse: sparse: cast truncates bits from constant value (b000b becomes b)
   arch/x86/kvm/vmx/vmcs12.c:66:9: sparse: sparse: cast truncates bits from constant value (b004b becomes 4b)
   arch/x86/kvm/vmx/vmcs12.c:67:9: sparse: sparse: cast truncates bits from constant value (b008b becomes 8b)
   arch/x86/kvm/vmx/vmcs12.c:67:9: sparse: sparse: cast truncates bits from constant value (b00cb becomes cb)
   arch/x86/kvm/vmx/vmcs12.c:68:9: sparse: sparse: cast truncates bits from constant value (b010b becomes 10b)
   arch/x86/kvm/vmx/vmcs12.c:68:9: sparse: sparse: cast truncates bits from constant value (b014b becomes 14b)
   arch/x86/kvm/vmx/vmcs12.c:69:9: sparse: sparse: cast truncates bits from constant value (100010 becomes 10)
   arch/x86/kvm/vmx/vmcs12.c:70:9: sparse: sparse: cast truncates bits from constant value (100090 becomes 90)
   arch/x86/kvm/vmx/vmcs12.c:71:9: sparse: sparse: cast truncates bits from constant value (100110 becomes 110)
   arch/x86/kvm/vmx/vmcs12.c:72:9: sparse: sparse: cast truncates bits from constant value (100190 becomes 190)
   arch/x86/kvm/vmx/vmcs12.c:73:9: sparse: sparse: cast truncates bits from constant value (100210 becomes 210)
   arch/x86/kvm/vmx/vmcs12.c:74:9: sparse: sparse: cast truncates bits from constant value (100290 becomes 290)
   arch/x86/kvm/vmx/vmcs12.c:75:9: sparse: sparse: cast truncates bits from constant value (100310 becomes 310)
   arch/x86/kvm/vmx/vmcs12.c:76:9: sparse: sparse: cast truncates bits from constant value (100390 becomes 390)
   arch/x86/kvm/vmx/vmcs12.c:77:9: sparse: sparse: too many warnings

vim +40 arch/x86/kvm/vmx/vmcs12.c

     4	
     5	#define ROL16(val, n) ((u16)(((u16)(val) << (n)) | ((u16)(val) >> (16 - (n)))))
     6	#define VMCS12_OFFSET(x) offsetof(struct vmcs12, x)
     7	#define FIELD(number, name)	[ROL16(number, 6)] = VMCS12_OFFSET(name)
     8	#define FIELD64(number, name)						\
     9		FIELD(number, name),						\
    10		[ROL16(number##_HIGH, 6)] = VMCS12_OFFSET(name) + sizeof(u32)
    11	
    12	const unsigned short vmcs_field_to_offset_table[] = {
    13		FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id),
    14		FIELD(POSTED_INTR_NV, posted_intr_nv),
    15		FIELD(GUEST_ES_SELECTOR, guest_es_selector),
    16		FIELD(GUEST_CS_SELECTOR, guest_cs_selector),
    17		FIELD(GUEST_SS_SELECTOR, guest_ss_selector),
    18		FIELD(GUEST_DS_SELECTOR, guest_ds_selector),
    19		FIELD(GUEST_FS_SELECTOR, guest_fs_selector),
    20		FIELD(GUEST_GS_SELECTOR, guest_gs_selector),
    21		FIELD(GUEST_LDTR_SELECTOR, guest_ldtr_selector),
    22		FIELD(GUEST_TR_SELECTOR, guest_tr_selector),
    23		FIELD(GUEST_INTR_STATUS, guest_intr_status),
    24		FIELD(GUEST_PML_INDEX, guest_pml_index),
    25		FIELD(HOST_ES_SELECTOR, host_es_selector),
    26		FIELD(HOST_CS_SELECTOR, host_cs_selector),
    27		FIELD(HOST_SS_SELECTOR, host_ss_selector),
    28		FIELD(HOST_DS_SELECTOR, host_ds_selector),
    29		FIELD(HOST_FS_SELECTOR, host_fs_selector),
    30		FIELD(HOST_GS_SELECTOR, host_gs_selector),
    31		FIELD(HOST_TR_SELECTOR, host_tr_selector),
    32		FIELD64(IO_BITMAP_A, io_bitmap_a),
    33		FIELD64(IO_BITMAP_B, io_bitmap_b),
    34		FIELD64(MSR_BITMAP, msr_bitmap),
    35		FIELD64(VM_EXIT_MSR_STORE_ADDR, vm_exit_msr_store_addr),
    36		FIELD64(VM_EXIT_MSR_LOAD_ADDR, vm_exit_msr_load_addr),
    37		FIELD64(VM_ENTRY_MSR_LOAD_ADDR, vm_entry_msr_load_addr),
    38		FIELD64(PML_ADDRESS, pml_address),
    39		FIELD64(TSC_OFFSET, tsc_offset),
  > 40		FIELD64(TSC_MULTIPLIER, tsc_multiplier),

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 38050 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] KVM: VMX: Implement nested TSC scaling
  2021-05-06 10:32 [PATCH 0/8] KVM: VMX: Implement nested TSC scaling ilstam
                   ` (7 preceding siblings ...)
  2021-05-06 10:32 ` [PATCH 8/8] KVM: selftests: x86: Add vmx_nested_tsc_scaling_test ilstam
@ 2021-05-06 17:16 ` Jim Mattson
  2021-05-06 17:48   ` Stamatis, Ilias
  2021-05-10 14:29   ` Paolo Bonzini
  8 siblings, 2 replies; 36+ messages in thread
From: Jim Mattson @ 2021-05-06 17:16 UTC (permalink / raw)
  To: ilstam
  Cc: kvm list, Paolo Bonzini, ilstam, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, Haozhong Zhang,
	zamsden, mtosatti, Denis Plotnikov, David Woodhouse

On Thu, May 6, 2021 at 3:34 AM <ilstam@mailbox.org> wrote:
>
> From: Ilias Stamatis <ilstam@amazon.com>
>
> KVM currently supports hardware-assisted TSC scaling but only for L1 and it
> doesn't expose the feature to nested guests. This patch series adds support for
> nested TSC scaling and allows both L1 and L2 to be scaled with different
> scaling factors.
>
> When scaling and offsetting is applied, the TSC for the guest is calculated as:
>
> (TSC * multiplier >> 48) + offset
>
> With nested scaling the values in VMCS01 and VMCS12 need to be merged
> together and stored in VMCS02.
>
> The VMCS02 values are calculated as follows:
>
> offset_02 = ((offset_01 * mult_12) >> 48) + offset_12
> mult_02 = (mult_01 * mult_12) >> 48
>
> The last patch of the series adds a KVM selftest.

Will you be doing the same for SVM? The last time I tried to add a
nested virtualization feature for Intel only, Paolo rapped my knuckles
with a ruler.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/8] KVM: VMX: Adjust the TSC-related VMCS fields on L2 entry and exit
  2021-05-06 11:32   ` Paolo Bonzini
@ 2021-05-06 17:35     ` Stamatis, Ilias
  2021-05-10 14:11       ` Paolo Bonzini
  0 siblings, 1 reply; 36+ messages in thread
From: Stamatis, Ilias @ 2021-05-06 17:35 UTC (permalink / raw)
  To: kvm, pbonzini, ilstam
  Cc: jmattson, Woodhouse, David, vkuznets, haozhong.zhang, joro,
	mtosatti, zamsden, seanjc, dplotnikov, wanpengli

On Thu, 2021-05-06 at 13:32 +0200, Paolo Bonzini wrote:
> On 06/05/21 12:32, ilstam@mailbox.org wrote:
> > +     if (vmcs12->cpu_based_vm_exec_control &
> > CPU_BASED_USE_TSC_OFFSETTING) {
> > +             if (vmcs12->secondary_vm_exec_control &
> > SECONDARY_EXEC_TSC_SCALING) {
> > +                     vcpu->arch.tsc_offset =
> > kvm_compute_02_tsc_offset(
> > +                                     vcpu->arch.l1_tsc_offset,
> > +                                     vmcs12->tsc_multiplier,
> > +                                     vmcs12->tsc_offset);
> > +
> > +                     vcpu->arch.tsc_scaling_ratio =
> > mul_u64_u64_shr(
> > +                                     vcpu->arch.tsc_scaling_ratio,
> > +                                     vmcs12->tsc_multiplier,
> > +                                     kvm_tsc_scaling_ratio_frac_bit
> > s);
> > +             } else {
> > +                     vcpu->arch.tsc_offset += vmcs12->tsc_offset;
> > +             }
> 
> The computation of vcpu->arch.tsc_offset is (not coincidentially) the
> same that appears in patch 6
> 
> +           (vmcs12->cpu_based_vm_exec_control &
> CPU_BASED_USE_TSC_OFFSETTING)) {
> +               if (vmcs12->secondary_vm_exec_control &
> SECONDARY_EXEC_TSC_SCALING) {
> +                       cur_offset = kvm_compute_02_tsc_offset(
> +                                       l1_offset,
> +                                       vmcs12->tsc_multiplier,
> +                                       vmcs12->tsc_offset);
> +               } else {
> +                       cur_offset = l1_offset + vmcs12->tsc_offset;
> 
> So I think you should just pass vmcs12 and the L1 offset to
> kvm_compute_02_tsc_offset, and let it handle both cases (and possibly
> even set vcpu->arch.tsc_scaling_ratio in the same function).
> 
> Paolo
> 

That was my thinking initially too. However, kvm_compute_02_tsc_offset
is defined in x86.c which is vmx-agnostic and hence 'struct vmcs12' is
not defined there. 

The way it is now, that function can be re-used by svm code in case we
add amd support later.

I could try to define a wrapper in vmx.c instead, that calls
kvm_compute_02_tsc_offset and move all the code there. However, I think
this requires many more changes to the vmx and svm write_l1_tsc_offset
functions and to their caller. I am looking at it now. 

Ilias

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] KVM: VMX: Add a TSC multiplier field in VMCS12
  2021-05-06 10:32 ` [PATCH 1/8] KVM: VMX: Add a TSC multiplier field in VMCS12 ilstam
  2021-05-06 14:50   ` kernel test robot
@ 2021-05-06 17:36   ` Jim Mattson
  2021-05-10 13:42   ` Maxim Levitsky
  2 siblings, 0 replies; 36+ messages in thread
From: Jim Mattson @ 2021-05-06 17:36 UTC (permalink / raw)
  To: ilstam
  Cc: kvm list, Paolo Bonzini, ilstam, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, Haozhong Zhang,
	zamsden, mtosatti, Denis Plotnikov, David Woodhouse

On Thu, May 6, 2021 at 3:34 AM <ilstam@mailbox.org> wrote:
>
> From: Ilias Stamatis <ilstam@amazon.com>
>
> This is required for supporting nested TSC scaling.
>
> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Jim Mattson <jmattson@google.com>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] KVM: VMX: Implement nested TSC scaling
  2021-05-06 17:16 ` [PATCH 0/8] KVM: VMX: Implement nested TSC scaling Jim Mattson
@ 2021-05-06 17:48   ` Stamatis, Ilias
  2021-05-10 13:43     ` Maxim Levitsky
  2021-05-10 14:29   ` Paolo Bonzini
  1 sibling, 1 reply; 36+ messages in thread
From: Stamatis, Ilias @ 2021-05-06 17:48 UTC (permalink / raw)
  To: jmattson, ilstam
  Cc: kvm, Woodhouse, David, vkuznets, mtosatti, joro, zamsden, seanjc,
	pbonzini, wanpengli

On Thu, 2021-05-06 at 10:16 -0700, Jim Mattson wrote:
> On Thu, May 6, 2021 at 3:34 AM <ilstam@mailbox.org> wrote:
> > 
> > From: Ilias Stamatis <ilstam@amazon.com>
> > 
> > KVM currently supports hardware-assisted TSC scaling but only for L1
> > and it
> > doesn't expose the feature to nested guests. This patch series adds
> > support for
> > nested TSC scaling and allows both L1 and L2 to be scaled with
> > different
> > scaling factors.
> > 
> > When scaling and offsetting is applied, the TSC for the guest is
> > calculated as:
> > 
> > (TSC * multiplier >> 48) + offset
> > 
> > With nested scaling the values in VMCS01 and VMCS12 need to be
> > merged
> > together and stored in VMCS02.
> > 
> > The VMCS02 values are calculated as follows:
> > 
> > offset_02 = ((offset_01 * mult_12) >> 48) + offset_12
> > mult_02 = (mult_01 * mult_12) >> 48
> > 
> > The last patch of the series adds a KVM selftest.
> 
> Will you be doing the same for SVM? The last time I tried to add a
> nested virtualization feature for Intel only, Paolo rapped my knuckles
> with a ruler.

Yes, I can try do this, if it's not much more complicated, once I get
access to AMD hardware. 

But I suppose this series is standalone and could be merged separately?
By taking a quick look it seems that SVM exposes far less features to
nested guests than VMX does anyway.

Ilias

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] KVM: VMX: Add a TSC multiplier field in VMCS12
  2021-05-06 10:32 ` [PATCH 1/8] KVM: VMX: Add a TSC multiplier field in VMCS12 ilstam
  2021-05-06 14:50   ` kernel test robot
  2021-05-06 17:36   ` Jim Mattson
@ 2021-05-10 13:42   ` Maxim Levitsky
  2 siblings, 0 replies; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-10 13:42 UTC (permalink / raw)
  To: ilstam, kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> From: Ilias Stamatis <ilstam@amazon.com>
> 
> This is required for supporting nested TSC scaling.
> 
> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> ---
>  arch/x86/kvm/vmx/vmcs12.c | 1 +
>  arch/x86/kvm/vmx/vmcs12.h | 4 +++-
>  2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
> index 034adb6404dc..d9f5d7c56ae3 100644
> --- a/arch/x86/kvm/vmx/vmcs12.c
> +++ b/arch/x86/kvm/vmx/vmcs12.c
> @@ -37,6 +37,7 @@ const unsigned short vmcs_field_to_offset_table[] = {
>  	FIELD64(VM_ENTRY_MSR_LOAD_ADDR, vm_entry_msr_load_addr),
>  	FIELD64(PML_ADDRESS, pml_address),
>  	FIELD64(TSC_OFFSET, tsc_offset),
> +	FIELD64(TSC_MULTIPLIER, tsc_multiplier),
>  	FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr),
>  	FIELD64(APIC_ACCESS_ADDR, apic_access_addr),
>  	FIELD64(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr),
> diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
> index 13494956d0e9..bb81a23afe89 100644
> --- a/arch/x86/kvm/vmx/vmcs12.h
> +++ b/arch/x86/kvm/vmx/vmcs12.h
> @@ -70,7 +70,8 @@ struct __packed vmcs12 {
>  	u64 eptp_list_address;
>  	u64 pml_address;
>  	u64 encls_exiting_bitmap;
> -	u64 padding64[2]; /* room for future expansion */
> +	u64 tsc_multiplier;
> +	u64 padding64[1]; /* room for future expansion */

Getting low on the padding. Oh well...
>  	/*
>  	 * To allow migration of L1 (complete with its L2 guests) between
>  	 * machines of different natural widths (32 or 64 bit), we cannot have
> @@ -258,6 +259,7 @@ static inline void vmx_check_vmcs12_offsets(void)
>  	CHECK_OFFSET(eptp_list_address, 304);
>  	CHECK_OFFSET(pml_address, 312);
>  	CHECK_OFFSET(encls_exiting_bitmap, 320);
> +	CHECK_OFFSET(tsc_multiplier, 328);
>  	CHECK_OFFSET(cr0_guest_host_mask, 344);
>  	CHECK_OFFSET(cr4_guest_host_mask, 352);
>  	CHECK_OFFSET(cr0_read_shadow, 360);

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] KVM: VMX: Implement nested TSC scaling
  2021-05-06 17:48   ` Stamatis, Ilias
@ 2021-05-10 13:43     ` Maxim Levitsky
  0 siblings, 0 replies; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-10 13:43 UTC (permalink / raw)
  To: Stamatis, Ilias, jmattson, ilstam
  Cc: kvm, Woodhouse, David, vkuznets, mtosatti, joro, zamsden, seanjc,
	pbonzini, wanpengli

On Thu, 2021-05-06 at 17:48 +0000, Stamatis, Ilias wrote:
> On Thu, 2021-05-06 at 10:16 -0700, Jim Mattson wrote:
> > On Thu, May 6, 2021 at 3:34 AM <ilstam@mailbox.org> wrote:
> > > From: Ilias Stamatis <ilstam@amazon.com>
> > > 
> > > KVM currently supports hardware-assisted TSC scaling but only for L1
> > > and it
> > > doesn't expose the feature to nested guests. This patch series adds
> > > support for
> > > nested TSC scaling and allows both L1 and L2 to be scaled with
> > > different
> > > scaling factors.
> > > 
> > > When scaling and offsetting is applied, the TSC for the guest is
> > > calculated as:
> > > 
> > > (TSC * multiplier >> 48) + offset
> > > 
> > > With nested scaling the values in VMCS01 and VMCS12 need to be
> > > merged
> > > together and stored in VMCS02.
> > > 
> > > The VMCS02 values are calculated as follows:
> > > 
> > > offset_02 = ((offset_01 * mult_12) >> 48) + offset_12
> > > mult_02 = (mult_01 * mult_12) >> 48
> > > 
> > > The last patch of the series adds a KVM selftest.
> > 
> > Will you be doing the same for SVM? The last time I tried to add a
> > nested virtualization feature for Intel only, Paolo rapped my knuckles
> > with a ruler.
> 
> Yes, I can try do this, if it's not much more complicated, once I get
> access to AMD hardware. 

I have access to AMD hardware with regular TSC scaling,
and nested TSC scaling IMHO won't be hard for me to implement 
so I volunteer for this task! 


Best regards,
	Maxim Levitsky

> 
> But I suppose this series is standalone and could be merged separately?
> By taking a quick look it seems that SVM exposes far less features to
> nested guests than VMX does anyway.
> 
> Ilias



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/8] KVM: X86: Store L1's TSC scaling ratio in 'struct kvm_vcpu_arch'
  2021-05-06 10:32 ` [PATCH 2/8] KVM: X86: Store L1's TSC scaling ratio in 'struct kvm_vcpu_arch' ilstam
@ 2021-05-10 13:43   ` Maxim Levitsky
  0 siblings, 0 replies; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-10 13:43 UTC (permalink / raw)
  To: ilstam, kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> From: Ilias Stamatis <ilstam@amazon.com>
> 
> Store L1's scaling ratio in that struct like we already do for L1's TSC
> offset. This allows for easy save/restore when we enter and then exit
> the nested guest.
> 
> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 5 +++--
>  arch/x86/kvm/x86.c              | 4 +++-
>  2 files changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index cbbcee0a84f9..132e820525fb 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -705,7 +705,7 @@ struct kvm_vcpu_arch {
>  	} st;
>  
>  	u64 l1_tsc_offset;
> -	u64 tsc_offset;
> +	u64 tsc_offset; /* current tsc offset */
>  	u64 last_guest_tsc;
>  	u64 last_host_tsc;
>  	u64 tsc_offset_adjustment;
> @@ -719,7 +719,8 @@ struct kvm_vcpu_arch {
>  	u32 virtual_tsc_khz;
>  	s64 ia32_tsc_adjust_msr;
>  	u64 msr_ia32_power_ctl;
> -	u64 tsc_scaling_ratio;
> +	u64 l1_tsc_scaling_ratio;
> +	u64 tsc_scaling_ratio; /* current scaling ratio */
>  
>  	atomic_t nmi_queued;  /* unprocessed asynchronous NMIs */
>  	unsigned nmi_pending; /* NMI queued after currently running handler */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cebdaa1e3cf5..7bc5155ac6fd 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2119,6 +2119,7 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
>  
>  	/* Guest TSC same frequency as host TSC? */
>  	if (!scale) {
> +		vcpu->arch.l1_tsc_scaling_ratio = kvm_default_tsc_scaling_ratio;
>  		vcpu->arch.tsc_scaling_ratio = kvm_default_tsc_scaling_ratio;
>  		return 0;
>  	}
> @@ -2145,7 +2146,7 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
>  		return -1;
>  	}
>  
> -	vcpu->arch.tsc_scaling_ratio = ratio;
> +	vcpu->arch.l1_tsc_scaling_ratio = vcpu->arch.tsc_scaling_ratio = ratio;
>  	return 0;
>  }
>  
> @@ -2157,6 +2158,7 @@ static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
>  	/* tsc_khz can be zero if TSC calibration fails */
>  	if (user_tsc_khz == 0) {
>  		/* set tsc_scaling_ratio to a safe value */
> +		vcpu->arch.l1_tsc_scaling_ratio = kvm_default_tsc_scaling_ratio;
>  		vcpu->arch.tsc_scaling_ratio = kvm_default_tsc_scaling_ratio;
>  		return -1;
>  	}
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/8] KVM: X86: Pass an additional 'L1' argument to kvm_scale_tsc()
  2021-05-06 10:32 ` [PATCH 3/8] KVM: X86: Pass an additional 'L1' argument to kvm_scale_tsc() ilstam
@ 2021-05-10 13:52   ` Maxim Levitsky
  2021-05-10 15:44     ` Stamatis, Ilias
  0 siblings, 1 reply; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-10 13:52 UTC (permalink / raw)
  To: ilstam, kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> From: Ilias Stamatis <ilstam@amazon.com>
> 
> Sometimes kvm_scale_tsc() needs to use the current scaling ratio and
> other times (like when reading the TSC from user space) it needs to use
> L1's scaling ratio. Have the caller specify this by passing an
> additional boolean argument.
> 
> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 +-
>  arch/x86/kvm/x86.c              | 21 +++++++++++++--------
>  2 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 132e820525fb..cdddbf0b1177 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1779,7 +1779,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
>  void kvm_define_user_return_msr(unsigned index, u32 msr);
>  int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
>  
> -u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc);
> +u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc, bool l1);
>  u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc);
>  
>  unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7bc5155ac6fd..26a4c0f46f15 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2241,10 +2241,14 @@ static inline u64 __scale_tsc(u64 ratio, u64 tsc)
>  	return mul_u64_u64_shr(tsc, ratio, kvm_tsc_scaling_ratio_frac_bits);
>  }
>  
> -u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc)
> +/*
> + * If l1 is true the TSC is scaled using L1's scaling ratio, otherwise
> + * the current scaling ratio is used.
> + */
> +u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc, bool l1)
>  {
>  	u64 _tsc = tsc;
> -	u64 ratio = vcpu->arch.tsc_scaling_ratio;
> +	u64 ratio = l1 ? vcpu->arch.l1_tsc_scaling_ratio : vcpu->arch.tsc_scaling_ratio;
>  
>  	if (ratio != kvm_default_tsc_scaling_ratio)
>  		_tsc = __scale_tsc(ratio, tsc);
I do wonder if it is better to have kvm_scale_tsc_l1 and kvm_scale_tsc instead for better
readablility?



> @@ -2257,14 +2261,14 @@ static u64 kvm_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)
>  {
>  	u64 tsc;
>  
> -	tsc = kvm_scale_tsc(vcpu, rdtsc());
> +	tsc = kvm_scale_tsc(vcpu, rdtsc(), true);
Here we have a somewhat dangerous assumption that this function 
will always be used with L1 tsc values. 

The kvm_compute_tsc_offset should at least be renamed to
kvm_compute_tsc_offset_l1 or something like that.

Currently the assumption holds though:

We call the kvm_compute_tsc_offset in:

-> kvm_synchronize_tsc which is nowadays thankfully only called
on TSC writes from qemu, which are assumed to be L1 values.

(this is pending a rework of the whole thing which I started
some time ago but haven't had a chance to finish it yet)

-> Guest write of MSR_IA32_TSC. The value written is in L1 units,
since TSC offset/scaling only covers RDTSC and RDMSR of the IA32_TSC,
while WRMSR should be intercepted by L1 and emulated.
If it is not emulated, then L2 would just write L1 value.

-> in kvm_arch_vcpu_load, when TSC is unstable, we always try to resume
the guest from the same TSC value as it had seen last time,
and then catchup.

Also host TSC values are used, and after reading this function,
I recommend to rename the vcpu->arch.last_guest_tsc
to vcpu->arch.last_guest_tsc_l1 to document this.


>  
>  	return target_tsc - tsc;
>  }
>  
>  u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
>  {
> -	return vcpu->arch.l1_tsc_offset + kvm_scale_tsc(vcpu, host_tsc);
> +	return vcpu->arch.l1_tsc_offset + kvm_scale_tsc(vcpu, host_tsc, true);
OK
>  }
>  EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
>  
> @@ -2395,9 +2399,9 @@ static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
>  
>  static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
>  {
> -	if (vcpu->arch.tsc_scaling_ratio != kvm_default_tsc_scaling_ratio)
> +	if (vcpu->arch.l1_tsc_scaling_ratio != kvm_default_tsc_scaling_ratio)
>  		WARN_ON(adjustment < 0);
This should belong to patch 2 IMHO.

> -	adjustment = kvm_scale_tsc(vcpu, (u64) adjustment);
> +	adjustment = kvm_scale_tsc(vcpu, (u64) adjustment, true);
OK
>  	adjust_tsc_offset_guest(vcpu, adjustment);
>  }
>  
> @@ -2780,7 +2784,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  	/* With all the info we got, fill in the values */
>  
>  	if (kvm_has_tsc_control)
> -		tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz);
> +		tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz, true);
OK (kvmclock is for L1 only, L1 hypervisor is free to expose its own kvmclock to L2)
>  
>  	if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
>  		kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> @@ -3474,7 +3478,8 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		u64 tsc_offset = msr_info->host_initiated ? vcpu->arch.l1_tsc_offset :
>  							    vcpu->arch.tsc_offset;
>  
> -		msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + tsc_offset;
> +		msr_info->data = kvm_scale_tsc(vcpu, rdtsc(),
> +					       msr_info->host_initiated) + tsc_offset;

Since we now do two things that depend on msr_info->host_initiated, I 
think I would prefer to convert this back to regular 'if'.
I don't have a strong opinion on this though.


>  		break;
>  	}
>  	case MSR_MTRRcap:


Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/8] KVM: VMX: Adjust the TSC-related VMCS fields on L2 entry and exit
  2021-05-06 10:32 ` [PATCH 4/8] KVM: VMX: Adjust the TSC-related VMCS fields on L2 entry and exit ilstam
  2021-05-06 11:32   ` Paolo Bonzini
@ 2021-05-10 13:53   ` Maxim Levitsky
  2021-05-10 14:44     ` Stamatis, Ilias
  1 sibling, 1 reply; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-10 13:53 UTC (permalink / raw)
  To: ilstam, kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> From: Ilias Stamatis <ilstam@amazon.com>
> 
> When L2 is entered we need to "merge" the TSC multiplier and TSC offset
> values of VMCS01 and VMCS12 and store the result into the current
> VMCS02.
> 
> The 02 values are calculated using the following equations:
>   offset_02 = ((offset_01 * mult_12) >> 48) + offset_12
>   mult_02 = (mult_01 * mult_12) >> 48

I would mention that 48 is kvm_tsc_scaling_ratio_frac_bits instead.
Also maybe add the common code in a separate patch?

> 
> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/vmx/nested.c       | 26 ++++++++++++++++++++++----
>  arch/x86/kvm/x86.c              | 25 +++++++++++++++++++++++++
>  3 files changed, 48 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index cdddbf0b1177..e7a1eb36f95a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1780,6 +1780,7 @@ void kvm_define_user_return_msr(unsigned index, u32 msr);
>  int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
>  
>  u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc, bool l1);
> +u64 kvm_compute_02_tsc_offset(u64 l1_offset, u64 l2_multiplier, u64 l2_offset);
>  u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc);
>  
>  unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index bced76637823..a1bf28f33837 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3353,8 +3353,22 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
>  	}
>  
>  	enter_guest_mode(vcpu);
> -	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING)
> -		vcpu->arch.tsc_offset += vmcs12->tsc_offset;
> +
> +	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING) {
> +		if (vmcs12->secondary_vm_exec_control & SECONDARY_EXEC_TSC_SCALING) {
> +			vcpu->arch.tsc_offset = kvm_compute_02_tsc_offset(
> +					vcpu->arch.l1_tsc_offset,
> +					vmcs12->tsc_multiplier,
> +					vmcs12->tsc_offset);
> +
> +			vcpu->arch.tsc_scaling_ratio = mul_u64_u64_shr(
> +					vcpu->arch.tsc_scaling_ratio,
> +					vmcs12->tsc_multiplier,
> +					kvm_tsc_scaling_ratio_frac_bits);
> +		} else {
> +			vcpu->arch.tsc_offset += vmcs12->tsc_offset;
> +		}
> +	}
>  
>  	if (prepare_vmcs02(vcpu, vmcs12, &entry_failure_code)) {
>  		exit_reason.basic = EXIT_REASON_INVALID_STATE;
> @@ -4454,8 +4468,12 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>  	if (nested_cpu_has_preemption_timer(vmcs12))
>  		hrtimer_cancel(&to_vmx(vcpu)->nested.preemption_timer);
>  
> -	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING)
> -		vcpu->arch.tsc_offset -= vmcs12->tsc_offset;
> +	if (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING) {
> +		vcpu->arch.tsc_offset = vcpu->arch.l1_tsc_offset;
> +
> +		if (vmcs12->secondary_vm_exec_control & SECONDARY_EXEC_TSC_SCALING)
> +			vcpu->arch.tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
> +	}
>  
>  	if (likely(!vmx->fail)) {
>  		sync_vmcs02_to_vmcs12(vcpu, vmcs12);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 26a4c0f46f15..87deb119c521 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2266,6 +2266,31 @@ static u64 kvm_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)
>  	return target_tsc - tsc;
>  }
>  
> +/*
> + * This function computes the TSC offset that is stored in VMCS02 when entering
> + * L2 by combining the offset and multiplier values of VMCS01 and VMCS12.
> + */
> +u64 kvm_compute_02_tsc_offset(u64 l1_offset, u64 l2_multiplier, u64 l2_offset)
> +{
> +	u64 offset;
> +
> +	/*
> +	 * The L1 offset is interpreted as a signed number by the CPU and can
> +	 * be negative. So we extract the sign before the multiplication and
> +	 * put it back afterwards if needed.
If I understand correctly the reason for sign extraction is that we don't
have mul_s64_u64_shr. Maybe we should add it?

The pattern of (A * B) >> shift appears many times in TSC scaling.

So instead of this function maybe just use something like that?

merged_offset = l2_offset + mul_s64_u64_shr ((s64) l1_offset, 
                                             l2_multiplier, 
                                             kvm_tsc_scaling_ratio_frac_bits)

Or another idea:

How about

u64 __kvm_scale_tsc_value(u64 value, u64 multiplier) {
	return mul_u64_u64_shr(value, 
			       multiplier, 
			       kvm_tsc_scaling_ratio_frac_bits);
}


and

s64 __kvm_scale_tsc_offset(u64 value, u64 multiplier)
{
	return mul_s64_u64_shr((s64)value,
			       multiplier, 
			       kvm_tsc_scaling_ratio_frac_bits);
}

And then use them in the code.

Overall though the code *looks* correct to me
but I might have missed something.

Best regards,
	Maxim Levitsky


> +	 */
> +	offset = mul_u64_u64_shr(abs((s64) l1_offset),
> +				 l2_multiplier,
> +				 kvm_tsc_scaling_ratio_frac_bits);
> +
> +	if ((s64) l1_offset < 0)
> +		offset = -((s64) offset);
> +
> +	offset += l2_offset;
> +	return offset;
> +}
> +EXPORT_SYMBOL_GPL(kvm_compute_02_tsc_offset);
> +
>  u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
>  {
>  	return vcpu->arch.l1_tsc_offset + kvm_scale_tsc(vcpu, host_tsc, true);



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] KVM: X86: Move tracing outside write_l1_tsc_offset()
  2021-05-06 10:32 ` [PATCH 5/8] KVM: X86: Move tracing outside write_l1_tsc_offset() ilstam
@ 2021-05-10 13:54   ` Maxim Levitsky
  0 siblings, 0 replies; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-10 13:54 UTC (permalink / raw)
  To: ilstam, kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> From: Ilias Stamatis <ilstam@amazon.com>
> 
> A subsequent patch fixes write_l1_tsc_offset() to account for nested TSC
> scaling. Calculating the L1 TSC for logging it with the trace call
> becomes more complex then.
> 
> This patch moves the trace call to the caller and avoids code
> duplication as a result too.
> 
> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> ---
>  arch/x86/kvm/svm/svm.c | 4 ----
>  arch/x86/kvm/vmx/vmx.c | 3 ---
>  arch/x86/kvm/x86.c     | 4 ++++
>  3 files changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 9790c73f2a32..d2f9d6a9716f 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1090,10 +1090,6 @@ static u64 svm_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  		svm->vmcb01.ptr->control.tsc_offset = offset;
>  	}
>  
> -	trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> -				   svm->vmcb->control.tsc_offset - g_tsc_offset,
> -				   offset);
> -
>  	svm->vmcb->control.tsc_offset = offset + g_tsc_offset;
>  
>  	vmcb_mark_dirty(svm->vmcb, VMCB_INTERCEPTS);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index cbe0cdade38a..49241423b854 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1812,9 +1812,6 @@ static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  	    (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING))
>  		g_tsc_offset = vmcs12->tsc_offset;
>  
> -	trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> -				   vcpu->arch.tsc_offset - g_tsc_offset,
> -				   offset);
>  	vmcs_write64(TSC_OFFSET, offset + g_tsc_offset);
>  	return offset + g_tsc_offset;
>  }
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 87deb119c521..c08295bcf50e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2299,6 +2299,10 @@ EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
>  
>  static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  {
> +	trace_kvm_write_tsc_offset(vcpu->vcpu_id,
> +				   vcpu->arch.l1_tsc_offset,
> +				   offset);
> +
>  	vcpu->arch.l1_tsc_offset = offset;
>  	vcpu->arch.tsc_offset = static_call(kvm_x86_write_l1_tsc_offset)(vcpu, offset);
>  }

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/8] KVM: VMX: Make vmx_write_l1_tsc_offset() work with nested TSC scaling
  2021-05-06 10:32 ` [PATCH 6/8] KVM: VMX: Make vmx_write_l1_tsc_offset() work with nested TSC scaling ilstam
@ 2021-05-10 13:54   ` Maxim Levitsky
  2021-05-10 16:08     ` Stamatis, Ilias
  0 siblings, 1 reply; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-10 13:54 UTC (permalink / raw)
  To: ilstam, kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> From: Ilias Stamatis <ilstam@amazon.com>
> 
> Calculating the current TSC offset is done differently when nested TSC
> scaling is used.
> 
> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 26 ++++++++++++++++++++------
>  1 file changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 49241423b854..df7dc0e4c903 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1797,10 +1797,16 @@ static void setup_msrs(struct vcpu_vmx *vmx)
>  		vmx_update_msr_bitmap(&vmx->vcpu);
>  }
>  
> -static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> +/*
> + * This function receives the requested offset for L1 as an argument but it
> + * actually writes the "current" tsc offset to the VMCS and returns it. The
> + * current offset might be different in case an L2 guest is currently running
> + * and its VMCS02 is loaded.
> + */
(Not related to this patch) It might be a good idea to rename this callback
instead of this comment, but I am not sure about it.


> +static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 l1_offset)
>  {
>  	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> -	u64 g_tsc_offset = 0;
> +	u64 cur_offset = l1_offset;
>  
>  	/*
>  	 * We're here if L1 chose not to trap WRMSR to TSC. According
> @@ -1809,11 +1815,19 @@ static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>  	 * to the newly set TSC to get L2's TSC.
>  	 */
>  	if (is_guest_mode(vcpu) &&
> -	    (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING))
> -		g_tsc_offset = vmcs12->tsc_offset;
> +	    (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING)) {
> +		if (vmcs12->secondary_vm_exec_control & SECONDARY_EXEC_TSC_SCALING) {
> +			cur_offset = kvm_compute_02_tsc_offset(
> +					l1_offset,
> +					vmcs12->tsc_multiplier,
> +					vmcs12->tsc_offset);
> +		} else {
> +			cur_offset = l1_offset + vmcs12->tsc_offset;
> +		}
> +	}
>  
> -	vmcs_write64(TSC_OFFSET, offset + g_tsc_offset);
> -	return offset + g_tsc_offset;
> +	vmcs_write64(TSC_OFFSET, cur_offset);
> +	return cur_offset;
>  }
>  
>  /*

This code would be ideal to move to common code as SVM will do basically
the same thing.
Doesn't have to be done now though.


Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 7/8] KVM: VMX: Expose TSC scaling to L2
  2021-05-06 10:32 ` [PATCH 7/8] KVM: VMX: Expose TSC scaling to L2 ilstam
@ 2021-05-10 13:56   ` Maxim Levitsky
  0 siblings, 0 replies; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-10 13:56 UTC (permalink / raw)
  To: ilstam, kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> From: Ilias Stamatis <ilstam@amazon.com>
> 
> Expose the TSC scaling feature to nested guests.
> 
> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index a1bf28f33837..639cb9462103 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2277,7 +2277,8 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>  				  SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE |
>  				  SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
>  				  SECONDARY_EXEC_APIC_REGISTER_VIRT |
> -				  SECONDARY_EXEC_ENABLE_VMFUNC);
> +				  SECONDARY_EXEC_ENABLE_VMFUNC |
> +				  SECONDARY_EXEC_TSC_SCALING);
>  		if (nested_cpu_has(vmcs12,
>  				   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS))
>  			exec_control |= vmcs12->secondary_vm_exec_control;
> @@ -6483,7 +6484,8 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
>  		SECONDARY_EXEC_RDRAND_EXITING |
>  		SECONDARY_EXEC_ENABLE_INVPCID |
>  		SECONDARY_EXEC_RDSEED_EXITING |
> -		SECONDARY_EXEC_XSAVES;
> +		SECONDARY_EXEC_XSAVES |
> +		SECONDARY_EXEC_TSC_SCALING;
>  
>  	/*
>  	 * We can emulate "VMCS shadowing," even if the hardware


Seems to be correct. I don't yet have experience with how VMX does the VMX capablity
msrs and primary/secondary/entry/exit/pinbased control fitering for features that are not supported on the host,
but after digging through it this seems to be OK.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/8] KVM: selftests: x86: Add vmx_nested_tsc_scaling_test
  2021-05-06 10:32 ` [PATCH 8/8] KVM: selftests: x86: Add vmx_nested_tsc_scaling_test ilstam
@ 2021-05-10 13:59   ` Maxim Levitsky
  2021-05-11 11:16     ` Stamatis, Ilias
  0 siblings, 1 reply; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-10 13:59 UTC (permalink / raw)
  To: ilstam, kvm, pbonzini
  Cc: ilstam, seanjc, vkuznets, wanpengli, jmattson, joro,
	haozhong.zhang, zamsden, mtosatti, dplotnikov, dwmw

On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> From: Ilias Stamatis <ilstam@amazon.com>
> 
> Test that nested TSC scaling works as expected with both L1 and L2
> scaled.
> 
> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> ---
>  tools/testing/selftests/kvm/.gitignore        |   1 +
>  tools/testing/selftests/kvm/Makefile          |   1 +
>  .../kvm/x86_64/vmx_nested_tsc_scaling_test.c  | 209 ++++++++++++++++++
>  3 files changed, 211 insertions(+)
>  create mode 100644 tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c
> 
> diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
> index bd83158e0e0b..cc02022f9951 100644
> --- a/tools/testing/selftests/kvm/.gitignore
> +++ b/tools/testing/selftests/kvm/.gitignore
> @@ -29,6 +29,7 @@
>  /x86_64/vmx_preemption_timer_test
>  /x86_64/vmx_set_nested_state_test
>  /x86_64/vmx_tsc_adjust_test
> +/x86_64/vmx_nested_tsc_scaling_test
>  /x86_64/xapic_ipi_test
>  /x86_64/xen_shinfo_test
>  /x86_64/xen_vmcall_test
> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> index e439d027939d..1078240b1313 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -60,6 +60,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_close_while_nested_test
>  TEST_GEN_PROGS_x86_64 += x86_64/vmx_dirty_log_test
>  TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
>  TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
> +TEST_GEN_PROGS_x86_64 += x86_64/vmx_nested_tsc_scaling_test
>  TEST_GEN_PROGS_x86_64 += x86_64/xapic_ipi_test
>  TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test
>  TEST_GEN_PROGS_x86_64 += x86_64/debug_regs
> diff --git a/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c b/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c
> new file mode 100644
> index 000000000000..b05f5151ecbe
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c
> @@ -0,0 +1,209 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * vmx_nested_tsc_scaling_test
> + *
> + * Copyright (C) 2021 Amazon.com, Inc. or its affiliates.
> + *
> + * This test case verifies that nested TSC scaling behaves as expected when
> + * both L1 and L2 are scaled using different ratios. For this test we scale
> + * L1 down and scale L2 up.
> + */
> +
> +
> +#include "kvm_util.h"
> +#include "vmx.h"
> +#include "kselftest.h"
> +
> +
> +#define VCPU_ID 0
> +
> +/* L1 is scaled down by this factor */
> +#define L1_SCALE_FACTOR 2ULL
> +/* L2 is scaled up (from L1's perspective) by this factor */
> +#define L2_SCALE_FACTOR 4ULL

For fun, I might have randomized these factors as well.

> +
> +#define TSC_OFFSET_L2 (1UL << 32)
> +#define TSC_MULTIPLIER_L2 (L2_SCALE_FACTOR << 48)
It would be fun to use a negative offset here (also randomally).

> +
> +#define L2_GUEST_STACK_SIZE 64
> +
> +enum { USLEEP, UCHECK_L1, UCHECK_L2 };
> +#define GUEST_SLEEP(sec)         ucall(UCALL_SYNC, 2, USLEEP, sec)
> +#define GUEST_CHECK(level, freq) ucall(UCALL_SYNC, 2, level, freq)
> +
> +
> +/*
> + * This function checks whether the "actual" TSC frequency of a guest matches
> + * its expected frequency. In order to account for delays in taking the TSC
> + * measurements, a difference of 1% between the actual and the expected value
> + * is tolerated.
> + */
> +static void compare_tsc_freq(uint64_t actual, uint64_t expected)
> +{
> +	uint64_t tolerance, thresh_low, thresh_high;
> +
> +	tolerance = expected / 100;
> +	thresh_low = expected - tolerance;
> +	thresh_high = expected + tolerance;
> +
> +	TEST_ASSERT(thresh_low < actual,
> +		"TSC freq is expected to be between %"PRIu64" and %"PRIu64
> +		" but it actually is %"PRIu64,
> +		thresh_low, thresh_high, actual);
> +	TEST_ASSERT(thresh_high > actual,
> +		"TSC freq is expected to be between %"PRIu64" and %"PRIu64
> +		" but it actually is %"PRIu64,
> +		thresh_low, thresh_high, actual);
> +}
> +
> +static void check_tsc_freq(int level)
> +{
> +	uint64_t tsc_start, tsc_end, tsc_freq;
> +
> +	/*
> +	 * Reading the TSC twice with about a second's difference should give
> +	 * us an approximation of the TSC frequency from the guest's
> +	 * perspective. Now, this won't be completely accurate, but it should
> +	 * be good enough for the purposes of this test.
> +	 */

It would be nice to know if the host has stable TSC (you can obtain this via 
KVM_GET_CLOCK, the KVM_CLOCK_TSC_STABLE flag).

And if not stable skip the test, to avoid false positives.
(Yes I have a laptop I just bought that has an unstable TSC....)


> +	tsc_start = rdmsr(MSR_IA32_TSC);
> +	GUEST_SLEEP(1);
> +	tsc_end = rdmsr(MSR_IA32_TSC);
> +
> +	tsc_freq = tsc_end - tsc_start;
> +
> +	GUEST_CHECK(level, tsc_freq);
> +}
> +
> +static void l2_guest_code(void)
> +{
> +	check_tsc_freq(UCHECK_L2);
> +
> +	/* exit to L1 */
> +	__asm__ __volatile__("vmcall");
> +}
> +
> +static void l1_guest_code(struct vmx_pages *vmx_pages)
> +{
> +	unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
> +	uint32_t control;
> +
> +	/* check that L1's frequency looks alright before launching L2 */
> +	check_tsc_freq(UCHECK_L1);
> +
> +	GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages));
> +	GUEST_ASSERT(load_vmcs(vmx_pages));
> +
> +	/* prepare the VMCS for L2 execution */
> +	prepare_vmcs(vmx_pages, l2_guest_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]);
> +
> +	/* enable TSC offsetting and TSC scaling for L2 */
> +	control = vmreadz(CPU_BASED_VM_EXEC_CONTROL);
> +	control |= CPU_BASED_USE_MSR_BITMAPS | CPU_BASED_USE_TSC_OFFSETTING;
> +	vmwrite(CPU_BASED_VM_EXEC_CONTROL, control);
> +
> +	control = vmreadz(SECONDARY_VM_EXEC_CONTROL);
> +	control |= SECONDARY_EXEC_TSC_SCALING;
> +	vmwrite(SECONDARY_VM_EXEC_CONTROL, control);
> +
> +	vmwrite(TSC_OFFSET, TSC_OFFSET_L2);
> +	vmwrite(TSC_MULTIPLIER, TSC_MULTIPLIER_L2);
> +	vmwrite(TSC_MULTIPLIER_HIGH, TSC_MULTIPLIER_L2 >> 32);
> +
> +	/* launch L2 */
> +	GUEST_ASSERT(!vmlaunch());
> +	GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL);
> +
> +	/* check that L1's frequency still looks good */
> +	check_tsc_freq(UCHECK_L1);
> +
> +	GUEST_DONE();
> +}
> +
> +static void tsc_scaling_check_supported(void)
> +{
> +	if (!kvm_check_cap(KVM_CAP_TSC_CONTROL)) {
> +		print_skip("TSC scaling not supported by the HW");
> +		exit(KSFT_SKIP);
> +	}
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +	struct kvm_vm *vm;
> +	vm_vaddr_t vmx_pages_gva;
> +
> +	uint64_t tsc_start, tsc_end;
> +	uint64_t tsc_khz;
> +	uint64_t l0_tsc_freq = 0;
> +	uint64_t l1_tsc_freq = 0;
> +	uint64_t l2_tsc_freq = 0;
> +
> +	nested_vmx_check_supported();
> +	tsc_scaling_check_supported();
> +
> +	tsc_start = rdtsc();
> +	sleep(1);
> +	tsc_end = rdtsc();
> +
> +	l0_tsc_freq = tsc_end - tsc_start;
> +	printf("real TSC frequency is around: %"PRIu64"\n", l0_tsc_freq);
> +
> +	vm = vm_create_default(VCPU_ID, 0, (void *) l1_guest_code);
> +	vcpu_alloc_vmx(vm, &vmx_pages_gva);
> +	vcpu_args_set(vm, VCPU_ID, 1, vmx_pages_gva);
> +
> +	tsc_khz = _vcpu_ioctl(vm, VCPU_ID, KVM_GET_TSC_KHZ, NULL);
> +	TEST_ASSERT(tsc_khz != -1, "vcpu ioctl KVM_GET_TSC_KHZ failed");
> +
> +	/* scale down L1's TSC frequency */
> +	vcpu_ioctl(vm, VCPU_ID, KVM_SET_TSC_KHZ,
> +		  (void *) (tsc_khz / L1_SCALE_FACTOR));
> +
> +	for (;;) {
> +		volatile struct kvm_run *run = vcpu_state(vm, VCPU_ID);
> +		struct ucall uc;
> +
> +		vcpu_run(vm, VCPU_ID);
> +		TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
> +			    "Got exit_reason other than KVM_EXIT_IO: %u (%s)\n",
> +			    run->exit_reason,
> +			    exit_reason_str(run->exit_reason));
> +
> +		switch (get_ucall(vm, VCPU_ID, &uc)) {
> +		case UCALL_ABORT:
> +			TEST_FAIL("%s", (const char *) uc.args[0]);
> +		case UCALL_SYNC:
> +			switch (uc.args[0]) {
> +			case USLEEP:
> +				sleep(uc.args[1]);
> +				break;
> +			case UCHECK_L1:
> +				l1_tsc_freq = uc.args[1];
> +				printf("L1's TSC frequency is around: %"PRIu64
> +				       "\n", l1_tsc_freq);
> +
> +				compare_tsc_freq(l1_tsc_freq,
> +						 l0_tsc_freq / L1_SCALE_FACTOR);
> +				break;
> +			case UCHECK_L2:
> +				l2_tsc_freq = uc.args[1];
> +				printf("L2's TSC frequency is around: %"PRIu64
> +				       "\n", l2_tsc_freq);
> +
> +				compare_tsc_freq(l2_tsc_freq,
> +						 l1_tsc_freq * L2_SCALE_FACTOR);
> +				break;
> +			}
> +			break;
> +		case UCALL_DONE:
> +			goto done;
> +		default:
> +			TEST_FAIL("Unknown ucall %lu", uc.cmd);
> +		}
> +	}
> +
> +done:
> +	kvm_vm_free(vm);
> +	return 0;
> +}
Overall looks OK to me.

I can't test it, since the most recent Intel laptop I have (i7-7600U)
still lacks TSC scaling (or did Intel cripple this feature on clients like what
they did with APICv ?)

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/8] KVM: VMX: Adjust the TSC-related VMCS fields on L2 entry and exit
  2021-05-06 17:35     ` Stamatis, Ilias
@ 2021-05-10 14:11       ` Paolo Bonzini
  0 siblings, 0 replies; 36+ messages in thread
From: Paolo Bonzini @ 2021-05-10 14:11 UTC (permalink / raw)
  To: Stamatis, Ilias, kvm, ilstam
  Cc: jmattson, Woodhouse, David, vkuznets, haozhong.zhang, joro,
	mtosatti, zamsden, seanjc, dplotnikov, wanpengli

On 06/05/21 19:35, Stamatis, Ilias wrote:
> On Thu, 2021-05-06 at 13:32 +0200, Paolo Bonzini wrote:
>> On 06/05/21 12:32, ilstam@mailbox.org wrote:
>>> +     if (vmcs12->cpu_based_vm_exec_control &
>>> CPU_BASED_USE_TSC_OFFSETTING) {
>>> +             if (vmcs12->secondary_vm_exec_control &
>>> SECONDARY_EXEC_TSC_SCALING) {
>>> +                     vcpu->arch.tsc_offset =
>>> kvm_compute_02_tsc_offset(
>>> +                                     vcpu->arch.l1_tsc_offset,
>>> +                                     vmcs12->tsc_multiplier,
>>> +                                     vmcs12->tsc_offset);
>>> +
>>> +                     vcpu->arch.tsc_scaling_ratio =
>>> mul_u64_u64_shr(
>>> +                                     vcpu->arch.tsc_scaling_ratio,
>>> +                                     vmcs12->tsc_multiplier,
>>> +                                     kvm_tsc_scaling_ratio_frac_bit
>>> s);
>>> +             } else {
>>> +                     vcpu->arch.tsc_offset += vmcs12->tsc_offset;
>>> +             }
>>
>> The computation of vcpu->arch.tsc_offset is (not coincidentially) the
>> same that appears in patch 6
>>
>> +           (vmcs12->cpu_based_vm_exec_control &
>> CPU_BASED_USE_TSC_OFFSETTING)) {
>> +               if (vmcs12->secondary_vm_exec_control &
>> SECONDARY_EXEC_TSC_SCALING) {
>> +                       cur_offset = kvm_compute_02_tsc_offset(
>> +                                       l1_offset,
>> +                                       vmcs12->tsc_multiplier,
>> +                                       vmcs12->tsc_offset);
>> +               } else {
>> +                       cur_offset = l1_offset + vmcs12->tsc_offset;
>>
>> So I think you should just pass vmcs12 and the L1 offset to
>> kvm_compute_02_tsc_offset, and let it handle both cases (and possibly
>> even set vcpu->arch.tsc_scaling_ratio in the same function).
> 
> That was my thinking initially too. However, kvm_compute_02_tsc_offset
> is defined in x86.c which is vmx-agnostic and hence 'struct vmcs12' is
> not defined there.

Good point.  Yeah, it's trading one comde duplication for another. 
Right now there's already code duplication in write_l1_tsc_offset, so it 
makes some sense to limit duplication _within_ the file and lose in 
duplicated code across vmx/svm, but either is okay.

Paolo


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/8] KVM: VMX: Implement nested TSC scaling
  2021-05-06 17:16 ` [PATCH 0/8] KVM: VMX: Implement nested TSC scaling Jim Mattson
  2021-05-06 17:48   ` Stamatis, Ilias
@ 2021-05-10 14:29   ` Paolo Bonzini
  1 sibling, 0 replies; 36+ messages in thread
From: Paolo Bonzini @ 2021-05-10 14:29 UTC (permalink / raw)
  To: Jim Mattson, ilstam
  Cc: kvm list, ilstam, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Joerg Roedel, Haozhong Zhang, zamsden, mtosatti,
	Denis Plotnikov, David Woodhouse

On 06/05/21 19:16, Jim Mattson wrote:
> On Thu, May 6, 2021 at 3:34 AM <ilstam@mailbox.org> wrote:
>>
>> From: Ilias Stamatis <ilstam@amazon.com>
>>
>> KVM currently supports hardware-assisted TSC scaling but only for L1 and it
>> doesn't expose the feature to nested guests. This patch series adds support for
>> nested TSC scaling and allows both L1 and L2 to be scaled with different
>> scaling factors.
>>
>> When scaling and offsetting is applied, the TSC for the guest is calculated as:
>>
>> (TSC * multiplier >> 48) + offset
>>
>> With nested scaling the values in VMCS01 and VMCS12 need to be merged
>> together and stored in VMCS02.
>>
>> The VMCS02 values are calculated as follows:
>>
>> offset_02 = ((offset_01 * mult_12) >> 48) + offset_12
>> mult_02 = (mult_01 * mult_12) >> 48
>>
>> The last patch of the series adds a KVM selftest.
> 
> Will you be doing the same for SVM? The last time I tried to add a
> nested virtualization feature for Intel only, Paolo rapped my knuckles
> with a ruler.

For bugfixes definitely, for features it is definitely nice.

And these days we even have similar-enough code between nVMX and nSVM 
code, that in many cases there's really no good excuse not to do it.

Paolo


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/8] KVM: VMX: Adjust the TSC-related VMCS fields on L2 entry and exit
  2021-05-10 13:53   ` Maxim Levitsky
@ 2021-05-10 14:44     ` Stamatis, Ilias
  2021-05-11 12:38       ` Maxim Levitsky
  0 siblings, 1 reply; 36+ messages in thread
From: Stamatis, Ilias @ 2021-05-10 14:44 UTC (permalink / raw)
  To: kvm, pbonzini, mlevitsk, ilstam
  Cc: jmattson, Woodhouse, David, vkuznets, haozhong.zhang, joro,
	mtosatti, zamsden, seanjc, dplotnikov, wanpengli

On Mon, 2021-05-10 at 16:53 +0300, Maxim Levitsky wrote:
> On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> > From: Ilias Stamatis <ilstam@amazon.com>
> > 
> > When L2 is entered we need to "merge" the TSC multiplier and TSC
> > offset
> > values of VMCS01 and VMCS12 and store the result into the current
> > VMCS02.
> > 
> > The 02 values are calculated using the following equations:
> >   offset_02 = ((offset_01 * mult_12) >> 48) + offset_12
> >   mult_02 = (mult_01 * mult_12) >> 48
> 
> I would mention that 48 is kvm_tsc_scaling_ratio_frac_bits instead.
> Also maybe add the common code in a separate patch?

Right, will do.

> 
> > 
> > Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  1 +
> >  arch/x86/kvm/vmx/nested.c       | 26 ++++++++++++++++++++++----
> >  arch/x86/kvm/x86.c              | 25 +++++++++++++++++++++++++
> >  3 files changed, 48 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h
> > b/arch/x86/include/asm/kvm_host.h
> > index cdddbf0b1177..e7a1eb36f95a 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1780,6 +1780,7 @@ void kvm_define_user_return_msr(unsigned
> > index, u32 msr);
> >  int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
> > 
> >  u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc, bool l1);
> > +u64 kvm_compute_02_tsc_offset(u64 l1_offset, u64 l2_multiplier, u64
> > l2_offset);
> >  u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc);
> > 
> >  unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index bced76637823..a1bf28f33837 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -3353,8 +3353,22 @@ enum nvmx_vmentry_status
> > nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
> >       }
> > 
> >       enter_guest_mode(vcpu);
> > -     if (vmcs12->cpu_based_vm_exec_control &
> > CPU_BASED_USE_TSC_OFFSETTING)
> > -             vcpu->arch.tsc_offset += vmcs12->tsc_offset;
> > +
> > +     if (vmcs12->cpu_based_vm_exec_control &
> > CPU_BASED_USE_TSC_OFFSETTING) {
> > +             if (vmcs12->secondary_vm_exec_control &
> > SECONDARY_EXEC_TSC_SCALING) {
> > +                     vcpu->arch.tsc_offset =
> > kvm_compute_02_tsc_offset(
> > +                                     vcpu->arch.l1_tsc_offset,
> > +                                     vmcs12->tsc_multiplier,
> > +                                     vmcs12->tsc_offset);
> > +
> > +                     vcpu->arch.tsc_scaling_ratio =
> > mul_u64_u64_shr(
> > +                                     vcpu->arch.tsc_scaling_ratio,
> > +                                     vmcs12->tsc_multiplier,
> > +                                     kvm_tsc_scaling_ratio_frac_bit
> > s);
> > +             } else {
> > +                     vcpu->arch.tsc_offset += vmcs12->tsc_offset;
> > +             }
> > +     }
> > 
> >       if (prepare_vmcs02(vcpu, vmcs12, &entry_failure_code)) {
> >               exit_reason.basic = EXIT_REASON_INVALID_STATE;
> > @@ -4454,8 +4468,12 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu,
> > u32 vm_exit_reason,
> >       if (nested_cpu_has_preemption_timer(vmcs12))
> >               hrtimer_cancel(&to_vmx(vcpu)-
> > >nested.preemption_timer);
> > 
> > -     if (vmcs12->cpu_based_vm_exec_control &
> > CPU_BASED_USE_TSC_OFFSETTING)
> > -             vcpu->arch.tsc_offset -= vmcs12->tsc_offset;
> > +     if (vmcs12->cpu_based_vm_exec_control &
> > CPU_BASED_USE_TSC_OFFSETTING) {
> > +             vcpu->arch.tsc_offset = vcpu->arch.l1_tsc_offset;
> > +
> > +             if (vmcs12->secondary_vm_exec_control &
> > SECONDARY_EXEC_TSC_SCALING)
> > +                     vcpu->arch.tsc_scaling_ratio = vcpu-
> > >arch.l1_tsc_scaling_ratio;
> > +     }
> > 
> >       if (likely(!vmx->fail)) {
> >               sync_vmcs02_to_vmcs12(vcpu, vmcs12);
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 26a4c0f46f15..87deb119c521 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2266,6 +2266,31 @@ static u64 kvm_compute_tsc_offset(struct
> > kvm_vcpu *vcpu, u64 target_tsc)
> >       return target_tsc - tsc;
> >  }
> > 
> > +/*
> > + * This function computes the TSC offset that is stored in VMCS02
> > when entering
> > + * L2 by combining the offset and multiplier values of VMCS01 and
> > VMCS12.
> > + */
> > +u64 kvm_compute_02_tsc_offset(u64 l1_offset, u64 l2_multiplier, u64
> > l2_offset)
> > +{
> > +     u64 offset;
> > +
> > +     /*
> > +      * The L1 offset is interpreted as a signed number by the CPU
> > and can
> > +      * be negative. So we extract the sign before the
> > multiplication and
> > +      * put it back afterwards if needed.
> 
> If I understand correctly the reason for sign extraction is that we
> don't
> have mul_s64_u64_shr. Maybe we should add it?
> 

Initially I added a mul_s64_s64_shr and used that.

You can take a look
here:

https://github.com/ilstam/linux/commit/7bc0b75bb7b8d9bcecf41e2e1e1936880d3b93d1


I added mul_s64_s64_shr instead of mul_s64_u64_shr thinking that a)
perhaps it would be more useful as a general-purpose function and b)
even though the multiplier is unsigned in reality it's caped so it
should be safe to cast it to signed.

But then I realized that
mul_u64_u64_shr that we already have is enough and I can just simply re-
use that.

How would a mul_s64_u64_shr be implemented?

I think in the end
we need to decide if we want to do signed or unsigned multiplication.
And depending on this we need to do a signed or unsigned right shift
accordingly.

So if we did have a mul_s64_u64_shr wouldn't it need to cast
either of the arguments and then use mul_u64_u64 (or mul_s64_s64) being
just a wrapper essentially? I don't think that we can guarantee that
casting either of the arguments is always safe if we want to advertise
this as a general purpose function.

Thanks for the reviews!

Ilias

> The pattern of (A * B) >> shift appears many times in TSC scaling.
> 
> So instead of this function maybe just use something like that?
> 
> merged_offset = l2_offset + mul_s64_u64_shr ((s64) l1_offset,
>                                              l2_multiplier,
>                                              kvm_tsc_scaling_ratio_fra
> c_bits)
> 
> Or another idea:
> 
> How about
> 
> u64 __kvm_scale_tsc_value(u64 value, u64 multiplier) {
>         return mul_u64_u64_shr(value,
>                                multiplier,
>                                kvm_tsc_scaling_ratio_frac_bits);
> }
> 
> 
> and
> 
> s64 __kvm_scale_tsc_offset(u64 value, u64 multiplier)
> {
>         return mul_s64_u64_shr((s64)value,
>                                multiplier,
>                                kvm_tsc_scaling_ratio_frac_bits);
> }
> 
> And then use them in the code.
> 
> Overall though the code *looks* correct to me
> but I might have missed something.
> 
> Best regards,
>         Maxim Levitsky
> 
> 
> > +      */
> > +     offset = mul_u64_u64_shr(abs((s64) l1_offset),
> > +                              l2_multiplier,
> > +                              kvm_tsc_scaling_ratio_frac_bits);
> > +
> > +     if ((s64) l1_offset < 0)
> > +             offset = -((s64) offset);
> > +
> > +     offset += l2_offset;
> > +     return offset;
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_compute_02_tsc_offset);
> > +
> >  u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
> >  {
> >       return vcpu->arch.l1_tsc_offset + kvm_scale_tsc(vcpu,
> > host_tsc, true);
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/8] KVM: X86: Pass an additional 'L1' argument to kvm_scale_tsc()
  2021-05-10 13:52   ` Maxim Levitsky
@ 2021-05-10 15:44     ` Stamatis, Ilias
  0 siblings, 0 replies; 36+ messages in thread
From: Stamatis, Ilias @ 2021-05-10 15:44 UTC (permalink / raw)
  To: kvm, pbonzini, mlevitsk, ilstam
  Cc: jmattson, Woodhouse, David, vkuznets, joro, mtosatti, zamsden,
	seanjc, wanpengli

On Mon, 2021-05-10 at 16:52 +0300, Maxim Levitsky wrote:
> On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> > From: Ilias Stamatis <ilstam@amazon.com>
> > 
> > Sometimes kvm_scale_tsc() needs to use the current scaling ratio and
> > other times (like when reading the TSC from user space) it needs to use
> > L1's scaling ratio. Have the caller specify this by passing an
> > additional boolean argument.
> > 
> > Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 +-
> >  arch/x86/kvm/x86.c              | 21 +++++++++++++--------
> >  2 files changed, 14 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 132e820525fb..cdddbf0b1177 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1779,7 +1779,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
> >  void kvm_define_user_return_msr(unsigned index, u32 msr);
> >  int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
> > 
> > -u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc);
> > +u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc, bool l1);
> >  u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc);
> > 
> >  unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 7bc5155ac6fd..26a4c0f46f15 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2241,10 +2241,14 @@ static inline u64 __scale_tsc(u64 ratio, u64 tsc)
> >       return mul_u64_u64_shr(tsc, ratio, kvm_tsc_scaling_ratio_frac_bits);
> >  }
> > 
> > -u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc)
> > +/*
> > + * If l1 is true the TSC is scaled using L1's scaling ratio, otherwise
> > + * the current scaling ratio is used.
> > + */
> > +u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc, bool l1)
> >  {
> >       u64 _tsc = tsc;
> > -     u64 ratio = vcpu->arch.tsc_scaling_ratio;
> > +     u64 ratio = l1 ? vcpu->arch.l1_tsc_scaling_ratio : vcpu->arch.tsc_scaling_ratio;
> > 
> >       if (ratio != kvm_default_tsc_scaling_ratio)
> >               _tsc = __scale_tsc(ratio, tsc);
> 
> I do wonder if it is better to have kvm_scale_tsc_l1 and kvm_scale_tsc instead for better
> readablility?
> 

That makes sense. Will do.

> 
> > @@ -2257,14 +2261,14 @@ static u64 kvm_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)
> >  {
> >       u64 tsc;
> > 
> > -     tsc = kvm_scale_tsc(vcpu, rdtsc());
> > +     tsc = kvm_scale_tsc(vcpu, rdtsc(), true);
> 
> Here we have a somewhat dangerous assumption that this function
> will always be used with L1 tsc values.
> 
> The kvm_compute_tsc_offset should at least be renamed to
> kvm_compute_tsc_offset_l1 or something like that.
> 
> Currently the assumption holds though:
> 
> We call the kvm_compute_tsc_offset in:
> 
> -> kvm_synchronize_tsc which is nowadays thankfully only called
> on TSC writes from qemu, which are assumed to be L1 values.
> 
> (this is pending a rework of the whole thing which I started
> some time ago but haven't had a chance to finish it yet)
> 
> -> Guest write of MSR_IA32_TSC. The value written is in L1 units,
> since TSC offset/scaling only covers RDTSC and RDMSR of the IA32_TSC,
> while WRMSR should be intercepted by L1 and emulated.
> If it is not emulated, then L2 would just write L1 value.
> 
> -> in kvm_arch_vcpu_load, when TSC is unstable, we always try to resume
> the guest from the same TSC value as it had seen last time,
> and then catchup.

Yes. I wasn't sure about kvm_compute_tsc_offset but my understanding was
that all of its callers wanted the L1 value scaled.

Renaming it to kvm_scale_tsc_l1 sounds like a great idea.

> Also host TSC values are used, and after reading this function,
> I recommend to rename the vcpu->arch.last_guest_tsc
> to vcpu->arch.last_guest_tsc_l1 to document this.

OK

> > 
> >       return target_tsc - tsc;
> >  }
> > 
> >  u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
> >  {
> > -     return vcpu->arch.l1_tsc_offset + kvm_scale_tsc(vcpu, host_tsc);
> > +     return vcpu->arch.l1_tsc_offset + kvm_scale_tsc(vcpu, host_tsc, true);
> 
> OK
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
> > 
> > @@ -2395,9 +2399,9 @@ static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
> > 
> >  static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
> >  {
> > -     if (vcpu->arch.tsc_scaling_ratio != kvm_default_tsc_scaling_ratio)
> > +     if (vcpu->arch.l1_tsc_scaling_ratio != kvm_default_tsc_scaling_ratio)
> >               WARN_ON(adjustment < 0);
> 
> This should belong to patch 2 IMHO.
> 

Right, I will move it.

> > -     adjustment = kvm_scale_tsc(vcpu, (u64) adjustment);
> > +     adjustment = kvm_scale_tsc(vcpu, (u64) adjustment, true);
> 
> OK
> >       adjust_tsc_offset_guest(vcpu, adjustment);
> >  }
> > 
> > @@ -2780,7 +2784,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
> >       /* With all the info we got, fill in the values */
> > 
> >       if (kvm_has_tsc_control)
> > -             tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz);
> > +             tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz, true);
> 
> OK (kvmclock is for L1 only, L1 hypervisor is free to expose its own kvmclock to L2)
> > 
> >       if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
> >               kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> > @@ -3474,7 +3478,8 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >               u64 tsc_offset = msr_info->host_initiated ? vcpu->arch.l1_tsc_offset :
> >                                                           vcpu->arch.tsc_offset;
> > 
> > -             msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + tsc_offset;
> > +             msr_info->data = kvm_scale_tsc(vcpu, rdtsc(),
> > +                                            msr_info->host_initiated) + tsc_offset;
> 
> Since we now do two things that depend on msr_info->host_initiated, I
> think I would prefer to convert this back to regular 'if'.
> I don't have a strong opinion on this though.
> 

Agreed.

Thanks!

Ilias

> 
> >               break;
> >       }
> >       case MSR_MTRRcap:
> 
> 
> Best regards,
>         Maxim Levitsky
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/8] KVM: VMX: Make vmx_write_l1_tsc_offset() work with nested TSC scaling
  2021-05-10 13:54   ` Maxim Levitsky
@ 2021-05-10 16:08     ` Stamatis, Ilias
  2021-05-11 12:44       ` Maxim Levitsky
  0 siblings, 1 reply; 36+ messages in thread
From: Stamatis, Ilias @ 2021-05-10 16:08 UTC (permalink / raw)
  To: kvm, pbonzini, mlevitsk, ilstam
  Cc: jmattson, Woodhouse, David, vkuznets, joro, mtosatti, zamsden,
	seanjc, wanpengli

On Mon, 2021-05-10 at 16:54 +0300, Maxim Levitsky wrote:
> On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> > From: Ilias Stamatis <ilstam@amazon.com>
> > 
> > Calculating the current TSC offset is done differently when nested TSC
> > scaling is used.
> > 
> > Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> > ---
> >  arch/x86/kvm/vmx/vmx.c | 26 ++++++++++++++++++++------
> >  1 file changed, 20 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 49241423b854..df7dc0e4c903 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -1797,10 +1797,16 @@ static void setup_msrs(struct vcpu_vmx *vmx)
> >               vmx_update_msr_bitmap(&vmx->vcpu);
> >  }
> > 
> > -static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> > +/*
> > + * This function receives the requested offset for L1 as an argument but it
> > + * actually writes the "current" tsc offset to the VMCS and returns it. The
> > + * current offset might be different in case an L2 guest is currently running
> > + * and its VMCS02 is loaded.
> > + */
> 
> (Not related to this patch) It might be a good idea to rename this callback
> instead of this comment, but I am not sure about it.
> 

Yes! I was planning to do this on v2 anyway as the name of that function
is completely misleading/wrong IMHO.

I think that even the comment inside it that explains that when L1
doesn't trap WRMSR then L2 TSC writes overwrite L1's TSC is misplaced.
It should go one or more levels above I believe.

This function simply
updates the TSC offset in the current VMCS depending on whether L1 or L2
is executing. 

> 
> > +static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 l1_offset)
> >  {
> >       struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > -     u64 g_tsc_offset = 0;
> > +     u64 cur_offset = l1_offset;
> > 
> >       /*
> >        * We're here if L1 chose not to trap WRMSR to TSC. According
> > @@ -1809,11 +1815,19 @@ static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> >        * to the newly set TSC to get L2's TSC.
> >        */
> >       if (is_guest_mode(vcpu) &&
> > -         (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING))
> > -             g_tsc_offset = vmcs12->tsc_offset;
> > +         (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING)) {
> > +             if (vmcs12->secondary_vm_exec_control & SECONDARY_EXEC_TSC_SCALING) {
> > +                     cur_offset = kvm_compute_02_tsc_offset(
> > +                                     l1_offset,
> > +                                     vmcs12->tsc_multiplier,
> > +                                     vmcs12->tsc_offset);
> > +             } else {
> > +                     cur_offset = l1_offset + vmcs12->tsc_offset;
> > +             }
> > +     }
> > 
> > -     vmcs_write64(TSC_OFFSET, offset + g_tsc_offset);
> > -     return offset + g_tsc_offset;
> > +     vmcs_write64(TSC_OFFSET, cur_offset);
> > +     return cur_offset;
> >  }
> > 
> >  /*
> 
> This code would be ideal to move to common code as SVM will do basically
> the same thing.
> Doesn't have to be done now though.
> 

Hmm, how can we do the feature availability checking in common code?

> 
> Best regards,
>         Maxim Levitsky
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/8] KVM: selftests: x86: Add vmx_nested_tsc_scaling_test
  2021-05-10 13:59   ` Maxim Levitsky
@ 2021-05-11 11:16     ` Stamatis, Ilias
  2021-05-11 12:47       ` Maxim Levitsky
  0 siblings, 1 reply; 36+ messages in thread
From: Stamatis, Ilias @ 2021-05-11 11:16 UTC (permalink / raw)
  To: kvm, pbonzini, mlevitsk, ilstam
  Cc: jmattson, Woodhouse, David, vkuznets, joro, mtosatti, zamsden,
	seanjc, wanpengli

On Mon, 2021-05-10 at 16:59 +0300, Maxim Levitsky wrote:
> On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> > From: Ilias Stamatis <ilstam@amazon.com>
> > 
> > Test that nested TSC scaling works as expected with both L1 and L2
> > scaled.
> > 
> > Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> > ---
> >  tools/testing/selftests/kvm/.gitignore        |   1 +
> >  tools/testing/selftests/kvm/Makefile          |   1 +
> >  .../kvm/x86_64/vmx_nested_tsc_scaling_test.c  | 209 ++++++++++++++++++
> >  3 files changed, 211 insertions(+)
> >  create mode 100644 tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c
> > 
> > diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
> > index bd83158e0e0b..cc02022f9951 100644
> > --- a/tools/testing/selftests/kvm/.gitignore
> > +++ b/tools/testing/selftests/kvm/.gitignore
> > @@ -29,6 +29,7 @@
> >  /x86_64/vmx_preemption_timer_test
> >  /x86_64/vmx_set_nested_state_test
> >  /x86_64/vmx_tsc_adjust_test
> > +/x86_64/vmx_nested_tsc_scaling_test
> >  /x86_64/xapic_ipi_test
> >  /x86_64/xen_shinfo_test
> >  /x86_64/xen_vmcall_test
> > diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> > index e439d027939d..1078240b1313 100644
> > --- a/tools/testing/selftests/kvm/Makefile
> > +++ b/tools/testing/selftests/kvm/Makefile
> > @@ -60,6 +60,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_close_while_nested_test
> >  TEST_GEN_PROGS_x86_64 += x86_64/vmx_dirty_log_test
> >  TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
> >  TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
> > +TEST_GEN_PROGS_x86_64 += x86_64/vmx_nested_tsc_scaling_test
> >  TEST_GEN_PROGS_x86_64 += x86_64/xapic_ipi_test
> >  TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test
> >  TEST_GEN_PROGS_x86_64 += x86_64/debug_regs
> > diff --git a/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c b/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c
> > new file mode 100644
> > index 000000000000..b05f5151ecbe
> > --- /dev/null
> > +++ b/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c
> > @@ -0,0 +1,209 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * vmx_nested_tsc_scaling_test
> > + *
> > + * Copyright (C) 2021 Amazon.com, Inc. or its affiliates.
> > + *
> > + * This test case verifies that nested TSC scaling behaves as expected when
> > + * both L1 and L2 are scaled using different ratios. For this test we scale
> > + * L1 down and scale L2 up.
> > + */
> > +
> > +
> > +#include "kvm_util.h"
> > +#include "vmx.h"
> > +#include "kselftest.h"
> > +
> > +
> > +#define VCPU_ID 0
> > +
> > +/* L1 is scaled down by this factor */
> > +#define L1_SCALE_FACTOR 2ULL
> > +/* L2 is scaled up (from L1's perspective) by this factor */
> > +#define L2_SCALE_FACTOR 4ULL
> 
> For fun, I might have randomized these factors as well.

So L2_SCALE_FACTOR (or rather TSC_MULTIPLIER_L2 that depends on it) is
referenced from within l1_guest_code(). If we change this to a static variable
we won't be able to access it from there. How could this be done?

For the L1 factor it's easy as we only use it in main().

> 
> > +
> > +#define TSC_OFFSET_L2 (1UL << 32)
> > +#define TSC_MULTIPLIER_L2 (L2_SCALE_FACTOR << 48)
> 
> It would be fun to use a negative offset here (also randomally).

Do you mean a random offset that is always negative or a random offset that
sometimes is positive and sometimes is negative?

> 
> > +
> > +#define L2_GUEST_STACK_SIZE 64
> > +
> > +enum { USLEEP, UCHECK_L1, UCHECK_L2 };
> > +#define GUEST_SLEEP(sec)         ucall(UCALL_SYNC, 2, USLEEP, sec)
> > +#define GUEST_CHECK(level, freq) ucall(UCALL_SYNC, 2, level, freq)
> > +
> > +
> > +/*
> > + * This function checks whether the "actual" TSC frequency of a guest matches
> > + * its expected frequency. In order to account for delays in taking the TSC
> > + * measurements, a difference of 1% between the actual and the expected value
> > + * is tolerated.
> > + */
> > +static void compare_tsc_freq(uint64_t actual, uint64_t expected)
> > +{
> > +     uint64_t tolerance, thresh_low, thresh_high;
> > +
> > +     tolerance = expected / 100;
> > +     thresh_low = expected - tolerance;
> > +     thresh_high = expected + tolerance;
> > +
> > +     TEST_ASSERT(thresh_low < actual,
> > +             "TSC freq is expected to be between %"PRIu64" and %"PRIu64
> > +             " but it actually is %"PRIu64,
> > +             thresh_low, thresh_high, actual);
> > +     TEST_ASSERT(thresh_high > actual,
> > +             "TSC freq is expected to be between %"PRIu64" and %"PRIu64
> > +             " but it actually is %"PRIu64,
> > +             thresh_low, thresh_high, actual);
> > +}
> > +
> > +static void check_tsc_freq(int level)
> > +{
> > +     uint64_t tsc_start, tsc_end, tsc_freq;
> > +
> > +     /*
> > +      * Reading the TSC twice with about a second's difference should give
> > +      * us an approximation of the TSC frequency from the guest's
> > +      * perspective. Now, this won't be completely accurate, but it should
> > +      * be good enough for the purposes of this test.
> > +      */
> 
> It would be nice to know if the host has stable TSC (you can obtain this via
> KVM_GET_CLOCK, the KVM_CLOCK_TSC_STABLE flag).
> 
> And if not stable skip the test, to avoid false positives.
> (Yes I have a laptop I just bought that has an unstable TSC....)
> 

Hmm, this is a vm ioctl but I noticed that one of its vcpus needs to have been
run at least once otherwise it won't return KVM_CLOCK_TSC_STABLE in the flags.

So...

> 
> > +     tsc_start = rdmsr(MSR_IA32_TSC);
> > +     GUEST_SLEEP(1);
> > +     tsc_end = rdmsr(MSR_IA32_TSC);
> > +
> > +     tsc_freq = tsc_end - tsc_start;
> > +
> > +     GUEST_CHECK(level, tsc_freq);
> > +}
> > +
> > +static void l2_guest_code(void)
> > +{
> > +     check_tsc_freq(UCHECK_L2);
> > +
> > +     /* exit to L1 */
> > +     __asm__ __volatile__("vmcall");
> > +}
> > +
> > +static void l1_guest_code(struct vmx_pages *vmx_pages)
> > +{
> > +     unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
> > +     uint32_t control;
> > +
> > +     /* check that L1's frequency looks alright before launching L2 */
> > +     check_tsc_freq(UCHECK_L1);
> > +
> > +     GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages));
> > +     GUEST_ASSERT(load_vmcs(vmx_pages));
> > +
> > +     /* prepare the VMCS for L2 execution */
> > +     prepare_vmcs(vmx_pages, l2_guest_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]);
> > +
> > +     /* enable TSC offsetting and TSC scaling for L2 */
> > +     control = vmreadz(CPU_BASED_VM_EXEC_CONTROL);
> > +     control |= CPU_BASED_USE_MSR_BITMAPS | CPU_BASED_USE_TSC_OFFSETTING;
> > +     vmwrite(CPU_BASED_VM_EXEC_CONTROL, control);
> > +
> > +     control = vmreadz(SECONDARY_VM_EXEC_CONTROL);
> > +     control |= SECONDARY_EXEC_TSC_SCALING;
> > +     vmwrite(SECONDARY_VM_EXEC_CONTROL, control);
> > +
> > +     vmwrite(TSC_OFFSET, TSC_OFFSET_L2);
> > +     vmwrite(TSC_MULTIPLIER, TSC_MULTIPLIER_L2);
> > +     vmwrite(TSC_MULTIPLIER_HIGH, TSC_MULTIPLIER_L2 >> 32);
> > +
> > +     /* launch L2 */
> > +     GUEST_ASSERT(!vmlaunch());
> > +     GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL);
> > +
> > +     /* check that L1's frequency still looks good */
> > +     check_tsc_freq(UCHECK_L1);
> > +
> > +     GUEST_DONE();
> > +}
> > +
> > +static void tsc_scaling_check_supported(void)
> > +{
> > +     if (!kvm_check_cap(KVM_CAP_TSC_CONTROL)) {
> > +             print_skip("TSC scaling not supported by the HW");
> > +             exit(KSFT_SKIP);
> > +     }
> > +}
> > +
> > +int main(int argc, char *argv[])
> > +{
> > +     struct kvm_vm *vm;
> > +     vm_vaddr_t vmx_pages_gva;
> > +
> > +     uint64_t tsc_start, tsc_end;
> > +     uint64_t tsc_khz;
> > +     uint64_t l0_tsc_freq = 0;
> > +     uint64_t l1_tsc_freq = 0;
> > +     uint64_t l2_tsc_freq = 0;
> > +
> > +     nested_vmx_check_supported();
> > +     tsc_scaling_check_supported();

I can't add the check here

> > +
> > +     tsc_start = rdtsc();
> > +     sleep(1);
> > +     tsc_end = rdtsc();
> > +
> > +     l0_tsc_freq = tsc_end - tsc_start;
> > +     printf("real TSC frequency is around: %"PRIu64"\n", l0_tsc_freq);
> > +
> > +     vm = vm_create_default(VCPU_ID, 0, (void *) l1_guest_code);
> > +     vcpu_alloc_vmx(vm, &vmx_pages_gva);
> > +     vcpu_args_set(vm, VCPU_ID, 1, vmx_pages_gva);

nor here

> > +
> > +     tsc_khz = _vcpu_ioctl(vm, VCPU_ID, KVM_GET_TSC_KHZ, NULL);
> > +     TEST_ASSERT(tsc_khz != -1, "vcpu ioctl KVM_GET_TSC_KHZ failed");
> > +
> > +     /* scale down L1's TSC frequency */
> > +     vcpu_ioctl(vm, VCPU_ID, KVM_SET_TSC_KHZ,
> > +               (void *) (tsc_khz / L1_SCALE_FACTOR));
> > +
> > +     for (;;) {
> > +             volatile struct kvm_run *run = vcpu_state(vm, VCPU_ID);
> > +             struct ucall uc;
> > +
> > +             vcpu_run(vm, VCPU_ID);
> > +             TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
> > +                         "Got exit_reason other than KVM_EXIT_IO: %u (%s)\n",
> > +                         run->exit_reason,
> > +                         exit_reason_str(run->exit_reason));

should I add it here?

> > +
> > +             switch (get_ucall(vm, VCPU_ID, &uc)) {
> > +             case UCALL_ABORT:
> > +                     TEST_FAIL("%s", (const char *) uc.args[0]);
> > +             case UCALL_SYNC:
> > +                     switch (uc.args[0]) {
> > +                     case USLEEP:
> > +                             sleep(uc.args[1]);
> > +                             break;
> > +                     case UCHECK_L1:
> > +                             l1_tsc_freq = uc.args[1];
> > +                             printf("L1's TSC frequency is around: %"PRIu64
> > +                                    "\n", l1_tsc_freq);
> > +
> > +                             compare_tsc_freq(l1_tsc_freq,
> > +                                              l0_tsc_freq / L1_SCALE_FACTOR);
> > +                             break;
> > +                     case UCHECK_L2:
> > +                             l2_tsc_freq = uc.args[1];
> > +                             printf("L2's TSC frequency is around: %"PRIu64
> > +                                    "\n", l2_tsc_freq);
> > +
> > +                             compare_tsc_freq(l2_tsc_freq,
> > +                                              l1_tsc_freq * L2_SCALE_FACTOR);
> > +                             break;
> > +                     }
> > +                     break;
> > +             case UCALL_DONE:
> > +                     goto done;
> > +             default:
> > +                     TEST_FAIL("Unknown ucall %lu", uc.cmd);
> > +             }
> > +     }
> > +
> > +done:
> > +     kvm_vm_free(vm);
> > +     return 0;
> > +}
> 
> Overall looks OK to me.
> 
> I can't test it, since the most recent Intel laptop I have (i7-7600U)
> still lacks TSC scaling (or did Intel cripple this feature on clients like what
> they did with APICv ?)
> 
> Best regards,
>         Maxim Levitsky
> 
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/8] KVM: VMX: Adjust the TSC-related VMCS fields on L2 entry and exit
  2021-05-10 14:44     ` Stamatis, Ilias
@ 2021-05-11 12:38       ` Maxim Levitsky
  2021-05-11 15:11         ` Stamatis, Ilias
  0 siblings, 1 reply; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-11 12:38 UTC (permalink / raw)
  To: Stamatis, Ilias, kvm, pbonzini, ilstam
  Cc: jmattson, Woodhouse, David, vkuznets, haozhong.zhang, joro,
	mtosatti, zamsden, seanjc, dplotnikov, wanpengli

On Mon, 2021-05-10 at 14:44 +0000, Stamatis, Ilias wrote:
> On Mon, 2021-05-10 at 16:53 +0300, Maxim Levitsky wrote:
> > On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> > > From: Ilias Stamatis <ilstam@amazon.com>
> > > 
> > > When L2 is entered we need to "merge" the TSC multiplier and TSC
> > > offset
> > > values of VMCS01 and VMCS12 and store the result into the current
> > > VMCS02.
> > > 
> > > The 02 values are calculated using the following equations:
> > >   offset_02 = ((offset_01 * mult_12) >> 48) + offset_12
> > >   mult_02 = (mult_01 * mult_12) >> 48
> > 
> > I would mention that 48 is kvm_tsc_scaling_ratio_frac_bits instead.
> > Also maybe add the common code in a separate patch?
> 
> Right, will do.
> 
> > > Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> > > ---
> > >  arch/x86/include/asm/kvm_host.h |  1 +
> > >  arch/x86/kvm/vmx/nested.c       | 26 ++++++++++++++++++++++----
> > >  arch/x86/kvm/x86.c              | 25 +++++++++++++++++++++++++
> > >  3 files changed, 48 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/kvm_host.h
> > > b/arch/x86/include/asm/kvm_host.h
> > > index cdddbf0b1177..e7a1eb36f95a 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -1780,6 +1780,7 @@ void kvm_define_user_return_msr(unsigned
> > > index, u32 msr);
> > >  int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
> > > 
> > >  u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc, bool l1);
> > > +u64 kvm_compute_02_tsc_offset(u64 l1_offset, u64 l2_multiplier, u64
> > > l2_offset);
> > >  u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc);
> > > 
> > >  unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu);
> > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > > index bced76637823..a1bf28f33837 100644
> > > --- a/arch/x86/kvm/vmx/nested.c
> > > +++ b/arch/x86/kvm/vmx/nested.c
> > > @@ -3353,8 +3353,22 @@ enum nvmx_vmentry_status
> > > nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
> > >       }
> > > 
> > >       enter_guest_mode(vcpu);
> > > -     if (vmcs12->cpu_based_vm_exec_control &
> > > CPU_BASED_USE_TSC_OFFSETTING)
> > > -             vcpu->arch.tsc_offset += vmcs12->tsc_offset;
> > > +
> > > +     if (vmcs12->cpu_based_vm_exec_control &
> > > CPU_BASED_USE_TSC_OFFSETTING) {
> > > +             if (vmcs12->secondary_vm_exec_control &
> > > SECONDARY_EXEC_TSC_SCALING) {
> > > +                     vcpu->arch.tsc_offset =
> > > kvm_compute_02_tsc_offset(
> > > +                                     vcpu->arch.l1_tsc_offset,
> > > +                                     vmcs12->tsc_multiplier,
> > > +                                     vmcs12->tsc_offset);
> > > +
> > > +                     vcpu->arch.tsc_scaling_ratio =
> > > mul_u64_u64_shr(
> > > +                                     vcpu->arch.tsc_scaling_ratio,
> > > +                                     vmcs12->tsc_multiplier,
> > > +                                     kvm_tsc_scaling_ratio_frac_bit
> > > s);
> > > +             } else {
> > > +                     vcpu->arch.tsc_offset += vmcs12->tsc_offset;
> > > +             }
> > > +     }
> > > 
> > >       if (prepare_vmcs02(vcpu, vmcs12, &entry_failure_code)) {
> > >               exit_reason.basic = EXIT_REASON_INVALID_STATE;
> > > @@ -4454,8 +4468,12 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu,
> > > u32 vm_exit_reason,
> > >       if (nested_cpu_has_preemption_timer(vmcs12))
> > >               hrtimer_cancel(&to_vmx(vcpu)-
> > > > nested.preemption_timer);
> > > 
> > > -     if (vmcs12->cpu_based_vm_exec_control &
> > > CPU_BASED_USE_TSC_OFFSETTING)
> > > -             vcpu->arch.tsc_offset -= vmcs12->tsc_offset;
> > > +     if (vmcs12->cpu_based_vm_exec_control &
> > > CPU_BASED_USE_TSC_OFFSETTING) {
> > > +             vcpu->arch.tsc_offset = vcpu->arch.l1_tsc_offset;
> > > +
> > > +             if (vmcs12->secondary_vm_exec_control &
> > > SECONDARY_EXEC_TSC_SCALING)
> > > +                     vcpu->arch.tsc_scaling_ratio = vcpu-
> > > > arch.l1_tsc_scaling_ratio;
> > > +     }
> > > 
> > >       if (likely(!vmx->fail)) {
> > >               sync_vmcs02_to_vmcs12(vcpu, vmcs12);
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 26a4c0f46f15..87deb119c521 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -2266,6 +2266,31 @@ static u64 kvm_compute_tsc_offset(struct
> > > kvm_vcpu *vcpu, u64 target_tsc)
> > >       return target_tsc - tsc;
> > >  }
> > > 
> > > +/*
> > > + * This function computes the TSC offset that is stored in VMCS02
> > > when entering
> > > + * L2 by combining the offset and multiplier values of VMCS01 and
> > > VMCS12.
> > > + */
> > > +u64 kvm_compute_02_tsc_offset(u64 l1_offset, u64 l2_multiplier, u64
> > > l2_offset)
> > > +{
> > > +     u64 offset;
> > > +
> > > +     /*
> > > +      * The L1 offset is interpreted as a signed number by the CPU
> > > and can
> > > +      * be negative. So we extract the sign before the
> > > multiplication and
> > > +      * put it back afterwards if needed.
> > 
> > If I understand correctly the reason for sign extraction is that we
> > don't
> > have mul_s64_u64_shr. Maybe we should add it?
> > 
> 
> Initially I added a mul_s64_s64_shr and used that.
> 
> You can take a look
> here:
> 
> https://github.com/ilstam/linux/commit/7bc0b75bb7b8d9bcecf41e2e1e1936880d3b93d1
> 
> 
> I added mul_s64_s64_shr instead of mul_s64_u64_shr thinking that a)
> perhaps it would be more useful as a general-purpose function and b)
> even though the multiplier is unsigned in reality it's caped so it
> should be safe to cast it to signed.

Actually is it? Intel's PRM I think defines TSC multiplier to be full 
64 bit. It is shifted right after multiplication, making the multiplier
be effictively a fixed point 16.48 number but since multiplication is done
first, it will still have to use full 64 bit value.
> 
> But then I realized that
> mul_u64_u64_shr that we already have is enough and I can just simply re-
> use that.
> 
> How would a mul_s64_u64_shr be implemented?

I'll say even if it is implemented by just extracting the sign as you did
and then calling mul_u64_u64_shr, 
it could still be a bit better documentation wise.


> 
> I think in the end
> we need to decide if we want to do signed or unsigned multiplication.
> And depending on this we need to do a signed or unsigned right shift
> accordingly.
I agree.
> 
> So if we did have a mul_s64_u64_shr wouldn't it need to cast
> either of the arguments and then use mul_u64_u64 (or mul_s64_s64) being
> just a wrapper essentially? I don't think that we can guarantee that
> casting either of the arguments is always safe if we want to advertise
> this as a general purpose function.

Note that I don't have a strong opinion on this whole thing, so if you feel
that it should be left as is, I don't mind.

> 
> Thanks for the reviews!

No problem!

Best regards,
	Maxim Levitsky

> 
> Ilias
> 
> > The pattern of (A * B) >> shift appears many times in TSC scaling.
> > 
> > So instead of this function maybe just use something like that?
> > 
> > merged_offset = l2_offset + mul_s64_u64_shr ((s64) l1_offset,
> >                                              l2_multiplier,
> >                                              kvm_tsc_scaling_ratio_fra
> > c_bits)
> > 
> > Or another idea:
> > 
> > How about
> > 
> > u64 __kvm_scale_tsc_value(u64 value, u64 multiplier) {
> >         return mul_u64_u64_shr(value,
> >                                multiplier,
> >                                kvm_tsc_scaling_ratio_frac_bits);
> > }
> > 
> > 
> > and
> > 
> > s64 __kvm_scale_tsc_offset(u64 value, u64 multiplier)
> > {
> >         return mul_s64_u64_shr((s64)value,
> >                                multiplier,
> >                                kvm_tsc_scaling_ratio_frac_bits);
> > }
> > 
> > And then use them in the code.
> > 
> > Overall though the code *looks* correct to me
> > but I might have missed something.
> > 
> > Best regards,
> >         Maxim Levitsky
> > 
> > 
> > > +      */
> > > +     offset = mul_u64_u64_shr(abs((s64) l1_offset),
> > > +                              l2_multiplier,
> > > +                              kvm_tsc_scaling_ratio_frac_bits);
> > > +
> > > +     if ((s64) l1_offset < 0)
> > > +             offset = -((s64) offset);
> > > +
> > > +     offset += l2_offset;
> > > +     return offset;
> > > +}
> > > +EXPORT_SYMBOL_GPL(kvm_compute_02_tsc_offset);
> > > +
> > >  u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
> > >  {
> > >       return vcpu->arch.l1_tsc_offset + kvm_scale_tsc(vcpu,
> > > host_tsc, true);






^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/8] KVM: VMX: Make vmx_write_l1_tsc_offset() work with nested TSC scaling
  2021-05-10 16:08     ` Stamatis, Ilias
@ 2021-05-11 12:44       ` Maxim Levitsky
  2021-05-11 17:44         ` Stamatis, Ilias
  0 siblings, 1 reply; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-11 12:44 UTC (permalink / raw)
  To: Stamatis, Ilias, kvm, pbonzini, ilstam
  Cc: jmattson, Woodhouse, David, vkuznets, joro, mtosatti, zamsden,
	seanjc, wanpengli

On Mon, 2021-05-10 at 16:08 +0000, Stamatis, Ilias wrote:
> On Mon, 2021-05-10 at 16:54 +0300, Maxim Levitsky wrote:
> > On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> > > From: Ilias Stamatis <ilstam@amazon.com>
> > > 
> > > Calculating the current TSC offset is done differently when nested TSC
> > > scaling is used.
> > > 
> > > Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> > > ---
> > >  arch/x86/kvm/vmx/vmx.c | 26 ++++++++++++++++++++------
> > >  1 file changed, 20 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 49241423b854..df7dc0e4c903 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -1797,10 +1797,16 @@ static void setup_msrs(struct vcpu_vmx *vmx)
> > >               vmx_update_msr_bitmap(&vmx->vcpu);
> > >  }
> > > 
> > > -static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> > > +/*
> > > + * This function receives the requested offset for L1 as an argument but it
> > > + * actually writes the "current" tsc offset to the VMCS and returns it. The
> > > + * current offset might be different in case an L2 guest is currently running
> > > + * and its VMCS02 is loaded.
> > > + */
> > 
> > (Not related to this patch) It might be a good idea to rename this callback
> > instead of this comment, but I am not sure about it.
> > 
> 
> Yes! I was planning to do this on v2 anyway as the name of that function
> is completely misleading/wrong IMHO.
> 
> I think that even the comment inside it that explains that when L1
> doesn't trap WRMSR then L2 TSC writes overwrite L1's TSC is misplaced.
> It should go one or more levels above I believe.
> 
> This function simply
> updates the TSC offset in the current VMCS depending on whether L1 or L2
> is executing. 
> 
> > > +static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 l1_offset)
> > >  {
> > >       struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > > -     u64 g_tsc_offset = 0;
> > > +     u64 cur_offset = l1_offset;
> > > 
> > >       /*
> > >        * We're here if L1 chose not to trap WRMSR to TSC. According
> > > @@ -1809,11 +1815,19 @@ static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> > >        * to the newly set TSC to get L2's TSC.
> > >        */
> > >       if (is_guest_mode(vcpu) &&
> > > -         (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING))
> > > -             g_tsc_offset = vmcs12->tsc_offset;
> > > +         (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING)) {
> > > +             if (vmcs12->secondary_vm_exec_control & SECONDARY_EXEC_TSC_SCALING) {
> > > +                     cur_offset = kvm_compute_02_tsc_offset(
> > > +                                     l1_offset,
> > > +                                     vmcs12->tsc_multiplier,
> > > +                                     vmcs12->tsc_offset);
> > > +             } else {
> > > +                     cur_offset = l1_offset + vmcs12->tsc_offset;
> > > +             }
> > > +     }
> > > 
> > > -     vmcs_write64(TSC_OFFSET, offset + g_tsc_offset);
> > > -     return offset + g_tsc_offset;
> > > +     vmcs_write64(TSC_OFFSET, cur_offset);
> > > +     return cur_offset;
> > >  }
> > > 
> > >  /*
> > 
> > This code would be ideal to move to common code as SVM will do basically
> > the same thing.
> > Doesn't have to be done now though.
> > 
> 
> Hmm, how can we do the feature availability checking in common code?

We can add a vendor callback for this.

Just a few thoughts about how I think we can implement
the nested TSC scaling in (mostly) common code:


Assuming that the common code knows that:
1. Nested guest is running (already the case)

2. The format of the scaling multiplier is known 
(thankfully both SVM and VMX use fixed point binary number.

SVM is using 8.32 format and VMX using 16.48 format.

The common code already knows this via
kvm_max_tsc_scaling_ratio/kvm_tsc_scaling_ratio_frac_bits.

3. the value of nested TSC scaling multiplier 
is known to the common code.

(a callback to VMX/SVM code to query the TSC scaling value, 
and have it return 1 when TSC scaling is disabled should work)


Then the common code can do the whole thing, and only
let the SVM/VMX code write the actual multiplier.

As far as I know on the SVM, the TSC scaling works like that:

1. SVM has a CPUID bit to indicate that tsc scaling is supported.
(X86_FEATURE_TSCRATEMSR)

When this bit is set, TSC scale ratio is unconditionally enabled (but
can be just 1), and it is set via a special MSR (MSR_AMD64_TSC_RATIO)
rather than a field in VMCB (someone at AMD did cut corners...).

However since the TSC scaling is only effective when a guest is running,
that MSR can be treated almost as if it was just another VMCB field.

The TSC scale value is 32 bit fraction and another 8 bits the integer value
(as opposed to 48 bit fraction on VMX and 16 bits integer value).

I don't think that there are any other differences.

I should also note that I can do all of the above myself if 
I end up implementing the nested TSC scaling on AMD 
so I don't object much to the way that this patch series is done.

Best regards,
	Maxim Levitsky

> 
> > Best regards,
> >         Maxim Levitsky
> > 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/8] KVM: selftests: x86: Add vmx_nested_tsc_scaling_test
  2021-05-11 11:16     ` Stamatis, Ilias
@ 2021-05-11 12:47       ` Maxim Levitsky
  2021-05-11 14:02         ` Stamatis, Ilias
  0 siblings, 1 reply; 36+ messages in thread
From: Maxim Levitsky @ 2021-05-11 12:47 UTC (permalink / raw)
  To: Stamatis, Ilias, kvm, pbonzini, ilstam
  Cc: jmattson, Woodhouse, David, vkuznets, joro, mtosatti, zamsden,
	seanjc, wanpengli

On Tue, 2021-05-11 at 11:16 +0000, Stamatis, Ilias wrote:
> On Mon, 2021-05-10 at 16:59 +0300, Maxim Levitsky wrote:
> > On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> > > From: Ilias Stamatis <ilstam@amazon.com>
> > > 
> > > Test that nested TSC scaling works as expected with both L1 and L2
> > > scaled.
> > > 
> > > Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> > > ---
> > >  tools/testing/selftests/kvm/.gitignore        |   1 +
> > >  tools/testing/selftests/kvm/Makefile          |   1 +
> > >  .../kvm/x86_64/vmx_nested_tsc_scaling_test.c  | 209 ++++++++++++++++++
> > >  3 files changed, 211 insertions(+)
> > >  create mode 100644 tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c
> > > 
> > > diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
> > > index bd83158e0e0b..cc02022f9951 100644
> > > --- a/tools/testing/selftests/kvm/.gitignore
> > > +++ b/tools/testing/selftests/kvm/.gitignore
> > > @@ -29,6 +29,7 @@
> > >  /x86_64/vmx_preemption_timer_test
> > >  /x86_64/vmx_set_nested_state_test
> > >  /x86_64/vmx_tsc_adjust_test
> > > +/x86_64/vmx_nested_tsc_scaling_test
> > >  /x86_64/xapic_ipi_test
> > >  /x86_64/xen_shinfo_test
> > >  /x86_64/xen_vmcall_test
> > > diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> > > index e439d027939d..1078240b1313 100644
> > > --- a/tools/testing/selftests/kvm/Makefile
> > > +++ b/tools/testing/selftests/kvm/Makefile
> > > @@ -60,6 +60,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_close_while_nested_test
> > >  TEST_GEN_PROGS_x86_64 += x86_64/vmx_dirty_log_test
> > >  TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
> > >  TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
> > > +TEST_GEN_PROGS_x86_64 += x86_64/vmx_nested_tsc_scaling_test
> > >  TEST_GEN_PROGS_x86_64 += x86_64/xapic_ipi_test
> > >  TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test
> > >  TEST_GEN_PROGS_x86_64 += x86_64/debug_regs
> > > diff --git a/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c b/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c
> > > new file mode 100644
> > > index 000000000000..b05f5151ecbe
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c
> > > @@ -0,0 +1,209 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * vmx_nested_tsc_scaling_test
> > > + *
> > > + * Copyright (C) 2021 Amazon.com, Inc. or its affiliates.
> > > + *
> > > + * This test case verifies that nested TSC scaling behaves as expected when
> > > + * both L1 and L2 are scaled using different ratios. For this test we scale
> > > + * L1 down and scale L2 up.
> > > + */
> > > +
> > > +
> > > +#include "kvm_util.h"
> > > +#include "vmx.h"
> > > +#include "kselftest.h"
> > > +
> > > +
> > > +#define VCPU_ID 0
> > > +
> > > +/* L1 is scaled down by this factor */
> > > +#define L1_SCALE_FACTOR 2ULL
> > > +/* L2 is scaled up (from L1's perspective) by this factor */
> > > +#define L2_SCALE_FACTOR 4ULL
> > 
> > For fun, I might have randomized these factors as well.
> 
> So L2_SCALE_FACTOR (or rather TSC_MULTIPLIER_L2 that depends on it) is
> referenced from within l1_guest_code(). If we change this to a static variable
> we won't be able to access it from there. How could this be done?
I also had this thought after I wrote the reply. I don't have much experience
yet with KVM selftests so this might indeed be not possible.

> 
> For the L1 factor it's easy as we only use it in main().
> 
> > > +
> > > +#define TSC_OFFSET_L2 (1UL << 32)
> > > +#define TSC_MULTIPLIER_L2 (L2_SCALE_FACTOR << 48)
> > 
> > It would be fun to use a negative offset here (also randomally).
> 
> Do you mean a random offset that is always negative or a random offset that
> sometimes is positive and sometimes is negative?
Yep, to test the special case for negative numbers.
> 
> > > +
> > > +#define L2_GUEST_STACK_SIZE 64
> > > +
> > > +enum { USLEEP, UCHECK_L1, UCHECK_L2 };
> > > +#define GUEST_SLEEP(sec)         ucall(UCALL_SYNC, 2, USLEEP, sec)
> > > +#define GUEST_CHECK(level, freq) ucall(UCALL_SYNC, 2, level, freq)
> > > +
> > > +
> > > +/*
> > > + * This function checks whether the "actual" TSC frequency of a guest matches
> > > + * its expected frequency. In order to account for delays in taking the TSC
> > > + * measurements, a difference of 1% between the actual and the expected value
> > > + * is tolerated.
> > > + */
> > > +static void compare_tsc_freq(uint64_t actual, uint64_t expected)
> > > +{
> > > +     uint64_t tolerance, thresh_low, thresh_high;
> > > +
> > > +     tolerance = expected / 100;
> > > +     thresh_low = expected - tolerance;
> > > +     thresh_high = expected + tolerance;
> > > +
> > > +     TEST_ASSERT(thresh_low < actual,
> > > +             "TSC freq is expected to be between %"PRIu64" and %"PRIu64
> > > +             " but it actually is %"PRIu64,
> > > +             thresh_low, thresh_high, actual);
> > > +     TEST_ASSERT(thresh_high > actual,
> > > +             "TSC freq is expected to be between %"PRIu64" and %"PRIu64
> > > +             " but it actually is %"PRIu64,
> > > +             thresh_low, thresh_high, actual);
> > > +}
> > > +
> > > +static void check_tsc_freq(int level)
> > > +{
> > > +     uint64_t tsc_start, tsc_end, tsc_freq;
> > > +
> > > +     /*
> > > +      * Reading the TSC twice with about a second's difference should give
> > > +      * us an approximation of the TSC frequency from the guest's
> > > +      * perspective. Now, this won't be completely accurate, but it should
> > > +      * be good enough for the purposes of this test.
> > > +      */
> > 
> > It would be nice to know if the host has stable TSC (you can obtain this via
> > KVM_GET_CLOCK, the KVM_CLOCK_TSC_STABLE flag).
> > 
> > And if not stable skip the test, to avoid false positives.
> > (Yes I have a laptop I just bought that has an unstable TSC....)
> > 
> 
> Hmm, this is a vm ioctl but I noticed that one of its vcpus needs to have been
> run at least once otherwise it won't return KVM_CLOCK_TSC_STABLE in the flags.
> 
> So...

Yes, now I remember that this thing relies on the TSC sync logic,
master clock thing, etc... Oh well...

To be honest we really need the kernel to export the information
it knows about the TSC because it is useful to many users and
not limited to virtualization.

Currently other than KVM's KVM_GET_TSC_KHZ there is no clean way
to know even the TSC frequency, let alone if kernel considers 
the TSC to be stable AFAIK.

Other more or less reliable (but hacky) way to know if TSC is stable is to see
if the kernel is using tsc via
(/sys/devices/system/clocksource/clocksource0/current_clocksource = tsc)

Oh well...

Best regards,
	Maxim Levitsky

> 
> > > +     tsc_start = rdmsr(MSR_IA32_TSC);
> > > +     GUEST_SLEEP(1);
> > > +     tsc_end = rdmsr(MSR_IA32_TSC);
> > > +
> > > +     tsc_freq = tsc_end - tsc_start;
> > > +
> > > +     GUEST_CHECK(level, tsc_freq);
> > > +}
> > > +
> > > +static void l2_guest_code(void)
> > > +{
> > > +     check_tsc_freq(UCHECK_L2);
> > > +
> > > +     /* exit to L1 */
> > > +     __asm__ __volatile__("vmcall");
> > > +}
> > > +
> > > +static void l1_guest_code(struct vmx_pages *vmx_pages)
> > > +{
> > > +     unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
> > > +     uint32_t control;
> > > +
> > > +     /* check that L1's frequency looks alright before launching L2 */
> > > +     check_tsc_freq(UCHECK_L1);
> > > +
> > > +     GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages));
> > > +     GUEST_ASSERT(load_vmcs(vmx_pages));
> > > +
> > > +     /* prepare the VMCS for L2 execution */
> > > +     prepare_vmcs(vmx_pages, l2_guest_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]);
> > > +
> > > +     /* enable TSC offsetting and TSC scaling for L2 */
> > > +     control = vmreadz(CPU_BASED_VM_EXEC_CONTROL);
> > > +     control |= CPU_BASED_USE_MSR_BITMAPS | CPU_BASED_USE_TSC_OFFSETTING;
> > > +     vmwrite(CPU_BASED_VM_EXEC_CONTROL, control);
> > > +
> > > +     control = vmreadz(SECONDARY_VM_EXEC_CONTROL);
> > > +     control |= SECONDARY_EXEC_TSC_SCALING;
> > > +     vmwrite(SECONDARY_VM_EXEC_CONTROL, control);
> > > +
> > > +     vmwrite(TSC_OFFSET, TSC_OFFSET_L2);
> > > +     vmwrite(TSC_MULTIPLIER, TSC_MULTIPLIER_L2);
> > > +     vmwrite(TSC_MULTIPLIER_HIGH, TSC_MULTIPLIER_L2 >> 32);
> > > +
> > > +     /* launch L2 */
> > > +     GUEST_ASSERT(!vmlaunch());
> > > +     GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL);
> > > +
> > > +     /* check that L1's frequency still looks good */
> > > +     check_tsc_freq(UCHECK_L1);
> > > +
> > > +     GUEST_DONE();
> > > +}
> > > +
> > > +static void tsc_scaling_check_supported(void)
> > > +{
> > > +     if (!kvm_check_cap(KVM_CAP_TSC_CONTROL)) {
> > > +             print_skip("TSC scaling not supported by the HW");
> > > +             exit(KSFT_SKIP);
> > > +     }
> > > +}
> > > +
> > > +int main(int argc, char *argv[])
> > > +{
> > > +     struct kvm_vm *vm;
> > > +     vm_vaddr_t vmx_pages_gva;
> > > +
> > > +     uint64_t tsc_start, tsc_end;
> > > +     uint64_t tsc_khz;
> > > +     uint64_t l0_tsc_freq = 0;
> > > +     uint64_t l1_tsc_freq = 0;
> > > +     uint64_t l2_tsc_freq = 0;
> > > +
> > > +     nested_vmx_check_supported();
> > > +     tsc_scaling_check_supported();
> 
> I can't add the check here
> 
> > > +
> > > +     tsc_start = rdtsc();
> > > +     sleep(1);
> > > +     tsc_end = rdtsc();
> > > +
> > > +     l0_tsc_freq = tsc_end - tsc_start;
> > > +     printf("real TSC frequency is around: %"PRIu64"\n", l0_tsc_freq);
> > > +
> > > +     vm = vm_create_default(VCPU_ID, 0, (void *) l1_guest_code);
> > > +     vcpu_alloc_vmx(vm, &vmx_pages_gva);
> > > +     vcpu_args_set(vm, VCPU_ID, 1, vmx_pages_gva);
> 
> nor here
> 
> > > +
> > > +     tsc_khz = _vcpu_ioctl(vm, VCPU_ID, KVM_GET_TSC_KHZ, NULL);
> > > +     TEST_ASSERT(tsc_khz != -1, "vcpu ioctl KVM_GET_TSC_KHZ failed");
> > > +
> > > +     /* scale down L1's TSC frequency */
> > > +     vcpu_ioctl(vm, VCPU_ID, KVM_SET_TSC_KHZ,
> > > +               (void *) (tsc_khz / L1_SCALE_FACTOR));
> > > +
> > > +     for (;;) {
> > > +             volatile struct kvm_run *run = vcpu_state(vm, VCPU_ID);
> > > +             struct ucall uc;
> > > +
> > > +             vcpu_run(vm, VCPU_ID);
> > > +             TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
> > > +                         "Got exit_reason other than KVM_EXIT_IO: %u (%s)\n",
> > > +                         run->exit_reason,
> > > +                         exit_reason_str(run->exit_reason));
> 
> should I add it here?
> 
> > > +
> > > +             switch (get_ucall(vm, VCPU_ID, &uc)) {
> > > +             case UCALL_ABORT:
> > > +                     TEST_FAIL("%s", (const char *) uc.args[0]);
> > > +             case UCALL_SYNC:
> > > +                     switch (uc.args[0]) {
> > > +                     case USLEEP:
> > > +                             sleep(uc.args[1]);
> > > +                             break;
> > > +                     case UCHECK_L1:
> > > +                             l1_tsc_freq = uc.args[1];
> > > +                             printf("L1's TSC frequency is around: %"PRIu64
> > > +                                    "\n", l1_tsc_freq);
> > > +
> > > +                             compare_tsc_freq(l1_tsc_freq,
> > > +                                              l0_tsc_freq / L1_SCALE_FACTOR);
> > > +                             break;
> > > +                     case UCHECK_L2:
> > > +                             l2_tsc_freq = uc.args[1];
> > > +                             printf("L2's TSC frequency is around: %"PRIu64
> > > +                                    "\n", l2_tsc_freq);
> > > +
> > > +                             compare_tsc_freq(l2_tsc_freq,
> > > +                                              l1_tsc_freq * L2_SCALE_FACTOR);
> > > +                             break;
> > > +                     }
> > > +                     break;
> > > +             case UCALL_DONE:
> > > +                     goto done;
> > > +             default:
> > > +                     TEST_FAIL("Unknown ucall %lu", uc.cmd);
> > > +             }
> > > +     }
> > > +
> > > +done:
> > > +     kvm_vm_free(vm);
> > > +     return 0;
> > > +}
> > 
> > Overall looks OK to me.
> > 
> > I can't test it, since the most recent Intel laptop I have (i7-7600U)
> > still lacks TSC scaling (or did Intel cripple this feature on clients like what
> > they did with APICv ?)
> > 
> > Best regards,
> >         Maxim Levitsky
> > 
> > 
> > 



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/8] KVM: selftests: x86: Add vmx_nested_tsc_scaling_test
  2021-05-11 12:47       ` Maxim Levitsky
@ 2021-05-11 14:02         ` Stamatis, Ilias
  0 siblings, 0 replies; 36+ messages in thread
From: Stamatis, Ilias @ 2021-05-11 14:02 UTC (permalink / raw)
  To: kvm, pbonzini, mlevitsk, ilstam
  Cc: jmattson, Woodhouse, David, vkuznets, joro, mtosatti, zamsden,
	seanjc, wanpengli

On Tue, 2021-05-11 at 15:47 +0300, Maxim Levitsky wrote:
> On Tue, 2021-05-11 at 11:16 +0000, Stamatis, Ilias wrote:
> > On Mon, 2021-05-10 at 16:59 +0300, Maxim Levitsky wrote:
> > > On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> > > > From: Ilias Stamatis <ilstam@amazon.com>
> > > > 
> > > > Test that nested TSC scaling works as expected with both L1 and L2
> > > > scaled.
> > > > 
> > > > Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> > > > ---
> > > >  tools/testing/selftests/kvm/.gitignore        |   1 +
> > > >  tools/testing/selftests/kvm/Makefile          |   1 +
> > > >  .../kvm/x86_64/vmx_nested_tsc_scaling_test.c  | 209 ++++++++++++++++++
> > > >  3 files changed, 211 insertions(+)
> > > >  create mode 100644 tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c
> > > > 
> > > > diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
> > > > index bd83158e0e0b..cc02022f9951 100644
> > > > --- a/tools/testing/selftests/kvm/.gitignore
> > > > +++ b/tools/testing/selftests/kvm/.gitignore
> > > > @@ -29,6 +29,7 @@
> > > >  /x86_64/vmx_preemption_timer_test
> > > >  /x86_64/vmx_set_nested_state_test
> > > >  /x86_64/vmx_tsc_adjust_test
> > > > +/x86_64/vmx_nested_tsc_scaling_test
> > > >  /x86_64/xapic_ipi_test
> > > >  /x86_64/xen_shinfo_test
> > > >  /x86_64/xen_vmcall_test
> > > > diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> > > > index e439d027939d..1078240b1313 100644
> > > > --- a/tools/testing/selftests/kvm/Makefile
> > > > +++ b/tools/testing/selftests/kvm/Makefile
> > > > @@ -60,6 +60,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_close_while_nested_test
> > > >  TEST_GEN_PROGS_x86_64 += x86_64/vmx_dirty_log_test
> > > >  TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
> > > >  TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
> > > > +TEST_GEN_PROGS_x86_64 += x86_64/vmx_nested_tsc_scaling_test
> > > >  TEST_GEN_PROGS_x86_64 += x86_64/xapic_ipi_test
> > > >  TEST_GEN_PROGS_x86_64 += x86_64/xss_msr_test
> > > >  TEST_GEN_PROGS_x86_64 += x86_64/debug_regs
> > > > diff --git a/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c b/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c
> > > > new file mode 100644
> > > > index 000000000000..b05f5151ecbe
> > > > --- /dev/null
> > > > +++ b/tools/testing/selftests/kvm/x86_64/vmx_nested_tsc_scaling_test.c
> > > > @@ -0,0 +1,209 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > +/*
> > > > + * vmx_nested_tsc_scaling_test
> > > > + *
> > > > + * Copyright (C) 2021 Amazon.com, Inc. or its affiliates.
> > > > + *
> > > > + * This test case verifies that nested TSC scaling behaves as expected when
> > > > + * both L1 and L2 are scaled using different ratios. For this test we scale
> > > > + * L1 down and scale L2 up.
> > > > + */
> > > > +
> > > > +
> > > > +#include "kvm_util.h"
> > > > +#include "vmx.h"
> > > > +#include "kselftest.h"
> > > > +
> > > > +
> > > > +#define VCPU_ID 0
> > > > +
> > > > +/* L1 is scaled down by this factor */
> > > > +#define L1_SCALE_FACTOR 2ULL
> > > > +/* L2 is scaled up (from L1's perspective) by this factor */
> > > > +#define L2_SCALE_FACTOR 4ULL
> > > 
> > > For fun, I might have randomized these factors as well.
> > 
> > So L2_SCALE_FACTOR (or rather TSC_MULTIPLIER_L2 that depends on it) is
> > referenced from within l1_guest_code(). If we change this to a static variable
> > we won't be able to access it from there. How could this be done?
> 
> I also had this thought after I wrote the reply. I don't have much experience
> yet with KVM selftests so this might indeed be not possible.
> 

OK, I can make the L1 scale factor random, but the L2 fixed for now.

Does a range of 2 to 10 sound reasonable for L1?

> > 
> > For the L1 factor it's easy as we only use it in main().
> > 
> > > > +
> > > > +#define TSC_OFFSET_L2 (1UL << 32)
> > > > +#define TSC_MULTIPLIER_L2 (L2_SCALE_FACTOR << 48)
> > > 
> > > It would be fun to use a negative offset here (also randomally).
> > 
> > Do you mean a random offset that is always negative or a random offset that
> > sometimes is positive and sometimes is negative?
> 
> Yep, to test the special case for negative numbers.

OK, I will use a negative offset then but it won't be random for the same
reason as above.

> > 
> > > > +
> > > > +#define L2_GUEST_STACK_SIZE 64
> > > > +
> > > > +enum { USLEEP, UCHECK_L1, UCHECK_L2 };
> > > > +#define GUEST_SLEEP(sec)         ucall(UCALL_SYNC, 2, USLEEP, sec)
> > > > +#define GUEST_CHECK(level, freq) ucall(UCALL_SYNC, 2, level, freq)
> > > > +
> > > > +
> > > > +/*
> > > > + * This function checks whether the "actual" TSC frequency of a guest matches
> > > > + * its expected frequency. In order to account for delays in taking the TSC
> > > > + * measurements, a difference of 1% between the actual and the expected value
> > > > + * is tolerated.
> > > > + */
> > > > +static void compare_tsc_freq(uint64_t actual, uint64_t expected)
> > > > +{
> > > > +     uint64_t tolerance, thresh_low, thresh_high;
> > > > +
> > > > +     tolerance = expected / 100;
> > > > +     thresh_low = expected - tolerance;
> > > > +     thresh_high = expected + tolerance;
> > > > +
> > > > +     TEST_ASSERT(thresh_low < actual,
> > > > +             "TSC freq is expected to be between %"PRIu64" and %"PRIu64
> > > > +             " but it actually is %"PRIu64,
> > > > +             thresh_low, thresh_high, actual);
> > > > +     TEST_ASSERT(thresh_high > actual,
> > > > +             "TSC freq is expected to be between %"PRIu64" and %"PRIu64
> > > > +             " but it actually is %"PRIu64,
> > > > +             thresh_low, thresh_high, actual);
> > > > +}
> > > > +
> > > > +static void check_tsc_freq(int level)
> > > > +{
> > > > +     uint64_t tsc_start, tsc_end, tsc_freq;
> > > > +
> > > > +     /*
> > > > +      * Reading the TSC twice with about a second's difference should give
> > > > +      * us an approximation of the TSC frequency from the guest's
> > > > +      * perspective. Now, this won't be completely accurate, but it should
> > > > +      * be good enough for the purposes of this test.
> > > > +      */
> > > 
> > > It would be nice to know if the host has stable TSC (you can obtain this via
> > > KVM_GET_CLOCK, the KVM_CLOCK_TSC_STABLE flag).
> > > 
> > > And if not stable skip the test, to avoid false positives.
> > > (Yes I have a laptop I just bought that has an unstable TSC....)
> > > 
> > 
> > Hmm, this is a vm ioctl but I noticed that one of its vcpus needs to have been
> > run at least once otherwise it won't return KVM_CLOCK_TSC_STABLE in the flags.
> > 
> > So...
> 
> Yes, now I remember that this thing relies on the TSC sync logic,
> master clock thing, etc... Oh well...
> 
> To be honest we really need the kernel to export the information
> it knows about the TSC because it is useful to many users and
> not limited to virtualization.
> 
> Currently other than KVM's KVM_GET_TSC_KHZ there is no clean way
> to know even the TSC frequency, let alone if kernel considers
> the TSC to be stable AFAIK.
> 
> Other more or less reliable (but hacky) way to know if TSC is stable is to see
> if the kernel is using tsc via
> (/sys/devices/system/clocksource/clocksource0/current_clocksource = tsc)
> 
> Oh well...

So do you suggest checking the content of this file over doing a KVM_GET_CLOCK
ioctl after the vcpu has run once?

> 
> Best regards,
>         Maxim Levitsky
> 
> > 
> > > > +     tsc_start = rdmsr(MSR_IA32_TSC);
> > > > +     GUEST_SLEEP(1);
> > > > +     tsc_end = rdmsr(MSR_IA32_TSC);
> > > > +
> > > > +     tsc_freq = tsc_end - tsc_start;
> > > > +
> > > > +     GUEST_CHECK(level, tsc_freq);
> > > > +}
> > > > +
> > > > +static void l2_guest_code(void)
> > > > +{
> > > > +     check_tsc_freq(UCHECK_L2);
> > > > +
> > > > +     /* exit to L1 */
> > > > +     __asm__ __volatile__("vmcall");
> > > > +}
> > > > +
> > > > +static void l1_guest_code(struct vmx_pages *vmx_pages)
> > > > +{
> > > > +     unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
> > > > +     uint32_t control;
> > > > +
> > > > +     /* check that L1's frequency looks alright before launching L2 */
> > > > +     check_tsc_freq(UCHECK_L1);
> > > > +
> > > > +     GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages));
> > > > +     GUEST_ASSERT(load_vmcs(vmx_pages));
> > > > +
> > > > +     /* prepare the VMCS for L2 execution */
> > > > +     prepare_vmcs(vmx_pages, l2_guest_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]);
> > > > +
> > > > +     /* enable TSC offsetting and TSC scaling for L2 */
> > > > +     control = vmreadz(CPU_BASED_VM_EXEC_CONTROL);
> > > > +     control |= CPU_BASED_USE_MSR_BITMAPS | CPU_BASED_USE_TSC_OFFSETTING;
> > > > +     vmwrite(CPU_BASED_VM_EXEC_CONTROL, control);
> > > > +
> > > > +     control = vmreadz(SECONDARY_VM_EXEC_CONTROL);
> > > > +     control |= SECONDARY_EXEC_TSC_SCALING;
> > > > +     vmwrite(SECONDARY_VM_EXEC_CONTROL, control);
> > > > +
> > > > +     vmwrite(TSC_OFFSET, TSC_OFFSET_L2);
> > > > +     vmwrite(TSC_MULTIPLIER, TSC_MULTIPLIER_L2);
> > > > +     vmwrite(TSC_MULTIPLIER_HIGH, TSC_MULTIPLIER_L2 >> 32);
> > > > +
> > > > +     /* launch L2 */
> > > > +     GUEST_ASSERT(!vmlaunch());
> > > > +     GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL);
> > > > +
> > > > +     /* check that L1's frequency still looks good */
> > > > +     check_tsc_freq(UCHECK_L1);
> > > > +
> > > > +     GUEST_DONE();
> > > > +}
> > > > +
> > > > +static void tsc_scaling_check_supported(void)
> > > > +{
> > > > +     if (!kvm_check_cap(KVM_CAP_TSC_CONTROL)) {
> > > > +             print_skip("TSC scaling not supported by the HW");
> > > > +             exit(KSFT_SKIP);
> > > > +     }
> > > > +}
> > > > +
> > > > +int main(int argc, char *argv[])
> > > > +{
> > > > +     struct kvm_vm *vm;
> > > > +     vm_vaddr_t vmx_pages_gva;
> > > > +
> > > > +     uint64_t tsc_start, tsc_end;
> > > > +     uint64_t tsc_khz;
> > > > +     uint64_t l0_tsc_freq = 0;
> > > > +     uint64_t l1_tsc_freq = 0;
> > > > +     uint64_t l2_tsc_freq = 0;
> > > > +
> > > > +     nested_vmx_check_supported();
> > > > +     tsc_scaling_check_supported();
> > 
> > I can't add the check here
> > 
> > > > +
> > > > +     tsc_start = rdtsc();
> > > > +     sleep(1);
> > > > +     tsc_end = rdtsc();
> > > > +
> > > > +     l0_tsc_freq = tsc_end - tsc_start;
> > > > +     printf("real TSC frequency is around: %"PRIu64"\n", l0_tsc_freq);
> > > > +
> > > > +     vm = vm_create_default(VCPU_ID, 0, (void *) l1_guest_code);
> > > > +     vcpu_alloc_vmx(vm, &vmx_pages_gva);
> > > > +     vcpu_args_set(vm, VCPU_ID, 1, vmx_pages_gva);
> > 
> > nor here
> > 
> > > > +
> > > > +     tsc_khz = _vcpu_ioctl(vm, VCPU_ID, KVM_GET_TSC_KHZ, NULL);
> > > > +     TEST_ASSERT(tsc_khz != -1, "vcpu ioctl KVM_GET_TSC_KHZ failed");
> > > > +
> > > > +     /* scale down L1's TSC frequency */
> > > > +     vcpu_ioctl(vm, VCPU_ID, KVM_SET_TSC_KHZ,
> > > > +               (void *) (tsc_khz / L1_SCALE_FACTOR));
> > > > +
> > > > +     for (;;) {
> > > > +             volatile struct kvm_run *run = vcpu_state(vm, VCPU_ID);
> > > > +             struct ucall uc;
> > > > +
> > > > +             vcpu_run(vm, VCPU_ID);
> > > > +             TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
> > > > +                         "Got exit_reason other than KVM_EXIT_IO: %u (%s)\n",
> > > > +                         run->exit_reason,
> > > > +                         exit_reason_str(run->exit_reason));
> > 
> > should I add it here?
> > 
> > > > +
> > > > +             switch (get_ucall(vm, VCPU_ID, &uc)) {
> > > > +             case UCALL_ABORT:
> > > > +                     TEST_FAIL("%s", (const char *) uc.args[0]);
> > > > +             case UCALL_SYNC:
> > > > +                     switch (uc.args[0]) {
> > > > +                     case USLEEP:
> > > > +                             sleep(uc.args[1]);
> > > > +                             break;
> > > > +                     case UCHECK_L1:
> > > > +                             l1_tsc_freq = uc.args[1];
> > > > +                             printf("L1's TSC frequency is around: %"PRIu64
> > > > +                                    "\n", l1_tsc_freq);
> > > > +
> > > > +                             compare_tsc_freq(l1_tsc_freq,
> > > > +                                              l0_tsc_freq / L1_SCALE_FACTOR);
> > > > +                             break;
> > > > +                     case UCHECK_L2:
> > > > +                             l2_tsc_freq = uc.args[1];
> > > > +                             printf("L2's TSC frequency is around: %"PRIu64
> > > > +                                    "\n", l2_tsc_freq);
> > > > +
> > > > +                             compare_tsc_freq(l2_tsc_freq,
> > > > +                                              l1_tsc_freq * L2_SCALE_FACTOR);
> > > > +                             break;
> > > > +                     }
> > > > +                     break;
> > > > +             case UCALL_DONE:
> > > > +                     goto done;
> > > > +             default:
> > > > +                     TEST_FAIL("Unknown ucall %lu", uc.cmd);
> > > > +             }
> > > > +     }
> > > > +
> > > > +done:
> > > > +     kvm_vm_free(vm);
> > > > +     return 0;
> > > > +}
> > > 
> > > Overall looks OK to me.
> > > 
> > > I can't test it, since the most recent Intel laptop I have (i7-7600U)
> > > still lacks TSC scaling (or did Intel cripple this feature on clients like what
> > > they did with APICv ?)
> > > 
> > > Best regards,
> > >         Maxim Levitsky
> > > 
> > > 
> > > 
> 
> 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/8] KVM: VMX: Adjust the TSC-related VMCS fields on L2 entry and exit
  2021-05-11 12:38       ` Maxim Levitsky
@ 2021-05-11 15:11         ` Stamatis, Ilias
  0 siblings, 0 replies; 36+ messages in thread
From: Stamatis, Ilias @ 2021-05-11 15:11 UTC (permalink / raw)
  To: kvm, pbonzini, mlevitsk, ilstam
  Cc: jmattson, Woodhouse, David, vkuznets, joro, mtosatti, zamsden,
	seanjc, wanpengli

On Tue, 2021-05-11 at 15:38 +0300, Maxim Levitsky wrote:
> On Mon, 2021-05-10 at 14:44 +0000, Stamatis, Ilias wrote:
> > On Mon, 2021-05-10 at 16:53 +0300, Maxim Levitsky wrote:
> > > On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> > > > From: Ilias Stamatis <ilstam@amazon.com>
> > > > 
> > > > When L2 is entered we need to "merge" the TSC multiplier and TSC
> > > > offset
> > > > values of VMCS01 and VMCS12 and store the result into the current
> > > > VMCS02.
> > > > 
> > > > The 02 values are calculated using the following equations:
> > > >   offset_02 = ((offset_01 * mult_12) >> 48) + offset_12
> > > >   mult_02 = (mult_01 * mult_12) >> 48
> > > 
> > > I would mention that 48 is kvm_tsc_scaling_ratio_frac_bits instead.
> > > Also maybe add the common code in a separate patch?
> > 
> > Right, will do.
> > 
> > > > Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> > > > ---
> > > >  arch/x86/include/asm/kvm_host.h |  1 +
> > > >  arch/x86/kvm/vmx/nested.c       | 26 ++++++++++++++++++++++----
> > > >  arch/x86/kvm/x86.c              | 25 +++++++++++++++++++++++++
> > > >  3 files changed, 48 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/kvm_host.h
> > > > b/arch/x86/include/asm/kvm_host.h
> > > > index cdddbf0b1177..e7a1eb36f95a 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1780,6 +1780,7 @@ void kvm_define_user_return_msr(unsigned
> > > > index, u32 msr);
> > > >  int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
> > > > 
> > > >  u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc, bool l1);
> > > > +u64 kvm_compute_02_tsc_offset(u64 l1_offset, u64 l2_multiplier, u64
> > > > l2_offset);
> > > >  u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc);
> > > > 
> > > >  unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu);
> > > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > > > index bced76637823..a1bf28f33837 100644
> > > > --- a/arch/x86/kvm/vmx/nested.c
> > > > +++ b/arch/x86/kvm/vmx/nested.c
> > > > @@ -3353,8 +3353,22 @@ enum nvmx_vmentry_status
> > > > nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
> > > >       }
> > > > 
> > > >       enter_guest_mode(vcpu);
> > > > -     if (vmcs12->cpu_based_vm_exec_control &
> > > > CPU_BASED_USE_TSC_OFFSETTING)
> > > > -             vcpu->arch.tsc_offset += vmcs12->tsc_offset;
> > > > +
> > > > +     if (vmcs12->cpu_based_vm_exec_control &
> > > > CPU_BASED_USE_TSC_OFFSETTING) {
> > > > +             if (vmcs12->secondary_vm_exec_control &
> > > > SECONDARY_EXEC_TSC_SCALING) {
> > > > +                     vcpu->arch.tsc_offset =
> > > > kvm_compute_02_tsc_offset(
> > > > +                                     vcpu->arch.l1_tsc_offset,
> > > > +                                     vmcs12->tsc_multiplier,
> > > > +                                     vmcs12->tsc_offset);
> > > > +
> > > > +                     vcpu->arch.tsc_scaling_ratio =
> > > > mul_u64_u64_shr(
> > > > +                                     vcpu->arch.tsc_scaling_ratio,
> > > > +                                     vmcs12->tsc_multiplier,
> > > > +                                     kvm_tsc_scaling_ratio_frac_bit
> > > > s);
> > > > +             } else {
> > > > +                     vcpu->arch.tsc_offset += vmcs12->tsc_offset;
> > > > +             }
> > > > +     }
> > > > 
> > > >       if (prepare_vmcs02(vcpu, vmcs12, &entry_failure_code)) {
> > > >               exit_reason.basic = EXIT_REASON_INVALID_STATE;
> > > > @@ -4454,8 +4468,12 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu,
> > > > u32 vm_exit_reason,
> > > >       if (nested_cpu_has_preemption_timer(vmcs12))
> > > >               hrtimer_cancel(&to_vmx(vcpu)-
> > > > > nested.preemption_timer);
> > > > 
> > > > -     if (vmcs12->cpu_based_vm_exec_control &
> > > > CPU_BASED_USE_TSC_OFFSETTING)
> > > > -             vcpu->arch.tsc_offset -= vmcs12->tsc_offset;
> > > > +     if (vmcs12->cpu_based_vm_exec_control &
> > > > CPU_BASED_USE_TSC_OFFSETTING) {
> > > > +             vcpu->arch.tsc_offset = vcpu->arch.l1_tsc_offset;
> > > > +
> > > > +             if (vmcs12->secondary_vm_exec_control &
> > > > SECONDARY_EXEC_TSC_SCALING)
> > > > +                     vcpu->arch.tsc_scaling_ratio = vcpu-
> > > > > arch.l1_tsc_scaling_ratio;
> > > > 
> > > > +     }
> > > > 
> > > >       if (likely(!vmx->fail)) {
> > > >               sync_vmcs02_to_vmcs12(vcpu, vmcs12);
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 26a4c0f46f15..87deb119c521 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -2266,6 +2266,31 @@ static u64 kvm_compute_tsc_offset(struct
> > > > kvm_vcpu *vcpu, u64 target_tsc)
> > > >       return target_tsc - tsc;
> > > >  }
> > > > 
> > > > +/*
> > > > + * This function computes the TSC offset that is stored in VMCS02
> > > > when entering
> > > > + * L2 by combining the offset and multiplier values of VMCS01 and
> > > > VMCS12.
> > > > + */
> > > > +u64 kvm_compute_02_tsc_offset(u64 l1_offset, u64 l2_multiplier, u64
> > > > l2_offset)
> > > > +{
> > > > +     u64 offset;
> > > > +
> > > > +     /*
> > > > +      * The L1 offset is interpreted as a signed number by the CPU
> > > > and can
> > > > +      * be negative. So we extract the sign before the
> > > > multiplication and
> > > > +      * put it back afterwards if needed.
> > > 
> > > If I understand correctly the reason for sign extraction is that we
> > > don't
> > > have mul_s64_u64_shr. Maybe we should add it?
> > > 
> > 
> > Initially I added a mul_s64_s64_shr and used that.
> > 
> > You can take a look
> > here:
> > 
> > https://github.com/ilstam/linux/commit/7bc0b75bb7b8d9bcecf41e2e1e1936880d3b93d1
> > 
> > 
> > I added mul_s64_s64_shr instead of mul_s64_u64_shr thinking that a)
> > perhaps it would be more useful as a general-purpose function and b)
> > even though the multiplier is unsigned in reality it's caped so it
> > should be safe to cast it to signed.
> 
> Actually is it? Intel's PRM I think defines TSC multiplier to be full
> 64 bit. It is shifted right after multiplication, making the multiplier
> be effictively a fixed point 16.48 number but since multiplication is done
> first, it will still have to use full 64 bit value.

Oh, my bad. I remember I saw some max limit in the code but this was for the
TSC frequency requested by userspace, it was not for the actual multiplier. 

> > 
> > But then I realized that
> > mul_u64_u64_shr that we already have is enough and I can just simply re-
> > use that.
> > 
> > How would a mul_s64_u64_shr be implemented?
> 
> I'll say even if it is implemented by just extracting the sign as you did
> and then calling mul_u64_u64_shr,
> it could still be a bit better documentation wise.

That makes sense, I will do exactly this then. 
And I'll remove
kvm_compute_02_tsc_offset completely.

> 
> > 
> > I think in the end
> > we need to decide if we want to do signed or unsigned multiplication.
> > And depending on this we need to do a signed or unsigned right shift
> > accordingly.
> 
> I agree.
> > 
> > So if we did have a mul_s64_u64_shr wouldn't it need to cast
> > either of the arguments and then use mul_u64_u64 (or mul_s64_s64) being
> > just a wrapper essentially? I don't think that we can guarantee that
> > casting either of the arguments is always safe if we want to advertise
> > this as a general purpose function.
> 
> Note that I don't have a strong opinion on this whole thing, so if you feel
> that it should be left as is, I don't mind.
> 
> > 
> > Thanks for the reviews!
> 
> No problem!
> 
> Best regards,
>         Maxim Levitsky

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/8] KVM: VMX: Make vmx_write_l1_tsc_offset() work with nested TSC scaling
  2021-05-11 12:44       ` Maxim Levitsky
@ 2021-05-11 17:44         ` Stamatis, Ilias
  0 siblings, 0 replies; 36+ messages in thread
From: Stamatis, Ilias @ 2021-05-11 17:44 UTC (permalink / raw)
  To: kvm, pbonzini, mlevitsk, ilstam
  Cc: jmattson, Woodhouse, David, vkuznets, joro, mtosatti, zamsden,
	seanjc, wanpengli

On Tue, 2021-05-11 at 15:44 +0300, Maxim Levitsky wrote:
> On Mon, 2021-05-10 at 16:08 +0000, Stamatis, Ilias wrote:
> > On Mon, 2021-05-10 at 16:54 +0300, Maxim Levitsky wrote:
> > > On Thu, 2021-05-06 at 10:32 +0000, ilstam@mailbox.org wrote:
> > > > From: Ilias Stamatis <ilstam@amazon.com>
> > > > 
> > > > Calculating the current TSC offset is done differently when nested TSC
> > > > scaling is used.
> > > > 
> > > > Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> > > > ---
> > > >  arch/x86/kvm/vmx/vmx.c | 26 ++++++++++++++++++++------
> > > >  1 file changed, 20 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > > index 49241423b854..df7dc0e4c903 100644
> > > > --- a/arch/x86/kvm/vmx/vmx.c
> > > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > > @@ -1797,10 +1797,16 @@ static void setup_msrs(struct vcpu_vmx *vmx)
> > > >               vmx_update_msr_bitmap(&vmx->vcpu);
> > > >  }
> > > > 
> > > > -static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> > > > +/*
> > > > + * This function receives the requested offset for L1 as an argument but it
> > > > + * actually writes the "current" tsc offset to the VMCS and returns it. The
> > > > + * current offset might be different in case an L2 guest is currently running
> > > > + * and its VMCS02 is loaded.
> > > > + */
> > > 
> > > (Not related to this patch) It might be a good idea to rename this callback
> > > instead of this comment, but I am not sure about it.
> > > 
> > 
> > Yes! I was planning to do this on v2 anyway as the name of that function
> > is completely misleading/wrong IMHO.
> > 
> > I think that even the comment inside it that explains that when L1
> > doesn't trap WRMSR then L2 TSC writes overwrite L1's TSC is misplaced.
> > It should go one or more levels above I believe.
> > 
> > This function simply
> > updates the TSC offset in the current VMCS depending on whether L1 or L2
> > is executing.
> > 
> > > > +static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 l1_offset)
> > > >  {
> > > >       struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > > > -     u64 g_tsc_offset = 0;
> > > > +     u64 cur_offset = l1_offset;
> > > > 
> > > >       /*
> > > >        * We're here if L1 chose not to trap WRMSR to TSC. According
> > > > @@ -1809,11 +1815,19 @@ static u64 vmx_write_l1_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> > > >        * to the newly set TSC to get L2's TSC.
> > > >        */
> > > >       if (is_guest_mode(vcpu) &&
> > > > -         (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING))
> > > > -             g_tsc_offset = vmcs12->tsc_offset;
> > > > +         (vmcs12->cpu_based_vm_exec_control & CPU_BASED_USE_TSC_OFFSETTING)) {
> > > > +             if (vmcs12->secondary_vm_exec_control & SECONDARY_EXEC_TSC_SCALING) {
> > > > +                     cur_offset = kvm_compute_02_tsc_offset(
> > > > +                                     l1_offset,
> > > > +                                     vmcs12->tsc_multiplier,
> > > > +                                     vmcs12->tsc_offset);
> > > > +             } else {
> > > > +                     cur_offset = l1_offset + vmcs12->tsc_offset;
> > > > +             }
> > > > +     }
> > > > 
> > > > -     vmcs_write64(TSC_OFFSET, offset + g_tsc_offset);
> > > > -     return offset + g_tsc_offset;
> > > > +     vmcs_write64(TSC_OFFSET, cur_offset);
> > > > +     return cur_offset;
> > > >  }
> > > > 
> > > >  /*
> > > 
> > > This code would be ideal to move to common code as SVM will do basically
> > > the same thing.
> > > Doesn't have to be done now though.
> > > 
> > 
> > Hmm, how can we do the feature availability checking in common code?
> 
> We can add a vendor callback for this.
> 
> Just a few thoughts about how I think we can implement
> the nested TSC scaling in (mostly) common code:
> 
> 
> Assuming that the common code knows that:
> 1. Nested guest is running (already the case)
> 
> 2. The format of the scaling multiplier is known
> (thankfully both SVM and VMX use fixed point binary number.
> 
> SVM is using 8.32 format and VMX using 16.48 format.
> 
> The common code already knows this via
> kvm_max_tsc_scaling_ratio/kvm_tsc_scaling_ratio_frac_bits.
> 
> 3. the value of nested TSC scaling multiplier
> is known to the common code.
> 
> (a callback to VMX/SVM code to query the TSC scaling value,
> and have it return 1 when TSC scaling is disabled should work)
> 

I suppose you mean return kvm_default_tsc_scaling_ratio

> 
> Then the common code can do the whole thing, and only
> let the SVM/VMX code write the actual multiplier.
> 
> As far as I know on the SVM, the TSC scaling works like that:
> 
> 1. SVM has a CPUID bit to indicate that tsc scaling is supported.
> (X86_FEATURE_TSCRATEMSR)
> 
> When this bit is set, TSC scale ratio is unconditionally enabled (but
> can be just 1), and it is set via a special MSR (MSR_AMD64_TSC_RATIO)
> rather than a field in VMCB (someone at AMD did cut corners...).
> 
> However since the TSC scaling is only effective when a guest is running,
> that MSR can be treated almost as if it was just another VMCB field.
> 
> The TSC scale value is 32 bit fraction and another 8 bits the integer value
> (as opposed to 48 bit fraction on VMX and 16 bits integer value).
> 
> I don't think that there are any other differences.
> 
> I should also note that I can do all of the above myself if
> I end up implementing the nested TSC scaling on AMD
> so I don't object much to the way that this patch series is done.
> 

That's fine, I will add the callbacks and move everything to common code. And
then you can fill the svm-specific bits if you want.

Thanks,
Ilias

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-05-11 17:45 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-06 10:32 [PATCH 0/8] KVM: VMX: Implement nested TSC scaling ilstam
2021-05-06 10:32 ` [PATCH 1/8] KVM: VMX: Add a TSC multiplier field in VMCS12 ilstam
2021-05-06 14:50   ` kernel test robot
2021-05-06 17:36   ` Jim Mattson
2021-05-10 13:42   ` Maxim Levitsky
2021-05-06 10:32 ` [PATCH 2/8] KVM: X86: Store L1's TSC scaling ratio in 'struct kvm_vcpu_arch' ilstam
2021-05-10 13:43   ` Maxim Levitsky
2021-05-06 10:32 ` [PATCH 3/8] KVM: X86: Pass an additional 'L1' argument to kvm_scale_tsc() ilstam
2021-05-10 13:52   ` Maxim Levitsky
2021-05-10 15:44     ` Stamatis, Ilias
2021-05-06 10:32 ` [PATCH 4/8] KVM: VMX: Adjust the TSC-related VMCS fields on L2 entry and exit ilstam
2021-05-06 11:32   ` Paolo Bonzini
2021-05-06 17:35     ` Stamatis, Ilias
2021-05-10 14:11       ` Paolo Bonzini
2021-05-10 13:53   ` Maxim Levitsky
2021-05-10 14:44     ` Stamatis, Ilias
2021-05-11 12:38       ` Maxim Levitsky
2021-05-11 15:11         ` Stamatis, Ilias
2021-05-06 10:32 ` [PATCH 5/8] KVM: X86: Move tracing outside write_l1_tsc_offset() ilstam
2021-05-10 13:54   ` Maxim Levitsky
2021-05-06 10:32 ` [PATCH 6/8] KVM: VMX: Make vmx_write_l1_tsc_offset() work with nested TSC scaling ilstam
2021-05-10 13:54   ` Maxim Levitsky
2021-05-10 16:08     ` Stamatis, Ilias
2021-05-11 12:44       ` Maxim Levitsky
2021-05-11 17:44         ` Stamatis, Ilias
2021-05-06 10:32 ` [PATCH 7/8] KVM: VMX: Expose TSC scaling to L2 ilstam
2021-05-10 13:56   ` Maxim Levitsky
2021-05-06 10:32 ` [PATCH 8/8] KVM: selftests: x86: Add vmx_nested_tsc_scaling_test ilstam
2021-05-10 13:59   ` Maxim Levitsky
2021-05-11 11:16     ` Stamatis, Ilias
2021-05-11 12:47       ` Maxim Levitsky
2021-05-11 14:02         ` Stamatis, Ilias
2021-05-06 17:16 ` [PATCH 0/8] KVM: VMX: Implement nested TSC scaling Jim Mattson
2021-05-06 17:48   ` Stamatis, Ilias
2021-05-10 13:43     ` Maxim Levitsky
2021-05-10 14:29   ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).