linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] kvm,x86: Use the refined tsc rate for the guest tsc.
@ 2021-08-03  7:59 Suleiman Souhlal
  2021-08-06 16:20 ` Sean Christopherson
  0 siblings, 1 reply; 3+ messages in thread
From: Suleiman Souhlal @ 2021-08-03  7:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: ssouhlal, hikalium, senozhatsky, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, x86,
	H. Peter Anvin, kvm, linux-kernel, Suleiman Souhlal

Prior to this change, the initial tsc rate used by kvm would be
the unrefined rate, instead of the refined rate that is derived
later at boot and used for timekeeping. This can cause time to
advance at different rates between the host and the guest.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 arch/x86/kvm/x86.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4116567f3d44..1e59bb326c10 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2199,6 +2199,7 @@ static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0);
 #endif
 
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
+static DEFINE_PER_CPU(bool, cpu_tsc_khz_changed);
 static unsigned long max_tsc_khz;
 
 static u32 adjust_tsc_khz(u32 khz, s32 ppm)
@@ -2906,6 +2907,14 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
 		return 1;
 	}
+	/*
+	 * Use the refined tsc_khz instead of the tsc_khz at boot (which was
+	 * not refined yet when we got it), if the tsc frequency hasn't changed.
+	 * If the frequency does change, it does not get refined any further,
+	 * so it is safe to use the one gotten from the notifiers.
+	 */
+	if (!__this_cpu_read(cpu_tsc_khz_changed))
+		tgt_tsc_khz = tsc_khz;
 	if (!use_master_clock) {
 		host_tsc = rdtsc();
 		kernel_ns = get_kvmclock_base_ns();
@@ -8086,6 +8095,8 @@ static void tsc_khz_changed(void *data)
 		khz = freq->new;
 	else if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
 		khz = cpufreq_quick_get(raw_smp_processor_id());
+	if (khz)
+		__this_cpu_write(cpu_tsc_khz_changed, true);
 	if (!khz)
 		khz = tsc_khz;
 	__this_cpu_write(cpu_tsc_khz, khz);
-- 
2.32.0.554.ge1b32706d8-goog


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] kvm,x86: Use the refined tsc rate for the guest tsc.
  2021-08-03  7:59 [PATCH] kvm,x86: Use the refined tsc rate for the guest tsc Suleiman Souhlal
@ 2021-08-06 16:20 ` Sean Christopherson
  2022-02-15  2:18   ` Sean Christopherson
  0 siblings, 1 reply; 3+ messages in thread
From: Sean Christopherson @ 2021-08-06 16:20 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Paolo Bonzini, ssouhlal, hikalium, senozhatsky, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, kvm,
	linux-kernel

"KVM: x86:" is the preferred scope.  And for whatever reason, punctuation at the
end of the shortlog is almost always omitted.

On Tue, Aug 03, 2021, Suleiman Souhlal wrote:
> Prior to this change,

In the changelog proper, please state what change is being made before launching
into the effects of the change.  Oftentimes that does mean restating the shortlog,
but it's very helpful to reviewers/readers to provide a single cohesive/coherent
explanation.

> the initial tsc rate used by kvm would be the unrefined rate, instead of the
> refined rate that is derived later at boot and used for timekeeping. This can
> cause time to advance at different rates between the host and the guest.

This needs a lot more explanation, with a clear statement of the direct effects
of the change and with explicit references to variables and probably functions.
Normally I prefer abstraction over explicit code references, but in this case
there are too many ambiguous terms to understand the intended change without a
lot of staring.  E.g. at first blush, it's not obvious that "boot" refers to the
host kernel boot, especially for those of us that load KVM as a module.

And the statement "the initial tsc rate used by kvm would be the unrefined rate"
will not always hold true, e.g. when KVM is loaded long after boot.  That also
confused me.

IIUC, this "fixes" a race where KVM is initialized before the second call to
tsc_refine_calibration_work() completes.  Fixes in quotes because it doesn't
actually fix the race, it just papers over the problem to get the desired behavior.
If the race can't be truly fixed, the changelog should explain why it can't be
fixed, otherwise fudging our way around the race is not justifiable.

Ideally, we would find a way to fix the race, e.g. by ensuring KVM can't load or
by stalling KVM initialization until refinement completes (or fails).  tsc_khz is
consumed by KVM in multiple paths, and initializing KVM before tsc_khz calibration
is fully refined means some part of KVM will use the wrong tsc_khz, e.g. the VMX
preemption timer.  Due to sanity checks in tsc_refine_calibration_work(), the delta
won't be more than 1%, but it's still far from ideal.

> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
>  arch/x86/kvm/x86.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4116567f3d44..1e59bb326c10 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2199,6 +2199,7 @@ static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0);
>  #endif
>  
>  static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
> +static DEFINE_PER_CPU(bool, cpu_tsc_khz_changed);
>  static unsigned long max_tsc_khz;
>  
>  static u32 adjust_tsc_khz(u32 khz, s32 ppm)
> @@ -2906,6 +2907,14 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  		kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
>  		return 1;
>  	}
> +	/*
> +	 * Use the refined tsc_khz instead of the tsc_khz at boot (which was
> +	 * not refined yet when we got it),

As above, this does not hold true in all cases.  And for anyone that isn't familiar
with tsc_refine_calibration_work(), the "refined tsc_khz instead of the tsc_khz"
blurb is borderline nonsensical.

> +	 * If the frequency does change, it does not get refined any further,
> +	 * so it is safe to use the one gotten from the notifiers.
> +	 */
> +	if (!__this_cpu_read(cpu_tsc_khz_changed))
> +		tgt_tsc_khz = tsc_khz;
>  	if (!use_master_clock) {
>  		host_tsc = rdtsc();
>  		kernel_ns = get_kvmclock_base_ns();
> @@ -8086,6 +8095,8 @@ static void tsc_khz_changed(void *data)
>  		khz = freq->new;
>  	else if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
>  		khz = cpufreq_quick_get(raw_smp_processor_id());
> +	if (khz)
> +		__this_cpu_write(cpu_tsc_khz_changed, true);

On a system with a constant TSC and KVM loaded after boot, cpu_tsc_khz_changed
will never be true.  Ditto for the case where refinement fails.  That may not be
a functional issue, but it's confusing.

This also fails to gate other readers of kvm_guest_time_update.  In particular,
KVM_GET_CLOCK -> get_kvmclock_ns() will use the "wrong" frequency when userspace
is retrieving guest TSC.

>  	if (!khz)
>  		khz = tsc_khz;
>  	__this_cpu_write(cpu_tsc_khz, khz);
> -- 
> 2.32.0.554.ge1b32706d8-goog
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] kvm,x86: Use the refined tsc rate for the guest tsc.
  2021-08-06 16:20 ` Sean Christopherson
@ 2022-02-15  2:18   ` Sean Christopherson
  0 siblings, 0 replies; 3+ messages in thread
From: Sean Christopherson @ 2022-02-15  2:18 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Paolo Bonzini, ssouhlal, hikalium, senozhatsky, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, x86, H. Peter Anvin, kvm,
	linux-kernel, Anton Romanov

+Anton

On Fri, Aug 06, 2021, Sean Christopherson wrote:
> IIUC, this "fixes" a race where KVM is initialized before the second call to
> tsc_refine_calibration_work() completes.  Fixes in quotes because it doesn't
> actually fix the race, it just papers over the problem to get the desired behavior.
> If the race can't be truly fixed, the changelog should explain why it can't be
> fixed, otherwise fudging our way around the race is not justifiable.
> 
> Ideally, we would find a way to fix the race, e.g. by ensuring KVM can't load or
> by stalling KVM initialization until refinement completes (or fails).  tsc_khz is
> consumed by KVM in multiple paths, and initializing KVM before tsc_khz calibration
> is fully refined means some part of KVM will use the wrong tsc_khz, e.g. the VMX
> preemption timer.  Due to sanity checks in tsc_refine_calibration_work(), the delta
> won't be more than 1%, but it's still far from ideal.

Hmm, for systems with a constant TSC, KVM can fudge around the issue by not taking
a snapshot.  It's still racy and potentially fragile, e.g. if userspace manages
to create a vCPU before tsc_khz is refined, but it's not a bad standalone patch
and if it fixes your use case...

The only other alternative I can come up with is add a one-off "notifier" for KVM,
but that's rather gross, especially since TSC refinement is (hopefully) headed the
way of the Dodo.

Does this remedy your issues?  Any idea if you need to support old CPUs that don't
provide a constant TSC?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eaa3b5b89c5e..6a75c2748bff 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8708,13 +8708,13 @@ static int kvmclock_cpu_online(unsigned int cpu)

 static void kvm_timer_init(void)
 {
-       max_tsc_khz = tsc_khz;
-
        if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
 #ifdef CONFIG_CPU_FREQ
                struct cpufreq_policy *policy;
                int cpu;

+               max_tsc_khz = tsc_khz;
+
                cpu = get_cpu();
                policy = cpufreq_cpu_get(cpu);
                if (policy) {
@@ -11144,7 +11144,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
        vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT;
        kvm_vcpu_mtrr_init(vcpu);
        vcpu_load(vcpu);
-       kvm_set_tsc_khz(vcpu, max_tsc_khz);
+       kvm_set_tsc_khz(vcpu, max_tsc_khz ? : tsc_khz);
        kvm_vcpu_reset(vcpu, false);
        kvm_init_mmu(vcpu);
        vcpu_put(vcpu);

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2022-02-15  2:18 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-03  7:59 [PATCH] kvm,x86: Use the refined tsc rate for the guest tsc Suleiman Souhlal
2021-08-06 16:20 ` Sean Christopherson
2022-02-15  2:18   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).