Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically

From: Sean Christopherson <seanjc@google.com>
To: David Woodhouse <dwmw2@infradead.org>
Cc: Dongli Zhang <dongli.zhang@oracle.com>,
	Joe Jin <joe.jin@oracle.com>,
	x86@kernel.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, pbonzini@redhat.com,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com
Subject: Re: [PATCH RFC 1/1] KVM: x86: add param to update master clock periodically
Date: Wed, 4 Oct 2023 11:06:57 -0700	[thread overview]
Message-ID: <ZR2pwdZtO3WLCwjj@google.com> (raw)
In-Reply-To: <d6dc1242ff731cf0f2826760816081674ade9ff9.camel@infradead.org>

On Wed, Oct 04, 2023, David Woodhouse wrote:
> On Tue, 2023-10-03 at 17:04 -0700, Sean Christopherson wrote:
> > > Can't we ensure that the kvmclock uses the *same* algorithm,
> > > precisely, as CLOCK_MONOTONIC_RAW?
> > 
> > Yes?  At least for sane hardware, after much staring, I think it's possible.
> > 
> > It's tricky because the two algorithms are wierdly different, the PV clock algorithm
> > is ABI and thus immutable, and Thomas and the timekeeping folks would rightly laugh
> > at us for suggesting that we try to shove the pvclock algorithm into the kernel.
> > 
> > The hardcoded shift right 32 in PV clock is annoying, but not the end of the world.
> > 
> > Compile tested only, but I believe this math is correct.  And I'm guessing we'd
> > want some safeguards against overflow, e.g. due to a multiplier that is too big.
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 6573c89c35a9..ae9275c3d580 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3212,9 +3212,19 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
> >                                             v->arch.l1_tsc_scaling_ratio);
> >  
> >         if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
> > -               kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> > -                                  &vcpu->hv_clock.tsc_shift,
> > -                                  &vcpu->hv_clock.tsc_to_system_mul);
> > +               u32 shift, mult;
> > +
> > +               clocks_calc_mult_shift(&mult, &shift, tgt_tsc_khz, NSEC_PER_MSEC, 600);
> > +
> > +               if (shift <= 32) {
> > +                       vcpu->hv_clock.tsc_shift = 0;
> > +                       vcpu->hv_clock.tsc_to_system_mul = mult * BIT(32 - shift);
> > +               } else {
> > +                       kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
> > +                                          &vcpu->hv_clock.tsc_shift,
> > +                                          &vcpu->hv_clock.tsc_to_system_mul);
> > +               }
> > +
> >                 vcpu->hw_tsc_khz = tgt_tsc_khz;
> >                 kvm_xen_update_tsc_info(v);
> >         }
> > 
> 
> I gave that a go on my test box, and for a TSC frequency of 2593992 kHz
> it got mult=1655736523, shift=32 and took the 'happy' path instead of
> falling back.
> 
> It still drifts about the same though, using the same test as before:
> https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/kvmclock
> 
> 
> I was going to facetiously suggest that perhaps the kvmclock should
> have leap nanoseconds... but then realised that that's basically what
> Dongli's patch is *doing*. Maybe we just need to *recognise* that,

Yeah, I suspect trying to get kvmclock to always precisely align with the kernel's
monotonic raw clock is a fool's errand.

> so rather than having a user-configured period for the update, KVM could
> calculate the frequency for the updates based on the rate at which the clocks
> would otherwise drift, and a maximum delta? Not my favourite option, but
> perhaps better than nothing? 

Holy moly, the existing code for the periodic syncs/updates is a mess.  If I'm
reading the code correctly, commits

  0061d53daf26 ("KVM: x86: limit difference between kvmclock updates")
  7e44e4495a39 ("x86: kvm: rate-limit global clock updates")
  332967a3eac0 ("x86: kvm: introduce periodic global clock updates")

splattered together an immpressively inefficient update mechanism.

On the first vCPU creation, KVM schedules kvmclock_sync_fn() at a hardcoded rate
of 300hz.

	if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0)
		schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
						KVMCLOCK_SYNC_PERIOD);

That handler does two things: schedule "delayed" work kvmclock_update_fn() to
be executed immediately, and reschedule kvmclock_sync_fn() at 300hz.
kvmclock_sync_fn() then kicks *every* vCPU in the VM, i.e. KVM kicks every vCPU
to sync kvmlock at a 300hz frequency.  

If we're going to kick every vCPU, then we might as well do a masterclock update,
because the extra cost of synchronizing the masterclock is likely in the noise
compared to kicking every vCPU.  There's also zero reason to do the work in vCPU
context.

And because that's not enough, on pCPU migration or if the TSC is unstable,
kvm_arch_vcpu_load() requests KVM_REQ_GLOBAL_CLOCK_UPDATE, which schedules
kvmclock_update_fn() with a delay of 100ms.  The large delay is to play nice with
unstable TSCs.  But if KVM is periodically doing clock updates on all vCPU,
scheduling another update with a *longer* delay is silly.

The really, really stupid part of all is that the periodic syncs happen even if
kvmclock isn't exposed to the guest.  *sigh*

So rather than add yet another periodic work function, I think we should clean up
the mess we have, fix the whole "leapseconds" mess with the masterclock, and then
tune the frequency (if necessary).

Something like the below is what I'm thinking.  Once the dust settles, I'd like
to do dynamically enable/disable kvmclock_sync_work based on whether or not the
VM actually has vCPU's with a pvclock, but that's definitely an enhancement that
can go on top.

Does this look sane, or am I missing something?

---
 arch/x86/include/asm/kvm_host.h |  3 +-
 arch/x86/kvm/x86.c              | 53 +++++++++++----------------------
 2 files changed, 19 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 34a64527654c..d108452fc301 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -98,7 +98,7 @@
 	KVM_ARCH_REQ_FLAGS(14, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_SCAN_IOAPIC \
 	KVM_ARCH_REQ_FLAGS(15, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
-#define KVM_REQ_GLOBAL_CLOCK_UPDATE	KVM_ARCH_REQ(16)
+/* AVAILABLE BIT!!!!			KVM_ARCH_REQ(16) */
 #define KVM_REQ_APIC_PAGE_RELOAD \
 	KVM_ARCH_REQ_FLAGS(17, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_HV_CRASH		KVM_ARCH_REQ(18)
@@ -1336,7 +1336,6 @@ struct kvm_arch {
 	bool use_master_clock;
 	u64 master_kernel_ns;
 	u64 master_cycle_now;
-	struct delayed_work kvmclock_update_work;
 	struct delayed_work kvmclock_sync_work;
 
 	struct kvm_xen_hvm_config xen_hvm_config;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6573c89c35a9..5d35724f1963 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2367,7 +2367,7 @@ static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
 	}
 
 	vcpu->arch.time = system_time;
-	kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
+	kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
 	/* we verify if the enable bit is set... */
 	if (system_time & 1)
@@ -3257,30 +3257,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 
 #define KVMCLOCK_UPDATE_DELAY msecs_to_jiffies(100)
 
-static void kvmclock_update_fn(struct work_struct *work)
-{
-	unsigned long i;
-	struct delayed_work *dwork = to_delayed_work(work);
-	struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
-					   kvmclock_update_work);
-	struct kvm *kvm = container_of(ka, struct kvm, arch);
-	struct kvm_vcpu *vcpu;
-
-	kvm_for_each_vcpu(i, vcpu, kvm) {
-		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-		kvm_vcpu_kick(vcpu);
-	}
-}
-
-static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
-{
-	struct kvm *kvm = v->kvm;
-
-	kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
-	schedule_delayed_work(&kvm->arch.kvmclock_update_work,
-					KVMCLOCK_UPDATE_DELAY);
-}
-
 #define KVMCLOCK_SYNC_PERIOD (300 * HZ)
 
 static void kvmclock_sync_fn(struct work_struct *work)
@@ -3290,12 +3266,14 @@ static void kvmclock_sync_fn(struct work_struct *work)
 					   kvmclock_sync_work);
 	struct kvm *kvm = container_of(ka, struct kvm, arch);
 
-	if (!kvmclock_periodic_sync)
-		return;
+	if (ka->use_master_clock)
+		kvm_update_masterclock(kvm);
+	else
+		kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);
 
-	schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
-	schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
-					KVMCLOCK_SYNC_PERIOD);
+	if (kvmclock_periodic_sync)
+		schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
+				      KVMCLOCK_SYNC_PERIOD);
 }
 
 /* These helpers are safe iff @msr is known to be an MCx bank MSR. */
@@ -4845,7 +4823,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		 * kvmclock on vcpu->cpu migration
 		 */
 		if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
-			kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
+			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 		if (vcpu->cpu != cpu)
 			kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu);
 		vcpu->cpu = cpu;
@@ -10520,12 +10498,19 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			__kvm_migrate_timers(vcpu);
 		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
 			kvm_update_masterclock(vcpu->kvm);
-		if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))
-			kvm_gen_kvmclock_update(vcpu);
 		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
 			r = kvm_guest_time_update(vcpu);
 			if (unlikely(r))
 				goto out;
+
+			/*
+			 * Ensure all other vCPUs synchronize "soon", e.g. so
+			 * that all vCPUs recognize NTP corrections and drift
+			 * corrections (relative to the kernel's raw clock).
+			 */
+			if (!kvmclock_periodic_sync)
+				schedule_delayed_work(&vcpu->kvm->arch.kvmclock_sync_work,
+						      KVMCLOCK_UPDATE_DELAY);
 		}
 		if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
 			kvm_mmu_sync_roots(vcpu);
@@ -12345,7 +12330,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm->arch.hv_root_tdp = INVALID_PAGE;
 #endif
 
-	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
 
 	kvm_apicv_init(kvm);
@@ -12387,7 +12371,6 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm)
 void kvm_arch_sync_events(struct kvm *kvm)
 {
 	cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work);
-	cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work);
 	kvm_free_pit(kvm);
 }
 

base-commit: e2c8c2928d93f64b976b9242ddb08684b8cdea8d
--