Re: [PATCH v2 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration

From: David Woodhouse <dwmw2@infradead.org>
To: paul@xen.org, Paul Durrant <xadimgnik@gmail.com>,
	Jack Allister <jalliste@amazon.com>
Cc: bp@alien8.de, corbet@lwn.net, dave.hansen@linux.intel.com,
	hpa@zytor.com,  kvm@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,  mingo@redhat.com,
	pbonzini@redhat.com, seanjc@google.com, tglx@linutronix.de,
	 x86@kernel.org, Dongli Zhang <dongli.zhang@oracle.com>
Subject: Re: [PATCH v2 1/2] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
Date: Wed, 17 Apr 2024 20:50:43 +0100	[thread overview]
Message-ID: <c582ce58f3b13528b8d9c1f28cdac6fbd41cb775.camel@infradead.org> (raw)
In-Reply-To: <26bfe5ec-e583-458d-8e43-e5ecdc5883cc@xen.org>

[-- Attachment #1: Type: text/plain, Size: 5433 bytes --]

On Wed, 2024-04-10 at 13:43 +0100, Paul Durrant wrote:
> On 10/04/2024 13:09, David Woodhouse wrote:
> > On 10 April 2024 11:29:13 BST, Paul Durrant <xadimgnik@gmail.com>
> > wrote:
> > > On 10/04/2024 10:52, Jack Allister wrote:
> > > > +        * It's possible that this vCPU doesn't have a HVCLOCK
> > > > configured
> > > > +        * but the other vCPUs may. If this is the case
> > > > calculate based
> > > > +        * upon the time gathered in the seqcount but do not
> > > > update the
> > > > +        * vCPU specific PVTI. If we have one, then use that.
> > > 
> > > Given this is a per-vCPU ioctl, why not fail in the case the vCPU
> > > doesn't have HVCLOCK configured? Or is your intention that a
> > > GET/SET should always work if TSC is stable?
> > 
> > It definitely needs to work for SET even when the vCPU hasn't been
> > run yet (and doesn't have a hvclock in vcpu->arch.hv_clock).
> 
> So would it make sense to set up hvclock earlier?

Yeah, and I think we can do so just by calling kvm_guest_time_update().

The GET function can look like this:

static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
{
	struct pvclock_vcpu_time_info *hv_clock = &v->arch.hv_clock;

	/*
	 * If KVM_REQ_CLOCK_UPDATE is already pending, or if the hv_clock has
	 * never been generated at all, call kvm_guest_time_update() to do so.
	 * Might as well use the PVCLOCK_TSC_STABLE_BIT as the check for ever
	 * having been written.
	 */
	if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, v) ||
	    !(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT)) {
		if (kvm_guest_time_update(v))
			return -EINVAL;
	}

	/*
	 * PVCLOCK_TSC_STABLE_BIT is set in use_master_clock mode where the
	 * KVM clock is defined in terms of the guest TSC. Otherwise, it is
	 * is defined by the host CLOCK_MONOTONIC_RAW, and userspace should
	 * use the legacy KVM_[GS]ET_CLOCK to migrate it.
	 */
	if (!(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT))
		return -EINVAL;

	if (copy_to_user(argp, hv_clock, sizeof(*hv_clock)))
		return -EFAULT;

	return 0;
}

And the SET function doesn't even *need* the existing vCPU's hv_clock,
because we know damn well that the number of TSC cycles elapsed between
the reference time point and... erm... the reference time point... is
zero.

And everything *else* the hv_clock was being used for, either in Jack's
version or my own (where I used it for checking PVCLOCK_TSC_STABLE_BIT
and even used my new hvclock_to_hz() on it), can be done differently
too.

https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/clocks
(Try not to look at the 'Improve accuracy of KVM clock' one. It'll just
make you sad. Let Jack and me get to the end of the TODO list and you
can have all the sadness in one go like pulling a band-aid off.)

static int kvm_vcpu_ioctl_set_clock_guest(struct kvm_vcpu *v, void __user *argp)
{
	struct pvclock_vcpu_time_info user_hv_clock;
	struct kvm *kvm = v->kvm;
	struct kvm_arch *ka = &kvm->arch;
	uint64_t curr_tsc_hz, user_tsc_hz;
	uint64_t user_clk_ns;
	uint64_t guest_tsc;
	int rc = 0;

	if (copy_from_user(&user_hv_clock, argp, sizeof(user_hv_clock)))
		return -EFAULT;

	if (!user_hv_clock.tsc_to_system_mul)
		return -EINVAL;

	user_tsc_hz = hvclock_to_hz(user_hv_clock.tsc_to_system_mul,
				    user_hv_clock.tsc_shift);

	kvm_hv_request_tsc_page_update(kvm);
	kvm_start_pvclock_update(kvm);
	pvclock_update_vm_gtod_copy(kvm);

	/*
	 * If not in use_master_clock mode, do not allow userspace to set
	 * the clock in terms of the guest TSC. Userspace should either
	 * fail the migration (to a host with suboptimal TSCs), or should
	 * knowingly restore the KVM clock using KVM_SET_CLOCK instead.
	 */
	if (!ka->use_master_clock) {
		rc = -EINVAL;
		goto out;
	}

	curr_tsc_hz = get_cpu_tsc_khz() * 1000LL;
	if (unlikely(curr_tsc_hz == 0)) {
		rc = -EINVAL;
		goto out;
	}

	if (kvm_caps.has_tsc_control)
		curr_tsc_hz = kvm_scale_tsc(curr_tsc_hz,
					    v->arch.l1_tsc_scaling_ratio);

	/*
	 * The scaling factors in the hv_clock do not depend solely on the
	 * TSC frequency *requested* by userspace. They actually use the
	 * host TSC frequency that was measured/detected by the host kernel,
	 * scaled by kvm_scale_tsc() with the vCPU's l1_tsc_scaling_ratio.
	 *
	 * So a sanity check that they *precisely* match would have false
	 * negatives. Allow for a discrepancy of 1 kHz either way.
	 */
	if (user_tsc_hz < curr_tsc_hz - 1000 ||
	    user_tsc_hz > curr_tsc_hz + 1000) {
		rc = -ERANGE;
		goto out;
	}

	/*
	 * The call to pvclock_update_vm_gtod_copy() has created a new time
	 * reference point in ka->master_cycle_now and ka->master_kernel_ns.
	 *
	 * Calculate the guest TSC at that moment, and the corresponding KVM
	 * clock value according to user_hv_clock. The value according to the
	 * current hv_clock will of course be ka->master_kernel_ns since no
	 * TSC cycles have elapsed.
	 *
	 * Adjust ka->kvmclock_offset to the delta, so that both definitions
	 * of the clock give precisely the same reading at the reference time.
	 */
	guest_tsc = kvm_read_l1_tsc(v, ka->master_cycle_now);
	user_clk_ns = __pvclock_read_cycles(&user_hv_clock, guest_tsc);
	ka->kvmclock_offset = user_clk_ns - ka->master_kernel_ns;

out:
	kvm_end_pvclock_update(kvm);
	return rc;
}

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]