Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
@ 2022-03-22 19:18 Franke, Daniel
  2022-03-22 21:53 ` Oliver Upton
  0 siblings, 1 reply; 44+ messages in thread
From: Franke, Daniel @ 2022-03-22 19:18 UTC (permalink / raw)
  To: Oliver Upton, David Woodhouse
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On 3/21/22, 5:24 PM, "Oliver Upton" <oupton@google.com> wrote:
 >  Right, but I'd argue that interface has some problems too. It
 >   depends on the guest polling instead of an interrupt from the
 >   hypervisor. It also has no way of informing the kernel exactly how much
 >   time has elapsed.

 >   The whole point of all these hacks that we've done internally is that we,
 >   the hypervisor, know full well how much real time hasv advanced during the
 >   VM blackout. If we can at least let the guest know how much to fudge real
 >   time, it can then poke NTP for better refinement. I worry about using NTP
 >   as the sole source of truth for such a mechanism, since you'll need to go
 >   out to the network and any reads until the response comes back are hosed.

(I'm a kernel newbie, so please excuse any ignorance with respect to kernel
Internals or kernel/hypervisor interfaces.)

We can have it both ways, I think. Let the hypervisor manipulate the guest TSC
so as to keep the guest kernel's idea of real time as accurate as possible 
without any awareness required on the guest's side. *Also* give the guest kernel
a notification in the form of a KVM_PVCLOCK_STOPPED event or whatever else,
and let the kernel propagate this notification to userspace so that the NTP
daemon can recombobulate itself as quickly as possible, treating whatever TSC
adjustment was received as best-effort only.

The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
globals kept by kernel/time/ntp.c (which are visible to userspace through
adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.

The aforementioned fields get overwritten by the NTP daemon every polling
interval, and whatever notification gets sent to userspace is going to be
asynchronous, so we need to avoid race conditions wherein userspace clobbers
them with its outdated idea of their correct value before the notification arrives.
I propose allocating one of the unused fields at the end of `struct timex` as a
`clockver` field. This field will be 0 at boot time, and get incremented after
clock discontinuity such as a KVM blackout, an ACPI suspend-to-RAM event,
or a manual setting of the clock through clock_settime(3) or similar. Augment
`modes` with an `ADJ_CLOCKVER` bit, with the semantics "make this call fail
If the `clockver` I passed in is not current".

As for the form of the notification to userspace... dunno, I think a netlink
interface makes the most sense? I'm not too opinionated here. Whatever we
decide, I'll be happy to contribute the patches to chrony and ntpd. The NTP
daemon should handle this notification by essentially resetting its state to
how it was at first startup, considering itself unsynchronized and immediately
kicking off new queries to all its peers.

One other thing the NTP daemon might want to do with this notification is
clobber its drift file, but only if it's running on different hardware than before.
But this requires getting that bit of information from the hypervisor to the
kernel and I haven't thought about how this should work.

There's a minor problem remaining here, which impacts NTP servers but not
clients. NTP servers don't use adjtimex to get the timestamps the use to
populate NTP packets; they just use clock_gettime, because the latter is a
VDSO call and the former is not. That means that they may not learn of the
`clockver` step in time to answer any queries that come in immediately
post-blackout. So, the server may briefly become a falseticker if the
hypervisor did too poor a job with the TSC step. I don't think this is a big
deal or something that needs to be addressed in the first iteration. The
long-term solution is to make adjtimex also be a VDSO when `modes` is
0, so that NTP daemons can use it for everything.

^ permalink raw reply	[flat|nested] 44+ messages in thread