All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
@ 2022-03-16  4:53 Oliver Upton
  2022-03-16  7:47 ` Paolo Bonzini
  0 siblings, 1 reply; 33+ messages in thread
From: Oliver Upton @ 2022-03-16  4:53 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Woodhouse, Oliver Upton,
	David Woodhouse

The VMM has control of both the guest's TSC scale and offset. Extend the
described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
to cover TSC scaling.

Reported-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Oliver Upton <oupton@google.com>
---

Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).

Parent commit: 2ca1ba339ed8 ("KVM: x86: Test case for TSC scaling and offset sync")

 Documentation/virt/kvm/devices/vcpu.rst | 26 ++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 60a29972d3f1..85199e9b7f8d 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -180,13 +180,18 @@ Returns:
 	 ======= ======================================
 
 Specifies the guest's TSC offset relative to the host's TSC. The guest's
-TSC is then derived by the following equation:
+TSC is then derived by the following equation where tsc_scale is the
+ratio between the VM and host TSC frequencies:
 
-  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
+  guest_tsc = (host_tsc * tsc_scale) + KVM_VCPU_TSC_OFFSET
 
-This attribute is useful to adjust the guest's TSC on live migration,
-so that the TSC counts the time during which the VM was paused. The
-following describes a possible algorithm to use for this purpose.
+The VM's TSC frequency is configured with the KVM_{GET,SET}_TSC_KHZ
+vCPU or VM ioctls.
+
+The KVM_VCPU_TSC_OFFSET attribute is useful to adjust the guest's TSC
+on live migration, so that the TSC counts the time during which the VM
+was paused. The following describes a possible algorithm to use for this
+purpose.
 
 From the source VMM process:
 
@@ -202,7 +207,10 @@ From the source VMM process:
 
 From the destination VMM process:
 
-4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
+4. Invoke the KVM_SET_TSC_KHZ ioctl to set the guest TSC frequency to
+   the value recorded in step 3 (freq).
+
+5. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
    kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective
    fields.  Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
    structure.
@@ -214,10 +222,10 @@ From the destination VMM process:
    between the source pausing the VMs and the destination executing
    steps 4-7.
 
-5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and
+6. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and
    kvmclock nanoseconds (guest_dest).
 
-6. Adjust the guest TSC offsets for every vCPU to account for (1) time
+7. Adjust the guest TSC offsets for every vCPU to account for (1) time
    elapsed since recording state and (2) difference in TSCs between the
    source and destination machine:
 
@@ -229,5 +237,5 @@ From the destination VMM process:
    a time of 0 in kvmclock.  The above formula ensures that it is the
    same on the destination as it was on the source).
 
-7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
+8. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
    respective value derived in the previous step.
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread
* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
@ 2022-03-22 19:18 Franke, Daniel
  2022-03-22 21:53 ` Oliver Upton
  0 siblings, 1 reply; 33+ messages in thread
From: Franke, Daniel @ 2022-03-22 19:18 UTC (permalink / raw)
  To: Oliver Upton, David Woodhouse
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On 3/21/22, 5:24 PM, "Oliver Upton" <oupton@google.com> wrote:
 >  Right, but I'd argue that interface has some problems too. It
 >   depends on the guest polling instead of an interrupt from the
 >   hypervisor. It also has no way of informing the kernel exactly how much
 >   time has elapsed.

 >   The whole point of all these hacks that we've done internally is that we,
 >   the hypervisor, know full well how much real time hasv advanced during the
 >   VM blackout. If we can at least let the guest know how much to fudge real
 >   time, it can then poke NTP for better refinement. I worry about using NTP
 >   as the sole source of truth for such a mechanism, since you'll need to go
 >   out to the network and any reads until the response comes back are hosed.

(I'm a kernel newbie, so please excuse any ignorance with respect to kernel
Internals or kernel/hypervisor interfaces.)

We can have it both ways, I think. Let the hypervisor manipulate the guest TSC
so as to keep the guest kernel's idea of real time as accurate as possible 
without any awareness required on the guest's side. *Also* give the guest kernel
a notification in the form of a KVM_PVCLOCK_STOPPED event or whatever else,
and let the kernel propagate this notification to userspace so that the NTP
daemon can recombobulate itself as quickly as possible, treating whatever TSC
adjustment was received as best-effort only.

The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
globals kept by kernel/time/ntp.c (which are visible to userspace through
adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.

The aforementioned fields get overwritten by the NTP daemon every polling
interval, and whatever notification gets sent to userspace is going to be
asynchronous, so we need to avoid race conditions wherein userspace clobbers
them with its outdated idea of their correct value before the notification arrives.
I propose allocating one of the unused fields at the end of `struct timex` as a
`clockver` field. This field will be 0 at boot time, and get incremented after
clock discontinuity such as a KVM blackout, an ACPI suspend-to-RAM event,
or a manual setting of the clock through clock_settime(3) or similar. Augment
`modes` with an `ADJ_CLOCKVER` bit, with the semantics "make this call fail
If the `clockver` I passed in is not current".

As for the form of the notification to userspace... dunno, I think a netlink
interface makes the most sense? I'm not too opinionated here. Whatever we
decide, I'll be happy to contribute the patches to chrony and ntpd. The NTP
daemon should handle this notification by essentially resetting its state to
how it was at first startup, considering itself unsynchronized and immediately
kicking off new queries to all its peers.

One other thing the NTP daemon might want to do with this notification is
clobber its drift file, but only if it's running on different hardware than before.
But this requires getting that bit of information from the hypervisor to the
kernel and I haven't thought about how this should work.

There's a minor problem remaining here, which impacts NTP servers but not
clients. NTP servers don't use adjtimex to get the timestamps the use to
populate NTP packets; they just use clock_gettime, because the latter is a
VDSO call and the former is not. That means that they may not learn of the
`clockver` step in time to answer any queries that come in immediately
post-blackout. So, the server may briefly become a falseticker if the
hypervisor did too poor a job with the TSC step. I don't think this is a big
deal or something that needs to be addressed in the first iteration. The
long-term solution is to make adjtimex also be a VDSO when `modes` is
0, so that NTP daemons can use it for everything.


^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2022-07-05 14:43 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-16  4:53 [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm Oliver Upton
2022-03-16  7:47 ` Paolo Bonzini
2022-03-18 18:39   ` Oliver Upton
2022-03-19  7:52     ` Paolo Bonzini
2022-03-19  7:59       ` Oliver Upton
2022-03-19  8:08         ` David Woodhouse
2022-03-19 11:54           ` Paolo Bonzini
2022-03-19 13:00             ` Paolo Bonzini
2022-03-19 13:13               ` David Woodhouse
2022-03-20  8:10                 ` Paolo Bonzini
2022-03-20  8:52                   ` Oliver Upton
2022-03-20  9:46                     ` David Woodhouse
2022-03-21  0:38                       ` Oliver Upton
2022-03-21 19:43                         ` David Woodhouse
2022-03-21 21:23                           ` Oliver Upton
2022-03-20 13:39                     ` Paolo Bonzini
2022-03-21  0:51                       ` Oliver Upton
2022-03-21 12:36                         ` Paolo Bonzini
2022-03-21 12:56                           ` David Woodhouse
2022-03-21 12:16                       ` David Woodhouse
2022-03-21 13:10                         ` Paolo Bonzini
2022-03-21 14:59                           ` David Woodhouse
2022-03-22 19:18 Franke, Daniel
2022-03-22 21:53 ` Oliver Upton
2022-03-23 12:35   ` David Woodhouse
2022-03-23 16:21     ` Oliver Upton
2022-03-25  9:03       ` David Woodhouse
2022-03-25 17:47         ` Oliver Upton
2022-03-29 14:19   ` Thomas Gleixner
2022-03-29 16:02     ` Oliver Upton
2022-03-29 19:34       ` Thomas Gleixner
2022-06-30 11:58       ` David Woodhouse
2022-07-05 14:43         ` David Woodhouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.