[PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
@ 2022-03-16  4:53 Oliver Upton
  2022-03-16  7:47 ` Paolo Bonzini
  0 siblings, 1 reply; 33+ messages in thread
From: Oliver Upton @ 2022-03-16  4:53 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Woodhouse, Oliver Upton,
	David Woodhouse

The VMM has control of both the guest's TSC scale and offset. Extend the
described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
to cover TSC scaling.

Reported-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Oliver Upton <oupton@google.com>
---

Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).

Parent commit: 2ca1ba339ed8 ("KVM: x86: Test case for TSC scaling and offset sync")

 Documentation/virt/kvm/devices/vcpu.rst | 26 ++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 60a29972d3f1..85199e9b7f8d 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -180,13 +180,18 @@ Returns:
 	 ======= ======================================
 
 Specifies the guest's TSC offset relative to the host's TSC. The guest's
-TSC is then derived by the following equation:
+TSC is then derived by the following equation where tsc_scale is the
+ratio between the VM and host TSC frequencies:
 
-  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
+  guest_tsc = (host_tsc * tsc_scale) + KVM_VCPU_TSC_OFFSET
 
-This attribute is useful to adjust the guest's TSC on live migration,
-so that the TSC counts the time during which the VM was paused. The
-following describes a possible algorithm to use for this purpose.
+The VM's TSC frequency is configured with the KVM_{GET,SET}_TSC_KHZ
+vCPU or VM ioctls.
+
+The KVM_VCPU_TSC_OFFSET attribute is useful to adjust the guest's TSC
+on live migration, so that the TSC counts the time during which the VM
+was paused. The following describes a possible algorithm to use for this
+purpose.
 
 From the source VMM process:
 
@@ -202,7 +207,10 @@ From the source VMM process:
 
 From the destination VMM process:
 
-4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
+4. Invoke the KVM_SET_TSC_KHZ ioctl to set the guest TSC frequency to
+   the value recorded in step 3 (freq).
+
+5. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
    kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective
    fields.  Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
    structure.
@@ -214,10 +222,10 @@ From the destination VMM process:
    between the source pausing the VMs and the destination executing
    steps 4-7.
 
-5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and
+6. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and
    kvmclock nanoseconds (guest_dest).
 
-6. Adjust the guest TSC offsets for every vCPU to account for (1) time
+7. Adjust the guest TSC offsets for every vCPU to account for (1) time
    elapsed since recording state and (2) difference in TSCs between the
    source and destination machine:
 
@@ -229,5 +237,5 @@ From the destination VMM process:
    a time of 0 in kvmclock.  The above formula ensures that it is the
    same on the destination as it was on the source).
 
-7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
+8. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
    respective value derived in the previous step.
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-16  4:53 [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm Oliver Upton
@ 2022-03-16  7:47 ` Paolo Bonzini
  2022-03-18 18:39   ` Oliver Upton
  0 siblings, 1 reply; 33+ messages in thread
From: Paolo Bonzini @ 2022-03-16  7:47 UTC (permalink / raw)
  To: Oliver Upton, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Woodhouse, David Woodhouse

On 3/16/22 05:53, Oliver Upton wrote:
> The VMM has control of both the guest's TSC scale and offset. Extend the
> described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
> to cover TSC scaling.
> 
> Reported-by: David Woodhouse<dwmw@amazon.co.uk>
> Signed-off-by: Oliver Upton<oupton@google.com>
> ---
> 
> Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).

A few more things that have to be changed:

> 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
>    kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
>    (host_src).
> 

One of two changes:

a) Add "Multiply tsc_src by guest_freq / src_freq to obtain 
scaled_tsc_src", add a new device attribute for the host TSC frequency.

b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a 
new device attribute for the guest_frequency/host_frequency ratio.

A third would be scaling the host TSC frequency in KVM_GETCLOCK, but 
that's confusing IMO.

> 3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
>    guest's TSC (freq).

Replace freq with guest_freq.

> 6. Adjust the guest TSC offsets for every vCPU to account for (1) time
>    elapsed since recording state and (2) difference in TSCs between the
>    source and destination machine:
> 
>    ofs_dst[i] = ofs_src[i] -
>      (guest_src - guest_dest) * freq +
>      (tsc_src - tsc_dest)
> 

Replace freq with guest_freq.

Replace tsc_src with scaled_tsc_src; replace tsc_dest with tsc_dest * 
guest_freq / dest_freq.

>    ("ofs[i] + tsc - guest * freq" is the guest TSC value corresponding to

Replace with "ofs[i] + tsc * guest_freq / host_freq - guest * 
guest_freq", or "ofs[i] + tsc * ratio - guest * guest_freq".

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-16  7:47 ` Paolo Bonzini
@ 2022-03-18 18:39   ` Oliver Upton
  2022-03-19  7:52     ` Paolo Bonzini
  0 siblings, 1 reply; 33+ messages in thread
From: Oliver Upton @ 2022-03-18 18:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Woodhouse, David Woodhouse

Hi Paolo,

On Wed, Mar 16, 2022 at 08:47:59AM +0100, Paolo Bonzini wrote:
> On 3/16/22 05:53, Oliver Upton wrote:
> > The VMM has control of both the guest's TSC scale and offset. Extend the
> > described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
> > to cover TSC scaling.
> > 
> > Reported-by: David Woodhouse<dwmw@amazon.co.uk>
> > Signed-off-by: Oliver Upton<oupton@google.com>
> > ---
> > 
> > Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).
> 
> A few more things that have to be changed:
> 
> > 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
> >    kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
> >    (host_src).
> > 
> 
> One of two changes:
> 
> a) Add "Multiply tsc_src by guest_freq / src_freq to obtain scaled_tsc_src",
> add a new device attribute for the host TSC frequency.
> 
> b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a new
> device attribute for the guest_frequency/host_frequency ratio.
> 
> A third would be scaling the host TSC frequency in KVM_GETCLOCK, but that's
> confusing IMO.

Agreed -- I think kvmclock should remain as is.

A fourth would be to expose the host's TSC frequency outside of KVM
since we're really in the business of guest features, not host ones :-)
We already have a patch that does this internally, and its visible in
some of our open source userspace libraries [1].

That said, I do not have any immediate reservations about adding an
attribute as the easy way out.

Ack on the rest of the feedback, I'll touch up the documentation further
once we figure out TSC frequency exposure.

[1]: https://github.com/abseil/abseil-cpp/blob/c33f21f86a14216336b87d48e9b552a13985b2bc/absl/base/internal/sysinfo.cc#L311

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-18 18:39   ` Oliver Upton
@ 2022-03-19  7:52     ` Paolo Bonzini
  2022-03-19  7:59       ` Oliver Upton
  0 siblings, 1 reply; 33+ messages in thread
From: Paolo Bonzini @ 2022-03-19  7:52 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Woodhouse, David Woodhouse

On 3/18/22 19:39, Oliver Upton wrote:
> Hi Paolo,
> 
> On Wed, Mar 16, 2022 at 08:47:59AM +0100, Paolo Bonzini wrote:
>> On 3/16/22 05:53, Oliver Upton wrote:
>>> The VMM has control of both the guest's TSC scale and offset. Extend the
>>> described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
>>> to cover TSC scaling.
>>>
>>> Reported-by: David Woodhouse<dwmw@amazon.co.uk>
>>> Signed-off-by: Oliver Upton<oupton@google.com>
>>> ---
>>>
>>> Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).
>>
>> A few more things that have to be changed:
>>
>>> 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
>>>     kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
>>>     (host_src).
>>>
>>
>> One of two changes:
>>
>> a) Add "Multiply tsc_src by guest_freq / src_freq to obtain scaled_tsc_src",
>> add a new device attribute for the host TSC frequency.
>>
>> b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a new
>> device attribute for the guest_frequency/host_frequency ratio.
>>
>> A third would be scaling the host TSC frequency in KVM_GETCLOCK, but that's
>> confusing IMO.
> 
> Agreed -- I think kvmclock should remain as is.
> 
> A fourth would be to expose the host's TSC frequency outside of KVM
> since we're really in the business of guest features, not host ones :-)
> We already have a patch that does this internally, and its visible in
> some of our open source userspace libraries [1].

Yeah, it was a bit of a cop out on my part but adding it to sysfs would 
be nice.  Please tell me if any of you going to send a patch, or even 
just share it so that I can handle the upstream submission.

Paolo

> 
> That said, I do not have any immediate reservations about adding an
> attribute as the easy way out.
> 
> Ack on the rest of the feedback, I'll touch up the documentation further
> once we figure out TSC frequency exposure.
> 
> [1]: https://github.com/abseil/abseil-cpp/blob/c33f21f86a14216336b87d48e9b552a13985b2bc/absl/base/internal/sysinfo.cc#L311
> 
> --
> Thanks,
> Oliver
> 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-19  7:52     ` Paolo Bonzini
@ 2022-03-19  7:59       ` Oliver Upton
  2022-03-19  8:08         ` David Woodhouse
  0 siblings, 1 reply; 33+ messages in thread
From: Oliver Upton @ 2022-03-19  7:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Woodhouse, David Woodhouse

On Sat, Mar 19, 2022 at 08:52:56AM +0100, Paolo Bonzini wrote:
> On 3/18/22 19:39, Oliver Upton wrote:
> > Hi Paolo,
> > 
> > On Wed, Mar 16, 2022 at 08:47:59AM +0100, Paolo Bonzini wrote:
> > > On 3/16/22 05:53, Oliver Upton wrote:
> > > > The VMM has control of both the guest's TSC scale and offset. Extend the
> > > > described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
> > > > to cover TSC scaling.
> > > > 
> > > > Reported-by: David Woodhouse<dwmw@amazon.co.uk>
> > > > Signed-off-by: Oliver Upton<oupton@google.com>
> > > > ---
> > > > 
> > > > Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).
> > > 
> > > A few more things that have to be changed:
> > > 
> > > > 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
> > > >     kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
> > > >     (host_src).
> > > > 
> > > 
> > > One of two changes:
> > > 
> > > a) Add "Multiply tsc_src by guest_freq / src_freq to obtain scaled_tsc_src",
> > > add a new device attribute for the host TSC frequency.
> > > 
> > > b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a new
> > > device attribute for the guest_frequency/host_frequency ratio.
> > > 
> > > A third would be scaling the host TSC frequency in KVM_GETCLOCK, but that's
> > > confusing IMO.
> > 
> > Agreed -- I think kvmclock should remain as is.
> > 
> > A fourth would be to expose the host's TSC frequency outside of KVM
> > since we're really in the business of guest features, not host ones :-)
> > We already have a patch that does this internally, and its visible in
> > some of our open source userspace libraries [1].
> 
> Yeah, it was a bit of a cop out on my part but adding it to sysfs would be
> nice.  Please tell me if any of you going to send a patch, or even just
> share it so that I can handle the upstream submission.

I'll have some time to look at it next week and send upstream. I was
somewhat panning if there were any other ideas here, but sysfs does seem
sensible :)

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-19  7:59       ` Oliver Upton
@ 2022-03-19  8:08         ` David Woodhouse
  2022-03-19 11:54           ` Paolo Bonzini
  0 siblings, 1 reply; 33+ messages in thread
From: David Woodhouse @ 2022-03-19  8:08 UTC (permalink / raw)
  To: Oliver Upton, Paolo Bonzini
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 945 bytes --]

On Sat, 2022-03-19 at 07:59 +0000, Oliver Upton wrote:
> I'll have some time to look at it next week and send upstream. I was
> somewhat panning if there were any other ideas here, but sysfs does seem
> sensible :)

I can't say this fills me with joy. If a basic API requires this much
documentation, my instinct is to *fix* it with fire first, then
document what's left.

A userspace-friendly API for migration would be more like KVM on the
source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
the destination host (after providing the TSC frequency) is give it
precisely the same data.

We depend on source and destination clocks being kept in sync with NTP
so that the TIME_OF_DAY is meaningful, but that's fairly fundamental;
that was *always* going to be our reference¹, right?

For migration between hosts, the offset-based API doesn't seem to be
much of an improvement over what we had before.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-19  8:08         ` David Woodhouse
@ 2022-03-19 11:54           ` Paolo Bonzini
  2022-03-19 13:00             ` Paolo Bonzini
  0 siblings, 1 reply; 33+ messages in thread
From: Paolo Bonzini @ 2022-03-19 11:54 UTC (permalink / raw)
  To: David Woodhouse, Oliver Upton
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel

On 3/19/22 09:08, David Woodhouse wrote:
> If a basic API requires this much documentation, my instinct is to
> *fix* it with fire first, then document what's left.
I agree, but you're missing all the improvements that went in together 
with the offset API in order to enable the ugly algorithm.

> A userspace-friendly API for migration would be more like KVM on the
> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
> the destination host (after providing the TSC frequency) is give it
> precisely the same data.

Again I agree but it would have to be {hostTimeOfDay, hostTSC, 
hostTSCFrequency, guestTimeOfDay}.

> For migration between hosts, the offset-based API doesn't seem to be
> much of an improvement over what we had before.

The improvement is that KVM now provides a consistent {hostTimeOfDay, 
hostTSC, guestTimeOfDay} triple, so the "get" part is already okay.  I 
agree that the "set" leaves a substantial amount of work to the guest, 
because *using* that information is left to userspace rather than being 
done in KVM.  But at least it is correct.

So we only lack a way to retrieve hostTSCFrequency to have an ugly but 
fully correct source-side API.  Then we can add a new destination-side 
ioctl (possibly an extension of KVM_SET_CLOCK?) that takes a 
{sourceTimeOfDay, sourceTSC, sourceTSCFrequency, guestTimeOfDay} tuple 
and relieves userspace from having to do the math.

Paolo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-19 11:54           ` Paolo Bonzini
@ 2022-03-19 13:00             ` Paolo Bonzini
  2022-03-19 13:13               ` David Woodhouse
  0 siblings, 1 reply; 33+ messages in thread
From: Paolo Bonzini @ 2022-03-19 13:00 UTC (permalink / raw)
  To: David Woodhouse, Oliver Upton
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel

On 3/19/22 12:54, Paolo Bonzini wrote:
> On 3/19/22 09:08, David Woodhouse wrote:
>> If a basic API requires this much documentation, my instinct is to
>> *fix* it with fire first, then document what's left.
> I agree, but you're missing all the improvements that went in together 
> with the offset API in order to enable the ugly algorithm.
> 
>> A userspace-friendly API for migration would be more like KVM on the
>> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
>> the destination host (after providing the TSC frequency) is give it
>> precisely the same data.
> 
> Again I agree but it would have to be {hostTimeOfDay, hostTSC, 
> hostTSCFrequency, guestTimeOfDay}.

Ah, I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant 
{guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}.  That would work, 
and in that case it wouldn't even be KVM returning that host information.

In fact {hostTimeOfDay, hostTSC, hostTSCFrequency, guestTimeOfDay} is 
not enough, you also need the guestTSCFrequency and guestTSC (or 
equivalently the scale/offset pair).

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-19 13:00             ` Paolo Bonzini
@ 2022-03-19 13:13               ` David Woodhouse
  2022-03-20  8:10                 ` Paolo Bonzini
  0 siblings, 1 reply; 33+ messages in thread
From: David Woodhouse @ 2022-03-19 13:13 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Woodhouse, Oliver Upton, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel



> On 3/19/22 12:54, Paolo Bonzini wrote:
>> On 3/19/22 09:08, David Woodhouse wrote:
>>> If a basic API requires this much documentation, my instinct is to
>>> *fix* it with fire first, then document what's left.
>> I agree, but you're missing all the improvements that went in together
>> with the offset API in order to enable the ugly algorithm.
>>
>>> A userspace-friendly API for migration would be more like KVM on the
>>> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
>>> the destination host (after providing the TSC frequency) is give it
>>> precisely the same data.
>>
>> Again I agree but it would have to be {hostTimeOfDay, hostTSC,
>> hostTSCFrequency, guestTimeOfDay}.
>
> Ah, I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant
> {guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}.  That would work,
> and in that case it wouldn't even be KVM returning that host information.
>
> In fact {hostTimeOfDay, hostTSC, hostTSCFrequency, guestTimeOfDay} is
> not enough, you also need the guestTSCFrequency and guestTSC (or
> equivalently the scale/offset pair).

I would have said nobody cares about the host TSC value and frequency.
That is for KVM to know and deal with internally.

For guest migration it should be as simple as "guest TSC frequency is <F>,
and the TSC value was <X> at (wallclock time <T>|KVM_CLOCK time <T>).

Not sure I have an opinion on whether the objective time reference is the
timeofday clock or the KVM clock.


-- 
dwmw2


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-19 13:13               ` David Woodhouse
@ 2022-03-20  8:10                 ` Paolo Bonzini
  2022-03-20  8:52                   ` Oliver Upton
  0 siblings, 1 reply; 33+ messages in thread
From: Paolo Bonzini @ 2022-03-20  8:10 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Oliver Upton, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On 3/19/22 14:13, David Woodhouse wrote:
> 
> 
>> On 3/19/22 12:54, Paolo Bonzini wrote:
>>> On 3/19/22 09:08, David Woodhouse wrote:
>>>> If a basic API requires this much documentation, my instinct is to
>>>> *fix* it with fire first, then document what's left.
>>> I agree, but you're missing all the improvements that went in together
>>> with the offset API in order to enable the ugly algorithm.
>>>
>>>> A userspace-friendly API for migration would be more like KVM on the
>>>> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
>>>> the destination host (after providing the TSC frequency) is give it
>>>> precisely the same data.
>>
>> I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant
>> {guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}.  That would work,
>> and in that case it wouldn't even be KVM returning that host information.
> 
> I would have said nobody cares about the host TSC value and frequency.
> That is for KVM to know and deal with internally.

There are two schools as to how to do migration.  The QEMU school is to 
just load back the guest TOD and TSC and let NTP resync.  They had 
better be synced, but a difference of a few microseconds might not matter.

This has the advantage of not showing the guest that there was a pause. 
  QEMU is doing it this way due to not having postcopy live migration 
for a long time; precopy is subject to longer brownout between source 
and destination, which might result in soft lockups.  Apart from this it 
really has only disadvantage.

The Google school has the destination come up with the guest TOD and TSC 
that takes into account the length of the brownout phase.  This is where 
the algorithm in Documentation/ comes into play, and why you need the 
host pair as well.  Actually Google does not use it because they already 
have precise time available to userspace as part of Spanner.  Maybe so 
does Amazon (?), but for the rest of the world the host {TOD, TSC} pair 
is required to compute what the guest TSC "should look like" on the 
destination.

Paolo

> For guest migration it should be as simple as "guest TSC frequency is <F>,
> and the TSC value was <X> at (wallclock time <T>|KVM_CLOCK time <T>).
> 
> Not sure I have an opinion on whether the objective time reference is the
> timeofday clock or the KVM clock.
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-20  8:10                 ` Paolo Bonzini
@ 2022-03-20  8:52                   ` Oliver Upton
  2022-03-20  9:46                     ` David Woodhouse
  2022-03-20 13:39                     ` Paolo Bonzini
  0 siblings, 2 replies; 33+ messages in thread
From: Oliver Upton @ 2022-03-20  8:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Woodhouse, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On Sun, Mar 20, 2022 at 09:10:15AM +0100, Paolo Bonzini wrote:
> On 3/19/22 14:13, David Woodhouse wrote:
> > 
> > 
> > > On 3/19/22 12:54, Paolo Bonzini wrote:
> > > > On 3/19/22 09:08, David Woodhouse wrote:
> > > > > If a basic API requires this much documentation, my instinct is to
> > > > > *fix* it with fire first, then document what's left.
> > > > I agree, but you're missing all the improvements that went in together
> > > > with the offset API in order to enable the ugly algorithm.
> > > > 
> > > > > A userspace-friendly API for migration would be more like KVM on the
> > > > > source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
> > > > > the destination host (after providing the TSC frequency) is give it
> > > > > precisely the same data.
> > > 
> > > I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant
> > > {guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}.  That would work,
> > > and in that case it wouldn't even be KVM returning that host information.
> > 
> > I would have said nobody cares about the host TSC value and frequency.
> > That is for KVM to know and deal with internally.
> 
> There are two schools as to how to do migration.  The QEMU school is to just
> load back the guest TOD and TSC and let NTP resync.  They had better be
> synced, but a difference of a few microseconds might not matter.
> 
> This has the advantage of not showing the guest that there was a pause.
> QEMU is doing it this way due to not having postcopy live migration for a
> long time; precopy is subject to longer brownout between source and
> destination, which might result in soft lockups.  Apart from this it really
> has only disadvantage.
> 
> The Google school has the destination come up with the guest TOD and TSC
> that takes into account the length of the brownout phase.  This is where the
> algorithm in Documentation/ comes into play, and why you need the host pair
> as well.  Actually Google does not use it because they already have precise
> time available to userspace as part of Spanner.  Maybe so does Amazon (?),
> but for the rest of the world the host {TOD, TSC} pair is required to
> compute what the guest TSC "should look like" on the destination.

Hey, beat me to the punch :) Paolo is pretty much spot on, but there are
a few additional details here that I believe are relevant.

I really don't think we want to effectively step the guest's monotonic
clock if at all possible. It hurts when you do this for large windows,
and leads to soft lockups as you've noted above. Nonetheless, its a
kludgy way to advance the guest's realtime clock without informing it
that it is about to experience time travel.

Given all of this, there is a limit to how much we advance the TSC in
the Google school. If this limit is exceeded we refuse to step the TSC
further and inform the guest it has experienced time travel [1]. It is
an attempt to bridge the gap and avoid completely laying waste to guest
clocks while hiding the migration if we're confident it was smooth
enough. Beyond that, guest userspace wants to be appraised of time
travel as well (TFD_TIMER_CANCEL_ON_SET). Having the guest clean up a
messy migration ensures that this all 'just works'.

The offset interface completely punts the decision around guest clocks
to userspace. We (KVM) have absolutely no idea what userspace is about
to do with the guest. The guest could be paused for 5 seconds or 5
years. Encouraging host userspace to just read/write a { TOD, TSC } pair
and let KVM do the heavy lifting could completely wreck the guest's
monotonic clock.

Additionally, it is impossible for userspace to enforce policy/limits on
how much to time travel a guest with a value-based interface. Any event
could sneak in between the time userspace checks the value and KVM sets
the L1 offset. Offsets are idempotent and will still uphold userspace's
intentions even if an inordinate amount of time elapses until KVM
processes it.

Apologies for grandstanding, but clocks has been a real source of pain
during migration. I do agree that the documented algorithm is a mess at
the moment, given that there's no good way for userspace to transform
host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is
nice to have, but probably not ABI to hang this whole thing off of.

What do you folks think about having a new R/O vCPU attribute that
returns a { TOD, guest_tsc } pair? I believe that would immediately
satisfy the needs of upstream to implement clock-advancing live
migration.

[1]: https://github.com/GoogleCloudPlatform/guest-agent/blob/main/google_guest_agent/clock.go
--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-20  8:52                   ` Oliver Upton
@ 2022-03-20  9:46                     ` David Woodhouse
  2022-03-21  0:38                       ` Oliver Upton
  2022-03-20 13:39                     ` Paolo Bonzini
  1 sibling, 1 reply; 33+ messages in thread
From: David Woodhouse @ 2022-03-20  9:46 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Paolo Bonzini, David Woodhouse, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel

> The offset interface completely punts the decision around guest clocks
> to userspace. We (KVM) have absolutely no idea what userspace is about
> to do with the guest. The guest could be paused for 5 seconds or 5
> years. Encouraging host userspace to just read/write a { TOD, TSC } pair
> and let KVM do the heavy lifting could completely wreck the guest's
> monotonic clock.
>
> Additionally, it is impossible for userspace to enforce policy/limits on
> how much to time travel a guest with a value-based interface. Any event
> could sneak in between the time userspace checks the value and KVM sets
> the L1 offset. Offsets are idempotent and will still uphold userspace's
> intentions even if an inordinate amount of time elapses until KVM
> processes it.

Thanks for the detailed explanation. One part which confuses me here...
Why can't userspace impose those same limits using a (TOD, value) tuple?
Userspace can still look at that TOD from before the brownout period
started, and declare that is too far in the past.

If the event happens *after* userspace has decided that the migration was
quick enough, but before the vCPUs are actually running again, even the
offset based interface doesn't protect against that.

> Apologies for grandstanding, but clocks has been a real source of pain
> during migration. I do agree that the documented algorithm is a mess at
> the moment, given that there's no good way for userspace to transform
> host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is
> nice to have, but probably not ABI to hang this whole thing off of.
>
> What do you folks think about having a new R/O vCPU attribute that
> returns a { TOD, guest_tsc } pair? I believe that would immediately
> satisfy the needs of upstream to implement clock-advancing live
> migration.

Hm, I need to do some more thinking here. I poked at this because for TSC
scaling even before we think about clock jumps it was just utterly hosed —
userspace naively just creates a bunch of vCPUs and sets their TSC
frequency + value, and they all end up with unsynced TSC values.

But coincidentally since then I have started having conversations with
people who really want the guest to have an immediate knowledge of the
adjtimex maxerror etc. on the new host immediately after the migration.
Maybe the "if the migration isn't fast enough then let the guest know it's
now unsynced" is OK, but I'll need to work out what "immediately" means
when we have a guest userspace component involved in it.

I'll need to do that with a real screen and keyboard though, and fingers
that aren't freezing as I sit by a 9-year-old's hockey training...

-- 
dwmw2

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-20  8:52                   ` Oliver Upton
  2022-03-20  9:46                     ` David Woodhouse
@ 2022-03-20 13:39                     ` Paolo Bonzini
  2022-03-21  0:51                       ` Oliver Upton
  2022-03-21 12:16                       ` David Woodhouse
  1 sibling, 2 replies; 33+ messages in thread
From: Paolo Bonzini @ 2022-03-20 13:39 UTC (permalink / raw)
  To: Oliver Upton
  Cc: David Woodhouse, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On 3/20/22 09:52, Oliver Upton wrote:
> What do you folks think about having a new R/O vCPU attribute that
> returns a { TOD, guest_tsc } pair? I believe that would immediately
> satisfy the needs of upstream to implement clock-advancing live
> migration.

I don't think this adds much.  The missing link is on the destination 
side, not the source side.

To recap, the data that needs to be migrated from source to destination 
is the hostTSC+hostTOD pairing (returned by KVM_GET_CLOCK) plus one of 
each of the following:

* either guestTSCOffset or a guestTSC synced with the hostTSC

* either guestTODOffset or a guestTOD synced with the hostTOD.

* either guestTSCScale or hostTSCFreq

Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD 
returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. 
It's a bit mix-and-match, but it's already a 5-tuple that the 
destination can use.  What's missing is a ioctl on the destination side 
that relieves userspace from having to do the math.

Adding a nicer API just means choosing a different 5-tuple that the 
source side sends over to the destination, but it isn't particularly 
useful if userspace still has to do the math.

Paolo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-20  9:46                     ` David Woodhouse
@ 2022-03-21  0:38                       ` Oliver Upton
  2022-03-21 19:43                         ` David Woodhouse
  0 siblings, 1 reply; 33+ messages in thread
From: Oliver Upton @ 2022-03-21  0:38 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On Sun, Mar 20, 2022 at 09:46:35AM -0000, David Woodhouse wrote:
> 
> > The offset interface completely punts the decision around guest clocks
> > to userspace. We (KVM) have absolutely no idea what userspace is about
> > to do with the guest. The guest could be paused for 5 seconds or 5
> > years. Encouraging host userspace to just read/write a { TOD, TSC } pair
> > and let KVM do the heavy lifting could completely wreck the guest's
> > monotonic clock.
> >
> > Additionally, it is impossible for userspace to enforce policy/limits on
> > how much to time travel a guest with a value-based interface. Any event
> > could sneak in between the time userspace checks the value and KVM sets
> > the L1 offset. Offsets are idempotent and will still uphold userspace's
> > intentions even if an inordinate amount of time elapses until KVM
> > processes it.
> 
> Thanks for the detailed explanation. One part which confuses me here...
> Why can't userspace impose those same limits using a (TOD, value) tuple?
> Userspace can still look at that TOD from before the brownout period
> started, and declare that is too far in the past.
> 
> If the event happens *after* userspace has decided that the migration was
> quick enough, but before the vCPUs are actually running again, even the
> offset based interface doesn't protect against that.

Right, if nastiness happens after the offset was calculated, the guest
is still going to feel it. The benefit is that we've at least stopped
the bleeding on the monotonic clock. Guests of KVM are probably happier
having a messed up realtime clock instead of a panicked kernel that
threw a fit over monotonic stepping too far.

> > Apologies for grandstanding, but clocks has been a real source of pain
> > during migration. I do agree that the documented algorithm is a mess at
> > the moment, given that there's no good way for userspace to transform
> > host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is
> > nice to have, but probably not ABI to hang this whole thing off of.
> >
> > What do you folks think about having a new R/O vCPU attribute that
> > returns a { TOD, guest_tsc } pair? I believe that would immediately
> > satisfy the needs of upstream to implement clock-advancing live
> > migration.
> 
> Hm, I need to do some more thinking here. I poked at this because for TSC
> scaling even before we think about clock jumps it was just utterly hosed —
> userspace naively just creates a bunch of vCPUs and sets their TSC
> frequency + value, and they all end up with unsynced TSC values.

And thank you for addressing that. I think there is a broader
generalization that can be made about guest timekeeping that you've
started poking at with that patch set, which is that we should only
really care about these at a VM scope. There's nothing we can do
to cover up broken hardware, so if the host's TSCs are out of whack then
oh well.

To that end, we could have a single host<->guest offset that is used
across all vCPUs. Guest fiddling with TSC_ADJUST or even worse direct
writes to the TSC can then be treated as solely a part of guest state,
and get added on top of the host<->guest offset.

> But coincidentally since then I have started having conversations with
> people who really want the guest to have an immediate knowledge of the
> adjtimex maxerror etc. on the new host immediately after the migration.
> Maybe the "if the migration isn't fast enough then let the guest know it's
> now unsynced" is OK, but I'll need to work out what "immediately" means
> when we have a guest userspace component involved in it.

This has also been an area of interest to me. I think we've all seen the
many ways in which doing migrations behind the guest's can put software
in an extremely undesirable state on the other end. If those
conversations are taking place on the mailing lists, could you please CC
me?

Our (Google) TSC adjustment clamping and userspace notification mechanism
was a halfway kludge to keep things happy on the other end. And it
generally has worked well, but misses a fundamental point.

The hypervisor should tell the guest kernel about time travel and let it
cascade that information throughout the guest system. Regardless of what
we do to the TSC, we invariably destroy one of the two guest clocks along
the way. If we told the guest "you time traveled X seconds", it could
fold that into its own idea of real time. Guest kernel can then fire off
events to inform software that wants to keep up with clock changes, and
even a new event to let NTP know its probably running on different
hardware.

Time sucks :-)

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-20 13:39                     ` Paolo Bonzini
@ 2022-03-21  0:51                       ` Oliver Upton
  2022-03-21 12:36                         ` Paolo Bonzini
  2022-03-21 12:16                       ` David Woodhouse
  1 sibling, 1 reply; 33+ messages in thread
From: Oliver Upton @ 2022-03-21  0:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Woodhouse, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On Sun, Mar 20, 2022 at 02:39:34PM +0100, Paolo Bonzini wrote:
> On 3/20/22 09:52, Oliver Upton wrote:
> > What do you folks think about having a new R/O vCPU attribute that
> > returns a { TOD, guest_tsc } pair? I believe that would immediately
> > satisfy the needs of upstream to implement clock-advancing live
> > migration.
> 
> I don't think this adds much.  The missing link is on the destination side,
> not the source side.

I think it'll work:

 Source:
  - Pick a vCPU and save its { TOD, guest_TSC } pair
  - Save the tsc offset of every vCPU
  - Using all of the offsets, calculate the drift of all the follower
    vCPUs from the leader vCPU (from step 1)
  - Save the TSC frequency

 Destination:
  - Restore the TSC frequency
  - Read the { TOD, guest_TSC } pair for the first vCPU
  - Compare with the source value to work out delta_guest_TSC and
    delta_TOD
  - Apply delta_guest_TSC to all vCPUs in a VM
  - If you want to account for realtime, apply guest_tsc_freq *
    delta_TOD to every vCPU in the VM
  - Account for drift between leader/follower vCPUs

Userspace has some math to do, but IMO it needs to until we have a
better mechanism for helping the guest clean up a slow migration.
It does eliminate the need for doing TSC scaling in userspace, which
I think is the trickiest piece of it all.

Alternative could be to say that the VM has a single, authoritative {
TOD, guest_TSC } clock that can be read or written. Any vCPU offsets
then account for guest-induced drift in TSCs.

> To recap, the data that needs to be migrated from source to destination is
> the hostTSC+hostTOD pairing (returned by KVM_GET_CLOCK) plus one of each of
> the following:
> 
> * either guestTSCOffset or a guestTSC synced with the hostTSC
> 
> * either guestTODOffset or a guestTOD synced with the hostTOD.
> 
> * either guestTSCScale or hostTSCFreq
> 
> Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD
> returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. It's a
> bit mix-and-match, but it's already a 5-tuple that the destination can use.
> What's missing is a ioctl on the destination side that relieves userspace
> from having to do the math.

That ioctl will work fine, but userspace needs to accept all the
nastiness that ensues. If it yanks the guest too hard into the future
it'll need to pick up the pieces when the guest kernel panics.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-20 13:39                     ` Paolo Bonzini
  2022-03-21  0:51                       ` Oliver Upton
@ 2022-03-21 12:16                       ` David Woodhouse
  2022-03-21 13:10                         ` Paolo Bonzini
  1 sibling, 1 reply; 33+ messages in thread
From: David Woodhouse @ 2022-03-21 12:16 UTC (permalink / raw)
  To: Paolo Bonzini, Oliver Upton
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 2402 bytes --]

On Sun, 2022-03-20 at 14:39 +0100, Paolo Bonzini wrote:
> On 3/20/22 09:52, Oliver Upton wrote:
> > What do you folks think about having a new R/O vCPU attribute that
> > returns a { TOD, guest_tsc } pair? I believe that would immediately
> > satisfy the needs of upstream to implement clock-advancing live
> > migration.
> 
> I don't think this adds much.  The missing link is on the destination 
> side, not the source side.
> 
> To recap, the data that needs to be migrated from source to destination 
> is the hostTSC+hostTOD pairing (returned by KVM_GET_CLOCK) plus one of 
> each of the following:
> 
> * either guestTSCOffset or a guestTSC synced with the hostTSC
> 
> * either guestTODOffset or a guestTOD synced with the hostTOD.
> 
> * either guestTSCScale or hostTSCFreq
> 
> Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD 
> returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. 
> It's a bit mix-and-match, but it's already a 5-tuple that the 
> destination can use.  What's missing is a ioctl on the destination side 
> that relieves userspace from having to do the math.
> 
> Adding a nicer API just means choosing a different 5-tuple that the 
> source side sends over to the destination, but it isn't particularly 
> useful if userspace still has to do the math.

Hm, I still don't really understand why it's a 5-tuple and why we care
about the *host* TSC at all.

Fundamentally, guest state is *guest* state. There is a 3-tuple of
 { NTP-synchronized time of day, Guest KVM clock, Guest TSC value } 

(plus a TSC offset from vCPU0, for each other vCPU perhaps.)

All else is redundant, and including host values is just wrong.

We can already do clock-advancing live migration by adding the
appropriate number of nanoseconds, and the appropriate number of TSC
ticks, to the values being restored on the destination host based on
the time at which the migrated instance is being restored there, as
compared with the originally stored time-of-day.

I understand why KVM would want to use the *same* host TSC value for
calculating the time-of-day as well as the guest KVM clock and the
guest TSC value — so that all three elements of that 3-tuple reflect
precisely the same moment in time.

I don't understand why the actual *value* of the host TSC is something
that userspace needs to see.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-21  0:51                       ` Oliver Upton
@ 2022-03-21 12:36                         ` Paolo Bonzini
  2022-03-21 12:56                           ` David Woodhouse
  0 siblings, 1 reply; 33+ messages in thread
From: Paolo Bonzini @ 2022-03-21 12:36 UTC (permalink / raw)
  To: Oliver Upton
  Cc: David Woodhouse, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On 3/21/22 01:51, Oliver Upton wrote:
>>
>> Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD
>> returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. It's a
>> bit mix-and-match, but it's already a 5-tuple that the destination can use.
>> What's missing is a ioctl on the destination side that relieves userspace
>> from having to do the math.
> That ioctl will work fine, but userspace needs to accept all the
> nastiness that ensues. If it yanks the guest too hard into the future
> it'll need to pick up the pieces when the guest kernel panics.

It can do that before issuing the ioctl, by comparing the source and 
destination TOD, can't it?

Paolo


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-21 12:36                         ` Paolo Bonzini
@ 2022-03-21 12:56                           ` David Woodhouse
  0 siblings, 0 replies; 33+ messages in thread
From: David Woodhouse @ 2022-03-21 12:56 UTC (permalink / raw)
  To: Paolo Bonzini, Oliver Upton, Franke, Daniel
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 1586 bytes --]

On Mon, 2022-03-21 at 13:36 +0100, Paolo Bonzini wrote:
> On 3/21/22 01:51, Oliver Upton wrote:
> > > 
> > > Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD
> > > returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. It's a
> > > bit mix-and-match, but it's already a 5-tuple that the destination can use.
> > > What's missing is a ioctl on the destination side that relieves userspace
> > > from having to do the math.
> > That ioctl will work fine, but userspace needs to accept all the
> > nastiness that ensues. If it yanks the guest too hard into the future
> > it'll need to pick up the pieces when the guest kernel panics.
> 
> It can do that before issuing the ioctl, by comparing the source and 
> destination TOD, can't it?

That's how we do it. 

We currently take the *really* naïve approach of just letting each vCPU
thread serialize its own vCPU's TSC with the other MSRs, which means
they're slightly out of sync.

Then we calculate the TOD delta when resuming on the destination, and
add the corresponding number of nanoseconds to the KVM clock, and ticks
to each vCPU's MSR.

When each vCPU thread then *restores* its vCPU's MSRs, the 'within a
second of the last sync" bit kicks in and they are all synchronized
properly.

There's definitely a bit of slop there; I don't think we even use the
KVM_CLOCK_REALTIME flag when setting the guest's KVM clock. But those
are fixable by having a single ioctl which handles the 3-tuple of
{ TOD, KVM clock, TSC } at a single instance in time, I think?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-21 12:16                       ` David Woodhouse
@ 2022-03-21 13:10                         ` Paolo Bonzini
  2022-03-21 14:59                           ` David Woodhouse
  0 siblings, 1 reply; 33+ messages in thread
From: Paolo Bonzini @ 2022-03-21 13:10 UTC (permalink / raw)
  To: David Woodhouse, Oliver Upton
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel

On 3/21/22 13:16, David Woodhouse wrote:
> Hm, I still don't really understand why it's a 5-tuple and why we care
> about the *host* TSC at all.
> 
> Fundamentally, guest state is *guest* state. There is a 3-tuple of
>   { NTP-synchronized time of day, Guest KVM clock, Guest TSC value }
> 
> (plus a TSC offset from vCPU0, for each other vCPU perhaps.)
> 
> All else is redundant, and including host values is just wrong. [...]
> I don't understand why the actual *value* of the host TSC is something
> that userspace needs to see.

Ok, I understand now.

You're right, you don't need the value of the host TSC; you only need a 
{hostTOD, guestNS, guestTSC} tuple for every vCPU recorded on the source.

If all the frequencies are the same, that can be "packed" as {hostTOD, 
guestNS, anyTSC} plus N offsets.  The N offsets in turn can be 
KVM_VCPU_TSC_OFFSET if you use the hostTSC, or the offsets between vCPU0 
and the others if you use the vCPU0 guestTSC.

I think reasoning in terms of the host TSC is nicer in general, because 
it doesn't make vCPU0 special.  But apart from the aesthetics of having 
a "special" vCPU, making vCPU0 special is actually harder, because the 
TSC frequencies need not be the same for all vCPUs.  I think that is a 
mistake in the KVM API, but it was done long before I was involved (and 
long before I actually understood this stuff).

Paolo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-21 13:10                         ` Paolo Bonzini
@ 2022-03-21 14:59                           ` David Woodhouse
  0 siblings, 0 replies; 33+ messages in thread
From: David Woodhouse @ 2022-03-21 14:59 UTC (permalink / raw)
  To: Paolo Bonzini, Oliver Upton
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 2687 bytes --]

On Mon, 2022-03-21 at 14:10 +0100, Paolo Bonzini wrote:
> On 3/21/22 13:16, David Woodhouse wrote:
> > Hm, I still don't really understand why it's a 5-tuple and why we care
> > about the *host* TSC at all.
> > 
> > Fundamentally, guest state is *guest* state. There is a 3-tuple of
> >   { NTP-synchronized time of day, Guest KVM clock, Guest TSC value }
> > 
> > (plus a TSC offset from vCPU0, for each other vCPU perhaps.)
> > 
> > All else is redundant, and including host values is just wrong. [...]
> > I don't understand why the actual *value* of the host TSC is something
> > that userspace needs to see.
> 
> Ok, I understand now.
> 
> You're right, you don't need the value of the host TSC; you only need a 
> {hostTOD, guestNS, guestTSC} tuple for every vCPU recorded on the source.

Actually, isn't that still redundant? All we really need is a *single*
{ hostTOD, guestNS } for the KVM clock, and then each vCPU has its own
{ guestNS, guestTSC } tuple.

> If all the frequencies are the same, that can be "packed" as {hostTOD, 
> guestNS, anyTSC} plus N offsets.  The N offsets in turn can be 
> KVM_VCPU_TSC_OFFSET if you use the hostTSC, or the offsets between vCPU0 
> and the others if you use the vCPU0 guestTSC.
> 
> I think reasoning in terms of the host TSC is nicer in general, because 
> it doesn't make vCPU0 special.  But apart from the aesthetics of having 
> a "special" vCPU, making vCPU0 special is actually harder, because the 
> TSC frequencies need not be the same for all vCPUs.  I think that is a 
> mistake in the KVM API, but it was done long before I was involved (and 
> long before I actually understood this stuff).

If each vCPU has its own { guestNS, guestTSC } tuple that actually
works out just fine even if they have different frequencies. The
*common* case would be that they are all at the same frequency and have
the same value at the same time. But other cases can be accommodated.

I'm not averse to *reasoning* in terms of the host TSC; I just don't
like exposing the actual numbers to userspace and forcing userspace to
access it through some new ABI just in order to translate some other
fundamental property (either the time of day, or the guest data) into
that domain. And I especially don't like considering it part of 'guest
state'.

Right now when I'm not frowning at TSC synchronisation issues, I'm
working on guest transparent live migration from actual Xen, to
KVM-pretending-to-be-Xen. That kind of insanity is only really possible
with a strict adherence to the design principle that "guest state is
guest state", without conflating it with host/implementation details :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-21  0:38                       ` Oliver Upton
@ 2022-03-21 19:43                         ` David Woodhouse
  2022-03-21 21:23                           ` Oliver Upton
  0 siblings, 1 reply; 33+ messages in thread
From: David Woodhouse @ 2022-03-21 19:43 UTC (permalink / raw)
  To: Oliver Upton, Franke, Daniel
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 1909 bytes --]

On Mon, 2022-03-21 at 00:38 +0000, Oliver Upton wrote:
> On Sun, Mar 20, 2022 at 09:46:35AM -0000, David Woodhouse wrote:
> > But coincidentally since then I have started having conversations with
> > people who really want the guest to have an immediate knowledge of the
> > adjtimex maxerror etc. on the new host immediately after the migration.
> > Maybe the "if the migration isn't fast enough then let the guest know it's
> > now unsynced" is OK, but I'll need to work out what "immediately" means
> > when we have a guest userspace component involved in it.
> 
> This has also been an area of interest to me. I think we've all seen the
> many ways in which doing migrations behind the guest's can put software
> in an extremely undesirable state on the other end. If those
> conversations are taking place on the mailing lists, could you please CC
> me?
> 
> Our (Google) TSC adjustment clamping and userspace notification mechanism
> was a halfway kludge to keep things happy on the other end. And it
> generally has worked well, but misses a fundamental point.
> 
> The hypervisor should tell the guest kernel about time travel and let it
> cascade that information throughout the guest system. Regardless of what
> we do to the TSC, we invariably destroy one of the two guest clocks along
> the way. If we told the guest "you time traveled X seconds", it could
> fold that into its own idea of real time. Guest kernel can then fire off
> events to inform software that wants to keep up with clock changes, and
> even a new event to let NTP know its probably running on different
> hardware.
> 
> Time sucks :-)

So, we already have PVCLOCK_GUEST_STOPPED which tells the guest that
its clock may have experienced a jump. Linux guests will use this to
kick various watchdogs to prevent them whining. Shouldn't we *also* be
driving the NTP reset from that same signal?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-21 19:43                         ` David Woodhouse
@ 2022-03-21 21:23                           ` Oliver Upton
  0 siblings, 0 replies; 33+ messages in thread
From: Oliver Upton @ 2022-03-21 21:23 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel

On Mon, Mar 21, 2022 at 07:43:21PM +0000, David Woodhouse wrote:
> On Mon, 2022-03-21 at 00:38 +0000, Oliver Upton wrote:
> > On Sun, Mar 20, 2022 at 09:46:35AM -0000, David Woodhouse wrote:
> > > But coincidentally since then I have started having conversations with
> > > people who really want the guest to have an immediate knowledge of the
> > > adjtimex maxerror etc. on the new host immediately after the migration.
> > > Maybe the "if the migration isn't fast enough then let the guest know it's
> > > now unsynced" is OK, but I'll need to work out what "immediately" means
> > > when we have a guest userspace component involved in it.
> > 
> > This has also been an area of interest to me. I think we've all seen the
> > many ways in which doing migrations behind the guest's can put software
> > in an extremely undesirable state on the other end. If those
> > conversations are taking place on the mailing lists, could you please CC
> > me?
> > 
> > Our (Google) TSC adjustment clamping and userspace notification mechanism
> > was a halfway kludge to keep things happy on the other end. And it
> > generally has worked well, but misses a fundamental point.
> > 
> > The hypervisor should tell the guest kernel about time travel and let it
> > cascade that information throughout the guest system. Regardless of what
> > we do to the TSC, we invariably destroy one of the two guest clocks along
> > the way. If we told the guest "you time traveled X seconds", it could
> > fold that into its own idea of real time. Guest kernel can then fire off
> > events to inform software that wants to keep up with clock changes, and
> > even a new event to let NTP know its probably running on different
> > hardware.
> > 
> > Time sucks :-)
> 
> So, we already have PVCLOCK_GUEST_STOPPED which tells the guest that
> its clock may have experienced a jump. Linux guests will use this to
> kick various watchdogs to prevent them whining. Shouldn't we *also* be
> driving the NTP reset from that same signal?

Right, but I'd argue that interface has some problems too. It
depends on the guest polling instead of an interrupt from the
hypervisor. It also has no way of informing the kernel exactly how much
time has elapsed.

The whole point of all these hacks that we've done internally is that we,
the hypervisor, know full well how much real time hasv advanced during the
VM blackout. If we can at least let the guest know how much to fudge real
time, it can then poke NTP for better refinement. I worry about using NTP
as the sole source of truth for such a mechanism, since you'll need to go
out to the network and any reads until the response comes back are hosed.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-06-30 11:58       ` David Woodhouse
@ 2022-07-05 14:43         ` David Woodhouse
  0 siblings, 0 replies; 33+ messages in thread
From: David Woodhouse @ 2022-07-05 14:43 UTC (permalink / raw)
  To: Oliver Upton, Thomas Gleixner, Veith, Simon
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 2697 bytes --]

On Thu, 2022-06-30 at 12:58 +0100, David Woodhouse wrote:
>  The TSC of each vCPU#n was <Tₙ> at a given value of kvmclock <Kₙ>

That one isn't even particularly hard to do. At its core, it looks
something like this (thanks to Simon for fixing the arithmetic. My
first proof-of-concept was only a *few* orders of magnitude out...)

Simon then has a follow-on patch to wrap an ioctl around it and
actually allow the kvm_ns to be passed in.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1910e1e78b15..ddb67c9566e4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2629,7 +2629,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm_track_tsc_matching(vcpu);
 }
 
-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, u64 *kvm_ns)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
@@ -2638,8 +2638,23 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	bool synchronizing = false;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
-	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	ns = get_kvmclock_base_ns();
+	if (kvm_ns) {
+ 		/*
+ 		 * We have been provided a KVM clock reference time point at
+ 		 * which this TSC value was correct. Use this time point to
+		 * compensate for any delays that were incurred since that
+		 * TSC value was computed. Note that time is still passing;
+		 * in an ideal world the get_kvmclock_base_ns() above would
+		 * be based on the *same* TSC value that's hidden in the call
+		 * to kvm_compute_l1_tsc_offset() below.
+ 		 */
+ 		u32 tsc_khz = vcpu->arch.virtual_tsc_khz;
+ 		s64 delta_ns = ns + vcpu->kvm->arch.kvmclock_offset - *kvm_ns;
+ 		s64 delta_counts = delta_ns * tsc_khz / NSEC_PER_MSEC;
+ 		data += delta_counts;
+	}
+	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
 	if (vcpu->arch.virtual_tsc_khz) {
@@ -3581,7 +3596,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_TSC:
 		if (msr_info->host_initiated) {
-			kvm_synchronize_tsc(vcpu, data);
+			kvm_synchronize_tsc(vcpu, data, NULL);
 		} else {
 			u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
 			adjust_tsc_offset_guest(vcpu, adj);
@@ -11392,7 +11407,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 	if (mutex_lock_killable(&vcpu->mutex))
 		return;
 	vcpu_load(vcpu);
-	kvm_synchronize_tsc(vcpu, 0);
+	kvm_synchronize_tsc(vcpu, 0, NULL);
 	vcpu_put(vcpu);
 
 	/* poll control enabled by default */


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-29 16:02     ` Oliver Upton
  2022-03-29 19:34       ` Thomas Gleixner
@ 2022-06-30 11:58       ` David Woodhouse
  2022-07-05 14:43         ` David Woodhouse
  1 sibling, 1 reply; 33+ messages in thread
From: David Woodhouse @ 2022-06-30 11:58 UTC (permalink / raw)
  To: Oliver Upton, Thomas Gleixner, sveith
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 3438 bytes --]

On Tue, 2022-03-29 at 09:02 -0700, Oliver Upton wrote:
> 
> There's a need to sound the alarm for NTP regardless of whether
> TOLERABLE_THRESHOLD is exceeded. David pointed out that the host
> advancing the guest clocks (delta injection or TSC advancement) could
> inject some error. Also, hardware has likely changed and the new parts
> will have their own errors as well.

I don't admit to pointing that out in that form because I don't accept
the use of the term "advance".

Clocks advance *themselves*. That's what clocks do.

When we perform a live update or live migration we might *adjust* those
clocks, calibrate, synchronise or restore them. But I'll eat my
keyboard before using the term "advance" for that. Even if our
adjustment is in a forward direction.

Let's consider the case of a live update — where we stop scheduling the
guest for a moment, kexec into the new kernel, then resume scheduling
the guest.

I assert strongly that from the guest point of view this is *no*
different to any other brief period of not being scheduled.

Yes, in practice we have a whole new kernel, a whole new KVM and set of
kvm_vcpus, and we've *restored* the state. And we have restored the
TSCs/clocks in those new kvm objects to precisely match what they were
before. Note: *match* not advance.

Before the kexec, there were a bunch of relationships between clocks,
mostly based on the host TSC Tₕ (assuming the case where that's stable
and reliable):

 • The kernel's idea of wallclock time was based on Tₕ, plus some
   offset and divided by some frequency. NTP tweaks those values over
   time but at any given instant there is a current value for them
   which is used to derive the wallblock time.

 • The kernel's idea of the guest kvmclock epoch (nanoseconds since the
   KVM started) was based on Tₕ and some other offset and hopefully the
   same frequency. 

 • The TSC of each vCPU was based on Tₕ, some offset and a TSC scaling
   factor.

After a live update, the host TSC Tₕ is just the same as it always was.
Not the same *value* of course; that was never the case from one tick
to the next anyway. It's the same, in that it continues to advance
*itself* at a consistent frequency as real time progresses, which is
what clocks do.

In the new kernel we just want all those other derivative clocks to
*also* be the same as before. That is, the offset and multipliers are
the *same* value. We're not "advancing" those clocks. We're
*preserving* them.

For live migration it's slightly harder because we don't have a
consistent host TSC to use as the basis. The best we can do is NTP-
synchronised wallclock time between the two hosts. And thus I think we
want *these* constants to be preserved across the migration:

 The KVM's kvmclock was <K> at a given wallclock time <W>

 The TSC of each vCPU#n was <Tₙ> at a given value of kvmclock <Kₙ>

In *this* case we are running on different hardware and the reliance on
the NTP wallclock time as the basis for preserving the guest clocks may
have introduced an error, as well as the fact that the hardware has
changed. So in this case we should indeed inform the guest that it
should consider itself out of NTP sync and start over, in *addition* to
making a best effort to preserve those clocks.

But there is no scope for the word "advance" to be used anywhere there
either.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-29 16:02     ` Oliver Upton
@ 2022-03-29 19:34       ` Thomas Gleixner
  2022-06-30 11:58       ` David Woodhouse
  1 sibling, 0 replies; 33+ messages in thread
From: Thomas Gleixner @ 2022-03-29 19:34 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Franke, Daniel, David Woodhouse, Paolo Bonzini, kvm,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel

Oliver,

On Tue, Mar 29 2022 at 09:02, Oliver Upton wrote:
> On Tue, Mar 29, 2022 at 7:19 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Doing this the other way around (advance the TSC, tell the guest to fix
>> > MONOTONIC) is fundamentally wrong, as it violates two invariants of the
>> > monotonic clock. Monotonic counts during a migration, which really is a
>> > forced suspend. Additionally, you cannot step the monotonic clock.
>>
>> A migration _should_ have suspend semantics, but the forced suspend
>> which is done by migration today does not have proper defined semantics
>> at all.
>>
>> Also clock monotonic can be stepped forward under certain circumstances
>> and the kernel is very well able to handle it within well defined
>> limits. Think about scheduled out vCPUs. From their perspective clock
>> monotonic is stepping forwards.
>>
>
> Right. For better or worse, the kernel has been conditioned to
> tolerate small steps due to scheduling. There is just zero definition
> around how much slop is allowed.

From the experience with the MONO=BOOT experiment, I'd say anything < 1
second is unlikely to cause larger trouble, but there might be user
space applications/configurations which disagree :)

>> The problem starts when the 'force suspended' time becomes excessive, as
>> that causes the mass expiry of clock monotonic timers with all the nasty
>> side effects described in a3ed0e4393d6. In the worst case it's going to
>> exceed the catchup limit of non-suspended timekeeping (~440s for a TSC
>> @2GHz) which in fact breaks the world and some more.
>>
>> So let me go back to the use cases:
>>
>>  1) Regular freeze of a VM to disk and resume at some arbitrary point in
>>     the future.
>>
>>  2) Live migration
>>
>> In both cases the proper solution is to make the guest go into a well
>> defined state (suspend) and resume it on restore. Everything just works
>> because it is well defined.
>>
>> Though you say that there is a special case (not that I believe it):
>
> I believe the easier special case to articulate is when the hypervisor
> has already done its due diligence to warn the guest about a
> migration. Guest doesn't heed the warning and doesn't quiesce. The
> most predictable state at this point is probably just to kill the VM
> on the spot, but that is likely to be a _very_ tough sell :)

I'm all for it. It's very well defined.

> So assuming that it's still possible for a non-cooperative suspend
> (live migration, VM freeze, etc.) there's still a need to stop the
> bleeding. I think you touch on what that may look like:
>
>>   1) Trestart - Tstop < TOLERABLE_THRESHOLD
>>
>>      That's the easy case as you just can adjust TSC on restore by that
>>      amount on all vCPUs and be done with it. Just works like scheduling
>>      out all vCPUs for some time.
>>
>>   2) Trestart - Tstop >= TOLERABLE_THRESHOLD
>>
>>      Avoid adjusting TSC for the reasons above.
>
> Which naturally will prompt the question: what is the value of
> TOLERABLE_THRESHOLD? Speaking from experience (Google already does
> something similar, but without a good fallback for exceeding the
> threshold), there's ~zero science in deriving that value. But, IMO if
> it's at least documented we can make the shenanigans at least a bit
> more predictable. It also makes it very easy to define who
> (guest/host) is responsible for cleaning up the mess.

See above, but the hyperscalers with experience on heavy host overload
might have better information when keeping a vCPU scheduled out starts
to create problems in the guest.

> In absence of documentation there's an unlimited license for VM
> operators to do as they please and I fear we will forever perpetuate
> the pain of time in virt.

You can prevent that, by making 'Cooperate within time or die hard' the
policy. :)

>>              if (seq != get_migration_sequence())
>>                 do_something_smart();
>>              else
>>                 proceed_as_usual();
>
> Agreed pretty much the whole way through. There's no point in keeping
> NTP naive at this point.
>
> There's a need to sound the alarm for NTP regardless of whether
> TOLERABLE_THRESHOLD is exceeded. David pointed out that the host
> advancing the guest clocks (delta injection or TSC advancement) could
> inject some error. Also, hardware has likely changed and the new parts
> will have their own errors as well.

There is no reason why you can't use the #2 scheme for the #1 case
too:

>>         On destination host:
>>            Restore memory image
>>            Expose metadata in PV:
>>              - migration sequence number + 1

                - Flag whether Tout was compensated already via
                  TSC or just set Tout = 0

>>              - Tout (dest/source host delta of clock TAI)
>>            Run guest
>>
>>         Guest kernel:
>>
>>            - Keep track of the PV migration sequence number.
>>
>>              If it changed act accordingly by injecting the TAI delta,
>>              which updates NTP state, wakes TFD_TIMER_CANCEL_ON_SET,
>>              etc...

                if it was compensated via TSC already, it might be
                sufficient to just reset NTP state.

>>         NTP:
>>            - utilize the sequence counter information
                ....

OTOH, the question is whether it's worth it.

If we assume that the sane case is a cooperative guest and the forced
migration is the last resort, then we can just avoid the extra magic and
the discussion around the correct value for TOLERABLE_THRESHOLD
alltogether.

I suggest to start from a TOLERABLE_THRESHOLD=0 assumption to keep it
simple in the first step. Once this has been established, you can still
experiment with the threshold and figure out whether it matters.

In fact, doing the host side TSC compensation is just an excuse for VM
operators not to make the guest cooperative, because it might solve
their main problems for the vast majority of migrations.

Forcing them to doing it right is definitely the better option, which
means the 'Cooperate or die hard' policy is the best one you can
chose. :)

>>         That still will leave everything else exposed to
>>         CLOCK_REALTIME/TAI jumping forward, but there is nothing you can
>>         do about that and any application which cares about this has to
>>         be able to deal with it anyway.
>
> Right. There's no cure-all between hypervisor/guest kernel that could
> fix the problem for userspace entirely.

In the same way as there is no cure for time jumps caused by
settimeofday(), daylight saving changes, leap seconds etc., unless the
application is carefully designed to deal with that.

> Appreciate you chiming in on this topic yet again.

I still hope that this get's fixed _before_ I retire :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-29 14:19   ` Thomas Gleixner
@ 2022-03-29 16:02     ` Oliver Upton
  2022-03-29 19:34       ` Thomas Gleixner
  2022-06-30 11:58       ` David Woodhouse
  0 siblings, 2 replies; 33+ messages in thread
From: Oliver Upton @ 2022-03-29 16:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Franke, Daniel, David Woodhouse, Paolo Bonzini, kvm,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel

Hi Thomas,

On Tue, Mar 29, 2022 at 7:19 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > Let's just assume for now that the hypervisor will *not* quiesce
> > the guest into an S2IDLE state before migration. I think quiesced
> > migrations are a good thing to have in the toolbelt, but there will
> > always be host-side problems that require us to migrate a VM off a host
> > immediately with no time to inform the guest.
>
> Quiesced migrations are the right thing on the principle of least
> surprise.

Absolutely no argument here. Furthermore, it is likely the laziest
solution as well since all the adjustment is done within the guest.

The one thing I envision is that it's highly unlikely that a migration
will block until the guest cooperates. If the guest doesn't act within
a certain threshold then sorry, you're going along for the ride
anyway.

But it is very easy to paint that as user error if advance notice was
given for the migration :)

> > It seems possible to block racing reads of REALTIME if we protect it with
> > a migration sequence counter. Host raises the sequence after a migration
> > when control is yielded back to the guest. The sequence is odd if an
> > update is pending. Guest increments the sequence again after the
> > interrupt handler accounts for time travel. That has the side effect of
> > blocking all realtime clock reads until the interrupt is handled. But
> > what are the value of those reads if we know they're wrong? There is
> > also the implication that said shared memory interface gets mapped
> > through to userspace for vDSO, haven't thought at all about those
> > implications yet.
>
> So you need two sequence counters to be checked similar to the
> pv_clock() case, which prevents the usage of TSC without the pv_clock()
> overhead.
>
> That also means that CLOCK_REALTIME, CLOCK_TAI and CLOCK_REALTIME_COARSE
> will need to become special cased in the VDSO and all realtime based
> syscall interfaces.
>
> I'm sure that will be well received from a performance perspective.

Lol. Yeah, I was rambling and this still falls apart for the case
where a clock read came in before the migration, which you raise
below.

> As I explained before, this does not solve the following:
>
>    T = now();
>    -----------> interruption of some sort (Tout)
>    use(T);
>
> Any use case which cares about Tout being less than a certain threshold
> has to be carefully designed to handle this. See also below.
>
> > Doing this the other way around (advance the TSC, tell the guest to fix
> > MONOTONIC) is fundamentally wrong, as it violates two invariants of the
> > monotonic clock. Monotonic counts during a migration, which really is a
> > forced suspend. Additionally, you cannot step the monotonic clock.
>
> A migration _should_ have suspend semantics, but the forced suspend
> which is done by migration today does not have proper defined semantics
> at all.
>
> Also clock monotonic can be stepped forward under certain circumstances
> and the kernel is very well able to handle it within well defined
> limits. Think about scheduled out vCPUs. From their perspective clock
> monotonic is stepping forwards.
>

Right. For better or worse, the kernel has been conditioned to
tolerate small steps due to scheduling. There is just zero definition
around how much slop is allowed.

> The problem starts when the 'force suspended' time becomes excessive, as
> that causes the mass expiry of clock monotonic timers with all the nasty
> side effects described in a3ed0e4393d6. In the worst case it's going to
> exceed the catchup limit of non-suspended timekeeping (~440s for a TSC
> @2GHz) which in fact breaks the world and some more.
>
> So let me go back to the use cases:
>
>  1) Regular freeze of a VM to disk and resume at some arbitrary point in
>     the future.
>
>  2) Live migration
>
> In both cases the proper solution is to make the guest go into a well
> defined state (suspend) and resume it on restore. Everything just works
> because it is well defined.
>
> Though you say that there is a special case (not that I believe it):

I believe the easier special case to articulate is when the hypervisor
has already done its due diligence to warn the guest about a
migration. Guest doesn't heed the warning and doesn't quiesce. The
most predictable state at this point is probably just to kill the VM
on the spot, but that is likely to be a _very_ tough sell :)

So assuming that it's still possible for a non-cooperative suspend
(live migration, VM freeze, etc.) there's still a need to stop the
bleeding. I think you touch on what that may look like:

>   1) Trestart - Tstop < TOLERABLE_THRESHOLD
>
>      That's the easy case as you just can adjust TSC on restore by that
>      amount on all vCPUs and be done with it. Just works like scheduling
>      out all vCPUs for some time.
>
>   2) Trestart - Tstop >= TOLERABLE_THRESHOLD
>
>      Avoid adjusting TSC for the reasons above.

Which naturally will prompt the question: what is the value of
TOLERABLE_THRESHOLD? Speaking from experience (Google already does
something similar, but without a good fallback for exceeding the
threshold), there's ~zero science in deriving that value. But, IMO if
it's at least documented we can make the shenanigans at least a bit
more predictable. It also makes it very easy to define who
(guest/host) is responsible for cleaning up the mess.

In absence of documentation there's an unlimited license for VM
operators to do as they please and I fear we will forever perpetuate
the pain of time in virt.

>      That leaves you with the options of
>
>      A) Make NTP unhappy as of today
>
>      B) Provide information about the fact that the vCPUs got scheduled
>         out for a long time and teach NTP to be smart about it.
>
>         Right now NTP decides to declare itself unstable when it
>         observes the time jump on CLOCK_REALTIME from the NTP server.
>
>         That's exactly the situation described above:
>
>         T = now();
>         -----------> interruption of some sort (Tout)
>         use(T);
>
>         NTP observes that T is inconsistent because it does not know
>         about Tout due to the fact that neither clock MONOTONIC nor
>         clock BOOTTIME advanced.
>
>         So here is where you can bring PV information into play by
>         providing a migration sequence number in PV shared memory.
>
>         On source host:
>            Stop the guest dead in it's tracks.
>            Record metadata:
>              - migration sequence number
>              - clock TAI as observed on the host
>            Transfer the image along with metadata
>
>         On destination host:
>            Restore memory image
>            Expose metadata in PV:
>              - migration sequence number + 1
>              - Tout (dest/source host delta of clock TAI)
>            Run guest
>
>         Guest kernel:
>
>            - Keep track of the PV migration sequence number.
>
>              If it changed act accordingly by injecting the TAI delta,
>              which updates NTP state, wakes TFD_TIMER_CANCEL_ON_SET,
>              etc...
>
>              We have to do that in the kernel because we cannot rely on
>              NTP being running.
>
>              That can be an explicit IPI injected from the host or just
>              polling from the tick. It doesn't matter much.
>
>            - Expose information to user space:
>                - Migration sequence counter
>
>            - Add some smarts to adjtimex() to prevent user space racing
>              against a pending migration update in the kernel.
>
>         NTP:
>            - utilize the sequence counter information
>
>              seq = get_migration_sequence();
>
>              do stuff();
>
>              if (seq != get_migration_sequence())
>                 do_something_smart();
>              else
>                 proceed_as_usual();

Agreed pretty much the whole way through. There's no point in keeping
NTP naive at this point.

There's a need to sound the alarm for NTP regardless of whether
TOLERABLE_THRESHOLD is exceeded. David pointed out that the host
advancing the guest clocks (delta injection or TSC advancement) could
inject some error. Also, hardware has likely changed and the new parts
will have their own errors as well.

>         That still will leave everything else exposed to
>         CLOCK_REALTIME/TAI jumping forward, but there is nothing you can
>         do about that and any application which cares about this has to
>         be able to deal with it anyway.

Right. There's no cure-all between hypervisor/guest kernel that could
fix the problem for userspace entirely.

Appreciate you chiming in on this topic yet again.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-22 21:53 ` Oliver Upton
  2022-03-23 12:35   ` David Woodhouse
@ 2022-03-29 14:19   ` Thomas Gleixner
  2022-03-29 16:02     ` Oliver Upton
  1 sibling, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2022-03-29 14:19 UTC (permalink / raw)
  To: Oliver Upton, Franke, Daniel
  Cc: David Woodhouse, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel

On Tue, Mar 22 2022 at 21:53, Oliver Upton wrote:
> On Tue, Mar 22, 2022 at 07:18:20PM +0000, Franke, Daniel wrote:
>> The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
>> globals kept by kernel/time/ntp.c (which are visible to userspace through
>> adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
>> to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
>
> I do not disagree that NTP needs to throw the book out after a live
> migration.
>
> But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> relies on the guest polling a shared structure, and who knows when the
> guest is going to check the structure again? If we inject an interrupt
> the guest is likely to check this state in a reasonable amount of time.
>
> Thomas, we're talking about how to not wreck time (as bad) under
> virtualization. I know this has been an area of interest to you for a
> while ;-) The idea is that the hypervisor should let the guest know
> about time travel.

Finally. It only took 10+ years of ranting :)

> Let's just assume for now that the hypervisor will *not* quiesce
> the guest into an S2IDLE state before migration. I think quiesced
> migrations are a good thing to have in the toolbelt, but there will
> always be host-side problems that require us to migrate a VM off a host
> immediately with no time to inform the guest.

Quiesced migrations are the right thing on the principle of least
surprise.

> Given that, we're deciding which clock is going to get wrecked during
> a migration and what the guest can do afterwards to clean it up.
> Whichever clock gets wrecked is going to have a window where reads race
> with the eventual fix, and could be completely wrong. My thoughts:
>
> We do not advance the TSC during a migration and notify the
> guest (interrupt, shared structure) about how much it has
> time traveled (delta_REALTIME). REALTIME is wrong until the interrupt
> is handled in the guest, but should fire off all of the existing
> mechanisms for a clock step. Userspace gets notified with
> TFD_TIMER_CANCEL_ON_SET. I believe you have proposed something similar
> as a way to make live migration less sinister from the guest
> perspective.

That still is an information _after_ the fact and not well defined.

> It seems possible to block racing reads of REALTIME if we protect it with
> a migration sequence counter. Host raises the sequence after a migration
> when control is yielded back to the guest. The sequence is odd if an
> update is pending. Guest increments the sequence again after the
> interrupt handler accounts for time travel. That has the side effect of
> blocking all realtime clock reads until the interrupt is handled. But
> what are the value of those reads if we know they're wrong? There is
> also the implication that said shared memory interface gets mapped
> through to userspace for vDSO, haven't thought at all about those
> implications yet.

So you need two sequence counters to be checked similar to the
pv_clock() case, which prevents the usage of TSC without the pv_clock()
overhead.

That also means that CLOCK_REALTIME, CLOCK_TAI and CLOCK_REALTIME_COARSE
will need to become special cased in the VDSO and all realtime based
syscall interfaces.

I'm sure that will be well received from a performance perspective.

Aside of that where do you put the limit? Just user space visible read
outs? What about the kernel? Ignore it in the kernel and hope that
nothing will break? That's going to create really hard to diagnose
issues, which I'm absolutely not interested in.

You can't expand this scheme to the kernel in general because the CPU
which should handle the update might be in an interrupt disabled region
reading clock REALTIME....

Also this would just make the race window smaller. It does not solve the
problem in a consistent way. Don't even think about it. It's just
another layer of duct tape over the existing ones.

As I explained before, this does not solve the following:

   T = now();
   -----------> interruption of some sort (Tout)
   use(T);

Any use case which cares about Tout being less than a certain threshold
has to be carefully designed to handle this. See also below.

> Doing this the other way around (advance the TSC, tell the guest to fix
> MONOTONIC) is fundamentally wrong, as it violates two invariants of the
> monotonic clock. Monotonic counts during a migration, which really is a
> forced suspend. Additionally, you cannot step the monotonic clock.

A migration _should_ have suspend semantics, but the forced suspend
which is done by migration today does not have proper defined semantics
at all.

Also clock monotonic can be stepped forward under certain circumstances
and the kernel is very well able to handle it within well defined
limits. Think about scheduled out vCPUs. From their perspective clock
monotonic is stepping forwards.

The problem starts when the 'force suspended' time becomes excessive, as
that causes the mass expiry of clock monotonic timers with all the nasty
side effects described in a3ed0e4393d6. In the worst case it's going to
exceed the catchup limit of non-suspended timekeeping (~440s for a TSC
@2GHz) which in fact breaks the world and some more.

So let me go back to the use cases:

 1) Regular freeze of a VM to disk and resume at some arbitrary point in
    the future.

 2) Live migration

In both cases the proper solution is to make the guest go into a well
defined state (suspend) and resume it on restore. Everything just works
because it is well defined.

Though you say that there is a special case (not that I believe it):

> but there will always be host-side problems that require us to migrate
> a VM off a host immediately with no time to inform the guest.

which is basically what we do today for the very wrong reasons. There you
have two situations:

  1) Trestart - Tstop < TOLERABLE_THRESHOLD

     That's the easy case as you just can adjust TSC on restore by that
     amount on all vCPUs and be done with it. Just works like scheduling
     out all vCPUs for some time.

  2) Trestart - Tstop >= TOLERABLE_THRESHOLD

     Avoid adjusting TSC for the reasons above. That leaves you with the
     options of

     A) Make NTP unhappy as of today

     B) Provide information about the fact that the vCPUs got scheduled
        out for a long time and teach NTP to be smart about it.

        Right now NTP decides to declare itself unstable when it
        observes the time jump on CLOCK_REALTIME from the NTP server.

        That's exactly the situation described above:

        T = now();
        -----------> interruption of some sort (Tout)
        use(T);

        NTP observes that T is inconsistent because it does not know
        about Tout due to the fact that neither clock MONOTONIC nor
        clock BOOTTIME advanced.

        So here is where you can bring PV information into play by
        providing a migration sequence number in PV shared memory.

        On source host:
           Stop the guest dead in it's tracks.
           Record metadata:
             - migration sequence number
             - clock TAI as observed on the host
           Transfer the image along with metadata

        On destination host:
           Restore memory image
           Expose metadata in PV:
             - migration sequence number + 1
             - Tout (dest/source host delta of clock TAI)
           Run guest

        Guest kernel:

           - Keep track of the PV migration sequence number.

             If it changed act accordingly by injecting the TAI delta,
             which updates NTP state, wakes TFD_TIMER_CANCEL_ON_SET,
             etc...

             We have to do that in the kernel because we cannot rely on
             NTP being running.

             That can be an explicit IPI injected from the host or just
             polling from the tick. It doesn't matter much.

           - Expose information to user space:
               - Migration sequence counter

           - Add some smarts to adjtimex() to prevent user space racing
             against a pending migration update in the kernel.

        NTP:
           - utilize the sequence counter information 

             seq = get_migration_sequence();

             do stuff();
        
             if (seq != get_migration_sequence())
             	do_something_smart();
             else
                proceed_as_usual();

        That still will leave everything else exposed to
        CLOCK_REALTIME/TAI jumping forward, but there is nothing you can
        do about that and any application which cares about this has to
        be able to deal with it anyway.

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-25  9:03       ` David Woodhouse
@ 2022-03-25 17:47         ` Oliver Upton
  0 siblings, 0 replies; 33+ messages in thread
From: Oliver Upton @ 2022-03-25 17:47 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner

On Fri, Mar 25, 2022 at 09:03:50AM +0000, David Woodhouse wrote:
> On Wed, 2022-03-23 at 16:21 +0000, Oliver Upton wrote:
> > On Wed, Mar 23, 2022 at 12:35:17PM +0000, David Woodhouse wrote:
> > > And it all looks *just* like it would if the vCPUs happened not to be
> > > scheduled for a little while, because the host was busy.
> > 
> > We could continue to get away with TSC advancement, but the critical
> > part IMO is the upper bound. And what happens when we exceed it.
> > 
> > There is no authoritative documentation around what time looks like as a
> > guest of KVM, and futhermore what happens when a guest experiences time
> > travel. Now we're in a particularly undesirable situation where there
> > are at least three known definitions for time during a migration
> > (upstream QEMU, Google, Amazon) and it is ~impossible to program guest
> > software to anticipate our shenanigans.
> > 
> > If we are to do this right we probably need to agree on documented
> > behavior. If we decide that advancing TSCs is acceptable up to 'X'
> > seconds, guest kernels could take a change to relax expectations at
> > least up to this value.
> 
> I'm not sure I even buy this line of argument at all. I don't think of
> it as 'TSC advancement'. Or maybe I should put that the other way
> round: TSCs advance *all* the time; that's the whole point in them. 

The only reason I use the term 'TSC advancement' is to account for the
fact that the guest probably feels *very* differently about what the
advancement time actually is. The world around it has continued onwards
while it was out.

> All we do when we live-migrate is move the state from one host to
> another, completely intact. From the guest point of view we don't
> "advance" the TSC; we just set the guest TSC on the destination host to
> precisely¹ the same value that the guest TSC on the source host would
> have at that moment if we were to abort the migration and do a
> rollback. Either way (rollback or not), the vCPUs weren't *running* for
> a period of time, but time and TSCs are entirely the same.
> 
> In fact, on live update we literally *don't* change anything that the
> guest sees. The pvclock information that the guest gets is *precisely*
> the same as before, and the TSC has 'advanced' purely by nature of
> being precisely the same CPU just a short period of time in the
> future... because that's what TSCs do when time passes :)
> 
> Really, we aren't talking about a "maximum advancement of the TSC".
> We're talking about a "maximum period for which the vCPUs aren't
> scheduled".

But that is all built on the assumption that a guest is actually
being *live* migrated. That is entirely a userspace decision, and IMO we
can't glean what the VMM is trying to do just based on how it calls
the KVM APIs.

The only thing we know for certain is the guest state is serialized.
Where that data goes and how long it takes to be acted upon is none of
our business in the kernel.

> And I don't see why there need to be documented hard limits on that. I
> see it as a quality of implementation issue. If the host is in swap
> death and doesn't manage to run the guest vCPUs for seconds at a time,
> that's fairly crappy, but it's not strictly a *timekeeping* issue. And
> it's just the same if the final transition period of live migration (or
> the kexec time for live update) is excessive.

Serializing a VM to persistent storage and resuming it at a later date
is a completely valid use of the KVM APIs. Sure, if the guest goes down
for that long during a live migration then it has every right to scream
at the entity running the VM, but I would argue there are more
lifecycles outside of what our businesses typically do.

> > > > > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> > > > > globals kept by kernel/time/ntp.c (which are visible to userspace through
> > > > > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> > > > > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
> > > > 
> > > > I do not disagree that NTP needs to throw the book out after a live
> > > > migration.
> > > > 
> 
> Right. To recap, this is because where I highlighted 'precisely¹'
> above, our accuracy is actually limited to the synchronisation of the
> wallclock time between the two hosts. If the guest thinks it has a more
> accurate NTP sync than either of the hosts do, we may have introduced
> an error which is more than the error bounds the guest thinks it has,
> and that may ultimately lead to data corruption in some circumstances.

Agreed. And on top of that I would state the guest is likely running on
a different oscillator with its own set of problems since we switched out
the underlying hardware. NTP would need to rediscipline the clock even
if the migration somehow bends our understanding of reality and happens
in zero time.

> This part is somewhat orthogonal to the above discussion, isn't it?
> Regardless of whether we 'step' the TSC or not, we need to signal the
> guest to know that it needs to consider itself unsynchronized (or, if
> we want to get really fancy, let it know the upper bounds of the error
> we just introduced. But let's not).

Right, whatever signal we add should be asserted regardless of whether
the TSCs kept ticking while the guest is out. If there is going to be a
shared structure/message/whatever that conveys delta_REALTIME, you could
just set it to zero if the hypervisor decided to handle time on its own.

> > > > But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> > > > relies on the guest polling a shared structure, and who knows when the
> > > > guest is going to check the structure again? If we inject an interrupt
> > > > the guest is likely to check this state in a reasonable amount of time.
> > > 
> > > Ah, but that's the point. A flag in shared memory can be checked
> > > whenever the guest needs to know that it's operating on valid state.
> > > Linux will check it *every* time from pvclock_clocksource_read().
> > > 
> > > As opposed to a separate interrupt which eventually gets processed some
> > > indefinite amount of time in the future.
> > 
> > There are a few annoying things with pvclock, though. It is a per-vCPU
> > structure, so special care must be taken to act exactly once on a
> > migration. Also, since commit 7539b174aef4 ("x86: kvmguest: use TSC
> > clocksource if invariant TSC is exposed") the guest kernel could pick
> > the TSC over the pvclock by default, so its hard to say when the pvclock
> > structure is checked again.
> 
> That commit appears to be assuming that the TSC *will* be "stepped", or
> as I call it "continue to advance normally at its normal frequency over
> elapsed time".

Maybe? The author's intent isn't the reason I cared about that commit. It
means we will not detect PVCLOCK_GUEST_STOPPED synchronously with any
clock read, and I imagine we're rather concerned about those reads until
the clock is refined. sched clock would probably be the one to find out.

And sure, changing the clocksource away from the pvclock may not be the
right call if we are going to use it to enlighten NTP, I just feel the
commit above codified that it isn't user error.

> >  This is what I had in mind when suggesting a doorbell is needed, as
> > there is no good way to know what clocksource the guest is using.
> 
> Yes, perhaps that might be necessary.
> 
> The concern is that by the time that doorbell interrupt has finally
> been delivered and processed, an inaccurate timestamp could *already*
> have been used on the wire in the some application's coherency
> protocol, and the guest's database could already be hosed.

Unless you get that software into a known-safe instruction boundary,
who's to say you haven't migrated the guest between the clock read
and eventual packet out to the world? The timestamp would be
pre-blackout.

There's also nothing to prevent any agent within the guest from reading
the clock before NTP refinement comes in. We don't block clock reads on
PVCLOCK_GUEST_STOPPED.

> But I don't see why we can't have both.  I think it makes sense for the
> guest kernel to ditch its NTP sync when it sees PVCLOCK_GUEST_STOPPED,
> but I'm not entirely averse to the existence of a doorbell mechanism
> such as you describe.

Sure, there's nothing stopping us from setting off the alarms when that
bit is set. But:

 - It isn't as helpful for guests using the TSC
 - It does not prevent guest kernel or userspace from getting completely
   wrong clock reads in the interim

Given this, I'm convinced there is significant room for improvement as
there is a significant window of time during which we really wish the
guest to not use its wall clock.

> > > I'll give you the assertion that migrations aren't completely
> > > invisible, but I still think they should be *equivalent* to the vCPU
> > > just not being scheduled for a moment.
> > 
> > I sure hope that migrations are fast enough that it is indistinguishable
> > from scheduler pressure. I think the situations where that is not the
> > case are particularly interesting. Defining a limit and having a
> > mechanism for remedial action could make things more predictable for
> > guest software.
> > 
> > But agree, and shame on us for the broken virtual hardware when that
> > isn't the case :-)
> 
> Right. The "how to tell the guest that it needs to ditch its NTP sync
> status" question needs some more thought
>
> but in the short term I think
> it makes sense to add get/set TSC mechanisms which work like
> KVM_[SG]ET_CLOCK with the KVM_CLOCK_REALTIME flag.
> 
> Instead of realtime though, it should use the KVM clock. So reading a
> TSC returns a { kvm_ns, TSC value } tuple, and setting it will advance
> the value by the appropriate amount just as we advance the KVM clock
> for the KVM_CLOCK_REALTIME delta.

I'm all ears for changes to the TSC mechanisms, I just feel rather
strongly that it needs to support all possible use cases, not just live
migration. Further, KVM doesn't know what userspace is doing and I really
hope that userspace knows what userspace is doing.

If it does, maybe that's the right place to decide how to bend reality.
If we fully commit to KVM advancing clocks for an unbounded amount of
time then there needs to be a *giant* disclaimer for the case of pulling
a VM out of cold storage. Writing back exactly what you saved will in all
likelihood detonate the guest.

I just think it is highly desirable that anything we get upstream is
actually usable by upstream. Even though the hyperscalers only care about
situations where we're likely to immediately resume the guest, there are
other use cases to think about. The TSC offset attributes fit the
immediate interests of my employer, but was only halfway there for
upstream. I'd like to do right on that, because hopefully then I'm not
the only person testing it.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-23 16:21     ` Oliver Upton
@ 2022-03-25  9:03       ` David Woodhouse
  2022-03-25 17:47         ` Oliver Upton
  0 siblings, 1 reply; 33+ messages in thread
From: David Woodhouse @ 2022-03-25  9:03 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 6863 bytes --]

On Wed, 2022-03-23 at 16:21 +0000, Oliver Upton wrote:
> On Wed, Mar 23, 2022 at 12:35:17PM +0000, David Woodhouse wrote:
> > And it all looks *just* like it would if the vCPUs happened not to be
> > scheduled for a little while, because the host was busy.
> 
> We could continue to get away with TSC advancement, but the critical
> part IMO is the upper bound. And what happens when we exceed it.
> 
> There is no authoritative documentation around what time looks like as a
> guest of KVM, and futhermore what happens when a guest experiences time
> travel. Now we're in a particularly undesirable situation where there
> are at least three known definitions for time during a migration
> (upstream QEMU, Google, Amazon) and it is ~impossible to program guest
> software to anticipate our shenanigans.
> 
> If we are to do this right we probably need to agree on documented
> behavior. If we decide that advancing TSCs is acceptable up to 'X'
> seconds, guest kernels could take a change to relax expectations at
> least up to this value.

I'm not sure I even buy this line of argument at all. I don't think of
it as 'TSC advancement'. Or maybe I should put that the other way
round: TSCs advance *all* the time; that's the whole point in them. 

All we do when we live-migrate is move the state from one host to
another, completely intact. From the guest point of view we don't
"advance" the TSC; we just set the guest TSC on the destination host to
precisely¹ the same value that the guest TSC on the source host would
have at that moment if we were to abort the migration and do a
rollback. Either way (rollback or not), the vCPUs weren't *running* for
a period of time, but time and TSCs are entirely the same.

In fact, on live update we literally *don't* change anything that the
guest sees. The pvclock information that the guest gets is *precisely*
the same as before, and the TSC has 'advanced' purely by nature of
being precisely the same CPU just a short period of time in the
future... because that's what TSCs do when time passes :)

Really, we aren't talking about a "maximum advancement of the TSC".
We're talking about a "maximum period for which the vCPUs aren't
scheduled".

And I don't see why there need to be documented hard limits on that. I
see it as a quality of implementation issue. If the host is in swap
death and doesn't manage to run the guest vCPUs for seconds at a time,
that's fairly crappy, but it's not strictly a *timekeeping* issue. And
it's just the same if the final transition period of live migration (or
the kexec time for live update) is excessive.

> > > > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> > > > globals kept by kernel/time/ntp.c (which are visible to userspace through
> > > > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> > > > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
> > > 
> > > I do not disagree that NTP needs to throw the book out after a live
> > > migration.
> > > 

Right. To recap, this is because where I highlighted 'precisely¹'
above, our accuracy is actually limited to the synchronisation of the
wallclock time between the two hosts. If the guest thinks it has a more
accurate NTP sync than either of the hosts do, we may have introduced
an error which is more than the error bounds the guest thinks it has,
and that may ultimately lead to data corruption in some circumstances.

This part is somewhat orthogonal to the above discussion, isn't it?
Regardless of whether we 'step' the TSC or not, we need to signal the
guest to know that it needs to consider itself unsynchronized (or, if
we want to get really fancy, let it know the upper bounds of the error
we just introduced. But let's not).

> > > But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> > > relies on the guest polling a shared structure, and who knows when the
> > > guest is going to check the structure again? If we inject an interrupt
> > > the guest is likely to check this state in a reasonable amount of time.
> > 
> > Ah, but that's the point. A flag in shared memory can be checked
> > whenever the guest needs to know that it's operating on valid state.
> > Linux will check it *every* time from pvclock_clocksource_read().
> > 
> > As opposed to a separate interrupt which eventually gets processed some
> > indefinite amount of time in the future.
> 
> There are a few annoying things with pvclock, though. It is a per-vCPU
> structure, so special care must be taken to act exactly once on a
> migration. Also, since commit 7539b174aef4 ("x86: kvmguest: use TSC
> clocksource if invariant TSC is exposed") the guest kernel could pick
> the TSC over the pvclock by default, so its hard to say when the pvclock
> structure is checked again.

That commit appears to be assuming that the TSC *will* be "stepped", or
as I call it "continue to advance normally at its normal frequency over
elapsed time".

>  This is what I had in mind when suggesting a doorbell is needed, as
> there is no good way to know what clocksource the guest is using.

Yes, perhaps that might be necessary.

The concern is that by the time that doorbell interrupt has finally
been delivered and processed, an inaccurate timestamp could *already*
have been used on the wire in the some application's coherency
protocol, and the guest's database could already be hosed.

But I don't see why we can't have both.  I think it makes sense for the
guest kernel to ditch its NTP sync when it sees PVCLOCK_GUEST_STOPPED,
but I'm not entirely averse to the existence of a doorbell mechanism
such as you describe.

> > I'll give you the assertion that migrations aren't completely
> > invisible, but I still think they should be *equivalent* to the vCPU
> > just not being scheduled for a moment.
> 
> I sure hope that migrations are fast enough that it is indistinguishable
> from scheduler pressure. I think the situations where that is not the
> case are particularly interesting. Defining a limit and having a
> mechanism for remedial action could make things more predictable for
> guest software.
> 
> But agree, and shame on us for the broken virtual hardware when that
> isn't the case :-)

Right. The "how to tell the guest that it needs to ditch its NTP sync
status" question needs some more thought but in the short term I think
it makes sense to add get/set TSC mechanisms which work like
KVM_[SG]ET_CLOCK with the KVM_CLOCK_REALTIME flag.

Instead of realtime though, it should use the KVM clock. So reading a
TSC returns a { kvm_ns, TSC value } tuple, and setting it will advance
the value by the appropriate amount just as we advance the KVM clock
for the KVM_CLOCK_REALTIME delta.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-23 12:35   ` David Woodhouse
@ 2022-03-23 16:21     ` Oliver Upton
  2022-03-25  9:03       ` David Woodhouse
  0 siblings, 1 reply; 33+ messages in thread
From: Oliver Upton @ 2022-03-23 16:21 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner

On Wed, Mar 23, 2022 at 12:35:17PM +0000, David Woodhouse wrote:
> On Tue, 2022-03-22 at 21:53 +0000, Oliver Upton wrote:
> > But what happens to CLOCK_MONOTONIC in this case? We are still accepting
> > the fact that live migrations destroy CLOCK_MONOTONIC if we directly
> > advance the guest TSCs to account for elapsed time. The definition of
> > CLOCK_MONOTONIC is that the clock does not count while the system is
> > suspended. From the viewpoint of the guest, a live migration appears to
> > be a forced suspend operation at an arbitrary instruction boundary.
> > There is no realistic way for the guest to give the illusion that
> > MONOTONIC has stopped without help from the hypervisor.
> 
> I'm a little lost there. CLOCK_MONOTONIC is *permitted* to stop when
> the guest is suspended, but it's not *mandatory*, surely?
> 
> I can buy your assertion that the brownout period of a live migration
> (or the time for the host kernel to kexec in the case of live update) 
> can be considered a suspend... but regardless of whether that makes it
> mandatory to stop the clocks, I prefer to see it a different way. 
> 
> In normal operation — especially with CPU overcommit and/or throttling
> — there are times when none of the guest's vCPUs will be scheduled for
> short periods of time. We don't *have* to stop CLOCK_MONOTONIC when
> that happens, do we?
> 

You're absolutely right. We've at least accepted MONOTONIC behaving the
way it does for time lost to host scheduling, and expose this through
steal_time for the guest scheduler.

> If we want live migration to be guest-transparent, shouldn't we treat
> is as much as possible as one of those times when the vCPUs just happen
> not to be running for a moment?

There is still a subtle difference between host scheduler pressure and
live migration. Its hard to crisply state whether or not the VM is
actually suspended, as any one of its vCPU threads could actually be
running. Migration is one of those events where we positively know the
guest isn't running at all.

> On a live update, where the host does a kexec and then resumes the
> guest state, the host TSC reference is precisely the same as before the
> upate. We basically don't want to change *anything* that the guest sees
> in its pvclock information. In fact, we have a local patch to
> 'KVM_SET_CLOCK_FROM_GUEST' for the live update case, which ensures
> exactly that. We then add a delta to the guest TSC as we create each
> vCPU in the new KVM; the *offset* interface would be beneficial to us
> here (since that offset doesn't change) but we're not using it yet.
> 
> For live migration, the same applies — we can just add a delta to the
> clock and the guest TSC values, commensurate with the amount of
> wallclock time that elapsed from serialisation on the source host, to
> deserialisation on the destination.
> 
> And it all looks *just* like it would if the vCPUs happened not to be
> scheduled for a little while, because the host was busy.

We could continue to get away with TSC advancement, but the critical
part IMO is the upper bound. And what happens when we exceed it.

There is no authoritative documentation around what time looks like as a
guest of KVM, and futhermore what happens when a guest experiences time
travel. Now we're in a particularly undesirable situation where there
are at least three known definitions for time during a migration
(upstream QEMU, Google, Amazon) and it is ~impossible to program guest
software to anticipate our shenanigans.

If we are to do this right we probably need to agree on documented
behavior. If we decide that advancing TSCs is acceptable up to 'X'
seconds, guest kernels could take a change to relax expectations at
least up to this value.

> > > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> > > globals kept by kernel/time/ntp.c (which are visible to userspace through
> > > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> > > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
> > 
> > I do not disagree that NTP needs to throw the book out after a live
> > migration.
> > 
> > But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> > relies on the guest polling a shared structure, and who knows when the
> > guest is going to check the structure again? If we inject an interrupt
> > the guest is likely to check this state in a reasonable amount of time.
> 
> Ah, but that's the point. A flag in shared memory can be checked
> whenever the guest needs to know that it's operating on valid state.
> Linux will check it *every* time from pvclock_clocksource_read().
> 
> As opposed to a separate interrupt which eventually gets processed some
> indefinite amount of time in the future.

There are a few annoying things with pvclock, though. It is a per-vCPU
structure, so special care must be taken to act exactly once on a
migration. Also, since commit 7539b174aef4 ("x86: kvmguest: use TSC
clocksource if invariant TSC is exposed") the guest kernel could pick
the TSC over the pvclock by default, so its hard to say when the pvclock
structure is checked again. This is what I had in mind when suggesting a
doorbell is needed, as there is no good way to know what clocksource the
guest is using.

> > Doing this the other way around (advance the TSC, tell the guest to fix
> > MONOTONIC) is fundamentally wrong, as it violates two invariants of the
> > monotonic clock. Monotonic counts during a migration, which really is a
> > forced suspend. Additionally, you cannot step the monotonic clock.
> > 
> I dont understand why we can't "step the monotonic clock". Any time
> merely refrain from scheduling the vCPUs for any period of time, that
> is surely indistinguishable from a "step" in the monotonic clock,
> surely?

Right, there is some nebulous threshold that we've defined as an
acceptable amount of time to 'step' the monotonic clock. I think that
everything to date is built around the assumption that it is a small
amount of time of O(timeslice). Pinning down the upper bound should at
least make clocks more predictable on virtualization.

> > Sorry to revisit this conversation yet again. Virtualization isn't going
> > away any time soon and the illusion that migrations are invisible to the
> > guest is simply not true.
> 
> I'll give you the assertion that migrations aren't completely
> invisible, but I still think they should be *equivalent* to the vCPU
> just not being scheduled for a moment.

I sure hope that migrations are fast enough that it is indistinguishable
from scheduler pressure. I think the situations where that is not the
case are particularly interesting. Defining a limit and having a
mechanism for remedial action could make things more predictable for
guest software.

But agree, and shame on us for the broken virtual hardware when that
isn't the case :-)

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-22 21:53 ` Oliver Upton
@ 2022-03-23 12:35   ` David Woodhouse
  2022-03-23 16:21     ` Oliver Upton
  2022-03-29 14:19   ` Thomas Gleixner
  1 sibling, 1 reply; 33+ messages in thread
From: David Woodhouse @ 2022-03-23 12:35 UTC (permalink / raw)
  To: Oliver Upton, Franke, Daniel
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 4304 bytes --]

On Tue, 2022-03-22 at 21:53 +0000, Oliver Upton wrote:
> But what happens to CLOCK_MONOTONIC in this case? We are still accepting
> the fact that live migrations destroy CLOCK_MONOTONIC if we directly
> advance the guest TSCs to account for elapsed time. The definition of
> CLOCK_MONOTONIC is that the clock does not count while the system is
> suspended. From the viewpoint of the guest, a live migration appears to
> be a forced suspend operation at an arbitrary instruction boundary.
> There is no realistic way for the guest to give the illusion that
> MONOTONIC has stopped without help from the hypervisor.

I'm a little lost there. CLOCK_MONOTONIC is *permitted* to stop when
the guest is suspended, but it's not *mandatory*, surely?

I can buy your assertion that the brownout period of a live migration
(or the time for the host kernel to kexec in the case of live update) 
can be considered a suspend... but regardless of whether that makes it
mandatory to stop the clocks, I prefer to see it a different way. 

In normal operation — especially with CPU overcommit and/or throttling
— there are times when none of the guest's vCPUs will be scheduled for
short periods of time. We don't *have* to stop CLOCK_MONOTONIC when
that happens, do we?

If we want live migration to be guest-transparent, shouldn't we treat
is as much as possible as one of those times when the vCPUs just happen
not to be running for a moment?

On a live update, where the host does a kexec and then resumes the
guest state, the host TSC reference is precisely the same as before the
upate. We basically don't want to change *anything* that the guest sees
in its pvclock information. In fact, we have a local patch to
'KVM_SET_CLOCK_FROM_GUEST' for the live update case, which ensures
exactly that. We then add a delta to the guest TSC as we create each
vCPU in the new KVM; the *offset* interface would be beneficial to us
here (since that offset doesn't change) but we're not using it yet.

For live migration, the same applies — we can just add a delta to the
clock and the guest TSC values, commensurate with the amount of
wallclock time that elapsed from serialisation on the source host, to
deserialisation on the destination.

And it all looks *just* like it would if the vCPUs happened not to be
scheduled for a little while, because the host was busy.

> > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> > globals kept by kernel/time/ntp.c (which are visible to userspace through
> > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
> 
> I do not disagree that NTP needs to throw the book out after a live
> migration.
> 
> But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> relies on the guest polling a shared structure, and who knows when the
> guest is going to check the structure again? If we inject an interrupt
> the guest is likely to check this state in a reasonable amount of time.

Ah, but that's the point. A flag in shared memory can be checked
whenever the guest needs to know that it's operating on valid state.
Linux will check it *every* time from pvclock_clocksource_read().

As opposed to a separate interrupt which eventually gets processed some
indefinite amount of time in the future.

> Doing this the other way around (advance the TSC, tell the guest to fix
> MONOTONIC) is fundamentally wrong, as it violates two invariants of the
> monotonic clock. Monotonic counts during a migration, which really is a
> forced suspend. Additionally, you cannot step the monotonic clock.
> 
I dont understand why we can't "step the monotonic clock". Any time
merely refrain from scheduling the vCPUs for any period of time, that
is surely indistinguishable from a "step" in the monotonic clock,
surely?

> Sorry to revisit this conversation yet again. Virtualization isn't going
> away any time soon and the illusion that migrations are invisible to the
> guest is simply not true.

I'll give you the assertion that migrations aren't completely
invisible, but I still think they should be *equivalent* to the vCPU
just not being scheduled for a moment.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-22 19:18 Franke, Daniel
@ 2022-03-22 21:53 ` Oliver Upton
  2022-03-23 12:35   ` David Woodhouse
  2022-03-29 14:19   ` Thomas Gleixner
  0 siblings, 2 replies; 33+ messages in thread
From: Oliver Upton @ 2022-03-22 21:53 UTC (permalink / raw)
  To: Franke, Daniel
  Cc: David Woodhouse, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner

Hi Daniel,

As we're venturing into the realm of timekeeping and he has noted
the evils of virtualization several times, I've CC'ed Thomas on this
mail. Some comments inline on the ongoing discussion.

On Tue, Mar 22, 2022 at 07:18:20PM +0000, Franke, Daniel wrote:
> On 3/21/22, 5:24 PM, "Oliver Upton" <oupton@google.com> wrote:
>  >  Right, but I'd argue that interface has some problems too. It
>  >   depends on the guest polling instead of an interrupt from the
>  >   hypervisor. It also has no way of informing the kernel exactly how much
>  >   time has elapsed.
> 
>  >   The whole point of all these hacks that we've done internally is that we,
>  >   the hypervisor, know full well how much real time hasv advanced during the
>  >   VM blackout. If we can at least let the guest know how much to fudge real
>  >   time, it can then poke NTP for better refinement. I worry about using NTP
>  >   as the sole source of truth for such a mechanism, since you'll need to go
>  >   out to the network and any reads until the response comes back are hosed.
> 
> (I'm a kernel newbie, so please excuse any ignorance with respect to kernel
> Internals or kernel/hypervisor interfaces.)

Welcome :-)

> We can have it both ways, I think. Let the hypervisor manipulate the guest TSC
> so as to keep the guest kernel's idea of real time as accurate as possible 
> without any awareness required on the guest's side. *Also* give the guest kernel
> a notification in the form of a KVM_PVCLOCK_STOPPED event or whatever else,
> and let the kernel propagate this notification to userspace so that the NTP
> daemon can recombobulate itself as quickly as possible, treating whatever TSC
> adjustment was received as best-effort only.

But what happens to CLOCK_MONOTONIC in this case? We are still accepting
the fact that live migrations destroy CLOCK_MONOTONIC if we directly
advance the guest TSCs to account for elapsed time. The definition of
CLOCK_MONOTONIC is that the clock does not count while the system is
suspended. From the viewpoint of the guest, a live migration appears to
be a forced suspend operation at an arbitrary instruction boundary.
There is no realistic way for the guest to give the illusion that
MONOTONIC has stopped without help from the hypervisor.

> The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> globals kept by kernel/time/ntp.c (which are visible to userspace through
> adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.

I do not disagree that NTP needs to throw the book out after a live
migration.

But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
relies on the guest polling a shared structure, and who knows when the
guest is going to check the structure again? If we inject an interrupt
the guest is likely to check this state in a reasonable amount of time.

Thomas, we're talking about how to not wreck time (as bad) under
virtualization. I know this has been an area of interest to you for a
while ;-) The idea is that the hypervisor should let the guest know
about time travel.

Let's just assume for now that the hypervisor will *not* quiesce
the guest into an S2IDLE state before migration. I think quiesced
migrations are a good thing to have in the toolbelt, but there will
always be host-side problems that require us to migrate a VM off a host
immediately with no time to inform the guest.

Given that, we're deciding which clock is going to get wrecked during
a migration and what the guest can do afterwards to clean it up.
Whichever clock gets wrecked is going to have a window where reads race
with the eventual fix, and could be completely wrong. My thoughts:

We do not advance the TSC during a migration and notify the
guest (interrupt, shared structure) about how much it has
time traveled (delta_REALTIME). REALTIME is wrong until the interrupt
is handled in the guest, but should fire off all of the existing
mechanisms for a clock step. Userspace gets notified with
TFD_TIMER_CANCEL_ON_SET. I believe you have proposed something similar
as a way to make live migration less sinister from the guest
perspective.

It seems possible to block racing reads of REALTIME if we protect it with
a migration sequence counter. Host raises the sequence after a migration
when control is yielded back to the guest. The sequence is odd if an
update is pending. Guest increments the sequence again after the
interrupt handler accounts for time travel. That has the side effect of
blocking all realtime clock reads until the interrupt is handled. But
what are the value of those reads if we know they're wrong? There is
also the implication that said shared memory interface gets mapped
through to userspace for vDSO, haven't thought at all about those
implications yet.

Doing this the other way around (advance the TSC, tell the guest to fix
MONOTONIC) is fundamentally wrong, as it violates two invariants of the
monotonic clock. Monotonic counts during a migration, which really is a
forced suspend. Additionally, you cannot step the monotonic clock.

Thoughts?

Sorry to revisit this conversation yet again. Virtualization isn't going
away any time soon and the illusion that migrations are invisible to the
guest is simply not true.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
@ 2022-03-22 19:18 Franke, Daniel
  2022-03-22 21:53 ` Oliver Upton
  0 siblings, 1 reply; 33+ messages in thread
From: Franke, Daniel @ 2022-03-22 19:18 UTC (permalink / raw)
  To: Oliver Upton, David Woodhouse
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On 3/21/22, 5:24 PM, "Oliver Upton" <oupton@google.com> wrote:
 >  Right, but I'd argue that interface has some problems too. It
 >   depends on the guest polling instead of an interrupt from the
 >   hypervisor. It also has no way of informing the kernel exactly how much
 >   time has elapsed.

 >   The whole point of all these hacks that we've done internally is that we,
 >   the hypervisor, know full well how much real time hasv advanced during the
 >   VM blackout. If we can at least let the guest know how much to fudge real
 >   time, it can then poke NTP for better refinement. I worry about using NTP
 >   as the sole source of truth for such a mechanism, since you'll need to go
 >   out to the network and any reads until the response comes back are hosed.

(I'm a kernel newbie, so please excuse any ignorance with respect to kernel
Internals or kernel/hypervisor interfaces.)

We can have it both ways, I think. Let the hypervisor manipulate the guest TSC
so as to keep the guest kernel's idea of real time as accurate as possible 
without any awareness required on the guest's side. *Also* give the guest kernel
a notification in the form of a KVM_PVCLOCK_STOPPED event or whatever else,
and let the kernel propagate this notification to userspace so that the NTP
daemon can recombobulate itself as quickly as possible, treating whatever TSC
adjustment was received as best-effort only.

The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
globals kept by kernel/time/ntp.c (which are visible to userspace through
adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.

The aforementioned fields get overwritten by the NTP daemon every polling
interval, and whatever notification gets sent to userspace is going to be
asynchronous, so we need to avoid race conditions wherein userspace clobbers
them with its outdated idea of their correct value before the notification arrives.
I propose allocating one of the unused fields at the end of `struct timex` as a
`clockver` field. This field will be 0 at boot time, and get incremented after
clock discontinuity such as a KVM blackout, an ACPI suspend-to-RAM event,
or a manual setting of the clock through clock_settime(3) or similar. Augment
`modes` with an `ADJ_CLOCKVER` bit, with the semantics "make this call fail
If the `clockver` I passed in is not current".

As for the form of the notification to userspace... dunno, I think a netlink
interface makes the most sense? I'm not too opinionated here. Whatever we
decide, I'll be happy to contribute the patches to chrony and ntpd. The NTP
daemon should handle this notification by essentially resetting its state to
how it was at first startup, considering itself unsynchronized and immediately
kicking off new queries to all its peers.

One other thing the NTP daemon might want to do with this notification is
clobber its drift file, but only if it's running on different hardware than before.
But this requires getting that bit of information from the hypervisor to the
kernel and I haven't thought about how this should work.

There's a minor problem remaining here, which impacts NTP servers but not
clients. NTP servers don't use adjtimex to get the timestamps the use to
populate NTP packets; they just use clock_gettime, because the latter is a
VDSO call and the former is not. That means that they may not learn of the
`clockver` step in time to answer any queries that come in immediately
post-blackout. So, the server may briefly become a falseticker if the
hypervisor did too poor a job with the TSC step. I don't think this is a big
deal or something that needs to be addressed in the first iteration. The
long-term solution is to make adjtimex also be a VDSO when `modes` is
0, so that NTP daemons can use it for everything.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2022-07-05 14:43 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-16  4:53 [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm Oliver Upton
2022-03-16  7:47 ` Paolo Bonzini
2022-03-18 18:39   ` Oliver Upton
2022-03-19  7:52     ` Paolo Bonzini
2022-03-19  7:59       ` Oliver Upton
2022-03-19  8:08         ` David Woodhouse
2022-03-19 11:54           ` Paolo Bonzini
2022-03-19 13:00             ` Paolo Bonzini
2022-03-19 13:13               ` David Woodhouse
2022-03-20  8:10                 ` Paolo Bonzini
2022-03-20  8:52                   ` Oliver Upton
2022-03-20  9:46                     ` David Woodhouse
2022-03-21  0:38                       ` Oliver Upton
2022-03-21 19:43                         ` David Woodhouse
2022-03-21 21:23                           ` Oliver Upton
2022-03-20 13:39                     ` Paolo Bonzini
2022-03-21  0:51                       ` Oliver Upton
2022-03-21 12:36                         ` Paolo Bonzini
2022-03-21 12:56                           ` David Woodhouse
2022-03-21 12:16                       ` David Woodhouse
2022-03-21 13:10                         ` Paolo Bonzini
2022-03-21 14:59                           ` David Woodhouse
2022-03-22 19:18 Franke, Daniel
2022-03-22 21:53 ` Oliver Upton
2022-03-23 12:35   ` David Woodhouse
2022-03-23 16:21     ` Oliver Upton
2022-03-25  9:03       ` David Woodhouse
2022-03-25 17:47         ` Oliver Upton
2022-03-29 14:19   ` Thomas Gleixner
2022-03-29 16:02     ` Oliver Upton
2022-03-29 19:34       ` Thomas Gleixner
2022-06-30 11:58       ` David Woodhouse
2022-07-05 14:43         ` David Woodhouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.