All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
@ 2022-03-22 19:18 Franke, Daniel
  2022-03-22 21:53 ` Oliver Upton
  0 siblings, 1 reply; 44+ messages in thread
From: Franke, Daniel @ 2022-03-22 19:18 UTC (permalink / raw)
  To: Oliver Upton, David Woodhouse
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On 3/21/22, 5:24 PM, "Oliver Upton" <oupton@google.com> wrote:
 >  Right, but I'd argue that interface has some problems too. It
 >   depends on the guest polling instead of an interrupt from the
 >   hypervisor. It also has no way of informing the kernel exactly how much
 >   time has elapsed.

 >   The whole point of all these hacks that we've done internally is that we,
 >   the hypervisor, know full well how much real time hasv advanced during the
 >   VM blackout. If we can at least let the guest know how much to fudge real
 >   time, it can then poke NTP for better refinement. I worry about using NTP
 >   as the sole source of truth for such a mechanism, since you'll need to go
 >   out to the network and any reads until the response comes back are hosed.

(I'm a kernel newbie, so please excuse any ignorance with respect to kernel
Internals or kernel/hypervisor interfaces.)

We can have it both ways, I think. Let the hypervisor manipulate the guest TSC
so as to keep the guest kernel's idea of real time as accurate as possible 
without any awareness required on the guest's side. *Also* give the guest kernel
a notification in the form of a KVM_PVCLOCK_STOPPED event or whatever else,
and let the kernel propagate this notification to userspace so that the NTP
daemon can recombobulate itself as quickly as possible, treating whatever TSC
adjustment was received as best-effort only.

The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
globals kept by kernel/time/ntp.c (which are visible to userspace through
adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.

The aforementioned fields get overwritten by the NTP daemon every polling
interval, and whatever notification gets sent to userspace is going to be
asynchronous, so we need to avoid race conditions wherein userspace clobbers
them with its outdated idea of their correct value before the notification arrives.
I propose allocating one of the unused fields at the end of `struct timex` as a
`clockver` field. This field will be 0 at boot time, and get incremented after
clock discontinuity such as a KVM blackout, an ACPI suspend-to-RAM event,
or a manual setting of the clock through clock_settime(3) or similar. Augment
`modes` with an `ADJ_CLOCKVER` bit, with the semantics "make this call fail
If the `clockver` I passed in is not current".

As for the form of the notification to userspace... dunno, I think a netlink
interface makes the most sense? I'm not too opinionated here. Whatever we
decide, I'll be happy to contribute the patches to chrony and ntpd. The NTP
daemon should handle this notification by essentially resetting its state to
how it was at first startup, considering itself unsynchronized and immediately
kicking off new queries to all its peers.

One other thing the NTP daemon might want to do with this notification is
clobber its drift file, but only if it's running on different hardware than before.
But this requires getting that bit of information from the hypervisor to the
kernel and I haven't thought about how this should work.

There's a minor problem remaining here, which impacts NTP servers but not
clients. NTP servers don't use adjtimex to get the timestamps the use to
populate NTP packets; they just use clock_gettime, because the latter is a
VDSO call and the former is not. That means that they may not learn of the
`clockver` step in time to answer any queries that come in immediately
post-blackout. So, the server may briefly become a falseticker if the
hypervisor did too poor a job with the TSC step. I don't think this is a big
deal or something that needs to be addressed in the first iteration. The
long-term solution is to make adjtimex also be a VDSO when `modes` is
0, so that NTP daemons can use it for everything.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-22 19:18 [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm Franke, Daniel
@ 2022-03-22 21:53 ` Oliver Upton
  2022-03-23 12:35   ` David Woodhouse
  2022-03-29 14:19   ` Thomas Gleixner
  0 siblings, 2 replies; 44+ messages in thread
From: Oliver Upton @ 2022-03-22 21:53 UTC (permalink / raw)
  To: Franke, Daniel
  Cc: David Woodhouse, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner

Hi Daniel,

As we're venturing into the realm of timekeeping and he has noted
the evils of virtualization several times, I've CC'ed Thomas on this
mail. Some comments inline on the ongoing discussion.

On Tue, Mar 22, 2022 at 07:18:20PM +0000, Franke, Daniel wrote:
> On 3/21/22, 5:24 PM, "Oliver Upton" <oupton@google.com> wrote:
>  >  Right, but I'd argue that interface has some problems too. It
>  >   depends on the guest polling instead of an interrupt from the
>  >   hypervisor. It also has no way of informing the kernel exactly how much
>  >   time has elapsed.
> 
>  >   The whole point of all these hacks that we've done internally is that we,
>  >   the hypervisor, know full well how much real time hasv advanced during the
>  >   VM blackout. If we can at least let the guest know how much to fudge real
>  >   time, it can then poke NTP for better refinement. I worry about using NTP
>  >   as the sole source of truth for such a mechanism, since you'll need to go
>  >   out to the network and any reads until the response comes back are hosed.
> 
> (I'm a kernel newbie, so please excuse any ignorance with respect to kernel
> Internals or kernel/hypervisor interfaces.)

Welcome :-)

> We can have it both ways, I think. Let the hypervisor manipulate the guest TSC
> so as to keep the guest kernel's idea of real time as accurate as possible 
> without any awareness required on the guest's side. *Also* give the guest kernel
> a notification in the form of a KVM_PVCLOCK_STOPPED event or whatever else,
> and let the kernel propagate this notification to userspace so that the NTP
> daemon can recombobulate itself as quickly as possible, treating whatever TSC
> adjustment was received as best-effort only.

But what happens to CLOCK_MONOTONIC in this case? We are still accepting
the fact that live migrations destroy CLOCK_MONOTONIC if we directly
advance the guest TSCs to account for elapsed time. The definition of
CLOCK_MONOTONIC is that the clock does not count while the system is
suspended. From the viewpoint of the guest, a live migration appears to
be a forced suspend operation at an arbitrary instruction boundary.
There is no realistic way for the guest to give the illusion that
MONOTONIC has stopped without help from the hypervisor.

> The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> globals kept by kernel/time/ntp.c (which are visible to userspace through
> adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.

I do not disagree that NTP needs to throw the book out after a live
migration.

But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
relies on the guest polling a shared structure, and who knows when the
guest is going to check the structure again? If we inject an interrupt
the guest is likely to check this state in a reasonable amount of time.

Thomas, we're talking about how to not wreck time (as bad) under
virtualization. I know this has been an area of interest to you for a
while ;-) The idea is that the hypervisor should let the guest know
about time travel.

Let's just assume for now that the hypervisor will *not* quiesce
the guest into an S2IDLE state before migration. I think quiesced
migrations are a good thing to have in the toolbelt, but there will
always be host-side problems that require us to migrate a VM off a host
immediately with no time to inform the guest.

Given that, we're deciding which clock is going to get wrecked during
a migration and what the guest can do afterwards to clean it up.
Whichever clock gets wrecked is going to have a window where reads race
with the eventual fix, and could be completely wrong. My thoughts:

We do not advance the TSC during a migration and notify the
guest (interrupt, shared structure) about how much it has
time traveled (delta_REALTIME). REALTIME is wrong until the interrupt
is handled in the guest, but should fire off all of the existing
mechanisms for a clock step. Userspace gets notified with
TFD_TIMER_CANCEL_ON_SET. I believe you have proposed something similar
as a way to make live migration less sinister from the guest
perspective.

It seems possible to block racing reads of REALTIME if we protect it with
a migration sequence counter. Host raises the sequence after a migration
when control is yielded back to the guest. The sequence is odd if an
update is pending. Guest increments the sequence again after the
interrupt handler accounts for time travel. That has the side effect of
blocking all realtime clock reads until the interrupt is handled. But
what are the value of those reads if we know they're wrong? There is
also the implication that said shared memory interface gets mapped
through to userspace for vDSO, haven't thought at all about those
implications yet.

Doing this the other way around (advance the TSC, tell the guest to fix
MONOTONIC) is fundamentally wrong, as it violates two invariants of the
monotonic clock. Monotonic counts during a migration, which really is a
forced suspend. Additionally, you cannot step the monotonic clock.

Thoughts?

Sorry to revisit this conversation yet again. Virtualization isn't going
away any time soon and the illusion that migrations are invisible to the
guest is simply not true.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-22 21:53 ` Oliver Upton
@ 2022-03-23 12:35   ` David Woodhouse
  2022-03-23 16:21     ` Oliver Upton
  2022-03-29 14:19   ` Thomas Gleixner
  1 sibling, 1 reply; 44+ messages in thread
From: David Woodhouse @ 2022-03-23 12:35 UTC (permalink / raw)
  To: Oliver Upton, Franke, Daniel
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 4304 bytes --]

On Tue, 2022-03-22 at 21:53 +0000, Oliver Upton wrote:
> But what happens to CLOCK_MONOTONIC in this case? We are still accepting
> the fact that live migrations destroy CLOCK_MONOTONIC if we directly
> advance the guest TSCs to account for elapsed time. The definition of
> CLOCK_MONOTONIC is that the clock does not count while the system is
> suspended. From the viewpoint of the guest, a live migration appears to
> be a forced suspend operation at an arbitrary instruction boundary.
> There is no realistic way for the guest to give the illusion that
> MONOTONIC has stopped without help from the hypervisor.

I'm a little lost there. CLOCK_MONOTONIC is *permitted* to stop when
the guest is suspended, but it's not *mandatory*, surely?

I can buy your assertion that the brownout period of a live migration
(or the time for the host kernel to kexec in the case of live update) 
can be considered a suspend... but regardless of whether that makes it
mandatory to stop the clocks, I prefer to see it a different way. 

In normal operation — especially with CPU overcommit and/or throttling
— there are times when none of the guest's vCPUs will be scheduled for
short periods of time. We don't *have* to stop CLOCK_MONOTONIC when
that happens, do we?

If we want live migration to be guest-transparent, shouldn't we treat
is as much as possible as one of those times when the vCPUs just happen
not to be running for a moment?

On a live update, where the host does a kexec and then resumes the
guest state, the host TSC reference is precisely the same as before the
upate. We basically don't want to change *anything* that the guest sees
in its pvclock information. In fact, we have a local patch to
'KVM_SET_CLOCK_FROM_GUEST' for the live update case, which ensures
exactly that. We then add a delta to the guest TSC as we create each
vCPU in the new KVM; the *offset* interface would be beneficial to us
here (since that offset doesn't change) but we're not using it yet.

For live migration, the same applies — we can just add a delta to the
clock and the guest TSC values, commensurate with the amount of
wallclock time that elapsed from serialisation on the source host, to
deserialisation on the destination.

And it all looks *just* like it would if the vCPUs happened not to be
scheduled for a little while, because the host was busy.


> > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> > globals kept by kernel/time/ntp.c (which are visible to userspace through
> > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
> 
> I do not disagree that NTP needs to throw the book out after a live
> migration.
> 
> But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> relies on the guest polling a shared structure, and who knows when the
> guest is going to check the structure again? If we inject an interrupt
> the guest is likely to check this state in a reasonable amount of time.

Ah, but that's the point. A flag in shared memory can be checked
whenever the guest needs to know that it's operating on valid state.
Linux will check it *every* time from pvclock_clocksource_read().

As opposed to a separate interrupt which eventually gets processed some
indefinite amount of time in the future.

> Doing this the other way around (advance the TSC, tell the guest to fix
> MONOTONIC) is fundamentally wrong, as it violates two invariants of the
> monotonic clock. Monotonic counts during a migration, which really is a
> forced suspend. Additionally, you cannot step the monotonic clock.
> 
I dont understand why we can't "step the monotonic clock". Any time
merely refrain from scheduling the vCPUs for any period of time, that
is surely indistinguishable from a "step" in the monotonic clock,
surely?

> Sorry to revisit this conversation yet again. Virtualization isn't going
> away any time soon and the illusion that migrations are invisible to the
> guest is simply not true.

I'll give you the assertion that migrations aren't completely
invisible, but I still think they should be *equivalent* to the vCPU
just not being scheduled for a moment.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-23 12:35   ` David Woodhouse
@ 2022-03-23 16:21     ` Oliver Upton
  2022-03-25  9:03       ` David Woodhouse
  0 siblings, 1 reply; 44+ messages in thread
From: Oliver Upton @ 2022-03-23 16:21 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner

On Wed, Mar 23, 2022 at 12:35:17PM +0000, David Woodhouse wrote:
> On Tue, 2022-03-22 at 21:53 +0000, Oliver Upton wrote:
> > But what happens to CLOCK_MONOTONIC in this case? We are still accepting
> > the fact that live migrations destroy CLOCK_MONOTONIC if we directly
> > advance the guest TSCs to account for elapsed time. The definition of
> > CLOCK_MONOTONIC is that the clock does not count while the system is
> > suspended. From the viewpoint of the guest, a live migration appears to
> > be a forced suspend operation at an arbitrary instruction boundary.
> > There is no realistic way for the guest to give the illusion that
> > MONOTONIC has stopped without help from the hypervisor.
> 
> I'm a little lost there. CLOCK_MONOTONIC is *permitted* to stop when
> the guest is suspended, but it's not *mandatory*, surely?
> 
> I can buy your assertion that the brownout period of a live migration
> (or the time for the host kernel to kexec in the case of live update) 
> can be considered a suspend... but regardless of whether that makes it
> mandatory to stop the clocks, I prefer to see it a different way. 
> 
> In normal operation — especially with CPU overcommit and/or throttling
> — there are times when none of the guest's vCPUs will be scheduled for
> short periods of time. We don't *have* to stop CLOCK_MONOTONIC when
> that happens, do we?
> 

You're absolutely right. We've at least accepted MONOTONIC behaving the
way it does for time lost to host scheduling, and expose this through
steal_time for the guest scheduler.

> If we want live migration to be guest-transparent, shouldn't we treat
> is as much as possible as one of those times when the vCPUs just happen
> not to be running for a moment?

There is still a subtle difference between host scheduler pressure and
live migration. Its hard to crisply state whether or not the VM is
actually suspended, as any one of its vCPU threads could actually be
running. Migration is one of those events where we positively know the
guest isn't running at all.

> On a live update, where the host does a kexec and then resumes the
> guest state, the host TSC reference is precisely the same as before the
> upate. We basically don't want to change *anything* that the guest sees
> in its pvclock information. In fact, we have a local patch to
> 'KVM_SET_CLOCK_FROM_GUEST' for the live update case, which ensures
> exactly that. We then add a delta to the guest TSC as we create each
> vCPU in the new KVM; the *offset* interface would be beneficial to us
> here (since that offset doesn't change) but we're not using it yet.
> 
> For live migration, the same applies — we can just add a delta to the
> clock and the guest TSC values, commensurate with the amount of
> wallclock time that elapsed from serialisation on the source host, to
> deserialisation on the destination.
> 
> And it all looks *just* like it would if the vCPUs happened not to be
> scheduled for a little while, because the host was busy.

We could continue to get away with TSC advancement, but the critical
part IMO is the upper bound. And what happens when we exceed it.

There is no authoritative documentation around what time looks like as a
guest of KVM, and futhermore what happens when a guest experiences time
travel. Now we're in a particularly undesirable situation where there
are at least three known definitions for time during a migration
(upstream QEMU, Google, Amazon) and it is ~impossible to program guest
software to anticipate our shenanigans.

If we are to do this right we probably need to agree on documented
behavior. If we decide that advancing TSCs is acceptable up to 'X'
seconds, guest kernels could take a change to relax expectations at
least up to this value.

> > > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> > > globals kept by kernel/time/ntp.c (which are visible to userspace through
> > > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> > > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
> > 
> > I do not disagree that NTP needs to throw the book out after a live
> > migration.
> > 
> > But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> > relies on the guest polling a shared structure, and who knows when the
> > guest is going to check the structure again? If we inject an interrupt
> > the guest is likely to check this state in a reasonable amount of time.
> 
> Ah, but that's the point. A flag in shared memory can be checked
> whenever the guest needs to know that it's operating on valid state.
> Linux will check it *every* time from pvclock_clocksource_read().
> 
> As opposed to a separate interrupt which eventually gets processed some
> indefinite amount of time in the future.

There are a few annoying things with pvclock, though. It is a per-vCPU
structure, so special care must be taken to act exactly once on a
migration. Also, since commit 7539b174aef4 ("x86: kvmguest: use TSC
clocksource if invariant TSC is exposed") the guest kernel could pick
the TSC over the pvclock by default, so its hard to say when the pvclock
structure is checked again. This is what I had in mind when suggesting a
doorbell is needed, as there is no good way to know what clocksource the
guest is using.

> > Doing this the other way around (advance the TSC, tell the guest to fix
> > MONOTONIC) is fundamentally wrong, as it violates two invariants of the
> > monotonic clock. Monotonic counts during a migration, which really is a
> > forced suspend. Additionally, you cannot step the monotonic clock.
> > 
> I dont understand why we can't "step the monotonic clock". Any time
> merely refrain from scheduling the vCPUs for any period of time, that
> is surely indistinguishable from a "step" in the monotonic clock,
> surely?

Right, there is some nebulous threshold that we've defined as an
acceptable amount of time to 'step' the monotonic clock. I think that
everything to date is built around the assumption that it is a small
amount of time of O(timeslice). Pinning down the upper bound should at
least make clocks more predictable on virtualization.

> > Sorry to revisit this conversation yet again. Virtualization isn't going
> > away any time soon and the illusion that migrations are invisible to the
> > guest is simply not true.
> 
> I'll give you the assertion that migrations aren't completely
> invisible, but I still think they should be *equivalent* to the vCPU
> just not being scheduled for a moment.

I sure hope that migrations are fast enough that it is indistinguishable
from scheduler pressure. I think the situations where that is not the
case are particularly interesting. Defining a limit and having a
mechanism for remedial action could make things more predictable for
guest software.

But agree, and shame on us for the broken virtual hardware when that
isn't the case :-)

--
Thanks,
Oliver


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-23 16:21     ` Oliver Upton
@ 2022-03-25  9:03       ` David Woodhouse
  2022-03-25 17:47         ` Oliver Upton
  0 siblings, 1 reply; 44+ messages in thread
From: David Woodhouse @ 2022-03-25  9:03 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 6863 bytes --]

On Wed, 2022-03-23 at 16:21 +0000, Oliver Upton wrote:
> On Wed, Mar 23, 2022 at 12:35:17PM +0000, David Woodhouse wrote:
> > And it all looks *just* like it would if the vCPUs happened not to be
> > scheduled for a little while, because the host was busy.
> 
> We could continue to get away with TSC advancement, but the critical
> part IMO is the upper bound. And what happens when we exceed it.
> 
> There is no authoritative documentation around what time looks like as a
> guest of KVM, and futhermore what happens when a guest experiences time
> travel. Now we're in a particularly undesirable situation where there
> are at least three known definitions for time during a migration
> (upstream QEMU, Google, Amazon) and it is ~impossible to program guest
> software to anticipate our shenanigans.
> 
> If we are to do this right we probably need to agree on documented
> behavior. If we decide that advancing TSCs is acceptable up to 'X'
> seconds, guest kernels could take a change to relax expectations at
> least up to this value.

I'm not sure I even buy this line of argument at all. I don't think of
it as 'TSC advancement'. Or maybe I should put that the other way
round: TSCs advance *all* the time; that's the whole point in them. 

All we do when we live-migrate is move the state from one host to
another, completely intact. From the guest point of view we don't
"advance" the TSC; we just set the guest TSC on the destination host to
precisely¹ the same value that the guest TSC on the source host would
have at that moment if we were to abort the migration and do a
rollback. Either way (rollback or not), the vCPUs weren't *running* for
a period of time, but time and TSCs are entirely the same.

In fact, on live update we literally *don't* change anything that the
guest sees. The pvclock information that the guest gets is *precisely*
the same as before, and the TSC has 'advanced' purely by nature of
being precisely the same CPU just a short period of time in the
future... because that's what TSCs do when time passes :)

Really, we aren't talking about a "maximum advancement of the TSC".
We're talking about a "maximum period for which the vCPUs aren't
scheduled".

And I don't see why there need to be documented hard limits on that. I
see it as a quality of implementation issue. If the host is in swap
death and doesn't manage to run the guest vCPUs for seconds at a time,
that's fairly crappy, but it's not strictly a *timekeeping* issue. And
it's just the same if the final transition period of live migration (or
the kexec time for live update) is excessive.

> > > > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> > > > globals kept by kernel/time/ntp.c (which are visible to userspace through
> > > > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> > > > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
> > > 
> > > I do not disagree that NTP needs to throw the book out after a live
> > > migration.
> > > 

Right. To recap, this is because where I highlighted 'precisely¹'
above, our accuracy is actually limited to the synchronisation of the
wallclock time between the two hosts. If the guest thinks it has a more
accurate NTP sync than either of the hosts do, we may have introduced
an error which is more than the error bounds the guest thinks it has,
and that may ultimately lead to data corruption in some circumstances.

This part is somewhat orthogonal to the above discussion, isn't it?
Regardless of whether we 'step' the TSC or not, we need to signal the
guest to know that it needs to consider itself unsynchronized (or, if
we want to get really fancy, let it know the upper bounds of the error
we just introduced. But let's not).

> > > But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> > > relies on the guest polling a shared structure, and who knows when the
> > > guest is going to check the structure again? If we inject an interrupt
> > > the guest is likely to check this state in a reasonable amount of time.
> > 
> > Ah, but that's the point. A flag in shared memory can be checked
> > whenever the guest needs to know that it's operating on valid state.
> > Linux will check it *every* time from pvclock_clocksource_read().
> > 
> > As opposed to a separate interrupt which eventually gets processed some
> > indefinite amount of time in the future.
> 
> There are a few annoying things with pvclock, though. It is a per-vCPU
> structure, so special care must be taken to act exactly once on a
> migration. Also, since commit 7539b174aef4 ("x86: kvmguest: use TSC
> clocksource if invariant TSC is exposed") the guest kernel could pick
> the TSC over the pvclock by default, so its hard to say when the pvclock
> structure is checked again.

That commit appears to be assuming that the TSC *will* be "stepped", or
as I call it "continue to advance normally at its normal frequency over
elapsed time".

>  This is what I had in mind when suggesting a doorbell is needed, as
> there is no good way to know what clocksource the guest is using.

Yes, perhaps that might be necessary.

The concern is that by the time that doorbell interrupt has finally
been delivered and processed, an inaccurate timestamp could *already*
have been used on the wire in the some application's coherency
protocol, and the guest's database could already be hosed.

But I don't see why we can't have both.  I think it makes sense for the
guest kernel to ditch its NTP sync when it sees PVCLOCK_GUEST_STOPPED,
but I'm not entirely averse to the existence of a doorbell mechanism
such as you describe.

> > I'll give you the assertion that migrations aren't completely
> > invisible, but I still think they should be *equivalent* to the vCPU
> > just not being scheduled for a moment.
> 
> I sure hope that migrations are fast enough that it is indistinguishable
> from scheduler pressure. I think the situations where that is not the
> case are particularly interesting. Defining a limit and having a
> mechanism for remedial action could make things more predictable for
> guest software.
> 
> But agree, and shame on us for the broken virtual hardware when that
> isn't the case :-)

Right. The "how to tell the guest that it needs to ditch its NTP sync
status" question needs some more thought but in the short term I think
it makes sense to add get/set TSC mechanisms which work like
KVM_[SG]ET_CLOCK with the KVM_CLOCK_REALTIME flag.

Instead of realtime though, it should use the KVM clock. So reading a
TSC returns a { kvm_ns, TSC value } tuple, and setting it will advance
the value by the appropriate amount just as we advance the KVM clock
for the KVM_CLOCK_REALTIME delta.



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-25  9:03       ` David Woodhouse
@ 2022-03-25 17:47         ` Oliver Upton
  0 siblings, 0 replies; 44+ messages in thread
From: Oliver Upton @ 2022-03-25 17:47 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner

On Fri, Mar 25, 2022 at 09:03:50AM +0000, David Woodhouse wrote:
> On Wed, 2022-03-23 at 16:21 +0000, Oliver Upton wrote:
> > On Wed, Mar 23, 2022 at 12:35:17PM +0000, David Woodhouse wrote:
> > > And it all looks *just* like it would if the vCPUs happened not to be
> > > scheduled for a little while, because the host was busy.
> > 
> > We could continue to get away with TSC advancement, but the critical
> > part IMO is the upper bound. And what happens when we exceed it.
> > 
> > There is no authoritative documentation around what time looks like as a
> > guest of KVM, and futhermore what happens when a guest experiences time
> > travel. Now we're in a particularly undesirable situation where there
> > are at least three known definitions for time during a migration
> > (upstream QEMU, Google, Amazon) and it is ~impossible to program guest
> > software to anticipate our shenanigans.
> > 
> > If we are to do this right we probably need to agree on documented
> > behavior. If we decide that advancing TSCs is acceptable up to 'X'
> > seconds, guest kernels could take a change to relax expectations at
> > least up to this value.
> 
> I'm not sure I even buy this line of argument at all. I don't think of
> it as 'TSC advancement'. Or maybe I should put that the other way
> round: TSCs advance *all* the time; that's the whole point in them. 

The only reason I use the term 'TSC advancement' is to account for the
fact that the guest probably feels *very* differently about what the
advancement time actually is. The world around it has continued onwards
while it was out.

> All we do when we live-migrate is move the state from one host to
> another, completely intact. From the guest point of view we don't
> "advance" the TSC; we just set the guest TSC on the destination host to
> precisely¹ the same value that the guest TSC on the source host would
> have at that moment if we were to abort the migration and do a
> rollback. Either way (rollback or not), the vCPUs weren't *running* for
> a period of time, but time and TSCs are entirely the same.
> 
> In fact, on live update we literally *don't* change anything that the
> guest sees. The pvclock information that the guest gets is *precisely*
> the same as before, and the TSC has 'advanced' purely by nature of
> being precisely the same CPU just a short period of time in the
> future... because that's what TSCs do when time passes :)
> 
> Really, we aren't talking about a "maximum advancement of the TSC".
> We're talking about a "maximum period for which the vCPUs aren't
> scheduled".

But that is all built on the assumption that a guest is actually
being *live* migrated. That is entirely a userspace decision, and IMO we
can't glean what the VMM is trying to do just based on how it calls
the KVM APIs.

The only thing we know for certain is the guest state is serialized.
Where that data goes and how long it takes to be acted upon is none of
our business in the kernel.

> And I don't see why there need to be documented hard limits on that. I
> see it as a quality of implementation issue. If the host is in swap
> death and doesn't manage to run the guest vCPUs for seconds at a time,
> that's fairly crappy, but it's not strictly a *timekeeping* issue. And
> it's just the same if the final transition period of live migration (or
> the kexec time for live update) is excessive.

Serializing a VM to persistent storage and resuming it at a later date
is a completely valid use of the KVM APIs. Sure, if the guest goes down
for that long during a live migration then it has every right to scream
at the entity running the VM, but I would argue there are more
lifecycles outside of what our businesses typically do.

> > > > > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> > > > > globals kept by kernel/time/ntp.c (which are visible to userspace through
> > > > > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> > > > > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
> > > > 
> > > > I do not disagree that NTP needs to throw the book out after a live
> > > > migration.
> > > > 
> 
> Right. To recap, this is because where I highlighted 'precisely¹'
> above, our accuracy is actually limited to the synchronisation of the
> wallclock time between the two hosts. If the guest thinks it has a more
> accurate NTP sync than either of the hosts do, we may have introduced
> an error which is more than the error bounds the guest thinks it has,
> and that may ultimately lead to data corruption in some circumstances.

Agreed. And on top of that I would state the guest is likely running on
a different oscillator with its own set of problems since we switched out
the underlying hardware. NTP would need to rediscipline the clock even
if the migration somehow bends our understanding of reality and happens
in zero time.

> This part is somewhat orthogonal to the above discussion, isn't it?
> Regardless of whether we 'step' the TSC or not, we need to signal the
> guest to know that it needs to consider itself unsynchronized (or, if
> we want to get really fancy, let it know the upper bounds of the error
> we just introduced. But let's not).

Right, whatever signal we add should be asserted regardless of whether
the TSCs kept ticking while the guest is out. If there is going to be a
shared structure/message/whatever that conveys delta_REALTIME, you could
just set it to zero if the hypervisor decided to handle time on its own.

> > > > But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> > > > relies on the guest polling a shared structure, and who knows when the
> > > > guest is going to check the structure again? If we inject an interrupt
> > > > the guest is likely to check this state in a reasonable amount of time.
> > > 
> > > Ah, but that's the point. A flag in shared memory can be checked
> > > whenever the guest needs to know that it's operating on valid state.
> > > Linux will check it *every* time from pvclock_clocksource_read().
> > > 
> > > As opposed to a separate interrupt which eventually gets processed some
> > > indefinite amount of time in the future.
> > 
> > There are a few annoying things with pvclock, though. It is a per-vCPU
> > structure, so special care must be taken to act exactly once on a
> > migration. Also, since commit 7539b174aef4 ("x86: kvmguest: use TSC
> > clocksource if invariant TSC is exposed") the guest kernel could pick
> > the TSC over the pvclock by default, so its hard to say when the pvclock
> > structure is checked again.
> 
> That commit appears to be assuming that the TSC *will* be "stepped", or
> as I call it "continue to advance normally at its normal frequency over
> elapsed time".

Maybe? The author's intent isn't the reason I cared about that commit. It
means we will not detect PVCLOCK_GUEST_STOPPED synchronously with any
clock read, and I imagine we're rather concerned about those reads until
the clock is refined. sched clock would probably be the one to find out.

And sure, changing the clocksource away from the pvclock may not be the
right call if we are going to use it to enlighten NTP, I just feel the
commit above codified that it isn't user error.

> >  This is what I had in mind when suggesting a doorbell is needed, as
> > there is no good way to know what clocksource the guest is using.
> 
> Yes, perhaps that might be necessary.
> 
> The concern is that by the time that doorbell interrupt has finally
> been delivered and processed, an inaccurate timestamp could *already*
> have been used on the wire in the some application's coherency
> protocol, and the guest's database could already be hosed.

Unless you get that software into a known-safe instruction boundary,
who's to say you haven't migrated the guest between the clock read
and eventual packet out to the world? The timestamp would be
pre-blackout.

There's also nothing to prevent any agent within the guest from reading
the clock before NTP refinement comes in. We don't block clock reads on
PVCLOCK_GUEST_STOPPED.

> But I don't see why we can't have both.  I think it makes sense for the
> guest kernel to ditch its NTP sync when it sees PVCLOCK_GUEST_STOPPED,
> but I'm not entirely averse to the existence of a doorbell mechanism
> such as you describe.

Sure, there's nothing stopping us from setting off the alarms when that
bit is set. But:

 - It isn't as helpful for guests using the TSC
 - It does not prevent guest kernel or userspace from getting completely
   wrong clock reads in the interim

Given this, I'm convinced there is significant room for improvement as
there is a significant window of time during which we really wish the
guest to not use its wall clock.

> > > I'll give you the assertion that migrations aren't completely
> > > invisible, but I still think they should be *equivalent* to the vCPU
> > > just not being scheduled for a moment.
> > 
> > I sure hope that migrations are fast enough that it is indistinguishable
> > from scheduler pressure. I think the situations where that is not the
> > case are particularly interesting. Defining a limit and having a
> > mechanism for remedial action could make things more predictable for
> > guest software.
> > 
> > But agree, and shame on us for the broken virtual hardware when that
> > isn't the case :-)
> 
> Right. The "how to tell the guest that it needs to ditch its NTP sync
> status" question needs some more thought
>
> but in the short term I think
> it makes sense to add get/set TSC mechanisms which work like
> KVM_[SG]ET_CLOCK with the KVM_CLOCK_REALTIME flag.
> 
> Instead of realtime though, it should use the KVM clock. So reading a
> TSC returns a { kvm_ns, TSC value } tuple, and setting it will advance
> the value by the appropriate amount just as we advance the KVM clock
> for the KVM_CLOCK_REALTIME delta.

I'm all ears for changes to the TSC mechanisms, I just feel rather
strongly that it needs to support all possible use cases, not just live
migration. Further, KVM doesn't know what userspace is doing and I really
hope that userspace knows what userspace is doing.

If it does, maybe that's the right place to decide how to bend reality.
If we fully commit to KVM advancing clocks for an unbounded amount of
time then there needs to be a *giant* disclaimer for the case of pulling
a VM out of cold storage. Writing back exactly what you saved will in all
likelihood detonate the guest.

I just think it is highly desirable that anything we get upstream is
actually usable by upstream. Even though the hyperscalers only care about
situations where we're likely to immediately resume the guest, there are
other use cases to think about. The TSC offset attributes fit the
immediate interests of my employer, but was only halfway there for
upstream. I'd like to do right on that, because hopefully then I'm not
the only person testing it.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-22 21:53 ` Oliver Upton
  2022-03-23 12:35   ` David Woodhouse
@ 2022-03-29 14:19   ` Thomas Gleixner
  2022-03-29 16:02     ` Oliver Upton
  1 sibling, 1 reply; 44+ messages in thread
From: Thomas Gleixner @ 2022-03-29 14:19 UTC (permalink / raw)
  To: Oliver Upton, Franke, Daniel
  Cc: David Woodhouse, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel

On Tue, Mar 22 2022 at 21:53, Oliver Upton wrote:
> On Tue, Mar 22, 2022 at 07:18:20PM +0000, Franke, Daniel wrote:
>> The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
>> globals kept by kernel/time/ntp.c (which are visible to userspace through
>> adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
>> to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
>
> I do not disagree that NTP needs to throw the book out after a live
> migration.
>
> But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> relies on the guest polling a shared structure, and who knows when the
> guest is going to check the structure again? If we inject an interrupt
> the guest is likely to check this state in a reasonable amount of time.
>
> Thomas, we're talking about how to not wreck time (as bad) under
> virtualization. I know this has been an area of interest to you for a
> while ;-) The idea is that the hypervisor should let the guest know
> about time travel.

Finally. It only took 10+ years of ranting :)

> Let's just assume for now that the hypervisor will *not* quiesce
> the guest into an S2IDLE state before migration. I think quiesced
> migrations are a good thing to have in the toolbelt, but there will
> always be host-side problems that require us to migrate a VM off a host
> immediately with no time to inform the guest.

Quiesced migrations are the right thing on the principle of least
surprise.

> Given that, we're deciding which clock is going to get wrecked during
> a migration and what the guest can do afterwards to clean it up.
> Whichever clock gets wrecked is going to have a window where reads race
> with the eventual fix, and could be completely wrong. My thoughts:
>
> We do not advance the TSC during a migration and notify the
> guest (interrupt, shared structure) about how much it has
> time traveled (delta_REALTIME). REALTIME is wrong until the interrupt
> is handled in the guest, but should fire off all of the existing
> mechanisms for a clock step. Userspace gets notified with
> TFD_TIMER_CANCEL_ON_SET. I believe you have proposed something similar
> as a way to make live migration less sinister from the guest
> perspective.

That still is an information _after_ the fact and not well defined.

> It seems possible to block racing reads of REALTIME if we protect it with
> a migration sequence counter. Host raises the sequence after a migration
> when control is yielded back to the guest. The sequence is odd if an
> update is pending. Guest increments the sequence again after the
> interrupt handler accounts for time travel. That has the side effect of
> blocking all realtime clock reads until the interrupt is handled. But
> what are the value of those reads if we know they're wrong? There is
> also the implication that said shared memory interface gets mapped
> through to userspace for vDSO, haven't thought at all about those
> implications yet.

So you need two sequence counters to be checked similar to the
pv_clock() case, which prevents the usage of TSC without the pv_clock()
overhead.

That also means that CLOCK_REALTIME, CLOCK_TAI and CLOCK_REALTIME_COARSE
will need to become special cased in the VDSO and all realtime based
syscall interfaces.

I'm sure that will be well received from a performance perspective.

Aside of that where do you put the limit? Just user space visible read
outs? What about the kernel? Ignore it in the kernel and hope that
nothing will break? That's going to create really hard to diagnose
issues, which I'm absolutely not interested in.

You can't expand this scheme to the kernel in general because the CPU
which should handle the update might be in an interrupt disabled region
reading clock REALTIME....

Also this would just make the race window smaller. It does not solve the
problem in a consistent way. Don't even think about it. It's just
another layer of duct tape over the existing ones.

As I explained before, this does not solve the following:

   T = now();
   -----------> interruption of some sort (Tout)
   use(T);

Any use case which cares about Tout being less than a certain threshold
has to be carefully designed to handle this. See also below.

> Doing this the other way around (advance the TSC, tell the guest to fix
> MONOTONIC) is fundamentally wrong, as it violates two invariants of the
> monotonic clock. Monotonic counts during a migration, which really is a
> forced suspend. Additionally, you cannot step the monotonic clock.

A migration _should_ have suspend semantics, but the forced suspend
which is done by migration today does not have proper defined semantics
at all.

Also clock monotonic can be stepped forward under certain circumstances
and the kernel is very well able to handle it within well defined
limits. Think about scheduled out vCPUs. From their perspective clock
monotonic is stepping forwards.

The problem starts when the 'force suspended' time becomes excessive, as
that causes the mass expiry of clock monotonic timers with all the nasty
side effects described in a3ed0e4393d6. In the worst case it's going to
exceed the catchup limit of non-suspended timekeeping (~440s for a TSC
@2GHz) which in fact breaks the world and some more.

So let me go back to the use cases:

 1) Regular freeze of a VM to disk and resume at some arbitrary point in
    the future.

 2) Live migration

In both cases the proper solution is to make the guest go into a well
defined state (suspend) and resume it on restore. Everything just works
because it is well defined.

Though you say that there is a special case (not that I believe it):

> but there will always be host-side problems that require us to migrate
> a VM off a host immediately with no time to inform the guest.

which is basically what we do today for the very wrong reasons. There you
have two situations:

  1) Trestart - Tstop < TOLERABLE_THRESHOLD

     That's the easy case as you just can adjust TSC on restore by that
     amount on all vCPUs and be done with it. Just works like scheduling
     out all vCPUs for some time.

  2) Trestart - Tstop >= TOLERABLE_THRESHOLD

     Avoid adjusting TSC for the reasons above. That leaves you with the
     options of

     A) Make NTP unhappy as of today

     B) Provide information about the fact that the vCPUs got scheduled
        out for a long time and teach NTP to be smart about it.

        Right now NTP decides to declare itself unstable when it
        observes the time jump on CLOCK_REALTIME from the NTP server.

        That's exactly the situation described above:

        T = now();
        -----------> interruption of some sort (Tout)
        use(T);

        NTP observes that T is inconsistent because it does not know
        about Tout due to the fact that neither clock MONOTONIC nor
        clock BOOTTIME advanced.

        So here is where you can bring PV information into play by
        providing a migration sequence number in PV shared memory.

        On source host:
           Stop the guest dead in it's tracks.
           Record metadata:
             - migration sequence number
             - clock TAI as observed on the host
           Transfer the image along with metadata

        On destination host:
           Restore memory image
           Expose metadata in PV:
             - migration sequence number + 1
             - Tout (dest/source host delta of clock TAI)
           Run guest

        Guest kernel:

           - Keep track of the PV migration sequence number.

             If it changed act accordingly by injecting the TAI delta,
             which updates NTP state, wakes TFD_TIMER_CANCEL_ON_SET,
             etc...

             We have to do that in the kernel because we cannot rely on
             NTP being running.

             That can be an explicit IPI injected from the host or just
             polling from the tick. It doesn't matter much.

           - Expose information to user space:
               - Migration sequence counter

           - Add some smarts to adjtimex() to prevent user space racing
             against a pending migration update in the kernel.

        NTP:
           - utilize the sequence counter information 

             seq = get_migration_sequence();

             do stuff();
        
             if (seq != get_migration_sequence())
             	do_something_smart();
             else
                proceed_as_usual();

        That still will leave everything else exposed to
        CLOCK_REALTIME/TAI jumping forward, but there is nothing you can
        do about that and any application which cares about this has to
        be able to deal with it anyway.

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-29 14:19   ` Thomas Gleixner
@ 2022-03-29 16:02     ` Oliver Upton
  2022-03-29 19:34       ` Thomas Gleixner
  2022-06-30 11:58       ` David Woodhouse
  0 siblings, 2 replies; 44+ messages in thread
From: Oliver Upton @ 2022-03-29 16:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Franke, Daniel, David Woodhouse, Paolo Bonzini, kvm,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel

Hi Thomas,

On Tue, Mar 29, 2022 at 7:19 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > Let's just assume for now that the hypervisor will *not* quiesce
> > the guest into an S2IDLE state before migration. I think quiesced
> > migrations are a good thing to have in the toolbelt, but there will
> > always be host-side problems that require us to migrate a VM off a host
> > immediately with no time to inform the guest.
>
> Quiesced migrations are the right thing on the principle of least
> surprise.

Absolutely no argument here. Furthermore, it is likely the laziest
solution as well since all the adjustment is done within the guest.

The one thing I envision is that it's highly unlikely that a migration
will block until the guest cooperates. If the guest doesn't act within
a certain threshold then sorry, you're going along for the ride
anyway.

But it is very easy to paint that as user error if advance notice was
given for the migration :)

> > It seems possible to block racing reads of REALTIME if we protect it with
> > a migration sequence counter. Host raises the sequence after a migration
> > when control is yielded back to the guest. The sequence is odd if an
> > update is pending. Guest increments the sequence again after the
> > interrupt handler accounts for time travel. That has the side effect of
> > blocking all realtime clock reads until the interrupt is handled. But
> > what are the value of those reads if we know they're wrong? There is
> > also the implication that said shared memory interface gets mapped
> > through to userspace for vDSO, haven't thought at all about those
> > implications yet.
>
> So you need two sequence counters to be checked similar to the
> pv_clock() case, which prevents the usage of TSC without the pv_clock()
> overhead.
>
> That also means that CLOCK_REALTIME, CLOCK_TAI and CLOCK_REALTIME_COARSE
> will need to become special cased in the VDSO and all realtime based
> syscall interfaces.
>
> I'm sure that will be well received from a performance perspective.

Lol. Yeah, I was rambling and this still falls apart for the case
where a clock read came in before the migration, which you raise
below.

> As I explained before, this does not solve the following:
>
>    T = now();
>    -----------> interruption of some sort (Tout)
>    use(T);
>
> Any use case which cares about Tout being less than a certain threshold
> has to be carefully designed to handle this. See also below.
>
> > Doing this the other way around (advance the TSC, tell the guest to fix
> > MONOTONIC) is fundamentally wrong, as it violates two invariants of the
> > monotonic clock. Monotonic counts during a migration, which really is a
> > forced suspend. Additionally, you cannot step the monotonic clock.
>
> A migration _should_ have suspend semantics, but the forced suspend
> which is done by migration today does not have proper defined semantics
> at all.
>
> Also clock monotonic can be stepped forward under certain circumstances
> and the kernel is very well able to handle it within well defined
> limits. Think about scheduled out vCPUs. From their perspective clock
> monotonic is stepping forwards.
>

Right. For better or worse, the kernel has been conditioned to
tolerate small steps due to scheduling. There is just zero definition
around how much slop is allowed.

> The problem starts when the 'force suspended' time becomes excessive, as
> that causes the mass expiry of clock monotonic timers with all the nasty
> side effects described in a3ed0e4393d6. In the worst case it's going to
> exceed the catchup limit of non-suspended timekeeping (~440s for a TSC
> @2GHz) which in fact breaks the world and some more.
>
> So let me go back to the use cases:
>
>  1) Regular freeze of a VM to disk and resume at some arbitrary point in
>     the future.
>
>  2) Live migration
>
> In both cases the proper solution is to make the guest go into a well
> defined state (suspend) and resume it on restore. Everything just works
> because it is well defined.
>
> Though you say that there is a special case (not that I believe it):

I believe the easier special case to articulate is when the hypervisor
has already done its due diligence to warn the guest about a
migration. Guest doesn't heed the warning and doesn't quiesce. The
most predictable state at this point is probably just to kill the VM
on the spot, but that is likely to be a _very_ tough sell :)

So assuming that it's still possible for a non-cooperative suspend
(live migration, VM freeze, etc.) there's still a need to stop the
bleeding. I think you touch on what that may look like:

>   1) Trestart - Tstop < TOLERABLE_THRESHOLD
>
>      That's the easy case as you just can adjust TSC on restore by that
>      amount on all vCPUs and be done with it. Just works like scheduling
>      out all vCPUs for some time.
>
>   2) Trestart - Tstop >= TOLERABLE_THRESHOLD
>
>      Avoid adjusting TSC for the reasons above.

Which naturally will prompt the question: what is the value of
TOLERABLE_THRESHOLD? Speaking from experience (Google already does
something similar, but without a good fallback for exceeding the
threshold), there's ~zero science in deriving that value. But, IMO if
it's at least documented we can make the shenanigans at least a bit
more predictable. It also makes it very easy to define who
(guest/host) is responsible for cleaning up the mess.

In absence of documentation there's an unlimited license for VM
operators to do as they please and I fear we will forever perpetuate
the pain of time in virt.

>      That leaves you with the options of
>
>      A) Make NTP unhappy as of today
>
>      B) Provide information about the fact that the vCPUs got scheduled
>         out for a long time and teach NTP to be smart about it.
>
>         Right now NTP decides to declare itself unstable when it
>         observes the time jump on CLOCK_REALTIME from the NTP server.
>
>         That's exactly the situation described above:
>
>         T = now();
>         -----------> interruption of some sort (Tout)
>         use(T);
>
>         NTP observes that T is inconsistent because it does not know
>         about Tout due to the fact that neither clock MONOTONIC nor
>         clock BOOTTIME advanced.
>
>         So here is where you can bring PV information into play by
>         providing a migration sequence number in PV shared memory.
>
>         On source host:
>            Stop the guest dead in it's tracks.
>            Record metadata:
>              - migration sequence number
>              - clock TAI as observed on the host
>            Transfer the image along with metadata
>
>         On destination host:
>            Restore memory image
>            Expose metadata in PV:
>              - migration sequence number + 1
>              - Tout (dest/source host delta of clock TAI)
>            Run guest
>
>         Guest kernel:
>
>            - Keep track of the PV migration sequence number.
>
>              If it changed act accordingly by injecting the TAI delta,
>              which updates NTP state, wakes TFD_TIMER_CANCEL_ON_SET,
>              etc...
>
>              We have to do that in the kernel because we cannot rely on
>              NTP being running.
>
>              That can be an explicit IPI injected from the host or just
>              polling from the tick. It doesn't matter much.
>
>            - Expose information to user space:
>                - Migration sequence counter
>
>            - Add some smarts to adjtimex() to prevent user space racing
>              against a pending migration update in the kernel.
>
>         NTP:
>            - utilize the sequence counter information
>
>              seq = get_migration_sequence();
>
>              do stuff();
>
>              if (seq != get_migration_sequence())
>                 do_something_smart();
>              else
>                 proceed_as_usual();

Agreed pretty much the whole way through. There's no point in keeping
NTP naive at this point.

There's a need to sound the alarm for NTP regardless of whether
TOLERABLE_THRESHOLD is exceeded. David pointed out that the host
advancing the guest clocks (delta injection or TSC advancement) could
inject some error. Also, hardware has likely changed and the new parts
will have their own errors as well.

>         That still will leave everything else exposed to
>         CLOCK_REALTIME/TAI jumping forward, but there is nothing you can
>         do about that and any application which cares about this has to
>         be able to deal with it anyway.

Right. There's no cure-all between hypervisor/guest kernel that could
fix the problem for userspace entirely.

Appreciate you chiming in on this topic yet again.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-29 16:02     ` Oliver Upton
@ 2022-03-29 19:34       ` Thomas Gleixner
  2022-06-30 11:58       ` David Woodhouse
  1 sibling, 0 replies; 44+ messages in thread
From: Thomas Gleixner @ 2022-03-29 19:34 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Franke, Daniel, David Woodhouse, Paolo Bonzini, kvm,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel

Oliver,

On Tue, Mar 29 2022 at 09:02, Oliver Upton wrote:
> On Tue, Mar 29, 2022 at 7:19 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Doing this the other way around (advance the TSC, tell the guest to fix
>> > MONOTONIC) is fundamentally wrong, as it violates two invariants of the
>> > monotonic clock. Monotonic counts during a migration, which really is a
>> > forced suspend. Additionally, you cannot step the monotonic clock.
>>
>> A migration _should_ have suspend semantics, but the forced suspend
>> which is done by migration today does not have proper defined semantics
>> at all.
>>
>> Also clock monotonic can be stepped forward under certain circumstances
>> and the kernel is very well able to handle it within well defined
>> limits. Think about scheduled out vCPUs. From their perspective clock
>> monotonic is stepping forwards.
>>
>
> Right. For better or worse, the kernel has been conditioned to
> tolerate small steps due to scheduling. There is just zero definition
> around how much slop is allowed.

From the experience with the MONO=BOOT experiment, I'd say anything < 1
second is unlikely to cause larger trouble, but there might be user
space applications/configurations which disagree :)

>> The problem starts when the 'force suspended' time becomes excessive, as
>> that causes the mass expiry of clock monotonic timers with all the nasty
>> side effects described in a3ed0e4393d6. In the worst case it's going to
>> exceed the catchup limit of non-suspended timekeeping (~440s for a TSC
>> @2GHz) which in fact breaks the world and some more.
>>
>> So let me go back to the use cases:
>>
>>  1) Regular freeze of a VM to disk and resume at some arbitrary point in
>>     the future.
>>
>>  2) Live migration
>>
>> In both cases the proper solution is to make the guest go into a well
>> defined state (suspend) and resume it on restore. Everything just works
>> because it is well defined.
>>
>> Though you say that there is a special case (not that I believe it):
>
> I believe the easier special case to articulate is when the hypervisor
> has already done its due diligence to warn the guest about a
> migration. Guest doesn't heed the warning and doesn't quiesce. The
> most predictable state at this point is probably just to kill the VM
> on the spot, but that is likely to be a _very_ tough sell :)

I'm all for it. It's very well defined.

> So assuming that it's still possible for a non-cooperative suspend
> (live migration, VM freeze, etc.) there's still a need to stop the
> bleeding. I think you touch on what that may look like:
>
>>   1) Trestart - Tstop < TOLERABLE_THRESHOLD
>>
>>      That's the easy case as you just can adjust TSC on restore by that
>>      amount on all vCPUs and be done with it. Just works like scheduling
>>      out all vCPUs for some time.
>>
>>   2) Trestart - Tstop >= TOLERABLE_THRESHOLD
>>
>>      Avoid adjusting TSC for the reasons above.
>
> Which naturally will prompt the question: what is the value of
> TOLERABLE_THRESHOLD? Speaking from experience (Google already does
> something similar, but without a good fallback for exceeding the
> threshold), there's ~zero science in deriving that value. But, IMO if
> it's at least documented we can make the shenanigans at least a bit
> more predictable. It also makes it very easy to define who
> (guest/host) is responsible for cleaning up the mess.

See above, but the hyperscalers with experience on heavy host overload
might have better information when keeping a vCPU scheduled out starts
to create problems in the guest.

> In absence of documentation there's an unlimited license for VM
> operators to do as they please and I fear we will forever perpetuate
> the pain of time in virt.

You can prevent that, by making 'Cooperate within time or die hard' the
policy. :)

>>              if (seq != get_migration_sequence())
>>                 do_something_smart();
>>              else
>>                 proceed_as_usual();
>
> Agreed pretty much the whole way through. There's no point in keeping
> NTP naive at this point.
>
> There's a need to sound the alarm for NTP regardless of whether
> TOLERABLE_THRESHOLD is exceeded. David pointed out that the host
> advancing the guest clocks (delta injection or TSC advancement) could
> inject some error. Also, hardware has likely changed and the new parts
> will have their own errors as well.

There is no reason why you can't use the #2 scheme for the #1 case
too:

>>         On destination host:
>>            Restore memory image
>>            Expose metadata in PV:
>>              - migration sequence number + 1

                - Flag whether Tout was compensated already via
                  TSC or just set Tout = 0

>>              - Tout (dest/source host delta of clock TAI)
>>            Run guest
>>
>>         Guest kernel:
>>
>>            - Keep track of the PV migration sequence number.
>>
>>              If it changed act accordingly by injecting the TAI delta,
>>              which updates NTP state, wakes TFD_TIMER_CANCEL_ON_SET,
>>              etc...

                if it was compensated via TSC already, it might be
                sufficient to just reset NTP state.

>>         NTP:
>>            - utilize the sequence counter information
                ....

OTOH, the question is whether it's worth it.

If we assume that the sane case is a cooperative guest and the forced
migration is the last resort, then we can just avoid the extra magic and
the discussion around the correct value for TOLERABLE_THRESHOLD
alltogether.

I suggest to start from a TOLERABLE_THRESHOLD=0 assumption to keep it
simple in the first step. Once this has been established, you can still
experiment with the threshold and figure out whether it matters.

In fact, doing the host side TSC compensation is just an excuse for VM
operators not to make the guest cooperative, because it might solve
their main problems for the vast majority of migrations.

Forcing them to doing it right is definitely the better option, which
means the 'Cooperate or die hard' policy is the best one you can
chose. :)

>>         That still will leave everything else exposed to
>>         CLOCK_REALTIME/TAI jumping forward, but there is nothing you can
>>         do about that and any application which cares about this has to
>>         be able to deal with it anyway.
>
> Right. There's no cure-all between hypervisor/guest kernel that could
> fix the problem for userspace entirely.

In the same way as there is no cure for time jumps caused by
settimeofday(), daylight saving changes, leap seconds etc., unless the
application is carefully designed to deal with that.

> Appreciate you chiming in on this topic yet again.

I still hope that this get's fixed _before_ I retire :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-29 16:02     ` Oliver Upton
  2022-03-29 19:34       ` Thomas Gleixner
@ 2022-06-30 11:58       ` David Woodhouse
  2022-07-05 14:43         ` David Woodhouse
  1 sibling, 1 reply; 44+ messages in thread
From: David Woodhouse @ 2022-06-30 11:58 UTC (permalink / raw)
  To: Oliver Upton, Thomas Gleixner, sveith
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 3438 bytes --]

On Tue, 2022-03-29 at 09:02 -0700, Oliver Upton wrote:
> 
> There's a need to sound the alarm for NTP regardless of whether
> TOLERABLE_THRESHOLD is exceeded. David pointed out that the host
> advancing the guest clocks (delta injection or TSC advancement) could
> inject some error. Also, hardware has likely changed and the new parts
> will have their own errors as well.

I don't admit to pointing that out in that form because I don't accept
the use of the term "advance".

Clocks advance *themselves*. That's what clocks do.

When we perform a live update or live migration we might *adjust* those
clocks, calibrate, synchronise or restore them. But I'll eat my
keyboard before using the term "advance" for that. Even if our
adjustment is in a forward direction.

Let's consider the case of a live update — where we stop scheduling the
guest for a moment, kexec into the new kernel, then resume scheduling
the guest.

I assert strongly that from the guest point of view this is *no*
different to any other brief period of not being scheduled.

Yes, in practice we have a whole new kernel, a whole new KVM and set of
kvm_vcpus, and we've *restored* the state. And we have restored the
TSCs/clocks in those new kvm objects to precisely match what they were
before. Note: *match* not advance.

Before the kexec, there were a bunch of relationships between clocks,
mostly based on the host TSC Tₕ (assuming the case where that's stable
and reliable):

 • The kernel's idea of wallclock time was based on Tₕ, plus some
   offset and divided by some frequency. NTP tweaks those values over
   time but at any given instant there is a current value for them
   which is used to derive the wallblock time.

 • The kernel's idea of the guest kvmclock epoch (nanoseconds since the
   KVM started) was based on Tₕ and some other offset and hopefully the
   same frequency. 

 • The TSC of each vCPU was based on Tₕ, some offset and a TSC scaling
   factor.

After a live update, the host TSC Tₕ is just the same as it always was.
Not the same *value* of course; that was never the case from one tick
to the next anyway. It's the same, in that it continues to advance
*itself* at a consistent frequency as real time progresses, which is
what clocks do.

In the new kernel we just want all those other derivative clocks to
*also* be the same as before. That is, the offset and multipliers are
the *same* value. We're not "advancing" those clocks. We're
*preserving* them.


For live migration it's slightly harder because we don't have a
consistent host TSC to use as the basis. The best we can do is NTP-
synchronised wallclock time between the two hosts. And thus I think we
want *these* constants to be preserved across the migration:

 The KVM's kvmclock was <K> at a given wallclock time <W>

 The TSC of each vCPU#n was <Tₙ> at a given value of kvmclock <Kₙ>

In *this* case we are running on different hardware and the reliance on
the NTP wallclock time as the basis for preserving the guest clocks may
have introduced an error, as well as the fact that the hardware has
changed. So in this case we should indeed inform the guest that it
should consider itself out of NTP sync and start over, in *addition* to
making a best effort to preserve those clocks.

But there is no scope for the word "advance" to be used anywhere there
either.



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-06-30 11:58       ` David Woodhouse
@ 2022-07-05 14:43         ` David Woodhouse
  2022-07-07 16:43           ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Simon Veith
  0 siblings, 1 reply; 44+ messages in thread
From: David Woodhouse @ 2022-07-05 14:43 UTC (permalink / raw)
  To: Oliver Upton, Thomas Gleixner, Veith, Simon
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 2697 bytes --]

On Thu, 2022-06-30 at 12:58 +0100, David Woodhouse wrote:
>  The TSC of each vCPU#n was <Tₙ> at a given value of kvmclock <Kₙ>

That one isn't even particularly hard to do. At its core, it looks
something like this (thanks to Simon for fixing the arithmetic. My
first proof-of-concept was only a *few* orders of magnitude out...)

Simon then has a follow-on patch to wrap an ioctl around it and
actually allow the kvm_ns to be passed in.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1910e1e78b15..ddb67c9566e4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2629,7 +2629,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm_track_tsc_matching(vcpu);
 }
 
-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, u64 *kvm_ns)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
@@ -2638,8 +2638,23 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	bool synchronizing = false;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
-	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	ns = get_kvmclock_base_ns();
+	if (kvm_ns) {
+ 		/*
+ 		 * We have been provided a KVM clock reference time point at
+ 		 * which this TSC value was correct. Use this time point to
+		 * compensate for any delays that were incurred since that
+		 * TSC value was computed. Note that time is still passing;
+		 * in an ideal world the get_kvmclock_base_ns() above would
+		 * be based on the *same* TSC value that's hidden in the call
+		 * to kvm_compute_l1_tsc_offset() below.
+ 		 */
+ 		u32 tsc_khz = vcpu->arch.virtual_tsc_khz;
+ 		s64 delta_ns = ns + vcpu->kvm->arch.kvmclock_offset - *kvm_ns;
+ 		s64 delta_counts = delta_ns * tsc_khz / NSEC_PER_MSEC;
+ 		data += delta_counts;
+	}
+	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
 	if (vcpu->arch.virtual_tsc_khz) {
@@ -3581,7 +3596,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_TSC:
 		if (msr_info->host_initiated) {
-			kvm_synchronize_tsc(vcpu, data);
+			kvm_synchronize_tsc(vcpu, data, NULL);
 		} else {
 			u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
 			adjust_tsc_offset_guest(vcpu, adj);
@@ -11392,7 +11407,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 	if (mutex_lock_killable(&vcpu->mutex))
 		return;
 	vcpu_load(vcpu);
-	kvm_synchronize_tsc(vcpu, 0);
+	kvm_synchronize_tsc(vcpu, 0, NULL);
 	vcpu_put(vcpu);
 
 	/* poll control enabled by default */


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc()
  2022-07-05 14:43         ` David Woodhouse
@ 2022-07-07 16:43           ` Simon Veith
  2022-07-07 16:43             ` [PATCH 2/2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
  2022-07-29 21:14             ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Sean Christopherson
  0 siblings, 2 replies; 44+ messages in thread
From: Simon Veith @ 2022-07-07 16:43 UTC (permalink / raw)
  To: dwmw2
  Cc: dff, jmattson, joro, kvm, oupton, pbonzini, seanjc, sveith, tglx,
	vkuznets, wanpengli, David Woodhouse

The Time Stamp Counter (TSC) value register can be set to an absolute
value using the KVM_SET_MSRS ioctl, which calls kvm_synchronize_tsc()
internally.

Since this is a per-vCPU register, and vCPUs are often managed by
separate threads, setting a uniform TSC value across all vCPUs is
challenging: After live migration, for example, the TSC value may need
to be adjusted to account for the migration downtime. Ideally, we would
want each vCPU to be adjusted by the same offset; but if we compute the
offset centrally, the TSC value may become out of date due to scheduling
delays by the time that each vCPU thread gets around to issuing
KVM_SET_MSRS.

In preparation for the next patch, this change adds an optional, KVM
clock based time reference argument to kvm_synchronize_tsc(). This
argument, if present, is understood to mean "the TSC value being written
was valid at this corresponding KVM clock time point".

kvm_synchronize_tsc() will then use this clock reference to adjust the
TSC value being written for any delays that have been incurred since the
provided TSC value was valid.

Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Simon Veith <sveith@amazon.de>
---
 arch/x86/kvm/x86.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1910e1e78b15..a44d083f1bf9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2629,7 +2629,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm_track_tsc_matching(vcpu);
 }
 
-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, u64 *kvm_ns)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
@@ -2638,12 +2638,24 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	bool synchronizing = false;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
-	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	ns = get_kvmclock_base_ns();
+
+	if (kvm_ns) {
+		/*
+		 * We have been provided a KVM clock reference time point at
+		 * which this TSC value was correct.
+		 * Use this time point to compensate for any delays that were
+		 * incurred since that TSC value was valid.
+		 */
+		s64 delta_ns = ns + vcpu->kvm->arch.kvmclock_offset - *kvm_ns;
+		data += nsec_to_cycles(vcpu, (u64)delta_ns);
+	}
+
+	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
 	if (vcpu->arch.virtual_tsc_khz) {
-		if (data == 0) {
+		if (data == 0 && !kvm_ns) {
 			/*
 			 * detection of vcpu initialization -- need to sync
 			 * with other vCPUs. This particularly helps to keep
@@ -3581,7 +3593,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_TSC:
 		if (msr_info->host_initiated) {
-			kvm_synchronize_tsc(vcpu, data);
+			kvm_synchronize_tsc(vcpu, data, NULL);
 		} else {
 			u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
 			adjust_tsc_offset_guest(vcpu, adj);
@@ -11392,7 +11404,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 	if (mutex_lock_killable(&vcpu->mutex))
 		return;
 	vcpu_load(vcpu);
-	kvm_synchronize_tsc(vcpu, 0);
+	kvm_synchronize_tsc(vcpu, 0, NULL);
 	vcpu_put(vcpu);
 
 	/* poll control enabled by default */
-- 
2.25.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2022-07-07 16:43           ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Simon Veith
@ 2022-07-07 16:43             ` Simon Veith
  2022-07-29 21:21               ` Sean Christopherson
  2022-07-29 21:14             ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Sean Christopherson
  1 sibling, 1 reply; 44+ messages in thread
From: Simon Veith @ 2022-07-07 16:43 UTC (permalink / raw)
  To: dwmw2
  Cc: dff, jmattson, joro, kvm, oupton, pbonzini, seanjc, sveith, tglx,
	vkuznets, wanpengli

With the previous commit having added a KVM clock time reference to
kvm_synchronize_tsc(), this patch adds a new TSC attribute
KVM_VCPU_TSC_VALUE that allows for setting the TSC value in a way that
is unaffected by scheduling delays.

Userspace provides a struct kvm_vcpu_tsc_value consisting of a matched
pair of ( guest TSC value, KVM clock value ). The TSC value that will
ultimately be written is adjusted to account for the time which has
elapsed since the given KVM clock time point.

In order to allow userspace to retrieve an accurate time reference
atomically, without being affected by scheduling delays between
KVM_GET_CLOCK and KVM_GET_MSRS, the KVM_GET_DEVICE_ATTR implementation
for this attribute uses get_kvmclock() internally and returns a struct
kvm_vcpu_tsc_value with both values in one go. If get_kvmclock()
supports the KVM_CLOCK_HOST_TSC flag, the two will be based on one and
the same host TSC reading.

Signed-off-by: Simon Veith <sveith@amazon.de>
---
 Documentation/virt/kvm/devices/vcpu.rst | 22 +++++++++++++
 arch/x86/include/uapi/asm/kvm.h         |  7 +++++
 arch/x86/kvm/x86.c                      | 41 +++++++++++++++++++++++++
 tools/arch/x86/include/uapi/asm/kvm.h   |  7 +++++
 4 files changed, 77 insertions(+)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 716aa3edae14..a9506d902b0a 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -263,3 +263,25 @@ From the destination VMM process:
 
 7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
    respective value derived in the previous step.
+
+4.2 ATTRIBUTE: KVM_VCPU_TSC_VALUE
+
+:Parameters: kvm_device_attr.addr points to a struct kvm_vcpu_tsc_value
+
+Returns:
+
+	 ======= ======================================
+	 -EFAULT Error reading/writing the provided
+		 parameter address.
+	 -ENXIO  Attribute not supported
+	 ======= ======================================
+
+Gets or sets a matched pair of guest TSC value and KVM clock time point.
+
+When setting the TSC value through this attribute, a corresponding KVM clock
+reference time point (as retrieved by KVM_GET_CLOCK in the clock field) must be
+provided.
+
+The actual TSC value written will be adjusted based on the time that has
+elapsed since the provided reference time point, taking TSC scaling into
+account.
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 21614807a2cb..79de9e34cfa8 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -525,5 +525,12 @@ struct kvm_pmu_event_filter {
 /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define   KVM_VCPU_TSC_VALUE 1 /* attribute for the TSC value */
+
+/* for KVM_VCPU_TSC_VALUE */
+struct kvm_vcpu_tsc_value {
+	__u64 tsc_val;
+	__u64 kvm_ns;
+};
 
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a44d083f1bf9..ed8c2729eae2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5169,6 +5169,7 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
 
 	switch (attr->attr) {
 	case KVM_VCPU_TSC_OFFSET:
+	case KVM_VCPU_TSC_VALUE:
 		r = 0;
 		break;
 	default:
@@ -5194,6 +5195,32 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
 			break;
 		r = 0;
 		break;
+	case KVM_VCPU_TSC_VALUE: {
+		struct kvm_vcpu_tsc_value __user *tsc_value_arg;
+		struct kvm_vcpu_tsc_value tsc_value;
+		struct kvm_clock_data kvm_clock;
+		u64 host_tsc, guest_tsc, ratio, offset;
+
+		get_kvmclock(vcpu->kvm, &kvm_clock);
+		if (kvm_clock.flags & KVM_CLOCK_HOST_TSC)
+			host_tsc = kvm_clock.host_tsc;
+		else
+			host_tsc = rdtsc();
+
+		ratio = vcpu->arch.l1_tsc_scaling_ratio;
+		offset = vcpu->arch.l1_tsc_offset;
+		guest_tsc = kvm_scale_tsc(host_tsc, ratio) + offset;
+
+		tsc_value.kvm_ns = kvm_clock.clock;
+		tsc_value.tsc_val = guest_tsc;
+
+		tsc_value_arg = (struct kvm_vcpu_tsc_value __user *)uaddr;
+		r = -EFAULT;
+		if (copy_to_user(tsc_value_arg, &tsc_value, sizeof(tsc_value)))
+			break;
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
@@ -5236,6 +5263,20 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
 		r = 0;
 		break;
 	}
+	case KVM_VCPU_TSC_VALUE: {
+		struct kvm_vcpu_tsc_value __user *tsc_value_arg;
+		struct kvm_vcpu_tsc_value tsc_value;
+
+		tsc_value_arg = (struct kvm_vcpu_tsc_value __user *)uaddr;
+		r = -EFAULT;
+		if (copy_from_user(&tsc_value, tsc_value_arg, sizeof(tsc_value)))
+			break;
+
+		kvm_synchronize_tsc(vcpu, tsc_value.tsc_val, &tsc_value.kvm_ns);
+
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 21614807a2cb..79de9e34cfa8 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -525,5 +525,12 @@ struct kvm_pmu_event_filter {
 /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define   KVM_VCPU_TSC_VALUE 1 /* attribute for the TSC value */
+
+/* for KVM_VCPU_TSC_VALUE */
+struct kvm_vcpu_tsc_value {
+	__u64 tsc_val;
+	__u64 kvm_ns;
+};
 
 #endif /* _ASM_X86_KVM_H */
-- 
2.25.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc()
  2022-07-07 16:43           ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Simon Veith
  2022-07-07 16:43             ` [PATCH 2/2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
@ 2022-07-29 21:14             ` Sean Christopherson
  2023-02-02 16:35               ` David Woodhouse
  1 sibling, 1 reply; 44+ messages in thread
From: Sean Christopherson @ 2022-07-29 21:14 UTC (permalink / raw)
  To: Simon Veith
  Cc: dwmw2, dff, jmattson, joro, kvm, oupton, pbonzini, tglx,
	vkuznets, wanpengli, David Woodhouse

On Thu, Jul 07, 2022, Simon Veith wrote:
> The Time Stamp Counter (TSC) value register can be set to an absolute
> value using the KVM_SET_MSRS ioctl, which calls kvm_synchronize_tsc()
> internally.
> 
> Since this is a per-vCPU register, and vCPUs are often managed by
> separate threads, setting a uniform TSC value across all vCPUs is
> challenging: After live migration, for example, the TSC value may need
> to be adjusted to account for the migration downtime. Ideally, we would
> want each vCPU to be adjusted by the same offset; but if we compute the
> offset centrally, the TSC value may become out of date due to scheduling
> delays by the time that each vCPU thread gets around to issuing
> KVM_SET_MSRS.
> 
> In preparation for the next patch, this change adds an optional, KVM
> clock based time reference argument to kvm_synchronize_tsc(). This
> argument, if present, is understood to mean "the TSC value being written
> was valid at this corresponding KVM clock time point".


Given that commit 828ca89628bf ("KVM: x86: Expose TSC offset controls to userspace")
was merged less than a year ago, it would be helpful to explicitly call out why
KVM_VCPU_TSC_CTRL doesn't work, and why that sub-ioctl can't be extended/modified to
make it work.

> kvm_synchronize_tsc() will then use this clock reference to adjust the
> TSC value being written for any delays that have been incurred since the
> provided TSC value was valid.
> 
> Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>

Needs David's SOB.  And this patch should be squashed with the next patch.  The
kvm_ns != NULL path is dead code here, e.g. even if the logic is wildly broken,
bisection will blame the patch that actually passes a non-null reference.  Neither
patch is big enough to warrant splitting in this way.

And more importantly, the next patch's changelog provides a thorough description
of what it's doing, but there's very little in there that describes _why_ KVM
wants to provide this functionality.

> Signed-off-by: Simon Veith <sveith@amazon.de>
> ---
>  arch/x86/kvm/x86.c | 22 +++++++++++++++++-----
>  1 file changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1910e1e78b15..a44d083f1bf9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2629,7 +2629,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
>  	kvm_track_tsc_matching(vcpu);
>  }
>  
> -static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
> +static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, u64 *kvm_ns)
>  {
>  	struct kvm *kvm = vcpu->kvm;
>  	u64 offset, ns, elapsed;
> @@ -2638,12 +2638,24 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
>  	bool synchronizing = false;
>  
>  	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
> -	offset = kvm_compute_l1_tsc_offset(vcpu, data);
>  	ns = get_kvmclock_base_ns();
> +
> +	if (kvm_ns) {
> +		/*
> +		 * We have been provided a KVM clock reference time point at

Avoid pronouns, e.g. there's no need to say "We have been provided", just state
what kvm_ns is and how it's used.

> +		 * which this TSC value was correct.
> +		 * Use this time point to compensate for any delays that were
> +		 * incurred since that TSC value was valid.
> +		 */
> +		s64 delta_ns = ns + vcpu->kvm->arch.kvmclock_offset - *kvm_ns;
> +		data += nsec_to_cycles(vcpu, (u64)delta_ns);
> +	}

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2022-07-07 16:43             ` [PATCH 2/2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
@ 2022-07-29 21:21               ` Sean Christopherson
  0 siblings, 0 replies; 44+ messages in thread
From: Sean Christopherson @ 2022-07-29 21:21 UTC (permalink / raw)
  To: Simon Veith
  Cc: dwmw2, dff, jmattson, joro, kvm, oupton, pbonzini, tglx,
	vkuznets, wanpengli

On Thu, Jul 07, 2022, Simon Veith wrote:
> With the previous commit having added a KVM clock time reference to

Moot point if the two patches are squashed, but avoid "previous commit" and "next
commit" and instead use less precise language, e.g. "Now that TSC synchronization
can account for a KVM clock time reference, blah blah blah".

> kvm_synchronize_tsc(), this patch adds a new TSC attribute

Don't use "this patch", the reader knows it's a patch.  Just state what's being
done in imperative mood, a.d. "add a pair of ioctls to allow getting and setting ..."

> KVM_VCPU_TSC_VALUE that allows for setting the TSC value in a way that
> is unaffected by scheduling delays.
> 
> Userspace provides a struct kvm_vcpu_tsc_value consisting of a matched
> pair of ( guest TSC value, KVM clock value ). The TSC value that will
> ultimately be written is adjusted to account for the time which has
> elapsed since the given KVM clock time point.
> 
> In order to allow userspace to retrieve an accurate time reference
> atomically, without being affected by scheduling delays between
> KVM_GET_CLOCK and KVM_GET_MSRS, the KVM_GET_DEVICE_ATTR implementation
> for this attribute uses get_kvmclock() internally and returns a struct
> kvm_vcpu_tsc_value with both values in one go. If get_kvmclock()
> supports the KVM_CLOCK_HOST_TSC flag, the two will be based on one and
> the same host TSC reading.
> 
> Signed-off-by: Simon Veith <sveith@amazon.de>
> ---

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc()
  2022-07-29 21:14             ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Sean Christopherson
@ 2023-02-02 16:35               ` David Woodhouse
  2023-02-02 16:59                 ` [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
  0 siblings, 1 reply; 44+ messages in thread
From: David Woodhouse @ 2023-02-02 16:35 UTC (permalink / raw)
  To: Sean Christopherson, Simon Veith
  Cc: dff, jmattson, joro, kvm, oupton, pbonzini, tglx, vkuznets, wanpengli

[-- Attachment #1: Type: text/plain, Size: 1952 bytes --]

On Fri, 2022-07-29 at 21:14 +0000, Sean Christopherson wrote:
> On Thu, Jul 07, 2022, Simon Veith wrote:
> > The Time Stamp Counter (TSC) value register can be set to an absolute
> > value using the KVM_SET_MSRS ioctl, which calls kvm_synchronize_tsc()
> > internally.
> > 
> > Since this is a per-vCPU register, and vCPUs are often managed by
> > separate threads, setting a uniform TSC value across all vCPUs is
> > challenging: After live migration, for example, the TSC value may need
> > to be adjusted to account for the migration downtime. Ideally, we would
> > want each vCPU to be adjusted by the same offset; but if we compute the
> > offset centrally, the TSC value may become out of date due to scheduling
> > delays by the time that each vCPU thread gets around to issuing
> > KVM_SET_MSRS.
> > 
> > In preparation for the next patch, this change adds an optional, KVM
> > clock based time reference argument to kvm_synchronize_tsc(). This
> > argument, if present, is understood to mean "the TSC value being written
> > was valid at this corresponding KVM clock time point".
> 
> 
> Given that commit 828ca89628bf ("KVM: x86: Expose TSC offset controls to userspace")
> was merged less than a year ago, it would be helpful to explicitly call out why
> KVM_VCPU_TSC_CTRL doesn't work, and why that sub-ioctl can't be extended/modified to
> make it work.
> 
> > kvm_synchronize_tsc() will then use this clock reference to adjust the
> > TSC value being written for any delays that have been incurred since the
> > provided TSC value was valid.
> > 
> > Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
> 
> Needs David's SOB. 

For the record (and for the half-baked "do it like this, except in a
way that might stand a chance of compiling..." that I threw at Simon on
Slack and didn't necessarily even expect him to credit me for
authorship)

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2023-02-02 16:35               ` David Woodhouse
@ 2023-02-02 16:59                 ` Simon Veith
  2023-03-15 19:57                   ` Sean Christopherson
  0 siblings, 1 reply; 44+ messages in thread
From: Simon Veith @ 2023-02-02 16:59 UTC (permalink / raw)
  To: dwmw2
  Cc: dff, jmattson, joro, kvm, oupton, pbonzini, seanjc, sveith, tglx,
	vkuznets, wanpengli

The Time Stamp Counter (TSC) value register can be set to an absolute
value using the KVM_SET_MSRS ioctl. Since this is a per-vCPU register,
and vCPUs are often managed by separate threads, setting a uniform TSC
value across all vCPUs is challenging: After liveupdate or live
migration, the TSC value may need to be adjusted to account for the
incurred downtime.

Ideally, we want such an adjustment to happen uniformly across all
vCPUs; however, if we compute the offset centrally, the TSC value may
become out of date due to scheduling delays by the time that each vCPU
thread gets around to issuing KVM_SET_MSRS. Such delays can lead to
unaccounted pause time (the guest observes that its system clock has
fallen behind the NTP reference time).

To avoid such inaccuracies from the use of KVM_SET_MSRS, there is an
alternative attribute KVM_VCPU_TSC_OFFSET which, rather than setting an
absolute TSC value, defines it in terms of the offset to be applied
relative to the host's TSC value. Using this attribute, the TSC can be
adjusted reliably by userspace, but only if TSC scaling remains
unchanged, i.e., in the case of liveupdate on the same host, and not
when live migrating an instance between hosts with different TSC scaling
parameters.

In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
preserve the TSC value and apply a known offset would require
duplicating the TSC scaling computations in userspace to account for
frequency differences between source and destination TSCs.

Hence, if userspace wants to set the TSC to some known value without
having to deal with TSC scaling, and while also being resilient against
scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
suitable options.

Add a new TSC attribute KVM_VCPU_TSC_VALUE that allows for setting the
TSC value in a way that is unaffected by scheduling delays, handling TSC
scaling internally.

Add an optional, KVM clock based time reference argument to
kvm_synchronize_tsc(). This argument, if present, is understood to mean
"the TSC value being written was valid at this corresponding KVM clock
time point".

Userspace provides a struct kvm_vcpu_tsc_value consisting of a matched
pair of ( guest TSC value, KVM clock value ). The TSC value that will
ultimately be written is adjusted to account for the time which has
elapsed since the given KVM clock time point.

In order to allow userspace to retrieve an accurate time reference
atomically, without being affected by scheduling delays between
KVM_GET_CLOCK and KVM_GET_MSRS, the KVM_GET_DEVICE_ATTR implementation
for this attribute uses get_kvmclock() internally and returns a struct
kvm_vcpu_tsc_value with both values in one go. If get_kvmclock()
supports the KVM_CLOCK_HOST_TSC flag, the two will be based on one and
the same host TSC reading.

Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Simon Veith <sveith@amazon.de>
---
V2:
 - Squashed into a single change
 - Added justification for introducing a new interface
 - Added missing Signed-off-by
 - Re-worded comment

 Documentation/virt/kvm/devices/vcpu.rst | 22 +++++++++
 arch/x86/include/uapi/asm/kvm.h         |  7 +++
 arch/x86/kvm/x86.c                      | 63 +++++++++++++++++++++++--
 tools/arch/x86/include/uapi/asm/kvm.h   |  7 +++
 4 files changed, 94 insertions(+), 5 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 31f14ec4a65b..240a3646947c 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -265,3 +265,25 @@ From the destination VMM process:
 
 7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
    respective value derived in the previous step.
+
+4.2 ATTRIBUTE: KVM_VCPU_TSC_VALUE
+
+:Parameters: kvm_device_attr.addr points to a struct kvm_vcpu_tsc_value
+
+Returns:
+
+	 ======= ======================================
+	 -EFAULT Error reading/writing the provided
+		 parameter address.
+	 -ENXIO  Attribute not supported
+	 ======= ======================================
+
+Gets or sets a matched pair of guest TSC value and KVM clock time point.
+
+When setting the TSC value through this attribute, a corresponding KVM clock
+reference time point (as retrieved by KVM_GET_CLOCK in the clock field) must be
+provided.
+
+The actual TSC value written will be adjusted based on the time that has
+elapsed since the provided reference time point, taking TSC scaling into
+account.
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index e48deab8901d..f99bdb959b54 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -528,5 +528,12 @@ struct kvm_pmu_event_filter {
 /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define   KVM_VCPU_TSC_VALUE 1 /* attribute for the TSC value */
+
+/* for KVM_VCPU_TSC_VALUE */
+struct kvm_vcpu_tsc_value {
+	__u64 tsc_val;
+	__u64 kvm_ns;
+};
 
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index da4bbd043a7b..b174200c909b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2655,7 +2655,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm_track_tsc_matching(vcpu);
 }
 
-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, u64 *kvm_ns)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
@@ -2664,12 +2664,24 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	bool synchronizing = false;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
-	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	ns = get_kvmclock_base_ns();
+
+	if (kvm_ns) {
+		/*
+		 * kvm_ns is the KVM clock reference time point at which this
+		 * TSC value was correct. Use this time point to compensate for
+		 * any delays that have been incurred since that TSC value was
+		 * valid.
+		 */
+		s64 delta_ns = ns + vcpu->kvm->arch.kvmclock_offset - *kvm_ns;
+		data += nsec_to_cycles(vcpu, (u64)delta_ns);
+	}
+
+	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
 	if (vcpu->arch.virtual_tsc_khz) {
-		if (data == 0) {
+		if (data == 0 && !kvm_ns) {
 			/*
 			 * detection of vcpu initialization -- need to sync
 			 * with other vCPUs. This particularly helps to keep
@@ -3672,7 +3684,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_TSC:
 		if (msr_info->host_initiated) {
-			kvm_synchronize_tsc(vcpu, data);
+			kvm_synchronize_tsc(vcpu, data, NULL);
 		} else {
 			u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
 			adjust_tsc_offset_guest(vcpu, adj);
@@ -5375,6 +5387,7 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
 
 	switch (attr->attr) {
 	case KVM_VCPU_TSC_OFFSET:
+	case KVM_VCPU_TSC_VALUE:
 		r = 0;
 		break;
 	default:
@@ -5400,6 +5413,32 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
 			break;
 		r = 0;
 		break;
+	case KVM_VCPU_TSC_VALUE: {
+		struct kvm_vcpu_tsc_value __user *tsc_value_arg;
+		struct kvm_vcpu_tsc_value tsc_value;
+		struct kvm_clock_data kvm_clock;
+		u64 host_tsc, guest_tsc, ratio, offset;
+
+		get_kvmclock(vcpu->kvm, &kvm_clock);
+		if (kvm_clock.flags & KVM_CLOCK_HOST_TSC)
+			host_tsc = kvm_clock.host_tsc;
+		else
+			host_tsc = rdtsc();
+
+		ratio = vcpu->arch.l1_tsc_scaling_ratio;
+		offset = vcpu->arch.l1_tsc_offset;
+		guest_tsc = kvm_scale_tsc(host_tsc, ratio) + offset;
+
+		tsc_value.kvm_ns = kvm_clock.clock;
+		tsc_value.tsc_val = guest_tsc;
+
+		tsc_value_arg = (struct kvm_vcpu_tsc_value __user *)uaddr;
+		r = -EFAULT;
+		if (copy_to_user(tsc_value_arg, &tsc_value, sizeof(tsc_value)))
+			break;
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
@@ -5442,6 +5481,20 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
 		r = 0;
 		break;
 	}
+	case KVM_VCPU_TSC_VALUE: {
+		struct kvm_vcpu_tsc_value __user *tsc_value_arg;
+		struct kvm_vcpu_tsc_value tsc_value;
+
+		tsc_value_arg = (struct kvm_vcpu_tsc_value __user *)uaddr;
+		r = -EFAULT;
+		if (copy_from_user(&tsc_value, tsc_value_arg, sizeof(tsc_value)))
+			break;
+
+		kvm_synchronize_tsc(vcpu, tsc_value.tsc_val, &tsc_value.kvm_ns);
+
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
@@ -11668,7 +11721,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 	if (mutex_lock_killable(&vcpu->mutex))
 		return;
 	vcpu_load(vcpu);
-	kvm_synchronize_tsc(vcpu, 0);
+	kvm_synchronize_tsc(vcpu, 0, NULL);
 	vcpu_put(vcpu);
 
 	/* poll control enabled by default */
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index e48deab8901d..f99bdb959b54 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -528,5 +528,12 @@ struct kvm_pmu_event_filter {
 /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define   KVM_VCPU_TSC_VALUE 1 /* attribute for the TSC value */
+
+/* for KVM_VCPU_TSC_VALUE */
+struct kvm_vcpu_tsc_value {
+	__u64 tsc_val;
+	__u64 kvm_ns;
+};
 
 #endif /* _ASM_X86_KVM_H */

base-commit: 9f266ccaa2f5228bfe67ad58a94ca4e0109b954a
-- 
2.34.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2023-02-02 16:59                 ` [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
@ 2023-03-15 19:57                   ` Sean Christopherson
  2023-03-23 23:26                     ` David Woodhouse
  2023-03-24 11:22                     ` Paolo Bonzini
  0 siblings, 2 replies; 44+ messages in thread
From: Sean Christopherson @ 2023-03-15 19:57 UTC (permalink / raw)
  To: Simon Veith
  Cc: dwmw2, dff, jmattson, joro, kvm, oupton, pbonzini, tglx,
	vkuznets, wanpengli

Please don't send vN+1 In-Reply-To vN, the threading messes up lore and other
tooling, and becomes really problematic for humans when N gets large.

On Thu, Feb 02, 2023, Simon Veith wrote:
> In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
> preserve the TSC value and apply a known offset would require
> duplicating the TSC scaling computations in userspace to account for
> frequency differences between source and destination TSCs.
> 
> Hence, if userspace wants to set the TSC to some known value without
> having to deal with TSC scaling, and while also being resilient against
> scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
> suitable options.

Requiring userspace to handle certain aspects of TSC scaling doesn't seem
particularly onerous, at least not relative to all the other time insanity.  In
other words, why should KVM take on more complexity and a mostly-redundant uAPI?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2023-03-15 19:57                   ` Sean Christopherson
@ 2023-03-23 23:26                     ` David Woodhouse
  2023-03-24 11:22                     ` Paolo Bonzini
  1 sibling, 0 replies; 44+ messages in thread
From: David Woodhouse @ 2023-03-23 23:26 UTC (permalink / raw)
  To: Sean Christopherson, Simon Veith
  Cc: dff, jmattson, joro, kvm, oupton, pbonzini, tglx, vkuznets, wanpengli

[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]

On Wed, 2023-03-15 at 12:57 -0700, Sean Christopherson wrote:
> 
> On Thu, Feb 02, 2023, Simon Veith wrote:
> > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
> > preserve the TSC value and apply a known offset would require
> > duplicating the TSC scaling computations in userspace to account for
> > frequency differences between source and destination TSCs.
> > 
> > Hence, if userspace wants to set the TSC to some known value without
> > having to deal with TSC scaling, and while also being resilient against
> > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
> > suitable options.
> 
> Requiring userspace to handle certain aspects of TSC scaling doesn't seem
> particularly onerous, at least not relative to all the other time insanity.  In
> other words, why should KVM take on more complexity and a mostly-redundant uAPI?

Because it offers userspace a simple and easy to use API. You get the
time information, you put the time information.

Now, if you want to *document* the actual calculations that userspace
would be forced to do just to preserve the existing relationship
between the guest TSC and the time of day / host TSC (for LM / LU
respectively).... well, you can try, but it exceeded my "if it needs
documenting this much, fix it first" threshold.

Does userspace even *have* what it needs to do this *right*? 

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2023-03-15 19:57                   ` Sean Christopherson
  2023-03-23 23:26                     ` David Woodhouse
@ 2023-03-24 11:22                     ` Paolo Bonzini
  2023-03-24 13:08                       ` David Woodhouse
  2023-09-13 14:08                       ` David Woodhouse
  1 sibling, 2 replies; 44+ messages in thread
From: Paolo Bonzini @ 2023-03-24 11:22 UTC (permalink / raw)
  To: Sean Christopherson, Simon Veith
  Cc: dwmw2, dff, jmattson, joro, kvm, oupton, tglx, vkuznets, wanpengli

On 3/15/23 20:57, Sean Christopherson wrote:
>> In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
>> preserve the TSC value and apply a known offset would require
>> duplicating the TSC scaling computations in userspace to account for
>> frequency differences between source and destination TSCs.
>>
>> Hence, if userspace wants to set the TSC to some known value without
>> having to deal with TSC scaling, and while also being resilient against
>> scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
>> suitable options.
>
> Requiring userspace to handle certain aspects of TSC scaling doesn't seem
> particularly onerous, at least not relative to all the other time insanity.  In
> other words, why should KVM take on more complexity and a mostly-redundant uAPI?

Yeah, it seems like the problem is that KVM_GET_CLOCK return host 
unscaled TSC units (which was done because the guest TSC frequency is at 
least in theory per-CPU, and KVM_GET_CLOCK is a vm ioctl)?

Perhaps it's more important (uAPI-wise) for KVM to return the precise 
guest/host TSC ratio via a vcpu device attribute?

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2023-03-24 11:22                     ` Paolo Bonzini
@ 2023-03-24 13:08                       ` David Woodhouse
  2023-09-13 14:08                       ` David Woodhouse
  1 sibling, 0 replies; 44+ messages in thread
From: David Woodhouse @ 2023-03-24 13:08 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Simon Veith
  Cc: dff, jmattson, joro, kvm, oupton, tglx, vkuznets, wanpengli

[-- Attachment #1: Type: text/plain, Size: 3096 bytes --]

On Fri, 2023-03-24 at 12:22 +0100, Paolo Bonzini wrote:
> On 3/15/23 20:57, Sean Christopherson wrote:
> > > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
> > > preserve the TSC value and apply a known offset would require
> > > duplicating the TSC scaling computations in userspace to account for
> > > frequency differences between source and destination TSCs.
> > > 
> > > Hence, if userspace wants to set the TSC to some known value without
> > > having to deal with TSC scaling, and while also being resilient against
> > > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
> > > suitable options.
> > 
> > Requiring userspace to handle certain aspects of TSC scaling doesn't seem
> > particularly onerous, at least not relative to all the other time insanity.  In
> > other words, why should KVM take on more complexity and a mostly-redundant uAPI?
> 
> Yeah, it seems like the problem is that KVM_GET_CLOCK return host 
> unscaled TSC units (which was done because the guest TSC frequency is at 
> least in theory per-CPU, and KVM_GET_CLOCK is a vm ioctl)?
> 
> Perhaps it's more important (uAPI-wise) for KVM to return the precise
> guest/host TSC ratio via a vcpu device attribute?
> 

My criteria for this are that in the case of a live update (serialize
guest, kexec, resume guest on precisely the same hardware a few
milliseconds of steal time later), the guest clocks (KVM_CLOCK, TSC,
etc) shall be *precisely* the same as before with no slop. The same
offset, the same scaling. And ideally precisely the same values
advertised in the pvclock data to the guest.

In the case of live migration, we can't be cycle-accurate because of
course we're limited to the accuracy of the NTP sync between hosts. But
that is the *only* inaccuracy we shall incur, because we can express
clocks in terms of each other, e.g. <KVM clock was X at time of day Y>
and then e.g. <TSC was W at KVM clock X> is unchanged from before.

It's OK to expect userspace to do some calculation, as long as we never
expect userspace to magically perform calculations based on some
concept of "now", and to call kernel APIs, without any actual time
elapsing while it does so.

I said it's OK to expect userspace to do *some* calculation. But that
should be clearly documented, *and* when we document it, that
documentation shouldn't codify too much of the kernel's internal
relationships between clocks, and shouldn't make us ashamed to be
kernel engineers. 

We tried documenting it, in
https://lore.kernel.org/all/20220316045308.2313184-1-oupton@google.com/

I don't quite know how to summarise that thread, other than "it's too
broken; let's fix it first and *then* document it".

But if it can be done, I'm happy for someone to fix the documentation
in a way which describes how to meet the above criteria using the
existing kernel APIs. And then perhaps we can make a decision about
just how ashamed of ourselves we should be, and whether we want to
provide a better, easier API for userspace to use.




[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2023-03-24 11:22                     ` Paolo Bonzini
  2023-03-24 13:08                       ` David Woodhouse
@ 2023-09-13 14:08                       ` David Woodhouse
  1 sibling, 0 replies; 44+ messages in thread
From: David Woodhouse @ 2023-09-13 14:08 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Simon Veith
  Cc: dff, jmattson, joro, kvm, oupton, tglx, vkuznets, wanpengli

[-- Attachment #1: Type: text/plain, Size: 1356 bytes --]

On Fri, 2023-03-24 at 12:22 +0100, Paolo Bonzini wrote:
> On 3/15/23 20:57, Sean Christopherson wrote:
> > > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
> > > preserve the TSC value and apply a known offset would require
> > > duplicating the TSC scaling computations in userspace to account for
> > > frequency differences between source and destination TSCs.
> > > 
> > > Hence, if userspace wants to set the TSC to some known value without
> > > having to deal with TSC scaling, and while also being resilient against
> > > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
> > > suitable options.
> > 
> > Requiring userspace to handle certain aspects of TSC scaling doesn't seem
> > particularly onerous, at least not relative to all the other time insanity.  In
> > other words, why should KVM take on more complexity and a mostly-redundant uAPI?
> 
> Yeah, it seems like the problem is that KVM_GET_CLOCK return host 
> unscaled TSC units (which was done because the guest TSC frequency is at 
> least in theory per-CPU, and KVM_GET_CLOCK is a vm ioctl)?
> 
> Perhaps it's more important (uAPI-wise) for KVM to return the precise
> guest/host TSC ratio via a vcpu device attribute?

Perhaps. Although that really does suck as a userspace experience. See
the patch I just sent.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-21 19:43                         ` David Woodhouse
@ 2022-03-21 21:23                           ` Oliver Upton
  0 siblings, 0 replies; 44+ messages in thread
From: Oliver Upton @ 2022-03-21 21:23 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel

On Mon, Mar 21, 2022 at 07:43:21PM +0000, David Woodhouse wrote:
> On Mon, 2022-03-21 at 00:38 +0000, Oliver Upton wrote:
> > On Sun, Mar 20, 2022 at 09:46:35AM -0000, David Woodhouse wrote:
> > > But coincidentally since then I have started having conversations with
> > > people who really want the guest to have an immediate knowledge of the
> > > adjtimex maxerror etc. on the new host immediately after the migration.
> > > Maybe the "if the migration isn't fast enough then let the guest know it's
> > > now unsynced" is OK, but I'll need to work out what "immediately" means
> > > when we have a guest userspace component involved in it.
> > 
> > This has also been an area of interest to me. I think we've all seen the
> > many ways in which doing migrations behind the guest's can put software
> > in an extremely undesirable state on the other end. If those
> > conversations are taking place on the mailing lists, could you please CC
> > me?
> > 
> > Our (Google) TSC adjustment clamping and userspace notification mechanism
> > was a halfway kludge to keep things happy on the other end. And it
> > generally has worked well, but misses a fundamental point.
> > 
> > The hypervisor should tell the guest kernel about time travel and let it
> > cascade that information throughout the guest system. Regardless of what
> > we do to the TSC, we invariably destroy one of the two guest clocks along
> > the way. If we told the guest "you time traveled X seconds", it could
> > fold that into its own idea of real time. Guest kernel can then fire off
> > events to inform software that wants to keep up with clock changes, and
> > even a new event to let NTP know its probably running on different
> > hardware.
> > 
> > Time sucks :-)
> 
> So, we already have PVCLOCK_GUEST_STOPPED which tells the guest that
> its clock may have experienced a jump. Linux guests will use this to
> kick various watchdogs to prevent them whining. Shouldn't we *also* be
> driving the NTP reset from that same signal?

Right, but I'd argue that interface has some problems too. It
depends on the guest polling instead of an interrupt from the
hypervisor. It also has no way of informing the kernel exactly how much
time has elapsed.

The whole point of all these hacks that we've done internally is that we,
the hypervisor, know full well how much real time hasv advanced during the
VM blackout. If we can at least let the guest know how much to fudge real
time, it can then poke NTP for better refinement. I worry about using NTP
as the sole source of truth for such a mechanism, since you'll need to go
out to the network and any reads until the response comes back are hosed.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-21  0:38                       ` Oliver Upton
@ 2022-03-21 19:43                         ` David Woodhouse
  2022-03-21 21:23                           ` Oliver Upton
  0 siblings, 1 reply; 44+ messages in thread
From: David Woodhouse @ 2022-03-21 19:43 UTC (permalink / raw)
  To: Oliver Upton, Franke, Daniel
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 1909 bytes --]

On Mon, 2022-03-21 at 00:38 +0000, Oliver Upton wrote:
> On Sun, Mar 20, 2022 at 09:46:35AM -0000, David Woodhouse wrote:
> > But coincidentally since then I have started having conversations with
> > people who really want the guest to have an immediate knowledge of the
> > adjtimex maxerror etc. on the new host immediately after the migration.
> > Maybe the "if the migration isn't fast enough then let the guest know it's
> > now unsynced" is OK, but I'll need to work out what "immediately" means
> > when we have a guest userspace component involved in it.
> 
> This has also been an area of interest to me. I think we've all seen the
> many ways in which doing migrations behind the guest's can put software
> in an extremely undesirable state on the other end. If those
> conversations are taking place on the mailing lists, could you please CC
> me?
> 
> Our (Google) TSC adjustment clamping and userspace notification mechanism
> was a halfway kludge to keep things happy on the other end. And it
> generally has worked well, but misses a fundamental point.
> 
> The hypervisor should tell the guest kernel about time travel and let it
> cascade that information throughout the guest system. Regardless of what
> we do to the TSC, we invariably destroy one of the two guest clocks along
> the way. If we told the guest "you time traveled X seconds", it could
> fold that into its own idea of real time. Guest kernel can then fire off
> events to inform software that wants to keep up with clock changes, and
> even a new event to let NTP know its probably running on different
> hardware.
> 
> Time sucks :-)

So, we already have PVCLOCK_GUEST_STOPPED which tells the guest that
its clock may have experienced a jump. Linux guests will use this to
kick various watchdogs to prevent them whining. Shouldn't we *also* be
driving the NTP reset from that same signal?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-21 13:10                         ` Paolo Bonzini
@ 2022-03-21 14:59                           ` David Woodhouse
  0 siblings, 0 replies; 44+ messages in thread
From: David Woodhouse @ 2022-03-21 14:59 UTC (permalink / raw)
  To: Paolo Bonzini, Oliver Upton
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 2687 bytes --]

On Mon, 2022-03-21 at 14:10 +0100, Paolo Bonzini wrote:
> On 3/21/22 13:16, David Woodhouse wrote:
> > Hm, I still don't really understand why it's a 5-tuple and why we care
> > about the *host* TSC at all.
> > 
> > Fundamentally, guest state is *guest* state. There is a 3-tuple of
> >   { NTP-synchronized time of day, Guest KVM clock, Guest TSC value }
> > 
> > (plus a TSC offset from vCPU0, for each other vCPU perhaps.)
> > 
> > All else is redundant, and including host values is just wrong. [...]
> > I don't understand why the actual *value* of the host TSC is something
> > that userspace needs to see.
> 
> Ok, I understand now.
> 
> You're right, you don't need the value of the host TSC; you only need a 
> {hostTOD, guestNS, guestTSC} tuple for every vCPU recorded on the source.

Actually, isn't that still redundant? All we really need is a *single*
{ hostTOD, guestNS } for the KVM clock, and then each vCPU has its own
{ guestNS, guestTSC } tuple.

> If all the frequencies are the same, that can be "packed" as {hostTOD, 
> guestNS, anyTSC} plus N offsets.  The N offsets in turn can be 
> KVM_VCPU_TSC_OFFSET if you use the hostTSC, or the offsets between vCPU0 
> and the others if you use the vCPU0 guestTSC.
> 
> I think reasoning in terms of the host TSC is nicer in general, because 
> it doesn't make vCPU0 special.  But apart from the aesthetics of having 
> a "special" vCPU, making vCPU0 special is actually harder, because the 
> TSC frequencies need not be the same for all vCPUs.  I think that is a 
> mistake in the KVM API, but it was done long before I was involved (and 
> long before I actually understood this stuff).

If each vCPU has its own { guestNS, guestTSC } tuple that actually
works out just fine even if they have different frequencies. The
*common* case would be that they are all at the same frequency and have
the same value at the same time. But other cases can be accommodated.

I'm not averse to *reasoning* in terms of the host TSC; I just don't
like exposing the actual numbers to userspace and forcing userspace to
access it through some new ABI just in order to translate some other
fundamental property (either the time of day, or the guest data) into
that domain. And I especially don't like considering it part of 'guest
state'.

Right now when I'm not frowning at TSC synchronisation issues, I'm
working on guest transparent live migration from actual Xen, to
KVM-pretending-to-be-Xen. That kind of insanity is only really possible
with a strict adherence to the design principle that "guest state is
guest state", without conflating it with host/implementation details :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-21 12:16                       ` David Woodhouse
@ 2022-03-21 13:10                         ` Paolo Bonzini
  2022-03-21 14:59                           ` David Woodhouse
  0 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2022-03-21 13:10 UTC (permalink / raw)
  To: David Woodhouse, Oliver Upton
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel

On 3/21/22 13:16, David Woodhouse wrote:
> Hm, I still don't really understand why it's a 5-tuple and why we care
> about the *host* TSC at all.
> 
> Fundamentally, guest state is *guest* state. There is a 3-tuple of
>   { NTP-synchronized time of day, Guest KVM clock, Guest TSC value }
> 
> (plus a TSC offset from vCPU0, for each other vCPU perhaps.)
> 
> All else is redundant, and including host values is just wrong. [...]
> I don't understand why the actual *value* of the host TSC is something
> that userspace needs to see.

Ok, I understand now.

You're right, you don't need the value of the host TSC; you only need a 
{hostTOD, guestNS, guestTSC} tuple for every vCPU recorded on the source.

If all the frequencies are the same, that can be "packed" as {hostTOD, 
guestNS, anyTSC} plus N offsets.  The N offsets in turn can be 
KVM_VCPU_TSC_OFFSET if you use the hostTSC, or the offsets between vCPU0 
and the others if you use the vCPU0 guestTSC.

I think reasoning in terms of the host TSC is nicer in general, because 
it doesn't make vCPU0 special.  But apart from the aesthetics of having 
a "special" vCPU, making vCPU0 special is actually harder, because the 
TSC frequencies need not be the same for all vCPUs.  I think that is a 
mistake in the KVM API, but it was done long before I was involved (and 
long before I actually understood this stuff).

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-21 12:36                         ` Paolo Bonzini
@ 2022-03-21 12:56                           ` David Woodhouse
  0 siblings, 0 replies; 44+ messages in thread
From: David Woodhouse @ 2022-03-21 12:56 UTC (permalink / raw)
  To: Paolo Bonzini, Oliver Upton, Franke, Daniel
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 1586 bytes --]

On Mon, 2022-03-21 at 13:36 +0100, Paolo Bonzini wrote:
> On 3/21/22 01:51, Oliver Upton wrote:
> > > 
> > > Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD
> > > returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. It's a
> > > bit mix-and-match, but it's already a 5-tuple that the destination can use.
> > > What's missing is a ioctl on the destination side that relieves userspace
> > > from having to do the math.
> > That ioctl will work fine, but userspace needs to accept all the
> > nastiness that ensues. If it yanks the guest too hard into the future
> > it'll need to pick up the pieces when the guest kernel panics.
> 
> It can do that before issuing the ioctl, by comparing the source and 
> destination TOD, can't it?

That's how we do it. 

We currently take the *really* naïve approach of just letting each vCPU
thread serialize its own vCPU's TSC with the other MSRs, which means
they're slightly out of sync.

Then we calculate the TOD delta when resuming on the destination, and
add the corresponding number of nanoseconds to the KVM clock, and ticks
to each vCPU's MSR.

When each vCPU thread then *restores* its vCPU's MSRs, the 'within a
second of the last sync" bit kicks in and they are all synchronized
properly.

There's definitely a bit of slop there; I don't think we even use the
KVM_CLOCK_REALTIME flag when setting the guest's KVM clock. But those
are fixable by having a single ioctl which handles the 3-tuple of
{ TOD, KVM clock, TSC } at a single instance in time, I think?



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-21  0:51                       ` Oliver Upton
@ 2022-03-21 12:36                         ` Paolo Bonzini
  2022-03-21 12:56                           ` David Woodhouse
  0 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2022-03-21 12:36 UTC (permalink / raw)
  To: Oliver Upton
  Cc: David Woodhouse, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On 3/21/22 01:51, Oliver Upton wrote:
>>
>> Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD
>> returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. It's a
>> bit mix-and-match, but it's already a 5-tuple that the destination can use.
>> What's missing is a ioctl on the destination side that relieves userspace
>> from having to do the math.
> That ioctl will work fine, but userspace needs to accept all the
> nastiness that ensues. If it yanks the guest too hard into the future
> it'll need to pick up the pieces when the guest kernel panics.

It can do that before issuing the ioctl, by comparing the source and 
destination TOD, can't it?

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-20 13:39                     ` Paolo Bonzini
  2022-03-21  0:51                       ` Oliver Upton
@ 2022-03-21 12:16                       ` David Woodhouse
  2022-03-21 13:10                         ` Paolo Bonzini
  1 sibling, 1 reply; 44+ messages in thread
From: David Woodhouse @ 2022-03-21 12:16 UTC (permalink / raw)
  To: Paolo Bonzini, Oliver Upton
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 2402 bytes --]

On Sun, 2022-03-20 at 14:39 +0100, Paolo Bonzini wrote:
> On 3/20/22 09:52, Oliver Upton wrote:
> > What do you folks think about having a new R/O vCPU attribute that
> > returns a { TOD, guest_tsc } pair? I believe that would immediately
> > satisfy the needs of upstream to implement clock-advancing live
> > migration.
> 
> I don't think this adds much.  The missing link is on the destination 
> side, not the source side.
> 
> To recap, the data that needs to be migrated from source to destination 
> is the hostTSC+hostTOD pairing (returned by KVM_GET_CLOCK) plus one of 
> each of the following:
> 
> * either guestTSCOffset or a guestTSC synced with the hostTSC
> 
> * either guestTODOffset or a guestTOD synced with the hostTOD.
> 
> * either guestTSCScale or hostTSCFreq
> 
> Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD 
> returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. 
> It's a bit mix-and-match, but it's already a 5-tuple that the 
> destination can use.  What's missing is a ioctl on the destination side 
> that relieves userspace from having to do the math.
> 
> Adding a nicer API just means choosing a different 5-tuple that the 
> source side sends over to the destination, but it isn't particularly 
> useful if userspace still has to do the math.

Hm, I still don't really understand why it's a 5-tuple and why we care
about the *host* TSC at all.

Fundamentally, guest state is *guest* state. There is a 3-tuple of
 { NTP-synchronized time of day, Guest KVM clock, Guest TSC value } 

(plus a TSC offset from vCPU0, for each other vCPU perhaps.)

All else is redundant, and including host values is just wrong.

We can already do clock-advancing live migration by adding the
appropriate number of nanoseconds, and the appropriate number of TSC
ticks, to the values being restored on the destination host based on
the time at which the migrated instance is being restored there, as
compared with the originally stored time-of-day.

I understand why KVM would want to use the *same* host TSC value for
calculating the time-of-day as well as the guest KVM clock and the
guest TSC value — so that all three elements of that 3-tuple reflect
precisely the same moment in time.

I don't understand why the actual *value* of the host TSC is something
that userspace needs to see.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-20 13:39                     ` Paolo Bonzini
@ 2022-03-21  0:51                       ` Oliver Upton
  2022-03-21 12:36                         ` Paolo Bonzini
  2022-03-21 12:16                       ` David Woodhouse
  1 sibling, 1 reply; 44+ messages in thread
From: Oliver Upton @ 2022-03-21  0:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Woodhouse, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On Sun, Mar 20, 2022 at 02:39:34PM +0100, Paolo Bonzini wrote:
> On 3/20/22 09:52, Oliver Upton wrote:
> > What do you folks think about having a new R/O vCPU attribute that
> > returns a { TOD, guest_tsc } pair? I believe that would immediately
> > satisfy the needs of upstream to implement clock-advancing live
> > migration.
> 
> I don't think this adds much.  The missing link is on the destination side,
> not the source side.

I think it'll work:

 Source:
  - Pick a vCPU and save its { TOD, guest_TSC } pair
  - Save the tsc offset of every vCPU
  - Using all of the offsets, calculate the drift of all the follower
    vCPUs from the leader vCPU (from step 1)
  - Save the TSC frequency

 Destination:
  - Restore the TSC frequency
  - Read the { TOD, guest_TSC } pair for the first vCPU
  - Compare with the source value to work out delta_guest_TSC and
    delta_TOD
  - Apply delta_guest_TSC to all vCPUs in a VM
  - If you want to account for realtime, apply guest_tsc_freq *
    delta_TOD to every vCPU in the VM
  - Account for drift between leader/follower vCPUs

Userspace has some math to do, but IMO it needs to until we have a
better mechanism for helping the guest clean up a slow migration.
It does eliminate the need for doing TSC scaling in userspace, which
I think is the trickiest piece of it all.

Alternative could be to say that the VM has a single, authoritative {
TOD, guest_TSC } clock that can be read or written. Any vCPU offsets
then account for guest-induced drift in TSCs.

> To recap, the data that needs to be migrated from source to destination is
> the hostTSC+hostTOD pairing (returned by KVM_GET_CLOCK) plus one of each of
> the following:
> 
> * either guestTSCOffset or a guestTSC synced with the hostTSC
> 
> * either guestTODOffset or a guestTOD synced with the hostTOD.
> 
> * either guestTSCScale or hostTSCFreq
> 
> Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD
> returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. It's a
> bit mix-and-match, but it's already a 5-tuple that the destination can use.
> What's missing is a ioctl on the destination side that relieves userspace
> from having to do the math.

That ioctl will work fine, but userspace needs to accept all the
nastiness that ensues. If it yanks the guest too hard into the future
it'll need to pick up the pieces when the guest kernel panics.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-20  9:46                     ` David Woodhouse
@ 2022-03-21  0:38                       ` Oliver Upton
  2022-03-21 19:43                         ` David Woodhouse
  0 siblings, 1 reply; 44+ messages in thread
From: Oliver Upton @ 2022-03-21  0:38 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On Sun, Mar 20, 2022 at 09:46:35AM -0000, David Woodhouse wrote:
> 
> > The offset interface completely punts the decision around guest clocks
> > to userspace. We (KVM) have absolutely no idea what userspace is about
> > to do with the guest. The guest could be paused for 5 seconds or 5
> > years. Encouraging host userspace to just read/write a { TOD, TSC } pair
> > and let KVM do the heavy lifting could completely wreck the guest's
> > monotonic clock.
> >
> > Additionally, it is impossible for userspace to enforce policy/limits on
> > how much to time travel a guest with a value-based interface. Any event
> > could sneak in between the time userspace checks the value and KVM sets
> > the L1 offset. Offsets are idempotent and will still uphold userspace's
> > intentions even if an inordinate amount of time elapses until KVM
> > processes it.
> 
> Thanks for the detailed explanation. One part which confuses me here...
> Why can't userspace impose those same limits using a (TOD, value) tuple?
> Userspace can still look at that TOD from before the brownout period
> started, and declare that is too far in the past.
> 
> If the event happens *after* userspace has decided that the migration was
> quick enough, but before the vCPUs are actually running again, even the
> offset based interface doesn't protect against that.

Right, if nastiness happens after the offset was calculated, the guest
is still going to feel it. The benefit is that we've at least stopped
the bleeding on the monotonic clock. Guests of KVM are probably happier
having a messed up realtime clock instead of a panicked kernel that
threw a fit over monotonic stepping too far.

> > Apologies for grandstanding, but clocks has been a real source of pain
> > during migration. I do agree that the documented algorithm is a mess at
> > the moment, given that there's no good way for userspace to transform
> > host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is
> > nice to have, but probably not ABI to hang this whole thing off of.
> >
> > What do you folks think about having a new R/O vCPU attribute that
> > returns a { TOD, guest_tsc } pair? I believe that would immediately
> > satisfy the needs of upstream to implement clock-advancing live
> > migration.
> 
> Hm, I need to do some more thinking here. I poked at this because for TSC
> scaling even before we think about clock jumps it was just utterly hosed —
> userspace naively just creates a bunch of vCPUs and sets their TSC
> frequency + value, and they all end up with unsynced TSC values.

And thank you for addressing that. I think there is a broader
generalization that can be made about guest timekeeping that you've
started poking at with that patch set, which is that we should only
really care about these at a VM scope. There's nothing we can do
to cover up broken hardware, so if the host's TSCs are out of whack then
oh well.

To that end, we could have a single host<->guest offset that is used
across all vCPUs. Guest fiddling with TSC_ADJUST or even worse direct
writes to the TSC can then be treated as solely a part of guest state,
and get added on top of the host<->guest offset.

> But coincidentally since then I have started having conversations with
> people who really want the guest to have an immediate knowledge of the
> adjtimex maxerror etc. on the new host immediately after the migration.
> Maybe the "if the migration isn't fast enough then let the guest know it's
> now unsynced" is OK, but I'll need to work out what "immediately" means
> when we have a guest userspace component involved in it.

This has also been an area of interest to me. I think we've all seen the
many ways in which doing migrations behind the guest's can put software
in an extremely undesirable state on the other end. If those
conversations are taking place on the mailing lists, could you please CC
me?

Our (Google) TSC adjustment clamping and userspace notification mechanism
was a halfway kludge to keep things happy on the other end. And it
generally has worked well, but misses a fundamental point.

The hypervisor should tell the guest kernel about time travel and let it
cascade that information throughout the guest system. Regardless of what
we do to the TSC, we invariably destroy one of the two guest clocks along
the way. If we told the guest "you time traveled X seconds", it could
fold that into its own idea of real time. Guest kernel can then fire off
events to inform software that wants to keep up with clock changes, and
even a new event to let NTP know its probably running on different
hardware.

Time sucks :-)

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-20  8:52                   ` Oliver Upton
  2022-03-20  9:46                     ` David Woodhouse
@ 2022-03-20 13:39                     ` Paolo Bonzini
  2022-03-21  0:51                       ` Oliver Upton
  2022-03-21 12:16                       ` David Woodhouse
  1 sibling, 2 replies; 44+ messages in thread
From: Paolo Bonzini @ 2022-03-20 13:39 UTC (permalink / raw)
  To: Oliver Upton
  Cc: David Woodhouse, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On 3/20/22 09:52, Oliver Upton wrote:
> What do you folks think about having a new R/O vCPU attribute that
> returns a { TOD, guest_tsc } pair? I believe that would immediately
> satisfy the needs of upstream to implement clock-advancing live
> migration.

I don't think this adds much.  The missing link is on the destination 
side, not the source side.

To recap, the data that needs to be migrated from source to destination 
is the hostTSC+hostTOD pairing (returned by KVM_GET_CLOCK) plus one of 
each of the following:

* either guestTSCOffset or a guestTSC synced with the hostTSC

* either guestTODOffset or a guestTOD synced with the hostTOD.

* either guestTSCScale or hostTSCFreq

Right now we have guestTSCOffset as a vCPU attribute, we have guestTOD 
returned by KVM_GET_CLOCK, and we plan to have hostTSCFreq in sysfs. 
It's a bit mix-and-match, but it's already a 5-tuple that the 
destination can use.  What's missing is a ioctl on the destination side 
that relieves userspace from having to do the math.

Adding a nicer API just means choosing a different 5-tuple that the 
source side sends over to the destination, but it isn't particularly 
useful if userspace still has to do the math.

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-20  8:52                   ` Oliver Upton
@ 2022-03-20  9:46                     ` David Woodhouse
  2022-03-21  0:38                       ` Oliver Upton
  2022-03-20 13:39                     ` Paolo Bonzini
  1 sibling, 1 reply; 44+ messages in thread
From: David Woodhouse @ 2022-03-20  9:46 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Paolo Bonzini, David Woodhouse, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel


> The offset interface completely punts the decision around guest clocks
> to userspace. We (KVM) have absolutely no idea what userspace is about
> to do with the guest. The guest could be paused for 5 seconds or 5
> years. Encouraging host userspace to just read/write a { TOD, TSC } pair
> and let KVM do the heavy lifting could completely wreck the guest's
> monotonic clock.
>
> Additionally, it is impossible for userspace to enforce policy/limits on
> how much to time travel a guest with a value-based interface. Any event
> could sneak in between the time userspace checks the value and KVM sets
> the L1 offset. Offsets are idempotent and will still uphold userspace's
> intentions even if an inordinate amount of time elapses until KVM
> processes it.

Thanks for the detailed explanation. One part which confuses me here...
Why can't userspace impose those same limits using a (TOD, value) tuple?
Userspace can still look at that TOD from before the brownout period
started, and declare that is too far in the past.

If the event happens *after* userspace has decided that the migration was
quick enough, but before the vCPUs are actually running again, even the
offset based interface doesn't protect against that.


> Apologies for grandstanding, but clocks has been a real source of pain
> during migration. I do agree that the documented algorithm is a mess at
> the moment, given that there's no good way for userspace to transform
> host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is
> nice to have, but probably not ABI to hang this whole thing off of.
>
> What do you folks think about having a new R/O vCPU attribute that
> returns a { TOD, guest_tsc } pair? I believe that would immediately
> satisfy the needs of upstream to implement clock-advancing live
> migration.

Hm, I need to do some more thinking here. I poked at this because for TSC
scaling even before we think about clock jumps it was just utterly hosed —
userspace naively just creates a bunch of vCPUs and sets their TSC
frequency + value, and they all end up with unsynced TSC values.

But coincidentally since then I have started having conversations with
people who really want the guest to have an immediate knowledge of the
adjtimex maxerror etc. on the new host immediately after the migration.
Maybe the "if the migration isn't fast enough then let the guest know it's
now unsynced" is OK, but I'll need to work out what "immediately" means
when we have a guest userspace component involved in it.

I'll need to do that with a real screen and keyboard though, and fingers
that aren't freezing as I sit by a 9-year-old's hockey training...

-- 
dwmw2


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-20  8:10                 ` Paolo Bonzini
@ 2022-03-20  8:52                   ` Oliver Upton
  2022-03-20  9:46                     ` David Woodhouse
  2022-03-20 13:39                     ` Paolo Bonzini
  0 siblings, 2 replies; 44+ messages in thread
From: Oliver Upton @ 2022-03-20  8:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Woodhouse, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On Sun, Mar 20, 2022 at 09:10:15AM +0100, Paolo Bonzini wrote:
> On 3/19/22 14:13, David Woodhouse wrote:
> > 
> > 
> > > On 3/19/22 12:54, Paolo Bonzini wrote:
> > > > On 3/19/22 09:08, David Woodhouse wrote:
> > > > > If a basic API requires this much documentation, my instinct is to
> > > > > *fix* it with fire first, then document what's left.
> > > > I agree, but you're missing all the improvements that went in together
> > > > with the offset API in order to enable the ugly algorithm.
> > > > 
> > > > > A userspace-friendly API for migration would be more like KVM on the
> > > > > source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
> > > > > the destination host (after providing the TSC frequency) is give it
> > > > > precisely the same data.
> > > 
> > > I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant
> > > {guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}.  That would work,
> > > and in that case it wouldn't even be KVM returning that host information.
> > 
> > I would have said nobody cares about the host TSC value and frequency.
> > That is for KVM to know and deal with internally.
> 
> There are two schools as to how to do migration.  The QEMU school is to just
> load back the guest TOD and TSC and let NTP resync.  They had better be
> synced, but a difference of a few microseconds might not matter.
> 
> This has the advantage of not showing the guest that there was a pause.
> QEMU is doing it this way due to not having postcopy live migration for a
> long time; precopy is subject to longer brownout between source and
> destination, which might result in soft lockups.  Apart from this it really
> has only disadvantage.
> 
> The Google school has the destination come up with the guest TOD and TSC
> that takes into account the length of the brownout phase.  This is where the
> algorithm in Documentation/ comes into play, and why you need the host pair
> as well.  Actually Google does not use it because they already have precise
> time available to userspace as part of Spanner.  Maybe so does Amazon (?),
> but for the rest of the world the host {TOD, TSC} pair is required to
> compute what the guest TSC "should look like" on the destination.

Hey, beat me to the punch :) Paolo is pretty much spot on, but there are
a few additional details here that I believe are relevant.

I really don't think we want to effectively step the guest's monotonic
clock if at all possible. It hurts when you do this for large windows,
and leads to soft lockups as you've noted above. Nonetheless, its a
kludgy way to advance the guest's realtime clock without informing it
that it is about to experience time travel.

Given all of this, there is a limit to how much we advance the TSC in
the Google school. If this limit is exceeded we refuse to step the TSC
further and inform the guest it has experienced time travel [1]. It is
an attempt to bridge the gap and avoid completely laying waste to guest
clocks while hiding the migration if we're confident it was smooth
enough. Beyond that, guest userspace wants to be appraised of time
travel as well (TFD_TIMER_CANCEL_ON_SET). Having the guest clean up a
messy migration ensures that this all 'just works'.

The offset interface completely punts the decision around guest clocks
to userspace. We (KVM) have absolutely no idea what userspace is about
to do with the guest. The guest could be paused for 5 seconds or 5
years. Encouraging host userspace to just read/write a { TOD, TSC } pair
and let KVM do the heavy lifting could completely wreck the guest's
monotonic clock.

Additionally, it is impossible for userspace to enforce policy/limits on
how much to time travel a guest with a value-based interface. Any event
could sneak in between the time userspace checks the value and KVM sets
the L1 offset. Offsets are idempotent and will still uphold userspace's
intentions even if an inordinate amount of time elapses until KVM
processes it.

Apologies for grandstanding, but clocks has been a real source of pain
during migration. I do agree that the documented algorithm is a mess at
the moment, given that there's no good way for userspace to transform
host_tsc -> guest_tsc. Poking the host TSC frequency out in sysfs is
nice to have, but probably not ABI to hang this whole thing off of.

What do you folks think about having a new R/O vCPU attribute that
returns a { TOD, guest_tsc } pair? I believe that would immediately
satisfy the needs of upstream to implement clock-advancing live
migration.

[1]: https://github.com/GoogleCloudPlatform/guest-agent/blob/main/google_guest_agent/clock.go
--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-19 13:13               ` David Woodhouse
@ 2022-03-20  8:10                 ` Paolo Bonzini
  2022-03-20  8:52                   ` Oliver Upton
  0 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2022-03-20  8:10 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Oliver Upton, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On 3/19/22 14:13, David Woodhouse wrote:
> 
> 
>> On 3/19/22 12:54, Paolo Bonzini wrote:
>>> On 3/19/22 09:08, David Woodhouse wrote:
>>>> If a basic API requires this much documentation, my instinct is to
>>>> *fix* it with fire first, then document what's left.
>>> I agree, but you're missing all the improvements that went in together
>>> with the offset API in order to enable the ugly algorithm.
>>>
>>>> A userspace-friendly API for migration would be more like KVM on the
>>>> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
>>>> the destination host (after providing the TSC frequency) is give it
>>>> precisely the same data.
>>
>> I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant
>> {guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}.  That would work,
>> and in that case it wouldn't even be KVM returning that host information.
> 
> I would have said nobody cares about the host TSC value and frequency.
> That is for KVM to know and deal with internally.

There are two schools as to how to do migration.  The QEMU school is to 
just load back the guest TOD and TSC and let NTP resync.  They had 
better be synced, but a difference of a few microseconds might not matter.

This has the advantage of not showing the guest that there was a pause. 
  QEMU is doing it this way due to not having postcopy live migration 
for a long time; precopy is subject to longer brownout between source 
and destination, which might result in soft lockups.  Apart from this it 
really has only disadvantage.

The Google school has the destination come up with the guest TOD and TSC 
that takes into account the length of the brownout phase.  This is where 
the algorithm in Documentation/ comes into play, and why you need the 
host pair as well.  Actually Google does not use it because they already 
have precise time available to userspace as part of Spanner.  Maybe so 
does Amazon (?), but for the rest of the world the host {TOD, TSC} pair 
is required to compute what the guest TSC "should look like" on the 
destination.

Paolo

> For guest migration it should be as simple as "guest TSC frequency is <F>,
> and the TSC value was <X> at (wallclock time <T>|KVM_CLOCK time <T>).
> 
> Not sure I have an opinion on whether the objective time reference is the
> timeofday clock or the KVM clock.
> 
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-19 13:00             ` Paolo Bonzini
@ 2022-03-19 13:13               ` David Woodhouse
  2022-03-20  8:10                 ` Paolo Bonzini
  0 siblings, 1 reply; 44+ messages in thread
From: David Woodhouse @ 2022-03-19 13:13 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: David Woodhouse, Oliver Upton, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel



> On 3/19/22 12:54, Paolo Bonzini wrote:
>> On 3/19/22 09:08, David Woodhouse wrote:
>>> If a basic API requires this much documentation, my instinct is to
>>> *fix* it with fire first, then document what's left.
>> I agree, but you're missing all the improvements that went in together
>> with the offset API in order to enable the ugly algorithm.
>>
>>> A userspace-friendly API for migration would be more like KVM on the
>>> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
>>> the destination host (after providing the TSC frequency) is give it
>>> precisely the same data.
>>
>> Again I agree but it would have to be {hostTimeOfDay, hostTSC,
>> hostTSCFrequency, guestTimeOfDay}.
>
> Ah, I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant
> {guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}.  That would work,
> and in that case it wouldn't even be KVM returning that host information.
>
> In fact {hostTimeOfDay, hostTSC, hostTSCFrequency, guestTimeOfDay} is
> not enough, you also need the guestTSCFrequency and guestTSC (or
> equivalently the scale/offset pair).

I would have said nobody cares about the host TSC value and frequency.
That is for KVM to know and deal with internally.

For guest migration it should be as simple as "guest TSC frequency is <F>,
and the TSC value was <X> at (wallclock time <T>|KVM_CLOCK time <T>).

Not sure I have an opinion on whether the objective time reference is the
timeofday clock or the KVM clock.


-- 
dwmw2


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-19 11:54           ` Paolo Bonzini
@ 2022-03-19 13:00             ` Paolo Bonzini
  2022-03-19 13:13               ` David Woodhouse
  0 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2022-03-19 13:00 UTC (permalink / raw)
  To: David Woodhouse, Oliver Upton
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel

On 3/19/22 12:54, Paolo Bonzini wrote:
> On 3/19/22 09:08, David Woodhouse wrote:
>> If a basic API requires this much documentation, my instinct is to
>> *fix* it with fire first, then document what's left.
> I agree, but you're missing all the improvements that went in together 
> with the offset API in order to enable the ugly algorithm.
> 
>> A userspace-friendly API for migration would be more like KVM on the
>> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
>> the destination host (after providing the TSC frequency) is give it
>> precisely the same data.
> 
> Again I agree but it would have to be {hostTimeOfDay, hostTSC, 
> hostTSCFrequency, guestTimeOfDay}.

Ah, I guess you meant {hostTimeOfDay, hostTSC} _plus_ the constant 
{guestTSCScale, guestTSCOffset, guestTimeOfDayOffset}.  That would work, 
and in that case it wouldn't even be KVM returning that host information.

In fact {hostTimeOfDay, hostTSC, hostTSCFrequency, guestTimeOfDay} is 
not enough, you also need the guestTSCFrequency and guestTSC (or 
equivalently the scale/offset pair).

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-19  8:08         ` David Woodhouse
@ 2022-03-19 11:54           ` Paolo Bonzini
  2022-03-19 13:00             ` Paolo Bonzini
  0 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2022-03-19 11:54 UTC (permalink / raw)
  To: David Woodhouse, Oliver Upton
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel

On 3/19/22 09:08, David Woodhouse wrote:
> If a basic API requires this much documentation, my instinct is to
> *fix* it with fire first, then document what's left.
I agree, but you're missing all the improvements that went in together 
with the offset API in order to enable the ugly algorithm.

> A userspace-friendly API for migration would be more like KVM on the
> source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
> the destination host (after providing the TSC frequency) is give it
> precisely the same data.

Again I agree but it would have to be {hostTimeOfDay, hostTSC, 
hostTSCFrequency, guestTimeOfDay}.

> For migration between hosts, the offset-based API doesn't seem to be
> much of an improvement over what we had before.

The improvement is that KVM now provides a consistent {hostTimeOfDay, 
hostTSC, guestTimeOfDay} triple, so the "get" part is already okay.  I 
agree that the "set" leaves a substantial amount of work to the guest, 
because *using* that information is left to userspace rather than being 
done in KVM.  But at least it is correct.

So we only lack a way to retrieve hostTSCFrequency to have an ugly but 
fully correct source-side API.  Then we can add a new destination-side 
ioctl (possibly an extension of KVM_SET_CLOCK?) that takes a 
{sourceTimeOfDay, sourceTSC, sourceTSCFrequency, guestTimeOfDay} tuple 
and relieves userspace from having to do the math.

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-19  7:59       ` Oliver Upton
@ 2022-03-19  8:08         ` David Woodhouse
  2022-03-19 11:54           ` Paolo Bonzini
  0 siblings, 1 reply; 44+ messages in thread
From: David Woodhouse @ 2022-03-19  8:08 UTC (permalink / raw)
  To: Oliver Upton, Paolo Bonzini
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 945 bytes --]

On Sat, 2022-03-19 at 07:59 +0000, Oliver Upton wrote:
> I'll have some time to look at it next week and send upstream. I was
> somewhat panning if there were any other ideas here, but sysfs does seem
> sensible :)

I can't say this fills me with joy. If a basic API requires this much
documentation, my instinct is to *fix* it with fire first, then
document what's left.

A userspace-friendly API for migration would be more like KVM on the
source host giving me { TIME_OF_DAY, TSC } and then all I have to do on
the destination host (after providing the TSC frequency) is give it
precisely the same data.

We depend on source and destination clocks being kept in sync with NTP
so that the TIME_OF_DAY is meaningful, but that's fairly fundamental;
that was *always* going to be our reference¹, right?

For migration between hosts, the offset-based API doesn't seem to be
much of an improvement over what we had before.



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-19  7:52     ` Paolo Bonzini
@ 2022-03-19  7:59       ` Oliver Upton
  2022-03-19  8:08         ` David Woodhouse
  0 siblings, 1 reply; 44+ messages in thread
From: Oliver Upton @ 2022-03-19  7:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Woodhouse, David Woodhouse

On Sat, Mar 19, 2022 at 08:52:56AM +0100, Paolo Bonzini wrote:
> On 3/18/22 19:39, Oliver Upton wrote:
> > Hi Paolo,
> > 
> > On Wed, Mar 16, 2022 at 08:47:59AM +0100, Paolo Bonzini wrote:
> > > On 3/16/22 05:53, Oliver Upton wrote:
> > > > The VMM has control of both the guest's TSC scale and offset. Extend the
> > > > described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
> > > > to cover TSC scaling.
> > > > 
> > > > Reported-by: David Woodhouse<dwmw@amazon.co.uk>
> > > > Signed-off-by: Oliver Upton<oupton@google.com>
> > > > ---
> > > > 
> > > > Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).
> > > 
> > > A few more things that have to be changed:
> > > 
> > > > 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
> > > >     kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
> > > >     (host_src).
> > > > 
> > > 
> > > One of two changes:
> > > 
> > > a) Add "Multiply tsc_src by guest_freq / src_freq to obtain scaled_tsc_src",
> > > add a new device attribute for the host TSC frequency.
> > > 
> > > b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a new
> > > device attribute for the guest_frequency/host_frequency ratio.
> > > 
> > > A third would be scaling the host TSC frequency in KVM_GETCLOCK, but that's
> > > confusing IMO.
> > 
> > Agreed -- I think kvmclock should remain as is.
> > 
> > A fourth would be to expose the host's TSC frequency outside of KVM
> > since we're really in the business of guest features, not host ones :-)
> > We already have a patch that does this internally, and its visible in
> > some of our open source userspace libraries [1].
> 
> Yeah, it was a bit of a cop out on my part but adding it to sysfs would be
> nice.  Please tell me if any of you going to send a patch, or even just
> share it so that I can handle the upstream submission.

I'll have some time to look at it next week and send upstream. I was
somewhat panning if there were any other ideas here, but sysfs does seem
sensible :)

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-18 18:39   ` Oliver Upton
@ 2022-03-19  7:52     ` Paolo Bonzini
  2022-03-19  7:59       ` Oliver Upton
  0 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2022-03-19  7:52 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Woodhouse, David Woodhouse

On 3/18/22 19:39, Oliver Upton wrote:
> Hi Paolo,
> 
> On Wed, Mar 16, 2022 at 08:47:59AM +0100, Paolo Bonzini wrote:
>> On 3/16/22 05:53, Oliver Upton wrote:
>>> The VMM has control of both the guest's TSC scale and offset. Extend the
>>> described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
>>> to cover TSC scaling.
>>>
>>> Reported-by: David Woodhouse<dwmw@amazon.co.uk>
>>> Signed-off-by: Oliver Upton<oupton@google.com>
>>> ---
>>>
>>> Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).
>>
>> A few more things that have to be changed:
>>
>>> 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
>>>     kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
>>>     (host_src).
>>>
>>
>> One of two changes:
>>
>> a) Add "Multiply tsc_src by guest_freq / src_freq to obtain scaled_tsc_src",
>> add a new device attribute for the host TSC frequency.
>>
>> b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a new
>> device attribute for the guest_frequency/host_frequency ratio.
>>
>> A third would be scaling the host TSC frequency in KVM_GETCLOCK, but that's
>> confusing IMO.
> 
> Agreed -- I think kvmclock should remain as is.
> 
> A fourth would be to expose the host's TSC frequency outside of KVM
> since we're really in the business of guest features, not host ones :-)
> We already have a patch that does this internally, and its visible in
> some of our open source userspace libraries [1].

Yeah, it was a bit of a cop out on my part but adding it to sysfs would 
be nice.  Please tell me if any of you going to send a patch, or even 
just share it so that I can handle the upstream submission.

Paolo

> 
> That said, I do not have any immediate reservations about adding an
> attribute as the easy way out.
> 
> Ack on the rest of the feedback, I'll touch up the documentation further
> once we figure out TSC frequency exposure.
> 
> [1]: https://github.com/abseil/abseil-cpp/blob/c33f21f86a14216336b87d48e9b552a13985b2bc/absl/base/internal/sysinfo.cc#L311
> 
> --
> Thanks,
> Oliver
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-16  7:47 ` Paolo Bonzini
@ 2022-03-18 18:39   ` Oliver Upton
  2022-03-19  7:52     ` Paolo Bonzini
  0 siblings, 1 reply; 44+ messages in thread
From: Oliver Upton @ 2022-03-18 18:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Woodhouse, David Woodhouse

Hi Paolo,

On Wed, Mar 16, 2022 at 08:47:59AM +0100, Paolo Bonzini wrote:
> On 3/16/22 05:53, Oliver Upton wrote:
> > The VMM has control of both the guest's TSC scale and offset. Extend the
> > described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
> > to cover TSC scaling.
> > 
> > Reported-by: David Woodhouse<dwmw@amazon.co.uk>
> > Signed-off-by: Oliver Upton<oupton@google.com>
> > ---
> > 
> > Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).
> 
> A few more things that have to be changed:
> 
> > 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
> >    kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
> >    (host_src).
> > 
> 
> One of two changes:
> 
> a) Add "Multiply tsc_src by guest_freq / src_freq to obtain scaled_tsc_src",
> add a new device attribute for the host TSC frequency.
> 
> b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a new
> device attribute for the guest_frequency/host_frequency ratio.
> 
> A third would be scaling the host TSC frequency in KVM_GETCLOCK, but that's
> confusing IMO.

Agreed -- I think kvmclock should remain as is.

A fourth would be to expose the host's TSC frequency outside of KVM
since we're really in the business of guest features, not host ones :-)
We already have a patch that does this internally, and its visible in
some of our open source userspace libraries [1].

That said, I do not have any immediate reservations about adding an
attribute as the easy way out.

Ack on the rest of the feedback, I'll touch up the documentation further
once we figure out TSC frequency exposure.

[1]: https://github.com/abseil/abseil-cpp/blob/c33f21f86a14216336b87d48e9b552a13985b2bc/absl/base/internal/sysinfo.cc#L311

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-16  4:53 [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm Oliver Upton
@ 2022-03-16  7:47 ` Paolo Bonzini
  2022-03-18 18:39   ` Oliver Upton
  0 siblings, 1 reply; 44+ messages in thread
From: Paolo Bonzini @ 2022-03-16  7:47 UTC (permalink / raw)
  To: Oliver Upton, kvm
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, David Woodhouse, David Woodhouse

On 3/16/22 05:53, Oliver Upton wrote:
> The VMM has control of both the guest's TSC scale and offset. Extend the
> described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
> to cover TSC scaling.
> 
> Reported-by: David Woodhouse<dwmw@amazon.co.uk>
> Signed-off-by: Oliver Upton<oupton@google.com>
> ---
> 
> Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).

A few more things that have to be changed:

> 1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
>    kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
>    (host_src).
> 

One of two changes:

a) Add "Multiply tsc_src by guest_freq / src_freq to obtain 
scaled_tsc_src", add a new device attribute for the host TSC frequency.

b) Add "Multiply tsc_src by src_ratio to obtain scaled_tsc_src", add a 
new device attribute for the guest_frequency/host_frequency ratio.

A third would be scaling the host TSC frequency in KVM_GETCLOCK, but 
that's confusing IMO.

> 3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
>    guest's TSC (freq).

Replace freq with guest_freq.

> 6. Adjust the guest TSC offsets for every vCPU to account for (1) time
>    elapsed since recording state and (2) difference in TSCs between the
>    source and destination machine:
> 
>    ofs_dst[i] = ofs_src[i] -
>      (guest_src - guest_dest) * freq +
>      (tsc_src - tsc_dest)
> 

Replace freq with guest_freq.

Replace tsc_src with scaled_tsc_src; replace tsc_dest with tsc_dest * 
guest_freq / dest_freq.

>    ("ofs[i] + tsc - guest * freq" is the guest TSC value corresponding to

Replace with "ofs[i] + tsc * guest_freq / host_freq - guest * 
guest_freq", or "ofs[i] + tsc * ratio - guest * guest_freq".

Paolo


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
@ 2022-03-16  4:53 Oliver Upton
  2022-03-16  7:47 ` Paolo Bonzini
  0 siblings, 1 reply; 44+ messages in thread
From: Oliver Upton @ 2022-03-16  4:53 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, David Woodhouse, Oliver Upton,
	David Woodhouse

The VMM has control of both the guest's TSC scale and offset. Extend the
described migration algorithm in the KVM_VCPU_TSC_OFFSET documentation
to cover TSC scaling.

Reported-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Oliver Upton <oupton@google.com>
---

Applies to kvm/queue (references KVM_{GET,SET}_TSC_KHZ on a VM fd).

Parent commit: 2ca1ba339ed8 ("KVM: x86: Test case for TSC scaling and offset sync")

 Documentation/virt/kvm/devices/vcpu.rst | 26 ++++++++++++++++---------
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 60a29972d3f1..85199e9b7f8d 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -180,13 +180,18 @@ Returns:
 	 ======= ======================================
 
 Specifies the guest's TSC offset relative to the host's TSC. The guest's
-TSC is then derived by the following equation:
+TSC is then derived by the following equation where tsc_scale is the
+ratio between the VM and host TSC frequencies:
 
-  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
+  guest_tsc = (host_tsc * tsc_scale) + KVM_VCPU_TSC_OFFSET
 
-This attribute is useful to adjust the guest's TSC on live migration,
-so that the TSC counts the time during which the VM was paused. The
-following describes a possible algorithm to use for this purpose.
+The VM's TSC frequency is configured with the KVM_{GET,SET}_TSC_KHZ
+vCPU or VM ioctls.
+
+The KVM_VCPU_TSC_OFFSET attribute is useful to adjust the guest's TSC
+on live migration, so that the TSC counts the time during which the VM
+was paused. The following describes a possible algorithm to use for this
+purpose.
 
 From the source VMM process:
 
@@ -202,7 +207,10 @@ From the source VMM process:
 
 From the destination VMM process:
 
-4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
+4. Invoke the KVM_SET_TSC_KHZ ioctl to set the guest TSC frequency to
+   the value recorded in step 3 (freq).
+
+5. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
    kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective
    fields.  Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
    structure.
@@ -214,10 +222,10 @@ From the destination VMM process:
    between the source pausing the VMs and the destination executing
    steps 4-7.
 
-5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and
+6. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and
    kvmclock nanoseconds (guest_dest).
 
-6. Adjust the guest TSC offsets for every vCPU to account for (1) time
+7. Adjust the guest TSC offsets for every vCPU to account for (1) time
    elapsed since recording state and (2) difference in TSCs between the
    source and destination machine:
 
@@ -229,5 +237,5 @@ From the destination VMM process:
    a time of 0 in kvmclock.  The above formula ensures that it is the
    same on the destination as it was on the source).
 
-7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
+8. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
    respective value derived in the previous step.
-- 
2.35.1.723.g4982287a31-goog


^ permalink raw reply related	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2023-09-13 14:11 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-22 19:18 [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm Franke, Daniel
2022-03-22 21:53 ` Oliver Upton
2022-03-23 12:35   ` David Woodhouse
2022-03-23 16:21     ` Oliver Upton
2022-03-25  9:03       ` David Woodhouse
2022-03-25 17:47         ` Oliver Upton
2022-03-29 14:19   ` Thomas Gleixner
2022-03-29 16:02     ` Oliver Upton
2022-03-29 19:34       ` Thomas Gleixner
2022-06-30 11:58       ` David Woodhouse
2022-07-05 14:43         ` David Woodhouse
2022-07-07 16:43           ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Simon Veith
2022-07-07 16:43             ` [PATCH 2/2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
2022-07-29 21:21               ` Sean Christopherson
2022-07-29 21:14             ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Sean Christopherson
2023-02-02 16:35               ` David Woodhouse
2023-02-02 16:59                 ` [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
2023-03-15 19:57                   ` Sean Christopherson
2023-03-23 23:26                     ` David Woodhouse
2023-03-24 11:22                     ` Paolo Bonzini
2023-03-24 13:08                       ` David Woodhouse
2023-09-13 14:08                       ` David Woodhouse
  -- strict thread matches above, loose matches on Subject: below --
2022-03-16  4:53 [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm Oliver Upton
2022-03-16  7:47 ` Paolo Bonzini
2022-03-18 18:39   ` Oliver Upton
2022-03-19  7:52     ` Paolo Bonzini
2022-03-19  7:59       ` Oliver Upton
2022-03-19  8:08         ` David Woodhouse
2022-03-19 11:54           ` Paolo Bonzini
2022-03-19 13:00             ` Paolo Bonzini
2022-03-19 13:13               ` David Woodhouse
2022-03-20  8:10                 ` Paolo Bonzini
2022-03-20  8:52                   ` Oliver Upton
2022-03-20  9:46                     ` David Woodhouse
2022-03-21  0:38                       ` Oliver Upton
2022-03-21 19:43                         ` David Woodhouse
2022-03-21 21:23                           ` Oliver Upton
2022-03-20 13:39                     ` Paolo Bonzini
2022-03-21  0:51                       ` Oliver Upton
2022-03-21 12:36                         ` Paolo Bonzini
2022-03-21 12:56                           ` David Woodhouse
2022-03-21 12:16                       ` David Woodhouse
2022-03-21 13:10                         ` Paolo Bonzini
2022-03-21 14:59                           ` David Woodhouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.