All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
@ 2022-03-22 19:18 Franke, Daniel
  2022-03-22 21:53 ` Oliver Upton
  0 siblings, 1 reply; 22+ messages in thread
From: Franke, Daniel @ 2022-03-22 19:18 UTC (permalink / raw)
  To: Oliver Upton, David Woodhouse
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel

On 3/21/22, 5:24 PM, "Oliver Upton" <oupton@google.com> wrote:
 >  Right, but I'd argue that interface has some problems too. It
 >   depends on the guest polling instead of an interrupt from the
 >   hypervisor. It also has no way of informing the kernel exactly how much
 >   time has elapsed.

 >   The whole point of all these hacks that we've done internally is that we,
 >   the hypervisor, know full well how much real time hasv advanced during the
 >   VM blackout. If we can at least let the guest know how much to fudge real
 >   time, it can then poke NTP for better refinement. I worry about using NTP
 >   as the sole source of truth for such a mechanism, since you'll need to go
 >   out to the network and any reads until the response comes back are hosed.

(I'm a kernel newbie, so please excuse any ignorance with respect to kernel
Internals or kernel/hypervisor interfaces.)

We can have it both ways, I think. Let the hypervisor manipulate the guest TSC
so as to keep the guest kernel's idea of real time as accurate as possible 
without any awareness required on the guest's side. *Also* give the guest kernel
a notification in the form of a KVM_PVCLOCK_STOPPED event or whatever else,
and let the kernel propagate this notification to userspace so that the NTP
daemon can recombobulate itself as quickly as possible, treating whatever TSC
adjustment was received as best-effort only.

The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
globals kept by kernel/time/ntp.c (which are visible to userspace through
adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.

The aforementioned fields get overwritten by the NTP daemon every polling
interval, and whatever notification gets sent to userspace is going to be
asynchronous, so we need to avoid race conditions wherein userspace clobbers
them with its outdated idea of their correct value before the notification arrives.
I propose allocating one of the unused fields at the end of `struct timex` as a
`clockver` field. This field will be 0 at boot time, and get incremented after
clock discontinuity such as a KVM blackout, an ACPI suspend-to-RAM event,
or a manual setting of the clock through clock_settime(3) or similar. Augment
`modes` with an `ADJ_CLOCKVER` bit, with the semantics "make this call fail
If the `clockver` I passed in is not current".

As for the form of the notification to userspace... dunno, I think a netlink
interface makes the most sense? I'm not too opinionated here. Whatever we
decide, I'll be happy to contribute the patches to chrony and ntpd. The NTP
daemon should handle this notification by essentially resetting its state to
how it was at first startup, considering itself unsynchronized and immediately
kicking off new queries to all its peers.

One other thing the NTP daemon might want to do with this notification is
clobber its drift file, but only if it's running on different hardware than before.
But this requires getting that bit of information from the hypervisor to the
kernel and I haven't thought about how this should work.

There's a minor problem remaining here, which impacts NTP servers but not
clients. NTP servers don't use adjtimex to get the timestamps the use to
populate NTP packets; they just use clock_gettime, because the latter is a
VDSO call and the former is not. That means that they may not learn of the
`clockver` step in time to answer any queries that come in immediately
post-blackout. So, the server may briefly become a falseticker if the
hypervisor did too poor a job with the TSC step. I don't think this is a big
deal or something that needs to be addressed in the first iteration. The
long-term solution is to make adjtimex also be a VDSO when `modes` is
0, so that NTP daemons can use it for everything.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-22 19:18 [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm Franke, Daniel
@ 2022-03-22 21:53 ` Oliver Upton
  2022-03-23 12:35   ` David Woodhouse
  2022-03-29 14:19   ` Thomas Gleixner
  0 siblings, 2 replies; 22+ messages in thread
From: Oliver Upton @ 2022-03-22 21:53 UTC (permalink / raw)
  To: Franke, Daniel
  Cc: David Woodhouse, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner

Hi Daniel,

As we're venturing into the realm of timekeeping and he has noted
the evils of virtualization several times, I've CC'ed Thomas on this
mail. Some comments inline on the ongoing discussion.

On Tue, Mar 22, 2022 at 07:18:20PM +0000, Franke, Daniel wrote:
> On 3/21/22, 5:24 PM, "Oliver Upton" <oupton@google.com> wrote:
>  >  Right, but I'd argue that interface has some problems too. It
>  >   depends on the guest polling instead of an interrupt from the
>  >   hypervisor. It also has no way of informing the kernel exactly how much
>  >   time has elapsed.
> 
>  >   The whole point of all these hacks that we've done internally is that we,
>  >   the hypervisor, know full well how much real time hasv advanced during the
>  >   VM blackout. If we can at least let the guest know how much to fudge real
>  >   time, it can then poke NTP for better refinement. I worry about using NTP
>  >   as the sole source of truth for such a mechanism, since you'll need to go
>  >   out to the network and any reads until the response comes back are hosed.
> 
> (I'm a kernel newbie, so please excuse any ignorance with respect to kernel
> Internals or kernel/hypervisor interfaces.)

Welcome :-)

> We can have it both ways, I think. Let the hypervisor manipulate the guest TSC
> so as to keep the guest kernel's idea of real time as accurate as possible 
> without any awareness required on the guest's side. *Also* give the guest kernel
> a notification in the form of a KVM_PVCLOCK_STOPPED event or whatever else,
> and let the kernel propagate this notification to userspace so that the NTP
> daemon can recombobulate itself as quickly as possible, treating whatever TSC
> adjustment was received as best-effort only.

But what happens to CLOCK_MONOTONIC in this case? We are still accepting
the fact that live migrations destroy CLOCK_MONOTONIC if we directly
advance the guest TSCs to account for elapsed time. The definition of
CLOCK_MONOTONIC is that the clock does not count while the system is
suspended. From the viewpoint of the guest, a live migration appears to
be a forced suspend operation at an arbitrary instruction boundary.
There is no realistic way for the guest to give the illusion that
MONOTONIC has stopped without help from the hypervisor.

> The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> globals kept by kernel/time/ntp.c (which are visible to userspace through
> adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.

I do not disagree that NTP needs to throw the book out after a live
migration.

But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
relies on the guest polling a shared structure, and who knows when the
guest is going to check the structure again? If we inject an interrupt
the guest is likely to check this state in a reasonable amount of time.

Thomas, we're talking about how to not wreck time (as bad) under
virtualization. I know this has been an area of interest to you for a
while ;-) The idea is that the hypervisor should let the guest know
about time travel.

Let's just assume for now that the hypervisor will *not* quiesce
the guest into an S2IDLE state before migration. I think quiesced
migrations are a good thing to have in the toolbelt, but there will
always be host-side problems that require us to migrate a VM off a host
immediately with no time to inform the guest.

Given that, we're deciding which clock is going to get wrecked during
a migration and what the guest can do afterwards to clean it up.
Whichever clock gets wrecked is going to have a window where reads race
with the eventual fix, and could be completely wrong. My thoughts:

We do not advance the TSC during a migration and notify the
guest (interrupt, shared structure) about how much it has
time traveled (delta_REALTIME). REALTIME is wrong until the interrupt
is handled in the guest, but should fire off all of the existing
mechanisms for a clock step. Userspace gets notified with
TFD_TIMER_CANCEL_ON_SET. I believe you have proposed something similar
as a way to make live migration less sinister from the guest
perspective.

It seems possible to block racing reads of REALTIME if we protect it with
a migration sequence counter. Host raises the sequence after a migration
when control is yielded back to the guest. The sequence is odd if an
update is pending. Guest increments the sequence again after the
interrupt handler accounts for time travel. That has the side effect of
blocking all realtime clock reads until the interrupt is handled. But
what are the value of those reads if we know they're wrong? There is
also the implication that said shared memory interface gets mapped
through to userspace for vDSO, haven't thought at all about those
implications yet.

Doing this the other way around (advance the TSC, tell the guest to fix
MONOTONIC) is fundamentally wrong, as it violates two invariants of the
monotonic clock. Monotonic counts during a migration, which really is a
forced suspend. Additionally, you cannot step the monotonic clock.

Thoughts?

Sorry to revisit this conversation yet again. Virtualization isn't going
away any time soon and the illusion that migrations are invisible to the
guest is simply not true.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-22 21:53 ` Oliver Upton
@ 2022-03-23 12:35   ` David Woodhouse
  2022-03-23 16:21     ` Oliver Upton
  2022-03-29 14:19   ` Thomas Gleixner
  1 sibling, 1 reply; 22+ messages in thread
From: David Woodhouse @ 2022-03-23 12:35 UTC (permalink / raw)
  To: Oliver Upton, Franke, Daniel
  Cc: Paolo Bonzini, kvm, Sean Christopherson, Vitaly Kuznetsov,
	Wanpeng Li, Jim Mattson, Joerg Roedel, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 4304 bytes --]

On Tue, 2022-03-22 at 21:53 +0000, Oliver Upton wrote:
> But what happens to CLOCK_MONOTONIC in this case? We are still accepting
> the fact that live migrations destroy CLOCK_MONOTONIC if we directly
> advance the guest TSCs to account for elapsed time. The definition of
> CLOCK_MONOTONIC is that the clock does not count while the system is
> suspended. From the viewpoint of the guest, a live migration appears to
> be a forced suspend operation at an arbitrary instruction boundary.
> There is no realistic way for the guest to give the illusion that
> MONOTONIC has stopped without help from the hypervisor.

I'm a little lost there. CLOCK_MONOTONIC is *permitted* to stop when
the guest is suspended, but it's not *mandatory*, surely?

I can buy your assertion that the brownout period of a live migration
(or the time for the host kernel to kexec in the case of live update) 
can be considered a suspend... but regardless of whether that makes it
mandatory to stop the clocks, I prefer to see it a different way. 

In normal operation — especially with CPU overcommit and/or throttling
— there are times when none of the guest's vCPUs will be scheduled for
short periods of time. We don't *have* to stop CLOCK_MONOTONIC when
that happens, do we?

If we want live migration to be guest-transparent, shouldn't we treat
is as much as possible as one of those times when the vCPUs just happen
not to be running for a moment?

On a live update, where the host does a kexec and then resumes the
guest state, the host TSC reference is precisely the same as before the
upate. We basically don't want to change *anything* that the guest sees
in its pvclock information. In fact, we have a local patch to
'KVM_SET_CLOCK_FROM_GUEST' for the live update case, which ensures
exactly that. We then add a delta to the guest TSC as we create each
vCPU in the new KVM; the *offset* interface would be beneficial to us
here (since that offset doesn't change) but we're not using it yet.

For live migration, the same applies — we can just add a delta to the
clock and the guest TSC values, commensurate with the amount of
wallclock time that elapsed from serialisation on the source host, to
deserialisation on the destination.

And it all looks *just* like it would if the vCPUs happened not to be
scheduled for a little while, because the host was busy.


> > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> > globals kept by kernel/time/ntp.c (which are visible to userspace through
> > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
> 
> I do not disagree that NTP needs to throw the book out after a live
> migration.
> 
> But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> relies on the guest polling a shared structure, and who knows when the
> guest is going to check the structure again? If we inject an interrupt
> the guest is likely to check this state in a reasonable amount of time.

Ah, but that's the point. A flag in shared memory can be checked
whenever the guest needs to know that it's operating on valid state.
Linux will check it *every* time from pvclock_clocksource_read().

As opposed to a separate interrupt which eventually gets processed some
indefinite amount of time in the future.

> Doing this the other way around (advance the TSC, tell the guest to fix
> MONOTONIC) is fundamentally wrong, as it violates two invariants of the
> monotonic clock. Monotonic counts during a migration, which really is a
> forced suspend. Additionally, you cannot step the monotonic clock.
> 
I dont understand why we can't "step the monotonic clock". Any time
merely refrain from scheduling the vCPUs for any period of time, that
is surely indistinguishable from a "step" in the monotonic clock,
surely?

> Sorry to revisit this conversation yet again. Virtualization isn't going
> away any time soon and the illusion that migrations are invisible to the
> guest is simply not true.

I'll give you the assertion that migrations aren't completely
invisible, but I still think they should be *equivalent* to the vCPU
just not being scheduled for a moment.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-23 12:35   ` David Woodhouse
@ 2022-03-23 16:21     ` Oliver Upton
  2022-03-25  9:03       ` David Woodhouse
  0 siblings, 1 reply; 22+ messages in thread
From: Oliver Upton @ 2022-03-23 16:21 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner

On Wed, Mar 23, 2022 at 12:35:17PM +0000, David Woodhouse wrote:
> On Tue, 2022-03-22 at 21:53 +0000, Oliver Upton wrote:
> > But what happens to CLOCK_MONOTONIC in this case? We are still accepting
> > the fact that live migrations destroy CLOCK_MONOTONIC if we directly
> > advance the guest TSCs to account for elapsed time. The definition of
> > CLOCK_MONOTONIC is that the clock does not count while the system is
> > suspended. From the viewpoint of the guest, a live migration appears to
> > be a forced suspend operation at an arbitrary instruction boundary.
> > There is no realistic way for the guest to give the illusion that
> > MONOTONIC has stopped without help from the hypervisor.
> 
> I'm a little lost there. CLOCK_MONOTONIC is *permitted* to stop when
> the guest is suspended, but it's not *mandatory*, surely?
> 
> I can buy your assertion that the brownout period of a live migration
> (or the time for the host kernel to kexec in the case of live update) 
> can be considered a suspend... but regardless of whether that makes it
> mandatory to stop the clocks, I prefer to see it a different way. 
> 
> In normal operation — especially with CPU overcommit and/or throttling
> — there are times when none of the guest's vCPUs will be scheduled for
> short periods of time. We don't *have* to stop CLOCK_MONOTONIC when
> that happens, do we?
> 

You're absolutely right. We've at least accepted MONOTONIC behaving the
way it does for time lost to host scheduling, and expose this through
steal_time for the guest scheduler.

> If we want live migration to be guest-transparent, shouldn't we treat
> is as much as possible as one of those times when the vCPUs just happen
> not to be running for a moment?

There is still a subtle difference between host scheduler pressure and
live migration. Its hard to crisply state whether or not the VM is
actually suspended, as any one of its vCPU threads could actually be
running. Migration is one of those events where we positively know the
guest isn't running at all.

> On a live update, where the host does a kexec and then resumes the
> guest state, the host TSC reference is precisely the same as before the
> upate. We basically don't want to change *anything* that the guest sees
> in its pvclock information. In fact, we have a local patch to
> 'KVM_SET_CLOCK_FROM_GUEST' for the live update case, which ensures
> exactly that. We then add a delta to the guest TSC as we create each
> vCPU in the new KVM; the *offset* interface would be beneficial to us
> here (since that offset doesn't change) but we're not using it yet.
> 
> For live migration, the same applies — we can just add a delta to the
> clock and the guest TSC values, commensurate with the amount of
> wallclock time that elapsed from serialisation on the source host, to
> deserialisation on the destination.
> 
> And it all looks *just* like it would if the vCPUs happened not to be
> scheduled for a little while, because the host was busy.

We could continue to get away with TSC advancement, but the critical
part IMO is the upper bound. And what happens when we exceed it.

There is no authoritative documentation around what time looks like as a
guest of KVM, and futhermore what happens when a guest experiences time
travel. Now we're in a particularly undesirable situation where there
are at least three known definitions for time during a migration
(upstream QEMU, Google, Amazon) and it is ~impossible to program guest
software to anticipate our shenanigans.

If we are to do this right we probably need to agree on documented
behavior. If we decide that advancing TSCs is acceptable up to 'X'
seconds, guest kernels could take a change to relax expectations at
least up to this value.

> > > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> > > globals kept by kernel/time/ntp.c (which are visible to userspace through
> > > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> > > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
> > 
> > I do not disagree that NTP needs to throw the book out after a live
> > migration.
> > 
> > But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> > relies on the guest polling a shared structure, and who knows when the
> > guest is going to check the structure again? If we inject an interrupt
> > the guest is likely to check this state in a reasonable amount of time.
> 
> Ah, but that's the point. A flag in shared memory can be checked
> whenever the guest needs to know that it's operating on valid state.
> Linux will check it *every* time from pvclock_clocksource_read().
> 
> As opposed to a separate interrupt which eventually gets processed some
> indefinite amount of time in the future.

There are a few annoying things with pvclock, though. It is a per-vCPU
structure, so special care must be taken to act exactly once on a
migration. Also, since commit 7539b174aef4 ("x86: kvmguest: use TSC
clocksource if invariant TSC is exposed") the guest kernel could pick
the TSC over the pvclock by default, so its hard to say when the pvclock
structure is checked again. This is what I had in mind when suggesting a
doorbell is needed, as there is no good way to know what clocksource the
guest is using.

> > Doing this the other way around (advance the TSC, tell the guest to fix
> > MONOTONIC) is fundamentally wrong, as it violates two invariants of the
> > monotonic clock. Monotonic counts during a migration, which really is a
> > forced suspend. Additionally, you cannot step the monotonic clock.
> > 
> I dont understand why we can't "step the monotonic clock". Any time
> merely refrain from scheduling the vCPUs for any period of time, that
> is surely indistinguishable from a "step" in the monotonic clock,
> surely?

Right, there is some nebulous threshold that we've defined as an
acceptable amount of time to 'step' the monotonic clock. I think that
everything to date is built around the assumption that it is a small
amount of time of O(timeslice). Pinning down the upper bound should at
least make clocks more predictable on virtualization.

> > Sorry to revisit this conversation yet again. Virtualization isn't going
> > away any time soon and the illusion that migrations are invisible to the
> > guest is simply not true.
> 
> I'll give you the assertion that migrations aren't completely
> invisible, but I still think they should be *equivalent* to the vCPU
> just not being scheduled for a moment.

I sure hope that migrations are fast enough that it is indistinguishable
from scheduler pressure. I think the situations where that is not the
case are particularly interesting. Defining a limit and having a
mechanism for remedial action could make things more predictable for
guest software.

But agree, and shame on us for the broken virtual hardware when that
isn't the case :-)

--
Thanks,
Oliver


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-23 16:21     ` Oliver Upton
@ 2022-03-25  9:03       ` David Woodhouse
  2022-03-25 17:47         ` Oliver Upton
  0 siblings, 1 reply; 22+ messages in thread
From: David Woodhouse @ 2022-03-25  9:03 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 6863 bytes --]

On Wed, 2022-03-23 at 16:21 +0000, Oliver Upton wrote:
> On Wed, Mar 23, 2022 at 12:35:17PM +0000, David Woodhouse wrote:
> > And it all looks *just* like it would if the vCPUs happened not to be
> > scheduled for a little while, because the host was busy.
> 
> We could continue to get away with TSC advancement, but the critical
> part IMO is the upper bound. And what happens when we exceed it.
> 
> There is no authoritative documentation around what time looks like as a
> guest of KVM, and futhermore what happens when a guest experiences time
> travel. Now we're in a particularly undesirable situation where there
> are at least three known definitions for time during a migration
> (upstream QEMU, Google, Amazon) and it is ~impossible to program guest
> software to anticipate our shenanigans.
> 
> If we are to do this right we probably need to agree on documented
> behavior. If we decide that advancing TSCs is acceptable up to 'X'
> seconds, guest kernels could take a change to relax expectations at
> least up to this value.

I'm not sure I even buy this line of argument at all. I don't think of
it as 'TSC advancement'. Or maybe I should put that the other way
round: TSCs advance *all* the time; that's the whole point in them. 

All we do when we live-migrate is move the state from one host to
another, completely intact. From the guest point of view we don't
"advance" the TSC; we just set the guest TSC on the destination host to
precisely¹ the same value that the guest TSC on the source host would
have at that moment if we were to abort the migration and do a
rollback. Either way (rollback or not), the vCPUs weren't *running* for
a period of time, but time and TSCs are entirely the same.

In fact, on live update we literally *don't* change anything that the
guest sees. The pvclock information that the guest gets is *precisely*
the same as before, and the TSC has 'advanced' purely by nature of
being precisely the same CPU just a short period of time in the
future... because that's what TSCs do when time passes :)

Really, we aren't talking about a "maximum advancement of the TSC".
We're talking about a "maximum period for which the vCPUs aren't
scheduled".

And I don't see why there need to be documented hard limits on that. I
see it as a quality of implementation issue. If the host is in swap
death and doesn't manage to run the guest vCPUs for seconds at a time,
that's fairly crappy, but it's not strictly a *timekeeping* issue. And
it's just the same if the final transition period of live migration (or
the kexec time for live update) is excessive.

> > > > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> > > > globals kept by kernel/time/ntp.c (which are visible to userspace through
> > > > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> > > > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
> > > 
> > > I do not disagree that NTP needs to throw the book out after a live
> > > migration.
> > > 

Right. To recap, this is because where I highlighted 'precisely¹'
above, our accuracy is actually limited to the synchronisation of the
wallclock time between the two hosts. If the guest thinks it has a more
accurate NTP sync than either of the hosts do, we may have introduced
an error which is more than the error bounds the guest thinks it has,
and that may ultimately lead to data corruption in some circumstances.

This part is somewhat orthogonal to the above discussion, isn't it?
Regardless of whether we 'step' the TSC or not, we need to signal the
guest to know that it needs to consider itself unsynchronized (or, if
we want to get really fancy, let it know the upper bounds of the error
we just introduced. But let's not).

> > > But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> > > relies on the guest polling a shared structure, and who knows when the
> > > guest is going to check the structure again? If we inject an interrupt
> > > the guest is likely to check this state in a reasonable amount of time.
> > 
> > Ah, but that's the point. A flag in shared memory can be checked
> > whenever the guest needs to know that it's operating on valid state.
> > Linux will check it *every* time from pvclock_clocksource_read().
> > 
> > As opposed to a separate interrupt which eventually gets processed some
> > indefinite amount of time in the future.
> 
> There are a few annoying things with pvclock, though. It is a per-vCPU
> structure, so special care must be taken to act exactly once on a
> migration. Also, since commit 7539b174aef4 ("x86: kvmguest: use TSC
> clocksource if invariant TSC is exposed") the guest kernel could pick
> the TSC over the pvclock by default, so its hard to say when the pvclock
> structure is checked again.

That commit appears to be assuming that the TSC *will* be "stepped", or
as I call it "continue to advance normally at its normal frequency over
elapsed time".

>  This is what I had in mind when suggesting a doorbell is needed, as
> there is no good way to know what clocksource the guest is using.

Yes, perhaps that might be necessary.

The concern is that by the time that doorbell interrupt has finally
been delivered and processed, an inaccurate timestamp could *already*
have been used on the wire in the some application's coherency
protocol, and the guest's database could already be hosed.

But I don't see why we can't have both.  I think it makes sense for the
guest kernel to ditch its NTP sync when it sees PVCLOCK_GUEST_STOPPED,
but I'm not entirely averse to the existence of a doorbell mechanism
such as you describe.

> > I'll give you the assertion that migrations aren't completely
> > invisible, but I still think they should be *equivalent* to the vCPU
> > just not being scheduled for a moment.
> 
> I sure hope that migrations are fast enough that it is indistinguishable
> from scheduler pressure. I think the situations where that is not the
> case are particularly interesting. Defining a limit and having a
> mechanism for remedial action could make things more predictable for
> guest software.
> 
> But agree, and shame on us for the broken virtual hardware when that
> isn't the case :-)

Right. The "how to tell the guest that it needs to ditch its NTP sync
status" question needs some more thought but in the short term I think
it makes sense to add get/set TSC mechanisms which work like
KVM_[SG]ET_CLOCK with the KVM_CLOCK_REALTIME flag.

Instead of realtime though, it should use the KVM clock. So reading a
TSC returns a { kvm_ns, TSC value } tuple, and setting it will advance
the value by the appropriate amount just as we advance the KVM clock
for the KVM_CLOCK_REALTIME delta.



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-25  9:03       ` David Woodhouse
@ 2022-03-25 17:47         ` Oliver Upton
  0 siblings, 0 replies; 22+ messages in thread
From: Oliver Upton @ 2022-03-25 17:47 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel,
	Thomas Gleixner

On Fri, Mar 25, 2022 at 09:03:50AM +0000, David Woodhouse wrote:
> On Wed, 2022-03-23 at 16:21 +0000, Oliver Upton wrote:
> > On Wed, Mar 23, 2022 at 12:35:17PM +0000, David Woodhouse wrote:
> > > And it all looks *just* like it would if the vCPUs happened not to be
> > > scheduled for a little while, because the host was busy.
> > 
> > We could continue to get away with TSC advancement, but the critical
> > part IMO is the upper bound. And what happens when we exceed it.
> > 
> > There is no authoritative documentation around what time looks like as a
> > guest of KVM, and futhermore what happens when a guest experiences time
> > travel. Now we're in a particularly undesirable situation where there
> > are at least three known definitions for time during a migration
> > (upstream QEMU, Google, Amazon) and it is ~impossible to program guest
> > software to anticipate our shenanigans.
> > 
> > If we are to do this right we probably need to agree on documented
> > behavior. If we decide that advancing TSCs is acceptable up to 'X'
> > seconds, guest kernels could take a change to relax expectations at
> > least up to this value.
> 
> I'm not sure I even buy this line of argument at all. I don't think of
> it as 'TSC advancement'. Or maybe I should put that the other way
> round: TSCs advance *all* the time; that's the whole point in them. 

The only reason I use the term 'TSC advancement' is to account for the
fact that the guest probably feels *very* differently about what the
advancement time actually is. The world around it has continued onwards
while it was out.

> All we do when we live-migrate is move the state from one host to
> another, completely intact. From the guest point of view we don't
> "advance" the TSC; we just set the guest TSC on the destination host to
> precisely¹ the same value that the guest TSC on the source host would
> have at that moment if we were to abort the migration and do a
> rollback. Either way (rollback or not), the vCPUs weren't *running* for
> a period of time, but time and TSCs are entirely the same.
> 
> In fact, on live update we literally *don't* change anything that the
> guest sees. The pvclock information that the guest gets is *precisely*
> the same as before, and the TSC has 'advanced' purely by nature of
> being precisely the same CPU just a short period of time in the
> future... because that's what TSCs do when time passes :)
> 
> Really, we aren't talking about a "maximum advancement of the TSC".
> We're talking about a "maximum period for which the vCPUs aren't
> scheduled".

But that is all built on the assumption that a guest is actually
being *live* migrated. That is entirely a userspace decision, and IMO we
can't glean what the VMM is trying to do just based on how it calls
the KVM APIs.

The only thing we know for certain is the guest state is serialized.
Where that data goes and how long it takes to be acted upon is none of
our business in the kernel.

> And I don't see why there need to be documented hard limits on that. I
> see it as a quality of implementation issue. If the host is in swap
> death and doesn't manage to run the guest vCPUs for seconds at a time,
> that's fairly crappy, but it's not strictly a *timekeeping* issue. And
> it's just the same if the final transition period of live migration (or
> the kexec time for live update) is excessive.

Serializing a VM to persistent storage and resuming it at a later date
is a completely valid use of the KVM APIs. Sure, if the guest goes down
for that long during a live migration then it has every right to scream
at the entity running the VM, but I would argue there are more
lifecycles outside of what our businesses typically do.

> > > > > The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
> > > > > globals kept by kernel/time/ntp.c (which are visible to userspace through
> > > > > adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
> > > > > to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
> > > > 
> > > > I do not disagree that NTP needs to throw the book out after a live
> > > > migration.
> > > > 
> 
> Right. To recap, this is because where I highlighted 'precisely¹'
> above, our accuracy is actually limited to the synchronisation of the
> wallclock time between the two hosts. If the guest thinks it has a more
> accurate NTP sync than either of the hosts do, we may have introduced
> an error which is more than the error bounds the guest thinks it has,
> and that may ultimately lead to data corruption in some circumstances.

Agreed. And on top of that I would state the guest is likely running on
a different oscillator with its own set of problems since we switched out
the underlying hardware. NTP would need to rediscipline the clock even
if the migration somehow bends our understanding of reality and happens
in zero time.

> This part is somewhat orthogonal to the above discussion, isn't it?
> Regardless of whether we 'step' the TSC or not, we need to signal the
> guest to know that it needs to consider itself unsynchronized (or, if
> we want to get really fancy, let it know the upper bounds of the error
> we just introduced. But let's not).

Right, whatever signal we add should be asserted regardless of whether
the TSCs kept ticking while the guest is out. If there is going to be a
shared structure/message/whatever that conveys delta_REALTIME, you could
just set it to zero if the hypervisor decided to handle time on its own.

> > > > But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> > > > relies on the guest polling a shared structure, and who knows when the
> > > > guest is going to check the structure again? If we inject an interrupt
> > > > the guest is likely to check this state in a reasonable amount of time.
> > > 
> > > Ah, but that's the point. A flag in shared memory can be checked
> > > whenever the guest needs to know that it's operating on valid state.
> > > Linux will check it *every* time from pvclock_clocksource_read().
> > > 
> > > As opposed to a separate interrupt which eventually gets processed some
> > > indefinite amount of time in the future.
> > 
> > There are a few annoying things with pvclock, though. It is a per-vCPU
> > structure, so special care must be taken to act exactly once on a
> > migration. Also, since commit 7539b174aef4 ("x86: kvmguest: use TSC
> > clocksource if invariant TSC is exposed") the guest kernel could pick
> > the TSC over the pvclock by default, so its hard to say when the pvclock
> > structure is checked again.
> 
> That commit appears to be assuming that the TSC *will* be "stepped", or
> as I call it "continue to advance normally at its normal frequency over
> elapsed time".

Maybe? The author's intent isn't the reason I cared about that commit. It
means we will not detect PVCLOCK_GUEST_STOPPED synchronously with any
clock read, and I imagine we're rather concerned about those reads until
the clock is refined. sched clock would probably be the one to find out.

And sure, changing the clocksource away from the pvclock may not be the
right call if we are going to use it to enlighten NTP, I just feel the
commit above codified that it isn't user error.

> >  This is what I had in mind when suggesting a doorbell is needed, as
> > there is no good way to know what clocksource the guest is using.
> 
> Yes, perhaps that might be necessary.
> 
> The concern is that by the time that doorbell interrupt has finally
> been delivered and processed, an inaccurate timestamp could *already*
> have been used on the wire in the some application's coherency
> protocol, and the guest's database could already be hosed.

Unless you get that software into a known-safe instruction boundary,
who's to say you haven't migrated the guest between the clock read
and eventual packet out to the world? The timestamp would be
pre-blackout.

There's also nothing to prevent any agent within the guest from reading
the clock before NTP refinement comes in. We don't block clock reads on
PVCLOCK_GUEST_STOPPED.

> But I don't see why we can't have both.  I think it makes sense for the
> guest kernel to ditch its NTP sync when it sees PVCLOCK_GUEST_STOPPED,
> but I'm not entirely averse to the existence of a doorbell mechanism
> such as you describe.

Sure, there's nothing stopping us from setting off the alarms when that
bit is set. But:

 - It isn't as helpful for guests using the TSC
 - It does not prevent guest kernel or userspace from getting completely
   wrong clock reads in the interim

Given this, I'm convinced there is significant room for improvement as
there is a significant window of time during which we really wish the
guest to not use its wall clock.

> > > I'll give you the assertion that migrations aren't completely
> > > invisible, but I still think they should be *equivalent* to the vCPU
> > > just not being scheduled for a moment.
> > 
> > I sure hope that migrations are fast enough that it is indistinguishable
> > from scheduler pressure. I think the situations where that is not the
> > case are particularly interesting. Defining a limit and having a
> > mechanism for remedial action could make things more predictable for
> > guest software.
> > 
> > But agree, and shame on us for the broken virtual hardware when that
> > isn't the case :-)
> 
> Right. The "how to tell the guest that it needs to ditch its NTP sync
> status" question needs some more thought
>
> but in the short term I think
> it makes sense to add get/set TSC mechanisms which work like
> KVM_[SG]ET_CLOCK with the KVM_CLOCK_REALTIME flag.
> 
> Instead of realtime though, it should use the KVM clock. So reading a
> TSC returns a { kvm_ns, TSC value } tuple, and setting it will advance
> the value by the appropriate amount just as we advance the KVM clock
> for the KVM_CLOCK_REALTIME delta.

I'm all ears for changes to the TSC mechanisms, I just feel rather
strongly that it needs to support all possible use cases, not just live
migration. Further, KVM doesn't know what userspace is doing and I really
hope that userspace knows what userspace is doing.

If it does, maybe that's the right place to decide how to bend reality.
If we fully commit to KVM advancing clocks for an unbounded amount of
time then there needs to be a *giant* disclaimer for the case of pulling
a VM out of cold storage. Writing back exactly what you saved will in all
likelihood detonate the guest.

I just think it is highly desirable that anything we get upstream is
actually usable by upstream. Even though the hyperscalers only care about
situations where we're likely to immediately resume the guest, there are
other use cases to think about. The TSC offset attributes fit the
immediate interests of my employer, but was only halfway there for
upstream. I'd like to do right on that, because hopefully then I'm not
the only person testing it.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-22 21:53 ` Oliver Upton
  2022-03-23 12:35   ` David Woodhouse
@ 2022-03-29 14:19   ` Thomas Gleixner
  2022-03-29 16:02     ` Oliver Upton
  1 sibling, 1 reply; 22+ messages in thread
From: Thomas Gleixner @ 2022-03-29 14:19 UTC (permalink / raw)
  To: Oliver Upton, Franke, Daniel
  Cc: David Woodhouse, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel

On Tue, Mar 22 2022 at 21:53, Oliver Upton wrote:
> On Tue, Mar 22, 2022 at 07:18:20PM +0000, Franke, Daniel wrote:
>> The KVM_PVCLOCK_STOPPED event should trigger a change in some of the
>> globals kept by kernel/time/ntp.c (which are visible to userspace through
>> adjtimex(2)). In particular, `time_esterror` and `time_maxerror` should get reset
>> to `NTP_PHASE_LIMIT` and time_status should get reset to `STA_UNSYNC`.
>
> I do not disagree that NTP needs to throw the book out after a live
> migration.
>
> But, the issue is how we convey that to the guest. KVM_PVCLOCK_STOPPED
> relies on the guest polling a shared structure, and who knows when the
> guest is going to check the structure again? If we inject an interrupt
> the guest is likely to check this state in a reasonable amount of time.
>
> Thomas, we're talking about how to not wreck time (as bad) under
> virtualization. I know this has been an area of interest to you for a
> while ;-) The idea is that the hypervisor should let the guest know
> about time travel.

Finally. It only took 10+ years of ranting :)

> Let's just assume for now that the hypervisor will *not* quiesce
> the guest into an S2IDLE state before migration. I think quiesced
> migrations are a good thing to have in the toolbelt, but there will
> always be host-side problems that require us to migrate a VM off a host
> immediately with no time to inform the guest.

Quiesced migrations are the right thing on the principle of least
surprise.

> Given that, we're deciding which clock is going to get wrecked during
> a migration and what the guest can do afterwards to clean it up.
> Whichever clock gets wrecked is going to have a window where reads race
> with the eventual fix, and could be completely wrong. My thoughts:
>
> We do not advance the TSC during a migration and notify the
> guest (interrupt, shared structure) about how much it has
> time traveled (delta_REALTIME). REALTIME is wrong until the interrupt
> is handled in the guest, but should fire off all of the existing
> mechanisms for a clock step. Userspace gets notified with
> TFD_TIMER_CANCEL_ON_SET. I believe you have proposed something similar
> as a way to make live migration less sinister from the guest
> perspective.

That still is an information _after_ the fact and not well defined.

> It seems possible to block racing reads of REALTIME if we protect it with
> a migration sequence counter. Host raises the sequence after a migration
> when control is yielded back to the guest. The sequence is odd if an
> update is pending. Guest increments the sequence again after the
> interrupt handler accounts for time travel. That has the side effect of
> blocking all realtime clock reads until the interrupt is handled. But
> what are the value of those reads if we know they're wrong? There is
> also the implication that said shared memory interface gets mapped
> through to userspace for vDSO, haven't thought at all about those
> implications yet.

So you need two sequence counters to be checked similar to the
pv_clock() case, which prevents the usage of TSC without the pv_clock()
overhead.

That also means that CLOCK_REALTIME, CLOCK_TAI and CLOCK_REALTIME_COARSE
will need to become special cased in the VDSO and all realtime based
syscall interfaces.

I'm sure that will be well received from a performance perspective.

Aside of that where do you put the limit? Just user space visible read
outs? What about the kernel? Ignore it in the kernel and hope that
nothing will break? That's going to create really hard to diagnose
issues, which I'm absolutely not interested in.

You can't expand this scheme to the kernel in general because the CPU
which should handle the update might be in an interrupt disabled region
reading clock REALTIME....

Also this would just make the race window smaller. It does not solve the
problem in a consistent way. Don't even think about it. It's just
another layer of duct tape over the existing ones.

As I explained before, this does not solve the following:

   T = now();
   -----------> interruption of some sort (Tout)
   use(T);

Any use case which cares about Tout being less than a certain threshold
has to be carefully designed to handle this. See also below.

> Doing this the other way around (advance the TSC, tell the guest to fix
> MONOTONIC) is fundamentally wrong, as it violates two invariants of the
> monotonic clock. Monotonic counts during a migration, which really is a
> forced suspend. Additionally, you cannot step the monotonic clock.

A migration _should_ have suspend semantics, but the forced suspend
which is done by migration today does not have proper defined semantics
at all.

Also clock monotonic can be stepped forward under certain circumstances
and the kernel is very well able to handle it within well defined
limits. Think about scheduled out vCPUs. From their perspective clock
monotonic is stepping forwards.

The problem starts when the 'force suspended' time becomes excessive, as
that causes the mass expiry of clock monotonic timers with all the nasty
side effects described in a3ed0e4393d6. In the worst case it's going to
exceed the catchup limit of non-suspended timekeeping (~440s for a TSC
@2GHz) which in fact breaks the world and some more.

So let me go back to the use cases:

 1) Regular freeze of a VM to disk and resume at some arbitrary point in
    the future.

 2) Live migration

In both cases the proper solution is to make the guest go into a well
defined state (suspend) and resume it on restore. Everything just works
because it is well defined.

Though you say that there is a special case (not that I believe it):

> but there will always be host-side problems that require us to migrate
> a VM off a host immediately with no time to inform the guest.

which is basically what we do today for the very wrong reasons. There you
have two situations:

  1) Trestart - Tstop < TOLERABLE_THRESHOLD

     That's the easy case as you just can adjust TSC on restore by that
     amount on all vCPUs and be done with it. Just works like scheduling
     out all vCPUs for some time.

  2) Trestart - Tstop >= TOLERABLE_THRESHOLD

     Avoid adjusting TSC for the reasons above. That leaves you with the
     options of

     A) Make NTP unhappy as of today

     B) Provide information about the fact that the vCPUs got scheduled
        out for a long time and teach NTP to be smart about it.

        Right now NTP decides to declare itself unstable when it
        observes the time jump on CLOCK_REALTIME from the NTP server.

        That's exactly the situation described above:

        T = now();
        -----------> interruption of some sort (Tout)
        use(T);

        NTP observes that T is inconsistent because it does not know
        about Tout due to the fact that neither clock MONOTONIC nor
        clock BOOTTIME advanced.

        So here is where you can bring PV information into play by
        providing a migration sequence number in PV shared memory.

        On source host:
           Stop the guest dead in it's tracks.
           Record metadata:
             - migration sequence number
             - clock TAI as observed on the host
           Transfer the image along with metadata

        On destination host:
           Restore memory image
           Expose metadata in PV:
             - migration sequence number + 1
             - Tout (dest/source host delta of clock TAI)
           Run guest

        Guest kernel:

           - Keep track of the PV migration sequence number.

             If it changed act accordingly by injecting the TAI delta,
             which updates NTP state, wakes TFD_TIMER_CANCEL_ON_SET,
             etc...

             We have to do that in the kernel because we cannot rely on
             NTP being running.

             That can be an explicit IPI injected from the host or just
             polling from the tick. It doesn't matter much.

           - Expose information to user space:
               - Migration sequence counter

           - Add some smarts to adjtimex() to prevent user space racing
             against a pending migration update in the kernel.

        NTP:
           - utilize the sequence counter information 

             seq = get_migration_sequence();

             do stuff();
        
             if (seq != get_migration_sequence())
             	do_something_smart();
             else
                proceed_as_usual();

        That still will leave everything else exposed to
        CLOCK_REALTIME/TAI jumping forward, but there is nothing you can
        do about that and any application which cares about this has to
        be able to deal with it anyway.

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-29 14:19   ` Thomas Gleixner
@ 2022-03-29 16:02     ` Oliver Upton
  2022-03-29 19:34       ` Thomas Gleixner
  2022-06-30 11:58       ` David Woodhouse
  0 siblings, 2 replies; 22+ messages in thread
From: Oliver Upton @ 2022-03-29 16:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Franke, Daniel, David Woodhouse, Paolo Bonzini, kvm,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel

Hi Thomas,

On Tue, Mar 29, 2022 at 7:19 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > Let's just assume for now that the hypervisor will *not* quiesce
> > the guest into an S2IDLE state before migration. I think quiesced
> > migrations are a good thing to have in the toolbelt, but there will
> > always be host-side problems that require us to migrate a VM off a host
> > immediately with no time to inform the guest.
>
> Quiesced migrations are the right thing on the principle of least
> surprise.

Absolutely no argument here. Furthermore, it is likely the laziest
solution as well since all the adjustment is done within the guest.

The one thing I envision is that it's highly unlikely that a migration
will block until the guest cooperates. If the guest doesn't act within
a certain threshold then sorry, you're going along for the ride
anyway.

But it is very easy to paint that as user error if advance notice was
given for the migration :)

> > It seems possible to block racing reads of REALTIME if we protect it with
> > a migration sequence counter. Host raises the sequence after a migration
> > when control is yielded back to the guest. The sequence is odd if an
> > update is pending. Guest increments the sequence again after the
> > interrupt handler accounts for time travel. That has the side effect of
> > blocking all realtime clock reads until the interrupt is handled. But
> > what are the value of those reads if we know they're wrong? There is
> > also the implication that said shared memory interface gets mapped
> > through to userspace for vDSO, haven't thought at all about those
> > implications yet.
>
> So you need two sequence counters to be checked similar to the
> pv_clock() case, which prevents the usage of TSC without the pv_clock()
> overhead.
>
> That also means that CLOCK_REALTIME, CLOCK_TAI and CLOCK_REALTIME_COARSE
> will need to become special cased in the VDSO and all realtime based
> syscall interfaces.
>
> I'm sure that will be well received from a performance perspective.

Lol. Yeah, I was rambling and this still falls apart for the case
where a clock read came in before the migration, which you raise
below.

> As I explained before, this does not solve the following:
>
>    T = now();
>    -----------> interruption of some sort (Tout)
>    use(T);
>
> Any use case which cares about Tout being less than a certain threshold
> has to be carefully designed to handle this. See also below.
>
> > Doing this the other way around (advance the TSC, tell the guest to fix
> > MONOTONIC) is fundamentally wrong, as it violates two invariants of the
> > monotonic clock. Monotonic counts during a migration, which really is a
> > forced suspend. Additionally, you cannot step the monotonic clock.
>
> A migration _should_ have suspend semantics, but the forced suspend
> which is done by migration today does not have proper defined semantics
> at all.
>
> Also clock monotonic can be stepped forward under certain circumstances
> and the kernel is very well able to handle it within well defined
> limits. Think about scheduled out vCPUs. From their perspective clock
> monotonic is stepping forwards.
>

Right. For better or worse, the kernel has been conditioned to
tolerate small steps due to scheduling. There is just zero definition
around how much slop is allowed.

> The problem starts when the 'force suspended' time becomes excessive, as
> that causes the mass expiry of clock monotonic timers with all the nasty
> side effects described in a3ed0e4393d6. In the worst case it's going to
> exceed the catchup limit of non-suspended timekeeping (~440s for a TSC
> @2GHz) which in fact breaks the world and some more.
>
> So let me go back to the use cases:
>
>  1) Regular freeze of a VM to disk and resume at some arbitrary point in
>     the future.
>
>  2) Live migration
>
> In both cases the proper solution is to make the guest go into a well
> defined state (suspend) and resume it on restore. Everything just works
> because it is well defined.
>
> Though you say that there is a special case (not that I believe it):

I believe the easier special case to articulate is when the hypervisor
has already done its due diligence to warn the guest about a
migration. Guest doesn't heed the warning and doesn't quiesce. The
most predictable state at this point is probably just to kill the VM
on the spot, but that is likely to be a _very_ tough sell :)

So assuming that it's still possible for a non-cooperative suspend
(live migration, VM freeze, etc.) there's still a need to stop the
bleeding. I think you touch on what that may look like:

>   1) Trestart - Tstop < TOLERABLE_THRESHOLD
>
>      That's the easy case as you just can adjust TSC on restore by that
>      amount on all vCPUs and be done with it. Just works like scheduling
>      out all vCPUs for some time.
>
>   2) Trestart - Tstop >= TOLERABLE_THRESHOLD
>
>      Avoid adjusting TSC for the reasons above.

Which naturally will prompt the question: what is the value of
TOLERABLE_THRESHOLD? Speaking from experience (Google already does
something similar, but without a good fallback for exceeding the
threshold), there's ~zero science in deriving that value. But, IMO if
it's at least documented we can make the shenanigans at least a bit
more predictable. It also makes it very easy to define who
(guest/host) is responsible for cleaning up the mess.

In absence of documentation there's an unlimited license for VM
operators to do as they please and I fear we will forever perpetuate
the pain of time in virt.

>      That leaves you with the options of
>
>      A) Make NTP unhappy as of today
>
>      B) Provide information about the fact that the vCPUs got scheduled
>         out for a long time and teach NTP to be smart about it.
>
>         Right now NTP decides to declare itself unstable when it
>         observes the time jump on CLOCK_REALTIME from the NTP server.
>
>         That's exactly the situation described above:
>
>         T = now();
>         -----------> interruption of some sort (Tout)
>         use(T);
>
>         NTP observes that T is inconsistent because it does not know
>         about Tout due to the fact that neither clock MONOTONIC nor
>         clock BOOTTIME advanced.
>
>         So here is where you can bring PV information into play by
>         providing a migration sequence number in PV shared memory.
>
>         On source host:
>            Stop the guest dead in it's tracks.
>            Record metadata:
>              - migration sequence number
>              - clock TAI as observed on the host
>            Transfer the image along with metadata
>
>         On destination host:
>            Restore memory image
>            Expose metadata in PV:
>              - migration sequence number + 1
>              - Tout (dest/source host delta of clock TAI)
>            Run guest
>
>         Guest kernel:
>
>            - Keep track of the PV migration sequence number.
>
>              If it changed act accordingly by injecting the TAI delta,
>              which updates NTP state, wakes TFD_TIMER_CANCEL_ON_SET,
>              etc...
>
>              We have to do that in the kernel because we cannot rely on
>              NTP being running.
>
>              That can be an explicit IPI injected from the host or just
>              polling from the tick. It doesn't matter much.
>
>            - Expose information to user space:
>                - Migration sequence counter
>
>            - Add some smarts to adjtimex() to prevent user space racing
>              against a pending migration update in the kernel.
>
>         NTP:
>            - utilize the sequence counter information
>
>              seq = get_migration_sequence();
>
>              do stuff();
>
>              if (seq != get_migration_sequence())
>                 do_something_smart();
>              else
>                 proceed_as_usual();

Agreed pretty much the whole way through. There's no point in keeping
NTP naive at this point.

There's a need to sound the alarm for NTP regardless of whether
TOLERABLE_THRESHOLD is exceeded. David pointed out that the host
advancing the guest clocks (delta injection or TSC advancement) could
inject some error. Also, hardware has likely changed and the new parts
will have their own errors as well.

>         That still will leave everything else exposed to
>         CLOCK_REALTIME/TAI jumping forward, but there is nothing you can
>         do about that and any application which cares about this has to
>         be able to deal with it anyway.

Right. There's no cure-all between hypervisor/guest kernel that could
fix the problem for userspace entirely.

Appreciate you chiming in on this topic yet again.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-29 16:02     ` Oliver Upton
@ 2022-03-29 19:34       ` Thomas Gleixner
  2022-06-30 11:58       ` David Woodhouse
  1 sibling, 0 replies; 22+ messages in thread
From: Thomas Gleixner @ 2022-03-29 19:34 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Franke, Daniel, David Woodhouse, Paolo Bonzini, kvm,
	Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel

Oliver,

On Tue, Mar 29 2022 at 09:02, Oliver Upton wrote:
> On Tue, Mar 29, 2022 at 7:19 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > Doing this the other way around (advance the TSC, tell the guest to fix
>> > MONOTONIC) is fundamentally wrong, as it violates two invariants of the
>> > monotonic clock. Monotonic counts during a migration, which really is a
>> > forced suspend. Additionally, you cannot step the monotonic clock.
>>
>> A migration _should_ have suspend semantics, but the forced suspend
>> which is done by migration today does not have proper defined semantics
>> at all.
>>
>> Also clock monotonic can be stepped forward under certain circumstances
>> and the kernel is very well able to handle it within well defined
>> limits. Think about scheduled out vCPUs. From their perspective clock
>> monotonic is stepping forwards.
>>
>
> Right. For better or worse, the kernel has been conditioned to
> tolerate small steps due to scheduling. There is just zero definition
> around how much slop is allowed.

From the experience with the MONO=BOOT experiment, I'd say anything < 1
second is unlikely to cause larger trouble, but there might be user
space applications/configurations which disagree :)

>> The problem starts when the 'force suspended' time becomes excessive, as
>> that causes the mass expiry of clock monotonic timers with all the nasty
>> side effects described in a3ed0e4393d6. In the worst case it's going to
>> exceed the catchup limit of non-suspended timekeeping (~440s for a TSC
>> @2GHz) which in fact breaks the world and some more.
>>
>> So let me go back to the use cases:
>>
>>  1) Regular freeze of a VM to disk and resume at some arbitrary point in
>>     the future.
>>
>>  2) Live migration
>>
>> In both cases the proper solution is to make the guest go into a well
>> defined state (suspend) and resume it on restore. Everything just works
>> because it is well defined.
>>
>> Though you say that there is a special case (not that I believe it):
>
> I believe the easier special case to articulate is when the hypervisor
> has already done its due diligence to warn the guest about a
> migration. Guest doesn't heed the warning and doesn't quiesce. The
> most predictable state at this point is probably just to kill the VM
> on the spot, but that is likely to be a _very_ tough sell :)

I'm all for it. It's very well defined.

> So assuming that it's still possible for a non-cooperative suspend
> (live migration, VM freeze, etc.) there's still a need to stop the
> bleeding. I think you touch on what that may look like:
>
>>   1) Trestart - Tstop < TOLERABLE_THRESHOLD
>>
>>      That's the easy case as you just can adjust TSC on restore by that
>>      amount on all vCPUs and be done with it. Just works like scheduling
>>      out all vCPUs for some time.
>>
>>   2) Trestart - Tstop >= TOLERABLE_THRESHOLD
>>
>>      Avoid adjusting TSC for the reasons above.
>
> Which naturally will prompt the question: what is the value of
> TOLERABLE_THRESHOLD? Speaking from experience (Google already does
> something similar, but without a good fallback for exceeding the
> threshold), there's ~zero science in deriving that value. But, IMO if
> it's at least documented we can make the shenanigans at least a bit
> more predictable. It also makes it very easy to define who
> (guest/host) is responsible for cleaning up the mess.

See above, but the hyperscalers with experience on heavy host overload
might have better information when keeping a vCPU scheduled out starts
to create problems in the guest.

> In absence of documentation there's an unlimited license for VM
> operators to do as they please and I fear we will forever perpetuate
> the pain of time in virt.

You can prevent that, by making 'Cooperate within time or die hard' the
policy. :)

>>              if (seq != get_migration_sequence())
>>                 do_something_smart();
>>              else
>>                 proceed_as_usual();
>
> Agreed pretty much the whole way through. There's no point in keeping
> NTP naive at this point.
>
> There's a need to sound the alarm for NTP regardless of whether
> TOLERABLE_THRESHOLD is exceeded. David pointed out that the host
> advancing the guest clocks (delta injection or TSC advancement) could
> inject some error. Also, hardware has likely changed and the new parts
> will have their own errors as well.

There is no reason why you can't use the #2 scheme for the #1 case
too:

>>         On destination host:
>>            Restore memory image
>>            Expose metadata in PV:
>>              - migration sequence number + 1

                - Flag whether Tout was compensated already via
                  TSC or just set Tout = 0

>>              - Tout (dest/source host delta of clock TAI)
>>            Run guest
>>
>>         Guest kernel:
>>
>>            - Keep track of the PV migration sequence number.
>>
>>              If it changed act accordingly by injecting the TAI delta,
>>              which updates NTP state, wakes TFD_TIMER_CANCEL_ON_SET,
>>              etc...

                if it was compensated via TSC already, it might be
                sufficient to just reset NTP state.

>>         NTP:
>>            - utilize the sequence counter information
                ....

OTOH, the question is whether it's worth it.

If we assume that the sane case is a cooperative guest and the forced
migration is the last resort, then we can just avoid the extra magic and
the discussion around the correct value for TOLERABLE_THRESHOLD
alltogether.

I suggest to start from a TOLERABLE_THRESHOLD=0 assumption to keep it
simple in the first step. Once this has been established, you can still
experiment with the threshold and figure out whether it matters.

In fact, doing the host side TSC compensation is just an excuse for VM
operators not to make the guest cooperative, because it might solve
their main problems for the vast majority of migrations.

Forcing them to doing it right is definitely the better option, which
means the 'Cooperate or die hard' policy is the best one you can
chose. :)

>>         That still will leave everything else exposed to
>>         CLOCK_REALTIME/TAI jumping forward, but there is nothing you can
>>         do about that and any application which cares about this has to
>>         be able to deal with it anyway.
>
> Right. There's no cure-all between hypervisor/guest kernel that could
> fix the problem for userspace entirely.

In the same way as there is no cure for time jumps caused by
settimeofday(), daylight saving changes, leap seconds etc., unless the
application is carefully designed to deal with that.

> Appreciate you chiming in on this topic yet again.

I still hope that this get's fixed _before_ I retire :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-03-29 16:02     ` Oliver Upton
  2022-03-29 19:34       ` Thomas Gleixner
@ 2022-06-30 11:58       ` David Woodhouse
  2022-07-05 14:43         ` David Woodhouse
  1 sibling, 1 reply; 22+ messages in thread
From: David Woodhouse @ 2022-06-30 11:58 UTC (permalink / raw)
  To: Oliver Upton, Thomas Gleixner, sveith
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 3438 bytes --]

On Tue, 2022-03-29 at 09:02 -0700, Oliver Upton wrote:
> 
> There's a need to sound the alarm for NTP regardless of whether
> TOLERABLE_THRESHOLD is exceeded. David pointed out that the host
> advancing the guest clocks (delta injection or TSC advancement) could
> inject some error. Also, hardware has likely changed and the new parts
> will have their own errors as well.

I don't admit to pointing that out in that form because I don't accept
the use of the term "advance".

Clocks advance *themselves*. That's what clocks do.

When we perform a live update or live migration we might *adjust* those
clocks, calibrate, synchronise or restore them. But I'll eat my
keyboard before using the term "advance" for that. Even if our
adjustment is in a forward direction.

Let's consider the case of a live update — where we stop scheduling the
guest for a moment, kexec into the new kernel, then resume scheduling
the guest.

I assert strongly that from the guest point of view this is *no*
different to any other brief period of not being scheduled.

Yes, in practice we have a whole new kernel, a whole new KVM and set of
kvm_vcpus, and we've *restored* the state. And we have restored the
TSCs/clocks in those new kvm objects to precisely match what they were
before. Note: *match* not advance.

Before the kexec, there were a bunch of relationships between clocks,
mostly based on the host TSC Tₕ (assuming the case where that's stable
and reliable):

 • The kernel's idea of wallclock time was based on Tₕ, plus some
   offset and divided by some frequency. NTP tweaks those values over
   time but at any given instant there is a current value for them
   which is used to derive the wallblock time.

 • The kernel's idea of the guest kvmclock epoch (nanoseconds since the
   KVM started) was based on Tₕ and some other offset and hopefully the
   same frequency. 

 • The TSC of each vCPU was based on Tₕ, some offset and a TSC scaling
   factor.

After a live update, the host TSC Tₕ is just the same as it always was.
Not the same *value* of course; that was never the case from one tick
to the next anyway. It's the same, in that it continues to advance
*itself* at a consistent frequency as real time progresses, which is
what clocks do.

In the new kernel we just want all those other derivative clocks to
*also* be the same as before. That is, the offset and multipliers are
the *same* value. We're not "advancing" those clocks. We're
*preserving* them.


For live migration it's slightly harder because we don't have a
consistent host TSC to use as the basis. The best we can do is NTP-
synchronised wallclock time between the two hosts. And thus I think we
want *these* constants to be preserved across the migration:

 The KVM's kvmclock was <K> at a given wallclock time <W>

 The TSC of each vCPU#n was <Tₙ> at a given value of kvmclock <Kₙ>

In *this* case we are running on different hardware and the reliance on
the NTP wallclock time as the basis for preserving the guest clocks may
have introduced an error, as well as the fact that the hardware has
changed. So in this case we should indeed inform the guest that it
should consider itself out of NTP sync and start over, in *addition* to
making a best effort to preserve those clocks.

But there is no scope for the word "advance" to be used anywhere there
either.



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm
  2022-06-30 11:58       ` David Woodhouse
@ 2022-07-05 14:43         ` David Woodhouse
  2022-07-07 16:43           ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Simon Veith
  0 siblings, 1 reply; 22+ messages in thread
From: David Woodhouse @ 2022-07-05 14:43 UTC (permalink / raw)
  To: Oliver Upton, Thomas Gleixner, Veith, Simon
  Cc: Franke, Daniel, Paolo Bonzini, kvm, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel

[-- Attachment #1: Type: text/plain, Size: 2697 bytes --]

On Thu, 2022-06-30 at 12:58 +0100, David Woodhouse wrote:
>  The TSC of each vCPU#n was <Tₙ> at a given value of kvmclock <Kₙ>

That one isn't even particularly hard to do. At its core, it looks
something like this (thanks to Simon for fixing the arithmetic. My
first proof-of-concept was only a *few* orders of magnitude out...)

Simon then has a follow-on patch to wrap an ioctl around it and
actually allow the kvm_ns to be passed in.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1910e1e78b15..ddb67c9566e4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2629,7 +2629,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm_track_tsc_matching(vcpu);
 }
 
-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, u64 *kvm_ns)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
@@ -2638,8 +2638,23 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	bool synchronizing = false;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
-	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	ns = get_kvmclock_base_ns();
+	if (kvm_ns) {
+ 		/*
+ 		 * We have been provided a KVM clock reference time point at
+ 		 * which this TSC value was correct. Use this time point to
+		 * compensate for any delays that were incurred since that
+		 * TSC value was computed. Note that time is still passing;
+		 * in an ideal world the get_kvmclock_base_ns() above would
+		 * be based on the *same* TSC value that's hidden in the call
+		 * to kvm_compute_l1_tsc_offset() below.
+ 		 */
+ 		u32 tsc_khz = vcpu->arch.virtual_tsc_khz;
+ 		s64 delta_ns = ns + vcpu->kvm->arch.kvmclock_offset - *kvm_ns;
+ 		s64 delta_counts = delta_ns * tsc_khz / NSEC_PER_MSEC;
+ 		data += delta_counts;
+	}
+	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
 	if (vcpu->arch.virtual_tsc_khz) {
@@ -3581,7 +3596,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_TSC:
 		if (msr_info->host_initiated) {
-			kvm_synchronize_tsc(vcpu, data);
+			kvm_synchronize_tsc(vcpu, data, NULL);
 		} else {
 			u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
 			adjust_tsc_offset_guest(vcpu, adj);
@@ -11392,7 +11407,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 	if (mutex_lock_killable(&vcpu->mutex))
 		return;
 	vcpu_load(vcpu);
-	kvm_synchronize_tsc(vcpu, 0);
+	kvm_synchronize_tsc(vcpu, 0, NULL);
 	vcpu_put(vcpu);
 
 	/* poll control enabled by default */


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc()
  2022-07-05 14:43         ` David Woodhouse
@ 2022-07-07 16:43           ` Simon Veith
  2022-07-07 16:43             ` [PATCH 2/2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
  2022-07-29 21:14             ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Sean Christopherson
  0 siblings, 2 replies; 22+ messages in thread
From: Simon Veith @ 2022-07-07 16:43 UTC (permalink / raw)
  To: dwmw2
  Cc: dff, jmattson, joro, kvm, oupton, pbonzini, seanjc, sveith, tglx,
	vkuznets, wanpengli, David Woodhouse

The Time Stamp Counter (TSC) value register can be set to an absolute
value using the KVM_SET_MSRS ioctl, which calls kvm_synchronize_tsc()
internally.

Since this is a per-vCPU register, and vCPUs are often managed by
separate threads, setting a uniform TSC value across all vCPUs is
challenging: After live migration, for example, the TSC value may need
to be adjusted to account for the migration downtime. Ideally, we would
want each vCPU to be adjusted by the same offset; but if we compute the
offset centrally, the TSC value may become out of date due to scheduling
delays by the time that each vCPU thread gets around to issuing
KVM_SET_MSRS.

In preparation for the next patch, this change adds an optional, KVM
clock based time reference argument to kvm_synchronize_tsc(). This
argument, if present, is understood to mean "the TSC value being written
was valid at this corresponding KVM clock time point".

kvm_synchronize_tsc() will then use this clock reference to adjust the
TSC value being written for any delays that have been incurred since the
provided TSC value was valid.

Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Simon Veith <sveith@amazon.de>
---
 arch/x86/kvm/x86.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1910e1e78b15..a44d083f1bf9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2629,7 +2629,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm_track_tsc_matching(vcpu);
 }
 
-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, u64 *kvm_ns)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
@@ -2638,12 +2638,24 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	bool synchronizing = false;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
-	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	ns = get_kvmclock_base_ns();
+
+	if (kvm_ns) {
+		/*
+		 * We have been provided a KVM clock reference time point at
+		 * which this TSC value was correct.
+		 * Use this time point to compensate for any delays that were
+		 * incurred since that TSC value was valid.
+		 */
+		s64 delta_ns = ns + vcpu->kvm->arch.kvmclock_offset - *kvm_ns;
+		data += nsec_to_cycles(vcpu, (u64)delta_ns);
+	}
+
+	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
 	if (vcpu->arch.virtual_tsc_khz) {
-		if (data == 0) {
+		if (data == 0 && !kvm_ns) {
 			/*
 			 * detection of vcpu initialization -- need to sync
 			 * with other vCPUs. This particularly helps to keep
@@ -3581,7 +3593,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_TSC:
 		if (msr_info->host_initiated) {
-			kvm_synchronize_tsc(vcpu, data);
+			kvm_synchronize_tsc(vcpu, data, NULL);
 		} else {
 			u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
 			adjust_tsc_offset_guest(vcpu, adj);
@@ -11392,7 +11404,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 	if (mutex_lock_killable(&vcpu->mutex))
 		return;
 	vcpu_load(vcpu);
-	kvm_synchronize_tsc(vcpu, 0);
+	kvm_synchronize_tsc(vcpu, 0, NULL);
 	vcpu_put(vcpu);
 
 	/* poll control enabled by default */
-- 
2.25.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2022-07-07 16:43           ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Simon Veith
@ 2022-07-07 16:43             ` Simon Veith
  2022-07-29 21:21               ` Sean Christopherson
  2022-07-29 21:14             ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Sean Christopherson
  1 sibling, 1 reply; 22+ messages in thread
From: Simon Veith @ 2022-07-07 16:43 UTC (permalink / raw)
  To: dwmw2
  Cc: dff, jmattson, joro, kvm, oupton, pbonzini, seanjc, sveith, tglx,
	vkuznets, wanpengli

With the previous commit having added a KVM clock time reference to
kvm_synchronize_tsc(), this patch adds a new TSC attribute
KVM_VCPU_TSC_VALUE that allows for setting the TSC value in a way that
is unaffected by scheduling delays.

Userspace provides a struct kvm_vcpu_tsc_value consisting of a matched
pair of ( guest TSC value, KVM clock value ). The TSC value that will
ultimately be written is adjusted to account for the time which has
elapsed since the given KVM clock time point.

In order to allow userspace to retrieve an accurate time reference
atomically, without being affected by scheduling delays between
KVM_GET_CLOCK and KVM_GET_MSRS, the KVM_GET_DEVICE_ATTR implementation
for this attribute uses get_kvmclock() internally and returns a struct
kvm_vcpu_tsc_value with both values in one go. If get_kvmclock()
supports the KVM_CLOCK_HOST_TSC flag, the two will be based on one and
the same host TSC reading.

Signed-off-by: Simon Veith <sveith@amazon.de>
---
 Documentation/virt/kvm/devices/vcpu.rst | 22 +++++++++++++
 arch/x86/include/uapi/asm/kvm.h         |  7 +++++
 arch/x86/kvm/x86.c                      | 41 +++++++++++++++++++++++++
 tools/arch/x86/include/uapi/asm/kvm.h   |  7 +++++
 4 files changed, 77 insertions(+)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 716aa3edae14..a9506d902b0a 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -263,3 +263,25 @@ From the destination VMM process:
 
 7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
    respective value derived in the previous step.
+
+4.2 ATTRIBUTE: KVM_VCPU_TSC_VALUE
+
+:Parameters: kvm_device_attr.addr points to a struct kvm_vcpu_tsc_value
+
+Returns:
+
+	 ======= ======================================
+	 -EFAULT Error reading/writing the provided
+		 parameter address.
+	 -ENXIO  Attribute not supported
+	 ======= ======================================
+
+Gets or sets a matched pair of guest TSC value and KVM clock time point.
+
+When setting the TSC value through this attribute, a corresponding KVM clock
+reference time point (as retrieved by KVM_GET_CLOCK in the clock field) must be
+provided.
+
+The actual TSC value written will be adjusted based on the time that has
+elapsed since the provided reference time point, taking TSC scaling into
+account.
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 21614807a2cb..79de9e34cfa8 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -525,5 +525,12 @@ struct kvm_pmu_event_filter {
 /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define   KVM_VCPU_TSC_VALUE 1 /* attribute for the TSC value */
+
+/* for KVM_VCPU_TSC_VALUE */
+struct kvm_vcpu_tsc_value {
+	__u64 tsc_val;
+	__u64 kvm_ns;
+};
 
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a44d083f1bf9..ed8c2729eae2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5169,6 +5169,7 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
 
 	switch (attr->attr) {
 	case KVM_VCPU_TSC_OFFSET:
+	case KVM_VCPU_TSC_VALUE:
 		r = 0;
 		break;
 	default:
@@ -5194,6 +5195,32 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
 			break;
 		r = 0;
 		break;
+	case KVM_VCPU_TSC_VALUE: {
+		struct kvm_vcpu_tsc_value __user *tsc_value_arg;
+		struct kvm_vcpu_tsc_value tsc_value;
+		struct kvm_clock_data kvm_clock;
+		u64 host_tsc, guest_tsc, ratio, offset;
+
+		get_kvmclock(vcpu->kvm, &kvm_clock);
+		if (kvm_clock.flags & KVM_CLOCK_HOST_TSC)
+			host_tsc = kvm_clock.host_tsc;
+		else
+			host_tsc = rdtsc();
+
+		ratio = vcpu->arch.l1_tsc_scaling_ratio;
+		offset = vcpu->arch.l1_tsc_offset;
+		guest_tsc = kvm_scale_tsc(host_tsc, ratio) + offset;
+
+		tsc_value.kvm_ns = kvm_clock.clock;
+		tsc_value.tsc_val = guest_tsc;
+
+		tsc_value_arg = (struct kvm_vcpu_tsc_value __user *)uaddr;
+		r = -EFAULT;
+		if (copy_to_user(tsc_value_arg, &tsc_value, sizeof(tsc_value)))
+			break;
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
@@ -5236,6 +5263,20 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
 		r = 0;
 		break;
 	}
+	case KVM_VCPU_TSC_VALUE: {
+		struct kvm_vcpu_tsc_value __user *tsc_value_arg;
+		struct kvm_vcpu_tsc_value tsc_value;
+
+		tsc_value_arg = (struct kvm_vcpu_tsc_value __user *)uaddr;
+		r = -EFAULT;
+		if (copy_from_user(&tsc_value, tsc_value_arg, sizeof(tsc_value)))
+			break;
+
+		kvm_synchronize_tsc(vcpu, tsc_value.tsc_val, &tsc_value.kvm_ns);
+
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 21614807a2cb..79de9e34cfa8 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -525,5 +525,12 @@ struct kvm_pmu_event_filter {
 /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define   KVM_VCPU_TSC_VALUE 1 /* attribute for the TSC value */
+
+/* for KVM_VCPU_TSC_VALUE */
+struct kvm_vcpu_tsc_value {
+	__u64 tsc_val;
+	__u64 kvm_ns;
+};
 
 #endif /* _ASM_X86_KVM_H */
-- 
2.25.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc()
  2022-07-07 16:43           ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Simon Veith
  2022-07-07 16:43             ` [PATCH 2/2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
@ 2022-07-29 21:14             ` Sean Christopherson
  2023-02-02 16:35               ` David Woodhouse
  1 sibling, 1 reply; 22+ messages in thread
From: Sean Christopherson @ 2022-07-29 21:14 UTC (permalink / raw)
  To: Simon Veith
  Cc: dwmw2, dff, jmattson, joro, kvm, oupton, pbonzini, tglx,
	vkuznets, wanpengli, David Woodhouse

On Thu, Jul 07, 2022, Simon Veith wrote:
> The Time Stamp Counter (TSC) value register can be set to an absolute
> value using the KVM_SET_MSRS ioctl, which calls kvm_synchronize_tsc()
> internally.
> 
> Since this is a per-vCPU register, and vCPUs are often managed by
> separate threads, setting a uniform TSC value across all vCPUs is
> challenging: After live migration, for example, the TSC value may need
> to be adjusted to account for the migration downtime. Ideally, we would
> want each vCPU to be adjusted by the same offset; but if we compute the
> offset centrally, the TSC value may become out of date due to scheduling
> delays by the time that each vCPU thread gets around to issuing
> KVM_SET_MSRS.
> 
> In preparation for the next patch, this change adds an optional, KVM
> clock based time reference argument to kvm_synchronize_tsc(). This
> argument, if present, is understood to mean "the TSC value being written
> was valid at this corresponding KVM clock time point".


Given that commit 828ca89628bf ("KVM: x86: Expose TSC offset controls to userspace")
was merged less than a year ago, it would be helpful to explicitly call out why
KVM_VCPU_TSC_CTRL doesn't work, and why that sub-ioctl can't be extended/modified to
make it work.

> kvm_synchronize_tsc() will then use this clock reference to adjust the
> TSC value being written for any delays that have been incurred since the
> provided TSC value was valid.
> 
> Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>

Needs David's SOB.  And this patch should be squashed with the next patch.  The
kvm_ns != NULL path is dead code here, e.g. even if the logic is wildly broken,
bisection will blame the patch that actually passes a non-null reference.  Neither
patch is big enough to warrant splitting in this way.

And more importantly, the next patch's changelog provides a thorough description
of what it's doing, but there's very little in there that describes _why_ KVM
wants to provide this functionality.

> Signed-off-by: Simon Veith <sveith@amazon.de>
> ---
>  arch/x86/kvm/x86.c | 22 +++++++++++++++++-----
>  1 file changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1910e1e78b15..a44d083f1bf9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2629,7 +2629,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
>  	kvm_track_tsc_matching(vcpu);
>  }
>  
> -static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
> +static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, u64 *kvm_ns)
>  {
>  	struct kvm *kvm = vcpu->kvm;
>  	u64 offset, ns, elapsed;
> @@ -2638,12 +2638,24 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
>  	bool synchronizing = false;
>  
>  	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
> -	offset = kvm_compute_l1_tsc_offset(vcpu, data);
>  	ns = get_kvmclock_base_ns();
> +
> +	if (kvm_ns) {
> +		/*
> +		 * We have been provided a KVM clock reference time point at

Avoid pronouns, e.g. there's no need to say "We have been provided", just state
what kvm_ns is and how it's used.

> +		 * which this TSC value was correct.
> +		 * Use this time point to compensate for any delays that were
> +		 * incurred since that TSC value was valid.
> +		 */
> +		s64 delta_ns = ns + vcpu->kvm->arch.kvmclock_offset - *kvm_ns;
> +		data += nsec_to_cycles(vcpu, (u64)delta_ns);
> +	}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2022-07-07 16:43             ` [PATCH 2/2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
@ 2022-07-29 21:21               ` Sean Christopherson
  0 siblings, 0 replies; 22+ messages in thread
From: Sean Christopherson @ 2022-07-29 21:21 UTC (permalink / raw)
  To: Simon Veith
  Cc: dwmw2, dff, jmattson, joro, kvm, oupton, pbonzini, tglx,
	vkuznets, wanpengli

On Thu, Jul 07, 2022, Simon Veith wrote:
> With the previous commit having added a KVM clock time reference to

Moot point if the two patches are squashed, but avoid "previous commit" and "next
commit" and instead use less precise language, e.g. "Now that TSC synchronization
can account for a KVM clock time reference, blah blah blah".

> kvm_synchronize_tsc(), this patch adds a new TSC attribute

Don't use "this patch", the reader knows it's a patch.  Just state what's being
done in imperative mood, a.d. "add a pair of ioctls to allow getting and setting ..."

> KVM_VCPU_TSC_VALUE that allows for setting the TSC value in a way that
> is unaffected by scheduling delays.
> 
> Userspace provides a struct kvm_vcpu_tsc_value consisting of a matched
> pair of ( guest TSC value, KVM clock value ). The TSC value that will
> ultimately be written is adjusted to account for the time which has
> elapsed since the given KVM clock time point.
> 
> In order to allow userspace to retrieve an accurate time reference
> atomically, without being affected by scheduling delays between
> KVM_GET_CLOCK and KVM_GET_MSRS, the KVM_GET_DEVICE_ATTR implementation
> for this attribute uses get_kvmclock() internally and returns a struct
> kvm_vcpu_tsc_value with both values in one go. If get_kvmclock()
> supports the KVM_CLOCK_HOST_TSC flag, the two will be based on one and
> the same host TSC reading.
> 
> Signed-off-by: Simon Veith <sveith@amazon.de>
> ---

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc()
  2022-07-29 21:14             ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Sean Christopherson
@ 2023-02-02 16:35               ` David Woodhouse
  2023-02-02 16:59                 ` [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
  0 siblings, 1 reply; 22+ messages in thread
From: David Woodhouse @ 2023-02-02 16:35 UTC (permalink / raw)
  To: Sean Christopherson, Simon Veith
  Cc: dff, jmattson, joro, kvm, oupton, pbonzini, tglx, vkuznets, wanpengli

[-- Attachment #1: Type: text/plain, Size: 1952 bytes --]

On Fri, 2022-07-29 at 21:14 +0000, Sean Christopherson wrote:
> On Thu, Jul 07, 2022, Simon Veith wrote:
> > The Time Stamp Counter (TSC) value register can be set to an absolute
> > value using the KVM_SET_MSRS ioctl, which calls kvm_synchronize_tsc()
> > internally.
> > 
> > Since this is a per-vCPU register, and vCPUs are often managed by
> > separate threads, setting a uniform TSC value across all vCPUs is
> > challenging: After live migration, for example, the TSC value may need
> > to be adjusted to account for the migration downtime. Ideally, we would
> > want each vCPU to be adjusted by the same offset; but if we compute the
> > offset centrally, the TSC value may become out of date due to scheduling
> > delays by the time that each vCPU thread gets around to issuing
> > KVM_SET_MSRS.
> > 
> > In preparation for the next patch, this change adds an optional, KVM
> > clock based time reference argument to kvm_synchronize_tsc(). This
> > argument, if present, is understood to mean "the TSC value being written
> > was valid at this corresponding KVM clock time point".
> 
> 
> Given that commit 828ca89628bf ("KVM: x86: Expose TSC offset controls to userspace")
> was merged less than a year ago, it would be helpful to explicitly call out why
> KVM_VCPU_TSC_CTRL doesn't work, and why that sub-ioctl can't be extended/modified to
> make it work.
> 
> > kvm_synchronize_tsc() will then use this clock reference to adjust the
> > TSC value being written for any delays that have been incurred since the
> > provided TSC value was valid.
> > 
> > Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
> 
> Needs David's SOB. 

For the record (and for the half-baked "do it like this, except in a
way that might stand a chance of compiling..." that I threw at Simon on
Slack and didn't necessarily even expect him to credit me for
authorship)

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2023-02-02 16:35               ` David Woodhouse
@ 2023-02-02 16:59                 ` Simon Veith
  2023-03-15 19:57                   ` Sean Christopherson
  0 siblings, 1 reply; 22+ messages in thread
From: Simon Veith @ 2023-02-02 16:59 UTC (permalink / raw)
  To: dwmw2
  Cc: dff, jmattson, joro, kvm, oupton, pbonzini, seanjc, sveith, tglx,
	vkuznets, wanpengli

The Time Stamp Counter (TSC) value register can be set to an absolute
value using the KVM_SET_MSRS ioctl. Since this is a per-vCPU register,
and vCPUs are often managed by separate threads, setting a uniform TSC
value across all vCPUs is challenging: After liveupdate or live
migration, the TSC value may need to be adjusted to account for the
incurred downtime.

Ideally, we want such an adjustment to happen uniformly across all
vCPUs; however, if we compute the offset centrally, the TSC value may
become out of date due to scheduling delays by the time that each vCPU
thread gets around to issuing KVM_SET_MSRS. Such delays can lead to
unaccounted pause time (the guest observes that its system clock has
fallen behind the NTP reference time).

To avoid such inaccuracies from the use of KVM_SET_MSRS, there is an
alternative attribute KVM_VCPU_TSC_OFFSET which, rather than setting an
absolute TSC value, defines it in terms of the offset to be applied
relative to the host's TSC value. Using this attribute, the TSC can be
adjusted reliably by userspace, but only if TSC scaling remains
unchanged, i.e., in the case of liveupdate on the same host, and not
when live migrating an instance between hosts with different TSC scaling
parameters.

In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
preserve the TSC value and apply a known offset would require
duplicating the TSC scaling computations in userspace to account for
frequency differences between source and destination TSCs.

Hence, if userspace wants to set the TSC to some known value without
having to deal with TSC scaling, and while also being resilient against
scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
suitable options.

Add a new TSC attribute KVM_VCPU_TSC_VALUE that allows for setting the
TSC value in a way that is unaffected by scheduling delays, handling TSC
scaling internally.

Add an optional, KVM clock based time reference argument to
kvm_synchronize_tsc(). This argument, if present, is understood to mean
"the TSC value being written was valid at this corresponding KVM clock
time point".

Userspace provides a struct kvm_vcpu_tsc_value consisting of a matched
pair of ( guest TSC value, KVM clock value ). The TSC value that will
ultimately be written is adjusted to account for the time which has
elapsed since the given KVM clock time point.

In order to allow userspace to retrieve an accurate time reference
atomically, without being affected by scheduling delays between
KVM_GET_CLOCK and KVM_GET_MSRS, the KVM_GET_DEVICE_ATTR implementation
for this attribute uses get_kvmclock() internally and returns a struct
kvm_vcpu_tsc_value with both values in one go. If get_kvmclock()
supports the KVM_CLOCK_HOST_TSC flag, the two will be based on one and
the same host TSC reading.

Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Simon Veith <sveith@amazon.de>
---
V2:
 - Squashed into a single change
 - Added justification for introducing a new interface
 - Added missing Signed-off-by
 - Re-worded comment

 Documentation/virt/kvm/devices/vcpu.rst | 22 +++++++++
 arch/x86/include/uapi/asm/kvm.h         |  7 +++
 arch/x86/kvm/x86.c                      | 63 +++++++++++++++++++++++--
 tools/arch/x86/include/uapi/asm/kvm.h   |  7 +++
 4 files changed, 94 insertions(+), 5 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 31f14ec4a65b..240a3646947c 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -265,3 +265,25 @@ From the destination VMM process:
 
 7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
    respective value derived in the previous step.
+
+4.2 ATTRIBUTE: KVM_VCPU_TSC_VALUE
+
+:Parameters: kvm_device_attr.addr points to a struct kvm_vcpu_tsc_value
+
+Returns:
+
+	 ======= ======================================
+	 -EFAULT Error reading/writing the provided
+		 parameter address.
+	 -ENXIO  Attribute not supported
+	 ======= ======================================
+
+Gets or sets a matched pair of guest TSC value and KVM clock time point.
+
+When setting the TSC value through this attribute, a corresponding KVM clock
+reference time point (as retrieved by KVM_GET_CLOCK in the clock field) must be
+provided.
+
+The actual TSC value written will be adjusted based on the time that has
+elapsed since the provided reference time point, taking TSC scaling into
+account.
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index e48deab8901d..f99bdb959b54 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -528,5 +528,12 @@ struct kvm_pmu_event_filter {
 /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define   KVM_VCPU_TSC_VALUE 1 /* attribute for the TSC value */
+
+/* for KVM_VCPU_TSC_VALUE */
+struct kvm_vcpu_tsc_value {
+	__u64 tsc_val;
+	__u64 kvm_ns;
+};
 
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index da4bbd043a7b..b174200c909b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2655,7 +2655,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm_track_tsc_matching(vcpu);
 }
 
-static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
+static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data, u64 *kvm_ns)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
@@ -2664,12 +2664,24 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	bool synchronizing = false;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
-	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	ns = get_kvmclock_base_ns();
+
+	if (kvm_ns) {
+		/*
+		 * kvm_ns is the KVM clock reference time point at which this
+		 * TSC value was correct. Use this time point to compensate for
+		 * any delays that have been incurred since that TSC value was
+		 * valid.
+		 */
+		s64 delta_ns = ns + vcpu->kvm->arch.kvmclock_offset - *kvm_ns;
+		data += nsec_to_cycles(vcpu, (u64)delta_ns);
+	}
+
+	offset = kvm_compute_l1_tsc_offset(vcpu, data);
 	elapsed = ns - kvm->arch.last_tsc_nsec;
 
 	if (vcpu->arch.virtual_tsc_khz) {
-		if (data == 0) {
+		if (data == 0 && !kvm_ns) {
 			/*
 			 * detection of vcpu initialization -- need to sync
 			 * with other vCPUs. This particularly helps to keep
@@ -3672,7 +3684,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_TSC:
 		if (msr_info->host_initiated) {
-			kvm_synchronize_tsc(vcpu, data);
+			kvm_synchronize_tsc(vcpu, data, NULL);
 		} else {
 			u64 adj = kvm_compute_l1_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
 			adjust_tsc_offset_guest(vcpu, adj);
@@ -5375,6 +5387,7 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
 
 	switch (attr->attr) {
 	case KVM_VCPU_TSC_OFFSET:
+	case KVM_VCPU_TSC_VALUE:
 		r = 0;
 		break;
 	default:
@@ -5400,6 +5413,32 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
 			break;
 		r = 0;
 		break;
+	case KVM_VCPU_TSC_VALUE: {
+		struct kvm_vcpu_tsc_value __user *tsc_value_arg;
+		struct kvm_vcpu_tsc_value tsc_value;
+		struct kvm_clock_data kvm_clock;
+		u64 host_tsc, guest_tsc, ratio, offset;
+
+		get_kvmclock(vcpu->kvm, &kvm_clock);
+		if (kvm_clock.flags & KVM_CLOCK_HOST_TSC)
+			host_tsc = kvm_clock.host_tsc;
+		else
+			host_tsc = rdtsc();
+
+		ratio = vcpu->arch.l1_tsc_scaling_ratio;
+		offset = vcpu->arch.l1_tsc_offset;
+		guest_tsc = kvm_scale_tsc(host_tsc, ratio) + offset;
+
+		tsc_value.kvm_ns = kvm_clock.clock;
+		tsc_value.tsc_val = guest_tsc;
+
+		tsc_value_arg = (struct kvm_vcpu_tsc_value __user *)uaddr;
+		r = -EFAULT;
+		if (copy_to_user(tsc_value_arg, &tsc_value, sizeof(tsc_value)))
+			break;
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
@@ -5442,6 +5481,20 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
 		r = 0;
 		break;
 	}
+	case KVM_VCPU_TSC_VALUE: {
+		struct kvm_vcpu_tsc_value __user *tsc_value_arg;
+		struct kvm_vcpu_tsc_value tsc_value;
+
+		tsc_value_arg = (struct kvm_vcpu_tsc_value __user *)uaddr;
+		r = -EFAULT;
+		if (copy_from_user(&tsc_value, tsc_value_arg, sizeof(tsc_value)))
+			break;
+
+		kvm_synchronize_tsc(vcpu, tsc_value.tsc_val, &tsc_value.kvm_ns);
+
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
@@ -11668,7 +11721,7 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 	if (mutex_lock_killable(&vcpu->mutex))
 		return;
 	vcpu_load(vcpu);
-	kvm_synchronize_tsc(vcpu, 0);
+	kvm_synchronize_tsc(vcpu, 0, NULL);
 	vcpu_put(vcpu);
 
 	/* poll control enabled by default */
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index e48deab8901d..f99bdb959b54 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -528,5 +528,12 @@ struct kvm_pmu_event_filter {
 /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define   KVM_VCPU_TSC_VALUE 1 /* attribute for the TSC value */
+
+/* for KVM_VCPU_TSC_VALUE */
+struct kvm_vcpu_tsc_value {
+	__u64 tsc_val;
+	__u64 kvm_ns;
+};
 
 #endif /* _ASM_X86_KVM_H */

base-commit: 9f266ccaa2f5228bfe67ad58a94ca4e0109b954a
-- 
2.34.1




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879




^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2023-02-02 16:59                 ` [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
@ 2023-03-15 19:57                   ` Sean Christopherson
  2023-03-23 23:26                     ` David Woodhouse
  2023-03-24 11:22                     ` Paolo Bonzini
  0 siblings, 2 replies; 22+ messages in thread
From: Sean Christopherson @ 2023-03-15 19:57 UTC (permalink / raw)
  To: Simon Veith
  Cc: dwmw2, dff, jmattson, joro, kvm, oupton, pbonzini, tglx,
	vkuznets, wanpengli

Please don't send vN+1 In-Reply-To vN, the threading messes up lore and other
tooling, and becomes really problematic for humans when N gets large.

On Thu, Feb 02, 2023, Simon Veith wrote:
> In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
> preserve the TSC value and apply a known offset would require
> duplicating the TSC scaling computations in userspace to account for
> frequency differences between source and destination TSCs.
> 
> Hence, if userspace wants to set the TSC to some known value without
> having to deal with TSC scaling, and while also being resilient against
> scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
> suitable options.

Requiring userspace to handle certain aspects of TSC scaling doesn't seem
particularly onerous, at least not relative to all the other time insanity.  In
other words, why should KVM take on more complexity and a mostly-redundant uAPI?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2023-03-15 19:57                   ` Sean Christopherson
@ 2023-03-23 23:26                     ` David Woodhouse
  2023-03-24 11:22                     ` Paolo Bonzini
  1 sibling, 0 replies; 22+ messages in thread
From: David Woodhouse @ 2023-03-23 23:26 UTC (permalink / raw)
  To: Sean Christopherson, Simon Veith
  Cc: dff, jmattson, joro, kvm, oupton, pbonzini, tglx, vkuznets, wanpengli

[-- Attachment #1: Type: text/plain, Size: 1407 bytes --]

On Wed, 2023-03-15 at 12:57 -0700, Sean Christopherson wrote:
> 
> On Thu, Feb 02, 2023, Simon Veith wrote:
> > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
> > preserve the TSC value and apply a known offset would require
> > duplicating the TSC scaling computations in userspace to account for
> > frequency differences between source and destination TSCs.
> > 
> > Hence, if userspace wants to set the TSC to some known value without
> > having to deal with TSC scaling, and while also being resilient against
> > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
> > suitable options.
> 
> Requiring userspace to handle certain aspects of TSC scaling doesn't seem
> particularly onerous, at least not relative to all the other time insanity.  In
> other words, why should KVM take on more complexity and a mostly-redundant uAPI?

Because it offers userspace a simple and easy to use API. You get the
time information, you put the time information.

Now, if you want to *document* the actual calculations that userspace
would be forced to do just to preserve the existing relationship
between the guest TSC and the time of day / host TSC (for LM / LU
respectively).... well, you can try, but it exceeded my "if it needs
documenting this much, fix it first" threshold.

Does userspace even *have* what it needs to do this *right*? 

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2023-03-15 19:57                   ` Sean Christopherson
  2023-03-23 23:26                     ` David Woodhouse
@ 2023-03-24 11:22                     ` Paolo Bonzini
  2023-03-24 13:08                       ` David Woodhouse
  2023-09-13 14:08                       ` David Woodhouse
  1 sibling, 2 replies; 22+ messages in thread
From: Paolo Bonzini @ 2023-03-24 11:22 UTC (permalink / raw)
  To: Sean Christopherson, Simon Veith
  Cc: dwmw2, dff, jmattson, joro, kvm, oupton, tglx, vkuznets, wanpengli

On 3/15/23 20:57, Sean Christopherson wrote:
>> In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
>> preserve the TSC value and apply a known offset would require
>> duplicating the TSC scaling computations in userspace to account for
>> frequency differences between source and destination TSCs.
>>
>> Hence, if userspace wants to set the TSC to some known value without
>> having to deal with TSC scaling, and while also being resilient against
>> scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
>> suitable options.
>
> Requiring userspace to handle certain aspects of TSC scaling doesn't seem
> particularly onerous, at least not relative to all the other time insanity.  In
> other words, why should KVM take on more complexity and a mostly-redundant uAPI?

Yeah, it seems like the problem is that KVM_GET_CLOCK return host 
unscaled TSC units (which was done because the guest TSC frequency is at 
least in theory per-CPU, and KVM_GET_CLOCK is a vm ioctl)?

Perhaps it's more important (uAPI-wise) for KVM to return the precise 
guest/host TSC ratio via a vcpu device attribute?

Paolo


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2023-03-24 11:22                     ` Paolo Bonzini
@ 2023-03-24 13:08                       ` David Woodhouse
  2023-09-13 14:08                       ` David Woodhouse
  1 sibling, 0 replies; 22+ messages in thread
From: David Woodhouse @ 2023-03-24 13:08 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Simon Veith
  Cc: dff, jmattson, joro, kvm, oupton, tglx, vkuznets, wanpengli

[-- Attachment #1: Type: text/plain, Size: 3096 bytes --]

On Fri, 2023-03-24 at 12:22 +0100, Paolo Bonzini wrote:
> On 3/15/23 20:57, Sean Christopherson wrote:
> > > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
> > > preserve the TSC value and apply a known offset would require
> > > duplicating the TSC scaling computations in userspace to account for
> > > frequency differences between source and destination TSCs.
> > > 
> > > Hence, if userspace wants to set the TSC to some known value without
> > > having to deal with TSC scaling, and while also being resilient against
> > > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
> > > suitable options.
> > 
> > Requiring userspace to handle certain aspects of TSC scaling doesn't seem
> > particularly onerous, at least not relative to all the other time insanity.  In
> > other words, why should KVM take on more complexity and a mostly-redundant uAPI?
> 
> Yeah, it seems like the problem is that KVM_GET_CLOCK return host 
> unscaled TSC units (which was done because the guest TSC frequency is at 
> least in theory per-CPU, and KVM_GET_CLOCK is a vm ioctl)?
> 
> Perhaps it's more important (uAPI-wise) for KVM to return the precise
> guest/host TSC ratio via a vcpu device attribute?
> 

My criteria for this are that in the case of a live update (serialize
guest, kexec, resume guest on precisely the same hardware a few
milliseconds of steal time later), the guest clocks (KVM_CLOCK, TSC,
etc) shall be *precisely* the same as before with no slop. The same
offset, the same scaling. And ideally precisely the same values
advertised in the pvclock data to the guest.

In the case of live migration, we can't be cycle-accurate because of
course we're limited to the accuracy of the NTP sync between hosts. But
that is the *only* inaccuracy we shall incur, because we can express
clocks in terms of each other, e.g. <KVM clock was X at time of day Y>
and then e.g. <TSC was W at KVM clock X> is unchanged from before.

It's OK to expect userspace to do some calculation, as long as we never
expect userspace to magically perform calculations based on some
concept of "now", and to call kernel APIs, without any actual time
elapsing while it does so.

I said it's OK to expect userspace to do *some* calculation. But that
should be clearly documented, *and* when we document it, that
documentation shouldn't codify too much of the kernel's internal
relationships between clocks, and shouldn't make us ashamed to be
kernel engineers. 

We tried documenting it, in
https://lore.kernel.org/all/20220316045308.2313184-1-oupton@google.com/

I don't quite know how to summarise that thread, other than "it's too
broken; let's fix it first and *then* document it".

But if it can be done, I'm happy for someone to fix the documentation
in a way which describes how to meet the above criteria using the
existing kernel APIs. And then perhaps we can make a decision about
just how ashamed of ourselves we should be, and whether we want to
provide a better, easier API for userspace to use.




[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute
  2023-03-24 11:22                     ` Paolo Bonzini
  2023-03-24 13:08                       ` David Woodhouse
@ 2023-09-13 14:08                       ` David Woodhouse
  1 sibling, 0 replies; 22+ messages in thread
From: David Woodhouse @ 2023-09-13 14:08 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Simon Veith
  Cc: dff, jmattson, joro, kvm, oupton, tglx, vkuznets, wanpengli

[-- Attachment #1: Type: text/plain, Size: 1356 bytes --]

On Fri, 2023-03-24 at 12:22 +0100, Paolo Bonzini wrote:
> On 3/15/23 20:57, Sean Christopherson wrote:
> > > In the case of live migration, using the KVM_VCPU_TSC_OFFSET approach to
> > > preserve the TSC value and apply a known offset would require
> > > duplicating the TSC scaling computations in userspace to account for
> > > frequency differences between source and destination TSCs.
> > > 
> > > Hence, if userspace wants to set the TSC to some known value without
> > > having to deal with TSC scaling, and while also being resilient against
> > > scheduling delays, neither KVM_SET_MSRS nor KVM_VCPU_TSC_VALUE are
> > > suitable options.
> > 
> > Requiring userspace to handle certain aspects of TSC scaling doesn't seem
> > particularly onerous, at least not relative to all the other time insanity.  In
> > other words, why should KVM take on more complexity and a mostly-redundant uAPI?
> 
> Yeah, it seems like the problem is that KVM_GET_CLOCK return host 
> unscaled TSC units (which was done because the guest TSC frequency is at 
> least in theory per-CPU, and KVM_GET_CLOCK is a vm ioctl)?
> 
> Perhaps it's more important (uAPI-wise) for KVM to return the precise
> guest/host TSC ratio via a vcpu device attribute?

Perhaps. Although that really does suck as a userspace experience. See
the patch I just sent.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-09-13 14:11 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-22 19:18 [PATCH] Documentation: KVM: Describe guest TSC scaling in migration algorithm Franke, Daniel
2022-03-22 21:53 ` Oliver Upton
2022-03-23 12:35   ` David Woodhouse
2022-03-23 16:21     ` Oliver Upton
2022-03-25  9:03       ` David Woodhouse
2022-03-25 17:47         ` Oliver Upton
2022-03-29 14:19   ` Thomas Gleixner
2022-03-29 16:02     ` Oliver Upton
2022-03-29 19:34       ` Thomas Gleixner
2022-06-30 11:58       ` David Woodhouse
2022-07-05 14:43         ` David Woodhouse
2022-07-07 16:43           ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Simon Veith
2022-07-07 16:43             ` [PATCH 2/2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
2022-07-29 21:21               ` Sean Christopherson
2022-07-29 21:14             ` [PATCH 1/2] KVM: x86: add KVM clock time reference arg to kvm_write_tsc() Sean Christopherson
2023-02-02 16:35               ` David Woodhouse
2023-02-02 16:59                 ` [PATCH v2] KVM: x86: add KVM_VCPU_TSC_VALUE attribute Simon Veith
2023-03-15 19:57                   ` Sean Christopherson
2023-03-23 23:26                     ` David Woodhouse
2023-03-24 11:22                     ` Paolo Bonzini
2023-03-24 13:08                       ` David Woodhouse
2023-09-13 14:08                       ` David Woodhouse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.