All of lore.kernel.org
 help / color / mirror / Atom feed
* recalibrating x86 TSC during suspend/resume
@ 2019-02-22 10:53 Olaf Hering
  2019-02-22 11:44 ` Thomas Gleixner
  0 siblings, 1 reply; 5+ messages in thread
From: Olaf Hering @ 2019-02-22 10:53 UTC (permalink / raw)
  To: John Stultz, Thomas Gleixner, Stephen Boyd; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1717 bytes --]

Is there a way to recalibrate the x86 TSC during a suspend/resume cycle?

While the frequency will remain the same on a Laptop, it may (or rather:
it definitly will) differ if a VM is migrated from one host to another.
The hypervisor may choose to emulate the expected TSC frequency on the
destination host, but this emulation comes with a significant
performance cost. Therefore it would be good if the kernel evaluates the
environment during resume.

The specific usecase I have is a workload within VMs that makes heavy
use of TSC. The kernel is booted with 'clocksource=tsc highres=off nohz=off'
because only this clocksource gives enough granularity. The default
paravirtualized clock will return the same values via
clock_gettime(CLOCK_MONOTONIC) if the timespan between two calls is too
short. This does not happen with 'clocksource=tsc'.

Right now it is not possible to migrate VMs to hosts with different CPU
speeds. This leads to "islands" of identical hardware, and makes
maintenance of hosts harder than it needs to be. If the VM kernel would
be able to cope with CPU/TSC frequency changes, the pool of potential
destination hosts will become significant larger.

The current result of a migration with non-emulated TSC between hosts of
different speed is:

[   42.452258] clocksource: timekeeping watchdog on CPU1: Marking clocksource 'tsc' as unstable because the skew is too large:
[   42.452270] clocksource:                       'xen' wd_now: 6d34a86adb wd_last: 6d1dc51793 mask: ffffffffffffffff
[   42.452272] clocksource:                       'tsc' cs_now: 1fd2ce46bb cs_last: 1f95c4ca75 mask: ffffffffffffffff
[   42.452273] tsc: Marking TSC unstable due to clocksource watchdog

Thanks,
Olaf

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: recalibrating x86 TSC during suspend/resume
  2019-02-22 10:53 recalibrating x86 TSC during suspend/resume Olaf Hering
@ 2019-02-22 11:44 ` Thomas Gleixner
  2019-02-22 11:51   ` Olaf Hering
  2019-02-22 12:31   ` Paolo Bonzini
  0 siblings, 2 replies; 5+ messages in thread
From: Thomas Gleixner @ 2019-02-22 11:44 UTC (permalink / raw)
  To: Olaf Hering; +Cc: John Stultz, Stephen Boyd, LKML, x86, Paolo Bonzini

On Fri, 22 Feb 2019, Olaf Hering wrote:
> Is there a way to recalibrate the x86 TSC during a suspend/resume cycle?

No.

> While the frequency will remain the same on a Laptop, it may (or rather:
> it definitly will) differ if a VM is migrated from one host to another.
> The hypervisor may choose to emulate the expected TSC frequency on the
> destination host, but this emulation comes with a significant
> performance cost. Therefore it would be good if the kernel evaluates the
> environment during resume.
> 
> The specific usecase I have is a workload within VMs that makes heavy
> use of TSC. The kernel is booted with 'clocksource=tsc highres=off nohz=off'
> because only this clocksource gives enough granularity. The default
> paravirtualized clock will return the same values via
> clock_gettime(CLOCK_MONOTONIC) if the timespan between two calls is too
> short. This does not happen with 'clocksource=tsc'.
> 
> Right now it is not possible to migrate VMs to hosts with different CPU
> speeds. This leads to "islands" of identical hardware, and makes
> maintenance of hosts harder than it needs to be. If the VM kernel would
> be able to cope with CPU/TSC frequency changes, the pool of potential
> destination hosts will become significant larger.

The problem with recalibrating TSC on resume is that it would have to be

    1) quick

    2) accurate, so NTP does not get utterly unhappy.

Newer Intels support TSC scaling for VMX, which could solve the problem. It
affects TSC readout by:

	TSC = (read(HWTSC) * multiplier) >> 48

So you can standarize on a TSC frequency accross a fleet. Not sure when
that was introduced and no idea whether it's available on AMD.

For a software solution we could try the following:

 1) Provide the raw TSC frequency of the host to the guest in some magic
    software defined MSR or CPUID. If there is an existing mechanism, use
    that.

 2) On resume check whether the MSR/CPUID is available and if so readout
    that information and check whether the frequency is the same as
    before. If not it is trivial enough to adjust the guest mult/shift
    values for both raw and NTP adjusted clocks before they are used again,
    i.e. before timekeeping_resume(). Need to look what's the best place,
    but probably the clocksource resume callback. Plus if TSC deadline
    timer is used, we'd need the same adjustment there.

    That's backward compatible, because if the MSR/CPUID is not there, then
    the recalibration is not tried.

Whether that is accurate enough or not to make NTP happy, I can't tell, but
it's definitely worth a try.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: recalibrating x86 TSC during suspend/resume
  2019-02-22 11:44 ` Thomas Gleixner
@ 2019-02-22 11:51   ` Olaf Hering
  2019-02-22 12:31   ` Paolo Bonzini
  1 sibling, 0 replies; 5+ messages in thread
From: Olaf Hering @ 2019-02-22 11:51 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: John Stultz, Stephen Boyd, LKML, x86, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 264 bytes --]

Am Fri, 22 Feb 2019 12:44:39 +0100 (CET)
schrieb Thomas Gleixner <tglx@linutronix.de>:

> Whether that is accurate enough or not to make NTP happy, I can't tell, but
> it's definitely worth a try.

Thanks Thomas, I will look into the suggestions.


Olaf

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: recalibrating x86 TSC during suspend/resume
  2019-02-22 11:44 ` Thomas Gleixner
  2019-02-22 11:51   ` Olaf Hering
@ 2019-02-22 12:31   ` Paolo Bonzini
  2019-02-22 14:28     ` Olaf Hering
  1 sibling, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2019-02-22 12:31 UTC (permalink / raw)
  To: Thomas Gleixner, Olaf Hering; +Cc: John Stultz, Stephen Boyd, LKML, x86

On 22/02/19 12:44, Thomas Gleixner wrote:
>> The specific usecase I have is a workload within VMs that makes heavy
>> use of TSC. The kernel is booted with 'clocksource=tsc highres=off nohz=off'
>> because only this clocksource gives enough granularity. The default
>> paravirtualized clock will return the same values via
>> clock_gettime(CLOCK_MONOTONIC) if the timespan between two calls is too
>> short. This does not happen with 'clocksource=tsc'.

This shouldn't happen.  clock_gettime(CLOCK_MONOTONIC) should be
monotonic increasing.  Do you have a testcase?

The KVM clocksource is high-resolution and also TSC-based, the
difference is that it performs two multiplications instead of one.  The
first uses TSC parameters from the host.  The second, which is the one
in arch/x86/entry/vdso/vclock_gettime.c's do_hres function, will have a
1:1 multiplier (excluding adjtime shearing) because kvmclock already
returns nanoseconds.

> Newer Intels support TSC scaling for VMX, which could solve the problem. It
> affects TSC readout by:
> 
> 	TSC = (read(HWTSC) * multiplier) >> 48
> 
> So you can standarize on a TSC frequency accross a fleet. Not sure when
> that was introduced and no idea whether it's available on AMD.

It's Skylake (server parts only) or newer.  AMD instead has had it
(almost) forever.  QEMU 2.6 or newer will use it automatically across
live migration, if available.

> For a software solution we could try the following:
> 
>  1) Provide the raw TSC frequency of the host to the guest in some magic
>     software defined MSR or CPUID. If there is an existing mechanism, use
>     that.

This shouldn't be needed for two reasons:

1) you could also use kvmclock's provided mult/shift

2) I am not convinced that kvmclock has the behavior that Olaf mentions,
and if it does it would be a bug.

Paolo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: recalibrating x86 TSC during suspend/resume
  2019-02-22 12:31   ` Paolo Bonzini
@ 2019-02-22 14:28     ` Olaf Hering
  0 siblings, 0 replies; 5+ messages in thread
From: Olaf Hering @ 2019-02-22 14:28 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Thomas Gleixner, John Stultz, Stephen Boyd, LKML, x86

[-- Attachment #1: Type: text/plain, Size: 1305 bytes --]

On Fri, Feb 22, Paolo Bonzini wrote:

> On 22/02/19 12:44, Thomas Gleixner wrote:
> >> The specific usecase I have is a workload within VMs that makes heavy
> >> use of TSC. The kernel is booted with 'clocksource=tsc highres=off nohz=off'
> >> because only this clocksource gives enough granularity. The default
> >> paravirtualized clock will return the same values via
> >> clock_gettime(CLOCK_MONOTONIC) if the timespan between two calls is too
> >> short. This does not happen with 'clocksource=tsc'.
> 
> This shouldn't happen.  clock_gettime(CLOCK_MONOTONIC) should be
> monotonic increasing.  Do you have a testcase?

Two years ago I tweaked sysbench to track the execution time of the
'memory' test:

https://github.com/olafhering/sysbench
https://github.com/olafhering/sysbench/blame/pv/src/tests/memory/sb_memory.c

The checks in diff_timespec() triggered with clocksource=xen, but I can
not reproduce it right now with 5.0 and 4.4 based kernels. I have no
data how KVM behaves. In the end the hypervisor was tweaked to tolerate
a certain jitter in expected TSC speed before emulation kicks in. Up to
~1MHz would be ok to stay within the 500PPM limit that ntpd can handle.

But now there is that "island" issue that needs to be resolved in one
way or another.

Olaf

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-02-22 14:28 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-22 10:53 recalibrating x86 TSC during suspend/resume Olaf Hering
2019-02-22 11:44 ` Thomas Gleixner
2019-02-22 11:51   ` Olaf Hering
2019-02-22 12:31   ` Paolo Bonzini
2019-02-22 14:28     ` Olaf Hering

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.