linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] RFC: Precise TSC migration
@ 2020-11-30 13:35 Maxim Levitsky
  2020-11-30 13:35 ` [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE Maxim Levitsky
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Maxim Levitsky @ 2020-11-30 13:35 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Oliver Upton, Ingo Molnar, Sean Christopherson,
	Thomas Gleixner, open list, Marcelo Tosatti, Jonathan Corbet,
	Wanpeng Li, Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov, Maxim Levitsky

Hi!

This is the first version of the work to make TSC migration more accurate,
as was defined by Paulo at:
https://www.spinics.net/lists/kvm/msg225525.html

I have a few thoughts about the kvm masterclock synchronization,
which relate to the Paulo's proposal that I implemented.

The idea of masterclock is that when the host TSC is synchronized
(or as kernel call it, stable), and the guest TSC is synchronized as well,
then we can base the kvmclock, on the same pair of
(host time in nsec, host tsc value), for all vCPUs.

This makes the random error in calculation of this value invariant
across vCPUS, and allows the guest to do kvmclock calculation in userspace
(vDSO) since kvmclock parameters are vCPU invariant.

To ensure that the guest tsc is synchronized we currently track host/guest tsc
writes, and enable the master clock only when roughly the same guest's TSC value
was written across all vCPUs.

Recently this was disabled by Paulo and I agree with this, because I think
that we indeed should only make the guest TSC synchronized by default
(including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
(Trying to guess when the guest syncs the TSC can cause more harm that good).

Besides, Linux guests don't sync the TSC via IA32_TSC write,
but rather use IA32_TSC_ADJUST which currently doesn't participate
in the tsc sync heruistics.
And as far as I know, Linux guest is the primary (only?) user of the kvmclock.

I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
in the documentation to state that it only guarantees invariance if the guest
doesn't mess with its own TSC.

Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
in the guest kernel, when kvm is detected to avoid the guest even from trying
to sync TSC on newly hotplugged vCPUs.

(The guest doesn't end up touching TSC_ADJUST usually, but it still might
in some cases due to scheduling of guest vCPUs)

(X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
and TSC clocksource watchdog, and the later we might want to keep).

For host TSC writes, just as Paulo proposed we can still do the tsc sync,
unless the new code that I implemented is in use.

Few more random notes:

I have a weird feeling about using 'nsec since 1 January 1970'.
Common sense is telling me that a 64 bit value can hold about 580 years,
but still I see that it is more common to use timespec which is a (sec,nsec) pair.

I feel that 'kvm_get_walltime' that I added is a bit of a hack.
Some refactoring might improve things here.

For example making kvm_get_walltime_and_clockread work in non tsc case as well
might make the code cleaner.

Patches to enable this feature in qemu are in process of being sent to
qemu-devel mailing list.

Best regards,
       Maxim Levitsky

Maxim Levitsky (2):
  KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
  KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS

 Documentation/virt/kvm/api.rst  | 56 +++++++++++++++++++++
 arch/x86/include/uapi/asm/kvm.h |  1 +
 arch/x86/kvm/x86.c              | 88 +++++++++++++++++++++++++++++++--
 include/uapi/linux/kvm.h        | 14 ++++++
 4 files changed, 154 insertions(+), 5 deletions(-)

-- 
2.26.2



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
  2020-11-30 13:35 [PATCH 0/2] RFC: Precise TSC migration Maxim Levitsky
@ 2020-11-30 13:35 ` Maxim Levitsky
  2020-11-30 14:33   ` Paolo Bonzini
  2020-12-01 19:43   ` Thomas Gleixner
  2020-11-30 13:35 ` [PATCH 2/2] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS Maxim Levitsky
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 30+ messages in thread
From: Maxim Levitsky @ 2020-11-30 13:35 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Oliver Upton, Ingo Molnar, Sean Christopherson,
	Thomas Gleixner, open list, Marcelo Tosatti, Jonathan Corbet,
	Wanpeng Li, Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov, Maxim Levitsky

These two new ioctls allow to more precisly capture and
restore guest's TSC state.

Both ioctls are meant to be used to accurately migrate guest TSC
even when there is a significant downtime during the migration.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 Documentation/virt/kvm/api.rst | 56 +++++++++++++++++++++++++++
 arch/x86/kvm/x86.c             | 69 ++++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h       | 14 +++++++
 3 files changed, 139 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 70254eaa5229f..2f04aa8ecf119 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -4826,6 +4826,62 @@ If a vCPU is in running state while this ioctl is invoked, the vCPU may
 experience inconsistent filtering behavior on MSR accesses.
 
 
+4.127 KVM_GET_TSC_STATE
+----------------------------
+
+:Capability: KVM_CAP_PRECISE_TSC
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_tsc_info
+:Returns: 0 on success, < 0 on error
+
+::
+
+  #define KVM_TSC_INFO_TSC_ADJUST_VALID 1
+  struct kvm_tsc_info {
+	__u32 flags;
+	__u64 nsec;
+	__u64 tsc;
+	__u64 tsc_adjust;
+  };
+
+flags values for ``struct kvm_tsc_info``:
+
+``KVM_TSC_INFO_TSC_ADJUST_VALID``
+
+  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
+
+This ioctl allows user space to read guest's IA32_TSC, IA32_TSC_ADJUST,
+and the current value of host CLOCK_REALTIME clock in nanoseconds since unix
+epoch.
+
+
+4.128 KVM_SET_TSC_STATE
+----------------------------
+
+:Capability: KVM_CAP_PRECISE_TSC
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_tsc_info
+:Returns: 0 on success, < 0 on error
+
+::
+
+This ioctl allows to reconstruct the guest's IA32_TSC and TSC_ADJUST value
+from the state obtained in the past by KVM_GET_TSC_STATE on the same vCPU.
+
+KVM will adjust the guest TSC value by the time that passed between
+CLOCK_REALTIME timestamp saved in the struct and current value of
+CLOCK_REALTIME, and set guest's TSC to the new value.
+
+TSC_ADJUST is restored as is if KVM_TSC_INFO_TSC_ADJUST_VALID is set.
+
+It is assumed that either both ioctls will be run on the same machine,
+or that source and destination machines have synchronized clocks.
+
+As a special case, it is allowed to leave the timestamp in the struct to zero,
+in which case it will be ignored and the TSC will be restored exactly.
+
 5. The kvm_run structure
 ========================
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a3fdc16cfd6f3..4f0ae9cb14b8a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2438,6 +2438,21 @@ static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
 
 	return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
 }
+
+
+static void kvm_get_walltime(u64 *walltime_ns, u64 *host_tsc)
+{
+	struct timespec64 ts;
+
+	if (kvm_get_walltime_and_clockread(&ts, host_tsc)) {
+		*walltime_ns = timespec64_to_ns(&ts);
+		return;
+	}
+
+	*host_tsc = rdtsc();
+	*walltime_ns = ktime_get_real_ns();
+}
+
 #endif
 
 /*
@@ -3757,6 +3772,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_X86_USER_SPACE_MSR:
 	case KVM_CAP_X86_MSR_FILTER:
 	case KVM_CAP_ENFORCE_PV_FEATURE_CPUID:
+#ifdef CONFIG_X86_64
+	case KVM_CAP_PRECISE_TSC:
+#endif
 		r = 1;
 		break;
 	case KVM_CAP_SYNC_REGS:
@@ -4999,6 +5017,57 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 	case KVM_GET_SUPPORTED_HV_CPUID:
 		r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
 		break;
+#ifdef CONFIG_X86_64
+	case KVM_GET_TSC_STATE: {
+		struct kvm_tsc_state __user *user_tsc_state = argp;
+		struct kvm_tsc_state tsc_state;
+		u64 host_tsc;
+
+		memset(&tsc_state, 0, sizeof(tsc_state));
+
+		kvm_get_walltime(&tsc_state.nsec, &host_tsc);
+		tsc_state.tsc = kvm_read_l1_tsc(vcpu, host_tsc);
+
+		if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
+			tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
+			tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
+		}
+
+		r = -EFAULT;
+		if (copy_to_user(user_tsc_state, &tsc_state, sizeof(tsc_state)))
+			goto out;
+		r = 0;
+		break;
+	}
+	case KVM_SET_TSC_STATE: {
+		struct kvm_tsc_state __user *user_tsc_state = argp;
+		struct kvm_tsc_state tsc_state;
+
+		u64 host_tsc, wall_nsec;
+		s64 diff;
+		u64 new_guest_tsc, new_guest_tsc_offset;
+
+		r = -EFAULT;
+		if (copy_from_user(&tsc_state, user_tsc_state, sizeof(tsc_state)))
+			goto out;
+
+		kvm_get_walltime(&wall_nsec, &host_tsc);
+		diff = wall_nsec - tsc_state.nsec;
+
+		if (diff < 0 || tsc_state.nsec == 0)
+			diff = 0;
+
+		new_guest_tsc = tsc_state.tsc + nsec_to_cycles(vcpu, diff);
+		new_guest_tsc_offset = new_guest_tsc - kvm_scale_tsc(vcpu, host_tsc);
+		kvm_vcpu_write_tsc_offset(vcpu, new_guest_tsc_offset);
+
+		if (tsc_state.flags & KVM_TSC_STATE_TSC_ADJUST_VALID)
+			if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
+				vcpu->arch.ia32_tsc_adjust_msr = tsc_state.tsc_adjust;
+		r = 0;
+		break;
+	}
+#endif
 	default:
 		r = -EINVAL;
 	}
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 886802b8ffba3..ee1bd5e7da964 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1056,6 +1056,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ENFORCE_PV_FEATURE_CPUID 190
 #define KVM_CAP_SYS_HYPERV_CPUID 191
 #define KVM_CAP_DIRTY_LOG_RING 192
+#define KVM_CAP_PRECISE_TSC 193
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1169,6 +1170,15 @@ struct kvm_clock_data {
 	__u32 pad[9];
 };
 
+
+#define KVM_TSC_STATE_TSC_ADJUST_VALID 1
+struct kvm_tsc_state {
+	__u32 flags;
+	__u64 nsec;
+	__u64 tsc;
+	__u64 tsc_adjust;
+};
+
 /* For KVM_CAP_SW_TLB */
 
 #define KVM_MMU_FSL_BOOKE_NOHV		0
@@ -1563,6 +1573,10 @@ struct kvm_pv_cmd {
 /* Available with KVM_CAP_DIRTY_LOG_RING */
 #define KVM_RESET_DIRTY_RINGS		_IO(KVMIO, 0xc7)
 
+/* Available with KVM_CAP_PRECISE_TSC*/
+#define KVM_SET_TSC_STATE          _IOW(KVMIO,  0xc8, struct kvm_tsc_state)
+#define KVM_GET_TSC_STATE          _IOR(KVMIO,  0xc9, struct kvm_tsc_state)
+
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
 	/* Guest initialization commands */
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/2] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
  2020-11-30 13:35 [PATCH 0/2] RFC: Precise TSC migration Maxim Levitsky
  2020-11-30 13:35 ` [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE Maxim Levitsky
@ 2020-11-30 13:35 ` Maxim Levitsky
  2020-11-30 13:54   ` Paolo Bonzini
  2020-11-30 13:38 ` [PATCH 0/2] RFC: Precise TSC migration (summary) Maxim Levitsky
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Maxim Levitsky @ 2020-11-30 13:35 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Oliver Upton, Ingo Molnar, Sean Christopherson,
	Thomas Gleixner, open list, Marcelo Tosatti, Jonathan Corbet,
	Wanpeng Li, Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov, Maxim Levitsky

This quirk reflects the fact that we currently treat MSR_IA32_TSC
and MSR_TSC_ADJUST access by the host (e.g qemu) in a way that is different
compared to an access from the guest.

For host's MSR_IA32_TSC read we currently always return L1 TSC value, and for
host's write we do the tsc synchronization.

For host's MSR_TSC_ADJUST write, we don't make the tsc 'jump' as we should
for this msr.

When the hypervisor uses the new TSC GET/SET state ioctls, all of this is no
longer needed, thus leave this enabled only with a quirk
which the hypervisor can disable.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/include/uapi/asm/kvm.h |  1 +
 arch/x86/kvm/x86.c              | 19 ++++++++++++++-----
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 8e76d3701db3f..2a60fc6674164 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -404,6 +404,7 @@ struct kvm_sync_regs {
 #define KVM_X86_QUIRK_LAPIC_MMIO_HOLE	   (1 << 2)
 #define KVM_X86_QUIRK_OUT_7E_INC_RIP	   (1 << 3)
 #define KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT (1 << 4)
+#define KVM_X86_QUIRK_TSC_HOST_ACCESS      (1 << 5)
 
 #define KVM_STATE_NESTED_FORMAT_VMX	0
 #define KVM_STATE_NESTED_FORMAT_SVM	1
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4f0ae9cb14b8a..46a2111d54840 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3091,7 +3091,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		break;
 	case MSR_IA32_TSC_ADJUST:
 		if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
-			if (!msr_info->host_initiated) {
+			if (!msr_info->host_initiated ||
+			    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) {
 				s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr;
 				adjust_tsc_offset_guest(vcpu, adj);
 			}
@@ -3118,7 +3119,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		vcpu->arch.msr_ia32_power_ctl = data;
 		break;
 	case MSR_IA32_TSC:
-		if (msr_info->host_initiated) {
+		if (msr_info->host_initiated &&
+		    kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) {
 			kvm_synchronize_tsc(vcpu, data);
 		} else {
 			u64 adj = kvm_compute_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
@@ -3409,17 +3411,24 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		msr_info->data = vcpu->arch.msr_ia32_power_ctl;
 		break;
 	case MSR_IA32_TSC: {
+		u64 tsc_offset;
+
 		/*
 		 * Intel SDM states that MSR_IA32_TSC read adds the TSC offset
 		 * even when not intercepted. AMD manual doesn't explicitly
 		 * state this but appears to behave the same.
 		 *
-		 * On userspace reads and writes, however, we unconditionally
+		 * On userspace reads and writes, when KVM_X86_QUIRK_SPECIAL_TSC_READ
+		 * is present, however, we unconditionally
 		 * return L1's TSC value to ensure backwards-compatible
 		 * behavior for migration.
 		 */
-		u64 tsc_offset = msr_info->host_initiated ? vcpu->arch.l1_tsc_offset :
-							    vcpu->arch.tsc_offset;
+
+		if (msr_info->host_initiated &&
+		    kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS))
+			tsc_offset = vcpu->arch.l1_tsc_offset;
+		else
+			tsc_offset = vcpu->arch.tsc_offset;
 
 		msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + tsc_offset;
 		break;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration (summary)
  2020-11-30 13:35 [PATCH 0/2] RFC: Precise TSC migration Maxim Levitsky
  2020-11-30 13:35 ` [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE Maxim Levitsky
  2020-11-30 13:35 ` [PATCH 2/2] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS Maxim Levitsky
@ 2020-11-30 13:38 ` Maxim Levitsky
  2020-11-30 16:54 ` [PATCH 0/2] RFC: Precise TSC migration Andy Lutomirski
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 30+ messages in thread
From: Maxim Levitsky @ 2020-11-30 13:38 UTC (permalink / raw)
  To: kvm
  Cc: Paolo Bonzini, Oliver Upton, Ingo Molnar, Sean Christopherson,
	Thomas Gleixner, open list, Marcelo Tosatti, Jonathan Corbet,
	Wanpeng Li, Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

[-- Attachment #1: Type: text/plain, Size: 92 bytes --]

This is the summary of few things that I think are relevant.

Best regards,
	Maxim Levitsky

[-- Attachment #2: tsc.md --]
[-- Type: text/markdown, Size: 4650 bytes --]

# Random unsynchronized ramblings about the TSC in KVM/Linux

## The KVM's master clock

Under assumption that

a. Host TSC is synchronized and stable (wasn't marked as unstable).

b. Guest TSC is synchronized:

   - When guest starts running, all its vCPUs start from TSC = 0.

   - If we hotplug a vCPU, its TSC is set to 0, but the 'kvm_synchronize_tsc'
     (it was used to be called kvm_write_tsc), practically speaking,
     just sets the TSC to the same value as other vCPUs are having right now.

     Later Linux will try to sync it again, but since it is already synchronized
     it won't do anything. Otherwise it uses IA_32_MSR_TSC_ADJUST to adjust it.
     
   - If the guest writes the TSC we try to detect if TSC is still synced.
     (We don't handle TSC adjustments done via IA_32_MSR_TSC_ADJUST).

Then the kvmclock is driven by a single pair of (nsecs, tsc) and that is used
to update the kvmclock on all vCPUs.

The advantage of this is that no random error can be introduced by calculating
this pair on each CPU.

Plus another advantage is that being vCPU invariant, guest's kvmclock
implementation can read it in userspace.
This is signaled by setting KVM_CLOCK_TSC_STABLE flag via kvmclock interface.

## KVM behavior when the host tsc is detected as unstable

* On each 'userspace' VM entry (aka 'vcpu_load'), we set guest tsc to its
  last value captured on last userspace VM exit,
  and we schedule a KVM clock update (KVM_REQ_GLOBAL_CLOCK_UPDATE)
  
* On each KVM clock update, we 'catchup' the tsc to the kernel clock.

* We don't use masterclock

## The TSC 'features' in the Linux kernel

Linux kernel has roughly speaking 4 cpu 'features' that define its treatment of TSC.
Some of these features are set from CPUID, some are set when certain CPU 
models are detected, and some are set when a specific hypervisor is detected.

* X86_FEATURE_TSC_KNOWN_FREQ
  This is the most harmless feature. It is set by various hypervisors,
  (including kvmclock), and for some Intel models, when TSC frequency can
  be obtained via an interface (PV, cpuid, msr, etc), rather than measuring it.

* X86_FEATURE_NONSTOP_TSC
  
  On real hardware, this feature is set when CPUID has the 'invtsc' bit set.
  And it tells the kernel that TSC doesn't stop in low power idle states.
  When absent, an attempt to enter a low power idle state (e.g C2) will mark
  the TSC as unstable.
  
  This feature has also a friend called X86_FEATURE_NONSTOP_TSC_S3,
  which doesn't do anything that is relevant to KVM.
  
  In a VM, on one hand the vCPU is interrupted often but as long as the host TSC
  is stable, the guest TSC should remain stable as well.
  
  (The guest TSC 'keeps on running' when the guest CPU is not 
  running, which is the same thing as the situation in which
  a real CPU is in low power/waiting state)
  
  However not exposing this bit to the guest doesn't cause much harm, since the
  guest usually doesn't use idle states, thus it never marks the TSC 
  as unstable due to lack of this bit.
  (the exception to that is cpu-pm=on thing, which is TBD)
  
 * X86_FEATURE_CONSTANT_TSC

  On real hardware this bit informs the kernel that the TSC frequency doesn't
  change with CPU frequency changes. If it is not set, on first cpufreq update,
  the tsc is marked as unstable.
  On real hardware this bit is also set by 'invtsc' CPUID bit, plus set on few
  older intel models which lack it but still have constant TSC.
   
  In a VM, once again there is no cpufreq driver, thus the lack of this bit 
  doesn't cause much harm.
  
  However if a VM is migrated to a host with a different CPU frequency,
  and TSC scaling is not supported then the guest will see a jump in the
  frequency.
  
  This is why qemu doesn't enable 'invtsc' by default, and blocks migration
  when enabled, unless you also tell explicitly what frequency the guest’s TSC
  should run at and in this case it will attempt to scale host TSC to this 
  value, using hardware means and I think qemu will fail vm start 
  if it fails.
  
* X86_FEATURE_TSC_RELIABLE
 
  This is feature only for virtualization (although I have seen some Atom
  specific code piggyback on it), and it makes the guest trust the TSC more
  (kind of 'trust us, you can count on TSC')
  
  It makes the guest do two things.
  1. Disable the clocksource watchdog (which is a periodic mechanism,
     by which the kernel compares TSC with other clocksources,
     and if it sees something fishy, it marks the tsc as unstable)
     
  2. Disable TSC sync on new vCPU hotplug, instead relying on hypervisor
     to sync it.
     (IMHO we should set this flag for KVM as well)




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
  2020-11-30 13:35 ` [PATCH 2/2] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS Maxim Levitsky
@ 2020-11-30 13:54   ` Paolo Bonzini
  2020-11-30 14:11     ` Maxim Levitsky
  0 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2020-11-30 13:54 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: Oliver Upton, Ingo Molnar, Sean Christopherson, Thomas Gleixner,
	open list, Marcelo Tosatti, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On 30/11/20 14:35, Maxim Levitsky wrote:
> This quirk reflects the fact that we currently treat MSR_IA32_TSC
> and MSR_TSC_ADJUST access by the host (e.g qemu) in a way that is different
> compared to an access from the guest.
> 
> For host's MSR_IA32_TSC read we currently always return L1 TSC value, and for
> host's write we do the tsc synchronization.
> 
> For host's MSR_TSC_ADJUST write, we don't make the tsc 'jump' as we should
> for this msr.
> 
> When the hypervisor uses the new TSC GET/SET state ioctls, all of this is no
> longer needed, thus leave this enabled only with a quirk
> which the hypervisor can disable.
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>

This needs to be covered by a variant of the existing selftests testcase 
(running the same guest code, but different host code of course).

Paolo

> ---
>   arch/x86/include/uapi/asm/kvm.h |  1 +
>   arch/x86/kvm/x86.c              | 19 ++++++++++++++-----
>   2 files changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 8e76d3701db3f..2a60fc6674164 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -404,6 +404,7 @@ struct kvm_sync_regs {
>   #define KVM_X86_QUIRK_LAPIC_MMIO_HOLE	   (1 << 2)
>   #define KVM_X86_QUIRK_OUT_7E_INC_RIP	   (1 << 3)
>   #define KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT (1 << 4)
> +#define KVM_X86_QUIRK_TSC_HOST_ACCESS      (1 << 5)
>   
>   #define KVM_STATE_NESTED_FORMAT_VMX	0
>   #define KVM_STATE_NESTED_FORMAT_SVM	1
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4f0ae9cb14b8a..46a2111d54840 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3091,7 +3091,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   		break;
>   	case MSR_IA32_TSC_ADJUST:
>   		if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> -			if (!msr_info->host_initiated) {
> +			if (!msr_info->host_initiated ||
> +			    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) {
>   				s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr;
>   				adjust_tsc_offset_guest(vcpu, adj);
>   			}
> @@ -3118,7 +3119,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   		vcpu->arch.msr_ia32_power_ctl = data;
>   		break;
>   	case MSR_IA32_TSC:
> -		if (msr_info->host_initiated) {
> +		if (msr_info->host_initiated &&
> +		    kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) {
>   			kvm_synchronize_tsc(vcpu, data);
>   		} else {
>   			u64 adj = kvm_compute_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
> @@ -3409,17 +3411,24 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   		msr_info->data = vcpu->arch.msr_ia32_power_ctl;
>   		break;
>   	case MSR_IA32_TSC: {
> +		u64 tsc_offset;
> +
>   		/*
>   		 * Intel SDM states that MSR_IA32_TSC read adds the TSC offset
>   		 * even when not intercepted. AMD manual doesn't explicitly
>   		 * state this but appears to behave the same.
>   		 *
> -		 * On userspace reads and writes, however, we unconditionally
> +		 * On userspace reads and writes, when KVM_X86_QUIRK_SPECIAL_TSC_READ
> +		 * is present, however, we unconditionally
>   		 * return L1's TSC value to ensure backwards-compatible
>   		 * behavior for migration.
>   		 */
> -		u64 tsc_offset = msr_info->host_initiated ? vcpu->arch.l1_tsc_offset :
> -							    vcpu->arch.tsc_offset;
> +
> +		if (msr_info->host_initiated &&
> +		    kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS))
> +			tsc_offset = vcpu->arch.l1_tsc_offset;
> +		else
> +			tsc_offset = vcpu->arch.tsc_offset;
>   
>   		msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + tsc_offset;
>   		break;
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
  2020-11-30 13:54   ` Paolo Bonzini
@ 2020-11-30 14:11     ` Maxim Levitsky
  2020-11-30 14:15       ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: Maxim Levitsky @ 2020-11-30 14:11 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Oliver Upton, Ingo Molnar, Sean Christopherson, Thomas Gleixner,
	open list, Marcelo Tosatti, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Mon, 2020-11-30 at 14:54 +0100, Paolo Bonzini wrote:
> On 30/11/20 14:35, Maxim Levitsky wrote:
> > This quirk reflects the fact that we currently treat MSR_IA32_TSC
> > and MSR_TSC_ADJUST access by the host (e.g qemu) in a way that is different
> > compared to an access from the guest.
> > 
> > For host's MSR_IA32_TSC read we currently always return L1 TSC value, and for
> > host's write we do the tsc synchronization.
> > 
> > For host's MSR_TSC_ADJUST write, we don't make the tsc 'jump' as we should
> > for this msr.
> > 
> > When the hypervisor uses the new TSC GET/SET state ioctls, all of this is no
> > longer needed, thus leave this enabled only with a quirk
> > which the hypervisor can disable.
> > 
> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> This needs to be covered by a variant of the existing selftests testcase 
> (running the same guest code, but different host code of course).
Do you think that the test should go to the kernel's kvm unit tests,
or to kvm-unit-tests project?

Best regards,
	Maxim Levitsky

> 
> Paolo
> 
> > ---
> >   arch/x86/include/uapi/asm/kvm.h |  1 +
> >   arch/x86/kvm/x86.c              | 19 ++++++++++++++-----
> >   2 files changed, 15 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > index 8e76d3701db3f..2a60fc6674164 100644
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -404,6 +404,7 @@ struct kvm_sync_regs {
> >   #define KVM_X86_QUIRK_LAPIC_MMIO_HOLE	   (1 << 2)
> >   #define KVM_X86_QUIRK_OUT_7E_INC_RIP	   (1 << 3)
> >   #define KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT (1 << 4)
> > +#define KVM_X86_QUIRK_TSC_HOST_ACCESS      (1 << 5)
> >   
> >   #define KVM_STATE_NESTED_FORMAT_VMX	0
> >   #define KVM_STATE_NESTED_FORMAT_SVM	1
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 4f0ae9cb14b8a..46a2111d54840 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3091,7 +3091,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >   		break;
> >   	case MSR_IA32_TSC_ADJUST:
> >   		if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> > -			if (!msr_info->host_initiated) {
> > +			if (!msr_info->host_initiated ||
> > +			    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) {
> >   				s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr;
> >   				adjust_tsc_offset_guest(vcpu, adj);
> >   			}
> > @@ -3118,7 +3119,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >   		vcpu->arch.msr_ia32_power_ctl = data;
> >   		break;
> >   	case MSR_IA32_TSC:
> > -		if (msr_info->host_initiated) {
> > +		if (msr_info->host_initiated &&
> > +		    kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS)) {
> >   			kvm_synchronize_tsc(vcpu, data);
> >   		} else {
> >   			u64 adj = kvm_compute_tsc_offset(vcpu, data) - vcpu->arch.l1_tsc_offset;
> > @@ -3409,17 +3411,24 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >   		msr_info->data = vcpu->arch.msr_ia32_power_ctl;
> >   		break;
> >   	case MSR_IA32_TSC: {
> > +		u64 tsc_offset;
> > +
> >   		/*
> >   		 * Intel SDM states that MSR_IA32_TSC read adds the TSC offset
> >   		 * even when not intercepted. AMD manual doesn't explicitly
> >   		 * state this but appears to behave the same.
> >   		 *
> > -		 * On userspace reads and writes, however, we unconditionally
> > +		 * On userspace reads and writes, when KVM_X86_QUIRK_SPECIAL_TSC_READ
> > +		 * is present, however, we unconditionally
> >   		 * return L1's TSC value to ensure backwards-compatible
> >   		 * behavior for migration.
> >   		 */
> > -		u64 tsc_offset = msr_info->host_initiated ? vcpu->arch.l1_tsc_offset :
> > -							    vcpu->arch.tsc_offset;
> > +
> > +		if (msr_info->host_initiated &&
> > +		    kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_TSC_HOST_ACCESS))
> > +			tsc_offset = vcpu->arch.l1_tsc_offset;
> > +		else
> > +			tsc_offset = vcpu->arch.tsc_offset;
> >   
> >   		msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + tsc_offset;
> >   		break;
> > 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
  2020-11-30 14:11     ` Maxim Levitsky
@ 2020-11-30 14:15       ` Paolo Bonzini
  2020-11-30 15:33         ` Maxim Levitsky
  0 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2020-11-30 14:15 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: Oliver Upton, Ingo Molnar, Sean Christopherson, Thomas Gleixner,
	open list, Marcelo Tosatti, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On 30/11/20 15:11, Maxim Levitsky wrote:
> On Mon, 2020-11-30 at 14:54 +0100, Paolo Bonzini wrote:
>> On 30/11/20 14:35, Maxim Levitsky wrote:
>>> This quirk reflects the fact that we currently treat MSR_IA32_TSC
>>> and MSR_TSC_ADJUST access by the host (e.g qemu) in a way that is different
>>> compared to an access from the guest.
>>>
>>> For host's MSR_IA32_TSC read we currently always return L1 TSC value, and for
>>> host's write we do the tsc synchronization.
>>>
>>> For host's MSR_TSC_ADJUST write, we don't make the tsc 'jump' as we should
>>> for this msr.
>>>
>>> When the hypervisor uses the new TSC GET/SET state ioctls, all of this is no
>>> longer needed, thus leave this enabled only with a quirk
>>> which the hypervisor can disable.
>>>
>>> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
>>> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
>>
>> This needs to be covered by a variant of the existing selftests testcase
>> (running the same guest code, but different host code of course).
> Do you think that the test should go to the kernel's kvm unit tests,
> or to kvm-unit-tests project?

The latter already has x86_64/tsc_msrs_test.c (which I created in 
preparation for this exact change :)).

Paolo


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
  2020-11-30 13:35 ` [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE Maxim Levitsky
@ 2020-11-30 14:33   ` Paolo Bonzini
  2020-11-30 15:58     ` Maxim Levitsky
  2020-12-01 19:43   ` Thomas Gleixner
  1 sibling, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2020-11-30 14:33 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: Oliver Upton, Ingo Molnar, Sean Christopherson, Thomas Gleixner,
	open list, Marcelo Tosatti, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On 30/11/20 14:35, Maxim Levitsky wrote:
> +		if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> +			tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
> +			tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
> +		}

This is mostly useful for userspace that doesn't disable the quirk, right?

> +		kvm_get_walltime(&wall_nsec, &host_tsc);
> +		diff = wall_nsec - tsc_state.nsec;
> +
> +		if (diff < 0 || tsc_state.nsec == 0)
> +			diff = 0;
> +

diff < 0 should be okay.  Also why the nsec==0 special case?  What about 
using a flag instead?

Paolo


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/2] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
  2020-11-30 14:15       ` Paolo Bonzini
@ 2020-11-30 15:33         ` Maxim Levitsky
  0 siblings, 0 replies; 30+ messages in thread
From: Maxim Levitsky @ 2020-11-30 15:33 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Oliver Upton, Ingo Molnar, Sean Christopherson, Thomas Gleixner,
	open list, Marcelo Tosatti, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Mon, 2020-11-30 at 15:15 +0100, Paolo Bonzini wrote:
> On 30/11/20 15:11, Maxim Levitsky wrote:
> > On Mon, 2020-11-30 at 14:54 +0100, Paolo Bonzini wrote:
> > > On 30/11/20 14:35, Maxim Levitsky wrote:
> > > > This quirk reflects the fact that we currently treat MSR_IA32_TSC
> > > > and MSR_TSC_ADJUST access by the host (e.g qemu) in a way that is different
> > > > compared to an access from the guest.
> > > > 
> > > > For host's MSR_IA32_TSC read we currently always return L1 TSC value, and for
> > > > host's write we do the tsc synchronization.
> > > > 
> > > > For host's MSR_TSC_ADJUST write, we don't make the tsc 'jump' as we should
> > > > for this msr.
> > > > 
> > > > When the hypervisor uses the new TSC GET/SET state ioctls, all of this is no
> > > > longer needed, thus leave this enabled only with a quirk
> > > > which the hypervisor can disable.
> > > > 
> > > > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > 
> > > This needs to be covered by a variant of the existing selftests testcase
> > > (running the same guest code, but different host code of course).
> > Do you think that the test should go to the kernel's kvm unit tests,
> > or to kvm-unit-tests project?
> 
> The latter already has x86_64/tsc_msrs_test.c (which I created in 
> preparation for this exact change :)).

I'll prepare a test then for it!

Best regards,
	Maxim Levitsky
> 
> Paolo
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
  2020-11-30 14:33   ` Paolo Bonzini
@ 2020-11-30 15:58     ` Maxim Levitsky
  2020-11-30 17:01       ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: Maxim Levitsky @ 2020-11-30 15:58 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: Oliver Upton, Ingo Molnar, Sean Christopherson, Thomas Gleixner,
	open list, Marcelo Tosatti, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Mon, 2020-11-30 at 15:33 +0100, Paolo Bonzini wrote:
> On 30/11/20 14:35, Maxim Levitsky wrote:
> > +		if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
> > +			tsc_state.tsc_adjust = vcpu->arch.ia32_tsc_adjust_msr;
> > +			tsc_state.flags |= KVM_TSC_STATE_TSC_ADJUST_VALID;
> > +		}
> 
> This is mostly useful for userspace that doesn't disable the quirk, right?

Isn't this the opposite? If I understand the original proposal correctly,
the reason that we include the TSC_ADJUST in the new ioctl, is that
we would like to disable the special kvm behavior (that is disable the quirk),
which would mean that tsc will jump on regular host initiated TSC_ADJUST write.

To avoid this, userspace would set TSC_ADJUST through this new interface.

Note that I haven't yet disabled the quirk in the patches I posted to the qemu,
because we need some infrastructure to manage which quirks we want to disable
in qemu
(That is, KVM_ENABLE_CAP is as I understand write only, so I can't just disable
KVM_X86_QUIRK_TSC_HOST_ACCESS, in the code that enables x-precise-tsc in qemu).

> 
> > +		kvm_get_walltime(&wall_nsec, &host_tsc);
> > +		diff = wall_nsec - tsc_state.nsec;
> > +
> > +		if (diff < 0 || tsc_state.nsec == 0)
> > +			diff = 0;
> > +
> 
> diff < 0 should be okay.  Also why the nsec==0 special case?  What about 
> using a flag instead?

In theory diff < 0 should indeed be okay (though this would mean that target,
has unsynchronized clock or time travel happened).

However for example nsec_to_cycles takes unsigned number, and then
pvclock_scale_delta also takes unsigned number, and so on,
so I was thinking why bother with this case.

There is still (mostly?) theoretical issue, if on some vcpus 'diff' is positive 
and on some is negative
(this can happen if the migration was really fast, and target has the clock
   A. that is only slightly ahead of the source).
Do you think that this is an issue? If so I can make the code work with
signed numbers.

About nsec == 0, this is to allow to use this API for VM initialization.
(That is to call KVM_SET_TSC_PRECISE prior to doing KVM_GET_TSC_PRECISE)

This simplifies qemu code, and I don't think 
that this makes the API much worse.

Best regards,
	Maxim Levitsky

> 
> Paolo
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-11-30 13:35 [PATCH 0/2] RFC: Precise TSC migration Maxim Levitsky
                   ` (2 preceding siblings ...)
  2020-11-30 13:38 ` [PATCH 0/2] RFC: Precise TSC migration (summary) Maxim Levitsky
@ 2020-11-30 16:54 ` Andy Lutomirski
  2020-11-30 16:59   ` Paolo Bonzini
  2020-11-30 19:16 ` Marcelo Tosatti
  2020-12-01 19:35 ` Thomas Gleixner
  5 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2020-11-30 16:54 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm list, Paolo Bonzini, Oliver Upton, Ingo Molnar,
	Sean Christopherson, Thomas Gleixner, open list, Marcelo Tosatti,
	Jonathan Corbet, Wanpeng Li, Borislav Petkov, Jim Mattson,
	H. Peter Anvin, open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Mon, Nov 30, 2020 at 5:36 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> Hi!
>
> This is the first version of the work to make TSC migration more accurate,
> as was defined by Paulo at:
> https://www.spinics.net/lists/kvm/msg225525.html
>
> I have a few thoughts about the kvm masterclock synchronization,
> which relate to the Paulo's proposal that I implemented.
>
> The idea of masterclock is that when the host TSC is synchronized
> (or as kernel call it, stable), and the guest TSC is synchronized as well,
> then we can base the kvmclock, on the same pair of
> (host time in nsec, host tsc value), for all vCPUs.
>
> This makes the random error in calculation of this value invariant
> across vCPUS, and allows the guest to do kvmclock calculation in userspace
> (vDSO) since kvmclock parameters are vCPU invariant.
>
> To ensure that the guest tsc is synchronized we currently track host/guest tsc
> writes, and enable the master clock only when roughly the same guest's TSC value
> was written across all vCPUs.
>
> Recently this was disabled by Paulo and I agree with this, because I think
> that we indeed should only make the guest TSC synchronized by default
> (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> (Trying to guess when the guest syncs the TSC can cause more harm that good).
>
> Besides, Linux guests don't sync the TSC via IA32_TSC write,
> but rather use IA32_TSC_ADJUST which currently doesn't participate
> in the tsc sync heruistics.
> And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
>
> I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> in the documentation to state that it only guarantees invariance if the guest
> doesn't mess with its own TSC.
>
> Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> in the guest kernel, when kvm is detected to avoid the guest even from trying
> to sync TSC on newly hotplugged vCPUs.
>
> (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> in some cases due to scheduling of guest vCPUs)
>
> (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> and TSC clocksource watchdog, and the later we might want to keep).

If you're going to change the guest behavior to be more trusting of
the host, I think
the host should probably signal this to the guest using a new bit.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-11-30 16:54 ` [PATCH 0/2] RFC: Precise TSC migration Andy Lutomirski
@ 2020-11-30 16:59   ` Paolo Bonzini
  0 siblings, 0 replies; 30+ messages in thread
From: Paolo Bonzini @ 2020-11-30 16:59 UTC (permalink / raw)
  To: Andy Lutomirski, Maxim Levitsky
  Cc: kvm list, Oliver Upton, Ingo Molnar, Sean Christopherson,
	Thomas Gleixner, open list, Marcelo Tosatti, Jonathan Corbet,
	Wanpeng Li, Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On 30/11/20 17:54, Andy Lutomirski wrote:
>> I*do think*  however that we should redefine KVM_CLOCK_TSC_STABLE
>> in the documentation to state that it only guarantees invariance if the guest
>> doesn't mess with its own TSC.
>>
>> Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
>> in the guest kernel, when kvm is detected to avoid the guest even from trying
>> to sync TSC on newly hotplugged vCPUs.
>>
>> (The guest doesn't end up touching TSC_ADJUST usually, but it still might
>> in some cases due to scheduling of guest vCPUs)
>>
>> (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
>> and TSC clocksource watchdog, and the later we might want to keep).
> If you're going to change the guest behavior to be more trusting of
> the host, I think
> the host should probably signal this to the guest using a new bit.
> 

Yes, a new CPUID bit takes longer to propagate to existing setups, but 
it is more future proof.

Paolo


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
  2020-11-30 15:58     ` Maxim Levitsky
@ 2020-11-30 17:01       ` Paolo Bonzini
  0 siblings, 0 replies; 30+ messages in thread
From: Paolo Bonzini @ 2020-11-30 17:01 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: Oliver Upton, Ingo Molnar, Sean Christopherson, Thomas Gleixner,
	open list, Marcelo Tosatti, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On 30/11/20 16:58, Maxim Levitsky wrote:
>> This is mostly useful for userspace that doesn't disable the quirk, right?
> Isn't this the opposite? If I understand the original proposal correctly,
> the reason that we include the TSC_ADJUST in the new ioctl, is that
> we would like to disable the special kvm behavior (that is disable the quirk),
> which would mean that tsc will jump on regular host initiated TSC_ADJUST write.
> 
> To avoid this, userspace would set TSC_ADJUST through this new interface.

Yeah, that makes sense.  It removes the need to think "I have to set TSC 
adjust before TSC".

> Do you think that this is an issue? If so I can make the code work with
> signed numbers.

Not sure if it's an issue, but I prefer to make the API "less 
surprising" for userspace.  Who knows how it will be used.

> About nsec == 0, this is to allow to use this API for VM initialization.
> (That is to call KVM_SET_TSC_PRECISE prior to doing KVM_GET_TSC_PRECISE)

I prefer using flags for that purpose.

Paolo


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-11-30 13:35 [PATCH 0/2] RFC: Precise TSC migration Maxim Levitsky
                   ` (3 preceding siblings ...)
  2020-11-30 16:54 ` [PATCH 0/2] RFC: Precise TSC migration Andy Lutomirski
@ 2020-11-30 19:16 ` Marcelo Tosatti
  2020-12-01 12:30   ` Maxim Levitsky
                     ` (2 more replies)
  2020-12-01 19:35 ` Thomas Gleixner
  5 siblings, 3 replies; 30+ messages in thread
From: Marcelo Tosatti @ 2020-11-30 19:16 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Paolo Bonzini, Oliver Upton, Ingo Molnar,
	Sean Christopherson, Thomas Gleixner, open list, Jonathan Corbet,
	Wanpeng Li, Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

Hi Maxim,

On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> Hi!
> 
> This is the first version of the work to make TSC migration more accurate,
> as was defined by Paulo at:
> https://www.spinics.net/lists/kvm/msg225525.html

Description from Oliver's patch:

"To date, VMMs have typically restored the guest's TSCs by value using
the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
value introduces some challenges with synchronization as the TSCs
continue to tick throughout the restoration process. As such, KVM has
some heuristics around TSC writes to infer whether or not the guest or
host is attempting to synchronize the TSCs."

Not really. The synchronization logic tries to sync TSCs during
BIOS boot (and CPU hotplug), because the TSC values are loaded
sequentially, say:

CPU		realtime	TSC val
vcpu0		0 usec		0
vcpu1		100 usec	0
vcpu2		200 usec	0
...

And we'd like to see all vcpus to read the same value at all times.

Other than that, comment makes sense. The problem with live migration
is as follows:

We'd like the TSC value to be written, ideally, just before the first
VM-entry a vCPU (because at the moment the TSC_OFFSET has been written, 
the vcpus tsc is ticking, which will cause a visible forward jump
in vcpus tsc time).

Before the first VM-entry is the farthest point in time before guest
entry that one could do that.

The window (or forward jump) between KVM_SET_TSC and VM-entry was about
100ms last time i checked (which results in a 100ms time jump forward), 
See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.

Have we measured any improvement with this patchset?

Then Paolo mentions (with >), i am replying as usual.

> Ok, after looking more at the code with Maxim I can confidently say that
> it's a total mess.  And a lot of the synchronization code is dead
> because 1) as far as we could see no guest synchronizes the TSC using
> MSR_IA32_TSC; 

Well, recent BIOS'es take care of synchronizing the TSC. So when Linux
boots, it does not have to synchronize TSC in software. 

However, upon migration (and initialization), the KVM_SET_TSC's do 
not happen at exactly the same time (the MSRs for each vCPU are loaded
in sequence). The synchronization code in kvm_set_tsc() is for those cases.

> and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the
> synchronization code in kvm_write_tsc.

Not familiar how guests are using MSR_IA32_TSC_ADJUST (or Linux)...
Lets see:


/*
 * Freshly booted CPUs call into this:
 */
void check_tsc_sync_target(void)
{
        struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
        unsigned int cpu = smp_processor_id();
        cycles_t cur_max_warp, gbl_max_warp;
        int cpus = 2;

        /* Also aborts if there is no TSC. */
        if (unsynchronized_tsc())
                return;

        /*
         * Store, verify and sanitize the TSC adjust register. If
         * successful skip the test.
         *
         * The test is also skipped when the TSC is marked reliable. This
         * is true for SoCs which have no fallback clocksource. On these
         * SoCs the TSC is frequency synchronized, but still the TSC ADJUST
         * register might have been wreckaged by the BIOS..
         */
        if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) {
                atomic_inc(&skip_test);
                return;
        }

retry:

I'd force that synchronization path to be taken as a test-case.


> I have a few thoughts about the kvm masterclock synchronization,
> which relate to the Paulo's proposal that I implemented.
> 
> The idea of masterclock is that when the host TSC is synchronized
> (or as kernel call it, stable), and the guest TSC is synchronized as well,
> then we can base the kvmclock, on the same pair of
> (host time in nsec, host tsc value), for all vCPUs.

We _have_ to base. See the comment which starts with

"Assuming a stable TSC across physical CPUS, and a stable TSC"

at x86.c.

> 
> This makes the random error in calculation of this value invariant
> across vCPUS, and allows the guest to do kvmclock calculation in userspace
> (vDSO) since kvmclock parameters are vCPU invariant.

Actually, without synchronized host TSCs (and the masterclock scheme,
with a single base read from a vCPU), kvmclock in kernel is buggy as
well:

u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
{
        unsigned version;
        u64 ret;
        u64 last;
        u8 flags;

        do {
                version = pvclock_read_begin(src);
                ret = __pvclock_read_cycles(src, rdtsc_ordered());
                flags = src->flags;
        } while (pvclock_read_retry(src, version));

        if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
                src->flags &= ~PVCLOCK_GUEST_STOPPED;
                pvclock_touch_watchdogs();
        }

        if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
                (flags & PVCLOCK_TSC_STABLE_BIT))
                return ret;

The code that follows this (including cmpxchg) is a workaround for that 
bug.

Workaround would require each vCPU to write to a "global clock", on
every clock read.

> To ensure that the guest tsc is synchronized we currently track host/guest tsc
> writes, and enable the master clock only when roughly the same guest's TSC value
> was written across all vCPUs.

Yes, because then you can do:

vcpu0				vcpu1

A = read TSC
		... elapsed time ...

				B = read TSC

				delta = B - A

> Recently this was disabled by Paulo

What was disabled exactly?

> and I agree with this, because I think
> that we indeed should only make the guest TSC synchronized by default
> (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> (Trying to guess when the guest syncs the TSC can cause more harm that good).
> 
> Besides, Linux guests don't sync the TSC via IA32_TSC write,
> but rather use IA32_TSC_ADJUST which currently doesn't participate
> in the tsc sync heruistics.

Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
the BIOS to boot with synced TSCs.

So i wonder what is making it attempt TSC sync in the first place?

(one might also want to have Linux's synchronization via IA32_TSC_ADJUST 
working, but it should not need to happen in the first place, as long as 
QEMU and KVM are behaving properly).

> And as far as I know, Linux guest is the primary (only?) user of the kvmclock.

Only AFAIK.

> I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> in the documentation to state that it only guarantees invariance if the guest
> doesn't mess with its own TSC.
> 
> Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> in the guest kernel, when kvm is detected to avoid the guest even from trying
> to sync TSC on newly hotplugged vCPUs.

See 7539b174aef405d9d57db48c58390ba360c91312. 

Was hoping to make that (-cpu xxx,+invtsc) the default in QEMU once invariant TSC code
becomes stable. Should be tested enough by now?

> (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> in some cases due to scheduling of guest vCPUs)
> 
> (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> and TSC clocksource watchdog, and the later we might want to keep).

The latter we want to keep.

> For host TSC writes, just as Paulo proposed we can still do the tsc sync,
> unless the new code that I implemented is in use.

So Paolo's proposal is to

"- for live migration, userspace is expected to use the new
KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
(nanosecond, TSC, TSC_ADJUST) tuple."

Makes sense, so that no time between KVM_SET_TSC and
MSR_WRITE(TSC_ADJUST) elapses, which would cause the TSC to go out
of what is desired by the user.

Since you are proposing this new ioctl, perhaps its useful to also
reduce the 100ms jump? 

"- for live migration, userspace is expected to use the new
KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
(nanosecond, TSC, TSC_ADJUST) tuple. This value will be written
to the guest before the first VM-entry"

Sounds like a good idea (to integrate the values in a tuple).

> Few more random notes:
> 
> I have a weird feeling about using 'nsec since 1 January 1970'.
> Common sense is telling me that a 64 bit value can hold about 580 years,
> but still I see that it is more common to use timespec which is a (sec,nsec) pair.

           struct timespec {
               time_t   tv_sec;        /* seconds */
               long     tv_nsec;       /* nanoseconds */
           };

> I feel that 'kvm_get_walltime' that I added is a bit of a hack.
> Some refactoring might improve things here.

Haven't read the patchset yet...

> For example making kvm_get_walltime_and_clockread work in non tsc case as well
> might make the code cleaner.
> 
> Patches to enable this feature in qemu are in process of being sent to
> qemu-devel mailing list.
> 
> Best regards,
>        Maxim Levitsky
> 
> Maxim Levitsky (2):
>   KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
>   KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
> 
>  Documentation/virt/kvm/api.rst  | 56 +++++++++++++++++++++
>  arch/x86/include/uapi/asm/kvm.h |  1 +
>  arch/x86/kvm/x86.c              | 88 +++++++++++++++++++++++++++++++--
>  include/uapi/linux/kvm.h        | 14 ++++++
>  4 files changed, 154 insertions(+), 5 deletions(-)
> 
> -- 
> 2.26.2
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-11-30 19:16 ` Marcelo Tosatti
@ 2020-12-01 12:30   ` Maxim Levitsky
  2020-12-01 19:48     ` Marcelo Tosatti
  2020-12-01 13:48   ` Thomas Gleixner
  2020-12-01 14:01   ` Thomas Gleixner
  2 siblings, 1 reply; 30+ messages in thread
From: Maxim Levitsky @ 2020-12-01 12:30 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, Paolo Bonzini, Oliver Upton, Ingo Molnar,
	Sean Christopherson, Thomas Gleixner, open list, Jonathan Corbet,
	Wanpeng Li, Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Mon, 2020-11-30 at 16:16 -0300, Marcelo Tosatti wrote:
> Hi Maxim,
> 
> On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> > Hi!
> > 
> > This is the first version of the work to make TSC migration more accurate,
> > as was defined by Paulo at:
> > https://www.spinics.net/lists/kvm/msg225525.html
> 
> Description from Oliver's patch:
> 
> "To date, VMMs have typically restored the guest's TSCs by value using
> the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
> value introduces some challenges with synchronization as the TSCs
> continue to tick throughout the restoration process. As such, KVM has
> some heuristics around TSC writes to infer whether or not the guest or
> host is attempting to synchronize the TSCs."
> 
> Not really. The synchronization logic tries to sync TSCs during
> BIOS boot (and CPU hotplug), because the TSC values are loaded
> sequentially, say:
> 
> CPU		realtime	TSC val
> vcpu0		0 usec		0
> vcpu1		100 usec	0
> vcpu2		200 usec	0
> ...
> 
> And we'd like to see all vcpus to read the same value at all times.
> 
> Other than that, comment makes sense. The problem with live migration
> is as follows:
> 
> We'd like the TSC value to be written, ideally, just before the first
> VM-entry a vCPU (because at the moment the TSC_OFFSET has been written, 
> the vcpus tsc is ticking, which will cause a visible forward jump
> in vcpus tsc time).
> 
> Before the first VM-entry is the farthest point in time before guest
> entry that one could do that.
> 
> The window (or forward jump) between KVM_SET_TSC and VM-entry was about
> 100ms last time i checked (which results in a 100ms time jump forward), 
> See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.
> 
> Have we measured any improvement with this patchset?

Its not about this window. 
It is about time that passes between the point that we read the 
TSC on source system (and we do it in qemu each time the VM is paused) 
and the moment that we set the same TSC value on the target. 
That time is unbounded.

Also this patchset should decrease TSC skew that happens
between restoring it on multiple vCPUs as well, since 
KVM_SET_TSC_STATE doesn't have to happen at the same time,
as it accounts for time passed on each vCPU.


Speaking of kvmclock, somewhat offtopic since this is a different issue,
I found out that qemu reads the kvmclock value on each pause, 
and then 'restores' on unpause, using
KVM_SET_CLOCK (this modifies the global kvmclock offset)

This means (and I tested it) that if guest uses kvmclock
for time reference, it will not account for time passed in
the paused state.

> 
> Then Paolo mentions (with >), i am replying as usual.
> 
> > Ok, after looking more at the code with Maxim I can confidently say that
> > it's a total mess.  And a lot of the synchronization code is dead
> > because 1) as far as we could see no guest synchronizes the TSC using
> > MSR_IA32_TSC; 
> 
> Well, recent BIOS'es take care of synchronizing the TSC. So when Linux
> boots, it does not have to synchronize TSC in software. 

Do you have an example of such BIOS? I tested OVMF which I compiled
from git master a few weeks ago, and I also tested this with seabios 
from qemu repo, and I have never seen writes to either TSC, or TSC_ADJUST
from BIOS.

Or do you refer to the native BIOS on the host doing TSC synchronization?

> 
> However, upon migration (and initialization), the KVM_SET_TSC's do 
> not happen at exactly the same time (the MSRs for each vCPU are loaded
> in sequence). The synchronization code in kvm_set_tsc() is for those cases.

I agree with that, and this is one of the issues that KVM_SET_TSC_STATE
is going to fix, since it accounts for it.


> 
> > and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the
> > synchronization code in kvm_write_tsc.
> 
> Not familiar how guests are using MSR_IA32_TSC_ADJUST (or Linux)...
> Lets see:
> 
> 
> /*
>  * Freshly booted CPUs call into this:
>  */
> void check_tsc_sync_target(void)
> {
>         struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
>         unsigned int cpu = smp_processor_id();
>         cycles_t cur_max_warp, gbl_max_warp;
>         int cpus = 2;
> 
>         /* Also aborts if there is no TSC. */
>         if (unsynchronized_tsc())
>                 return;
> 
>         /*
>          * Store, verify and sanitize the TSC adjust register. If
>          * successful skip the test.
>          *
>          * The test is also skipped when the TSC is marked reliable. This
>          * is true for SoCs which have no fallback clocksource. On these
>          * SoCs the TSC is frequency synchronized, but still the TSC ADJUST
>          * register might have been wreckaged by the BIOS..
>          */
>         if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) {
>                 atomic_inc(&skip_test);
>                 return;
>         }
> 
> retry:
> 
> I'd force that synchronization path to be taken as a test-case.

Or even better as I suggested, we might tell the guest kernel
to avoid this synchronization path when KVM is detected
(regardless of invtsc flag)

> 
> 
> > I have a few thoughts about the kvm masterclock synchronization,
> > which relate to the Paulo's proposal that I implemented.
> > 
> > The idea of masterclock is that when the host TSC is synchronized
> > (or as kernel call it, stable), and the guest TSC is synchronized as well,
> > then we can base the kvmclock, on the same pair of
> > (host time in nsec, host tsc value), for all vCPUs.
> 
> We _have_ to base. See the comment which starts with
> 
> "Assuming a stable TSC across physical CPUS, and a stable TSC"
> 
> at x86.c.
> 
> > This makes the random error in calculation of this value invariant
> > across vCPUS, and allows the guest to do kvmclock calculation in userspace
> > (vDSO) since kvmclock parameters are vCPU invariant.
> 
> Actually, without synchronized host TSCs (and the masterclock scheme,
> with a single base read from a vCPU), kvmclock in kernel is buggy as
> well:
> 
> u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> {
>         unsigned version;
>         u64 ret;
>         u64 last;
>         u8 flags;
> 
>         do {
>                 version = pvclock_read_begin(src);
>                 ret = __pvclock_read_cycles(src, rdtsc_ordered());
>                 flags = src->flags;
>         } while (pvclock_read_retry(src, version));
> 
>         if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
>                 src->flags &= ~PVCLOCK_GUEST_STOPPED;
>                 pvclock_touch_watchdogs();
>         }
> 
>         if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
>                 (flags & PVCLOCK_TSC_STABLE_BIT))
>                 return ret;
> 
> The code that follows this (including cmpxchg) is a workaround for that 
> bug.

I understand that. I am not arguing that we shoudn't use the masterclock!
I am just saying the facts about the condition when it works.

> 
> Workaround would require each vCPU to write to a "global clock", on
> every clock read.
> 
> > To ensure that the guest tsc is synchronized we currently track host/guest tsc
> > writes, and enable the master clock only when roughly the same guest's TSC value
> > was written across all vCPUs.
> 
> Yes, because then you can do:
> 
> vcpu0				vcpu1
> 
> A = read TSC
> 		... elapsed time ...
> 
> 				B = read TSC
> 
> 				delta = B - A
> 
> > Recently this was disabled by Paulo
> 
> What was disabled exactly?

The running of tsc synchronization code when the _guest_ writes the TSC.

Which changes two things:
   1. If the guest de-synchronizes its TSC, we won't disable master clock.
   2. If the guest writes similar TSC values on each vCPU we won't detect
      this as synchronization attempt, replace this with exactly the same
      value and finally re-enable the master clock.

I argue that this change is OK, because Linux guests don't write to TSC at all,
the virtual BIOSes seems not to write there either, and the only case in which
the Linux guest tries to change its TSC is on CPU hotplug as you mention and 
it uses TSC_ADJUST, that currently doesn't trigger TSC synchronization code in
KVM anyway, so it is broken already.

However I also argue that we should mention this in documentation just in case,
and we might also want (also just in case) to make Linux guests avoid even trying to
touch TSC_ADJUST register when running under KVM.

To rehash my own words, the KVM_CLOCK_TSC_STABLE should be defined as:
'kvmclock is vCPU invariant, as long as the guest doesn't mess with its TSC'.

Having said all that, now that I know tsc sync code, and the
reasons why it is there, I wouldn't be arguing about putting it back either.

> 
> > and I agree with this, because I think
> > that we indeed should only make the guest TSC synchronized by default
> > (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> > (Trying to guess when the guest syncs the TSC can cause more harm that good).
> > 
> > Besides, Linux guests don't sync the TSC via IA32_TSC write,
> > but rather use IA32_TSC_ADJUST which currently doesn't participate
> > in the tsc sync heruistics.
> 
> Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> the BIOS to boot with synced TSCs.
> 
> So i wonder what is making it attempt TSC sync in the first place?

CPU hotplug. And the guest doesn't really write to TSC_ADJUST 
since it's measurement code doesn't detect any tsc warps. 
 
I was just thinking that in theory since, this is a VM, and it can be 
interrupted at any point, the measurement code should sometimes fall,
and cause trouble.
I didn't do much homework on this so I might be overreacting.
 
As far as I see X86_FEATURE_TSC_RELIABLE was done mostly to support
running under Hyper-V and VMWARE, and these should be prone to similar
issues, supporting my theory.

> 
> (one might also want to have Linux's synchronization via IA32_TSC_ADJUST 
> working, but it should not need to happen in the first place, as long as 
> QEMU and KVM are behaving properly).
> 
> > And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
> 
> Only AFAIK.
> 
> > I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> > in the documentation to state that it only guarantees invariance if the guest
> > doesn't mess with its own TSC.
> > 
> > Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> > in the guest kernel, when kvm is detected to avoid the guest even from trying
> > to sync TSC on newly hotplugged vCPUs.
> 
> See 7539b174aef405d9d57db48c58390ba360c91312.

I know about this, and I personally always use invtsc
with my VMs.
>  
> 
> Was hoping to make that (-cpu xxx,+invtsc) the default in QEMU once invariant TSC code
> becomes stable. Should be tested enough by now?

The issue is that Qemu blocks migration when invtsc is set, based on the
fact that the target machine might have different TSC frequency and no
support for TSC scaling.
There was a long debate on this long ago.

It is possible though to override this by specifying the exact frequency
you want the guest TSC to run at, by using something like
(tsc-frequency=3500000000)
I haven't checked if libvirt does this or not.

I do think that as long as the user uses modern CPUs (which have stable TSC
and support TSC scaling), there is no reason to disable invtsc, and
therefore no reason to use kvmclock.

> 
> > (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> > in some cases due to scheduling of guest vCPUs)
> > 
> > (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> > and TSC clocksource watchdog, and the later we might want to keep).
> 
> The latter we want to keep.
> 
> > For host TSC writes, just as Paulo proposed we can still do the tsc sync,
> > unless the new code that I implemented is in use.
> 
> So Paolo's proposal is to
> 
> "- for live migration, userspace is expected to use the new
> KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> (nanosecond, TSC, TSC_ADJUST) tuple."
> 
> Makes sense, so that no time between KVM_SET_TSC and
> MSR_WRITE(TSC_ADJUST) elapses, which would cause the TSC to go out
> of what is desired by the user.
> 
> Since you are proposing this new ioctl, perhaps its useful to also
> reduce the 100ms jump? 

Yep. As long as target and destantion clocks are synchronized,
it should make it better.

> 
> "- for live migration, userspace is expected to use the new
> KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> (nanosecond, TSC, TSC_ADJUST) tuple. This value will be written
> to the guest before the first VM-entry"
> 
> Sounds like a good idea (to integrate the values in a tuple).
> 
> > Few more random notes:
> > 
> > I have a weird feeling about using 'nsec since 1 January 1970'.
> > Common sense is telling me that a 64 bit value can hold about 580 years,
> > but still I see that it is more common to use timespec which is a (sec,nsec) pair.
> 
>            struct timespec {
>                time_t   tv_sec;        /* seconds */
>                long     tv_nsec;       /* nanoseconds */
>            };
> 
> > I feel that 'kvm_get_walltime' that I added is a bit of a hack.
> > Some refactoring might improve things here.
> 
> Haven't read the patchset yet...
> 
> > For example making kvm_get_walltime_and_clockread work in non tsc case as well
> > might make the code cleaner.
> > 
> > Patches to enable this feature in qemu are in process of being sent to
> > qemu-devel mailing list.
> > 
> > Best regards,
> >        Maxim Levitsky
> > 
> > Maxim Levitsky (2):
> >   KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
> >   KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
> > 
> >  Documentation/virt/kvm/api.rst  | 56 +++++++++++++++++++++
> >  arch/x86/include/uapi/asm/kvm.h |  1 +
> >  arch/x86/kvm/x86.c              | 88 +++++++++++++++++++++++++++++++--
> >  include/uapi/linux/kvm.h        | 14 ++++++
> >  4 files changed, 154 insertions(+), 5 deletions(-)
> > 
> > -- 
> > 2.26.2
> > 

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-11-30 19:16 ` Marcelo Tosatti
  2020-12-01 12:30   ` Maxim Levitsky
@ 2020-12-01 13:48   ` Thomas Gleixner
  2020-12-01 15:02     ` Marcelo Tosatti
  2020-12-01 14:01   ` Thomas Gleixner
  2 siblings, 1 reply; 30+ messages in thread
From: Thomas Gleixner @ 2020-12-01 13:48 UTC (permalink / raw)
  To: Marcelo Tosatti, Maxim Levitsky
  Cc: kvm, Paolo Bonzini, Oliver Upton, Ingo Molnar,
	Sean Christopherson, open list, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Mon, Nov 30 2020 at 16:16, Marcelo Tosatti wrote:
>> Besides, Linux guests don't sync the TSC via IA32_TSC write,
>> but rather use IA32_TSC_ADJUST which currently doesn't participate
>> in the tsc sync heruistics.
>
> Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> the BIOS to boot with synced TSCs.

That's wishful thinking.

Reality is that BIOS tinkerers fail to get it right. TSC_ADJUST allows
us to undo the wreckage they create.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-11-30 19:16 ` Marcelo Tosatti
  2020-12-01 12:30   ` Maxim Levitsky
  2020-12-01 13:48   ` Thomas Gleixner
@ 2020-12-01 14:01   ` Thomas Gleixner
  2020-12-01 16:19     ` Andy Lutomirski
  2 siblings, 1 reply; 30+ messages in thread
From: Thomas Gleixner @ 2020-12-01 14:01 UTC (permalink / raw)
  To: Marcelo Tosatti, Maxim Levitsky
  Cc: kvm, Paolo Bonzini, Oliver Upton, Ingo Molnar,
	Sean Christopherson, open list, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Mon, Nov 30 2020 at 16:16, Marcelo Tosatti wrote:
> Not really. The synchronization logic tries to sync TSCs during
> BIOS boot (and CPU hotplug), because the TSC values are loaded
> sequentially, say:
>
> CPU		realtime	TSC val
> vcpu0		0 usec		0
> vcpu1		100 usec	0
> vcpu2		200 usec	0

That's nonsense, really.

> And we'd like to see all vcpus to read the same value at all times.

Providing guests with a synchronized and stable TSC on a host with a
synchronized and stable TSC is trivial.

Write the _same_ TSC offset to _all_ vcpu control structs and be done
with it. It's not rocket science.

The guest TSC read is:

    hostTSC + vcpu_offset

So if the host TSC is synchronized then the guest TSCs are synchronized
as well.

If the host TSC is not synchronized, then don't even try.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-12-01 13:48   ` Thomas Gleixner
@ 2020-12-01 15:02     ` Marcelo Tosatti
  2020-12-03 11:51       ` Maxim Levitsky
  0 siblings, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2020-12-01 15:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Maxim Levitsky, kvm, Paolo Bonzini, Oliver Upton, Ingo Molnar,
	Sean Christopherson, open list, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Tue, Dec 01, 2020 at 02:48:11PM +0100, Thomas Gleixner wrote:
> On Mon, Nov 30 2020 at 16:16, Marcelo Tosatti wrote:
> >> Besides, Linux guests don't sync the TSC via IA32_TSC write,
> >> but rather use IA32_TSC_ADJUST which currently doesn't participate
> >> in the tsc sync heruistics.
> >
> > Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> > the BIOS to boot with synced TSCs.
> 
> That's wishful thinking.
> 
> Reality is that BIOS tinkerers fail to get it right. TSC_ADJUST allows
> us to undo the wreckage they create.
> 
> Thanks,
> 
>         tglx

Have not seen any multicore Dell/HP systems require that.

Anyway, for QEMU/KVM it should be synced (unless there is a bug
in the sync logic in the first place).


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-12-01 14:01   ` Thomas Gleixner
@ 2020-12-01 16:19     ` Andy Lutomirski
  2020-12-03 11:57       ` Maxim Levitsky
  0 siblings, 1 reply; 30+ messages in thread
From: Andy Lutomirski @ 2020-12-01 16:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marcelo Tosatti, Maxim Levitsky, kvm, Paolo Bonzini,
	Oliver Upton, Ingo Molnar, Sean Christopherson, open list,
	Jonathan Corbet, Wanpeng Li, Borislav Petkov, Jim Mattson,
	H. Peter Anvin, open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov



> On Dec 1, 2020, at 6:01 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Mon, Nov 30 2020 at 16:16, Marcelo Tosatti wrote:
>> Not really. The synchronization logic tries to sync TSCs during
>> BIOS boot (and CPU hotplug), because the TSC values are loaded
>> sequentially, say:
>> 
>> CPU        realtime    TSC val
>> vcpu0        0 usec        0
>> vcpu1        100 usec    0
>> vcpu2        200 usec    0
> 
> That's nonsense, really.
> 
>> And we'd like to see all vcpus to read the same value at all times.
> 
> Providing guests with a synchronized and stable TSC on a host with a
> synchronized and stable TSC is trivial.
> 
> Write the _same_ TSC offset to _all_ vcpu control structs and be done
> with it. It's not rocket science.
> 
> The guest TSC read is:
> 
>    hostTSC + vcpu_offset
> 
> So if the host TSC is synchronized then the guest TSCs are synchronized
> as well.
> 
> If the host TSC is not synchronized, then don't even try.

This reminds me: if you’re adding a new kvm feature that tells the guest that the TSC works well, could you perhaps only have one structure for all vCPUs in the same guest?

> 
> Thanks,
> 
>        tglx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-11-30 13:35 [PATCH 0/2] RFC: Precise TSC migration Maxim Levitsky
                   ` (4 preceding siblings ...)
  2020-11-30 19:16 ` Marcelo Tosatti
@ 2020-12-01 19:35 ` Thomas Gleixner
  2020-12-03 11:41   ` Paolo Bonzini
  2020-12-03 12:47   ` Maxim Levitsky
  5 siblings, 2 replies; 30+ messages in thread
From: Thomas Gleixner @ 2020-12-01 19:35 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: Paolo Bonzini, Oliver Upton, Ingo Molnar, Sean Christopherson,
	open list, Marcelo Tosatti, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov, Maxim Levitsky

On Mon, Nov 30 2020 at 15:35, Maxim Levitsky wrote:
> The idea of masterclock is that when the host TSC is synchronized
> (or as kernel call it, stable), and the guest TSC is synchronized as well,
> then we can base the kvmclock, on the same pair of
> (host time in nsec, host tsc value), for all vCPUs.

That's correct. All guest CPUs should see exactly the same TSC value,
i.e.

        hostTSC + vcpu_offset

> This makes the random error in calculation of this value invariant
> across vCPUS, and allows the guest to do kvmclock calculation in userspace
> (vDSO) since kvmclock parameters are vCPU invariant.

That's not the case today? OMG!

> To ensure that the guest tsc is synchronized we currently track host/guest tsc
> writes, and enable the master clock only when roughly the same guest's TSC value
> was written across all vCPUs.

The Linux kernel never writes the TSC. We've tried that ~15 years ago
and it was a total disaster.

> Recently this was disabled by Paulo and I agree with this, because I think
> that we indeed should only make the guest TSC synchronized by default
> (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> (Trying to guess when the guest syncs the TSC can cause more harm that good).
>
> Besides, Linux guests don't sync the TSC via IA32_TSC write,
> but rather use IA32_TSC_ADJUST which currently doesn't participate
> in the tsc sync heruistics.

The kernel only writes TSC_ADJUST when it is advertised in CPUID and:

    1) when the boot CPU detects a non-zero TSC_ADJUST value it writes
       it to 0, except when running on SGI-UV

    2) when a starting CPU has a different TSC_ADJUST value than the
       first CPU which came up on the same socket.

    3) When the first CPU of a different socket is starting and the TSC
       synchronization check fails against a CPU on an already checked
       socket then the kernel tries to adjust TSC_ADJUST to the point
       that the synchronization check does not fail anymore.

> And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
>
> I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> in the documentation to state that it only guarantees invariance if the guest
> doesn't mess with its own TSC.
>
> Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> in the guest kernel, when kvm is detected to avoid the guest even from trying
> to sync TSC on newly hotplugged vCPUs.
>
> (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> in some cases due to scheduling of guest vCPUs)

The only cases it would try to write are #3 above or because the
hypervisor or BIOS messed it up (#1, #2).

> (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> and TSC clocksource watchdog, and the later we might want to keep).

Depends. If the host TSC is stable and synchronized, then you don't need
the TSC watchdog. We are slowly starting to trust the TSC to some extent
and phase out the watchdog for newer parts (hopefully).

If the host TSC really falls apart then it still can invalidate
KVM_CLOCK and force the guest to reevaluate the situation.

> Few more random notes:
>
> I have a weird feeling about using 'nsec since 1 January 1970'.
> Common sense is telling me that a 64 bit value can hold about 580
> years,

which is plenty.

> but still I see that it is more common to use timespec which is a
> (sec,nsec) pair.

timespecs are horrible.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
  2020-11-30 13:35 ` [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE Maxim Levitsky
  2020-11-30 14:33   ` Paolo Bonzini
@ 2020-12-01 19:43   ` Thomas Gleixner
  2020-12-03 11:11     ` Maxim Levitsky
  1 sibling, 1 reply; 30+ messages in thread
From: Thomas Gleixner @ 2020-12-01 19:43 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: Paolo Bonzini, Oliver Upton, Ingo Molnar, Sean Christopherson,
	open list, Marcelo Tosatti, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov, Maxim Levitsky

On Mon, Nov 30 2020 at 15:35, Maxim Levitsky wrote:
> +  struct kvm_tsc_info {
> +	__u32 flags;
> +	__u64 nsec;
> +	__u64 tsc;
> +	__u64 tsc_adjust;
> +  };
> +
> +flags values for ``struct kvm_tsc_info``:
> +
> +``KVM_TSC_INFO_TSC_ADJUST_VALID``
> +
> +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value

Why exposing TSC_ADJUST at all? Just because?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-12-01 12:30   ` Maxim Levitsky
@ 2020-12-01 19:48     ` Marcelo Tosatti
  2020-12-03 11:39       ` Maxim Levitsky
  0 siblings, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2020-12-01 19:48 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Paolo Bonzini, Oliver Upton, Ingo Molnar,
	Sean Christopherson, Thomas Gleixner, open list, Jonathan Corbet,
	Wanpeng Li, Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Tue, Dec 01, 2020 at 02:30:39PM +0200, Maxim Levitsky wrote:
> On Mon, 2020-11-30 at 16:16 -0300, Marcelo Tosatti wrote:
> > Hi Maxim,
> > 
> > On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> > > Hi!
> > > 
> > > This is the first version of the work to make TSC migration more accurate,
> > > as was defined by Paulo at:
> > > https://www.spinics.net/lists/kvm/msg225525.html
> > 
> > Description from Oliver's patch:
> > 
> > "To date, VMMs have typically restored the guest's TSCs by value using
> > the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
> > value introduces some challenges with synchronization as the TSCs
> > continue to tick throughout the restoration process. As such, KVM has
> > some heuristics around TSC writes to infer whether or not the guest or
> > host is attempting to synchronize the TSCs."
> > 
> > Not really. The synchronization logic tries to sync TSCs during
> > BIOS boot (and CPU hotplug), because the TSC values are loaded
> > sequentially, say:
> > 
> > CPU		realtime	TSC val
> > vcpu0		0 usec		0
> > vcpu1		100 usec	0
> > vcpu2		200 usec	0
> > ...
> > 
> > And we'd like to see all vcpus to read the same value at all times.
> > 
> > Other than that, comment makes sense. The problem with live migration
> > is as follows:
> > 
> > We'd like the TSC value to be written, ideally, just before the first
> > VM-entry a vCPU (because at the moment the TSC_OFFSET has been written, 
> > the vcpus tsc is ticking, which will cause a visible forward jump
> > in vcpus tsc time).
> > 
> > Before the first VM-entry is the farthest point in time before guest
> > entry that one could do that.
> > 
> > The window (or forward jump) between KVM_SET_TSC and VM-entry was about
> > 100ms last time i checked (which results in a 100ms time jump forward), 
> > See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.
> > 
> > Have we measured any improvement with this patchset?
> 
> Its not about this window. 
> It is about time that passes between the point that we read the 
> TSC on source system (and we do it in qemu each time the VM is paused) 
> and the moment that we set the same TSC value on the target. 
> That time is unbounded.

OK. Well, its the same problem: ideally you'd want to do that just
before VCPU-entry.

> Also this patchset should decrease TSC skew that happens
> between restoring it on multiple vCPUs as well, since 
> KVM_SET_TSC_STATE doesn't have to happen at the same time,
> as it accounts for time passed on each vCPU.
> 
> 
> Speaking of kvmclock, somewhat offtopic since this is a different issue,
> I found out that qemu reads the kvmclock value on each pause, 
> and then 'restores' on unpause, using
> KVM_SET_CLOCK (this modifies the global kvmclock offset)
> 
> This means (and I tested it) that if guest uses kvmclock
> for time reference, it will not account for time passed in
> the paused state.

Yes, this is necessary because otherwise there might be an overflow
in the kernel time accounting code (if the clock delta is too large).

> 
> > 
> > Then Paolo mentions (with >), i am replying as usual.
> > 
> > > Ok, after looking more at the code with Maxim I can confidently say that
> > > it's a total mess.  And a lot of the synchronization code is dead
> > > because 1) as far as we could see no guest synchronizes the TSC using
> > > MSR_IA32_TSC; 
> > 
> > Well, recent BIOS'es take care of synchronizing the TSC. So when Linux
> > boots, it does not have to synchronize TSC in software. 
> 
> Do you have an example of such BIOS? I tested OVMF which I compiled
> from git master a few weeks ago, and I also tested this with seabios 
> from qemu repo, and I have never seen writes to either TSC, or TSC_ADJUST
> from BIOS.

Oh, well, QEMU then.

> Or do you refer to the native BIOS on the host doing TSC synchronization?

No, virt.

> > However, upon migration (and initialization), the KVM_SET_TSC's do 
> > not happen at exactly the same time (the MSRs for each vCPU are loaded
> > in sequence). The synchronization code in kvm_set_tsc() is for those cases.
> 
> I agree with that, and this is one of the issues that KVM_SET_TSC_STATE
> is going to fix, since it accounts for it.
> 
> 
> > 
> > > and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the
> > > synchronization code in kvm_write_tsc.
> > 
> > Not familiar how guests are using MSR_IA32_TSC_ADJUST (or Linux)...
> > Lets see:
> > 
> > 
> > /*
> >  * Freshly booted CPUs call into this:
> >  */
> > void check_tsc_sync_target(void)
> > {
> >         struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
> >         unsigned int cpu = smp_processor_id();
> >         cycles_t cur_max_warp, gbl_max_warp;
> >         int cpus = 2;
> > 
> >         /* Also aborts if there is no TSC. */
> >         if (unsynchronized_tsc())
> >                 return;
> > 
> >         /*
> >          * Store, verify and sanitize the TSC adjust register. If
> >          * successful skip the test.
> >          *
> >          * The test is also skipped when the TSC is marked reliable. This
> >          * is true for SoCs which have no fallback clocksource. On these
> >          * SoCs the TSC is frequency synchronized, but still the TSC ADJUST
> >          * register might have been wreckaged by the BIOS..
> >          */
> >         if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) {
> >                 atomic_inc(&skip_test);
> >                 return;
> >         }
> > 
> > retry:
> > 
> > I'd force that synchronization path to be taken as a test-case.
> 
> Or even better as I suggested, we might tell the guest kernel
> to avoid this synchronization path when KVM is detected
> (regardless of invtsc flag)
> 
> > 
> > 
> > > I have a few thoughts about the kvm masterclock synchronization,
> > > which relate to the Paulo's proposal that I implemented.
> > > 
> > > The idea of masterclock is that when the host TSC is synchronized
> > > (or as kernel call it, stable), and the guest TSC is synchronized as well,
> > > then we can base the kvmclock, on the same pair of
> > > (host time in nsec, host tsc value), for all vCPUs.
> > 
> > We _have_ to base. See the comment which starts with
> > 
> > "Assuming a stable TSC across physical CPUS, and a stable TSC"
> > 
> > at x86.c.
> > 
> > > This makes the random error in calculation of this value invariant
> > > across vCPUS, and allows the guest to do kvmclock calculation in userspace
> > > (vDSO) since kvmclock parameters are vCPU invariant.
> > 
> > Actually, without synchronized host TSCs (and the masterclock scheme,
> > with a single base read from a vCPU), kvmclock in kernel is buggy as
> > well:
> > 
> > u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> > {
> >         unsigned version;
> >         u64 ret;
> >         u64 last;
> >         u8 flags;
> > 
> >         do {
> >                 version = pvclock_read_begin(src);
> >                 ret = __pvclock_read_cycles(src, rdtsc_ordered());
> >                 flags = src->flags;
> >         } while (pvclock_read_retry(src, version));
> > 
> >         if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
> >                 src->flags &= ~PVCLOCK_GUEST_STOPPED;
> >                 pvclock_touch_watchdogs();
> >         }
> > 
> >         if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> >                 (flags & PVCLOCK_TSC_STABLE_BIT))
> >                 return ret;
> > 
> > The code that follows this (including cmpxchg) is a workaround for that 
> > bug.
> 
> I understand that. I am not arguing that we shoudn't use the masterclock!
> I am just saying the facts about the condition when it works.

Sure.

> > Workaround would require each vCPU to write to a "global clock", on
> > every clock read.
> > 
> > > To ensure that the guest tsc is synchronized we currently track host/guest tsc
> > > writes, and enable the master clock only when roughly the same guest's TSC value
> > > was written across all vCPUs.
> > 
> > Yes, because then you can do:
> > 
> > vcpu0				vcpu1
> > 
> > A = read TSC
> > 		... elapsed time ...
> > 
> > 				B = read TSC
> > 
> > 				delta = B - A
> > 
> > > Recently this was disabled by Paulo
> > 
> > What was disabled exactly?
> 
> The running of tsc synchronization code when the _guest_ writes the TSC.
> 
> Which changes two things:
>    1. If the guest de-synchronizes its TSC, we won't disable master clock.
>    2. If the guest writes similar TSC values on each vCPU we won't detect
>       this as synchronization attempt, replace this with exactly the same
>       value and finally re-enable the master clock.
> 
> I argue that this change is OK, because Linux guests don't write to TSC at all,
> the virtual BIOSes seems not to write there either, and the only case in which
> the Linux guest tries to change its TSC is on CPU hotplug as you mention and 
> it uses TSC_ADJUST, that currently doesn't trigger TSC synchronization code in
> KVM anyway, so it is broken already.
> 
> However I also argue that we should mention this in documentation just in case,
> and we might also want (also just in case) to make Linux guests avoid even trying to
> touch TSC_ADJUST register when running under KVM.
> 
> To rehash my own words, the KVM_CLOCK_TSC_STABLE should be defined as:
> 'kvmclock is vCPU invariant, as long as the guest doesn't mess with its TSC'.
> 
> Having said all that, now that I know tsc sync code, and the
> reasons why it is there, I wouldn't be arguing about putting it back either.

Agree.

> > > and I agree with this, because I think
> > > that we indeed should only make the guest TSC synchronized by default
> > > (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> > > (Trying to guess when the guest syncs the TSC can cause more harm that good).
> > > 
> > > Besides, Linux guests don't sync the TSC via IA32_TSC write,
> > > but rather use IA32_TSC_ADJUST which currently doesn't participate
> > > in the tsc sync heruistics.
> > 
> > Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> > the BIOS to boot with synced TSCs.
> > 
> > So i wonder what is making it attempt TSC sync in the first place?
> 
> CPU hotplug. And the guest doesn't really write to TSC_ADJUST 
> since it's measurement code doesn't detect any tsc warps. 
>  
> I was just thinking that in theory since, this is a VM, and it can be 
> interrupted at any point, the measurement code should sometimes fall,
> and cause trouble.
> I didn't do much homework on this so I might be overreacting.

That is true (and you can see it with a CPU starved guest).

> As far as I see X86_FEATURE_TSC_RELIABLE was done mostly to support
> running under Hyper-V and VMWARE, and these should be prone to similar
> issues, supporting my theory.
> 
> > 
> > (one might also want to have Linux's synchronization via IA32_TSC_ADJUST 
> > working, but it should not need to happen in the first place, as long as 
> > QEMU and KVM are behaving properly).
> > 
> > > And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
> > 
> > Only AFAIK.
> > 
> > > I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> > > in the documentation to state that it only guarantees invariance if the guest
> > > doesn't mess with its own TSC.
> > > 
> > > Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> > > in the guest kernel, when kvm is detected to avoid the guest even from trying
> > > to sync TSC on newly hotplugged vCPUs.
> > 
> > See 7539b174aef405d9d57db48c58390ba360c91312.
> 
> I know about this, and I personally always use invtsc
> with my VMs.

Well, we can't make it (-cpu xxx,+invtsc) the default if vm-stop/vm-cont are unstable
with TSC!

> > Was hoping to make that (-cpu xxx,+invtsc) the default in QEMU once invariant TSC code
> > becomes stable. Should be tested enough by now?
> 
> The issue is that Qemu blocks migration when invtsc is set, based on the
> fact that the target machine might have different TSC frequency and no
> support for TSC scaling.
> There was a long debate on this long ago.

Oh right.

> It is possible though to override this by specifying the exact frequency
> you want the guest TSC to run at, by using something like
> (tsc-frequency=3500000000)
> I haven't checked if libvirt does this or not.

It does.

> I do think that as long as the user uses modern CPUs (which have stable TSC
> and support TSC scaling), there is no reason to disable invtsc, and
> therefore no reason to use kvmclock.

Yep. TSC is faster.

> > > (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> > > in some cases due to scheduling of guest vCPUs)
> > > 
> > > (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> > > and TSC clocksource watchdog, and the later we might want to keep).
> > 
> > The latter we want to keep.
> > 
> > > For host TSC writes, just as Paulo proposed we can still do the tsc sync,
> > > unless the new code that I implemented is in use.
> > 
> > So Paolo's proposal is to
> > 
> > "- for live migration, userspace is expected to use the new
> > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > (nanosecond, TSC, TSC_ADJUST) tuple."
> > 
> > Makes sense, so that no time between KVM_SET_TSC and
> > MSR_WRITE(TSC_ADJUST) elapses, which would cause the TSC to go out
> > of what is desired by the user.
> > 
> > Since you are proposing this new ioctl, perhaps its useful to also
> > reduce the 100ms jump? 
> 
> Yep. As long as target and destantion clocks are synchronized,
> it should make it better.
> 
> > 
> > "- for live migration, userspace is expected to use the new
> > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > (nanosecond, TSC, TSC_ADJUST) tuple. This value will be written
> > to the guest before the first VM-entry"
> > 
> > Sounds like a good idea (to integrate the values in a tuple).
> > 
> > > Few more random notes:
> > > 
> > > I have a weird feeling about using 'nsec since 1 January 1970'.
> > > Common sense is telling me that a 64 bit value can hold about 580 years,
> > > but still I see that it is more common to use timespec which is a (sec,nsec) pair.
> > 
> >            struct timespec {
> >                time_t   tv_sec;        /* seconds */
> >                long     tv_nsec;       /* nanoseconds */
> >            };
> > 
> > > I feel that 'kvm_get_walltime' that I added is a bit of a hack.
> > > Some refactoring might improve things here.
> > 
> > Haven't read the patchset yet...
> > 
> > > For example making kvm_get_walltime_and_clockread work in non tsc case as well
> > > might make the code cleaner.
> > > 
> > > Patches to enable this feature in qemu are in process of being sent to
> > > qemu-devel mailing list.
> > > 
> > > Best regards,
> > >        Maxim Levitsky
> > > 
> > > Maxim Levitsky (2):
> > >   KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
> > >   KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
> > > 
> > >  Documentation/virt/kvm/api.rst  | 56 +++++++++++++++++++++
> > >  arch/x86/include/uapi/asm/kvm.h |  1 +
> > >  arch/x86/kvm/x86.c              | 88 +++++++++++++++++++++++++++++++--
> > >  include/uapi/linux/kvm.h        | 14 ++++++
> > >  4 files changed, 154 insertions(+), 5 deletions(-)
> > > 
> > > -- 
> > > 2.26.2
> > > 
> 
> Best regards,
> 	Maxim Levitsky
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
  2020-12-01 19:43   ` Thomas Gleixner
@ 2020-12-03 11:11     ` Maxim Levitsky
  0 siblings, 0 replies; 30+ messages in thread
From: Maxim Levitsky @ 2020-12-03 11:11 UTC (permalink / raw)
  To: Thomas Gleixner, kvm
  Cc: Paolo Bonzini, Oliver Upton, Ingo Molnar, Sean Christopherson,
	open list, Marcelo Tosatti, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Tue, 2020-12-01 at 20:43 +0100, Thomas Gleixner wrote:
> On Mon, Nov 30 2020 at 15:35, Maxim Levitsky wrote:
> > +  struct kvm_tsc_info {
> > +	__u32 flags;
> > +	__u64 nsec;
> > +	__u64 tsc;
> > +	__u64 tsc_adjust;
> > +  };
> > +
> > +flags values for ``struct kvm_tsc_info``:
> > +
> > +``KVM_TSC_INFO_TSC_ADJUST_VALID``
> > +
> > +  ``tsc_adjust`` contains valid IA32_TSC_ADJUST value
> 
> Why exposing TSC_ADJUST at all? Just because?

It's because we want to reduce the number of cases where
KVM's msr/read write behavior differs between guest and host 
(e.g qemu) writes.
 
TSC and TSC_ADJUST are tied on architectural level, such as
chang
ing one, changes the other.
 
However for the migration to work we must be able 
to set each one separately.

Currently, KVM does this by turning the host write to 
TSC_ADJUST into a special case that bypasses
the actual TSC adjustment, and just sets this MSR.
 
The next patch in this series, will allow to disable
this special behavior, making host TSC_ADJUST write
work the same way as in guest.

Therefore to still allow to set TSC_ADJUST and TSC independently
after migration this ioctl will be used instead.

Best regards,
	Maxim Levitsky

> 
> Thanks,
> 
>         tglx
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-12-01 19:48     ` Marcelo Tosatti
@ 2020-12-03 11:39       ` Maxim Levitsky
  2020-12-03 20:18         ` Marcelo Tosatti
  0 siblings, 1 reply; 30+ messages in thread
From: Maxim Levitsky @ 2020-12-03 11:39 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, Paolo Bonzini, Oliver Upton, Ingo Molnar,
	Sean Christopherson, Thomas Gleixner, open list, Jonathan Corbet,
	Wanpeng Li, Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Tue, 2020-12-01 at 16:48 -0300, Marcelo Tosatti wrote:
> On Tue, Dec 01, 2020 at 02:30:39PM +0200, Maxim Levitsky wrote:
> > On Mon, 2020-11-30 at 16:16 -0300, Marcelo Tosatti wrote:
> > > Hi Maxim,
> > > 
> > > On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> > > > Hi!
> > > > 
> > > > This is the first version of the work to make TSC migration more accurate,
> > > > as was defined by Paulo at:
> > > > https://www.spinics.net/lists/kvm/msg225525.html
> > > 
> > > Description from Oliver's patch:
> > > 
> > > "To date, VMMs have typically restored the guest's TSCs by value using
> > > the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
> > > value introduces some challenges with synchronization as the TSCs
> > > continue to tick throughout the restoration process. As such, KVM has
> > > some heuristics around TSC writes to infer whether or not the guest or
> > > host is attempting to synchronize the TSCs."
> > > 
> > > Not really. The synchronization logic tries to sync TSCs during
> > > BIOS boot (and CPU hotplug), because the TSC values are loaded
> > > sequentially, say:
> > > 
> > > CPU		realtime	TSC val
> > > vcpu0		0 usec		0
> > > vcpu1		100 usec	0
> > > vcpu2		200 usec	0
> > > ...
> > > 
> > > And we'd like to see all vcpus to read the same value at all times.
> > > 
> > > Other than that, comment makes sense. The problem with live migration
> > > is as follows:
> > > 
> > > We'd like the TSC value to be written, ideally, just before the first
> > > VM-entry a vCPU (because at the moment the TSC_OFFSET has been written, 
> > > the vcpus tsc is ticking, which will cause a visible forward jump
> > > in vcpus tsc time).
> > > 
> > > Before the first VM-entry is the farthest point in time before guest
> > > entry that one could do that.
> > > 
> > > The window (or forward jump) between KVM_SET_TSC and VM-entry was about
> > > 100ms last time i checked (which results in a 100ms time jump forward), 
> > > See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.
> > > 
> > > Have we measured any improvement with this patchset?
> > 
> > Its not about this window. 
> > It is about time that passes between the point that we read the 
> > TSC on source system (and we do it in qemu each time the VM is paused) 
> > and the moment that we set the same TSC value on the target. 
> > That time is unbounded.
> 
> OK. Well, its the same problem: ideally you'd want to do that just
> before VCPU-entry.
> 
> > Also this patchset should decrease TSC skew that happens
> > between restoring it on multiple vCPUs as well, since 
> > KVM_SET_TSC_STATE doesn't have to happen at the same time,
> > as it accounts for time passed on each vCPU.
> > 
> > 
> > Speaking of kvmclock, somewhat offtopic since this is a different issue,
> > I found out that qemu reads the kvmclock value on each pause, 
> > and then 'restores' on unpause, using
> > KVM_SET_CLOCK (this modifies the global kvmclock offset)
> > 
> > This means (and I tested it) that if guest uses kvmclock
> > for time reference, it will not account for time passed in
> > the paused state.
> 
> Yes, this is necessary because otherwise there might be an overflow
> in the kernel time accounting code (if the clock delta is too large).

Could you elaborate on this? Do you mean that guest kernel can crash,
when the time 'jumps' too far forward in one go?

If so this will happen with kernel using TSC as well, 
since we do let the virtual TSC to 'keep running' while VM is suspended, 
and the goal of this patchset is to let it 'run' even while
the VM is migrating.

And if there is an issue, we really should try to fix it in
the guest kernel IMHO.

> 
> > > Then Paolo mentions (with >), i am replying as usual.
> > > 
> > > > Ok, after looking more at the code with Maxim I can confidently say that
> > > > it's a total mess.  And a lot of the synchronization code is dead
> > > > because 1) as far as we could see no guest synchronizes the TSC using
> > > > MSR_IA32_TSC; 
> > > 
> > > Well, recent BIOS'es take care of synchronizing the TSC. So when Linux
> > > boots, it does not have to synchronize TSC in software. 
> > 
> > Do you have an example of such BIOS? I tested OVMF which I compiled
> > from git master a few weeks ago, and I also tested this with seabios 
> > from qemu repo, and I have never seen writes to either TSC, or TSC_ADJUST
> > from BIOS.
> 
> Oh, well, QEMU then.
> 
> > Or do you refer to the native BIOS on the host doing TSC synchronization?
> 
> No, virt.

I also (lightly) tested win10 guest, and win10 guest with Hyper-V enabled,
and in both cases I haven't observed TSC/TSC_ADJUST writes.

> 
> > > However, upon migration (and initialization), the KVM_SET_TSC's do 
> > > not happen at exactly the same time (the MSRs for each vCPU are loaded
> > > in sequence). The synchronization code in kvm_set_tsc() is for those cases.
> > 
> > I agree with that, and this is one of the issues that KVM_SET_TSC_STATE
> > is going to fix, since it accounts for it.
> > 
> > 
> > > > and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the
> > > > synchronization code in kvm_write_tsc.
> > > 
> > > Not familiar how guests are using MSR_IA32_TSC_ADJUST (or Linux)...
> > > Lets see:
> > > 
> > > 
> > > /*
> > >  * Freshly booted CPUs call into this:
> > >  */
> > > void check_tsc_sync_target(void)
> > > {
> > >         struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
> > >         unsigned int cpu = smp_processor_id();
> > >         cycles_t cur_max_warp, gbl_max_warp;
> > >         int cpus = 2;
> > > 
> > >         /* Also aborts if there is no TSC. */
> > >         if (unsynchronized_tsc())
> > >                 return;
> > > 
> > >         /*
> > >          * Store, verify and sanitize the TSC adjust register. If
> > >          * successful skip the test.
> > >          *
> > >          * The test is also skipped when the TSC is marked reliable. This
> > >          * is true for SoCs which have no fallback clocksource. On these
> > >          * SoCs the TSC is frequency synchronized, but still the TSC ADJUST
> > >          * register might have been wreckaged by the BIOS..
> > >          */
> > >         if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) {
> > >                 atomic_inc(&skip_test);
> > >                 return;
> > >         }
> > > 
> > > retry:
> > > 
> > > I'd force that synchronization path to be taken as a test-case.
> > 
> > Or even better as I suggested, we might tell the guest kernel
> > to avoid this synchronization path when KVM is detected
> > (regardless of invtsc flag)
> > 
> > > 
> > > > I have a few thoughts about the kvm masterclock synchronization,
> > > > which relate to the Paulo's proposal that I implemented.
> > > > 
> > > > The idea of masterclock is that when the host TSC is synchronized
> > > > (or as kernel call it, stable), and the guest TSC is synchronized as well,
> > > > then we can base the kvmclock, on the same pair of
> > > > (host time in nsec, host tsc value), for all vCPUs.
> > > 
> > > We _have_ to base. See the comment which starts with
> > > 
> > > "Assuming a stable TSC across physical CPUS, and a stable TSC"
> > > 
> > > at x86.c.
> > > 
> > > > This makes the random error in calculation of this value invariant
> > > > across vCPUS, and allows the guest to do kvmclock calculation in userspace
> > > > (vDSO) since kvmclock parameters are vCPU invariant.
> > > 
> > > Actually, without synchronized host TSCs (and the masterclock scheme,
> > > with a single base read from a vCPU), kvmclock in kernel is buggy as
> > > well:
> > > 
> > > u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> > > {
> > >         unsigned version;
> > >         u64 ret;
> > >         u64 last;
> > >         u8 flags;
> > > 
> > >         do {
> > >                 version = pvclock_read_begin(src);
> > >                 ret = __pvclock_read_cycles(src, rdtsc_ordered());
> > >                 flags = src->flags;
> > >         } while (pvclock_read_retry(src, version));
> > > 
> > >         if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
> > >                 src->flags &= ~PVCLOCK_GUEST_STOPPED;
> > >                 pvclock_touch_watchdogs();
> > >         }
> > > 
> > >         if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> > >                 (flags & PVCLOCK_TSC_STABLE_BIT))
> > >                 return ret;
> > > 
> > > The code that follows this (including cmpxchg) is a workaround for that 
> > > bug.
> > 
> > I understand that. I am not arguing that we shoudn't use the masterclock!
> > I am just saying the facts about the condition when it works.
> 
> Sure.
> 
> > > Workaround would require each vCPU to write to a "global clock", on
> > > every clock read.
> > > 
> > > > To ensure that the guest tsc is synchronized we currently track host/guest tsc
> > > > writes, and enable the master clock only when roughly the same guest's TSC value
> > > > was written across all vCPUs.
> > > 
> > > Yes, because then you can do:
> > > 
> > > vcpu0				vcpu1
> > > 
> > > A = read TSC
> > > 		... elapsed time ...
> > > 
> > > 				B = read TSC
> > > 
> > > 				delta = B - A
> > > 
> > > > Recently this was disabled by Paulo
> > > 
> > > What was disabled exactly?
> > 
> > The running of tsc synchronization code when the _guest_ writes the TSC.
> > 
> > Which changes two things:
> >    1. If the guest de-synchronizes its TSC, we won't disable master clock.
> >    2. If the guest writes similar TSC values on each vCPU we won't detect
> >       this as synchronization attempt, replace this with exactly the same
> >       value and finally re-enable the master clock.
> > 
> > I argue that this change is OK, because Linux guests don't write to TSC at all,
> > the virtual BIOSes seems not to write there either, and the only case in which
> > the Linux guest tries to change its TSC is on CPU hotplug as you mention and 
> > it uses TSC_ADJUST, that currently doesn't trigger TSC synchronization code in
> > KVM anyway, so it is broken already.
> > 
> > However I also argue that we should mention this in documentation just in case,
> > and we might also want (also just in case) to make Linux guests avoid even trying to
> > touch TSC_ADJUST register when running under KVM.
> > 
> > To rehash my own words, the KVM_CLOCK_TSC_STABLE should be defined as:
> > 'kvmclock is vCPU invariant, as long as the guest doesn't mess with its TSC'.
> > 
> > Having said all that, now that I know tsc sync code, and the
> > reasons why it is there, I wouldn't be arguing about putting it back either.
> 
> Agree.
> 
> > > > and I agree with this, because I think
> > > > that we indeed should only make the guest TSC synchronized by default
> > > > (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> > > > (Trying to guess when the guest syncs the TSC can cause more harm that good).
> > > > 
> > > > Besides, Linux guests don't sync the TSC via IA32_TSC write,
> > > > but rather use IA32_TSC_ADJUST which currently doesn't participate
> > > > in the tsc sync heruistics.
> > > 
> > > Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> > > the BIOS to boot with synced TSCs.
> > > 
> > > So i wonder what is making it attempt TSC sync in the first place?
> > 
> > CPU hotplug. And the guest doesn't really write to TSC_ADJUST 
> > since it's measurement code doesn't detect any tsc warps. 
> >  
> > I was just thinking that in theory since, this is a VM, and it can be 
> > interrupted at any point, the measurement code should sometimes fall,
> > and cause trouble.
> > I didn't do much homework on this so I might be overreacting.
> 
> That is true (and you can see it with a CPU starved guest).
> 
> > As far as I see X86_FEATURE_TSC_RELIABLE was done mostly to support
> > running under Hyper-V and VMWARE, and these should be prone to similar
> > issues, supporting my theory.
> > 
> > > (one might also want to have Linux's synchronization via IA32_TSC_ADJUST 
> > > working, but it should not need to happen in the first place, as long as 
> > > QEMU and KVM are behaving properly).
> > > 
> > > > And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
> > > 
> > > Only AFAIK.
> > > 
> > > > I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> > > > in the documentation to state that it only guarantees invariance if the guest
> > > > doesn't mess with its own TSC.
> > > > 
> > > > Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> > > > in the guest kernel, when kvm is detected to avoid the guest even from trying
> > > > to sync TSC on newly hotplugged vCPUs.
> > > 
> > > See 7539b174aef405d9d57db48c58390ba360c91312.
> > 
> > I know about this, and I personally always use invtsc
> > with my VMs.
> 
> Well, we can't make it (-cpu xxx,+invtsc) the default if vm-stop/vm-cont are unstable
> with TSC!

Could you elaborate on this too? Are you referring to the same issue you 
had mentioned about the overflow in the kernel time accounting?

> 
> > > Was hoping to make that (-cpu xxx,+invtsc) the default in QEMU once invariant TSC code
> > > becomes stable. Should be tested enough by now?
> > 
> > The issue is that Qemu blocks migration when invtsc is set, based on the
> > fact that the target machine might have different TSC frequency and no
> > support for TSC scaling.
> > There was a long debate on this long ago.
> 
> Oh right.
> 
> > It is possible though to override this by specifying the exact frequency
> > you want the guest TSC to run at, by using something like
> > (tsc-frequency=3500000000)
> > I haven't checked if libvirt does this or not.
> 
> It does.
Cool.
> 
> > I do think that as long as the user uses modern CPUs (which have stable TSC
> > and support TSC scaling), there is no reason to disable invtsc, and
> > therefore no reason to use kvmclock.
> 
> Yep. TSC is faster.

Also this bit is sometimes used by userspace tools.

Some time ago I found out that fio uses it to decide whether 
to use TSC for measurements.

I didn't know this and was running fio in a guest without 'invtsc'.
Fio switched to plain gettimeofday behind my back
and totally screwed up the results.

> 
> > > > (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> > > > in some cases due to scheduling of guest vCPUs)
> > > > 
> > > > (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> > > > and TSC clocksource watchdog, and the later we might want to keep).
> > > 
> > > The latter we want to keep.
> > > 
> > > > For host TSC writes, just as Paulo proposed we can still do the tsc sync,
> > > > unless the new code that I implemented is in use.
> > > 
> > > So Paolo's proposal is to
> > > 
> > > "- for live migration, userspace is expected to use the new
> > > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > > (nanosecond, TSC, TSC_ADJUST) tuple."
> > > 
> > > Makes sense, so that no time between KVM_SET_TSC and
> > > MSR_WRITE(TSC_ADJUST) elapses, which would cause the TSC to go out
> > > of what is desired by the user.
> > > 
> > > Since you are proposing this new ioctl, perhaps its useful to also
> > > reduce the 100ms jump? 
> > 
> > Yep. As long as target and destantion clocks are synchronized,
> > it should make it better.
> > 
> > > "- for live migration, userspace is expected to use the new
> > > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > > (nanosecond, TSC, TSC_ADJUST) tuple. This value will be written
> > > to the guest before the first VM-entry"
> > > 
> > > Sounds like a good idea (to integrate the values in a tuple).
> > > 
> > > > Few more random notes:
> > > > 
> > > > I have a weird feeling about using 'nsec since 1 January 1970'.
> > > > Common sense is telling me that a 64 bit value can hold about 580 years,
> > > > but still I see that it is more common to use timespec which is a (sec,nsec) pair.
> > > 
> > >            struct timespec {
> > >                time_t   tv_sec;        /* seconds */
> > >                long     tv_nsec;       /* nanoseconds */
> > >            };
> > > 
> > > > I feel that 'kvm_get_walltime' that I added is a bit of a hack.
> > > > Some refactoring might improve things here.
> > > 
> > > Haven't read the patchset yet...
> > > 
> > > > For example making kvm_get_walltime_and_clockread work in non tsc case as well
> > > > might make the code cleaner.
> > > > 
> > > > Patches to enable this feature in qemu are in process of being sent to
> > > > qemu-devel mailing list.
> > > > 
> > > > Best regards,
> > > >        Maxim Levitsky
> > > > 
> > > > Maxim Levitsky (2):
> > > >   KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
> > > >   KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
> > > > 
> > > >  Documentation/virt/kvm/api.rst  | 56 +++++++++++++++++++++
> > > >  arch/x86/include/uapi/asm/kvm.h |  1 +
> > > >  arch/x86/kvm/x86.c              | 88 +++++++++++++++++++++++++++++++--
> > > >  include/uapi/linux/kvm.h        | 14 ++++++
> > > >  4 files changed, 154 insertions(+), 5 deletions(-)
> > > > 
> > > > -- 
> > > > 2.26.2
> > > > 
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 


Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-12-01 19:35 ` Thomas Gleixner
@ 2020-12-03 11:41   ` Paolo Bonzini
  2020-12-03 12:47   ` Maxim Levitsky
  1 sibling, 0 replies; 30+ messages in thread
From: Paolo Bonzini @ 2020-12-03 11:41 UTC (permalink / raw)
  To: Thomas Gleixner, Maxim Levitsky, kvm
  Cc: Oliver Upton, Ingo Molnar, Sean Christopherson, open list,
	Marcelo Tosatti, Jonathan Corbet, Wanpeng Li, Borislav Petkov,
	Jim Mattson, H. Peter Anvin, open list:DOCUMENTATION,
	Joerg Roedel, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On 01/12/20 20:35, Thomas Gleixner wrote:
>> This makes the random error in calculation of this value invariant
>> across vCPUS, and allows the guest to do kvmclock calculation in userspace
>> (vDSO) since kvmclock parameters are vCPU invariant.
>
> That's not the case today? OMG!

It's optional. If the host tells the guest that the host TSC is messed 
up, kvmclock disables its vDSO implementation and falls back to the syscall.

Paolo


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-12-01 15:02     ` Marcelo Tosatti
@ 2020-12-03 11:51       ` Maxim Levitsky
  0 siblings, 0 replies; 30+ messages in thread
From: Maxim Levitsky @ 2020-12-03 11:51 UTC (permalink / raw)
  To: Marcelo Tosatti, Thomas Gleixner
  Cc: kvm, Paolo Bonzini, Oliver Upton, Ingo Molnar,
	Sean Christopherson, open list, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Tue, 2020-12-01 at 12:02 -0300, Marcelo Tosatti wrote:
> On Tue, Dec 01, 2020 at 02:48:11PM +0100, Thomas Gleixner wrote:
> > On Mon, Nov 30 2020 at 16:16, Marcelo Tosatti wrote:
> > > > Besides, Linux guests don't sync the TSC via IA32_TSC write,
> > > > but rather use IA32_TSC_ADJUST which currently doesn't participate
> > > > in the tsc sync heruistics.
> > > 
> > > Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> > > the BIOS to boot with synced TSCs.
> > 
> > That's wishful thinking.
> > 
> > Reality is that BIOS tinkerers fail to get it right. TSC_ADJUST allows
> > us to undo the wreckage they create.
> > 
> > Thanks,
> > 
> >         tglx
> 
> Have not seen any multicore Dell/HP systems require that.
> 
> Anyway, for QEMU/KVM it should be synced (unless there is a bug
> in the sync logic in the first place).
> 

I agree with that, and that is why I suggested to make the guest
avoid TSC syncing when KVM is detected.
 
I don't mind how to implement this.
 
It can be either done with new CPUID bit, 
or always when KVM
is detected, 
(or even when *any* hypervisor is detected)
 
I also don't mind if we only disable tsc sync logic or
set X86_FEATURE_TSC_RELIABLE which will disable it
and the clocksource watchdog.


Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-12-01 16:19     ` Andy Lutomirski
@ 2020-12-03 11:57       ` Maxim Levitsky
  0 siblings, 0 replies; 30+ messages in thread
From: Maxim Levitsky @ 2020-12-03 11:57 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner
  Cc: Marcelo Tosatti, kvm, Paolo Bonzini, Oliver Upton, Ingo Molnar,
	Sean Christopherson, open list, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Tue, 2020-12-01 at 08:19 -0800, Andy Lutomirski wrote:
> > On Dec 1, 2020, at 6:01 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > On Mon, Nov 30 2020 at 16:16, Marcelo Tosatti wrote:
> > > Not really. The synchronization logic tries to sync TSCs during
> > > BIOS boot (and CPU hotplug), because the TSC values are loaded
> > > sequentially, say:
> > > 
> > > CPU        realtime    TSC val
> > > vcpu0        0 usec        0
> > > vcpu1        100 usec    0
> > > vcpu2        200 usec    0
> > 
> > That's nonsense, really.
> > 
> > > And we'd like to see all vcpus to read the same value at all times.
> > 
> > Providing guests with a synchronized and stable TSC on a host with a
> > synchronized and stable TSC is trivial.
> > 
> > Write the _same_ TSC offset to _all_ vcpu control structs and be done
> > with it. It's not rocket science.
> > 
> > The guest TSC read is:
> > 
> >    hostTSC + vcpu_offset
> > 
> > So if the host TSC is synchronized then the guest TSCs are synchronized
> > as well.
> > 
> > If the host TSC is not synchronized, then don't even try.
> 
> This reminds me: if you’re adding a new kvm feature that tells the guest that the TSC works well, could you perhaps only have one structure for all vCPUs in the same guest?

I won't mind doing this, but this might be too much work for
too little gain.

IMHO, modern hosts don't need the kvmclock in the first place,
and should just expose the TSC to the guest 
together with the invtsc bit.

Best regards,
	Maxim Levitsky

> 
> > Thanks,
> > 
> >        tglx



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-12-01 19:35 ` Thomas Gleixner
  2020-12-03 11:41   ` Paolo Bonzini
@ 2020-12-03 12:47   ` Maxim Levitsky
  1 sibling, 0 replies; 30+ messages in thread
From: Maxim Levitsky @ 2020-12-03 12:47 UTC (permalink / raw)
  To: Thomas Gleixner, kvm
  Cc: Paolo Bonzini, Oliver Upton, Ingo Molnar, Sean Christopherson,
	open list, Marcelo Tosatti, Jonathan Corbet, Wanpeng Li,
	Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Tue, 2020-12-01 at 20:35 +0100, Thomas Gleixner wrote:
> On Mon, Nov 30 2020 at 15:35, Maxim Levitsky wrote:
> > The idea of masterclock is that when the host TSC is synchronized
> > (or as kernel call it, stable), and the guest TSC is synchronized as well,
> > then we can base the kvmclock, on the same pair of
> > (host time in nsec, host tsc value), for all vCPUs.
> 
> That's correct. All guest CPUs should see exactly the same TSC value,
> i.e.
> 
>         hostTSC + vcpu_offset

This is roughly the case today, unless the guest messes up
its own TSC, which it is allowed to do so.
 
And that brings me an idea: Why do we allow the guest
to mess up with its TSC in the first place?

Can't we just stop exposing the TSC_ADJUST to the guest,
since it is an optional x86 feature?
 
In addition to that I also had read somewhere 
that TSC writes aren't even guaranteed to work on some x86 systems
(e.g that msr can be readonly, and that's why TSC_ADJUST was created)
so I'll say we can go even further and just ignore
writes to TSC from the guest as well.

Or if this is too much, we can just ignore this 
since Linux doesn't touch the TSC anyway.
 
Best feature is no feature, you know.


> 
> > This makes the random error in calculation of this value invariant
> > across vCPUS, and allows the guest to do kvmclock calculation in userspace
> > (vDSO) since kvmclock parameters are vCPU invariant.
> 
> That's not the case today? OMG!

We have the masterclock to avoid this, whose existence we trickle
down to the guest via KVM_CLOCK_TSC_STABLE bit.
 
When master clock is not enabled (due to one of the reasons
I mentioned in the RFC mail), this bit is not set,
and the guest fails back to do system calls.

> 
> > To ensure that the guest tsc is synchronized we currently track host/guest tsc
> > writes, and enable the master clock only when roughly the same guest's TSC value
> > was written across all vCPUs.
> 
> The Linux kernel never writes the TSC. We've tried that ~15 years ago
> and it was a total disaster.
I can imagine this.

> 
> > Recently this was disabled by Paulo and I agree with this, because I think
> > that we indeed should only make the guest TSC synchronized by default
> > (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> > (Trying to guess when the guest syncs the TSC can cause more harm that good).
> > 
> > Besides, Linux guests don't sync the TSC via IA32_TSC write,
> > but rather use IA32_TSC_ADJUST which currently doesn't participate
> > in the tsc sync heruistics.
> 
> The kernel only writes TSC_ADJUST when it is advertised in CPUID and:
> 
>     1) when the boot CPU detects a non-zero TSC_ADJUST value it writes
>        it to 0, except when running on SGI-UV
> 
>     2) when a starting CPU has a different TSC_ADJUST value than the
>        first CPU which came up on the same socket.
> 
>     3) When the first CPU of a different socket is starting and the TSC
>        synchronization check fails against a CPU on an already checked
>        socket then the kernel tries to adjust TSC_ADJUST to the point
>        that the synchronization check does not fail anymore.
Since we do allow multi socket guests, I guess (3) can still fail
due to scheduling noise and make the guest write TSC_ADJUST.

> 
> > And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
> > 
> > I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> > in the documentation to state that it only guarantees invariance if the guest
> > doesn't mess with its own TSC.
> > 
> > Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> > in the guest kernel, when kvm is detected to avoid the guest even from trying
> > to sync TSC on newly hotplugged vCPUs.
> > 
> > (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> > in some cases due to scheduling of guest vCPUs)
> 
> The only cases it would try to write are #3 above or because the
> hypervisor or BIOS messed it up (#1, #2).
> 
> > (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> > and TSC clocksource watchdog, and the later we might want to keep).
> 
> Depends. If the host TSC is stable and synchronized, then you don't need
> the TSC watchdog. We are slowly starting to trust the TSC to some extent
> and phase out the watchdog for newer parts (hopefully).
> 
> If the host TSC really falls apart then it still can invalidate
> KVM_CLOCK and force the guest to reevaluate the situation.
> 
> > Few more random notes:
> > 
> > I have a weird feeling about using 'nsec since 1 January 1970'.
> > Common sense is telling me that a 64 bit value can hold about 580
> > years,
> 
> which is plenty.
I also think so, just wanted to hear your opinion on that.
> 
> > but still I see that it is more common to use timespec which is a
> > (sec,nsec) pair.
> 
> timespecs are horrible.

Best regards,
	Maxim Levitsky

> 
> Thanks,
> 
>         tglx
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-12-03 11:39       ` Maxim Levitsky
@ 2020-12-03 20:18         ` Marcelo Tosatti
  2020-12-07 13:00           ` Maxim Levitsky
  0 siblings, 1 reply; 30+ messages in thread
From: Marcelo Tosatti @ 2020-12-03 20:18 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Paolo Bonzini, Oliver Upton, Ingo Molnar,
	Sean Christopherson, Thomas Gleixner, open list, Jonathan Corbet,
	Wanpeng Li, Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov


On Thu, Dec 03, 2020 at 01:39:42PM +0200, Maxim Levitsky wrote:
> On Tue, 2020-12-01 at 16:48 -0300, Marcelo Tosatti wrote:
> > On Tue, Dec 01, 2020 at 02:30:39PM +0200, Maxim Levitsky wrote:
> > > On Mon, 2020-11-30 at 16:16 -0300, Marcelo Tosatti wrote:
> > > > Hi Maxim,
> > > > 
> > > > On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> > > > > Hi!
> > > > > 
> > > > > This is the first version of the work to make TSC migration more accurate,
> > > > > as was defined by Paulo at:
> > > > > https://www.spinics.net/lists/kvm/msg225525.html
> > > > 
> > > > Description from Oliver's patch:
> > > > 
> > > > "To date, VMMs have typically restored the guest's TSCs by value using
> > > > the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
> > > > value introduces some challenges with synchronization as the TSCs
> > > > continue to tick throughout the restoration process. As such, KVM has
> > > > some heuristics around TSC writes to infer whether or not the guest or
> > > > host is attempting to synchronize the TSCs."
> > > > 
> > > > Not really. The synchronization logic tries to sync TSCs during
> > > > BIOS boot (and CPU hotplug), because the TSC values are loaded
> > > > sequentially, say:
> > > > 
> > > > CPU		realtime	TSC val
> > > > vcpu0		0 usec		0
> > > > vcpu1		100 usec	0
> > > > vcpu2		200 usec	0
> > > > ...
> > > > 
> > > > And we'd like to see all vcpus to read the same value at all times.
> > > > 
> > > > Other than that, comment makes sense. The problem with live migration
> > > > is as follows:
> > > > 
> > > > We'd like the TSC value to be written, ideally, just before the first
> > > > VM-entry a vCPU (because at the moment the TSC_OFFSET has been written, 
> > > > the vcpus tsc is ticking, which will cause a visible forward jump
> > > > in vcpus tsc time).
> > > > 
> > > > Before the first VM-entry is the farthest point in time before guest
> > > > entry that one could do that.
> > > > 
> > > > The window (or forward jump) between KVM_SET_TSC and VM-entry was about
> > > > 100ms last time i checked (which results in a 100ms time jump forward), 
> > > > See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.
> > > > 
> > > > Have we measured any improvement with this patchset?
> > > 
> > > Its not about this window. 
> > > It is about time that passes between the point that we read the 
> > > TSC on source system (and we do it in qemu each time the VM is paused) 
> > > and the moment that we set the same TSC value on the target. 
> > > That time is unbounded.
> > 
> > OK. Well, its the same problem: ideally you'd want to do that just
> > before VCPU-entry.
> > 
> > > Also this patchset should decrease TSC skew that happens
> > > between restoring it on multiple vCPUs as well, since 
> > > KVM_SET_TSC_STATE doesn't have to happen at the same time,
> > > as it accounts for time passed on each vCPU.
> > > 
> > > 
> > > Speaking of kvmclock, somewhat offtopic since this is a different issue,
> > > I found out that qemu reads the kvmclock value on each pause, 
> > > and then 'restores' on unpause, using
> > > KVM_SET_CLOCK (this modifies the global kvmclock offset)
> > > 
> > > This means (and I tested it) that if guest uses kvmclock
> > > for time reference, it will not account for time passed in
> > > the paused state.
> > 
> > Yes, this is necessary because otherwise there might be an overflow
> > in the kernel time accounting code (if the clock delta is too large).
> 
> Could you elaborate on this? Do you mean that guest kernel can crash,
> when the time 'jumps' too far forward in one go?

It can crash (there will a overflow and time will jump backwards).

> If so this will happen with kernel using TSC as well, 
> since we do let the virtual TSC to 'keep running' while VM is suspended, 
> and the goal of this patchset is to let it 'run' even while
> the VM is migrating.

True. For the overflow one, perhaps TSC can be stopped (and restored) similarly to
KVMCLOCK.

See QEMU's commit 00f4d64ee76e873be881a82d893a591487aa7950.

> And if there is an issue, we really should try to fix it in
> the guest kernel IMHO.
> 
> > 
> > > > Then Paolo mentions (with >), i am replying as usual.
> > > > 
> > > > > Ok, after looking more at the code with Maxim I can confidently say that
> > > > > it's a total mess.  And a lot of the synchronization code is dead
> > > > > because 1) as far as we could see no guest synchronizes the TSC using
> > > > > MSR_IA32_TSC; 
> > > > 
> > > > Well, recent BIOS'es take care of synchronizing the TSC. So when Linux
> > > > boots, it does not have to synchronize TSC in software. 
> > > 
> > > Do you have an example of such BIOS? I tested OVMF which I compiled
> > > from git master a few weeks ago, and I also tested this with seabios 
> > > from qemu repo, and I have never seen writes to either TSC, or TSC_ADJUST
> > > from BIOS.
> > 
> > Oh, well, QEMU then.
> > 
> > > Or do you refer to the native BIOS on the host doing TSC synchronization?
> > 
> > No, virt.
> 
> I also (lightly) tested win10 guest, and win10 guest with Hyper-V enabled,
> and in both cases I haven't observed TSC/TSC_ADJUST writes.
> 
> > 
> > > > However, upon migration (and initialization), the KVM_SET_TSC's do 
> > > > not happen at exactly the same time (the MSRs for each vCPU are loaded
> > > > in sequence). The synchronization code in kvm_set_tsc() is for those cases.
> > > 
> > > I agree with that, and this is one of the issues that KVM_SET_TSC_STATE
> > > is going to fix, since it accounts for it.
> > > 
> > > 
> > > > > and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the
> > > > > synchronization code in kvm_write_tsc.
> > > > 
> > > > Not familiar how guests are using MSR_IA32_TSC_ADJUST (or Linux)...
> > > > Lets see:
> > > > 
> > > > 
> > > > /*
> > > >  * Freshly booted CPUs call into this:
> > > >  */
> > > > void check_tsc_sync_target(void)
> > > > {
> > > >         struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
> > > >         unsigned int cpu = smp_processor_id();
> > > >         cycles_t cur_max_warp, gbl_max_warp;
> > > >         int cpus = 2;
> > > > 
> > > >         /* Also aborts if there is no TSC. */
> > > >         if (unsynchronized_tsc())
> > > >                 return;
> > > > 
> > > >         /*
> > > >          * Store, verify and sanitize the TSC adjust register. If
> > > >          * successful skip the test.
> > > >          *
> > > >          * The test is also skipped when the TSC is marked reliable. This
> > > >          * is true for SoCs which have no fallback clocksource. On these
> > > >          * SoCs the TSC is frequency synchronized, but still the TSC ADJUST
> > > >          * register might have been wreckaged by the BIOS..
> > > >          */
> > > >         if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) {
> > > >                 atomic_inc(&skip_test);
> > > >                 return;
> > > >         }
> > > > 
> > > > retry:
> > > > 
> > > > I'd force that synchronization path to be taken as a test-case.
> > > 
> > > Or even better as I suggested, we might tell the guest kernel
> > > to avoid this synchronization path when KVM is detected
> > > (regardless of invtsc flag)
> > > 
> > > > 
> > > > > I have a few thoughts about the kvm masterclock synchronization,
> > > > > which relate to the Paulo's proposal that I implemented.
> > > > > 
> > > > > The idea of masterclock is that when the host TSC is synchronized
> > > > > (or as kernel call it, stable), and the guest TSC is synchronized as well,
> > > > > then we can base the kvmclock, on the same pair of
> > > > > (host time in nsec, host tsc value), for all vCPUs.
> > > > 
> > > > We _have_ to base. See the comment which starts with
> > > > 
> > > > "Assuming a stable TSC across physical CPUS, and a stable TSC"
> > > > 
> > > > at x86.c.
> > > > 
> > > > > This makes the random error in calculation of this value invariant
> > > > > across vCPUS, and allows the guest to do kvmclock calculation in userspace
> > > > > (vDSO) since kvmclock parameters are vCPU invariant.
> > > > 
> > > > Actually, without synchronized host TSCs (and the masterclock scheme,
> > > > with a single base read from a vCPU), kvmclock in kernel is buggy as
> > > > well:
> > > > 
> > > > u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> > > > {
> > > >         unsigned version;
> > > >         u64 ret;
> > > >         u64 last;
> > > >         u8 flags;
> > > > 
> > > >         do {
> > > >                 version = pvclock_read_begin(src);
> > > >                 ret = __pvclock_read_cycles(src, rdtsc_ordered());
> > > >                 flags = src->flags;
> > > >         } while (pvclock_read_retry(src, version));
> > > > 
> > > >         if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
> > > >                 src->flags &= ~PVCLOCK_GUEST_STOPPED;
> > > >                 pvclock_touch_watchdogs();
> > > >         }
> > > > 
> > > >         if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> > > >                 (flags & PVCLOCK_TSC_STABLE_BIT))
> > > >                 return ret;
> > > > 
> > > > The code that follows this (including cmpxchg) is a workaround for that 
> > > > bug.
> > > 
> > > I understand that. I am not arguing that we shoudn't use the masterclock!
> > > I am just saying the facts about the condition when it works.
> > 
> > Sure.
> > 
> > > > Workaround would require each vCPU to write to a "global clock", on
> > > > every clock read.
> > > > 
> > > > > To ensure that the guest tsc is synchronized we currently track host/guest tsc
> > > > > writes, and enable the master clock only when roughly the same guest's TSC value
> > > > > was written across all vCPUs.
> > > > 
> > > > Yes, because then you can do:
> > > > 
> > > > vcpu0				vcpu1
> > > > 
> > > > A = read TSC
> > > > 		... elapsed time ...
> > > > 
> > > > 				B = read TSC
> > > > 
> > > > 				delta = B - A
> > > > 
> > > > > Recently this was disabled by Paulo
> > > > 
> > > > What was disabled exactly?
> > > 
> > > The running of tsc synchronization code when the _guest_ writes the TSC.
> > > 
> > > Which changes two things:
> > >    1. If the guest de-synchronizes its TSC, we won't disable master clock.
> > >    2. If the guest writes similar TSC values on each vCPU we won't detect
> > >       this as synchronization attempt, replace this with exactly the same
> > >       value and finally re-enable the master clock.
> > > 
> > > I argue that this change is OK, because Linux guests don't write to TSC at all,
> > > the virtual BIOSes seems not to write there either, and the only case in which
> > > the Linux guest tries to change its TSC is on CPU hotplug as you mention and 
> > > it uses TSC_ADJUST, that currently doesn't trigger TSC synchronization code in
> > > KVM anyway, so it is broken already.
> > > 
> > > However I also argue that we should mention this in documentation just in case,
> > > and we might also want (also just in case) to make Linux guests avoid even trying to
> > > touch TSC_ADJUST register when running under KVM.
> > > 
> > > To rehash my own words, the KVM_CLOCK_TSC_STABLE should be defined as:
> > > 'kvmclock is vCPU invariant, as long as the guest doesn't mess with its TSC'.
> > > 
> > > Having said all that, now that I know tsc sync code, and the
> > > reasons why it is there, I wouldn't be arguing about putting it back either.
> > 
> > Agree.
> > 
> > > > > and I agree with this, because I think
> > > > > that we indeed should only make the guest TSC synchronized by default
> > > > > (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> > > > > (Trying to guess when the guest syncs the TSC can cause more harm that good).
> > > > > 
> > > > > Besides, Linux guests don't sync the TSC via IA32_TSC write,
> > > > > but rather use IA32_TSC_ADJUST which currently doesn't participate
> > > > > in the tsc sync heruistics.
> > > > 
> > > > Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> > > > the BIOS to boot with synced TSCs.
> > > > 
> > > > So i wonder what is making it attempt TSC sync in the first place?
> > > 
> > > CPU hotplug. And the guest doesn't really write to TSC_ADJUST 
> > > since it's measurement code doesn't detect any tsc warps. 
> > >  
> > > I was just thinking that in theory since, this is a VM, and it can be 
> > > interrupted at any point, the measurement code should sometimes fall,
> > > and cause trouble.
> > > I didn't do much homework on this so I might be overreacting.
> > 
> > That is true (and you can see it with a CPU starved guest).
> > 
> > > As far as I see X86_FEATURE_TSC_RELIABLE was done mostly to support
> > > running under Hyper-V and VMWARE, and these should be prone to similar
> > > issues, supporting my theory.
> > > 
> > > > (one might also want to have Linux's synchronization via IA32_TSC_ADJUST 
> > > > working, but it should not need to happen in the first place, as long as 
> > > > QEMU and KVM are behaving properly).
> > > > 
> > > > > And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
> > > > 
> > > > Only AFAIK.
> > > > 
> > > > > I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> > > > > in the documentation to state that it only guarantees invariance if the guest
> > > > > doesn't mess with its own TSC.
> > > > > 
> > > > > Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> > > > > in the guest kernel, when kvm is detected to avoid the guest even from trying
> > > > > to sync TSC on newly hotplugged vCPUs.
> > > > 
> > > > See 7539b174aef405d9d57db48c58390ba360c91312.
> > > 
> > > I know about this, and I personally always use invtsc
> > > with my VMs.
> > 
> > Well, we can't make it (-cpu xxx,+invtsc) the default if vm-stop/vm-cont are unstable
> > with TSC!
> 
> Could you elaborate on this too? Are you referring to the same issue you 
> had mentioned about the overflow in the kernel time accounting?

Well, any issue that could show up.

> > > > Was hoping to make that (-cpu xxx,+invtsc) the default in QEMU once invariant TSC code
> > > > becomes stable. Should be tested enough by now?
> > > 
> > > The issue is that Qemu blocks migration when invtsc is set, based on the
> > > fact that the target machine might have different TSC frequency and no
> > > support for TSC scaling.
> > > There was a long debate on this long ago.
> > 
> > Oh right.
> > 
> > > It is possible though to override this by specifying the exact frequency
> > > you want the guest TSC to run at, by using something like
> > > (tsc-frequency=3500000000)
> > > I haven't checked if libvirt does this or not.
> > 
> > It does.
> Cool.
> > 
> > > I do think that as long as the user uses modern CPUs (which have stable TSC
> > > and support TSC scaling), there is no reason to disable invtsc, and
> > > therefore no reason to use kvmclock.
> > 
> > Yep. TSC is faster.
> 
> Also this bit is sometimes used by userspace tools.

Yep! SAP HANA as well.

> Some time ago I found out that fio uses it to decide whether 
> to use TSC for measurements.
> 
> I didn't know this and was running fio in a guest without 'invtsc'.
> Fio switched to plain gettimeofday behind my back
> and totally screwed up the results.
> 
> > 
> > > > > (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> > > > > in some cases due to scheduling of guest vCPUs)
> > > > > 
> > > > > (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> > > > > and TSC clocksource watchdog, and the later we might want to keep).
> > > > 
> > > > The latter we want to keep.
> > > > 
> > > > > For host TSC writes, just as Paulo proposed we can still do the tsc sync,
> > > > > unless the new code that I implemented is in use.
> > > > 
> > > > So Paolo's proposal is to
> > > > 
> > > > "- for live migration, userspace is expected to use the new
> > > > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > > > (nanosecond, TSC, TSC_ADJUST) tuple."
> > > > 
> > > > Makes sense, so that no time between KVM_SET_TSC and
> > > > MSR_WRITE(TSC_ADJUST) elapses, which would cause the TSC to go out
> > > > of what is desired by the user.
> > > > 
> > > > Since you are proposing this new ioctl, perhaps its useful to also
> > > > reduce the 100ms jump? 
> > > 
> > > Yep. As long as target and destantion clocks are synchronized,
> > > it should make it better.
> > > 
> > > > "- for live migration, userspace is expected to use the new
> > > > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > > > (nanosecond, TSC, TSC_ADJUST) tuple. This value will be written
> > > > to the guest before the first VM-entry"
> > > > 
> > > > Sounds like a good idea (to integrate the values in a tuple).
> > > > 
> > > > > Few more random notes:
> > > > > 
> > > > > I have a weird feeling about using 'nsec since 1 January 1970'.
> > > > > Common sense is telling me that a 64 bit value can hold about 580 years,
> > > > > but still I see that it is more common to use timespec which is a (sec,nsec) pair.
> > > > 
> > > >            struct timespec {
> > > >                time_t   tv_sec;        /* seconds */
> > > >                long     tv_nsec;       /* nanoseconds */
> > > >            };
> > > > 
> > > > > I feel that 'kvm_get_walltime' that I added is a bit of a hack.
> > > > > Some refactoring might improve things here.
> > > > 
> > > > Haven't read the patchset yet...
> > > > 
> > > > > For example making kvm_get_walltime_and_clockread work in non tsc case as well
> > > > > might make the code cleaner.
> > > > > 
> > > > > Patches to enable this feature in qemu are in process of being sent to
> > > > > qemu-devel mailing list.
> > > > > 
> > > > > Best regards,
> > > > >        Maxim Levitsky
> > > > > 
> > > > > Maxim Levitsky (2):
> > > > >   KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
> > > > >   KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
> > > > > 
> > > > >  Documentation/virt/kvm/api.rst  | 56 +++++++++++++++++++++
> > > > >  arch/x86/include/uapi/asm/kvm.h |  1 +
> > > > >  arch/x86/kvm/x86.c              | 88 +++++++++++++++++++++++++++++++--
> > > > >  include/uapi/linux/kvm.h        | 14 ++++++
> > > > >  4 files changed, 154 insertions(+), 5 deletions(-)
> > > > > 
> > > > > -- 
> > > > > 2.26.2
> > > > > 
> > > 
> > > Best regards,
> > > 	Maxim Levitsky
> > > 
> 
> 
> Best regards,
> 	Maxim Levitsky


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/2] RFC: Precise TSC migration
  2020-12-03 20:18         ` Marcelo Tosatti
@ 2020-12-07 13:00           ` Maxim Levitsky
  0 siblings, 0 replies; 30+ messages in thread
From: Maxim Levitsky @ 2020-12-07 13:00 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: kvm, Paolo Bonzini, Oliver Upton, Ingo Molnar,
	Sean Christopherson, Thomas Gleixner, open list, Jonathan Corbet,
	Wanpeng Li, Borislav Petkov, Jim Mattson, H. Peter Anvin,
	open list:DOCUMENTATION, Joerg Roedel,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Vitaly Kuznetsov

On Thu, 2020-12-03 at 17:18 -0300, Marcelo Tosatti wrote:
> On Thu, Dec 03, 2020 at 01:39:42PM +0200, Maxim Levitsky wrote:
> > On Tue, 2020-12-01 at 16:48 -0300, Marcelo Tosatti wrote:
> > > On Tue, Dec 01, 2020 at 02:30:39PM +0200, Maxim Levitsky wrote:
> > > > On Mon, 2020-11-30 at 16:16 -0300, Marcelo Tosatti wrote:
> > > > > Hi Maxim,
> > > > > 
> > > > > On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote:
> > > > > > Hi!
> > > > > > 
> > > > > > This is the first version of the work to make TSC migration more accurate,
> > > > > > as was defined by Paulo at:
> > > > > > https://www.spinics.net/lists/kvm/msg225525.html
> > > > > 
> > > > > Description from Oliver's patch:
> > > > > 
> > > > > "To date, VMMs have typically restored the guest's TSCs by value using
> > > > > the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by
> > > > > value introduces some challenges with synchronization as the TSCs
> > > > > continue to tick throughout the restoration process. As such, KVM has
> > > > > some heuristics around TSC writes to infer whether or not the guest or
> > > > > host is attempting to synchronize the TSCs."
> > > > > 
> > > > > Not really. The synchronization logic tries to sync TSCs during
> > > > > BIOS boot (and CPU hotplug), because the TSC values are loaded
> > > > > sequentially, say:
> > > > > 
> > > > > CPU		realtime	TSC val
> > > > > vcpu0		0 usec		0
> > > > > vcpu1		100 usec	0
> > > > > vcpu2		200 usec	0
> > > > > ...
> > > > > 
> > > > > And we'd like to see all vcpus to read the same value at all times.
> > > > > 
> > > > > Other than that, comment makes sense. The problem with live migration
> > > > > is as follows:
> > > > > 
> > > > > We'd like the TSC value to be written, ideally, just before the first
> > > > > VM-entry a vCPU (because at the moment the TSC_OFFSET has been written, 
> > > > > the vcpus tsc is ticking, which will cause a visible forward jump
> > > > > in vcpus tsc time).
> > > > > 
> > > > > Before the first VM-entry is the farthest point in time before guest
> > > > > entry that one could do that.
> > > > > 
> > > > > The window (or forward jump) between KVM_SET_TSC and VM-entry was about
> > > > > 100ms last time i checked (which results in a 100ms time jump forward), 
> > > > > See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687.
> > > > > 
> > > > > Have we measured any improvement with this patchset?
> > > > 
> > > > Its not about this window. 
> > > > It is about time that passes between the point that we read the 
> > > > TSC on source system (and we do it in qemu each time the VM is paused) 
> > > > and the moment that we set the same TSC value on the target. 
> > > > That time is unbounded.
> > > 
> > > OK. Well, its the same problem: ideally you'd want to do that just
> > > before VCPU-entry.
> > > 
> > > > Also this patchset should decrease TSC skew that happens
> > > > between restoring it on multiple vCPUs as well, since 
> > > > KVM_SET_TSC_STATE doesn't have to happen at the same time,
> > > > as it accounts for time passed on each vCPU.
> > > > 
> > > > 
> > > > Speaking of kvmclock, somewhat offtopic since this is a different issue,
> > > > I found out that qemu reads the kvmclock value on each pause, 
> > > > and then 'restores' on unpause, using
> > > > KVM_SET_CLOCK (this modifies the global kvmclock offset)
> > > > 
> > > > This means (and I tested it) that if guest uses kvmclock
> > > > for time reference, it will not account for time passed in
> > > > the paused state.
> > > 
> > > Yes, this is necessary because otherwise there might be an overflow
> > > in the kernel time accounting code (if the clock delta is too large).
> > 
> > Could you elaborate on this? Do you mean that guest kernel can crash,
> > when the time 'jumps' too far forward in one go?
> 
> It can crash (there will a overflow and time will jump backwards).
> 
> > If so this will happen with kernel using TSC as well, 
> > since we do let the virtual TSC to 'keep running' while VM is suspended, 
> > and the goal of this patchset is to let it 'run' even while
> > the VM is migrating.
> 
> True. For the overflow one, perhaps TSC can be stopped (and restored) similarly to
> KVMCLOCK.
> 
> See QEMU's commit 00f4d64ee76e873be881a82d893a591487aa7950.

This is exactly what we want to avoid, and it is the main purpose of this patch set. 
Can this issue be fixed in the guest kernel instead?
 
I also don't completely agree with 'CLOCK_MONOTONIC' remark in the above qemu commit.
 
From clock_gettime man page
"CLOCK_MONOTONIC
   Clock that cannot be set and represents monotonic time since—as described by POSIX—"some unspecified point in the past".  
   On Linux, that point corresponds to the number of seconds that the system has been running since it was booted.
"
 
'system has been running' IMHO isn't fully defined by the above statement.

In the same man page we also have CLOCK_BOOTTIME which was introduced to fix the
above 'bug' and it states that:
 
"CLOCK_BOOTTIME (since Linux 2.6.39; Linux-specific)
      Identical to CLOCK_MONOTONIC, except it also includes any time that the system is suspended.  
      This allows applications to get a suspend-aware monotonic
      clock without having to deal with the complications of CLOCK_REALTIME, 
      which may have discontinuities if the time is changed using settimeofday(2) or similar.
"
 
This relates to OS initiated suspend (e.g S3), which suggests that 'system has been running'
is defined as 'system is in S0 state'.
 
Also, by stopping the kvmclock, even if we define the hypervisor initiated suspend 
as the 'os suspended state', we at least break CLOCK_BOOTTIME since the 
latter is also based on kvmclock.

AFAIK (I might be wrong here), the guest kernel only updates its 
CLOCK_BOOTTIME offset when it enters/exits the S3/S4 state.
 
(Look at 'tk_update_sleep_time', which updates 'offs_boot' 
which is what defines CLOCK_BOOTTIME AFAIK)
 
 
Another thing I noticed is that the kvmclock documentation (msr.rst)
specifies that the time reference in the kvmclock PV structure, 
should include sleep time, thus the qemu patch you referenced 
might break its promise:
 
"
system_time:
	a host notion of monotonic time, including sleep
	time at the time this structure was last updated. Unit is
	nanoseconds.
"

Best regards,
	Maxim Levitsky

> 
> > And if there is an issue, we really should try to fix it in
> > the guest kernel IMHO.
> > 
> > > > > Then Paolo mentions (with >), i am replying as usual.
> > > > > 
> > > > > > Ok, after looking more at the code with Maxim I can confidently say that
> > > > > > it's a total mess.  And a lot of the synchronization code is dead
> > > > > > because 1) as far as we could see no guest synchronizes the TSC using
> > > > > > MSR_IA32_TSC; 
> > > > > 
> > > > > Well, recent BIOS'es take care of synchronizing the TSC. So when Linux
> > > > > boots, it does not have to synchronize TSC in software. 
> > > > 
> > > > Do you have an example of such BIOS? I tested OVMF which I compiled
> > > > from git master a few weeks ago, and I also tested this with seabios 
> > > > from qemu repo, and I have never seen writes to either TSC, or TSC_ADJUST
> > > > from BIOS.
> > > 
> > > Oh, well, QEMU then.
> > > 
> > > > Or do you refer to the native BIOS on the host doing TSC synchronization?
> > > 
> > > No, virt.
> > 
> > I also (lightly) tested win10 guest, and win10 guest with Hyper-V enabled,
> > and in both cases I haven't observed TSC/TSC_ADJUST writes.
> > 
> > > > > However, upon migration (and initialization), the KVM_SET_TSC's do 
> > > > > not happen at exactly the same time (the MSRs for each vCPU are loaded
> > > > > in sequence). The synchronization code in kvm_set_tsc() is for those cases.
> > > > 
> > > > I agree with that, and this is one of the issues that KVM_SET_TSC_STATE
> > > > is going to fix, since it accounts for it.
> > > > 
> > > > 
> > > > > > and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the
> > > > > > synchronization code in kvm_write_tsc.
> > > > > 
> > > > > Not familiar how guests are using MSR_IA32_TSC_ADJUST (or Linux)...
> > > > > Lets see:
> > > > > 
> > > > > 
> > > > > /*
> > > > >  * Freshly booted CPUs call into this:
> > > > >  */
> > > > > void check_tsc_sync_target(void)
> > > > > {
> > > > >         struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
> > > > >         unsigned int cpu = smp_processor_id();
> > > > >         cycles_t cur_max_warp, gbl_max_warp;
> > > > >         int cpus = 2;
> > > > > 
> > > > >         /* Also aborts if there is no TSC. */
> > > > >         if (unsynchronized_tsc())
> > > > >                 return;
> > > > > 
> > > > >         /*
> > > > >          * Store, verify and sanitize the TSC adjust register. If
> > > > >          * successful skip the test.
> > > > >          *
> > > > >          * The test is also skipped when the TSC is marked reliable. This
> > > > >          * is true for SoCs which have no fallback clocksource. On these
> > > > >          * SoCs the TSC is frequency synchronized, but still the TSC ADJUST
> > > > >          * register might have been wreckaged by the BIOS..
> > > > >          */
> > > > >         if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) {
> > > > >                 atomic_inc(&skip_test);
> > > > >                 return;
> > > > >         }
> > > > > 
> > > > > retry:
> > > > > 
> > > > > I'd force that synchronization path to be taken as a test-case.
> > > > 
> > > > Or even better as I suggested, we might tell the guest kernel
> > > > to avoid this synchronization path when KVM is detected
> > > > (regardless of invtsc flag)
> > > > 
> > > > > > I have a few thoughts about the kvm masterclock synchronization,
> > > > > > which relate to the Paulo's proposal that I implemented.
> > > > > > 
> > > > > > The idea of masterclock is that when the host TSC is synchronized
> > > > > > (or as kernel call it, stable), and the guest TSC is synchronized as well,
> > > > > > then we can base the kvmclock, on the same pair of
> > > > > > (host time in nsec, host tsc value), for all vCPUs.
> > > > > 
> > > > > We _have_ to base. See the comment which starts with
> > > > > 
> > > > > "Assuming a stable TSC across physical CPUS, and a stable TSC"
> > > > > 
> > > > > at x86.c.
> > > > > 
> > > > > > This makes the random error in calculation of this value invariant
> > > > > > across vCPUS, and allows the guest to do kvmclock calculation in userspace
> > > > > > (vDSO) since kvmclock parameters are vCPU invariant.
> > > > > 
> > > > > Actually, without synchronized host TSCs (and the masterclock scheme,
> > > > > with a single base read from a vCPU), kvmclock in kernel is buggy as
> > > > > well:
> > > > > 
> > > > > u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src)
> > > > > {
> > > > >         unsigned version;
> > > > >         u64 ret;
> > > > >         u64 last;
> > > > >         u8 flags;
> > > > > 
> > > > >         do {
> > > > >                 version = pvclock_read_begin(src);
> > > > >                 ret = __pvclock_read_cycles(src, rdtsc_ordered());
> > > > >                 flags = src->flags;
> > > > >         } while (pvclock_read_retry(src, version));
> > > > > 
> > > > >         if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) {
> > > > >                 src->flags &= ~PVCLOCK_GUEST_STOPPED;
> > > > >                 pvclock_touch_watchdogs();
> > > > >         }
> > > > > 
> > > > >         if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) &&
> > > > >                 (flags & PVCLOCK_TSC_STABLE_BIT))
> > > > >                 return ret;
> > > > > 
> > > > > The code that follows this (including cmpxchg) is a workaround for that 
> > > > > bug.
> > > > 
> > > > I understand that. I am not arguing that we shoudn't use the masterclock!
> > > > I am just saying the facts about the condition when it works.
> > > 
> > > Sure.
> > > 
> > > > > Workaround would require each vCPU to write to a "global clock", on
> > > > > every clock read.
> > > > > 
> > > > > > To ensure that the guest tsc is synchronized we currently track host/guest tsc
> > > > > > writes, and enable the master clock only when roughly the same guest's TSC value
> > > > > > was written across all vCPUs.
> > > > > 
> > > > > Yes, because then you can do:
> > > > > 
> > > > > vcpu0				vcpu1
> > > > > 
> > > > > A = read TSC
> > > > > 		... elapsed time ...
> > > > > 
> > > > > 				B = read TSC
> > > > > 
> > > > > 				delta = B - A
> > > > > 
> > > > > > Recently this was disabled by Paulo
> > > > > 
> > > > > What was disabled exactly?
> > > > 
> > > > The running of tsc synchronization code when the _guest_ writes the TSC.
> > > > 
> > > > Which changes two things:
> > > >    1. If the guest de-synchronizes its TSC, we won't disable master clock.
> > > >    2. If the guest writes similar TSC values on each vCPU we won't detect
> > > >       this as synchronization attempt, replace this with exactly the same
> > > >       value and finally re-enable the master clock.
> > > > 
> > > > I argue that this change is OK, because Linux guests don't write to TSC at all,
> > > > the virtual BIOSes seems not to write there either, and the only case in which
> > > > the Linux guest tries to change its TSC is on CPU hotplug as you mention and 
> > > > it uses TSC_ADJUST, that currently doesn't trigger TSC synchronization code in
> > > > KVM anyway, so it is broken already.
> > > > 
> > > > However I also argue that we should mention this in documentation just in case,
> > > > and we might also want (also just in case) to make Linux guests avoid even trying to
> > > > touch TSC_ADJUST register when running under KVM.
> > > > 
> > > > To rehash my own words, the KVM_CLOCK_TSC_STABLE should be defined as:
> > > > 'kvmclock is vCPU invariant, as long as the guest doesn't mess with its TSC'.
> > > > 
> > > > Having said all that, now that I know tsc sync code, and the
> > > > reasons why it is there, I wouldn't be arguing about putting it back either.
> > > 
> > > Agree.
> > > 
> > > > > > and I agree with this, because I think
> > > > > > that we indeed should only make the guest TSC synchronized by default
> > > > > > (including new hotplugged vCPUs) and not do any tsc synchronization beyond that.
> > > > > > (Trying to guess when the guest syncs the TSC can cause more harm that good).
> > > > > > 
> > > > > > Besides, Linux guests don't sync the TSC via IA32_TSC write,
> > > > > > but rather use IA32_TSC_ADJUST which currently doesn't participate
> > > > > > in the tsc sync heruistics.
> > > > > 
> > > > > Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects
> > > > > the BIOS to boot with synced TSCs.
> > > > > 
> > > > > So i wonder what is making it attempt TSC sync in the first place?
> > > > 
> > > > CPU hotplug. And the guest doesn't really write to TSC_ADJUST 
> > > > since it's measurement code doesn't detect any tsc warps. 
> > > >  
> > > > I was just thinking that in theory since, this is a VM, and it can be 
> > > > interrupted at any point, the measurement code should sometimes fall,
> > > > and cause trouble.
> > > > I didn't do much homework on this so I might be overreacting.
> > > 
> > > That is true (and you can see it with a CPU starved guest).
> > > 
> > > > As far as I see X86_FEATURE_TSC_RELIABLE was done mostly to support
> > > > running under Hyper-V and VMWARE, and these should be prone to similar
> > > > issues, supporting my theory.
> > > > 
> > > > > (one might also want to have Linux's synchronization via IA32_TSC_ADJUST 
> > > > > working, but it should not need to happen in the first place, as long as 
> > > > > QEMU and KVM are behaving properly).
> > > > > 
> > > > > > And as far as I know, Linux guest is the primary (only?) user of the kvmclock.
> > > > > 
> > > > > Only AFAIK.
> > > > > 
> > > > > > I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE
> > > > > > in the documentation to state that it only guarantees invariance if the guest
> > > > > > doesn't mess with its own TSC.
> > > > > > 
> > > > > > Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE
> > > > > > in the guest kernel, when kvm is detected to avoid the guest even from trying
> > > > > > to sync TSC on newly hotplugged vCPUs.
> > > > > 
> > > > > See 7539b174aef405d9d57db48c58390ba360c91312.
> > > > 
> > > > I know about this, and I personally always use invtsc
> > > > with my VMs.
> > > 
> > > Well, we can't make it (-cpu xxx,+invtsc) the default if vm-stop/vm-cont are unstable
> > > with TSC!
> > 
> > Could you elaborate on this too? Are you referring to the same issue you 
> > had mentioned about the overflow in the kernel time accounting?
> 
> Well, any issue that could show up.
> 
> > > > > Was hoping to make that (-cpu xxx,+invtsc) the default in QEMU once invariant TSC code
> > > > > becomes stable. Should be tested enough by now?
> > > > 
> > > > The issue is that Qemu blocks migration when invtsc is set, based on the
> > > > fact that the target machine might have different TSC frequency and no
> > > > support for TSC scaling.
> > > > There was a long debate on this long ago.
> > > 
> > > Oh right.
> > > 
> > > > It is possible though to override this by specifying the exact frequency
> > > > you want the guest TSC to run at, by using something like
> > > > (tsc-frequency=3500000000)
> > > > I haven't checked if libvirt does this or not.
> > > 
> > > It does.
> > Cool.
> > > > I do think that as long as the user uses modern CPUs (which have stable TSC
> > > > and support TSC scaling), there is no reason to disable invtsc, and
> > > > therefore no reason to use kvmclock.
> > > 
> > > Yep. TSC is faster.
> > 
> > Also this bit is sometimes used by userspace tools.
> 
> Yep! SAP HANA as well.
> 
> > Some time ago I found out that fio uses it to decide whether 
> > to use TSC for measurements.
> > 
> > I didn't know this and was running fio in a guest without 'invtsc'.
> > Fio switched to plain gettimeofday behind my back
> > and totally screwed up the results.
> > 
> > > > > > (The guest doesn't end up touching TSC_ADJUST usually, but it still might
> > > > > > in some cases due to scheduling of guest vCPUs)
> > > > > > 
> > > > > > (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug,
> > > > > > and TSC clocksource watchdog, and the later we might want to keep).
> > > > > 
> > > > > The latter we want to keep.
> > > > > 
> > > > > > For host TSC writes, just as Paulo proposed we can still do the tsc sync,
> > > > > > unless the new code that I implemented is in use.
> > > > > 
> > > > > So Paolo's proposal is to
> > > > > 
> > > > > "- for live migration, userspace is expected to use the new
> > > > > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > > > > (nanosecond, TSC, TSC_ADJUST) tuple."
> > > > > 
> > > > > Makes sense, so that no time between KVM_SET_TSC and
> > > > > MSR_WRITE(TSC_ADJUST) elapses, which would cause the TSC to go out
> > > > > of what is desired by the user.
> > > > > 
> > > > > Since you are proposing this new ioctl, perhaps its useful to also
> > > > > reduce the 100ms jump? 
> > > > 
> > > > Yep. As long as target and destantion clocks are synchronized,
> > > > it should make it better.
> > > > 
> > > > > "- for live migration, userspace is expected to use the new
> > > > > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a
> > > > > (nanosecond, TSC, TSC_ADJUST) tuple. This value will be written
> > > > > to the guest before the first VM-entry"
> > > > > 
> > > > > Sounds like a good idea (to integrate the values in a tuple).
> > > > > 
> > > > > > Few more random notes:
> > > > > > 
> > > > > > I have a weird feeling about using 'nsec since 1 January 1970'.
> > > > > > Common sense is telling me that a 64 bit value can hold about 580 years,
> > > > > > but still I see that it is more common to use timespec which is a (sec,nsec) pair.
> > > > > 
> > > > >            struct timespec {
> > > > >                time_t   tv_sec;        /* seconds */
> > > > >                long     tv_nsec;       /* nanoseconds */
> > > > >            };
> > > > > 
> > > > > > I feel that 'kvm_get_walltime' that I added is a bit of a hack.
> > > > > > Some refactoring might improve things here.
> > > > > 
> > > > > Haven't read the patchset yet...
> > > > > 
> > > > > > For example making kvm_get_walltime_and_clockread work in non tsc case as well
> > > > > > might make the code cleaner.
> > > > > > 
> > > > > > Patches to enable this feature in qemu are in process of being sent to
> > > > > > qemu-devel mailing list.
> > > > > > 
> > > > > > Best regards,
> > > > > >        Maxim Levitsky
> > > > > > 
> > > > > > Maxim Levitsky (2):
> > > > > >   KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE
> > > > > >   KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS
> > > > > > 
> > > > > >  Documentation/virt/kvm/api.rst  | 56 +++++++++++++++++++++
> > > > > >  arch/x86/include/uapi/asm/kvm.h |  1 +
> > > > > >  arch/x86/kvm/x86.c              | 88 +++++++++++++++++++++++++++++++--
> > > > > >  include/uapi/linux/kvm.h        | 14 ++++++
> > > > > >  4 files changed, 154 insertions(+), 5 deletions(-)
> > > > > > 
> > > > > > -- 
> > > > > > 2.26.2
> > > > > > 
> > > > 
> > > > Best regards,
> > > > 	Maxim Levitsky
> > > > 
> > 
> > Best regards,
> > 	Maxim Levitsky



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-12-07 13:02 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-30 13:35 [PATCH 0/2] RFC: Precise TSC migration Maxim Levitsky
2020-11-30 13:35 ` [PATCH 1/2] KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE Maxim Levitsky
2020-11-30 14:33   ` Paolo Bonzini
2020-11-30 15:58     ` Maxim Levitsky
2020-11-30 17:01       ` Paolo Bonzini
2020-12-01 19:43   ` Thomas Gleixner
2020-12-03 11:11     ` Maxim Levitsky
2020-11-30 13:35 ` [PATCH 2/2] KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS Maxim Levitsky
2020-11-30 13:54   ` Paolo Bonzini
2020-11-30 14:11     ` Maxim Levitsky
2020-11-30 14:15       ` Paolo Bonzini
2020-11-30 15:33         ` Maxim Levitsky
2020-11-30 13:38 ` [PATCH 0/2] RFC: Precise TSC migration (summary) Maxim Levitsky
2020-11-30 16:54 ` [PATCH 0/2] RFC: Precise TSC migration Andy Lutomirski
2020-11-30 16:59   ` Paolo Bonzini
2020-11-30 19:16 ` Marcelo Tosatti
2020-12-01 12:30   ` Maxim Levitsky
2020-12-01 19:48     ` Marcelo Tosatti
2020-12-03 11:39       ` Maxim Levitsky
2020-12-03 20:18         ` Marcelo Tosatti
2020-12-07 13:00           ` Maxim Levitsky
2020-12-01 13:48   ` Thomas Gleixner
2020-12-01 15:02     ` Marcelo Tosatti
2020-12-03 11:51       ` Maxim Levitsky
2020-12-01 14:01   ` Thomas Gleixner
2020-12-01 16:19     ` Andy Lutomirski
2020-12-03 11:57       ` Maxim Levitsky
2020-12-01 19:35 ` Thomas Gleixner
2020-12-03 11:41   ` Paolo Bonzini
2020-12-03 12:47   ` Maxim Levitsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).