All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 0/7] KVM: x86: Add idempotent controls for migrating system counter state
@ 2021-09-16 18:15 ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Peter Shier,
	Sean Christopherson, David Matlack, Paolo Bonzini,
	linux-arm-kernel, Jim Mattson

KVM's current means of saving/restoring system counters is plagued with
temporal issues. On x86, we migrate the guest's system counter by-value
through the respective guest's IA32_TSC value. Restoring system counters
by-value is brittle as the state is not idempotent: the host system
counter is still oscillating between the attempted save and restore.
Furthermore, VMMs may wish to transparently live migrate guest VMs,
meaning that they include the elapsed time due to live migration blackout
in the guest system counter view. The VMM thread could be preempted for
any number of reasons (scheduler, L0 hypervisor under nested) between the
time that it calculates the desired guest counter value and when
KVM actually sets this counter state.

Despite the value-based interface that we present to userspace, KVM
actually has idempotent guest controls by way of the TSC offset.
We can avoid all of the issues associated with a value-based interface
by abstracting these offset controls in a new device attribute. This
series introduces new vCPU device attributes to provide userspace access
to the vCPU's system counter offset.

Patches 1-2 are Paolo's refactorings around locking and the
KVM_{GET,SET}_CLOCK ioctls.

Patch 3 cures a race where use_master_clock is read outside of the
pvclock lock in the KVM_GET_CLOCK ioctl.

Patch 4 adopts Paolo's suggestion, augmenting the KVM_{GET,SET}_CLOCK
ioctls to provide userspace with a (host_tsc, realtime) instant. This is
essential for a VMM to perform precise migration of the guest's system
counters.

Patch 5 does away with the pvclock spin lock in favor of a sequence
lock based on the tsc_write_lock. The original patch is from Paolo, I
touched it up a bit to fix a deadlock and some unused variables that
caused -Werror to scream.

Patch 6 extracts the TSC synchronization tracking code in a way that it
can be used for both offset-based and value-based TSC synchronization
schemes.

Finally, patch 7 implements a vCPU device attribute which allows VMMs to
get at the TSC offset of a vCPU.

This series was tested with the new KVM selftests for the KVM clock and
system counter offset controls on Haswell hardware. Kernel was built
with CONFIG_LOCKDEP given the new locking changes/lockdep assertions
here.

Note that these tests are mailed as a separate series due to the
dependencies in both x86 and arm64.

Applies cleanly to 5.15-rc1

v8: http://lore.kernel.org/r/20210816001130.3059564-1-oupton@google.com

v7 -> v8:
 - Rebased to 5.15-rc1
 - Picked up Paolo's version of the series, which includes locking
   changes
 - Make KVM advertise KVM_CAP_VCPU_ATTRIBUTES

Oliver Upton (4):
  KVM: x86: Fix potential race in KVM_GET_CLOCK
  KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  KVM: x86: Refactor tsc synchronization code
  KVM: x86: Expose TSC offset controls to userspace

Paolo Bonzini (3):
  kvm: x86: abstract locking around pvclock_update_vm_gtod_copy
  KVM: x86: extract KVM_GET_CLOCK/KVM_SET_CLOCK to separate functions
  kvm: x86: protect masterclock with a seqcount

 Documentation/virt/kvm/api.rst          |  42 ++-
 Documentation/virt/kvm/devices/vcpu.rst |  57 +++
 arch/x86/include/asm/kvm_host.h         |  12 +-
 arch/x86/include/uapi/asm/kvm.h         |   4 +
 arch/x86/kvm/x86.c                      | 458 ++++++++++++++++--------
 include/uapi/linux/kvm.h                |   7 +-
 6 files changed, 419 insertions(+), 161 deletions(-)

-- 
2.33.0.309.g3052b89438-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH v8 0/7] KVM: x86: Add idempotent controls for migrating system counter state
@ 2021-09-16 18:15 ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

KVM's current means of saving/restoring system counters is plagued with
temporal issues. On x86, we migrate the guest's system counter by-value
through the respective guest's IA32_TSC value. Restoring system counters
by-value is brittle as the state is not idempotent: the host system
counter is still oscillating between the attempted save and restore.
Furthermore, VMMs may wish to transparently live migrate guest VMs,
meaning that they include the elapsed time due to live migration blackout
in the guest system counter view. The VMM thread could be preempted for
any number of reasons (scheduler, L0 hypervisor under nested) between the
time that it calculates the desired guest counter value and when
KVM actually sets this counter state.

Despite the value-based interface that we present to userspace, KVM
actually has idempotent guest controls by way of the TSC offset.
We can avoid all of the issues associated with a value-based interface
by abstracting these offset controls in a new device attribute. This
series introduces new vCPU device attributes to provide userspace access
to the vCPU's system counter offset.

Patches 1-2 are Paolo's refactorings around locking and the
KVM_{GET,SET}_CLOCK ioctls.

Patch 3 cures a race where use_master_clock is read outside of the
pvclock lock in the KVM_GET_CLOCK ioctl.

Patch 4 adopts Paolo's suggestion, augmenting the KVM_{GET,SET}_CLOCK
ioctls to provide userspace with a (host_tsc, realtime) instant. This is
essential for a VMM to perform precise migration of the guest's system
counters.

Patch 5 does away with the pvclock spin lock in favor of a sequence
lock based on the tsc_write_lock. The original patch is from Paolo, I
touched it up a bit to fix a deadlock and some unused variables that
caused -Werror to scream.

Patch 6 extracts the TSC synchronization tracking code in a way that it
can be used for both offset-based and value-based TSC synchronization
schemes.

Finally, patch 7 implements a vCPU device attribute which allows VMMs to
get at the TSC offset of a vCPU.

This series was tested with the new KVM selftests for the KVM clock and
system counter offset controls on Haswell hardware. Kernel was built
with CONFIG_LOCKDEP given the new locking changes/lockdep assertions
here.

Note that these tests are mailed as a separate series due to the
dependencies in both x86 and arm64.

Applies cleanly to 5.15-rc1

v8: http://lore.kernel.org/r/20210816001130.3059564-1-oupton@google.com

v7 -> v8:
 - Rebased to 5.15-rc1
 - Picked up Paolo's version of the series, which includes locking
   changes
 - Make KVM advertise KVM_CAP_VCPU_ATTRIBUTES

Oliver Upton (4):
  KVM: x86: Fix potential race in KVM_GET_CLOCK
  KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  KVM: x86: Refactor tsc synchronization code
  KVM: x86: Expose TSC offset controls to userspace

Paolo Bonzini (3):
  kvm: x86: abstract locking around pvclock_update_vm_gtod_copy
  KVM: x86: extract KVM_GET_CLOCK/KVM_SET_CLOCK to separate functions
  kvm: x86: protect masterclock with a seqcount

 Documentation/virt/kvm/api.rst          |  42 ++-
 Documentation/virt/kvm/devices/vcpu.rst |  57 +++
 arch/x86/include/asm/kvm_host.h         |  12 +-
 arch/x86/include/uapi/asm/kvm.h         |   4 +
 arch/x86/kvm/x86.c                      | 458 ++++++++++++++++--------
 include/uapi/linux/kvm.h                |   7 +-
 6 files changed, 419 insertions(+), 161 deletions(-)

-- 
2.33.0.309.g3052b89438-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH v8 0/7] KVM: x86: Add idempotent controls for migrating system counter state
@ 2021-09-16 18:15 ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

KVM's current means of saving/restoring system counters is plagued with
temporal issues. On x86, we migrate the guest's system counter by-value
through the respective guest's IA32_TSC value. Restoring system counters
by-value is brittle as the state is not idempotent: the host system
counter is still oscillating between the attempted save and restore.
Furthermore, VMMs may wish to transparently live migrate guest VMs,
meaning that they include the elapsed time due to live migration blackout
in the guest system counter view. The VMM thread could be preempted for
any number of reasons (scheduler, L0 hypervisor under nested) between the
time that it calculates the desired guest counter value and when
KVM actually sets this counter state.

Despite the value-based interface that we present to userspace, KVM
actually has idempotent guest controls by way of the TSC offset.
We can avoid all of the issues associated with a value-based interface
by abstracting these offset controls in a new device attribute. This
series introduces new vCPU device attributes to provide userspace access
to the vCPU's system counter offset.

Patches 1-2 are Paolo's refactorings around locking and the
KVM_{GET,SET}_CLOCK ioctls.

Patch 3 cures a race where use_master_clock is read outside of the
pvclock lock in the KVM_GET_CLOCK ioctl.

Patch 4 adopts Paolo's suggestion, augmenting the KVM_{GET,SET}_CLOCK
ioctls to provide userspace with a (host_tsc, realtime) instant. This is
essential for a VMM to perform precise migration of the guest's system
counters.

Patch 5 does away with the pvclock spin lock in favor of a sequence
lock based on the tsc_write_lock. The original patch is from Paolo, I
touched it up a bit to fix a deadlock and some unused variables that
caused -Werror to scream.

Patch 6 extracts the TSC synchronization tracking code in a way that it
can be used for both offset-based and value-based TSC synchronization
schemes.

Finally, patch 7 implements a vCPU device attribute which allows VMMs to
get at the TSC offset of a vCPU.

This series was tested with the new KVM selftests for the KVM clock and
system counter offset controls on Haswell hardware. Kernel was built
with CONFIG_LOCKDEP given the new locking changes/lockdep assertions
here.

Note that these tests are mailed as a separate series due to the
dependencies in both x86 and arm64.

Applies cleanly to 5.15-rc1

v8: http://lore.kernel.org/r/20210816001130.3059564-1-oupton@google.com

v7 -> v8:
 - Rebased to 5.15-rc1
 - Picked up Paolo's version of the series, which includes locking
   changes
 - Make KVM advertise KVM_CAP_VCPU_ATTRIBUTES

Oliver Upton (4):
  KVM: x86: Fix potential race in KVM_GET_CLOCK
  KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  KVM: x86: Refactor tsc synchronization code
  KVM: x86: Expose TSC offset controls to userspace

Paolo Bonzini (3):
  kvm: x86: abstract locking around pvclock_update_vm_gtod_copy
  KVM: x86: extract KVM_GET_CLOCK/KVM_SET_CLOCK to separate functions
  kvm: x86: protect masterclock with a seqcount

 Documentation/virt/kvm/api.rst          |  42 ++-
 Documentation/virt/kvm/devices/vcpu.rst |  57 +++
 arch/x86/include/asm/kvm_host.h         |  12 +-
 arch/x86/include/uapi/asm/kvm.h         |   4 +
 arch/x86/kvm/x86.c                      | 458 ++++++++++++++++--------
 include/uapi/linux/kvm.h                |   7 +-
 6 files changed, 419 insertions(+), 161 deletions(-)

-- 
2.33.0.309.g3052b89438-goog


^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH v8 1/7] kvm: x86: abstract locking around pvclock_update_vm_gtod_copy
  2021-09-16 18:15 ` Oliver Upton
  (?)
@ 2021-09-16 18:15   ` Oliver Upton
  -1 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Peter Shier,
	Sean Christopherson, David Matlack, Paolo Bonzini,
	linux-arm-kernel, Jim Mattson

From: Paolo Bonzini <pbonzini@redhat.com>

Updates to the kvmclock parameters needs to do a complicated dance of
KVM_REQ_MCLOCK_INPROGRESS and KVM_REQ_CLOCK_UPDATE in addition to taking
pvclock_gtod_sync_lock.  Place that in two functions that can be called
on all of master clock update, KVM_SET_CLOCK, and Hyper-V reenlightenment.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/x86.c              | 62 +++++++++++++++------------------
 2 files changed, 29 insertions(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8f48a7ec577..be6805fc0260 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1866,7 +1866,6 @@ u64 kvm_calc_nested_tsc_multiplier(u64 l1_multiplier, u64 l2_multiplier);
 unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu);
 bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip);
 
-void kvm_make_mclock_inprogress_request(struct kvm *kvm);
 void kvm_make_scan_ioapic_request(struct kvm *kvm);
 void kvm_make_scan_ioapic_request_mask(struct kvm *kvm,
 				       unsigned long *vcpu_bitmap);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 28ef14155726..1082b48418c3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2755,35 +2755,42 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 #endif
 }
 
-void kvm_make_mclock_inprogress_request(struct kvm *kvm)
+static void kvm_make_mclock_inprogress_request(struct kvm *kvm)
 {
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
 }
 
-static void kvm_gen_update_masterclock(struct kvm *kvm)
+static void kvm_start_pvclock_update(struct kvm *kvm)
 {
-#ifdef CONFIG_X86_64
-	int i;
-	struct kvm_vcpu *vcpu;
 	struct kvm_arch *ka = &kvm->arch;
-	unsigned long flags;
-
-	kvm_hv_invalidate_tsc_page(kvm);
 
 	kvm_make_mclock_inprogress_request(kvm);
 
 	/* no guest entries from this point */
-	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
-	pvclock_update_vm_gtod_copy(kvm);
-	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
+	spin_lock_irq(&ka->pvclock_gtod_sync_lock);
+}
 
+static void kvm_end_pvclock_update(struct kvm *kvm)
+{
+	struct kvm_arch *ka = &kvm->arch;
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	spin_unlock_irq(&ka->pvclock_gtod_sync_lock);
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
 	/* guest entries allowed */
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
-#endif
+}
+
+static void kvm_update_masterclock(struct kvm *kvm)
+{
+	kvm_hv_invalidate_tsc_page(kvm);
+	kvm_start_pvclock_update(kvm);
+	pvclock_update_vm_gtod_copy(kvm);
+	kvm_end_pvclock_update(kvm);
 }
 
 u64 get_kvmclock_ns(struct kvm *kvm)
@@ -6079,12 +6086,10 @@ long kvm_arch_vm_ioctl(struct file *filp,
 			goto out;
 
 		r = 0;
-		/*
-		 * TODO: userspace has to take care of races with VCPU_RUN, so
-		 * kvm_gen_update_masterclock() can be cut down to locked
-		 * pvclock_update_vm_gtod_copy().
-		 */
-		kvm_gen_update_masterclock(kvm);
+
+		kvm_hv_invalidate_tsc_page(kvm);
+		kvm_start_pvclock_update(kvm);
+		pvclock_update_vm_gtod_copy(kvm);
 
 		/*
 		 * This pairs with kvm_guest_time_update(): when masterclock is
@@ -6093,15 +6098,12 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		 * is slightly ahead) here we risk going negative on unsigned
 		 * 'system_time' when 'user_ns.clock' is very small.
 		 */
-		spin_lock_irq(&ka->pvclock_gtod_sync_lock);
 		if (kvm->arch.use_master_clock)
 			now_ns = ka->master_kernel_ns;
 		else
 			now_ns = get_kvmclock_base_ns();
 		ka->kvmclock_offset = user_ns.clock - now_ns;
-		spin_unlock_irq(&ka->pvclock_gtod_sync_lock);
-
-		kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);
+		kvm_end_pvclock_update(kvm);
 		break;
 	}
 	case KVM_GET_CLOCK: {
@@ -8107,14 +8109,13 @@ static void tsc_khz_changed(void *data)
 static void kvm_hyperv_tsc_notifier(void)
 {
 	struct kvm *kvm;
-	struct kvm_vcpu *vcpu;
 	int cpu;
-	unsigned long flags;
 
 	mutex_lock(&kvm_lock);
 	list_for_each_entry(kvm, &vm_list, vm_list)
 		kvm_make_mclock_inprogress_request(kvm);
 
+	/* no guest entries from this point */
 	hyperv_stop_tsc_emulation();
 
 	/* TSC frequency always matches when on Hyper-V */
@@ -8125,16 +8126,11 @@ static void kvm_hyperv_tsc_notifier(void)
 	list_for_each_entry(kvm, &vm_list, vm_list) {
 		struct kvm_arch *ka = &kvm->arch;
 
-		spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
+		spin_lock_irq(&ka->pvclock_gtod_sync_lock);
 		pvclock_update_vm_gtod_copy(kvm);
-		spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
-
-		kvm_for_each_vcpu(cpu, vcpu, kvm)
-			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-
-		kvm_for_each_vcpu(cpu, vcpu, kvm)
-			kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
+		kvm_end_pvclock_update(kvm);
 	}
+
 	mutex_unlock(&kvm_lock);
 }
 #endif
@@ -9418,7 +9414,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
 			__kvm_migrate_timers(vcpu);
 		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
-			kvm_gen_update_masterclock(vcpu->kvm);
+			kvm_update_masterclock(vcpu->kvm);
 		if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))
 			kvm_gen_kvmclock_update(vcpu);
 		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
-- 
2.33.0.309.g3052b89438-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 1/7] kvm: x86: abstract locking around pvclock_update_vm_gtod_copy
@ 2021-09-16 18:15   ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

From: Paolo Bonzini <pbonzini@redhat.com>

Updates to the kvmclock parameters needs to do a complicated dance of
KVM_REQ_MCLOCK_INPROGRESS and KVM_REQ_CLOCK_UPDATE in addition to taking
pvclock_gtod_sync_lock.  Place that in two functions that can be called
on all of master clock update, KVM_SET_CLOCK, and Hyper-V reenlightenment.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/x86.c              | 62 +++++++++++++++------------------
 2 files changed, 29 insertions(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8f48a7ec577..be6805fc0260 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1866,7 +1866,6 @@ u64 kvm_calc_nested_tsc_multiplier(u64 l1_multiplier, u64 l2_multiplier);
 unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu);
 bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip);
 
-void kvm_make_mclock_inprogress_request(struct kvm *kvm);
 void kvm_make_scan_ioapic_request(struct kvm *kvm);
 void kvm_make_scan_ioapic_request_mask(struct kvm *kvm,
 				       unsigned long *vcpu_bitmap);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 28ef14155726..1082b48418c3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2755,35 +2755,42 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 #endif
 }
 
-void kvm_make_mclock_inprogress_request(struct kvm *kvm)
+static void kvm_make_mclock_inprogress_request(struct kvm *kvm)
 {
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
 }
 
-static void kvm_gen_update_masterclock(struct kvm *kvm)
+static void kvm_start_pvclock_update(struct kvm *kvm)
 {
-#ifdef CONFIG_X86_64
-	int i;
-	struct kvm_vcpu *vcpu;
 	struct kvm_arch *ka = &kvm->arch;
-	unsigned long flags;
-
-	kvm_hv_invalidate_tsc_page(kvm);
 
 	kvm_make_mclock_inprogress_request(kvm);
 
 	/* no guest entries from this point */
-	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
-	pvclock_update_vm_gtod_copy(kvm);
-	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
+	spin_lock_irq(&ka->pvclock_gtod_sync_lock);
+}
 
+static void kvm_end_pvclock_update(struct kvm *kvm)
+{
+	struct kvm_arch *ka = &kvm->arch;
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	spin_unlock_irq(&ka->pvclock_gtod_sync_lock);
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
 	/* guest entries allowed */
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
-#endif
+}
+
+static void kvm_update_masterclock(struct kvm *kvm)
+{
+	kvm_hv_invalidate_tsc_page(kvm);
+	kvm_start_pvclock_update(kvm);
+	pvclock_update_vm_gtod_copy(kvm);
+	kvm_end_pvclock_update(kvm);
 }
 
 u64 get_kvmclock_ns(struct kvm *kvm)
@@ -6079,12 +6086,10 @@ long kvm_arch_vm_ioctl(struct file *filp,
 			goto out;
 
 		r = 0;
-		/*
-		 * TODO: userspace has to take care of races with VCPU_RUN, so
-		 * kvm_gen_update_masterclock() can be cut down to locked
-		 * pvclock_update_vm_gtod_copy().
-		 */
-		kvm_gen_update_masterclock(kvm);
+
+		kvm_hv_invalidate_tsc_page(kvm);
+		kvm_start_pvclock_update(kvm);
+		pvclock_update_vm_gtod_copy(kvm);
 
 		/*
 		 * This pairs with kvm_guest_time_update(): when masterclock is
@@ -6093,15 +6098,12 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		 * is slightly ahead) here we risk going negative on unsigned
 		 * 'system_time' when 'user_ns.clock' is very small.
 		 */
-		spin_lock_irq(&ka->pvclock_gtod_sync_lock);
 		if (kvm->arch.use_master_clock)
 			now_ns = ka->master_kernel_ns;
 		else
 			now_ns = get_kvmclock_base_ns();
 		ka->kvmclock_offset = user_ns.clock - now_ns;
-		spin_unlock_irq(&ka->pvclock_gtod_sync_lock);
-
-		kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);
+		kvm_end_pvclock_update(kvm);
 		break;
 	}
 	case KVM_GET_CLOCK: {
@@ -8107,14 +8109,13 @@ static void tsc_khz_changed(void *data)
 static void kvm_hyperv_tsc_notifier(void)
 {
 	struct kvm *kvm;
-	struct kvm_vcpu *vcpu;
 	int cpu;
-	unsigned long flags;
 
 	mutex_lock(&kvm_lock);
 	list_for_each_entry(kvm, &vm_list, vm_list)
 		kvm_make_mclock_inprogress_request(kvm);
 
+	/* no guest entries from this point */
 	hyperv_stop_tsc_emulation();
 
 	/* TSC frequency always matches when on Hyper-V */
@@ -8125,16 +8126,11 @@ static void kvm_hyperv_tsc_notifier(void)
 	list_for_each_entry(kvm, &vm_list, vm_list) {
 		struct kvm_arch *ka = &kvm->arch;
 
-		spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
+		spin_lock_irq(&ka->pvclock_gtod_sync_lock);
 		pvclock_update_vm_gtod_copy(kvm);
-		spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
-
-		kvm_for_each_vcpu(cpu, vcpu, kvm)
-			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-
-		kvm_for_each_vcpu(cpu, vcpu, kvm)
-			kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
+		kvm_end_pvclock_update(kvm);
 	}
+
 	mutex_unlock(&kvm_lock);
 }
 #endif
@@ -9418,7 +9414,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
 			__kvm_migrate_timers(vcpu);
 		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
-			kvm_gen_update_masterclock(vcpu->kvm);
+			kvm_update_masterclock(vcpu->kvm);
 		if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))
 			kvm_gen_kvmclock_update(vcpu);
 		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
-- 
2.33.0.309.g3052b89438-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 1/7] kvm: x86: abstract locking around pvclock_update_vm_gtod_copy
@ 2021-09-16 18:15   ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

From: Paolo Bonzini <pbonzini@redhat.com>

Updates to the kvmclock parameters needs to do a complicated dance of
KVM_REQ_MCLOCK_INPROGRESS and KVM_REQ_CLOCK_UPDATE in addition to taking
pvclock_gtod_sync_lock.  Place that in two functions that can be called
on all of master clock update, KVM_SET_CLOCK, and Hyper-V reenlightenment.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/include/asm/kvm_host.h |  1 -
 arch/x86/kvm/x86.c              | 62 +++++++++++++++------------------
 2 files changed, 29 insertions(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8f48a7ec577..be6805fc0260 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1866,7 +1866,6 @@ u64 kvm_calc_nested_tsc_multiplier(u64 l1_multiplier, u64 l2_multiplier);
 unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu);
 bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip);
 
-void kvm_make_mclock_inprogress_request(struct kvm *kvm);
 void kvm_make_scan_ioapic_request(struct kvm *kvm);
 void kvm_make_scan_ioapic_request_mask(struct kvm *kvm,
 				       unsigned long *vcpu_bitmap);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 28ef14155726..1082b48418c3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2755,35 +2755,42 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 #endif
 }
 
-void kvm_make_mclock_inprogress_request(struct kvm *kvm)
+static void kvm_make_mclock_inprogress_request(struct kvm *kvm)
 {
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
 }
 
-static void kvm_gen_update_masterclock(struct kvm *kvm)
+static void kvm_start_pvclock_update(struct kvm *kvm)
 {
-#ifdef CONFIG_X86_64
-	int i;
-	struct kvm_vcpu *vcpu;
 	struct kvm_arch *ka = &kvm->arch;
-	unsigned long flags;
-
-	kvm_hv_invalidate_tsc_page(kvm);
 
 	kvm_make_mclock_inprogress_request(kvm);
 
 	/* no guest entries from this point */
-	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
-	pvclock_update_vm_gtod_copy(kvm);
-	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
+	spin_lock_irq(&ka->pvclock_gtod_sync_lock);
+}
 
+static void kvm_end_pvclock_update(struct kvm *kvm)
+{
+	struct kvm_arch *ka = &kvm->arch;
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	spin_unlock_irq(&ka->pvclock_gtod_sync_lock);
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
 	/* guest entries allowed */
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
-#endif
+}
+
+static void kvm_update_masterclock(struct kvm *kvm)
+{
+	kvm_hv_invalidate_tsc_page(kvm);
+	kvm_start_pvclock_update(kvm);
+	pvclock_update_vm_gtod_copy(kvm);
+	kvm_end_pvclock_update(kvm);
 }
 
 u64 get_kvmclock_ns(struct kvm *kvm)
@@ -6079,12 +6086,10 @@ long kvm_arch_vm_ioctl(struct file *filp,
 			goto out;
 
 		r = 0;
-		/*
-		 * TODO: userspace has to take care of races with VCPU_RUN, so
-		 * kvm_gen_update_masterclock() can be cut down to locked
-		 * pvclock_update_vm_gtod_copy().
-		 */
-		kvm_gen_update_masterclock(kvm);
+
+		kvm_hv_invalidate_tsc_page(kvm);
+		kvm_start_pvclock_update(kvm);
+		pvclock_update_vm_gtod_copy(kvm);
 
 		/*
 		 * This pairs with kvm_guest_time_update(): when masterclock is
@@ -6093,15 +6098,12 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		 * is slightly ahead) here we risk going negative on unsigned
 		 * 'system_time' when 'user_ns.clock' is very small.
 		 */
-		spin_lock_irq(&ka->pvclock_gtod_sync_lock);
 		if (kvm->arch.use_master_clock)
 			now_ns = ka->master_kernel_ns;
 		else
 			now_ns = get_kvmclock_base_ns();
 		ka->kvmclock_offset = user_ns.clock - now_ns;
-		spin_unlock_irq(&ka->pvclock_gtod_sync_lock);
-
-		kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);
+		kvm_end_pvclock_update(kvm);
 		break;
 	}
 	case KVM_GET_CLOCK: {
@@ -8107,14 +8109,13 @@ static void tsc_khz_changed(void *data)
 static void kvm_hyperv_tsc_notifier(void)
 {
 	struct kvm *kvm;
-	struct kvm_vcpu *vcpu;
 	int cpu;
-	unsigned long flags;
 
 	mutex_lock(&kvm_lock);
 	list_for_each_entry(kvm, &vm_list, vm_list)
 		kvm_make_mclock_inprogress_request(kvm);
 
+	/* no guest entries from this point */
 	hyperv_stop_tsc_emulation();
 
 	/* TSC frequency always matches when on Hyper-V */
@@ -8125,16 +8126,11 @@ static void kvm_hyperv_tsc_notifier(void)
 	list_for_each_entry(kvm, &vm_list, vm_list) {
 		struct kvm_arch *ka = &kvm->arch;
 
-		spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
+		spin_lock_irq(&ka->pvclock_gtod_sync_lock);
 		pvclock_update_vm_gtod_copy(kvm);
-		spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
-
-		kvm_for_each_vcpu(cpu, vcpu, kvm)
-			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-
-		kvm_for_each_vcpu(cpu, vcpu, kvm)
-			kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
+		kvm_end_pvclock_update(kvm);
 	}
+
 	mutex_unlock(&kvm_lock);
 }
 #endif
@@ -9418,7 +9414,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
 			__kvm_migrate_timers(vcpu);
 		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
-			kvm_gen_update_masterclock(vcpu->kvm);
+			kvm_update_masterclock(vcpu->kvm);
 		if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))
 			kvm_gen_kvmclock_update(vcpu);
 		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
-- 
2.33.0.309.g3052b89438-goog


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 2/7] KVM: x86: extract KVM_GET_CLOCK/KVM_SET_CLOCK to separate functions
  2021-09-16 18:15 ` Oliver Upton
  (?)
@ 2021-09-16 18:15   ` Oliver Upton
  -1 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Peter Shier,
	Sean Christopherson, David Matlack, Paolo Bonzini,
	linux-arm-kernel, Jim Mattson

From: Paolo Bonzini <pbonzini@redhat.com>

no functional change intended.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/kvm/x86.c | 99 ++++++++++++++++++++++++----------------------
 1 file changed, 52 insertions(+), 47 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1082b48418c3..c910cf31958f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5829,6 +5829,54 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
 }
 #endif /* CONFIG_HAVE_KVM_PM_NOTIFIER */
 
+static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_clock_data data;
+	u64 now_ns;
+
+	now_ns = get_kvmclock_ns(kvm);
+	user_ns.clock = now_ns;
+	user_ns.flags = kvm->arch.use_master_clock ? KVM_CLOCK_TSC_STABLE : 0;
+	memset(&user_ns.pad, 0, sizeof(user_ns.pad));
+
+	if (copy_to_user(argp, &data, sizeof(data)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_arch *ka = &kvm->arch;
+	struct kvm_clock_data data;
+	u64 now_ns;
+
+	if (copy_from_user(&data, argp, sizeof(data)))
+		return -EFAULT;
+
+	if (data.flags)
+		return -EINVAL;
+
+	kvm_hv_invalidate_tsc_page(kvm);
+	kvm_start_pvclock_update(kvm);
+	pvclock_update_vm_gtod_copy(kvm);
+
+	/*
+	 * This pairs with kvm_guest_time_update(): when masterclock is
+	 * in use, we use master_kernel_ns + kvmclock_offset to set
+	 * unsigned 'system_time' so if we use get_kvmclock_ns() (which
+	 * is slightly ahead) here we risk going negative on unsigned
+	 * 'system_time' when 'data.clock' is very small.
+	 */
+	if (kvm->arch.use_master_clock)
+		now_ns = ka->master_kernel_ns;
+	else
+		now_ns = get_kvmclock_base_ns();
+	ka->kvmclock_offset = data.clock - now_ns;
+	kvm_end_pvclock_update(kvm);
+	return 0;
+}
+
 long kvm_arch_vm_ioctl(struct file *filp,
 		       unsigned int ioctl, unsigned long arg)
 {
@@ -6072,55 +6120,12 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
-	case KVM_SET_CLOCK: {
-		struct kvm_arch *ka = &kvm->arch;
-		struct kvm_clock_data user_ns;
-		u64 now_ns;
-
-		r = -EFAULT;
-		if (copy_from_user(&user_ns, argp, sizeof(user_ns)))
-			goto out;
-
-		r = -EINVAL;
-		if (user_ns.flags)
-			goto out;
-
-		r = 0;
-
-		kvm_hv_invalidate_tsc_page(kvm);
-		kvm_start_pvclock_update(kvm);
-		pvclock_update_vm_gtod_copy(kvm);
-
-		/*
-		 * This pairs with kvm_guest_time_update(): when masterclock is
-		 * in use, we use master_kernel_ns + kvmclock_offset to set
-		 * unsigned 'system_time' so if we use get_kvmclock_ns() (which
-		 * is slightly ahead) here we risk going negative on unsigned
-		 * 'system_time' when 'user_ns.clock' is very small.
-		 */
-		if (kvm->arch.use_master_clock)
-			now_ns = ka->master_kernel_ns;
-		else
-			now_ns = get_kvmclock_base_ns();
-		ka->kvmclock_offset = user_ns.clock - now_ns;
-		kvm_end_pvclock_update(kvm);
+	case KVM_SET_CLOCK:
+		r = kvm_vm_ioctl_set_clock(kvm, argp);
 		break;
-	}
-	case KVM_GET_CLOCK: {
-		struct kvm_clock_data user_ns;
-		u64 now_ns;
-
-		now_ns = get_kvmclock_ns(kvm);
-		user_ns.clock = now_ns;
-		user_ns.flags = kvm->arch.use_master_clock ? KVM_CLOCK_TSC_STABLE : 0;
-		memset(&user_ns.pad, 0, sizeof(user_ns.pad));
-
-		r = -EFAULT;
-		if (copy_to_user(argp, &user_ns, sizeof(user_ns)))
-			goto out;
-		r = 0;
+	case KVM_GET_CLOCK:
+		r = kvm_vm_ioctl_get_clock(kvm, argp);
 		break;
-	}
 	case KVM_MEMORY_ENCRYPT_OP: {
 		r = -ENOTTY;
 		if (kvm_x86_ops.mem_enc_op)
-- 
2.33.0.309.g3052b89438-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 2/7] KVM: x86: extract KVM_GET_CLOCK/KVM_SET_CLOCK to separate functions
@ 2021-09-16 18:15   ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

From: Paolo Bonzini <pbonzini@redhat.com>

no functional change intended.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/kvm/x86.c | 99 ++++++++++++++++++++++++----------------------
 1 file changed, 52 insertions(+), 47 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1082b48418c3..c910cf31958f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5829,6 +5829,54 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
 }
 #endif /* CONFIG_HAVE_KVM_PM_NOTIFIER */
 
+static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_clock_data data;
+	u64 now_ns;
+
+	now_ns = get_kvmclock_ns(kvm);
+	user_ns.clock = now_ns;
+	user_ns.flags = kvm->arch.use_master_clock ? KVM_CLOCK_TSC_STABLE : 0;
+	memset(&user_ns.pad, 0, sizeof(user_ns.pad));
+
+	if (copy_to_user(argp, &data, sizeof(data)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_arch *ka = &kvm->arch;
+	struct kvm_clock_data data;
+	u64 now_ns;
+
+	if (copy_from_user(&data, argp, sizeof(data)))
+		return -EFAULT;
+
+	if (data.flags)
+		return -EINVAL;
+
+	kvm_hv_invalidate_tsc_page(kvm);
+	kvm_start_pvclock_update(kvm);
+	pvclock_update_vm_gtod_copy(kvm);
+
+	/*
+	 * This pairs with kvm_guest_time_update(): when masterclock is
+	 * in use, we use master_kernel_ns + kvmclock_offset to set
+	 * unsigned 'system_time' so if we use get_kvmclock_ns() (which
+	 * is slightly ahead) here we risk going negative on unsigned
+	 * 'system_time' when 'data.clock' is very small.
+	 */
+	if (kvm->arch.use_master_clock)
+		now_ns = ka->master_kernel_ns;
+	else
+		now_ns = get_kvmclock_base_ns();
+	ka->kvmclock_offset = data.clock - now_ns;
+	kvm_end_pvclock_update(kvm);
+	return 0;
+}
+
 long kvm_arch_vm_ioctl(struct file *filp,
 		       unsigned int ioctl, unsigned long arg)
 {
@@ -6072,55 +6120,12 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
-	case KVM_SET_CLOCK: {
-		struct kvm_arch *ka = &kvm->arch;
-		struct kvm_clock_data user_ns;
-		u64 now_ns;
-
-		r = -EFAULT;
-		if (copy_from_user(&user_ns, argp, sizeof(user_ns)))
-			goto out;
-
-		r = -EINVAL;
-		if (user_ns.flags)
-			goto out;
-
-		r = 0;
-
-		kvm_hv_invalidate_tsc_page(kvm);
-		kvm_start_pvclock_update(kvm);
-		pvclock_update_vm_gtod_copy(kvm);
-
-		/*
-		 * This pairs with kvm_guest_time_update(): when masterclock is
-		 * in use, we use master_kernel_ns + kvmclock_offset to set
-		 * unsigned 'system_time' so if we use get_kvmclock_ns() (which
-		 * is slightly ahead) here we risk going negative on unsigned
-		 * 'system_time' when 'user_ns.clock' is very small.
-		 */
-		if (kvm->arch.use_master_clock)
-			now_ns = ka->master_kernel_ns;
-		else
-			now_ns = get_kvmclock_base_ns();
-		ka->kvmclock_offset = user_ns.clock - now_ns;
-		kvm_end_pvclock_update(kvm);
+	case KVM_SET_CLOCK:
+		r = kvm_vm_ioctl_set_clock(kvm, argp);
 		break;
-	}
-	case KVM_GET_CLOCK: {
-		struct kvm_clock_data user_ns;
-		u64 now_ns;
-
-		now_ns = get_kvmclock_ns(kvm);
-		user_ns.clock = now_ns;
-		user_ns.flags = kvm->arch.use_master_clock ? KVM_CLOCK_TSC_STABLE : 0;
-		memset(&user_ns.pad, 0, sizeof(user_ns.pad));
-
-		r = -EFAULT;
-		if (copy_to_user(argp, &user_ns, sizeof(user_ns)))
-			goto out;
-		r = 0;
+	case KVM_GET_CLOCK:
+		r = kvm_vm_ioctl_get_clock(kvm, argp);
 		break;
-	}
 	case KVM_MEMORY_ENCRYPT_OP: {
 		r = -ENOTTY;
 		if (kvm_x86_ops.mem_enc_op)
-- 
2.33.0.309.g3052b89438-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 2/7] KVM: x86: extract KVM_GET_CLOCK/KVM_SET_CLOCK to separate functions
@ 2021-09-16 18:15   ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

From: Paolo Bonzini <pbonzini@redhat.com>

no functional change intended.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/kvm/x86.c | 99 ++++++++++++++++++++++++----------------------
 1 file changed, 52 insertions(+), 47 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1082b48418c3..c910cf31958f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5829,6 +5829,54 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
 }
 #endif /* CONFIG_HAVE_KVM_PM_NOTIFIER */
 
+static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_clock_data data;
+	u64 now_ns;
+
+	now_ns = get_kvmclock_ns(kvm);
+	user_ns.clock = now_ns;
+	user_ns.flags = kvm->arch.use_master_clock ? KVM_CLOCK_TSC_STABLE : 0;
+	memset(&user_ns.pad, 0, sizeof(user_ns.pad));
+
+	if (copy_to_user(argp, &data, sizeof(data)))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_arch *ka = &kvm->arch;
+	struct kvm_clock_data data;
+	u64 now_ns;
+
+	if (copy_from_user(&data, argp, sizeof(data)))
+		return -EFAULT;
+
+	if (data.flags)
+		return -EINVAL;
+
+	kvm_hv_invalidate_tsc_page(kvm);
+	kvm_start_pvclock_update(kvm);
+	pvclock_update_vm_gtod_copy(kvm);
+
+	/*
+	 * This pairs with kvm_guest_time_update(): when masterclock is
+	 * in use, we use master_kernel_ns + kvmclock_offset to set
+	 * unsigned 'system_time' so if we use get_kvmclock_ns() (which
+	 * is slightly ahead) here we risk going negative on unsigned
+	 * 'system_time' when 'data.clock' is very small.
+	 */
+	if (kvm->arch.use_master_clock)
+		now_ns = ka->master_kernel_ns;
+	else
+		now_ns = get_kvmclock_base_ns();
+	ka->kvmclock_offset = data.clock - now_ns;
+	kvm_end_pvclock_update(kvm);
+	return 0;
+}
+
 long kvm_arch_vm_ioctl(struct file *filp,
 		       unsigned int ioctl, unsigned long arg)
 {
@@ -6072,55 +6120,12 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		break;
 	}
 #endif
-	case KVM_SET_CLOCK: {
-		struct kvm_arch *ka = &kvm->arch;
-		struct kvm_clock_data user_ns;
-		u64 now_ns;
-
-		r = -EFAULT;
-		if (copy_from_user(&user_ns, argp, sizeof(user_ns)))
-			goto out;
-
-		r = -EINVAL;
-		if (user_ns.flags)
-			goto out;
-
-		r = 0;
-
-		kvm_hv_invalidate_tsc_page(kvm);
-		kvm_start_pvclock_update(kvm);
-		pvclock_update_vm_gtod_copy(kvm);
-
-		/*
-		 * This pairs with kvm_guest_time_update(): when masterclock is
-		 * in use, we use master_kernel_ns + kvmclock_offset to set
-		 * unsigned 'system_time' so if we use get_kvmclock_ns() (which
-		 * is slightly ahead) here we risk going negative on unsigned
-		 * 'system_time' when 'user_ns.clock' is very small.
-		 */
-		if (kvm->arch.use_master_clock)
-			now_ns = ka->master_kernel_ns;
-		else
-			now_ns = get_kvmclock_base_ns();
-		ka->kvmclock_offset = user_ns.clock - now_ns;
-		kvm_end_pvclock_update(kvm);
+	case KVM_SET_CLOCK:
+		r = kvm_vm_ioctl_set_clock(kvm, argp);
 		break;
-	}
-	case KVM_GET_CLOCK: {
-		struct kvm_clock_data user_ns;
-		u64 now_ns;
-
-		now_ns = get_kvmclock_ns(kvm);
-		user_ns.clock = now_ns;
-		user_ns.flags = kvm->arch.use_master_clock ? KVM_CLOCK_TSC_STABLE : 0;
-		memset(&user_ns.pad, 0, sizeof(user_ns.pad));
-
-		r = -EFAULT;
-		if (copy_to_user(argp, &user_ns, sizeof(user_ns)))
-			goto out;
-		r = 0;
+	case KVM_GET_CLOCK:
+		r = kvm_vm_ioctl_get_clock(kvm, argp);
 		break;
-	}
 	case KVM_MEMORY_ENCRYPT_OP: {
 		r = -ENOTTY;
 		if (kvm_x86_ops.mem_enc_op)
-- 
2.33.0.309.g3052b89438-goog


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 3/7] KVM: x86: Fix potential race in KVM_GET_CLOCK
  2021-09-16 18:15 ` Oliver Upton
  (?)
@ 2021-09-16 18:15   ` Oliver Upton
  -1 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Peter Shier,
	Sean Christopherson, David Matlack, Paolo Bonzini,
	linux-arm-kernel, Jim Mattson

Sean noticed that KVM_GET_CLOCK was checking kvm_arch.use_master_clock
outside of the pvclock sync lock. This is problematic, as the clock
value written to the user may or may not actually correspond to a stable
TSC.

Fix the race by populating the entire kvm_clock_data structure behind
the pvclock_gtod_sync_lock.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/kvm/x86.c | 36 +++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c910cf31958f..523c4e5c109f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2793,19 +2793,20 @@ static void kvm_update_masterclock(struct kvm *kvm)
 	kvm_end_pvclock_update(kvm);
 }
 
-u64 get_kvmclock_ns(struct kvm *kvm)
+static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 {
 	struct kvm_arch *ka = &kvm->arch;
 	struct pvclock_vcpu_time_info hv_clock;
 	unsigned long flags;
-	u64 ret;
 
 	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
 	if (!ka->use_master_clock) {
 		spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
-		return get_kvmclock_base_ns() + ka->kvmclock_offset;
+		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
+		return;
 	}
 
+	data->flags |= KVM_CLOCK_TSC_STABLE;
 	hv_clock.tsc_timestamp = ka->master_cycle_now;
 	hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
 	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
@@ -2817,13 +2818,26 @@ u64 get_kvmclock_ns(struct kvm *kvm)
 		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
 				   &hv_clock.tsc_shift,
 				   &hv_clock.tsc_to_system_mul);
-		ret = __pvclock_read_cycles(&hv_clock, rdtsc());
-	} else
-		ret = get_kvmclock_base_ns() + ka->kvmclock_offset;
+		data->clock = __pvclock_read_cycles(&hv_clock, rdtsc());
+	} else {
+		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
+	}
 
 	put_cpu();
+}
 
-	return ret;
+u64 get_kvmclock_ns(struct kvm *kvm)
+{
+	struct kvm_clock_data data;
+
+	/*
+	 * Zero flags as it's accessed RMW, leave everything else uninitialized
+	 * as clock is always written and no other fields are consumed.
+	 */
+	data.flags = 0;
+
+	get_kvmclock(kvm, &data);
+	return data.clock;
 }
 
 static void kvm_setup_pvclock_page(struct kvm_vcpu *v,
@@ -5832,13 +5846,9 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
 static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_clock_data data;
-	u64 now_ns;
-
-	now_ns = get_kvmclock_ns(kvm);
-	user_ns.clock = now_ns;
-	user_ns.flags = kvm->arch.use_master_clock ? KVM_CLOCK_TSC_STABLE : 0;
-	memset(&user_ns.pad, 0, sizeof(user_ns.pad));
 
+	memset(&data, 0, sizeof(data));
+	get_kvmclock(kvm, &data);
 	if (copy_to_user(argp, &data, sizeof(data)))
 		return -EFAULT;
 
-- 
2.33.0.309.g3052b89438-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 3/7] KVM: x86: Fix potential race in KVM_GET_CLOCK
@ 2021-09-16 18:15   ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

Sean noticed that KVM_GET_CLOCK was checking kvm_arch.use_master_clock
outside of the pvclock sync lock. This is problematic, as the clock
value written to the user may or may not actually correspond to a stable
TSC.

Fix the race by populating the entire kvm_clock_data structure behind
the pvclock_gtod_sync_lock.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/kvm/x86.c | 36 +++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c910cf31958f..523c4e5c109f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2793,19 +2793,20 @@ static void kvm_update_masterclock(struct kvm *kvm)
 	kvm_end_pvclock_update(kvm);
 }
 
-u64 get_kvmclock_ns(struct kvm *kvm)
+static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 {
 	struct kvm_arch *ka = &kvm->arch;
 	struct pvclock_vcpu_time_info hv_clock;
 	unsigned long flags;
-	u64 ret;
 
 	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
 	if (!ka->use_master_clock) {
 		spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
-		return get_kvmclock_base_ns() + ka->kvmclock_offset;
+		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
+		return;
 	}
 
+	data->flags |= KVM_CLOCK_TSC_STABLE;
 	hv_clock.tsc_timestamp = ka->master_cycle_now;
 	hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
 	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
@@ -2817,13 +2818,26 @@ u64 get_kvmclock_ns(struct kvm *kvm)
 		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
 				   &hv_clock.tsc_shift,
 				   &hv_clock.tsc_to_system_mul);
-		ret = __pvclock_read_cycles(&hv_clock, rdtsc());
-	} else
-		ret = get_kvmclock_base_ns() + ka->kvmclock_offset;
+		data->clock = __pvclock_read_cycles(&hv_clock, rdtsc());
+	} else {
+		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
+	}
 
 	put_cpu();
+}
 
-	return ret;
+u64 get_kvmclock_ns(struct kvm *kvm)
+{
+	struct kvm_clock_data data;
+
+	/*
+	 * Zero flags as it's accessed RMW, leave everything else uninitialized
+	 * as clock is always written and no other fields are consumed.
+	 */
+	data.flags = 0;
+
+	get_kvmclock(kvm, &data);
+	return data.clock;
 }
 
 static void kvm_setup_pvclock_page(struct kvm_vcpu *v,
@@ -5832,13 +5846,9 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
 static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_clock_data data;
-	u64 now_ns;
-
-	now_ns = get_kvmclock_ns(kvm);
-	user_ns.clock = now_ns;
-	user_ns.flags = kvm->arch.use_master_clock ? KVM_CLOCK_TSC_STABLE : 0;
-	memset(&user_ns.pad, 0, sizeof(user_ns.pad));
 
+	memset(&data, 0, sizeof(data));
+	get_kvmclock(kvm, &data);
 	if (copy_to_user(argp, &data, sizeof(data)))
 		return -EFAULT;
 
-- 
2.33.0.309.g3052b89438-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 3/7] KVM: x86: Fix potential race in KVM_GET_CLOCK
@ 2021-09-16 18:15   ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

Sean noticed that KVM_GET_CLOCK was checking kvm_arch.use_master_clock
outside of the pvclock sync lock. This is problematic, as the clock
value written to the user may or may not actually correspond to a stable
TSC.

Fix the race by populating the entire kvm_clock_data structure behind
the pvclock_gtod_sync_lock.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/kvm/x86.c | 36 +++++++++++++++++++++++-------------
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c910cf31958f..523c4e5c109f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2793,19 +2793,20 @@ static void kvm_update_masterclock(struct kvm *kvm)
 	kvm_end_pvclock_update(kvm);
 }
 
-u64 get_kvmclock_ns(struct kvm *kvm)
+static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 {
 	struct kvm_arch *ka = &kvm->arch;
 	struct pvclock_vcpu_time_info hv_clock;
 	unsigned long flags;
-	u64 ret;
 
 	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
 	if (!ka->use_master_clock) {
 		spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
-		return get_kvmclock_base_ns() + ka->kvmclock_offset;
+		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
+		return;
 	}
 
+	data->flags |= KVM_CLOCK_TSC_STABLE;
 	hv_clock.tsc_timestamp = ka->master_cycle_now;
 	hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
 	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
@@ -2817,13 +2818,26 @@ u64 get_kvmclock_ns(struct kvm *kvm)
 		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
 				   &hv_clock.tsc_shift,
 				   &hv_clock.tsc_to_system_mul);
-		ret = __pvclock_read_cycles(&hv_clock, rdtsc());
-	} else
-		ret = get_kvmclock_base_ns() + ka->kvmclock_offset;
+		data->clock = __pvclock_read_cycles(&hv_clock, rdtsc());
+	} else {
+		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
+	}
 
 	put_cpu();
+}
 
-	return ret;
+u64 get_kvmclock_ns(struct kvm *kvm)
+{
+	struct kvm_clock_data data;
+
+	/*
+	 * Zero flags as it's accessed RMW, leave everything else uninitialized
+	 * as clock is always written and no other fields are consumed.
+	 */
+	data.flags = 0;
+
+	get_kvmclock(kvm, &data);
+	return data.clock;
 }
 
 static void kvm_setup_pvclock_page(struct kvm_vcpu *v,
@@ -5832,13 +5846,9 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
 static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_clock_data data;
-	u64 now_ns;
-
-	now_ns = get_kvmclock_ns(kvm);
-	user_ns.clock = now_ns;
-	user_ns.flags = kvm->arch.use_master_clock ? KVM_CLOCK_TSC_STABLE : 0;
-	memset(&user_ns.pad, 0, sizeof(user_ns.pad));
 
+	memset(&data, 0, sizeof(data));
+	get_kvmclock(kvm, &data);
 	if (copy_to_user(argp, &data, sizeof(data)))
 		return -EFAULT;
 
-- 
2.33.0.309.g3052b89438-goog


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-09-16 18:15 ` Oliver Upton
  (?)
@ 2021-09-16 18:15   ` Oliver Upton
  -1 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Peter Shier,
	Sean Christopherson, David Matlack, Paolo Bonzini,
	linux-arm-kernel, Jim Mattson

Handling the migration of TSCs correctly is difficult, in part because
Linux does not provide userspace with the ability to retrieve a (TSC,
realtime) clock pair for a single instant in time. In lieu of a more
convenient facility, KVM can report similar information in the kvm_clock
structure.

Provide userspace with a host TSC & realtime pair iff the realtime clock
is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
realtime value, advance the KVM clock by the amount of elapsed time. Do
not step the KVM clock backwards, though, as it is a monotonic
oscillator.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 Documentation/virt/kvm/api.rst  | 42 ++++++++++++++++++++++++++-------
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/x86.c              | 36 +++++++++++++++++++++-------
 include/uapi/linux/kvm.h        |  7 +++++-
 4 files changed, 70 insertions(+), 18 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index a6729c8cf063..d0b9c986cf6c 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -993,20 +993,34 @@ such as migration.
 When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
 set of bits that KVM can return in struct kvm_clock_data's flag member.
 
-The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
-value is the exact kvmclock value seen by all VCPUs at the instant
-when KVM_GET_CLOCK was called.  If clear, the returned value is simply
-CLOCK_MONOTONIC plus a constant offset; the offset can be modified
-with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
-but the exact value read by each VCPU could differ, because the host
-TSC is not stable.
+FLAGS:
+
+KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
+value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
+If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
+offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
+to make all VCPUs follow this clock, but the exact value read by each
+VCPU could differ, because the host TSC is not stable.
+
+KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
+structure is populated with the value of the host's real time
+clocksource at the instant when KVM_GET_CLOCK was called. If clear,
+the `realtime` field does not contain a value.
+
+KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
+structure is populated with the value of the host's timestamp counter (TSC)
+at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
+does not contain a value.
 
 ::
 
   struct kvm_clock_data {
 	__u64 clock;  /* kvmclock current value */
 	__u32 flags;
-	__u32 pad[9];
+	__u32 pad0;
+	__u64 realtime;
+	__u64 host_tsc;
+	__u32 pad[4];
   };
 
 
@@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value specified in its parameter.
 In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
 such as migration.
 
+FLAGS:
+
+KVM_CLOCK_REALTIME.  If set, KVM will compare the value of the `realtime` field
+with the value of the host's real time clocksource at the instant when
+KVM_SET_CLOCK was called. The difference in elapsed time is added to the final
+kvmclock value that will be provided to guests.
+
 ::
 
   struct kvm_clock_data {
 	__u64 clock;  /* kvmclock current value */
 	__u32 flags;
-	__u32 pad[9];
+	__u32 pad0;
+	__u64 realtime;
+	__u64 host_tsc;
+	__u32 pad[4];
   };
 
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index be6805fc0260..9c34b5b63e39 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1936,4 +1936,7 @@ int kvm_cpu_dirty_log_size(void);
 
 int alloc_all_memslots_rmaps(struct kvm *kvm);
 
+#define KVM_CLOCK_VALID_FLAGS						\
+	(KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC)
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 523c4e5c109f..cb5d5cad5124 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2815,10 +2815,20 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 	get_cpu();
 
 	if (__this_cpu_read(cpu_tsc_khz)) {
+#ifdef CONFIG_X86_64
+		struct timespec64 ts;
+
+		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
+			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
+			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
+		} else
+#endif
+		data->host_tsc = rdtsc();
+
 		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
 				   &hv_clock.tsc_shift,
 				   &hv_clock.tsc_to_system_mul);
-		data->clock = __pvclock_read_cycles(&hv_clock, rdtsc());
+		data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
 	} else {
 		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
 	}
@@ -4062,7 +4072,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = KVM_SYNC_X86_VALID_FIELDS;
 		break;
 	case KVM_CAP_ADJUST_CLOCK:
-		r = KVM_CLOCK_TSC_STABLE;
+		r = KVM_CLOCK_VALID_FLAGS;
 		break;
 	case KVM_CAP_X86_DISABLE_EXITS:
 		r |=  KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE |
@@ -5859,12 +5869,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_arch *ka = &kvm->arch;
 	struct kvm_clock_data data;
-	u64 now_ns;
+	u64 now_raw_ns;
 
 	if (copy_from_user(&data, argp, sizeof(data)))
 		return -EFAULT;
 
-	if (data.flags)
+	if (data.flags & ~KVM_CLOCK_REALTIME)
 		return -EINVAL;
 
 	kvm_hv_invalidate_tsc_page(kvm);
@@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 	 * is slightly ahead) here we risk going negative on unsigned
 	 * 'system_time' when 'data.clock' is very small.
 	 */
-	if (kvm->arch.use_master_clock)
-		now_ns = ka->master_kernel_ns;
+	if (data.flags & KVM_CLOCK_REALTIME) {
+		u64 now_real_ns = ktime_get_real_ns();
+
+		/*
+		 * Avoid stepping the kvmclock backwards.
+		 */
+		if (now_real_ns > data.realtime)
+			data.clock += now_real_ns - data.realtime;
+	}
+
+	if (ka->use_master_clock)
+		now_raw_ns = ka->master_kernel_ns;
 	else
-		now_ns = get_kvmclock_base_ns();
-	ka->kvmclock_offset = data.clock - now_ns;
+		now_raw_ns = get_kvmclock_base_ns();
+	ka->kvmclock_offset = data.clock - now_raw_ns;
 	kvm_end_pvclock_update(kvm);
 	return 0;
 }
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a067410ebea5..d228bf394465 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1223,11 +1223,16 @@ struct kvm_irqfd {
 
 /* Do not use 1, KVM_CHECK_EXTENSION returned it before we had flags.  */
 #define KVM_CLOCK_TSC_STABLE		2
+#define KVM_CLOCK_REALTIME		(1 << 2)
+#define KVM_CLOCK_HOST_TSC		(1 << 3)
 
 struct kvm_clock_data {
 	__u64 clock;
 	__u32 flags;
-	__u32 pad[9];
+	__u32 pad0;
+	__u64 realtime;
+	__u64 host_tsc;
+	__u32 pad[4];
 };
 
 /* For KVM_CAP_SW_TLB */
-- 
2.33.0.309.g3052b89438-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-09-16 18:15   ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

Handling the migration of TSCs correctly is difficult, in part because
Linux does not provide userspace with the ability to retrieve a (TSC,
realtime) clock pair for a single instant in time. In lieu of a more
convenient facility, KVM can report similar information in the kvm_clock
structure.

Provide userspace with a host TSC & realtime pair iff the realtime clock
is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
realtime value, advance the KVM clock by the amount of elapsed time. Do
not step the KVM clock backwards, though, as it is a monotonic
oscillator.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 Documentation/virt/kvm/api.rst  | 42 ++++++++++++++++++++++++++-------
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/x86.c              | 36 +++++++++++++++++++++-------
 include/uapi/linux/kvm.h        |  7 +++++-
 4 files changed, 70 insertions(+), 18 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index a6729c8cf063..d0b9c986cf6c 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -993,20 +993,34 @@ such as migration.
 When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
 set of bits that KVM can return in struct kvm_clock_data's flag member.
 
-The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
-value is the exact kvmclock value seen by all VCPUs at the instant
-when KVM_GET_CLOCK was called.  If clear, the returned value is simply
-CLOCK_MONOTONIC plus a constant offset; the offset can be modified
-with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
-but the exact value read by each VCPU could differ, because the host
-TSC is not stable.
+FLAGS:
+
+KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
+value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
+If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
+offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
+to make all VCPUs follow this clock, but the exact value read by each
+VCPU could differ, because the host TSC is not stable.
+
+KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
+structure is populated with the value of the host's real time
+clocksource at the instant when KVM_GET_CLOCK was called. If clear,
+the `realtime` field does not contain a value.
+
+KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
+structure is populated with the value of the host's timestamp counter (TSC)
+at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
+does not contain a value.
 
 ::
 
   struct kvm_clock_data {
 	__u64 clock;  /* kvmclock current value */
 	__u32 flags;
-	__u32 pad[9];
+	__u32 pad0;
+	__u64 realtime;
+	__u64 host_tsc;
+	__u32 pad[4];
   };
 
 
@@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value specified in its parameter.
 In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
 such as migration.
 
+FLAGS:
+
+KVM_CLOCK_REALTIME.  If set, KVM will compare the value of the `realtime` field
+with the value of the host's real time clocksource at the instant when
+KVM_SET_CLOCK was called. The difference in elapsed time is added to the final
+kvmclock value that will be provided to guests.
+
 ::
 
   struct kvm_clock_data {
 	__u64 clock;  /* kvmclock current value */
 	__u32 flags;
-	__u32 pad[9];
+	__u32 pad0;
+	__u64 realtime;
+	__u64 host_tsc;
+	__u32 pad[4];
   };
 
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index be6805fc0260..9c34b5b63e39 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1936,4 +1936,7 @@ int kvm_cpu_dirty_log_size(void);
 
 int alloc_all_memslots_rmaps(struct kvm *kvm);
 
+#define KVM_CLOCK_VALID_FLAGS						\
+	(KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC)
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 523c4e5c109f..cb5d5cad5124 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2815,10 +2815,20 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 	get_cpu();
 
 	if (__this_cpu_read(cpu_tsc_khz)) {
+#ifdef CONFIG_X86_64
+		struct timespec64 ts;
+
+		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
+			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
+			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
+		} else
+#endif
+		data->host_tsc = rdtsc();
+
 		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
 				   &hv_clock.tsc_shift,
 				   &hv_clock.tsc_to_system_mul);
-		data->clock = __pvclock_read_cycles(&hv_clock, rdtsc());
+		data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
 	} else {
 		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
 	}
@@ -4062,7 +4072,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = KVM_SYNC_X86_VALID_FIELDS;
 		break;
 	case KVM_CAP_ADJUST_CLOCK:
-		r = KVM_CLOCK_TSC_STABLE;
+		r = KVM_CLOCK_VALID_FLAGS;
 		break;
 	case KVM_CAP_X86_DISABLE_EXITS:
 		r |=  KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE |
@@ -5859,12 +5869,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_arch *ka = &kvm->arch;
 	struct kvm_clock_data data;
-	u64 now_ns;
+	u64 now_raw_ns;
 
 	if (copy_from_user(&data, argp, sizeof(data)))
 		return -EFAULT;
 
-	if (data.flags)
+	if (data.flags & ~KVM_CLOCK_REALTIME)
 		return -EINVAL;
 
 	kvm_hv_invalidate_tsc_page(kvm);
@@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 	 * is slightly ahead) here we risk going negative on unsigned
 	 * 'system_time' when 'data.clock' is very small.
 	 */
-	if (kvm->arch.use_master_clock)
-		now_ns = ka->master_kernel_ns;
+	if (data.flags & KVM_CLOCK_REALTIME) {
+		u64 now_real_ns = ktime_get_real_ns();
+
+		/*
+		 * Avoid stepping the kvmclock backwards.
+		 */
+		if (now_real_ns > data.realtime)
+			data.clock += now_real_ns - data.realtime;
+	}
+
+	if (ka->use_master_clock)
+		now_raw_ns = ka->master_kernel_ns;
 	else
-		now_ns = get_kvmclock_base_ns();
-	ka->kvmclock_offset = data.clock - now_ns;
+		now_raw_ns = get_kvmclock_base_ns();
+	ka->kvmclock_offset = data.clock - now_raw_ns;
 	kvm_end_pvclock_update(kvm);
 	return 0;
 }
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a067410ebea5..d228bf394465 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1223,11 +1223,16 @@ struct kvm_irqfd {
 
 /* Do not use 1, KVM_CHECK_EXTENSION returned it before we had flags.  */
 #define KVM_CLOCK_TSC_STABLE		2
+#define KVM_CLOCK_REALTIME		(1 << 2)
+#define KVM_CLOCK_HOST_TSC		(1 << 3)
 
 struct kvm_clock_data {
 	__u64 clock;
 	__u32 flags;
-	__u32 pad[9];
+	__u32 pad0;
+	__u64 realtime;
+	__u64 host_tsc;
+	__u32 pad[4];
 };
 
 /* For KVM_CAP_SW_TLB */
-- 
2.33.0.309.g3052b89438-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-09-16 18:15   ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

Handling the migration of TSCs correctly is difficult, in part because
Linux does not provide userspace with the ability to retrieve a (TSC,
realtime) clock pair for a single instant in time. In lieu of a more
convenient facility, KVM can report similar information in the kvm_clock
structure.

Provide userspace with a host TSC & realtime pair iff the realtime clock
is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
realtime value, advance the KVM clock by the amount of elapsed time. Do
not step the KVM clock backwards, though, as it is a monotonic
oscillator.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 Documentation/virt/kvm/api.rst  | 42 ++++++++++++++++++++++++++-------
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/x86.c              | 36 +++++++++++++++++++++-------
 include/uapi/linux/kvm.h        |  7 +++++-
 4 files changed, 70 insertions(+), 18 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index a6729c8cf063..d0b9c986cf6c 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -993,20 +993,34 @@ such as migration.
 When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
 set of bits that KVM can return in struct kvm_clock_data's flag member.
 
-The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
-value is the exact kvmclock value seen by all VCPUs at the instant
-when KVM_GET_CLOCK was called.  If clear, the returned value is simply
-CLOCK_MONOTONIC plus a constant offset; the offset can be modified
-with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
-but the exact value read by each VCPU could differ, because the host
-TSC is not stable.
+FLAGS:
+
+KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
+value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
+If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
+offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
+to make all VCPUs follow this clock, but the exact value read by each
+VCPU could differ, because the host TSC is not stable.
+
+KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
+structure is populated with the value of the host's real time
+clocksource at the instant when KVM_GET_CLOCK was called. If clear,
+the `realtime` field does not contain a value.
+
+KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
+structure is populated with the value of the host's timestamp counter (TSC)
+at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
+does not contain a value.
 
 ::
 
   struct kvm_clock_data {
 	__u64 clock;  /* kvmclock current value */
 	__u32 flags;
-	__u32 pad[9];
+	__u32 pad0;
+	__u64 realtime;
+	__u64 host_tsc;
+	__u32 pad[4];
   };
 
 
@@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value specified in its parameter.
 In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
 such as migration.
 
+FLAGS:
+
+KVM_CLOCK_REALTIME.  If set, KVM will compare the value of the `realtime` field
+with the value of the host's real time clocksource at the instant when
+KVM_SET_CLOCK was called. The difference in elapsed time is added to the final
+kvmclock value that will be provided to guests.
+
 ::
 
   struct kvm_clock_data {
 	__u64 clock;  /* kvmclock current value */
 	__u32 flags;
-	__u32 pad[9];
+	__u32 pad0;
+	__u64 realtime;
+	__u64 host_tsc;
+	__u32 pad[4];
   };
 
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index be6805fc0260..9c34b5b63e39 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1936,4 +1936,7 @@ int kvm_cpu_dirty_log_size(void);
 
 int alloc_all_memslots_rmaps(struct kvm *kvm);
 
+#define KVM_CLOCK_VALID_FLAGS						\
+	(KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC)
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 523c4e5c109f..cb5d5cad5124 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2815,10 +2815,20 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 	get_cpu();
 
 	if (__this_cpu_read(cpu_tsc_khz)) {
+#ifdef CONFIG_X86_64
+		struct timespec64 ts;
+
+		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
+			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
+			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
+		} else
+#endif
+		data->host_tsc = rdtsc();
+
 		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
 				   &hv_clock.tsc_shift,
 				   &hv_clock.tsc_to_system_mul);
-		data->clock = __pvclock_read_cycles(&hv_clock, rdtsc());
+		data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
 	} else {
 		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
 	}
@@ -4062,7 +4072,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = KVM_SYNC_X86_VALID_FIELDS;
 		break;
 	case KVM_CAP_ADJUST_CLOCK:
-		r = KVM_CLOCK_TSC_STABLE;
+		r = KVM_CLOCK_VALID_FLAGS;
 		break;
 	case KVM_CAP_X86_DISABLE_EXITS:
 		r |=  KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE |
@@ -5859,12 +5869,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_arch *ka = &kvm->arch;
 	struct kvm_clock_data data;
-	u64 now_ns;
+	u64 now_raw_ns;
 
 	if (copy_from_user(&data, argp, sizeof(data)))
 		return -EFAULT;
 
-	if (data.flags)
+	if (data.flags & ~KVM_CLOCK_REALTIME)
 		return -EINVAL;
 
 	kvm_hv_invalidate_tsc_page(kvm);
@@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 	 * is slightly ahead) here we risk going negative on unsigned
 	 * 'system_time' when 'data.clock' is very small.
 	 */
-	if (kvm->arch.use_master_clock)
-		now_ns = ka->master_kernel_ns;
+	if (data.flags & KVM_CLOCK_REALTIME) {
+		u64 now_real_ns = ktime_get_real_ns();
+
+		/*
+		 * Avoid stepping the kvmclock backwards.
+		 */
+		if (now_real_ns > data.realtime)
+			data.clock += now_real_ns - data.realtime;
+	}
+
+	if (ka->use_master_clock)
+		now_raw_ns = ka->master_kernel_ns;
 	else
-		now_ns = get_kvmclock_base_ns();
-	ka->kvmclock_offset = data.clock - now_ns;
+		now_raw_ns = get_kvmclock_base_ns();
+	ka->kvmclock_offset = data.clock - now_raw_ns;
 	kvm_end_pvclock_update(kvm);
 	return 0;
 }
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index a067410ebea5..d228bf394465 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1223,11 +1223,16 @@ struct kvm_irqfd {
 
 /* Do not use 1, KVM_CHECK_EXTENSION returned it before we had flags.  */
 #define KVM_CLOCK_TSC_STABLE		2
+#define KVM_CLOCK_REALTIME		(1 << 2)
+#define KVM_CLOCK_HOST_TSC		(1 << 3)
 
 struct kvm_clock_data {
 	__u64 clock;
 	__u32 flags;
-	__u32 pad[9];
+	__u32 pad0;
+	__u64 realtime;
+	__u64 host_tsc;
+	__u32 pad[4];
 };
 
 /* For KVM_CAP_SW_TLB */
-- 
2.33.0.309.g3052b89438-goog


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount
  2021-09-16 18:15 ` Oliver Upton
  (?)
@ 2021-09-16 18:15   ` Oliver Upton
  -1 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Peter Shier,
	Sean Christopherson, David Matlack, Paolo Bonzini,
	linux-arm-kernel, Jim Mattson

From: Paolo Bonzini <pbonzini@redhat.com>

Protect the reference point for kvmclock with a seqcount, so that
kvmclock updates for all vCPUs can proceed in parallel.  Xen runstate
updates will also run in parallel and not bounce the kvmclock cacheline.

nr_vcpus_matched_tsc is updated outside pvclock_update_vm_gtod_copy
though, so a spinlock must be kept for that one.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Oliver - drop unused locals, don't double acquire tsc_write_lock]
Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/include/asm/kvm_host.h |  7 ++-
 arch/x86/kvm/x86.c              | 83 +++++++++++++++++----------------
 2 files changed, 49 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9c34b5b63e39..5accfe7246ce 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1087,6 +1087,11 @@ struct kvm_arch {
 
 	unsigned long irq_sources_bitmap;
 	s64 kvmclock_offset;
+
+	/*
+	 * This also protects nr_vcpus_matched_tsc which is read from a
+	 * preemption-disabled region, so it must be a raw spinlock.
+	 */
 	raw_spinlock_t tsc_write_lock;
 	u64 last_tsc_nsec;
 	u64 last_tsc_write;
@@ -1097,7 +1102,7 @@ struct kvm_arch {
 	u64 cur_tsc_generation;
 	int nr_vcpus_matched_tsc;
 
-	spinlock_t pvclock_gtod_sync_lock;
+	seqcount_raw_spinlock_t pvclock_sc;
 	bool use_master_clock;
 	u64 master_kernel_ns;
 	u64 master_cycle_now;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cb5d5cad5124..29156c49cd11 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2533,9 +2533,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
 
 	kvm_vcpu_write_tsc_offset(vcpu, offset);
-	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 
-	spin_lock_irqsave(&kvm->arch.pvclock_gtod_sync_lock, flags);
 	if (!matched) {
 		kvm->arch.nr_vcpus_matched_tsc = 0;
 	} else if (!already_matched) {
@@ -2543,7 +2541,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	}
 
 	kvm_track_tsc_matching(vcpu);
-	spin_unlock_irqrestore(&kvm->arch.pvclock_gtod_sync_lock, flags);
+	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 }
 
 static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
@@ -2731,9 +2729,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 	int vclock_mode;
 	bool host_tsc_clocksource, vcpus_matched;
 
-	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
-			atomic_read(&kvm->online_vcpus));
-
 	/*
 	 * If the host uses TSC clock, then passthrough TSC as stable
 	 * to the guest.
@@ -2742,6 +2737,10 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 					&ka->master_kernel_ns,
 					&ka->master_cycle_now);
 
+	lockdep_assert_held(&kvm->arch.tsc_write_lock);
+	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
+			atomic_read(&kvm->online_vcpus));
+
 	ka->use_master_clock = host_tsc_clocksource && vcpus_matched
 				&& !ka->backwards_tsc_observed
 				&& !ka->boot_vcpu_runs_old_kvmclock;
@@ -2760,14 +2759,18 @@ static void kvm_make_mclock_inprogress_request(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
 }
 
-static void kvm_start_pvclock_update(struct kvm *kvm)
+static void __kvm_start_pvclock_update(struct kvm *kvm)
 {
-	struct kvm_arch *ka = &kvm->arch;
+	raw_spin_lock_irq(&kvm->arch.tsc_write_lock);
+	write_seqcount_begin(&kvm->arch.pvclock_sc);
+}
 
+static void kvm_start_pvclock_update(struct kvm *kvm)
+{
 	kvm_make_mclock_inprogress_request(kvm);
 
 	/* no guest entries from this point */
-	spin_lock_irq(&ka->pvclock_gtod_sync_lock);
+	__kvm_start_pvclock_update(kvm);
 }
 
 static void kvm_end_pvclock_update(struct kvm *kvm)
@@ -2776,7 +2779,8 @@ static void kvm_end_pvclock_update(struct kvm *kvm)
 	struct kvm_vcpu *vcpu;
 	int i;
 
-	spin_unlock_irq(&ka->pvclock_gtod_sync_lock);
+	write_seqcount_end(&ka->pvclock_sc);
+	raw_spin_unlock_irq(&ka->tsc_write_lock);
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
@@ -2797,20 +2801,12 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 {
 	struct kvm_arch *ka = &kvm->arch;
 	struct pvclock_vcpu_time_info hv_clock;
-	unsigned long flags;
 
-	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
 	if (!ka->use_master_clock) {
-		spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
 		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
 		return;
 	}
 
-	data->flags |= KVM_CLOCK_TSC_STABLE;
-	hv_clock.tsc_timestamp = ka->master_cycle_now;
-	hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
-	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
-
 	/* both __this_cpu_read() and rdtsc() should be on the same cpu */
 	get_cpu();
 
@@ -2825,6 +2821,9 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 #endif
 		data->host_tsc = rdtsc();
 
+		data->flags |= KVM_CLOCK_TSC_STABLE;
+		hv_clock.tsc_timestamp = ka->master_cycle_now;
+		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
 		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
 				   &hv_clock.tsc_shift,
 				   &hv_clock.tsc_to_system_mul);
@@ -2839,14 +2838,14 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 u64 get_kvmclock_ns(struct kvm *kvm)
 {
 	struct kvm_clock_data data;
+	struct kvm_arch *ka = &kvm->arch;
+	unsigned seq;
 
-	/*
-	 * Zero flags as it's accessed RMW, leave everything else uninitialized
-	 * as clock is always written and no other fields are consumed.
-	 */
-	data.flags = 0;
-
-	get_kvmclock(kvm, &data);
+	do {
+		seq = read_seqcount_begin(&ka->pvclock_sc);
+		data.flags = 0;
+		get_kvmclock(kvm, &data);
+	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
 	return data.clock;
 }
 
@@ -2912,6 +2911,7 @@ static void kvm_setup_pvclock_page(struct kvm_vcpu *v,
 static int kvm_guest_time_update(struct kvm_vcpu *v)
 {
 	unsigned long flags, tgt_tsc_khz;
+	unsigned seq;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
 	struct kvm_arch *ka = &v->kvm->arch;
 	s64 kernel_ns;
@@ -2926,13 +2926,14 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	 * If the host uses TSC clock, then passthrough TSC as stable
 	 * to the guest.
 	 */
-	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
-	use_master_clock = ka->use_master_clock;
-	if (use_master_clock) {
-		host_tsc = ka->master_cycle_now;
-		kernel_ns = ka->master_kernel_ns;
-	}
-	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
+	seq = read_seqcount_begin(&ka->pvclock_sc);
+	do {
+		use_master_clock = ka->use_master_clock;
+		if (use_master_clock) {
+			host_tsc = ka->master_cycle_now;
+			kernel_ns = ka->master_kernel_ns;
+		}
+	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
@@ -5855,10 +5856,15 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
 
 static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
 {
-	struct kvm_clock_data data;
+	struct kvm_clock_data data = { 0 };
+	unsigned seq;
+
+	do {
+		seq = read_seqcount_begin(&kvm->arch.pvclock_sc);
+		data.flags = 0;
+		get_kvmclock(kvm, &data);
+	} while (read_seqcount_retry(&kvm->arch.pvclock_sc, seq));
 
-	memset(&data, 0, sizeof(data));
-	get_kvmclock(kvm, &data);
 	if (copy_to_user(argp, &data, sizeof(data)))
 		return -EFAULT;
 
@@ -8159,9 +8165,7 @@ static void kvm_hyperv_tsc_notifier(void)
 	kvm_max_guest_tsc_khz = tsc_khz;
 
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		struct kvm_arch *ka = &kvm->arch;
-
-		spin_lock_irq(&ka->pvclock_gtod_sync_lock);
+		__kvm_start_pvclock_update(kvm);
 		pvclock_update_vm_gtod_copy(kvm);
 		kvm_end_pvclock_update(kvm);
 	}
@@ -11188,8 +11192,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
 	mutex_init(&kvm->arch.apic_map_lock);
-	spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock);
-
+	seqcount_raw_spinlock_init(&kvm->arch.pvclock_sc, &kvm->arch.tsc_write_lock);
 	kvm->arch.kvmclock_offset = -get_kvmclock_base_ns();
 	pvclock_update_vm_gtod_copy(kvm);
 
-- 
2.33.0.309.g3052b89438-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount
@ 2021-09-16 18:15   ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

From: Paolo Bonzini <pbonzini@redhat.com>

Protect the reference point for kvmclock with a seqcount, so that
kvmclock updates for all vCPUs can proceed in parallel.  Xen runstate
updates will also run in parallel and not bounce the kvmclock cacheline.

nr_vcpus_matched_tsc is updated outside pvclock_update_vm_gtod_copy
though, so a spinlock must be kept for that one.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Oliver - drop unused locals, don't double acquire tsc_write_lock]
Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/include/asm/kvm_host.h |  7 ++-
 arch/x86/kvm/x86.c              | 83 +++++++++++++++++----------------
 2 files changed, 49 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9c34b5b63e39..5accfe7246ce 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1087,6 +1087,11 @@ struct kvm_arch {
 
 	unsigned long irq_sources_bitmap;
 	s64 kvmclock_offset;
+
+	/*
+	 * This also protects nr_vcpus_matched_tsc which is read from a
+	 * preemption-disabled region, so it must be a raw spinlock.
+	 */
 	raw_spinlock_t tsc_write_lock;
 	u64 last_tsc_nsec;
 	u64 last_tsc_write;
@@ -1097,7 +1102,7 @@ struct kvm_arch {
 	u64 cur_tsc_generation;
 	int nr_vcpus_matched_tsc;
 
-	spinlock_t pvclock_gtod_sync_lock;
+	seqcount_raw_spinlock_t pvclock_sc;
 	bool use_master_clock;
 	u64 master_kernel_ns;
 	u64 master_cycle_now;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cb5d5cad5124..29156c49cd11 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2533,9 +2533,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
 
 	kvm_vcpu_write_tsc_offset(vcpu, offset);
-	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 
-	spin_lock_irqsave(&kvm->arch.pvclock_gtod_sync_lock, flags);
 	if (!matched) {
 		kvm->arch.nr_vcpus_matched_tsc = 0;
 	} else if (!already_matched) {
@@ -2543,7 +2541,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	}
 
 	kvm_track_tsc_matching(vcpu);
-	spin_unlock_irqrestore(&kvm->arch.pvclock_gtod_sync_lock, flags);
+	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 }
 
 static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
@@ -2731,9 +2729,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 	int vclock_mode;
 	bool host_tsc_clocksource, vcpus_matched;
 
-	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
-			atomic_read(&kvm->online_vcpus));
-
 	/*
 	 * If the host uses TSC clock, then passthrough TSC as stable
 	 * to the guest.
@@ -2742,6 +2737,10 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 					&ka->master_kernel_ns,
 					&ka->master_cycle_now);
 
+	lockdep_assert_held(&kvm->arch.tsc_write_lock);
+	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
+			atomic_read(&kvm->online_vcpus));
+
 	ka->use_master_clock = host_tsc_clocksource && vcpus_matched
 				&& !ka->backwards_tsc_observed
 				&& !ka->boot_vcpu_runs_old_kvmclock;
@@ -2760,14 +2759,18 @@ static void kvm_make_mclock_inprogress_request(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
 }
 
-static void kvm_start_pvclock_update(struct kvm *kvm)
+static void __kvm_start_pvclock_update(struct kvm *kvm)
 {
-	struct kvm_arch *ka = &kvm->arch;
+	raw_spin_lock_irq(&kvm->arch.tsc_write_lock);
+	write_seqcount_begin(&kvm->arch.pvclock_sc);
+}
 
+static void kvm_start_pvclock_update(struct kvm *kvm)
+{
 	kvm_make_mclock_inprogress_request(kvm);
 
 	/* no guest entries from this point */
-	spin_lock_irq(&ka->pvclock_gtod_sync_lock);
+	__kvm_start_pvclock_update(kvm);
 }
 
 static void kvm_end_pvclock_update(struct kvm *kvm)
@@ -2776,7 +2779,8 @@ static void kvm_end_pvclock_update(struct kvm *kvm)
 	struct kvm_vcpu *vcpu;
 	int i;
 
-	spin_unlock_irq(&ka->pvclock_gtod_sync_lock);
+	write_seqcount_end(&ka->pvclock_sc);
+	raw_spin_unlock_irq(&ka->tsc_write_lock);
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
@@ -2797,20 +2801,12 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 {
 	struct kvm_arch *ka = &kvm->arch;
 	struct pvclock_vcpu_time_info hv_clock;
-	unsigned long flags;
 
-	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
 	if (!ka->use_master_clock) {
-		spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
 		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
 		return;
 	}
 
-	data->flags |= KVM_CLOCK_TSC_STABLE;
-	hv_clock.tsc_timestamp = ka->master_cycle_now;
-	hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
-	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
-
 	/* both __this_cpu_read() and rdtsc() should be on the same cpu */
 	get_cpu();
 
@@ -2825,6 +2821,9 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 #endif
 		data->host_tsc = rdtsc();
 
+		data->flags |= KVM_CLOCK_TSC_STABLE;
+		hv_clock.tsc_timestamp = ka->master_cycle_now;
+		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
 		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
 				   &hv_clock.tsc_shift,
 				   &hv_clock.tsc_to_system_mul);
@@ -2839,14 +2838,14 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 u64 get_kvmclock_ns(struct kvm *kvm)
 {
 	struct kvm_clock_data data;
+	struct kvm_arch *ka = &kvm->arch;
+	unsigned seq;
 
-	/*
-	 * Zero flags as it's accessed RMW, leave everything else uninitialized
-	 * as clock is always written and no other fields are consumed.
-	 */
-	data.flags = 0;
-
-	get_kvmclock(kvm, &data);
+	do {
+		seq = read_seqcount_begin(&ka->pvclock_sc);
+		data.flags = 0;
+		get_kvmclock(kvm, &data);
+	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
 	return data.clock;
 }
 
@@ -2912,6 +2911,7 @@ static void kvm_setup_pvclock_page(struct kvm_vcpu *v,
 static int kvm_guest_time_update(struct kvm_vcpu *v)
 {
 	unsigned long flags, tgt_tsc_khz;
+	unsigned seq;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
 	struct kvm_arch *ka = &v->kvm->arch;
 	s64 kernel_ns;
@@ -2926,13 +2926,14 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	 * If the host uses TSC clock, then passthrough TSC as stable
 	 * to the guest.
 	 */
-	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
-	use_master_clock = ka->use_master_clock;
-	if (use_master_clock) {
-		host_tsc = ka->master_cycle_now;
-		kernel_ns = ka->master_kernel_ns;
-	}
-	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
+	seq = read_seqcount_begin(&ka->pvclock_sc);
+	do {
+		use_master_clock = ka->use_master_clock;
+		if (use_master_clock) {
+			host_tsc = ka->master_cycle_now;
+			kernel_ns = ka->master_kernel_ns;
+		}
+	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
@@ -5855,10 +5856,15 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
 
 static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
 {
-	struct kvm_clock_data data;
+	struct kvm_clock_data data = { 0 };
+	unsigned seq;
+
+	do {
+		seq = read_seqcount_begin(&kvm->arch.pvclock_sc);
+		data.flags = 0;
+		get_kvmclock(kvm, &data);
+	} while (read_seqcount_retry(&kvm->arch.pvclock_sc, seq));
 
-	memset(&data, 0, sizeof(data));
-	get_kvmclock(kvm, &data);
 	if (copy_to_user(argp, &data, sizeof(data)))
 		return -EFAULT;
 
@@ -8159,9 +8165,7 @@ static void kvm_hyperv_tsc_notifier(void)
 	kvm_max_guest_tsc_khz = tsc_khz;
 
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		struct kvm_arch *ka = &kvm->arch;
-
-		spin_lock_irq(&ka->pvclock_gtod_sync_lock);
+		__kvm_start_pvclock_update(kvm);
 		pvclock_update_vm_gtod_copy(kvm);
 		kvm_end_pvclock_update(kvm);
 	}
@@ -11188,8 +11192,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
 	mutex_init(&kvm->arch.apic_map_lock);
-	spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock);
-
+	seqcount_raw_spinlock_init(&kvm->arch.pvclock_sc, &kvm->arch.tsc_write_lock);
 	kvm->arch.kvmclock_offset = -get_kvmclock_base_ns();
 	pvclock_update_vm_gtod_copy(kvm);
 
-- 
2.33.0.309.g3052b89438-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount
@ 2021-09-16 18:15   ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

From: Paolo Bonzini <pbonzini@redhat.com>

Protect the reference point for kvmclock with a seqcount, so that
kvmclock updates for all vCPUs can proceed in parallel.  Xen runstate
updates will also run in parallel and not bounce the kvmclock cacheline.

nr_vcpus_matched_tsc is updated outside pvclock_update_vm_gtod_copy
though, so a spinlock must be kept for that one.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Oliver - drop unused locals, don't double acquire tsc_write_lock]
Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/include/asm/kvm_host.h |  7 ++-
 arch/x86/kvm/x86.c              | 83 +++++++++++++++++----------------
 2 files changed, 49 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9c34b5b63e39..5accfe7246ce 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1087,6 +1087,11 @@ struct kvm_arch {
 
 	unsigned long irq_sources_bitmap;
 	s64 kvmclock_offset;
+
+	/*
+	 * This also protects nr_vcpus_matched_tsc which is read from a
+	 * preemption-disabled region, so it must be a raw spinlock.
+	 */
 	raw_spinlock_t tsc_write_lock;
 	u64 last_tsc_nsec;
 	u64 last_tsc_write;
@@ -1097,7 +1102,7 @@ struct kvm_arch {
 	u64 cur_tsc_generation;
 	int nr_vcpus_matched_tsc;
 
-	spinlock_t pvclock_gtod_sync_lock;
+	seqcount_raw_spinlock_t pvclock_sc;
 	bool use_master_clock;
 	u64 master_kernel_ns;
 	u64 master_cycle_now;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cb5d5cad5124..29156c49cd11 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2533,9 +2533,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
 
 	kvm_vcpu_write_tsc_offset(vcpu, offset);
-	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 
-	spin_lock_irqsave(&kvm->arch.pvclock_gtod_sync_lock, flags);
 	if (!matched) {
 		kvm->arch.nr_vcpus_matched_tsc = 0;
 	} else if (!already_matched) {
@@ -2543,7 +2541,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 	}
 
 	kvm_track_tsc_matching(vcpu);
-	spin_unlock_irqrestore(&kvm->arch.pvclock_gtod_sync_lock, flags);
+	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 }
 
 static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
@@ -2731,9 +2729,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 	int vclock_mode;
 	bool host_tsc_clocksource, vcpus_matched;
 
-	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
-			atomic_read(&kvm->online_vcpus));
-
 	/*
 	 * If the host uses TSC clock, then passthrough TSC as stable
 	 * to the guest.
@@ -2742,6 +2737,10 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 					&ka->master_kernel_ns,
 					&ka->master_cycle_now);
 
+	lockdep_assert_held(&kvm->arch.tsc_write_lock);
+	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
+			atomic_read(&kvm->online_vcpus));
+
 	ka->use_master_clock = host_tsc_clocksource && vcpus_matched
 				&& !ka->backwards_tsc_observed
 				&& !ka->boot_vcpu_runs_old_kvmclock;
@@ -2760,14 +2759,18 @@ static void kvm_make_mclock_inprogress_request(struct kvm *kvm)
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
 }
 
-static void kvm_start_pvclock_update(struct kvm *kvm)
+static void __kvm_start_pvclock_update(struct kvm *kvm)
 {
-	struct kvm_arch *ka = &kvm->arch;
+	raw_spin_lock_irq(&kvm->arch.tsc_write_lock);
+	write_seqcount_begin(&kvm->arch.pvclock_sc);
+}
 
+static void kvm_start_pvclock_update(struct kvm *kvm)
+{
 	kvm_make_mclock_inprogress_request(kvm);
 
 	/* no guest entries from this point */
-	spin_lock_irq(&ka->pvclock_gtod_sync_lock);
+	__kvm_start_pvclock_update(kvm);
 }
 
 static void kvm_end_pvclock_update(struct kvm *kvm)
@@ -2776,7 +2779,8 @@ static void kvm_end_pvclock_update(struct kvm *kvm)
 	struct kvm_vcpu *vcpu;
 	int i;
 
-	spin_unlock_irq(&ka->pvclock_gtod_sync_lock);
+	write_seqcount_end(&ka->pvclock_sc);
+	raw_spin_unlock_irq(&ka->tsc_write_lock);
 	kvm_for_each_vcpu(i, vcpu, kvm)
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 
@@ -2797,20 +2801,12 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 {
 	struct kvm_arch *ka = &kvm->arch;
 	struct pvclock_vcpu_time_info hv_clock;
-	unsigned long flags;
 
-	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
 	if (!ka->use_master_clock) {
-		spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
 		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
 		return;
 	}
 
-	data->flags |= KVM_CLOCK_TSC_STABLE;
-	hv_clock.tsc_timestamp = ka->master_cycle_now;
-	hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
-	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
-
 	/* both __this_cpu_read() and rdtsc() should be on the same cpu */
 	get_cpu();
 
@@ -2825,6 +2821,9 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 #endif
 		data->host_tsc = rdtsc();
 
+		data->flags |= KVM_CLOCK_TSC_STABLE;
+		hv_clock.tsc_timestamp = ka->master_cycle_now;
+		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
 		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
 				   &hv_clock.tsc_shift,
 				   &hv_clock.tsc_to_system_mul);
@@ -2839,14 +2838,14 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 u64 get_kvmclock_ns(struct kvm *kvm)
 {
 	struct kvm_clock_data data;
+	struct kvm_arch *ka = &kvm->arch;
+	unsigned seq;
 
-	/*
-	 * Zero flags as it's accessed RMW, leave everything else uninitialized
-	 * as clock is always written and no other fields are consumed.
-	 */
-	data.flags = 0;
-
-	get_kvmclock(kvm, &data);
+	do {
+		seq = read_seqcount_begin(&ka->pvclock_sc);
+		data.flags = 0;
+		get_kvmclock(kvm, &data);
+	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
 	return data.clock;
 }
 
@@ -2912,6 +2911,7 @@ static void kvm_setup_pvclock_page(struct kvm_vcpu *v,
 static int kvm_guest_time_update(struct kvm_vcpu *v)
 {
 	unsigned long flags, tgt_tsc_khz;
+	unsigned seq;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
 	struct kvm_arch *ka = &v->kvm->arch;
 	s64 kernel_ns;
@@ -2926,13 +2926,14 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	 * If the host uses TSC clock, then passthrough TSC as stable
 	 * to the guest.
 	 */
-	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
-	use_master_clock = ka->use_master_clock;
-	if (use_master_clock) {
-		host_tsc = ka->master_cycle_now;
-		kernel_ns = ka->master_kernel_ns;
-	}
-	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
+	seq = read_seqcount_begin(&ka->pvclock_sc);
+	do {
+		use_master_clock = ka->use_master_clock;
+		if (use_master_clock) {
+			host_tsc = ka->master_cycle_now;
+			kernel_ns = ka->master_kernel_ns;
+		}
+	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
@@ -5855,10 +5856,15 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
 
 static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
 {
-	struct kvm_clock_data data;
+	struct kvm_clock_data data = { 0 };
+	unsigned seq;
+
+	do {
+		seq = read_seqcount_begin(&kvm->arch.pvclock_sc);
+		data.flags = 0;
+		get_kvmclock(kvm, &data);
+	} while (read_seqcount_retry(&kvm->arch.pvclock_sc, seq));
 
-	memset(&data, 0, sizeof(data));
-	get_kvmclock(kvm, &data);
 	if (copy_to_user(argp, &data, sizeof(data)))
 		return -EFAULT;
 
@@ -8159,9 +8165,7 @@ static void kvm_hyperv_tsc_notifier(void)
 	kvm_max_guest_tsc_khz = tsc_khz;
 
 	list_for_each_entry(kvm, &vm_list, vm_list) {
-		struct kvm_arch *ka = &kvm->arch;
-
-		spin_lock_irq(&ka->pvclock_gtod_sync_lock);
+		__kvm_start_pvclock_update(kvm);
 		pvclock_update_vm_gtod_copy(kvm);
 		kvm_end_pvclock_update(kvm);
 	}
@@ -11188,8 +11192,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 
 	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
 	mutex_init(&kvm->arch.apic_map_lock);
-	spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock);
-
+	seqcount_raw_spinlock_init(&kvm->arch.pvclock_sc, &kvm->arch.tsc_write_lock);
 	kvm->arch.kvmclock_offset = -get_kvmclock_base_ns();
 	pvclock_update_vm_gtod_copy(kvm);
 
-- 
2.33.0.309.g3052b89438-goog


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 6/7] KVM: x86: Refactor tsc synchronization code
  2021-09-16 18:15 ` Oliver Upton
  (?)
@ 2021-09-16 18:15   ` Oliver Upton
  -1 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Peter Shier,
	Sean Christopherson, David Matlack, Paolo Bonzini,
	linux-arm-kernel, Jim Mattson

Refactor kvm_synchronize_tsc to make a new function that allows callers
to specify TSC parameters (offset, value, nanoseconds, etc.) explicitly
for the sake of participating in TSC synchronization.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/kvm/x86.c | 100 ++++++++++++++++++++++++++-------------------
 1 file changed, 58 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 29156c49cd11..1ea65bb2e74d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2447,13 +2447,68 @@ static inline bool kvm_check_tsc_unstable(void)
 	return check_tsc_unstable();
 }
 
+/*
+ * Infers attempts to synchronize the guest's tsc from host writes. Sets the
+ * offset for the vcpu and tracks the TSC matching generation that the vcpu
+ * participates in.
+ */
+static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
+				  u64 ns, bool matched)
+{
+	struct kvm *kvm = vcpu->kvm;
+	bool already_matched;
+
+	lockdep_assert_held(&kvm->arch.tsc_write_lock);
+
+	already_matched =
+	       (vcpu->arch.this_tsc_generation == kvm->arch.cur_tsc_generation);
+
+	/*
+	 * We also track th most recent recorded KHZ, write and time to
+	 * allow the matching interval to be extended at each write.
+	 */
+	kvm->arch.last_tsc_nsec = ns;
+	kvm->arch.last_tsc_write = tsc;
+	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
+
+	vcpu->arch.last_guest_tsc = tsc;
+
+	/* Keep track of which generation this VCPU has synchronized to */
+	vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
+	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
+	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
+
+	kvm_vcpu_write_tsc_offset(vcpu, offset);
+
+	if (!matched) {
+		/*
+		 * We split periods of matched TSC writes into generations.
+		 * For each generation, we track the original measured
+		 * nanosecond time, offset, and write, so if TSCs are in
+		 * sync, we can match exact offset, and if not, we can match
+		 * exact software computation in compute_guest_tsc()
+		 *
+		 * These values are tracked in kvm->arch.cur_xxx variables.
+		 */
+		kvm->arch.cur_tsc_generation++;
+		kvm->arch.cur_tsc_nsec = ns;
+		kvm->arch.cur_tsc_write = tsc;
+		kvm->arch.cur_tsc_offset = offset;
+
+		kvm->arch.nr_vcpus_matched_tsc = 0;
+	} else if (!already_matched) {
+		kvm->arch.nr_vcpus_matched_tsc++;
+	}
+
+	kvm_track_tsc_matching(vcpu);
+}
+
 static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
 	unsigned long flags;
-	bool matched;
-	bool already_matched;
+	bool matched = false;
 	bool synchronizing = false;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
@@ -2499,48 +2554,9 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 			offset = kvm_compute_l1_tsc_offset(vcpu, data);
 		}
 		matched = true;
-		already_matched = (vcpu->arch.this_tsc_generation == kvm->arch.cur_tsc_generation);
-	} else {
-		/*
-		 * We split periods of matched TSC writes into generations.
-		 * For each generation, we track the original measured
-		 * nanosecond time, offset, and write, so if TSCs are in
-		 * sync, we can match exact offset, and if not, we can match
-		 * exact software computation in compute_guest_tsc()
-		 *
-		 * These values are tracked in kvm->arch.cur_xxx variables.
-		 */
-		kvm->arch.cur_tsc_generation++;
-		kvm->arch.cur_tsc_nsec = ns;
-		kvm->arch.cur_tsc_write = data;
-		kvm->arch.cur_tsc_offset = offset;
-		matched = false;
 	}
 
-	/*
-	 * We also track th most recent recorded KHZ, write and time to
-	 * allow the matching interval to be extended at each write.
-	 */
-	kvm->arch.last_tsc_nsec = ns;
-	kvm->arch.last_tsc_write = data;
-	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
-
-	vcpu->arch.last_guest_tsc = data;
-
-	/* Keep track of which generation this VCPU has synchronized to */
-	vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
-	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
-	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
-
-	kvm_vcpu_write_tsc_offset(vcpu, offset);
-
-	if (!matched) {
-		kvm->arch.nr_vcpus_matched_tsc = 0;
-	} else if (!already_matched) {
-		kvm->arch.nr_vcpus_matched_tsc++;
-	}
-
-	kvm_track_tsc_matching(vcpu);
+	__kvm_synchronize_tsc(vcpu, offset, data, ns, matched);
 	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 }
 
-- 
2.33.0.309.g3052b89438-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 6/7] KVM: x86: Refactor tsc synchronization code
@ 2021-09-16 18:15   ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

Refactor kvm_synchronize_tsc to make a new function that allows callers
to specify TSC parameters (offset, value, nanoseconds, etc.) explicitly
for the sake of participating in TSC synchronization.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/kvm/x86.c | 100 ++++++++++++++++++++++++++-------------------
 1 file changed, 58 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 29156c49cd11..1ea65bb2e74d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2447,13 +2447,68 @@ static inline bool kvm_check_tsc_unstable(void)
 	return check_tsc_unstable();
 }
 
+/*
+ * Infers attempts to synchronize the guest's tsc from host writes. Sets the
+ * offset for the vcpu and tracks the TSC matching generation that the vcpu
+ * participates in.
+ */
+static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
+				  u64 ns, bool matched)
+{
+	struct kvm *kvm = vcpu->kvm;
+	bool already_matched;
+
+	lockdep_assert_held(&kvm->arch.tsc_write_lock);
+
+	already_matched =
+	       (vcpu->arch.this_tsc_generation == kvm->arch.cur_tsc_generation);
+
+	/*
+	 * We also track th most recent recorded KHZ, write and time to
+	 * allow the matching interval to be extended at each write.
+	 */
+	kvm->arch.last_tsc_nsec = ns;
+	kvm->arch.last_tsc_write = tsc;
+	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
+
+	vcpu->arch.last_guest_tsc = tsc;
+
+	/* Keep track of which generation this VCPU has synchronized to */
+	vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
+	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
+	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
+
+	kvm_vcpu_write_tsc_offset(vcpu, offset);
+
+	if (!matched) {
+		/*
+		 * We split periods of matched TSC writes into generations.
+		 * For each generation, we track the original measured
+		 * nanosecond time, offset, and write, so if TSCs are in
+		 * sync, we can match exact offset, and if not, we can match
+		 * exact software computation in compute_guest_tsc()
+		 *
+		 * These values are tracked in kvm->arch.cur_xxx variables.
+		 */
+		kvm->arch.cur_tsc_generation++;
+		kvm->arch.cur_tsc_nsec = ns;
+		kvm->arch.cur_tsc_write = tsc;
+		kvm->arch.cur_tsc_offset = offset;
+
+		kvm->arch.nr_vcpus_matched_tsc = 0;
+	} else if (!already_matched) {
+		kvm->arch.nr_vcpus_matched_tsc++;
+	}
+
+	kvm_track_tsc_matching(vcpu);
+}
+
 static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
 	unsigned long flags;
-	bool matched;
-	bool already_matched;
+	bool matched = false;
 	bool synchronizing = false;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
@@ -2499,48 +2554,9 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 			offset = kvm_compute_l1_tsc_offset(vcpu, data);
 		}
 		matched = true;
-		already_matched = (vcpu->arch.this_tsc_generation == kvm->arch.cur_tsc_generation);
-	} else {
-		/*
-		 * We split periods of matched TSC writes into generations.
-		 * For each generation, we track the original measured
-		 * nanosecond time, offset, and write, so if TSCs are in
-		 * sync, we can match exact offset, and if not, we can match
-		 * exact software computation in compute_guest_tsc()
-		 *
-		 * These values are tracked in kvm->arch.cur_xxx variables.
-		 */
-		kvm->arch.cur_tsc_generation++;
-		kvm->arch.cur_tsc_nsec = ns;
-		kvm->arch.cur_tsc_write = data;
-		kvm->arch.cur_tsc_offset = offset;
-		matched = false;
 	}
 
-	/*
-	 * We also track th most recent recorded KHZ, write and time to
-	 * allow the matching interval to be extended at each write.
-	 */
-	kvm->arch.last_tsc_nsec = ns;
-	kvm->arch.last_tsc_write = data;
-	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
-
-	vcpu->arch.last_guest_tsc = data;
-
-	/* Keep track of which generation this VCPU has synchronized to */
-	vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
-	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
-	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
-
-	kvm_vcpu_write_tsc_offset(vcpu, offset);
-
-	if (!matched) {
-		kvm->arch.nr_vcpus_matched_tsc = 0;
-	} else if (!already_matched) {
-		kvm->arch.nr_vcpus_matched_tsc++;
-	}
-
-	kvm_track_tsc_matching(vcpu);
+	__kvm_synchronize_tsc(vcpu, offset, data, ns, matched);
 	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 }
 
-- 
2.33.0.309.g3052b89438-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 6/7] KVM: x86: Refactor tsc synchronization code
@ 2021-09-16 18:15   ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

Refactor kvm_synchronize_tsc to make a new function that allows callers
to specify TSC parameters (offset, value, nanoseconds, etc.) explicitly
for the sake of participating in TSC synchronization.

Signed-off-by: Oliver Upton <oupton@google.com>
---
 arch/x86/kvm/x86.c | 100 ++++++++++++++++++++++++++-------------------
 1 file changed, 58 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 29156c49cd11..1ea65bb2e74d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2447,13 +2447,68 @@ static inline bool kvm_check_tsc_unstable(void)
 	return check_tsc_unstable();
 }
 
+/*
+ * Infers attempts to synchronize the guest's tsc from host writes. Sets the
+ * offset for the vcpu and tracks the TSC matching generation that the vcpu
+ * participates in.
+ */
+static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
+				  u64 ns, bool matched)
+{
+	struct kvm *kvm = vcpu->kvm;
+	bool already_matched;
+
+	lockdep_assert_held(&kvm->arch.tsc_write_lock);
+
+	already_matched =
+	       (vcpu->arch.this_tsc_generation == kvm->arch.cur_tsc_generation);
+
+	/*
+	 * We also track th most recent recorded KHZ, write and time to
+	 * allow the matching interval to be extended at each write.
+	 */
+	kvm->arch.last_tsc_nsec = ns;
+	kvm->arch.last_tsc_write = tsc;
+	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
+
+	vcpu->arch.last_guest_tsc = tsc;
+
+	/* Keep track of which generation this VCPU has synchronized to */
+	vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
+	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
+	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
+
+	kvm_vcpu_write_tsc_offset(vcpu, offset);
+
+	if (!matched) {
+		/*
+		 * We split periods of matched TSC writes into generations.
+		 * For each generation, we track the original measured
+		 * nanosecond time, offset, and write, so if TSCs are in
+		 * sync, we can match exact offset, and if not, we can match
+		 * exact software computation in compute_guest_tsc()
+		 *
+		 * These values are tracked in kvm->arch.cur_xxx variables.
+		 */
+		kvm->arch.cur_tsc_generation++;
+		kvm->arch.cur_tsc_nsec = ns;
+		kvm->arch.cur_tsc_write = tsc;
+		kvm->arch.cur_tsc_offset = offset;
+
+		kvm->arch.nr_vcpus_matched_tsc = 0;
+	} else if (!already_matched) {
+		kvm->arch.nr_vcpus_matched_tsc++;
+	}
+
+	kvm_track_tsc_matching(vcpu);
+}
+
 static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 {
 	struct kvm *kvm = vcpu->kvm;
 	u64 offset, ns, elapsed;
 	unsigned long flags;
-	bool matched;
-	bool already_matched;
+	bool matched = false;
 	bool synchronizing = false;
 
 	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
@@ -2499,48 +2554,9 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
 			offset = kvm_compute_l1_tsc_offset(vcpu, data);
 		}
 		matched = true;
-		already_matched = (vcpu->arch.this_tsc_generation == kvm->arch.cur_tsc_generation);
-	} else {
-		/*
-		 * We split periods of matched TSC writes into generations.
-		 * For each generation, we track the original measured
-		 * nanosecond time, offset, and write, so if TSCs are in
-		 * sync, we can match exact offset, and if not, we can match
-		 * exact software computation in compute_guest_tsc()
-		 *
-		 * These values are tracked in kvm->arch.cur_xxx variables.
-		 */
-		kvm->arch.cur_tsc_generation++;
-		kvm->arch.cur_tsc_nsec = ns;
-		kvm->arch.cur_tsc_write = data;
-		kvm->arch.cur_tsc_offset = offset;
-		matched = false;
 	}
 
-	/*
-	 * We also track th most recent recorded KHZ, write and time to
-	 * allow the matching interval to be extended at each write.
-	 */
-	kvm->arch.last_tsc_nsec = ns;
-	kvm->arch.last_tsc_write = data;
-	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
-
-	vcpu->arch.last_guest_tsc = data;
-
-	/* Keep track of which generation this VCPU has synchronized to */
-	vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
-	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
-	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
-
-	kvm_vcpu_write_tsc_offset(vcpu, offset);
-
-	if (!matched) {
-		kvm->arch.nr_vcpus_matched_tsc = 0;
-	} else if (!already_matched) {
-		kvm->arch.nr_vcpus_matched_tsc++;
-	}
-
-	kvm_track_tsc_matching(vcpu);
+	__kvm_synchronize_tsc(vcpu, offset, data, ns, matched);
 	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
 }
 
-- 
2.33.0.309.g3052b89438-goog


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
  2021-09-16 18:15 ` Oliver Upton
  (?)
@ 2021-09-16 18:15   ` Oliver Upton
  -1 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Peter Shier,
	Sean Christopherson, David Matlack, Paolo Bonzini,
	linux-arm-kernel, Jim Mattson

To date, VMM-directed TSC synchronization and migration has been a bit
messy. KVM has some baked-in heuristics around TSC writes to infer if
the VMM is attempting to synchronize. This is problematic, as it depends
on host userspace writing to the guest's TSC within 1 second of the last
write.

A much cleaner approach to configuring the guest's views of the TSC is to
simply migrate the TSC offset for every vCPU. Offsets are idempotent,
and thus not subject to change depending on when the VMM actually
reads/writes values from/to KVM. The VMM can then read the TSC once with
KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
the guest is paused.

Cc: David Matlack <dmatlack@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 Documentation/virt/kvm/devices/vcpu.rst |  57 ++++++++++++
 arch/x86/include/asm/kvm_host.h         |   1 +
 arch/x86/include/uapi/asm/kvm.h         |   4 +
 arch/x86/kvm/x86.c                      | 110 ++++++++++++++++++++++++
 4 files changed, 172 insertions(+)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 2acec3b9ef65..3b399d727c11 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -161,3 +161,60 @@ Specifies the base address of the stolen time structure for this VCPU. The
 base address must be 64 byte aligned and exist within a valid guest memory
 region. See Documentation/virt/kvm/arm/pvtime.rst for more information
 including the layout of the stolen time structure.
+
+4. GROUP: KVM_VCPU_TSC_CTRL
+===========================
+
+:Architectures: x86
+
+4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
+
+:Parameters: 64-bit unsigned TSC offset
+
+Returns:
+
+	 ======= ======================================
+	 -EFAULT Error reading/writing the provided
+		 parameter address.
+	 -ENXIO  Attribute not supported
+	 ======= ======================================
+
+Specifies the guest's TSC offset relative to the host's TSC. The guest's
+TSC is then derived by the following equation:
+
+  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
+
+This attribute is useful for the precise migration of a guest's TSC. The
+following describes a possible algorithm to use for the migration of a
+guest's TSC:
+
+From the source VMM process:
+
+1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0),
+   kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0).
+
+2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the
+   guest TSC offset (off_n).
+
+3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
+   guest's TSC (freq).
+
+From the destination VMM process:
+
+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.
+
+5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_1) and
+   kvmclock nanoseconds (k_1).
+
+6. Adjust the guest TSC offsets for every vCPU to account for (1) time
+   elapsed since recording state and (2) difference in TSCs between the
+   source and destination machine:
+
+   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
+
+7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
+   respective value derived in the previous step.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5accfe7246ce..09c678f2e616 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1096,6 +1096,7 @@ struct kvm_arch {
 	u64 last_tsc_nsec;
 	u64 last_tsc_write;
 	u32 last_tsc_khz;
+	u64 last_tsc_offset;
 	u64 cur_tsc_nsec;
 	u64 cur_tsc_write;
 	u64 cur_tsc_offset;
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 2ef1f6513c68..5a776a08f78c 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -504,4 +504,8 @@ struct kvm_pmu_event_filter {
 #define KVM_PMU_EVENT_ALLOW 0
 #define KVM_PMU_EVENT_DENY 1
 
+/* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
+#define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
+#define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1ea65bb2e74d..1177604c805a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2470,6 +2470,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm->arch.last_tsc_nsec = ns;
 	kvm->arch.last_tsc_write = tsc;
 	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
+	kvm->arch.last_tsc_offset = offset;
 
 	vcpu->arch.last_guest_tsc = tsc;
 
@@ -4069,6 +4070,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VM_COPY_ENC_CONTEXT_FROM:
 	case KVM_CAP_SREGS2:
 	case KVM_CAP_EXIT_ON_EMULATION_FAILURE:
+	case KVM_CAP_VCPU_ATTRIBUTES:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
@@ -4933,6 +4935,109 @@ static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
+				 struct kvm_device_attr *attr)
+{
+	int r;
+
+	switch (attr->attr) {
+	case KVM_VCPU_TSC_OFFSET:
+		r = 0;
+		break;
+	default:
+		r = -ENXIO;
+	}
+
+	return r;
+}
+
+static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
+				 struct kvm_device_attr *attr)
+{
+	u64 __user *uaddr = (u64 __user *)attr->addr;
+	int r;
+
+	switch (attr->attr) {
+	case KVM_VCPU_TSC_OFFSET:
+		r = -EFAULT;
+		if (put_user(vcpu->arch.l1_tsc_offset, uaddr))
+			break;
+		r = 0;
+		break;
+	default:
+		r = -ENXIO;
+	}
+
+	return r;
+}
+
+static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
+				 struct kvm_device_attr *attr)
+{
+	u64 __user *uaddr = (u64 __user *)attr->addr;
+	struct kvm *kvm = vcpu->kvm;
+	int r;
+
+	switch (attr->attr) {
+	case KVM_VCPU_TSC_OFFSET: {
+		u64 offset, tsc, ns;
+		unsigned long flags;
+		bool matched;
+
+		r = -EFAULT;
+		if (get_user(offset, uaddr))
+			break;
+
+		raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
+
+		matched = (vcpu->arch.virtual_tsc_khz &&
+			   kvm->arch.last_tsc_khz == vcpu->arch.virtual_tsc_khz &&
+			   kvm->arch.last_tsc_offset == offset);
+
+		tsc = kvm_scale_tsc(vcpu, rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
+		ns = get_kvmclock_base_ns();
+
+		__kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched);
+		raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
+
+		r = 0;
+		break;
+	}
+	default:
+		r = -ENXIO;
+	}
+
+	return r;
+}
+
+static int kvm_vcpu_ioctl_device_attr(struct kvm_vcpu *vcpu,
+				      unsigned int ioctl,
+				      void __user *argp)
+{
+	struct kvm_device_attr attr;
+	int r;
+
+	if (copy_from_user(&attr, argp, sizeof(attr)))
+		return -EFAULT;
+
+	if (attr.group != KVM_VCPU_TSC_CTRL)
+		return -ENXIO;
+
+	switch (ioctl) {
+	case KVM_HAS_DEVICE_ATTR:
+		r = kvm_arch_tsc_has_attr(vcpu, &attr);
+		break;
+	case KVM_GET_DEVICE_ATTR:
+		r = kvm_arch_tsc_get_attr(vcpu, &attr);
+		break;
+	case KVM_SET_DEVICE_ATTR:
+		r = kvm_arch_tsc_set_attr(vcpu, &attr);
+		break;
+	}
+
+	return r;
+}
+
 static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 				     struct kvm_enable_cap *cap)
 {
@@ -5387,6 +5492,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = __set_sregs2(vcpu, u.sregs2);
 		break;
 	}
+	case KVM_HAS_DEVICE_ATTR:
+	case KVM_GET_DEVICE_ATTR:
+	case KVM_SET_DEVICE_ATTR:
+		r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
+		break;
 	default:
 		r = -EINVAL;
 	}
-- 
2.33.0.309.g3052b89438-goog

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-09-16 18:15   ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

To date, VMM-directed TSC synchronization and migration has been a bit
messy. KVM has some baked-in heuristics around TSC writes to infer if
the VMM is attempting to synchronize. This is problematic, as it depends
on host userspace writing to the guest's TSC within 1 second of the last
write.

A much cleaner approach to configuring the guest's views of the TSC is to
simply migrate the TSC offset for every vCPU. Offsets are idempotent,
and thus not subject to change depending on when the VMM actually
reads/writes values from/to KVM. The VMM can then read the TSC once with
KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
the guest is paused.

Cc: David Matlack <dmatlack@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 Documentation/virt/kvm/devices/vcpu.rst |  57 ++++++++++++
 arch/x86/include/asm/kvm_host.h         |   1 +
 arch/x86/include/uapi/asm/kvm.h         |   4 +
 arch/x86/kvm/x86.c                      | 110 ++++++++++++++++++++++++
 4 files changed, 172 insertions(+)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 2acec3b9ef65..3b399d727c11 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -161,3 +161,60 @@ Specifies the base address of the stolen time structure for this VCPU. The
 base address must be 64 byte aligned and exist within a valid guest memory
 region. See Documentation/virt/kvm/arm/pvtime.rst for more information
 including the layout of the stolen time structure.
+
+4. GROUP: KVM_VCPU_TSC_CTRL
+===========================
+
+:Architectures: x86
+
+4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
+
+:Parameters: 64-bit unsigned TSC offset
+
+Returns:
+
+	 ======= ======================================
+	 -EFAULT Error reading/writing the provided
+		 parameter address.
+	 -ENXIO  Attribute not supported
+	 ======= ======================================
+
+Specifies the guest's TSC offset relative to the host's TSC. The guest's
+TSC is then derived by the following equation:
+
+  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
+
+This attribute is useful for the precise migration of a guest's TSC. The
+following describes a possible algorithm to use for the migration of a
+guest's TSC:
+
+From the source VMM process:
+
+1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0),
+   kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0).
+
+2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the
+   guest TSC offset (off_n).
+
+3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
+   guest's TSC (freq).
+
+From the destination VMM process:
+
+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.
+
+5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_1) and
+   kvmclock nanoseconds (k_1).
+
+6. Adjust the guest TSC offsets for every vCPU to account for (1) time
+   elapsed since recording state and (2) difference in TSCs between the
+   source and destination machine:
+
+   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
+
+7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
+   respective value derived in the previous step.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5accfe7246ce..09c678f2e616 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1096,6 +1096,7 @@ struct kvm_arch {
 	u64 last_tsc_nsec;
 	u64 last_tsc_write;
 	u32 last_tsc_khz;
+	u64 last_tsc_offset;
 	u64 cur_tsc_nsec;
 	u64 cur_tsc_write;
 	u64 cur_tsc_offset;
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 2ef1f6513c68..5a776a08f78c 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -504,4 +504,8 @@ struct kvm_pmu_event_filter {
 #define KVM_PMU_EVENT_ALLOW 0
 #define KVM_PMU_EVENT_DENY 1
 
+/* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
+#define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
+#define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1ea65bb2e74d..1177604c805a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2470,6 +2470,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm->arch.last_tsc_nsec = ns;
 	kvm->arch.last_tsc_write = tsc;
 	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
+	kvm->arch.last_tsc_offset = offset;
 
 	vcpu->arch.last_guest_tsc = tsc;
 
@@ -4069,6 +4070,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VM_COPY_ENC_CONTEXT_FROM:
 	case KVM_CAP_SREGS2:
 	case KVM_CAP_EXIT_ON_EMULATION_FAILURE:
+	case KVM_CAP_VCPU_ATTRIBUTES:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
@@ -4933,6 +4935,109 @@ static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
+				 struct kvm_device_attr *attr)
+{
+	int r;
+
+	switch (attr->attr) {
+	case KVM_VCPU_TSC_OFFSET:
+		r = 0;
+		break;
+	default:
+		r = -ENXIO;
+	}
+
+	return r;
+}
+
+static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
+				 struct kvm_device_attr *attr)
+{
+	u64 __user *uaddr = (u64 __user *)attr->addr;
+	int r;
+
+	switch (attr->attr) {
+	case KVM_VCPU_TSC_OFFSET:
+		r = -EFAULT;
+		if (put_user(vcpu->arch.l1_tsc_offset, uaddr))
+			break;
+		r = 0;
+		break;
+	default:
+		r = -ENXIO;
+	}
+
+	return r;
+}
+
+static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
+				 struct kvm_device_attr *attr)
+{
+	u64 __user *uaddr = (u64 __user *)attr->addr;
+	struct kvm *kvm = vcpu->kvm;
+	int r;
+
+	switch (attr->attr) {
+	case KVM_VCPU_TSC_OFFSET: {
+		u64 offset, tsc, ns;
+		unsigned long flags;
+		bool matched;
+
+		r = -EFAULT;
+		if (get_user(offset, uaddr))
+			break;
+
+		raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
+
+		matched = (vcpu->arch.virtual_tsc_khz &&
+			   kvm->arch.last_tsc_khz == vcpu->arch.virtual_tsc_khz &&
+			   kvm->arch.last_tsc_offset == offset);
+
+		tsc = kvm_scale_tsc(vcpu, rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
+		ns = get_kvmclock_base_ns();
+
+		__kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched);
+		raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
+
+		r = 0;
+		break;
+	}
+	default:
+		r = -ENXIO;
+	}
+
+	return r;
+}
+
+static int kvm_vcpu_ioctl_device_attr(struct kvm_vcpu *vcpu,
+				      unsigned int ioctl,
+				      void __user *argp)
+{
+	struct kvm_device_attr attr;
+	int r;
+
+	if (copy_from_user(&attr, argp, sizeof(attr)))
+		return -EFAULT;
+
+	if (attr.group != KVM_VCPU_TSC_CTRL)
+		return -ENXIO;
+
+	switch (ioctl) {
+	case KVM_HAS_DEVICE_ATTR:
+		r = kvm_arch_tsc_has_attr(vcpu, &attr);
+		break;
+	case KVM_GET_DEVICE_ATTR:
+		r = kvm_arch_tsc_get_attr(vcpu, &attr);
+		break;
+	case KVM_SET_DEVICE_ATTR:
+		r = kvm_arch_tsc_set_attr(vcpu, &attr);
+		break;
+	}
+
+	return r;
+}
+
 static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 				     struct kvm_enable_cap *cap)
 {
@@ -5387,6 +5492,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = __set_sregs2(vcpu, u.sregs2);
 		break;
 	}
+	case KVM_HAS_DEVICE_ATTR:
+	case KVM_GET_DEVICE_ATTR:
+	case KVM_SET_DEVICE_ATTR:
+		r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
+		break;
 	default:
 		r = -EINVAL;
 	}
-- 
2.33.0.309.g3052b89438-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-09-16 18:15   ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-09-16 18:15 UTC (permalink / raw)
  To: kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Oliver Upton

To date, VMM-directed TSC synchronization and migration has been a bit
messy. KVM has some baked-in heuristics around TSC writes to infer if
the VMM is attempting to synchronize. This is problematic, as it depends
on host userspace writing to the guest's TSC within 1 second of the last
write.

A much cleaner approach to configuring the guest's views of the TSC is to
simply migrate the TSC offset for every vCPU. Offsets are idempotent,
and thus not subject to change depending on when the VMM actually
reads/writes values from/to KVM. The VMM can then read the TSC once with
KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
the guest is paused.

Cc: David Matlack <dmatlack@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
---
 Documentation/virt/kvm/devices/vcpu.rst |  57 ++++++++++++
 arch/x86/include/asm/kvm_host.h         |   1 +
 arch/x86/include/uapi/asm/kvm.h         |   4 +
 arch/x86/kvm/x86.c                      | 110 ++++++++++++++++++++++++
 4 files changed, 172 insertions(+)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 2acec3b9ef65..3b399d727c11 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -161,3 +161,60 @@ Specifies the base address of the stolen time structure for this VCPU. The
 base address must be 64 byte aligned and exist within a valid guest memory
 region. See Documentation/virt/kvm/arm/pvtime.rst for more information
 including the layout of the stolen time structure.
+
+4. GROUP: KVM_VCPU_TSC_CTRL
+===========================
+
+:Architectures: x86
+
+4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
+
+:Parameters: 64-bit unsigned TSC offset
+
+Returns:
+
+	 ======= ======================================
+	 -EFAULT Error reading/writing the provided
+		 parameter address.
+	 -ENXIO  Attribute not supported
+	 ======= ======================================
+
+Specifies the guest's TSC offset relative to the host's TSC. The guest's
+TSC is then derived by the following equation:
+
+  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
+
+This attribute is useful for the precise migration of a guest's TSC. The
+following describes a possible algorithm to use for the migration of a
+guest's TSC:
+
+From the source VMM process:
+
+1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0),
+   kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0).
+
+2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the
+   guest TSC offset (off_n).
+
+3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
+   guest's TSC (freq).
+
+From the destination VMM process:
+
+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.
+
+5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_1) and
+   kvmclock nanoseconds (k_1).
+
+6. Adjust the guest TSC offsets for every vCPU to account for (1) time
+   elapsed since recording state and (2) difference in TSCs between the
+   source and destination machine:
+
+   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
+
+7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
+   respective value derived in the previous step.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5accfe7246ce..09c678f2e616 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1096,6 +1096,7 @@ struct kvm_arch {
 	u64 last_tsc_nsec;
 	u64 last_tsc_write;
 	u32 last_tsc_khz;
+	u64 last_tsc_offset;
 	u64 cur_tsc_nsec;
 	u64 cur_tsc_write;
 	u64 cur_tsc_offset;
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 2ef1f6513c68..5a776a08f78c 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -504,4 +504,8 @@ struct kvm_pmu_event_filter {
 #define KVM_PMU_EVENT_ALLOW 0
 #define KVM_PMU_EVENT_DENY 1
 
+/* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
+#define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
+#define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1ea65bb2e74d..1177604c805a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2470,6 +2470,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm->arch.last_tsc_nsec = ns;
 	kvm->arch.last_tsc_write = tsc;
 	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
+	kvm->arch.last_tsc_offset = offset;
 
 	vcpu->arch.last_guest_tsc = tsc;
 
@@ -4069,6 +4070,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_VM_COPY_ENC_CONTEXT_FROM:
 	case KVM_CAP_SREGS2:
 	case KVM_CAP_EXIT_ON_EMULATION_FAILURE:
+	case KVM_CAP_VCPU_ATTRIBUTES:
 		r = 1;
 		break;
 	case KVM_CAP_EXIT_HYPERCALL:
@@ -4933,6 +4935,109 @@ static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
+				 struct kvm_device_attr *attr)
+{
+	int r;
+
+	switch (attr->attr) {
+	case KVM_VCPU_TSC_OFFSET:
+		r = 0;
+		break;
+	default:
+		r = -ENXIO;
+	}
+
+	return r;
+}
+
+static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
+				 struct kvm_device_attr *attr)
+{
+	u64 __user *uaddr = (u64 __user *)attr->addr;
+	int r;
+
+	switch (attr->attr) {
+	case KVM_VCPU_TSC_OFFSET:
+		r = -EFAULT;
+		if (put_user(vcpu->arch.l1_tsc_offset, uaddr))
+			break;
+		r = 0;
+		break;
+	default:
+		r = -ENXIO;
+	}
+
+	return r;
+}
+
+static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
+				 struct kvm_device_attr *attr)
+{
+	u64 __user *uaddr = (u64 __user *)attr->addr;
+	struct kvm *kvm = vcpu->kvm;
+	int r;
+
+	switch (attr->attr) {
+	case KVM_VCPU_TSC_OFFSET: {
+		u64 offset, tsc, ns;
+		unsigned long flags;
+		bool matched;
+
+		r = -EFAULT;
+		if (get_user(offset, uaddr))
+			break;
+
+		raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
+
+		matched = (vcpu->arch.virtual_tsc_khz &&
+			   kvm->arch.last_tsc_khz == vcpu->arch.virtual_tsc_khz &&
+			   kvm->arch.last_tsc_offset == offset);
+
+		tsc = kvm_scale_tsc(vcpu, rdtsc(), vcpu->arch.l1_tsc_scaling_ratio) + offset;
+		ns = get_kvmclock_base_ns();
+
+		__kvm_synchronize_tsc(vcpu, offset, tsc, ns, matched);
+		raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
+
+		r = 0;
+		break;
+	}
+	default:
+		r = -ENXIO;
+	}
+
+	return r;
+}
+
+static int kvm_vcpu_ioctl_device_attr(struct kvm_vcpu *vcpu,
+				      unsigned int ioctl,
+				      void __user *argp)
+{
+	struct kvm_device_attr attr;
+	int r;
+
+	if (copy_from_user(&attr, argp, sizeof(attr)))
+		return -EFAULT;
+
+	if (attr.group != KVM_VCPU_TSC_CTRL)
+		return -ENXIO;
+
+	switch (ioctl) {
+	case KVM_HAS_DEVICE_ATTR:
+		r = kvm_arch_tsc_has_attr(vcpu, &attr);
+		break;
+	case KVM_GET_DEVICE_ATTR:
+		r = kvm_arch_tsc_get_attr(vcpu, &attr);
+		break;
+	case KVM_SET_DEVICE_ATTR:
+		r = kvm_arch_tsc_set_attr(vcpu, &attr);
+		break;
+	}
+
+	return r;
+}
+
 static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 				     struct kvm_enable_cap *cap)
 {
@@ -5387,6 +5492,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = __set_sregs2(vcpu, u.sregs2);
 		break;
 	}
+	case KVM_HAS_DEVICE_ATTR:
+	case KVM_GET_DEVICE_ATTR:
+	case KVM_SET_DEVICE_ATTR:
+		r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
+		break;
 	default:
 		r = -EINVAL;
 	}
-- 
2.33.0.309.g3052b89438-goog


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount
  2021-09-16 18:15   ` Oliver Upton
  (?)
@ 2021-09-24 16:42     ` Paolo Bonzini
  -1 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-09-24 16:42 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Sean Christopherson, Marc Zyngier, Peter Shier, Jim Mattson,
	David Matlack, Ricardo Koller, Jing Zhang, Raghavendra Rao Anata,
	James Morse, Alexandru Elisei, Suzuki K Poulose,
	linux-arm-kernel, Andrew Jones, Will Deacon, Catalin Marinas

On 16/09/21 20:15, Oliver Upton wrote:
> From: Paolo Bonzini<pbonzini@redhat.com>
> 
> Protect the reference point for kvmclock with a seqcount, so that
> kvmclock updates for all vCPUs can proceed in parallel.  Xen runstate
> updates will also run in parallel and not bounce the kvmclock cacheline.
> 
> nr_vcpus_matched_tsc is updated outside pvclock_update_vm_gtod_copy
> though, so a spinlock must be kept for that one.
> 
> Signed-off-by: Paolo Bonzini<pbonzini@redhat.com>
> [Oliver - drop unused locals, don't double acquire tsc_write_lock]
> Signed-off-by: Oliver Upton<oupton@google.com>
> ---

This needs a small adjustment:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 07d00e711043..b0c21d42f453 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11289,6 +11289,7 @@ void kvm_arch_free_vm(struct kvm *kvm)
  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
  {
  	int ret;
+	unsigned long flags;
  
  	if (type)
  		return -EINVAL;
@@ -11314,7 +11315,10 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
  	mutex_init(&kvm->arch.apic_map_lock);
  	seqcount_raw_spinlock_init(&kvm->arch.pvclock_sc, &kvm->arch.tsc_write_lock);
  	kvm->arch.kvmclock_offset = -get_kvmclock_base_ns();
+
+	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
  	pvclock_update_vm_gtod_copy(kvm);
+	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
  
  	kvm->arch.guest_can_read_msr_platform_info = true;
  


^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount
@ 2021-09-24 16:42     ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-09-24 16:42 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Catalin Marinas, Will Deacon, Peter Shier, Marc Zyngier,
	David Matlack, linux-arm-kernel, Jim Mattson

On 16/09/21 20:15, Oliver Upton wrote:
> From: Paolo Bonzini<pbonzini@redhat.com>
> 
> Protect the reference point for kvmclock with a seqcount, so that
> kvmclock updates for all vCPUs can proceed in parallel.  Xen runstate
> updates will also run in parallel and not bounce the kvmclock cacheline.
> 
> nr_vcpus_matched_tsc is updated outside pvclock_update_vm_gtod_copy
> though, so a spinlock must be kept for that one.
> 
> Signed-off-by: Paolo Bonzini<pbonzini@redhat.com>
> [Oliver - drop unused locals, don't double acquire tsc_write_lock]
> Signed-off-by: Oliver Upton<oupton@google.com>
> ---

This needs a small adjustment:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 07d00e711043..b0c21d42f453 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11289,6 +11289,7 @@ void kvm_arch_free_vm(struct kvm *kvm)
  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
  {
  	int ret;
+	unsigned long flags;
  
  	if (type)
  		return -EINVAL;
@@ -11314,7 +11315,10 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
  	mutex_init(&kvm->arch.apic_map_lock);
  	seqcount_raw_spinlock_init(&kvm->arch.pvclock_sc, &kvm->arch.tsc_write_lock);
  	kvm->arch.kvmclock_offset = -get_kvmclock_base_ns();
+
+	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
  	pvclock_update_vm_gtod_copy(kvm);
+	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
  
  	kvm->arch.guest_can_read_msr_platform_info = true;
  

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount
@ 2021-09-24 16:42     ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-09-24 16:42 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Sean Christopherson, Marc Zyngier, Peter Shier, Jim Mattson,
	David Matlack, Ricardo Koller, Jing Zhang, Raghavendra Rao Anata,
	James Morse, Alexandru Elisei, Suzuki K Poulose,
	linux-arm-kernel, Andrew Jones, Will Deacon, Catalin Marinas

On 16/09/21 20:15, Oliver Upton wrote:
> From: Paolo Bonzini<pbonzini@redhat.com>
> 
> Protect the reference point for kvmclock with a seqcount, so that
> kvmclock updates for all vCPUs can proceed in parallel.  Xen runstate
> updates will also run in parallel and not bounce the kvmclock cacheline.
> 
> nr_vcpus_matched_tsc is updated outside pvclock_update_vm_gtod_copy
> though, so a spinlock must be kept for that one.
> 
> Signed-off-by: Paolo Bonzini<pbonzini@redhat.com>
> [Oliver - drop unused locals, don't double acquire tsc_write_lock]
> Signed-off-by: Oliver Upton<oupton@google.com>
> ---

This needs a small adjustment:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 07d00e711043..b0c21d42f453 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11289,6 +11289,7 @@ void kvm_arch_free_vm(struct kvm *kvm)
  int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
  {
  	int ret;
+	unsigned long flags;
  
  	if (type)
  		return -EINVAL;
@@ -11314,7 +11315,10 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
  	mutex_init(&kvm->arch.apic_map_lock);
  	seqcount_raw_spinlock_init(&kvm->arch.pvclock_sc, &kvm->arch.tsc_write_lock);
  	kvm->arch.kvmclock_offset = -get_kvmclock_base_ns();
+
+	raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
  	pvclock_update_vm_gtod_copy(kvm);
+	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
  
  	kvm->arch.guest_can_read_msr_platform_info = true;
  


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 0/7] KVM: x86: Add idempotent controls for migrating system counter state
  2021-09-16 18:15 ` Oliver Upton
  (?)
@ 2021-09-24 16:43   ` Paolo Bonzini
  -1 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-09-24 16:43 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Sean Christopherson, Marc Zyngier, Peter Shier, Jim Mattson,
	David Matlack, Ricardo Koller, Jing Zhang, Raghavendra Rao Anata,
	James Morse, Alexandru Elisei, Suzuki K Poulose,
	linux-arm-kernel, Andrew Jones, Will Deacon, Catalin Marinas

On 16/09/21 20:15, Oliver Upton wrote:
> KVM's current means of saving/restoring system counters is plagued with
> temporal issues. On x86, we migrate the guest's system counter by-value
> through the respective guest's IA32_TSC value. Restoring system counters
> by-value is brittle as the state is not idempotent: the host system
> counter is still oscillating between the attempted save and restore.
> Furthermore, VMMs may wish to transparently live migrate guest VMs,
> meaning that they include the elapsed time due to live migration blackout
> in the guest system counter view. The VMM thread could be preempted for
> any number of reasons (scheduler, L0 hypervisor under nested) between the
> time that it calculates the desired guest counter value and when
> KVM actually sets this counter state.
> 
> Despite the value-based interface that we present to userspace, KVM
> actually has idempotent guest controls by way of the TSC offset.
> We can avoid all of the issues associated with a value-based interface
> by abstracting these offset controls in a new device attribute. This
> series introduces new vCPU device attributes to provide userspace access
> to the vCPU's system counter offset.
> 
> Patches 1-2 are Paolo's refactorings around locking and the
> KVM_{GET,SET}_CLOCK ioctls.
> 
> Patch 3 cures a race where use_master_clock is read outside of the
> pvclock lock in the KVM_GET_CLOCK ioctl.
> 
> Patch 4 adopts Paolo's suggestion, augmenting the KVM_{GET,SET}_CLOCK
> ioctls to provide userspace with a (host_tsc, realtime) instant. This is
> essential for a VMM to perform precise migration of the guest's system
> counters.
> 
> Patch 5 does away with the pvclock spin lock in favor of a sequence
> lock based on the tsc_write_lock. The original patch is from Paolo, I
> touched it up a bit to fix a deadlock and some unused variables that
> caused -Werror to scream.
> 
> Patch 6 extracts the TSC synchronization tracking code in a way that it
> can be used for both offset-based and value-based TSC synchronization
> schemes.
> 
> Finally, patch 7 implements a vCPU device attribute which allows VMMs to
> get at the TSC offset of a vCPU.
> 
> This series was tested with the new KVM selftests for the KVM clock and
> system counter offset controls on Haswell hardware. Kernel was built
> with CONFIG_LOCKDEP given the new locking changes/lockdep assertions
> here.
> 
> Note that these tests are mailed as a separate series due to the
> dependencies in both x86 and arm64.
> 
> Applies cleanly to 5.15-rc1
> 
> v8: http://lore.kernel.org/r/20210816001130.3059564-1-oupton@google.com
> 
> v7 -> v8:
>   - Rebased to 5.15-rc1
>   - Picked up Paolo's version of the series, which includes locking
>     changes
>   - Make KVM advertise KVM_CAP_VCPU_ATTRIBUTES
> 
> Oliver Upton (4):
>    KVM: x86: Fix potential race in KVM_GET_CLOCK
>    KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
>    KVM: x86: Refactor tsc synchronization code
>    KVM: x86: Expose TSC offset controls to userspace
> 
> Paolo Bonzini (3):
>    kvm: x86: abstract locking around pvclock_update_vm_gtod_copy
>    KVM: x86: extract KVM_GET_CLOCK/KVM_SET_CLOCK to separate functions
>    kvm: x86: protect masterclock with a seqcount
> 
>   Documentation/virt/kvm/api.rst          |  42 ++-
>   Documentation/virt/kvm/devices/vcpu.rst |  57 +++
>   arch/x86/include/asm/kvm_host.h         |  12 +-
>   arch/x86/include/uapi/asm/kvm.h         |   4 +
>   arch/x86/kvm/x86.c                      | 458 ++++++++++++++++--------
>   include/uapi/linux/kvm.h                |   7 +-
>   6 files changed, 419 insertions(+), 161 deletions(-)
> 

Queued, thanks.

Paolo


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 0/7] KVM: x86: Add idempotent controls for migrating system counter state
@ 2021-09-24 16:43   ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-09-24 16:43 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Catalin Marinas, Will Deacon, Peter Shier, Marc Zyngier,
	David Matlack, linux-arm-kernel, Jim Mattson

On 16/09/21 20:15, Oliver Upton wrote:
> KVM's current means of saving/restoring system counters is plagued with
> temporal issues. On x86, we migrate the guest's system counter by-value
> through the respective guest's IA32_TSC value. Restoring system counters
> by-value is brittle as the state is not idempotent: the host system
> counter is still oscillating between the attempted save and restore.
> Furthermore, VMMs may wish to transparently live migrate guest VMs,
> meaning that they include the elapsed time due to live migration blackout
> in the guest system counter view. The VMM thread could be preempted for
> any number of reasons (scheduler, L0 hypervisor under nested) between the
> time that it calculates the desired guest counter value and when
> KVM actually sets this counter state.
> 
> Despite the value-based interface that we present to userspace, KVM
> actually has idempotent guest controls by way of the TSC offset.
> We can avoid all of the issues associated with a value-based interface
> by abstracting these offset controls in a new device attribute. This
> series introduces new vCPU device attributes to provide userspace access
> to the vCPU's system counter offset.
> 
> Patches 1-2 are Paolo's refactorings around locking and the
> KVM_{GET,SET}_CLOCK ioctls.
> 
> Patch 3 cures a race where use_master_clock is read outside of the
> pvclock lock in the KVM_GET_CLOCK ioctl.
> 
> Patch 4 adopts Paolo's suggestion, augmenting the KVM_{GET,SET}_CLOCK
> ioctls to provide userspace with a (host_tsc, realtime) instant. This is
> essential for a VMM to perform precise migration of the guest's system
> counters.
> 
> Patch 5 does away with the pvclock spin lock in favor of a sequence
> lock based on the tsc_write_lock. The original patch is from Paolo, I
> touched it up a bit to fix a deadlock and some unused variables that
> caused -Werror to scream.
> 
> Patch 6 extracts the TSC synchronization tracking code in a way that it
> can be used for both offset-based and value-based TSC synchronization
> schemes.
> 
> Finally, patch 7 implements a vCPU device attribute which allows VMMs to
> get at the TSC offset of a vCPU.
> 
> This series was tested with the new KVM selftests for the KVM clock and
> system counter offset controls on Haswell hardware. Kernel was built
> with CONFIG_LOCKDEP given the new locking changes/lockdep assertions
> here.
> 
> Note that these tests are mailed as a separate series due to the
> dependencies in both x86 and arm64.
> 
> Applies cleanly to 5.15-rc1
> 
> v8: http://lore.kernel.org/r/20210816001130.3059564-1-oupton@google.com
> 
> v7 -> v8:
>   - Rebased to 5.15-rc1
>   - Picked up Paolo's version of the series, which includes locking
>     changes
>   - Make KVM advertise KVM_CAP_VCPU_ATTRIBUTES
> 
> Oliver Upton (4):
>    KVM: x86: Fix potential race in KVM_GET_CLOCK
>    KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
>    KVM: x86: Refactor tsc synchronization code
>    KVM: x86: Expose TSC offset controls to userspace
> 
> Paolo Bonzini (3):
>    kvm: x86: abstract locking around pvclock_update_vm_gtod_copy
>    KVM: x86: extract KVM_GET_CLOCK/KVM_SET_CLOCK to separate functions
>    kvm: x86: protect masterclock with a seqcount
> 
>   Documentation/virt/kvm/api.rst          |  42 ++-
>   Documentation/virt/kvm/devices/vcpu.rst |  57 +++
>   arch/x86/include/asm/kvm_host.h         |  12 +-
>   arch/x86/include/uapi/asm/kvm.h         |   4 +
>   arch/x86/kvm/x86.c                      | 458 ++++++++++++++++--------
>   include/uapi/linux/kvm.h                |   7 +-
>   6 files changed, 419 insertions(+), 161 deletions(-)
> 

Queued, thanks.

Paolo

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 0/7] KVM: x86: Add idempotent controls for migrating system counter state
@ 2021-09-24 16:43   ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-09-24 16:43 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Sean Christopherson, Marc Zyngier, Peter Shier, Jim Mattson,
	David Matlack, Ricardo Koller, Jing Zhang, Raghavendra Rao Anata,
	James Morse, Alexandru Elisei, Suzuki K Poulose,
	linux-arm-kernel, Andrew Jones, Will Deacon, Catalin Marinas

On 16/09/21 20:15, Oliver Upton wrote:
> KVM's current means of saving/restoring system counters is plagued with
> temporal issues. On x86, we migrate the guest's system counter by-value
> through the respective guest's IA32_TSC value. Restoring system counters
> by-value is brittle as the state is not idempotent: the host system
> counter is still oscillating between the attempted save and restore.
> Furthermore, VMMs may wish to transparently live migrate guest VMs,
> meaning that they include the elapsed time due to live migration blackout
> in the guest system counter view. The VMM thread could be preempted for
> any number of reasons (scheduler, L0 hypervisor under nested) between the
> time that it calculates the desired guest counter value and when
> KVM actually sets this counter state.
> 
> Despite the value-based interface that we present to userspace, KVM
> actually has idempotent guest controls by way of the TSC offset.
> We can avoid all of the issues associated with a value-based interface
> by abstracting these offset controls in a new device attribute. This
> series introduces new vCPU device attributes to provide userspace access
> to the vCPU's system counter offset.
> 
> Patches 1-2 are Paolo's refactorings around locking and the
> KVM_{GET,SET}_CLOCK ioctls.
> 
> Patch 3 cures a race where use_master_clock is read outside of the
> pvclock lock in the KVM_GET_CLOCK ioctl.
> 
> Patch 4 adopts Paolo's suggestion, augmenting the KVM_{GET,SET}_CLOCK
> ioctls to provide userspace with a (host_tsc, realtime) instant. This is
> essential for a VMM to perform precise migration of the guest's system
> counters.
> 
> Patch 5 does away with the pvclock spin lock in favor of a sequence
> lock based on the tsc_write_lock. The original patch is from Paolo, I
> touched it up a bit to fix a deadlock and some unused variables that
> caused -Werror to scream.
> 
> Patch 6 extracts the TSC synchronization tracking code in a way that it
> can be used for both offset-based and value-based TSC synchronization
> schemes.
> 
> Finally, patch 7 implements a vCPU device attribute which allows VMMs to
> get at the TSC offset of a vCPU.
> 
> This series was tested with the new KVM selftests for the KVM clock and
> system counter offset controls on Haswell hardware. Kernel was built
> with CONFIG_LOCKDEP given the new locking changes/lockdep assertions
> here.
> 
> Note that these tests are mailed as a separate series due to the
> dependencies in both x86 and arm64.
> 
> Applies cleanly to 5.15-rc1
> 
> v8: http://lore.kernel.org/r/20210816001130.3059564-1-oupton@google.com
> 
> v7 -> v8:
>   - Rebased to 5.15-rc1
>   - Picked up Paolo's version of the series, which includes locking
>     changes
>   - Make KVM advertise KVM_CAP_VCPU_ATTRIBUTES
> 
> Oliver Upton (4):
>    KVM: x86: Fix potential race in KVM_GET_CLOCK
>    KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
>    KVM: x86: Refactor tsc synchronization code
>    KVM: x86: Expose TSC offset controls to userspace
> 
> Paolo Bonzini (3):
>    kvm: x86: abstract locking around pvclock_update_vm_gtod_copy
>    KVM: x86: extract KVM_GET_CLOCK/KVM_SET_CLOCK to separate functions
>    kvm: x86: protect masterclock with a seqcount
> 
>   Documentation/virt/kvm/api.rst          |  42 ++-
>   Documentation/virt/kvm/devices/vcpu.rst |  57 +++
>   arch/x86/include/asm/kvm_host.h         |  12 +-
>   arch/x86/include/uapi/asm/kvm.h         |   4 +
>   arch/x86/kvm/x86.c                      | 458 ++++++++++++++++--------
>   include/uapi/linux/kvm.h                |   7 +-
>   6 files changed, 419 insertions(+), 161 deletions(-)
> 

Queued, thanks.

Paolo


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-09-16 18:15   ` Oliver Upton
  (?)
@ 2021-09-28 18:53     ` Marcelo Tosatti
  -1 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-28 18:53 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, kvmarm, Paolo Bonzini, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> Handling the migration of TSCs correctly is difficult, in part because
> Linux does not provide userspace with the ability to retrieve a (TSC,
> realtime) clock pair for a single instant in time. In lieu of a more
> convenient facility, KVM can report similar information in the kvm_clock
> structure.
> 
> Provide userspace with a host TSC & realtime pair iff the realtime clock
> is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
> realtime value, advance the KVM clock by the amount of elapsed time. Do
> not step the KVM clock backwards, though, as it is a monotonic
> oscillator.
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  Documentation/virt/kvm/api.rst  | 42 ++++++++++++++++++++++++++-------
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 36 +++++++++++++++++++++-------
>  include/uapi/linux/kvm.h        |  7 +++++-
>  4 files changed, 70 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index a6729c8cf063..d0b9c986cf6c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -993,20 +993,34 @@ such as migration.
>  When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
>  set of bits that KVM can return in struct kvm_clock_data's flag member.
>  
> -The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
> -value is the exact kvmclock value seen by all VCPUs at the instant
> -when KVM_GET_CLOCK was called.  If clear, the returned value is simply
> -CLOCK_MONOTONIC plus a constant offset; the offset can be modified
> -with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
> -but the exact value read by each VCPU could differ, because the host
> -TSC is not stable.
> +FLAGS:
> +
> +KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
> +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
> +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
> +offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
> +to make all VCPUs follow this clock, but the exact value read by each
> +VCPU could differ, because the host TSC is not stable.
> +
> +KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
> +structure is populated with the value of the host's real time
> +clocksource at the instant when KVM_GET_CLOCK was called. If clear,
> +the `realtime` field does not contain a value.
> +
> +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
> +structure is populated with the value of the host's timestamp counter (TSC)
> +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
> +does not contain a value.

If the host TSCs are not stable, then KVM_CLOCK_HOST_TSC bit (and
host_tsc field) are ambiguous. Shouldnt exposing them be conditional on 
stable TSC for the host ?


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-09-28 18:53     ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-28 18:53 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Catalin Marinas, kvm, Will Deacon, Marc Zyngier, Peter Shier,
	David Matlack, Paolo Bonzini, kvmarm, linux-arm-kernel,
	Jim Mattson

On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> Handling the migration of TSCs correctly is difficult, in part because
> Linux does not provide userspace with the ability to retrieve a (TSC,
> realtime) clock pair for a single instant in time. In lieu of a more
> convenient facility, KVM can report similar information in the kvm_clock
> structure.
> 
> Provide userspace with a host TSC & realtime pair iff the realtime clock
> is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
> realtime value, advance the KVM clock by the amount of elapsed time. Do
> not step the KVM clock backwards, though, as it is a monotonic
> oscillator.
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  Documentation/virt/kvm/api.rst  | 42 ++++++++++++++++++++++++++-------
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 36 +++++++++++++++++++++-------
>  include/uapi/linux/kvm.h        |  7 +++++-
>  4 files changed, 70 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index a6729c8cf063..d0b9c986cf6c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -993,20 +993,34 @@ such as migration.
>  When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
>  set of bits that KVM can return in struct kvm_clock_data's flag member.
>  
> -The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
> -value is the exact kvmclock value seen by all VCPUs at the instant
> -when KVM_GET_CLOCK was called.  If clear, the returned value is simply
> -CLOCK_MONOTONIC plus a constant offset; the offset can be modified
> -with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
> -but the exact value read by each VCPU could differ, because the host
> -TSC is not stable.
> +FLAGS:
> +
> +KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
> +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
> +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
> +offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
> +to make all VCPUs follow this clock, but the exact value read by each
> +VCPU could differ, because the host TSC is not stable.
> +
> +KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
> +structure is populated with the value of the host's real time
> +clocksource at the instant when KVM_GET_CLOCK was called. If clear,
> +the `realtime` field does not contain a value.
> +
> +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
> +structure is populated with the value of the host's timestamp counter (TSC)
> +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
> +does not contain a value.

If the host TSCs are not stable, then KVM_CLOCK_HOST_TSC bit (and
host_tsc field) are ambiguous. Shouldnt exposing them be conditional on 
stable TSC for the host ?

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-09-28 18:53     ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-28 18:53 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, kvmarm, Paolo Bonzini, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> Handling the migration of TSCs correctly is difficult, in part because
> Linux does not provide userspace with the ability to retrieve a (TSC,
> realtime) clock pair for a single instant in time. In lieu of a more
> convenient facility, KVM can report similar information in the kvm_clock
> structure.
> 
> Provide userspace with a host TSC & realtime pair iff the realtime clock
> is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
> realtime value, advance the KVM clock by the amount of elapsed time. Do
> not step the KVM clock backwards, though, as it is a monotonic
> oscillator.
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  Documentation/virt/kvm/api.rst  | 42 ++++++++++++++++++++++++++-------
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 36 +++++++++++++++++++++-------
>  include/uapi/linux/kvm.h        |  7 +++++-
>  4 files changed, 70 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index a6729c8cf063..d0b9c986cf6c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -993,20 +993,34 @@ such as migration.
>  When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
>  set of bits that KVM can return in struct kvm_clock_data's flag member.
>  
> -The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
> -value is the exact kvmclock value seen by all VCPUs at the instant
> -when KVM_GET_CLOCK was called.  If clear, the returned value is simply
> -CLOCK_MONOTONIC plus a constant offset; the offset can be modified
> -with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
> -but the exact value read by each VCPU could differ, because the host
> -TSC is not stable.
> +FLAGS:
> +
> +KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
> +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
> +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
> +offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
> +to make all VCPUs follow this clock, but the exact value read by each
> +VCPU could differ, because the host TSC is not stable.
> +
> +KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
> +structure is populated with the value of the host's real time
> +clocksource at the instant when KVM_GET_CLOCK was called. If clear,
> +the `realtime` field does not contain a value.
> +
> +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
> +structure is populated with the value of the host's timestamp counter (TSC)
> +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
> +does not contain a value.

If the host TSCs are not stable, then KVM_CLOCK_HOST_TSC bit (and
host_tsc field) are ambiguous. Shouldnt exposing them be conditional on 
stable TSC for the host ?


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-09-28 18:53     ` Marcelo Tosatti
  (?)
@ 2021-09-29 11:20       ` Paolo Bonzini
  -1 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-09-29 11:20 UTC (permalink / raw)
  To: Marcelo Tosatti, Oliver Upton
  Cc: kvm, kvmarm, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On 28/09/21 20:53, Marcelo Tosatti wrote:
>> +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
>> +structure is populated with the value of the host's timestamp counter (TSC)
>> +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
>> +does not contain a value.
> If the host TSCs are not stable, then KVM_CLOCK_HOST_TSC bit (and
> host_tsc field) are ambiguous. Shouldnt exposing them be conditional on
> stable TSC for the host ?
> 

Yes, good point.

Paolo


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-09-29 11:20       ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-09-29 11:20 UTC (permalink / raw)
  To: Marcelo Tosatti, Oliver Upton
  Cc: Catalin Marinas, kvm, Will Deacon, Peter Shier, Marc Zyngier,
	David Matlack, kvmarm, linux-arm-kernel, Jim Mattson

On 28/09/21 20:53, Marcelo Tosatti wrote:
>> +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
>> +structure is populated with the value of the host's timestamp counter (TSC)
>> +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
>> +does not contain a value.
> If the host TSCs are not stable, then KVM_CLOCK_HOST_TSC bit (and
> host_tsc field) are ambiguous. Shouldnt exposing them be conditional on
> stable TSC for the host ?
> 

Yes, good point.

Paolo

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-09-29 11:20       ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-09-29 11:20 UTC (permalink / raw)
  To: Marcelo Tosatti, Oliver Upton
  Cc: kvm, kvmarm, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On 28/09/21 20:53, Marcelo Tosatti wrote:
>> +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
>> +structure is populated with the value of the host's timestamp counter (TSC)
>> +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
>> +does not contain a value.
> If the host TSCs are not stable, then KVM_CLOCK_HOST_TSC bit (and
> host_tsc field) are ambiguous. Shouldnt exposing them be conditional on
> stable TSC for the host ?
> 

Yes, good point.

Paolo


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 3/7] KVM: x86: Fix potential race in KVM_GET_CLOCK
  2021-09-16 18:15   ` Oliver Upton
  (?)
@ 2021-09-29 13:33     ` Marcelo Tosatti
  -1 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-29 13:33 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, kvmarm, Paolo Bonzini, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Thu, Sep 16, 2021 at 06:15:34PM +0000, Oliver Upton wrote:
> Sean noticed that KVM_GET_CLOCK was checking kvm_arch.use_master_clock
> outside of the pvclock sync lock. This is problematic, as the clock
> value written to the user may or may not actually correspond to a stable
> TSC.
> 
> Fix the race by populating the entire kvm_clock_data structure behind
> the pvclock_gtod_sync_lock.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Oliver Upton <oupton@google.com>

ACK patches 1-3, still reviewing the remaining ones...


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 3/7] KVM: x86: Fix potential race in KVM_GET_CLOCK
@ 2021-09-29 13:33     ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-29 13:33 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Catalin Marinas, kvm, Will Deacon, Marc Zyngier, Peter Shier,
	David Matlack, Paolo Bonzini, kvmarm, linux-arm-kernel,
	Jim Mattson

On Thu, Sep 16, 2021 at 06:15:34PM +0000, Oliver Upton wrote:
> Sean noticed that KVM_GET_CLOCK was checking kvm_arch.use_master_clock
> outside of the pvclock sync lock. This is problematic, as the clock
> value written to the user may or may not actually correspond to a stable
> TSC.
> 
> Fix the race by populating the entire kvm_clock_data structure behind
> the pvclock_gtod_sync_lock.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Oliver Upton <oupton@google.com>

ACK patches 1-3, still reviewing the remaining ones...

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 3/7] KVM: x86: Fix potential race in KVM_GET_CLOCK
@ 2021-09-29 13:33     ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-29 13:33 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, kvmarm, Paolo Bonzini, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Thu, Sep 16, 2021 at 06:15:34PM +0000, Oliver Upton wrote:
> Sean noticed that KVM_GET_CLOCK was checking kvm_arch.use_master_clock
> outside of the pvclock sync lock. This is problematic, as the clock
> value written to the user may or may not actually correspond to a stable
> TSC.
> 
> Fix the race by populating the entire kvm_clock_data structure behind
> the pvclock_gtod_sync_lock.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Oliver Upton <oupton@google.com>

ACK patches 1-3, still reviewing the remaining ones...


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-09-16 18:15   ` Oliver Upton
  (?)
@ 2021-09-29 18:56     ` Marcelo Tosatti
  -1 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-29 18:56 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, kvmarm, Paolo Bonzini, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Thomas Gleixner

Oliver,

Do you have any numbers for the improvement in guests CLOCK_REALTIME
accuracy across migration, when this is in place?

On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> Handling the migration of TSCs correctly is difficult, in part because
> Linux does not provide userspace with the ability to retrieve a (TSC,
> realtime) clock pair for a single instant in time. In lieu of a more
> convenient facility, KVM can report similar information in the kvm_clock
> structure.
> 
> Provide userspace with a host TSC & realtime pair iff the realtime clock
> is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
> realtime value, advance the KVM clock by the amount of elapsed time. Do
> not step the KVM clock backwards, though, as it is a monotonic
> oscillator.
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  Documentation/virt/kvm/api.rst  | 42 ++++++++++++++++++++++++++-------
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 36 +++++++++++++++++++++-------
>  include/uapi/linux/kvm.h        |  7 +++++-
>  4 files changed, 70 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index a6729c8cf063..d0b9c986cf6c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -993,20 +993,34 @@ such as migration.
>  When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
>  set of bits that KVM can return in struct kvm_clock_data's flag member.
>  
> -The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
> -value is the exact kvmclock value seen by all VCPUs at the instant
> -when KVM_GET_CLOCK was called.  If clear, the returned value is simply
> -CLOCK_MONOTONIC plus a constant offset; the offset can be modified
> -with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
> -but the exact value read by each VCPU could differ, because the host
> -TSC is not stable.
> +FLAGS:
> +
> +KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
> +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
> +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
> +offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
> +to make all VCPUs follow this clock, but the exact value read by each
> +VCPU could differ, because the host TSC is not stable.
> +
> +KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
> +structure is populated with the value of the host's real time
> +clocksource at the instant when KVM_GET_CLOCK was called. If clear,
> +the `realtime` field does not contain a value.
> +
> +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
> +structure is populated with the value of the host's timestamp counter (TSC)
> +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
> +does not contain a value.
>  
>  ::
>  
>    struct kvm_clock_data {
>  	__u64 clock;  /* kvmclock current value */
>  	__u32 flags;
> -	__u32 pad[9];
> +	__u32 pad0;
> +	__u64 realtime;
> +	__u64 host_tsc;
> +	__u32 pad[4];
>    };
>  
>  
> @@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value specified in its parameter.
>  In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
>  such as migration.
>  
> +FLAGS:
> +
> +KVM_CLOCK_REALTIME.  If set, KVM will compare the value of the `realtime` field
> +with the value of the host's real time clocksource at the instant when
> +KVM_SET_CLOCK was called. The difference in elapsed time is added to the final
> +kvmclock value that will be provided to guests.
> +
>  ::
>  
>    struct kvm_clock_data {
>  	__u64 clock;  /* kvmclock current value */
>  	__u32 flags;
> -	__u32 pad[9];
> +	__u32 pad0;
> +	__u64 realtime;
> +	__u64 host_tsc;
> +	__u32 pad[4];
>    };
>  
>  
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be6805fc0260..9c34b5b63e39 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1936,4 +1936,7 @@ int kvm_cpu_dirty_log_size(void);
>  
>  int alloc_all_memslots_rmaps(struct kvm *kvm);
>  
> +#define KVM_CLOCK_VALID_FLAGS						\
> +	(KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC)
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 523c4e5c109f..cb5d5cad5124 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2815,10 +2815,20 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
>  	get_cpu();
>  
>  	if (__this_cpu_read(cpu_tsc_khz)) {
> +#ifdef CONFIG_X86_64
> +		struct timespec64 ts;
> +
> +		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
> +			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
> +			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
> +		} else
> +#endif
> +		data->host_tsc = rdtsc();
> +
>  		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
>  				   &hv_clock.tsc_shift,
>  				   &hv_clock.tsc_to_system_mul);
> -		data->clock = __pvclock_read_cycles(&hv_clock, rdtsc());
> +		data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
>  	} else {
>  		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
>  	}
> @@ -4062,7 +4072,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  		r = KVM_SYNC_X86_VALID_FIELDS;
>  		break;
>  	case KVM_CAP_ADJUST_CLOCK:
> -		r = KVM_CLOCK_TSC_STABLE;
> +		r = KVM_CLOCK_VALID_FLAGS;
>  		break;
>  	case KVM_CAP_X86_DISABLE_EXITS:
>  		r |=  KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE |
> @@ -5859,12 +5869,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
>  {
>  	struct kvm_arch *ka = &kvm->arch;
>  	struct kvm_clock_data data;
> -	u64 now_ns;
> +	u64 now_raw_ns;
>  
>  	if (copy_from_user(&data, argp, sizeof(data)))
>  		return -EFAULT;
>  
> -	if (data.flags)
> +	if (data.flags & ~KVM_CLOCK_REALTIME)
>  		return -EINVAL;
>  
>  	kvm_hv_invalidate_tsc_page(kvm);
> @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
>  	 * is slightly ahead) here we risk going negative on unsigned
>  	 * 'system_time' when 'data.clock' is very small.
>  	 */
> -	if (kvm->arch.use_master_clock)
> -		now_ns = ka->master_kernel_ns;
> +	if (data.flags & KVM_CLOCK_REALTIME) {
> +		u64 now_real_ns = ktime_get_real_ns();
> +
> +		/*
> +		 * Avoid stepping the kvmclock backwards.
> +		 */
> +		if (now_real_ns > data.realtime)
> +			data.clock += now_real_ns - data.realtime;
> +	}

Forward jumps can also cause problems, for example:

* Kernel watchdogs

* https://patchwork.ozlabs.org/project/qemu-devel/patch/20130618233825.GA19042@amt.cnet/

So perhaps limiting the amount of forward jump that is allowed 
would be a good thing? (which can happen if the two hosts realtime
clocks are off).

Now by how much, i am not sure.

Or, as mentioned earlier, only enable KVM_CLOCK_REALTIME if userspace
KVM code checks clock synchronization.

Thomas, CC'ed, has deeper understanding of problems with 
forward time jumps than I do. Thomas, any comments?

As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
for either vm pause / vm resume (well, if paused for long periods of time) 
or savevm / restorevm.

> +	if (ka->use_master_clock)
> +		now_raw_ns = ka->master_kernel_ns;
>  	else
> -		now_ns = get_kvmclock_base_ns();
> -	ka->kvmclock_offset = data.clock - now_ns;
> +		now_raw_ns = get_kvmclock_base_ns();
> +	ka->kvmclock_offset = data.clock - now_raw_ns;
>  	kvm_end_pvclock_update(kvm);
>  	return 0;
>  }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a067410ebea5..d228bf394465 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1223,11 +1223,16 @@ struct kvm_irqfd {
>  
>  /* Do not use 1, KVM_CHECK_EXTENSION returned it before we had flags.  */
>  #define KVM_CLOCK_TSC_STABLE		2
> +#define KVM_CLOCK_REALTIME		(1 << 2)
> +#define KVM_CLOCK_HOST_TSC		(1 << 3)
>  
>  struct kvm_clock_data {
>  	__u64 clock;
>  	__u32 flags;
> -	__u32 pad[9];
> +	__u32 pad0;
> +	__u64 realtime;
> +	__u64 host_tsc;
> +	__u32 pad[4];
>  };
>  
>  /* For KVM_CAP_SW_TLB */
> -- 
> 2.33.0.309.g3052b89438-goog
> 
> 


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-09-29 18:56     ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-29 18:56 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Catalin Marinas, kvm, Will Deacon, Marc Zyngier, Peter Shier,
	David Matlack, Paolo Bonzini, Thomas Gleixner, kvmarm,
	linux-arm-kernel, Jim Mattson

Oliver,

Do you have any numbers for the improvement in guests CLOCK_REALTIME
accuracy across migration, when this is in place?

On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> Handling the migration of TSCs correctly is difficult, in part because
> Linux does not provide userspace with the ability to retrieve a (TSC,
> realtime) clock pair for a single instant in time. In lieu of a more
> convenient facility, KVM can report similar information in the kvm_clock
> structure.
> 
> Provide userspace with a host TSC & realtime pair iff the realtime clock
> is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
> realtime value, advance the KVM clock by the amount of elapsed time. Do
> not step the KVM clock backwards, though, as it is a monotonic
> oscillator.
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  Documentation/virt/kvm/api.rst  | 42 ++++++++++++++++++++++++++-------
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 36 +++++++++++++++++++++-------
>  include/uapi/linux/kvm.h        |  7 +++++-
>  4 files changed, 70 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index a6729c8cf063..d0b9c986cf6c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -993,20 +993,34 @@ such as migration.
>  When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
>  set of bits that KVM can return in struct kvm_clock_data's flag member.
>  
> -The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
> -value is the exact kvmclock value seen by all VCPUs at the instant
> -when KVM_GET_CLOCK was called.  If clear, the returned value is simply
> -CLOCK_MONOTONIC plus a constant offset; the offset can be modified
> -with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
> -but the exact value read by each VCPU could differ, because the host
> -TSC is not stable.
> +FLAGS:
> +
> +KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
> +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
> +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
> +offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
> +to make all VCPUs follow this clock, but the exact value read by each
> +VCPU could differ, because the host TSC is not stable.
> +
> +KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
> +structure is populated with the value of the host's real time
> +clocksource at the instant when KVM_GET_CLOCK was called. If clear,
> +the `realtime` field does not contain a value.
> +
> +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
> +structure is populated with the value of the host's timestamp counter (TSC)
> +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
> +does not contain a value.
>  
>  ::
>  
>    struct kvm_clock_data {
>  	__u64 clock;  /* kvmclock current value */
>  	__u32 flags;
> -	__u32 pad[9];
> +	__u32 pad0;
> +	__u64 realtime;
> +	__u64 host_tsc;
> +	__u32 pad[4];
>    };
>  
>  
> @@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value specified in its parameter.
>  In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
>  such as migration.
>  
> +FLAGS:
> +
> +KVM_CLOCK_REALTIME.  If set, KVM will compare the value of the `realtime` field
> +with the value of the host's real time clocksource at the instant when
> +KVM_SET_CLOCK was called. The difference in elapsed time is added to the final
> +kvmclock value that will be provided to guests.
> +
>  ::
>  
>    struct kvm_clock_data {
>  	__u64 clock;  /* kvmclock current value */
>  	__u32 flags;
> -	__u32 pad[9];
> +	__u32 pad0;
> +	__u64 realtime;
> +	__u64 host_tsc;
> +	__u32 pad[4];
>    };
>  
>  
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be6805fc0260..9c34b5b63e39 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1936,4 +1936,7 @@ int kvm_cpu_dirty_log_size(void);
>  
>  int alloc_all_memslots_rmaps(struct kvm *kvm);
>  
> +#define KVM_CLOCK_VALID_FLAGS						\
> +	(KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC)
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 523c4e5c109f..cb5d5cad5124 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2815,10 +2815,20 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
>  	get_cpu();
>  
>  	if (__this_cpu_read(cpu_tsc_khz)) {
> +#ifdef CONFIG_X86_64
> +		struct timespec64 ts;
> +
> +		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
> +			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
> +			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
> +		} else
> +#endif
> +		data->host_tsc = rdtsc();
> +
>  		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
>  				   &hv_clock.tsc_shift,
>  				   &hv_clock.tsc_to_system_mul);
> -		data->clock = __pvclock_read_cycles(&hv_clock, rdtsc());
> +		data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
>  	} else {
>  		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
>  	}
> @@ -4062,7 +4072,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  		r = KVM_SYNC_X86_VALID_FIELDS;
>  		break;
>  	case KVM_CAP_ADJUST_CLOCK:
> -		r = KVM_CLOCK_TSC_STABLE;
> +		r = KVM_CLOCK_VALID_FLAGS;
>  		break;
>  	case KVM_CAP_X86_DISABLE_EXITS:
>  		r |=  KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE |
> @@ -5859,12 +5869,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
>  {
>  	struct kvm_arch *ka = &kvm->arch;
>  	struct kvm_clock_data data;
> -	u64 now_ns;
> +	u64 now_raw_ns;
>  
>  	if (copy_from_user(&data, argp, sizeof(data)))
>  		return -EFAULT;
>  
> -	if (data.flags)
> +	if (data.flags & ~KVM_CLOCK_REALTIME)
>  		return -EINVAL;
>  
>  	kvm_hv_invalidate_tsc_page(kvm);
> @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
>  	 * is slightly ahead) here we risk going negative on unsigned
>  	 * 'system_time' when 'data.clock' is very small.
>  	 */
> -	if (kvm->arch.use_master_clock)
> -		now_ns = ka->master_kernel_ns;
> +	if (data.flags & KVM_CLOCK_REALTIME) {
> +		u64 now_real_ns = ktime_get_real_ns();
> +
> +		/*
> +		 * Avoid stepping the kvmclock backwards.
> +		 */
> +		if (now_real_ns > data.realtime)
> +			data.clock += now_real_ns - data.realtime;
> +	}

Forward jumps can also cause problems, for example:

* Kernel watchdogs

* https://patchwork.ozlabs.org/project/qemu-devel/patch/20130618233825.GA19042@amt.cnet/

So perhaps limiting the amount of forward jump that is allowed 
would be a good thing? (which can happen if the two hosts realtime
clocks are off).

Now by how much, i am not sure.

Or, as mentioned earlier, only enable KVM_CLOCK_REALTIME if userspace
KVM code checks clock synchronization.

Thomas, CC'ed, has deeper understanding of problems with 
forward time jumps than I do. Thomas, any comments?

As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
for either vm pause / vm resume (well, if paused for long periods of time) 
or savevm / restorevm.

> +	if (ka->use_master_clock)
> +		now_raw_ns = ka->master_kernel_ns;
>  	else
> -		now_ns = get_kvmclock_base_ns();
> -	ka->kvmclock_offset = data.clock - now_ns;
> +		now_raw_ns = get_kvmclock_base_ns();
> +	ka->kvmclock_offset = data.clock - now_raw_ns;
>  	kvm_end_pvclock_update(kvm);
>  	return 0;
>  }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a067410ebea5..d228bf394465 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1223,11 +1223,16 @@ struct kvm_irqfd {
>  
>  /* Do not use 1, KVM_CHECK_EXTENSION returned it before we had flags.  */
>  #define KVM_CLOCK_TSC_STABLE		2
> +#define KVM_CLOCK_REALTIME		(1 << 2)
> +#define KVM_CLOCK_HOST_TSC		(1 << 3)
>  
>  struct kvm_clock_data {
>  	__u64 clock;
>  	__u32 flags;
> -	__u32 pad[9];
> +	__u32 pad0;
> +	__u64 realtime;
> +	__u64 host_tsc;
> +	__u32 pad[4];
>  };
>  
>  /* For KVM_CAP_SW_TLB */
> -- 
> 2.33.0.309.g3052b89438-goog
> 
> 

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-09-29 18:56     ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-29 18:56 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, kvmarm, Paolo Bonzini, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Thomas Gleixner

Oliver,

Do you have any numbers for the improvement in guests CLOCK_REALTIME
accuracy across migration, when this is in place?

On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> Handling the migration of TSCs correctly is difficult, in part because
> Linux does not provide userspace with the ability to retrieve a (TSC,
> realtime) clock pair for a single instant in time. In lieu of a more
> convenient facility, KVM can report similar information in the kvm_clock
> structure.
> 
> Provide userspace with a host TSC & realtime pair iff the realtime clock
> is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
> realtime value, advance the KVM clock by the amount of elapsed time. Do
> not step the KVM clock backwards, though, as it is a monotonic
> oscillator.
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  Documentation/virt/kvm/api.rst  | 42 ++++++++++++++++++++++++++-------
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 36 +++++++++++++++++++++-------
>  include/uapi/linux/kvm.h        |  7 +++++-
>  4 files changed, 70 insertions(+), 18 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index a6729c8cf063..d0b9c986cf6c 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -993,20 +993,34 @@ such as migration.
>  When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
>  set of bits that KVM can return in struct kvm_clock_data's flag member.
>  
> -The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
> -value is the exact kvmclock value seen by all VCPUs at the instant
> -when KVM_GET_CLOCK was called.  If clear, the returned value is simply
> -CLOCK_MONOTONIC plus a constant offset; the offset can be modified
> -with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
> -but the exact value read by each VCPU could differ, because the host
> -TSC is not stable.
> +FLAGS:
> +
> +KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
> +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
> +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
> +offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
> +to make all VCPUs follow this clock, but the exact value read by each
> +VCPU could differ, because the host TSC is not stable.
> +
> +KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
> +structure is populated with the value of the host's real time
> +clocksource at the instant when KVM_GET_CLOCK was called. If clear,
> +the `realtime` field does not contain a value.
> +
> +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
> +structure is populated with the value of the host's timestamp counter (TSC)
> +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
> +does not contain a value.
>  
>  ::
>  
>    struct kvm_clock_data {
>  	__u64 clock;  /* kvmclock current value */
>  	__u32 flags;
> -	__u32 pad[9];
> +	__u32 pad0;
> +	__u64 realtime;
> +	__u64 host_tsc;
> +	__u32 pad[4];
>    };
>  
>  
> @@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value specified in its parameter.
>  In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
>  such as migration.
>  
> +FLAGS:
> +
> +KVM_CLOCK_REALTIME.  If set, KVM will compare the value of the `realtime` field
> +with the value of the host's real time clocksource at the instant when
> +KVM_SET_CLOCK was called. The difference in elapsed time is added to the final
> +kvmclock value that will be provided to guests.
> +
>  ::
>  
>    struct kvm_clock_data {
>  	__u64 clock;  /* kvmclock current value */
>  	__u32 flags;
> -	__u32 pad[9];
> +	__u32 pad0;
> +	__u64 realtime;
> +	__u64 host_tsc;
> +	__u32 pad[4];
>    };
>  
>  
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index be6805fc0260..9c34b5b63e39 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1936,4 +1936,7 @@ int kvm_cpu_dirty_log_size(void);
>  
>  int alloc_all_memslots_rmaps(struct kvm *kvm);
>  
> +#define KVM_CLOCK_VALID_FLAGS						\
> +	(KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC)
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 523c4e5c109f..cb5d5cad5124 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2815,10 +2815,20 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
>  	get_cpu();
>  
>  	if (__this_cpu_read(cpu_tsc_khz)) {
> +#ifdef CONFIG_X86_64
> +		struct timespec64 ts;
> +
> +		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
> +			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
> +			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
> +		} else
> +#endif
> +		data->host_tsc = rdtsc();
> +
>  		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
>  				   &hv_clock.tsc_shift,
>  				   &hv_clock.tsc_to_system_mul);
> -		data->clock = __pvclock_read_cycles(&hv_clock, rdtsc());
> +		data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
>  	} else {
>  		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
>  	}
> @@ -4062,7 +4072,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  		r = KVM_SYNC_X86_VALID_FIELDS;
>  		break;
>  	case KVM_CAP_ADJUST_CLOCK:
> -		r = KVM_CLOCK_TSC_STABLE;
> +		r = KVM_CLOCK_VALID_FLAGS;
>  		break;
>  	case KVM_CAP_X86_DISABLE_EXITS:
>  		r |=  KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE |
> @@ -5859,12 +5869,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
>  {
>  	struct kvm_arch *ka = &kvm->arch;
>  	struct kvm_clock_data data;
> -	u64 now_ns;
> +	u64 now_raw_ns;
>  
>  	if (copy_from_user(&data, argp, sizeof(data)))
>  		return -EFAULT;
>  
> -	if (data.flags)
> +	if (data.flags & ~KVM_CLOCK_REALTIME)
>  		return -EINVAL;
>  
>  	kvm_hv_invalidate_tsc_page(kvm);
> @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
>  	 * is slightly ahead) here we risk going negative on unsigned
>  	 * 'system_time' when 'data.clock' is very small.
>  	 */
> -	if (kvm->arch.use_master_clock)
> -		now_ns = ka->master_kernel_ns;
> +	if (data.flags & KVM_CLOCK_REALTIME) {
> +		u64 now_real_ns = ktime_get_real_ns();
> +
> +		/*
> +		 * Avoid stepping the kvmclock backwards.
> +		 */
> +		if (now_real_ns > data.realtime)
> +			data.clock += now_real_ns - data.realtime;
> +	}

Forward jumps can also cause problems, for example:

* Kernel watchdogs

* https://patchwork.ozlabs.org/project/qemu-devel/patch/20130618233825.GA19042@amt.cnet/

So perhaps limiting the amount of forward jump that is allowed 
would be a good thing? (which can happen if the two hosts realtime
clocks are off).

Now by how much, i am not sure.

Or, as mentioned earlier, only enable KVM_CLOCK_REALTIME if userspace
KVM code checks clock synchronization.

Thomas, CC'ed, has deeper understanding of problems with 
forward time jumps than I do. Thomas, any comments?

As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
for either vm pause / vm resume (well, if paused for long periods of time) 
or savevm / restorevm.

> +	if (ka->use_master_clock)
> +		now_raw_ns = ka->master_kernel_ns;
>  	else
> -		now_ns = get_kvmclock_base_ns();
> -	ka->kvmclock_offset = data.clock - now_ns;
> +		now_raw_ns = get_kvmclock_base_ns();
> +	ka->kvmclock_offset = data.clock - now_raw_ns;
>  	kvm_end_pvclock_update(kvm);
>  	return 0;
>  }
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a067410ebea5..d228bf394465 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1223,11 +1223,16 @@ struct kvm_irqfd {
>  
>  /* Do not use 1, KVM_CHECK_EXTENSION returned it before we had flags.  */
>  #define KVM_CLOCK_TSC_STABLE		2
> +#define KVM_CLOCK_REALTIME		(1 << 2)
> +#define KVM_CLOCK_HOST_TSC		(1 << 3)
>  
>  struct kvm_clock_data {
>  	__u64 clock;
>  	__u32 flags;
> -	__u32 pad[9];
> +	__u32 pad0;
> +	__u64 realtime;
> +	__u64 host_tsc;
> +	__u32 pad[4];
>  };
>  
>  /* For KVM_CAP_SW_TLB */
> -- 
> 2.33.0.309.g3052b89438-goog
> 
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount
  2021-09-16 18:15   ` Oliver Upton
  (?)
@ 2021-09-30 17:51     ` Marcelo Tosatti
  -1 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-30 17:51 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, kvmarm, Paolo Bonzini, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Thu, Sep 16, 2021 at 06:15:36PM +0000, Oliver Upton wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
> 
> Protect the reference point for kvmclock with a seqcount, so that
> kvmclock updates for all vCPUs can proceed in parallel.  Xen runstate
> updates will also run in parallel and not bounce the kvmclock cacheline.
> 
> nr_vcpus_matched_tsc is updated outside pvclock_update_vm_gtod_copy
> though, so a spinlock must be kept for that one.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> [Oliver - drop unused locals, don't double acquire tsc_write_lock]
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  7 ++-
>  arch/x86/kvm/x86.c              | 83 +++++++++++++++++----------------
>  2 files changed, 49 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9c34b5b63e39..5accfe7246ce 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1087,6 +1087,11 @@ struct kvm_arch {
>  
>  	unsigned long irq_sources_bitmap;
>  	s64 kvmclock_offset;
> +
> +	/*
> +	 * This also protects nr_vcpus_matched_tsc which is read from a
> +	 * preemption-disabled region, so it must be a raw spinlock.
> +	 */
>  	raw_spinlock_t tsc_write_lock;
>  	u64 last_tsc_nsec;
>  	u64 last_tsc_write;
> @@ -1097,7 +1102,7 @@ struct kvm_arch {
>  	u64 cur_tsc_generation;
>  	int nr_vcpus_matched_tsc;
>  
> -	spinlock_t pvclock_gtod_sync_lock;
> +	seqcount_raw_spinlock_t pvclock_sc;
>  	bool use_master_clock;
>  	u64 master_kernel_ns;
>  	u64 master_cycle_now;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cb5d5cad5124..29156c49cd11 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2533,9 +2533,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
>  	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
>  
>  	kvm_vcpu_write_tsc_offset(vcpu, offset);
> -	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
>  
> -	spin_lock_irqsave(&kvm->arch.pvclock_gtod_sync_lock, flags);
>  	if (!matched) {
>  		kvm->arch.nr_vcpus_matched_tsc = 0;
>  	} else if (!already_matched) {
> @@ -2543,7 +2541,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
>  	}
>  
>  	kvm_track_tsc_matching(vcpu);
> -	spin_unlock_irqrestore(&kvm->arch.pvclock_gtod_sync_lock, flags);
> +	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
>  }
>  
>  static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
> @@ -2731,9 +2729,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>  	int vclock_mode;
>  	bool host_tsc_clocksource, vcpus_matched;
>  
> -	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
> -			atomic_read(&kvm->online_vcpus));
> -
>  	/*
>  	 * If the host uses TSC clock, then passthrough TSC as stable
>  	 * to the guest.
> @@ -2742,6 +2737,10 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>  					&ka->master_kernel_ns,
>  					&ka->master_cycle_now);
>  
> +	lockdep_assert_held(&kvm->arch.tsc_write_lock);
> +	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
> +			atomic_read(&kvm->online_vcpus));
> +
>  	ka->use_master_clock = host_tsc_clocksource && vcpus_matched
>  				&& !ka->backwards_tsc_observed
>  				&& !ka->boot_vcpu_runs_old_kvmclock;
> @@ -2760,14 +2759,18 @@ static void kvm_make_mclock_inprogress_request(struct kvm *kvm)
>  	kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
>  }
>  
> -static void kvm_start_pvclock_update(struct kvm *kvm)
> +static void __kvm_start_pvclock_update(struct kvm *kvm)
>  {
> -	struct kvm_arch *ka = &kvm->arch;
> +	raw_spin_lock_irq(&kvm->arch.tsc_write_lock);
> +	write_seqcount_begin(&kvm->arch.pvclock_sc);
> +}
>  
> +static void kvm_start_pvclock_update(struct kvm *kvm)
> +{
>  	kvm_make_mclock_inprogress_request(kvm);
>  
>  	/* no guest entries from this point */
> -	spin_lock_irq(&ka->pvclock_gtod_sync_lock);
> +	__kvm_start_pvclock_update(kvm);
>  }
>  
>  static void kvm_end_pvclock_update(struct kvm *kvm)
> @@ -2776,7 +2779,8 @@ static void kvm_end_pvclock_update(struct kvm *kvm)
>  	struct kvm_vcpu *vcpu;
>  	int i;
>  
> -	spin_unlock_irq(&ka->pvclock_gtod_sync_lock);
> +	write_seqcount_end(&ka->pvclock_sc);
> +	raw_spin_unlock_irq(&ka->tsc_write_lock);
>  	kvm_for_each_vcpu(i, vcpu, kvm)
>  		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>  
> @@ -2797,20 +2801,12 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
>  {
>  	struct kvm_arch *ka = &kvm->arch;
>  	struct pvclock_vcpu_time_info hv_clock;
> -	unsigned long flags;
>  
> -	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
>  	if (!ka->use_master_clock) {
> -		spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
>  		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
>  		return;
>  	}
>  
> -	data->flags |= KVM_CLOCK_TSC_STABLE;
> -	hv_clock.tsc_timestamp = ka->master_cycle_now;
> -	hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> -	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
> -
>  	/* both __this_cpu_read() and rdtsc() should be on the same cpu */
>  	get_cpu();
>  
> @@ -2825,6 +2821,9 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
>  #endif
>  		data->host_tsc = rdtsc();
>  
> +		data->flags |= KVM_CLOCK_TSC_STABLE;
> +		hv_clock.tsc_timestamp = ka->master_cycle_now;
> +		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
>  		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
>  				   &hv_clock.tsc_shift,
>  				   &hv_clock.tsc_to_system_mul);
> @@ -2839,14 +2838,14 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
>  u64 get_kvmclock_ns(struct kvm *kvm)
>  {
>  	struct kvm_clock_data data;
> +	struct kvm_arch *ka = &kvm->arch;
> +	unsigned seq;
>  
> -	/*
> -	 * Zero flags as it's accessed RMW, leave everything else uninitialized
> -	 * as clock is always written and no other fields are consumed.
> -	 */
> -	data.flags = 0;
> -
> -	get_kvmclock(kvm, &data);
> +	do {
> +		seq = read_seqcount_begin(&ka->pvclock_sc);
> +		data.flags = 0;
> +		get_kvmclock(kvm, &data);
> +	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
>  	return data.clock;
>  }
>  
> @@ -2912,6 +2911,7 @@ static void kvm_setup_pvclock_page(struct kvm_vcpu *v,
>  static int kvm_guest_time_update(struct kvm_vcpu *v)
>  {
>  	unsigned long flags, tgt_tsc_khz;
> +	unsigned seq;
>  	struct kvm_vcpu_arch *vcpu = &v->arch;
>  	struct kvm_arch *ka = &v->kvm->arch;
>  	s64 kernel_ns;
> @@ -2926,13 +2926,14 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  	 * If the host uses TSC clock, then passthrough TSC as stable
>  	 * to the guest.
>  	 */
> -	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
> -	use_master_clock = ka->use_master_clock;
> -	if (use_master_clock) {
> -		host_tsc = ka->master_cycle_now;
> -		kernel_ns = ka->master_kernel_ns;
> -	}
> -	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
> +	seq = read_seqcount_begin(&ka->pvclock_sc);
> +	do {
> +		use_master_clock = ka->use_master_clock;
> +		if (use_master_clock) {
> +			host_tsc = ka->master_cycle_now;
> +			kernel_ns = ka->master_kernel_ns;
> +		}
> +	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
>  
>  	/* Keep irq disabled to prevent changes to the clock */
>  	local_irq_save(flags);
> @@ -5855,10 +5856,15 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
>  
>  static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
>  {
> -	struct kvm_clock_data data;
> +	struct kvm_clock_data data = { 0 };
> +	unsigned seq;
> +
> +	do {
> +		seq = read_seqcount_begin(&kvm->arch.pvclock_sc);
> +		data.flags = 0;
> +		get_kvmclock(kvm, &data);
> +	} while (read_seqcount_retry(&kvm->arch.pvclock_sc, seq));
>  
> -	memset(&data, 0, sizeof(data));
> -	get_kvmclock(kvm, &data);
>  	if (copy_to_user(argp, &data, sizeof(data)))
>  		return -EFAULT;
>  
> @@ -8159,9 +8165,7 @@ static void kvm_hyperv_tsc_notifier(void)
>  	kvm_max_guest_tsc_khz = tsc_khz;
>  
>  	list_for_each_entry(kvm, &vm_list, vm_list) {
> -		struct kvm_arch *ka = &kvm->arch;
> -
> -		spin_lock_irq(&ka->pvclock_gtod_sync_lock);
> +		__kvm_start_pvclock_update(kvm);
>  		pvclock_update_vm_gtod_copy(kvm);
>  		kvm_end_pvclock_update(kvm);
>  	}
> @@ -11188,8 +11192,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  
>  	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
>  	mutex_init(&kvm->arch.apic_map_lock);
> -	spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock);
> -
> +	seqcount_raw_spinlock_init(&kvm->arch.pvclock_sc, &kvm->arch.tsc_write_lock);
>  	kvm->arch.kvmclock_offset = -get_kvmclock_base_ns();
>  	pvclock_update_vm_gtod_copy(kvm);
>  
> -- 
> 2.33.0.309.g3052b89438-goog
> 
> 

ACK


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount
@ 2021-09-30 17:51     ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-30 17:51 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Catalin Marinas, kvm, Will Deacon, Marc Zyngier, Peter Shier,
	David Matlack, Paolo Bonzini, kvmarm, linux-arm-kernel,
	Jim Mattson

On Thu, Sep 16, 2021 at 06:15:36PM +0000, Oliver Upton wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
> 
> Protect the reference point for kvmclock with a seqcount, so that
> kvmclock updates for all vCPUs can proceed in parallel.  Xen runstate
> updates will also run in parallel and not bounce the kvmclock cacheline.
> 
> nr_vcpus_matched_tsc is updated outside pvclock_update_vm_gtod_copy
> though, so a spinlock must be kept for that one.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> [Oliver - drop unused locals, don't double acquire tsc_write_lock]
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  7 ++-
>  arch/x86/kvm/x86.c              | 83 +++++++++++++++++----------------
>  2 files changed, 49 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9c34b5b63e39..5accfe7246ce 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1087,6 +1087,11 @@ struct kvm_arch {
>  
>  	unsigned long irq_sources_bitmap;
>  	s64 kvmclock_offset;
> +
> +	/*
> +	 * This also protects nr_vcpus_matched_tsc which is read from a
> +	 * preemption-disabled region, so it must be a raw spinlock.
> +	 */
>  	raw_spinlock_t tsc_write_lock;
>  	u64 last_tsc_nsec;
>  	u64 last_tsc_write;
> @@ -1097,7 +1102,7 @@ struct kvm_arch {
>  	u64 cur_tsc_generation;
>  	int nr_vcpus_matched_tsc;
>  
> -	spinlock_t pvclock_gtod_sync_lock;
> +	seqcount_raw_spinlock_t pvclock_sc;
>  	bool use_master_clock;
>  	u64 master_kernel_ns;
>  	u64 master_cycle_now;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cb5d5cad5124..29156c49cd11 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2533,9 +2533,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
>  	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
>  
>  	kvm_vcpu_write_tsc_offset(vcpu, offset);
> -	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
>  
> -	spin_lock_irqsave(&kvm->arch.pvclock_gtod_sync_lock, flags);
>  	if (!matched) {
>  		kvm->arch.nr_vcpus_matched_tsc = 0;
>  	} else if (!already_matched) {
> @@ -2543,7 +2541,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
>  	}
>  
>  	kvm_track_tsc_matching(vcpu);
> -	spin_unlock_irqrestore(&kvm->arch.pvclock_gtod_sync_lock, flags);
> +	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
>  }
>  
>  static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
> @@ -2731,9 +2729,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>  	int vclock_mode;
>  	bool host_tsc_clocksource, vcpus_matched;
>  
> -	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
> -			atomic_read(&kvm->online_vcpus));
> -
>  	/*
>  	 * If the host uses TSC clock, then passthrough TSC as stable
>  	 * to the guest.
> @@ -2742,6 +2737,10 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>  					&ka->master_kernel_ns,
>  					&ka->master_cycle_now);
>  
> +	lockdep_assert_held(&kvm->arch.tsc_write_lock);
> +	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
> +			atomic_read(&kvm->online_vcpus));
> +
>  	ka->use_master_clock = host_tsc_clocksource && vcpus_matched
>  				&& !ka->backwards_tsc_observed
>  				&& !ka->boot_vcpu_runs_old_kvmclock;
> @@ -2760,14 +2759,18 @@ static void kvm_make_mclock_inprogress_request(struct kvm *kvm)
>  	kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
>  }
>  
> -static void kvm_start_pvclock_update(struct kvm *kvm)
> +static void __kvm_start_pvclock_update(struct kvm *kvm)
>  {
> -	struct kvm_arch *ka = &kvm->arch;
> +	raw_spin_lock_irq(&kvm->arch.tsc_write_lock);
> +	write_seqcount_begin(&kvm->arch.pvclock_sc);
> +}
>  
> +static void kvm_start_pvclock_update(struct kvm *kvm)
> +{
>  	kvm_make_mclock_inprogress_request(kvm);
>  
>  	/* no guest entries from this point */
> -	spin_lock_irq(&ka->pvclock_gtod_sync_lock);
> +	__kvm_start_pvclock_update(kvm);
>  }
>  
>  static void kvm_end_pvclock_update(struct kvm *kvm)
> @@ -2776,7 +2779,8 @@ static void kvm_end_pvclock_update(struct kvm *kvm)
>  	struct kvm_vcpu *vcpu;
>  	int i;
>  
> -	spin_unlock_irq(&ka->pvclock_gtod_sync_lock);
> +	write_seqcount_end(&ka->pvclock_sc);
> +	raw_spin_unlock_irq(&ka->tsc_write_lock);
>  	kvm_for_each_vcpu(i, vcpu, kvm)
>  		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>  
> @@ -2797,20 +2801,12 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
>  {
>  	struct kvm_arch *ka = &kvm->arch;
>  	struct pvclock_vcpu_time_info hv_clock;
> -	unsigned long flags;
>  
> -	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
>  	if (!ka->use_master_clock) {
> -		spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
>  		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
>  		return;
>  	}
>  
> -	data->flags |= KVM_CLOCK_TSC_STABLE;
> -	hv_clock.tsc_timestamp = ka->master_cycle_now;
> -	hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> -	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
> -
>  	/* both __this_cpu_read() and rdtsc() should be on the same cpu */
>  	get_cpu();
>  
> @@ -2825,6 +2821,9 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
>  #endif
>  		data->host_tsc = rdtsc();
>  
> +		data->flags |= KVM_CLOCK_TSC_STABLE;
> +		hv_clock.tsc_timestamp = ka->master_cycle_now;
> +		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
>  		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
>  				   &hv_clock.tsc_shift,
>  				   &hv_clock.tsc_to_system_mul);
> @@ -2839,14 +2838,14 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
>  u64 get_kvmclock_ns(struct kvm *kvm)
>  {
>  	struct kvm_clock_data data;
> +	struct kvm_arch *ka = &kvm->arch;
> +	unsigned seq;
>  
> -	/*
> -	 * Zero flags as it's accessed RMW, leave everything else uninitialized
> -	 * as clock is always written and no other fields are consumed.
> -	 */
> -	data.flags = 0;
> -
> -	get_kvmclock(kvm, &data);
> +	do {
> +		seq = read_seqcount_begin(&ka->pvclock_sc);
> +		data.flags = 0;
> +		get_kvmclock(kvm, &data);
> +	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
>  	return data.clock;
>  }
>  
> @@ -2912,6 +2911,7 @@ static void kvm_setup_pvclock_page(struct kvm_vcpu *v,
>  static int kvm_guest_time_update(struct kvm_vcpu *v)
>  {
>  	unsigned long flags, tgt_tsc_khz;
> +	unsigned seq;
>  	struct kvm_vcpu_arch *vcpu = &v->arch;
>  	struct kvm_arch *ka = &v->kvm->arch;
>  	s64 kernel_ns;
> @@ -2926,13 +2926,14 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  	 * If the host uses TSC clock, then passthrough TSC as stable
>  	 * to the guest.
>  	 */
> -	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
> -	use_master_clock = ka->use_master_clock;
> -	if (use_master_clock) {
> -		host_tsc = ka->master_cycle_now;
> -		kernel_ns = ka->master_kernel_ns;
> -	}
> -	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
> +	seq = read_seqcount_begin(&ka->pvclock_sc);
> +	do {
> +		use_master_clock = ka->use_master_clock;
> +		if (use_master_clock) {
> +			host_tsc = ka->master_cycle_now;
> +			kernel_ns = ka->master_kernel_ns;
> +		}
> +	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
>  
>  	/* Keep irq disabled to prevent changes to the clock */
>  	local_irq_save(flags);
> @@ -5855,10 +5856,15 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
>  
>  static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
>  {
> -	struct kvm_clock_data data;
> +	struct kvm_clock_data data = { 0 };
> +	unsigned seq;
> +
> +	do {
> +		seq = read_seqcount_begin(&kvm->arch.pvclock_sc);
> +		data.flags = 0;
> +		get_kvmclock(kvm, &data);
> +	} while (read_seqcount_retry(&kvm->arch.pvclock_sc, seq));
>  
> -	memset(&data, 0, sizeof(data));
> -	get_kvmclock(kvm, &data);
>  	if (copy_to_user(argp, &data, sizeof(data)))
>  		return -EFAULT;
>  
> @@ -8159,9 +8165,7 @@ static void kvm_hyperv_tsc_notifier(void)
>  	kvm_max_guest_tsc_khz = tsc_khz;
>  
>  	list_for_each_entry(kvm, &vm_list, vm_list) {
> -		struct kvm_arch *ka = &kvm->arch;
> -
> -		spin_lock_irq(&ka->pvclock_gtod_sync_lock);
> +		__kvm_start_pvclock_update(kvm);
>  		pvclock_update_vm_gtod_copy(kvm);
>  		kvm_end_pvclock_update(kvm);
>  	}
> @@ -11188,8 +11192,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  
>  	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
>  	mutex_init(&kvm->arch.apic_map_lock);
> -	spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock);
> -
> +	seqcount_raw_spinlock_init(&kvm->arch.pvclock_sc, &kvm->arch.tsc_write_lock);
>  	kvm->arch.kvmclock_offset = -get_kvmclock_base_ns();
>  	pvclock_update_vm_gtod_copy(kvm);
>  
> -- 
> 2.33.0.309.g3052b89438-goog
> 
> 

ACK

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount
@ 2021-09-30 17:51     ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-30 17:51 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, kvmarm, Paolo Bonzini, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Thu, Sep 16, 2021 at 06:15:36PM +0000, Oliver Upton wrote:
> From: Paolo Bonzini <pbonzini@redhat.com>
> 
> Protect the reference point for kvmclock with a seqcount, so that
> kvmclock updates for all vCPUs can proceed in parallel.  Xen runstate
> updates will also run in parallel and not bounce the kvmclock cacheline.
> 
> nr_vcpus_matched_tsc is updated outside pvclock_update_vm_gtod_copy
> though, so a spinlock must be kept for that one.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> [Oliver - drop unused locals, don't double acquire tsc_write_lock]
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  7 ++-
>  arch/x86/kvm/x86.c              | 83 +++++++++++++++++----------------
>  2 files changed, 49 insertions(+), 41 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 9c34b5b63e39..5accfe7246ce 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1087,6 +1087,11 @@ struct kvm_arch {
>  
>  	unsigned long irq_sources_bitmap;
>  	s64 kvmclock_offset;
> +
> +	/*
> +	 * This also protects nr_vcpus_matched_tsc which is read from a
> +	 * preemption-disabled region, so it must be a raw spinlock.
> +	 */
>  	raw_spinlock_t tsc_write_lock;
>  	u64 last_tsc_nsec;
>  	u64 last_tsc_write;
> @@ -1097,7 +1102,7 @@ struct kvm_arch {
>  	u64 cur_tsc_generation;
>  	int nr_vcpus_matched_tsc;
>  
> -	spinlock_t pvclock_gtod_sync_lock;
> +	seqcount_raw_spinlock_t pvclock_sc;
>  	bool use_master_clock;
>  	u64 master_kernel_ns;
>  	u64 master_cycle_now;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index cb5d5cad5124..29156c49cd11 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2533,9 +2533,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
>  	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
>  
>  	kvm_vcpu_write_tsc_offset(vcpu, offset);
> -	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
>  
> -	spin_lock_irqsave(&kvm->arch.pvclock_gtod_sync_lock, flags);
>  	if (!matched) {
>  		kvm->arch.nr_vcpus_matched_tsc = 0;
>  	} else if (!already_matched) {
> @@ -2543,7 +2541,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 data)
>  	}
>  
>  	kvm_track_tsc_matching(vcpu);
> -	spin_unlock_irqrestore(&kvm->arch.pvclock_gtod_sync_lock, flags);
> +	raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
>  }
>  
>  static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
> @@ -2731,9 +2729,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>  	int vclock_mode;
>  	bool host_tsc_clocksource, vcpus_matched;
>  
> -	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
> -			atomic_read(&kvm->online_vcpus));
> -
>  	/*
>  	 * If the host uses TSC clock, then passthrough TSC as stable
>  	 * to the guest.
> @@ -2742,6 +2737,10 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>  					&ka->master_kernel_ns,
>  					&ka->master_cycle_now);
>  
> +	lockdep_assert_held(&kvm->arch.tsc_write_lock);
> +	vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
> +			atomic_read(&kvm->online_vcpus));
> +
>  	ka->use_master_clock = host_tsc_clocksource && vcpus_matched
>  				&& !ka->backwards_tsc_observed
>  				&& !ka->boot_vcpu_runs_old_kvmclock;
> @@ -2760,14 +2759,18 @@ static void kvm_make_mclock_inprogress_request(struct kvm *kvm)
>  	kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
>  }
>  
> -static void kvm_start_pvclock_update(struct kvm *kvm)
> +static void __kvm_start_pvclock_update(struct kvm *kvm)
>  {
> -	struct kvm_arch *ka = &kvm->arch;
> +	raw_spin_lock_irq(&kvm->arch.tsc_write_lock);
> +	write_seqcount_begin(&kvm->arch.pvclock_sc);
> +}
>  
> +static void kvm_start_pvclock_update(struct kvm *kvm)
> +{
>  	kvm_make_mclock_inprogress_request(kvm);
>  
>  	/* no guest entries from this point */
> -	spin_lock_irq(&ka->pvclock_gtod_sync_lock);
> +	__kvm_start_pvclock_update(kvm);
>  }
>  
>  static void kvm_end_pvclock_update(struct kvm *kvm)
> @@ -2776,7 +2779,8 @@ static void kvm_end_pvclock_update(struct kvm *kvm)
>  	struct kvm_vcpu *vcpu;
>  	int i;
>  
> -	spin_unlock_irq(&ka->pvclock_gtod_sync_lock);
> +	write_seqcount_end(&ka->pvclock_sc);
> +	raw_spin_unlock_irq(&ka->tsc_write_lock);
>  	kvm_for_each_vcpu(i, vcpu, kvm)
>  		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>  
> @@ -2797,20 +2801,12 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
>  {
>  	struct kvm_arch *ka = &kvm->arch;
>  	struct pvclock_vcpu_time_info hv_clock;
> -	unsigned long flags;
>  
> -	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
>  	if (!ka->use_master_clock) {
> -		spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
>  		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
>  		return;
>  	}
>  
> -	data->flags |= KVM_CLOCK_TSC_STABLE;
> -	hv_clock.tsc_timestamp = ka->master_cycle_now;
> -	hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
> -	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
> -
>  	/* both __this_cpu_read() and rdtsc() should be on the same cpu */
>  	get_cpu();
>  
> @@ -2825,6 +2821,9 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
>  #endif
>  		data->host_tsc = rdtsc();
>  
> +		data->flags |= KVM_CLOCK_TSC_STABLE;
> +		hv_clock.tsc_timestamp = ka->master_cycle_now;
> +		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
>  		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
>  				   &hv_clock.tsc_shift,
>  				   &hv_clock.tsc_to_system_mul);
> @@ -2839,14 +2838,14 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
>  u64 get_kvmclock_ns(struct kvm *kvm)
>  {
>  	struct kvm_clock_data data;
> +	struct kvm_arch *ka = &kvm->arch;
> +	unsigned seq;
>  
> -	/*
> -	 * Zero flags as it's accessed RMW, leave everything else uninitialized
> -	 * as clock is always written and no other fields are consumed.
> -	 */
> -	data.flags = 0;
> -
> -	get_kvmclock(kvm, &data);
> +	do {
> +		seq = read_seqcount_begin(&ka->pvclock_sc);
> +		data.flags = 0;
> +		get_kvmclock(kvm, &data);
> +	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
>  	return data.clock;
>  }
>  
> @@ -2912,6 +2911,7 @@ static void kvm_setup_pvclock_page(struct kvm_vcpu *v,
>  static int kvm_guest_time_update(struct kvm_vcpu *v)
>  {
>  	unsigned long flags, tgt_tsc_khz;
> +	unsigned seq;
>  	struct kvm_vcpu_arch *vcpu = &v->arch;
>  	struct kvm_arch *ka = &v->kvm->arch;
>  	s64 kernel_ns;
> @@ -2926,13 +2926,14 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
>  	 * If the host uses TSC clock, then passthrough TSC as stable
>  	 * to the guest.
>  	 */
> -	spin_lock_irqsave(&ka->pvclock_gtod_sync_lock, flags);
> -	use_master_clock = ka->use_master_clock;
> -	if (use_master_clock) {
> -		host_tsc = ka->master_cycle_now;
> -		kernel_ns = ka->master_kernel_ns;
> -	}
> -	spin_unlock_irqrestore(&ka->pvclock_gtod_sync_lock, flags);
> +	seq = read_seqcount_begin(&ka->pvclock_sc);
> +	do {
> +		use_master_clock = ka->use_master_clock;
> +		if (use_master_clock) {
> +			host_tsc = ka->master_cycle_now;
> +			kernel_ns = ka->master_kernel_ns;
> +		}
> +	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
>  
>  	/* Keep irq disabled to prevent changes to the clock */
>  	local_irq_save(flags);
> @@ -5855,10 +5856,15 @@ int kvm_arch_pm_notifier(struct kvm *kvm, unsigned long state)
>  
>  static int kvm_vm_ioctl_get_clock(struct kvm *kvm, void __user *argp)
>  {
> -	struct kvm_clock_data data;
> +	struct kvm_clock_data data = { 0 };
> +	unsigned seq;
> +
> +	do {
> +		seq = read_seqcount_begin(&kvm->arch.pvclock_sc);
> +		data.flags = 0;
> +		get_kvmclock(kvm, &data);
> +	} while (read_seqcount_retry(&kvm->arch.pvclock_sc, seq));
>  
> -	memset(&data, 0, sizeof(data));
> -	get_kvmclock(kvm, &data);
>  	if (copy_to_user(argp, &data, sizeof(data)))
>  		return -EFAULT;
>  
> @@ -8159,9 +8165,7 @@ static void kvm_hyperv_tsc_notifier(void)
>  	kvm_max_guest_tsc_khz = tsc_khz;
>  
>  	list_for_each_entry(kvm, &vm_list, vm_list) {
> -		struct kvm_arch *ka = &kvm->arch;
> -
> -		spin_lock_irq(&ka->pvclock_gtod_sync_lock);
> +		__kvm_start_pvclock_update(kvm);
>  		pvclock_update_vm_gtod_copy(kvm);
>  		kvm_end_pvclock_update(kvm);
>  	}
> @@ -11188,8 +11192,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>  
>  	raw_spin_lock_init(&kvm->arch.tsc_write_lock);
>  	mutex_init(&kvm->arch.apic_map_lock);
> -	spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock);
> -
> +	seqcount_raw_spinlock_init(&kvm->arch.pvclock_sc, &kvm->arch.tsc_write_lock);
>  	kvm->arch.kvmclock_offset = -get_kvmclock_base_ns();
>  	pvclock_update_vm_gtod_copy(kvm);
>  
> -- 
> 2.33.0.309.g3052b89438-goog
> 
> 

ACK


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
  2021-09-16 18:15   ` Oliver Upton
  (?)
@ 2021-09-30 19:14     ` Marcelo Tosatti
  -1 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-30 19:14 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, kvmarm, Paolo Bonzini, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Thu, Sep 16, 2021 at 06:15:38PM +0000, Oliver Upton wrote:
> To date, VMM-directed TSC synchronization and migration has been a bit
> messy. KVM has some baked-in heuristics around TSC writes to infer if
> the VMM is attempting to synchronize. This is problematic, as it depends
> on host userspace writing to the guest's TSC within 1 second of the last
> write.
> 
> A much cleaner approach to configuring the guest's views of the TSC is to
> simply migrate the TSC offset for every vCPU. Offsets are idempotent,
> and thus not subject to change depending on when the VMM actually
> reads/writes values from/to KVM. The VMM can then read the TSC once with
> KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
> the guest is paused.
> 
> Cc: David Matlack <dmatlack@google.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Oliver Upton <oupton@google.com>


> ---
>  Documentation/virt/kvm/devices/vcpu.rst |  57 ++++++++++++
>  arch/x86/include/asm/kvm_host.h         |   1 +
>  arch/x86/include/uapi/asm/kvm.h         |   4 +
>  arch/x86/kvm/x86.c                      | 110 ++++++++++++++++++++++++
>  4 files changed, 172 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
> index 2acec3b9ef65..3b399d727c11 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -161,3 +161,60 @@ Specifies the base address of the stolen time structure for this VCPU. The
>  base address must be 64 byte aligned and exist within a valid guest memory
>  region. See Documentation/virt/kvm/arm/pvtime.rst for more information
>  including the layout of the stolen time structure.
> +
> +4. GROUP: KVM_VCPU_TSC_CTRL
> +===========================
> +
> +:Architectures: x86
> +
> +4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
> +
> +:Parameters: 64-bit unsigned TSC offset
> +
> +Returns:
> +
> +	 ======= ======================================
> +	 -EFAULT Error reading/writing the provided
> +		 parameter address.
> +	 -ENXIO  Attribute not supported
> +	 ======= ======================================
> +
> +Specifies the guest's TSC offset relative to the host's TSC. The guest's
> +TSC is then derived by the following equation:
> +
> +  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
> +
> +This attribute is useful for the precise migration of a guest's TSC. The
> +following describes a possible algorithm to use for the migration of a
> +guest's TSC:
> +
> +From the source VMM process:
> +
> +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0),
> +   kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0).
> +
> +2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the
> +   guest TSC offset (off_n).
> +
> +3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
> +   guest's TSC (freq).
> +
> +From the destination VMM process:
> +
> +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> +   structure. KVM will advance the VM's kvmclock to account for elapsed
> +   time since recording the clock values.
> +
> +5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_1) and
> +   kvmclock nanoseconds (k_1).
> +
> +6. Adjust the guest TSC offsets for every vCPU to account for (1) time
> +   elapsed since recording state and (2) difference in TSCs between the
> +   source and destination machine:
> +
> +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1

Hi Oliver,

This won't advance the TSC values themselves, right?
This (advancing the TSC values by the realtime elapsed time) would be
awesome because TSC clock_gettime() vdso is faster, and some
applications prefer to just read from TSC directly.
See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".

The advancement with this patchset only applies to kvmclock.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-09-30 19:14     ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-30 19:14 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Catalin Marinas, kvm, Will Deacon, Marc Zyngier, Peter Shier,
	David Matlack, Paolo Bonzini, kvmarm, linux-arm-kernel,
	Jim Mattson

On Thu, Sep 16, 2021 at 06:15:38PM +0000, Oliver Upton wrote:
> To date, VMM-directed TSC synchronization and migration has been a bit
> messy. KVM has some baked-in heuristics around TSC writes to infer if
> the VMM is attempting to synchronize. This is problematic, as it depends
> on host userspace writing to the guest's TSC within 1 second of the last
> write.
> 
> A much cleaner approach to configuring the guest's views of the TSC is to
> simply migrate the TSC offset for every vCPU. Offsets are idempotent,
> and thus not subject to change depending on when the VMM actually
> reads/writes values from/to KVM. The VMM can then read the TSC once with
> KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
> the guest is paused.
> 
> Cc: David Matlack <dmatlack@google.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Oliver Upton <oupton@google.com>


> ---
>  Documentation/virt/kvm/devices/vcpu.rst |  57 ++++++++++++
>  arch/x86/include/asm/kvm_host.h         |   1 +
>  arch/x86/include/uapi/asm/kvm.h         |   4 +
>  arch/x86/kvm/x86.c                      | 110 ++++++++++++++++++++++++
>  4 files changed, 172 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
> index 2acec3b9ef65..3b399d727c11 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -161,3 +161,60 @@ Specifies the base address of the stolen time structure for this VCPU. The
>  base address must be 64 byte aligned and exist within a valid guest memory
>  region. See Documentation/virt/kvm/arm/pvtime.rst for more information
>  including the layout of the stolen time structure.
> +
> +4. GROUP: KVM_VCPU_TSC_CTRL
> +===========================
> +
> +:Architectures: x86
> +
> +4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
> +
> +:Parameters: 64-bit unsigned TSC offset
> +
> +Returns:
> +
> +	 ======= ======================================
> +	 -EFAULT Error reading/writing the provided
> +		 parameter address.
> +	 -ENXIO  Attribute not supported
> +	 ======= ======================================
> +
> +Specifies the guest's TSC offset relative to the host's TSC. The guest's
> +TSC is then derived by the following equation:
> +
> +  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
> +
> +This attribute is useful for the precise migration of a guest's TSC. The
> +following describes a possible algorithm to use for the migration of a
> +guest's TSC:
> +
> +From the source VMM process:
> +
> +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0),
> +   kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0).
> +
> +2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the
> +   guest TSC offset (off_n).
> +
> +3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
> +   guest's TSC (freq).
> +
> +From the destination VMM process:
> +
> +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> +   structure. KVM will advance the VM's kvmclock to account for elapsed
> +   time since recording the clock values.
> +
> +5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_1) and
> +   kvmclock nanoseconds (k_1).
> +
> +6. Adjust the guest TSC offsets for every vCPU to account for (1) time
> +   elapsed since recording state and (2) difference in TSCs between the
> +   source and destination machine:
> +
> +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1

Hi Oliver,

This won't advance the TSC values themselves, right?
This (advancing the TSC values by the realtime elapsed time) would be
awesome because TSC clock_gettime() vdso is faster, and some
applications prefer to just read from TSC directly.
See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".

The advancement with this patchset only applies to kvmclock.

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-09-30 19:14     ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-30 19:14 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, kvmarm, Paolo Bonzini, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Thu, Sep 16, 2021 at 06:15:38PM +0000, Oliver Upton wrote:
> To date, VMM-directed TSC synchronization and migration has been a bit
> messy. KVM has some baked-in heuristics around TSC writes to infer if
> the VMM is attempting to synchronize. This is problematic, as it depends
> on host userspace writing to the guest's TSC within 1 second of the last
> write.
> 
> A much cleaner approach to configuring the guest's views of the TSC is to
> simply migrate the TSC offset for every vCPU. Offsets are idempotent,
> and thus not subject to change depending on when the VMM actually
> reads/writes values from/to KVM. The VMM can then read the TSC once with
> KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
> the guest is paused.
> 
> Cc: David Matlack <dmatlack@google.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Oliver Upton <oupton@google.com>


> ---
>  Documentation/virt/kvm/devices/vcpu.rst |  57 ++++++++++++
>  arch/x86/include/asm/kvm_host.h         |   1 +
>  arch/x86/include/uapi/asm/kvm.h         |   4 +
>  arch/x86/kvm/x86.c                      | 110 ++++++++++++++++++++++++
>  4 files changed, 172 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
> index 2acec3b9ef65..3b399d727c11 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -161,3 +161,60 @@ Specifies the base address of the stolen time structure for this VCPU. The
>  base address must be 64 byte aligned and exist within a valid guest memory
>  region. See Documentation/virt/kvm/arm/pvtime.rst for more information
>  including the layout of the stolen time structure.
> +
> +4. GROUP: KVM_VCPU_TSC_CTRL
> +===========================
> +
> +:Architectures: x86
> +
> +4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
> +
> +:Parameters: 64-bit unsigned TSC offset
> +
> +Returns:
> +
> +	 ======= ======================================
> +	 -EFAULT Error reading/writing the provided
> +		 parameter address.
> +	 -ENXIO  Attribute not supported
> +	 ======= ======================================
> +
> +Specifies the guest's TSC offset relative to the host's TSC. The guest's
> +TSC is then derived by the following equation:
> +
> +  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
> +
> +This attribute is useful for the precise migration of a guest's TSC. The
> +following describes a possible algorithm to use for the migration of a
> +guest's TSC:
> +
> +From the source VMM process:
> +
> +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0),
> +   kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0).
> +
> +2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the
> +   guest TSC offset (off_n).
> +
> +3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
> +   guest's TSC (freq).
> +
> +From the destination VMM process:
> +
> +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> +   structure. KVM will advance the VM's kvmclock to account for elapsed
> +   time since recording the clock values.
> +
> +5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_1) and
> +   kvmclock nanoseconds (k_1).
> +
> +6. Adjust the guest TSC offsets for every vCPU to account for (1) time
> +   elapsed since recording state and (2) difference in TSCs between the
> +   source and destination machine:
> +
> +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1

Hi Oliver,

This won't advance the TSC values themselves, right?
This (advancing the TSC values by the realtime elapsed time) would be
awesome because TSC clock_gettime() vdso is faster, and some
applications prefer to just read from TSC directly.
See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".

The advancement with this patchset only applies to kvmclock.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-09-29 18:56     ` Marcelo Tosatti
  (?)
@ 2021-09-30 19:21       ` Marcelo Tosatti
  -1 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-30 19:21 UTC (permalink / raw)
  To: Oliver Upton, Thomas Gleixner
  Cc: kvm, kvmarm, Paolo Bonzini, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Thomas Gleixner

On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> Oliver,
> 
> Do you have any numbers for the improvement in guests CLOCK_REALTIME
> accuracy across migration, when this is in place?
> 
> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> > Handling the migration of TSCs correctly is difficult, in part because
> > Linux does not provide userspace with the ability to retrieve a (TSC,
> > realtime) clock pair for a single instant in time. In lieu of a more
> > convenient facility, KVM can report similar information in the kvm_clock
> > structure.
> > 
> > Provide userspace with a host TSC & realtime pair iff the realtime clock
> > is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
> > realtime value, advance the KVM clock by the amount of elapsed time. Do
> > not step the KVM clock backwards, though, as it is a monotonic
> > oscillator.
> > 
> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  Documentation/virt/kvm/api.rst  | 42 ++++++++++++++++++++++++++-------
> >  arch/x86/include/asm/kvm_host.h |  3 +++
> >  arch/x86/kvm/x86.c              | 36 +++++++++++++++++++++-------
> >  include/uapi/linux/kvm.h        |  7 +++++-
> >  4 files changed, 70 insertions(+), 18 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index a6729c8cf063..d0b9c986cf6c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -993,20 +993,34 @@ such as migration.
> >  When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
> >  set of bits that KVM can return in struct kvm_clock_data's flag member.
> >  
> > -The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
> > -value is the exact kvmclock value seen by all VCPUs at the instant
> > -when KVM_GET_CLOCK was called.  If clear, the returned value is simply
> > -CLOCK_MONOTONIC plus a constant offset; the offset can be modified
> > -with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
> > -but the exact value read by each VCPU could differ, because the host
> > -TSC is not stable.
> > +FLAGS:
> > +
> > +KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
> > +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
> > +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
> > +offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
> > +to make all VCPUs follow this clock, but the exact value read by each
> > +VCPU could differ, because the host TSC is not stable.
> > +
> > +KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
> > +structure is populated with the value of the host's real time
> > +clocksource at the instant when KVM_GET_CLOCK was called. If clear,
> > +the `realtime` field does not contain a value.
> > +
> > +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
> > +structure is populated with the value of the host's timestamp counter (TSC)
> > +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
> > +does not contain a value.
> >  
> >  ::
> >  
> >    struct kvm_clock_data {
> >  	__u64 clock;  /* kvmclock current value */
> >  	__u32 flags;
> > -	__u32 pad[9];
> > +	__u32 pad0;
> > +	__u64 realtime;
> > +	__u64 host_tsc;
> > +	__u32 pad[4];
> >    };
> >  
> >  
> > @@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value specified in its parameter.
> >  In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
> >  such as migration.
> >  
> > +FLAGS:
> > +
> > +KVM_CLOCK_REALTIME.  If set, KVM will compare the value of the `realtime` field
> > +with the value of the host's real time clocksource at the instant when
> > +KVM_SET_CLOCK was called. The difference in elapsed time is added to the final
> > +kvmclock value that will be provided to guests.
> > +
> >  ::
> >  
> >    struct kvm_clock_data {
> >  	__u64 clock;  /* kvmclock current value */
> >  	__u32 flags;
> > -	__u32 pad[9];
> > +	__u32 pad0;
> > +	__u64 realtime;
> > +	__u64 host_tsc;
> > +	__u32 pad[4];
> >    };
> >  
> >  
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index be6805fc0260..9c34b5b63e39 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1936,4 +1936,7 @@ int kvm_cpu_dirty_log_size(void);
> >  
> >  int alloc_all_memslots_rmaps(struct kvm *kvm);
> >  
> > +#define KVM_CLOCK_VALID_FLAGS						\
> > +	(KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC)
> > +
> >  #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 523c4e5c109f..cb5d5cad5124 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2815,10 +2815,20 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
> >  	get_cpu();
> >  
> >  	if (__this_cpu_read(cpu_tsc_khz)) {
> > +#ifdef CONFIG_X86_64
> > +		struct timespec64 ts;
> > +
> > +		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
> > +			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
> > +			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
> > +		} else
> > +#endif
> > +		data->host_tsc = rdtsc();
> > +
> >  		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
> >  				   &hv_clock.tsc_shift,
> >  				   &hv_clock.tsc_to_system_mul);
> > -		data->clock = __pvclock_read_cycles(&hv_clock, rdtsc());
> > +		data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
> >  	} else {
> >  		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
> >  	}
> > @@ -4062,7 +4072,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >  		r = KVM_SYNC_X86_VALID_FIELDS;
> >  		break;
> >  	case KVM_CAP_ADJUST_CLOCK:
> > -		r = KVM_CLOCK_TSC_STABLE;
> > +		r = KVM_CLOCK_VALID_FLAGS;
> >  		break;
> >  	case KVM_CAP_X86_DISABLE_EXITS:
> >  		r |=  KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE |
> > @@ -5859,12 +5869,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
> >  {
> >  	struct kvm_arch *ka = &kvm->arch;
> >  	struct kvm_clock_data data;
> > -	u64 now_ns;
> > +	u64 now_raw_ns;
> >  
> >  	if (copy_from_user(&data, argp, sizeof(data)))
> >  		return -EFAULT;
> >  
> > -	if (data.flags)
> > +	if (data.flags & ~KVM_CLOCK_REALTIME)
> >  		return -EINVAL;
> >  
> >  	kvm_hv_invalidate_tsc_page(kvm);
> > @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
> >  	 * is slightly ahead) here we risk going negative on unsigned
> >  	 * 'system_time' when 'data.clock' is very small.
> >  	 */
> > -	if (kvm->arch.use_master_clock)
> > -		now_ns = ka->master_kernel_ns;
> > +	if (data.flags & KVM_CLOCK_REALTIME) {
> > +		u64 now_real_ns = ktime_get_real_ns();
> > +
> > +		/*
> > +		 * Avoid stepping the kvmclock backwards.
> > +		 */
> > +		if (now_real_ns > data.realtime)
> > +			data.clock += now_real_ns - data.realtime;
> > +	}
> 
> Forward jumps can also cause problems, for example:
> 
> * Kernel watchdogs
> 
> * https://patchwork.ozlabs.org/project/qemu-devel/patch/20130618233825.GA19042@amt.cnet/
> 
> So perhaps limiting the amount of forward jump that is allowed 
> would be a good thing? (which can happen if the two hosts realtime
> clocks are off).
> 
> Now by how much, i am not sure.
> 
> Or, as mentioned earlier, only enable KVM_CLOCK_REALTIME if userspace
> KVM code checks clock synchronization.
> 
> Thomas, CC'ed, has deeper understanding of problems with 
> forward time jumps than I do. Thomas, any comments?

Thomas,

Based on the earlier discussion about the problems of synchronizing
the guests clock via a notification to the NTP/Chrony daemon 
(where there is a window where applications can read the stale
value of the clock), a possible solution would be triggering
an NMI on the destination (so that it runs ASAP, with higher
priority than application/kernel).

What would this NMI do, exactly?

> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
> for either vm pause / vm resume (well, if paused for long periods of time) 
> or savevm / restorevm.

Maybe with the NMI above, it would be possible to use
the realtime clock as a way to know time elapsed between
events and advance guest clock without the current 
problematic window.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-09-30 19:21       ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-30 19:21 UTC (permalink / raw)
  To: Oliver Upton, Thomas Gleixner
  Cc: Catalin Marinas, kvm, Will Deacon, Marc Zyngier, Peter Shier,
	David Matlack, Paolo Bonzini, Thomas Gleixner, kvmarm,
	linux-arm-kernel, Jim Mattson

On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> Oliver,
> 
> Do you have any numbers for the improvement in guests CLOCK_REALTIME
> accuracy across migration, when this is in place?
> 
> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> > Handling the migration of TSCs correctly is difficult, in part because
> > Linux does not provide userspace with the ability to retrieve a (TSC,
> > realtime) clock pair for a single instant in time. In lieu of a more
> > convenient facility, KVM can report similar information in the kvm_clock
> > structure.
> > 
> > Provide userspace with a host TSC & realtime pair iff the realtime clock
> > is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
> > realtime value, advance the KVM clock by the amount of elapsed time. Do
> > not step the KVM clock backwards, though, as it is a monotonic
> > oscillator.
> > 
> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  Documentation/virt/kvm/api.rst  | 42 ++++++++++++++++++++++++++-------
> >  arch/x86/include/asm/kvm_host.h |  3 +++
> >  arch/x86/kvm/x86.c              | 36 +++++++++++++++++++++-------
> >  include/uapi/linux/kvm.h        |  7 +++++-
> >  4 files changed, 70 insertions(+), 18 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index a6729c8cf063..d0b9c986cf6c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -993,20 +993,34 @@ such as migration.
> >  When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
> >  set of bits that KVM can return in struct kvm_clock_data's flag member.
> >  
> > -The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
> > -value is the exact kvmclock value seen by all VCPUs at the instant
> > -when KVM_GET_CLOCK was called.  If clear, the returned value is simply
> > -CLOCK_MONOTONIC plus a constant offset; the offset can be modified
> > -with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
> > -but the exact value read by each VCPU could differ, because the host
> > -TSC is not stable.
> > +FLAGS:
> > +
> > +KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
> > +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
> > +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
> > +offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
> > +to make all VCPUs follow this clock, but the exact value read by each
> > +VCPU could differ, because the host TSC is not stable.
> > +
> > +KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
> > +structure is populated with the value of the host's real time
> > +clocksource at the instant when KVM_GET_CLOCK was called. If clear,
> > +the `realtime` field does not contain a value.
> > +
> > +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
> > +structure is populated with the value of the host's timestamp counter (TSC)
> > +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
> > +does not contain a value.
> >  
> >  ::
> >  
> >    struct kvm_clock_data {
> >  	__u64 clock;  /* kvmclock current value */
> >  	__u32 flags;
> > -	__u32 pad[9];
> > +	__u32 pad0;
> > +	__u64 realtime;
> > +	__u64 host_tsc;
> > +	__u32 pad[4];
> >    };
> >  
> >  
> > @@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value specified in its parameter.
> >  In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
> >  such as migration.
> >  
> > +FLAGS:
> > +
> > +KVM_CLOCK_REALTIME.  If set, KVM will compare the value of the `realtime` field
> > +with the value of the host's real time clocksource at the instant when
> > +KVM_SET_CLOCK was called. The difference in elapsed time is added to the final
> > +kvmclock value that will be provided to guests.
> > +
> >  ::
> >  
> >    struct kvm_clock_data {
> >  	__u64 clock;  /* kvmclock current value */
> >  	__u32 flags;
> > -	__u32 pad[9];
> > +	__u32 pad0;
> > +	__u64 realtime;
> > +	__u64 host_tsc;
> > +	__u32 pad[4];
> >    };
> >  
> >  
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index be6805fc0260..9c34b5b63e39 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1936,4 +1936,7 @@ int kvm_cpu_dirty_log_size(void);
> >  
> >  int alloc_all_memslots_rmaps(struct kvm *kvm);
> >  
> > +#define KVM_CLOCK_VALID_FLAGS						\
> > +	(KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC)
> > +
> >  #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 523c4e5c109f..cb5d5cad5124 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2815,10 +2815,20 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
> >  	get_cpu();
> >  
> >  	if (__this_cpu_read(cpu_tsc_khz)) {
> > +#ifdef CONFIG_X86_64
> > +		struct timespec64 ts;
> > +
> > +		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
> > +			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
> > +			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
> > +		} else
> > +#endif
> > +		data->host_tsc = rdtsc();
> > +
> >  		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
> >  				   &hv_clock.tsc_shift,
> >  				   &hv_clock.tsc_to_system_mul);
> > -		data->clock = __pvclock_read_cycles(&hv_clock, rdtsc());
> > +		data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
> >  	} else {
> >  		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
> >  	}
> > @@ -4062,7 +4072,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >  		r = KVM_SYNC_X86_VALID_FIELDS;
> >  		break;
> >  	case KVM_CAP_ADJUST_CLOCK:
> > -		r = KVM_CLOCK_TSC_STABLE;
> > +		r = KVM_CLOCK_VALID_FLAGS;
> >  		break;
> >  	case KVM_CAP_X86_DISABLE_EXITS:
> >  		r |=  KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE |
> > @@ -5859,12 +5869,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
> >  {
> >  	struct kvm_arch *ka = &kvm->arch;
> >  	struct kvm_clock_data data;
> > -	u64 now_ns;
> > +	u64 now_raw_ns;
> >  
> >  	if (copy_from_user(&data, argp, sizeof(data)))
> >  		return -EFAULT;
> >  
> > -	if (data.flags)
> > +	if (data.flags & ~KVM_CLOCK_REALTIME)
> >  		return -EINVAL;
> >  
> >  	kvm_hv_invalidate_tsc_page(kvm);
> > @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
> >  	 * is slightly ahead) here we risk going negative on unsigned
> >  	 * 'system_time' when 'data.clock' is very small.
> >  	 */
> > -	if (kvm->arch.use_master_clock)
> > -		now_ns = ka->master_kernel_ns;
> > +	if (data.flags & KVM_CLOCK_REALTIME) {
> > +		u64 now_real_ns = ktime_get_real_ns();
> > +
> > +		/*
> > +		 * Avoid stepping the kvmclock backwards.
> > +		 */
> > +		if (now_real_ns > data.realtime)
> > +			data.clock += now_real_ns - data.realtime;
> > +	}
> 
> Forward jumps can also cause problems, for example:
> 
> * Kernel watchdogs
> 
> * https://patchwork.ozlabs.org/project/qemu-devel/patch/20130618233825.GA19042@amt.cnet/
> 
> So perhaps limiting the amount of forward jump that is allowed 
> would be a good thing? (which can happen if the two hosts realtime
> clocks are off).
> 
> Now by how much, i am not sure.
> 
> Or, as mentioned earlier, only enable KVM_CLOCK_REALTIME if userspace
> KVM code checks clock synchronization.
> 
> Thomas, CC'ed, has deeper understanding of problems with 
> forward time jumps than I do. Thomas, any comments?

Thomas,

Based on the earlier discussion about the problems of synchronizing
the guests clock via a notification to the NTP/Chrony daemon 
(where there is a window where applications can read the stale
value of the clock), a possible solution would be triggering
an NMI on the destination (so that it runs ASAP, with higher
priority than application/kernel).

What would this NMI do, exactly?

> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
> for either vm pause / vm resume (well, if paused for long periods of time) 
> or savevm / restorevm.

Maybe with the NMI above, it would be possible to use
the realtime clock as a way to know time elapsed between
events and advance guest clock without the current 
problematic window.

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-09-30 19:21       ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-09-30 19:21 UTC (permalink / raw)
  To: Oliver Upton, Thomas Gleixner
  Cc: kvm, kvmarm, Paolo Bonzini, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas, Thomas Gleixner

On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> Oliver,
> 
> Do you have any numbers for the improvement in guests CLOCK_REALTIME
> accuracy across migration, when this is in place?
> 
> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> > Handling the migration of TSCs correctly is difficult, in part because
> > Linux does not provide userspace with the ability to retrieve a (TSC,
> > realtime) clock pair for a single instant in time. In lieu of a more
> > convenient facility, KVM can report similar information in the kvm_clock
> > structure.
> > 
> > Provide userspace with a host TSC & realtime pair iff the realtime clock
> > is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
> > realtime value, advance the KVM clock by the amount of elapsed time. Do
> > not step the KVM clock backwards, though, as it is a monotonic
> > oscillator.
> > 
> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > ---
> >  Documentation/virt/kvm/api.rst  | 42 ++++++++++++++++++++++++++-------
> >  arch/x86/include/asm/kvm_host.h |  3 +++
> >  arch/x86/kvm/x86.c              | 36 +++++++++++++++++++++-------
> >  include/uapi/linux/kvm.h        |  7 +++++-
> >  4 files changed, 70 insertions(+), 18 deletions(-)
> > 
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index a6729c8cf063..d0b9c986cf6c 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -993,20 +993,34 @@ such as migration.
> >  When KVM_CAP_ADJUST_CLOCK is passed to KVM_CHECK_EXTENSION, it returns the
> >  set of bits that KVM can return in struct kvm_clock_data's flag member.
> >  
> > -The only flag defined now is KVM_CLOCK_TSC_STABLE.  If set, the returned
> > -value is the exact kvmclock value seen by all VCPUs at the instant
> > -when KVM_GET_CLOCK was called.  If clear, the returned value is simply
> > -CLOCK_MONOTONIC plus a constant offset; the offset can be modified
> > -with KVM_SET_CLOCK.  KVM will try to make all VCPUs follow this clock,
> > -but the exact value read by each VCPU could differ, because the host
> > -TSC is not stable.
> > +FLAGS:
> > +
> > +KVM_CLOCK_TSC_STABLE.  If set, the returned value is the exact kvmclock
> > +value seen by all VCPUs at the instant when KVM_GET_CLOCK was called.
> > +If clear, the returned value is simply CLOCK_MONOTONIC plus a constant
> > +offset; the offset can be modified with KVM_SET_CLOCK.  KVM will try
> > +to make all VCPUs follow this clock, but the exact value read by each
> > +VCPU could differ, because the host TSC is not stable.
> > +
> > +KVM_CLOCK_REALTIME.  If set, the `realtime` field in the kvm_clock_data
> > +structure is populated with the value of the host's real time
> > +clocksource at the instant when KVM_GET_CLOCK was called. If clear,
> > +the `realtime` field does not contain a value.
> > +
> > +KVM_CLOCK_HOST_TSC.  If set, the `host_tsc` field in the kvm_clock_data
> > +structure is populated with the value of the host's timestamp counter (TSC)
> > +at the instant when KVM_GET_CLOCK was called. If clear, the `host_tsc` field
> > +does not contain a value.
> >  
> >  ::
> >  
> >    struct kvm_clock_data {
> >  	__u64 clock;  /* kvmclock current value */
> >  	__u32 flags;
> > -	__u32 pad[9];
> > +	__u32 pad0;
> > +	__u64 realtime;
> > +	__u64 host_tsc;
> > +	__u32 pad[4];
> >    };
> >  
> >  
> > @@ -1023,12 +1037,22 @@ Sets the current timestamp of kvmclock to the value specified in its parameter.
> >  In conjunction with KVM_GET_CLOCK, it is used to ensure monotonicity on scenarios
> >  such as migration.
> >  
> > +FLAGS:
> > +
> > +KVM_CLOCK_REALTIME.  If set, KVM will compare the value of the `realtime` field
> > +with the value of the host's real time clocksource at the instant when
> > +KVM_SET_CLOCK was called. The difference in elapsed time is added to the final
> > +kvmclock value that will be provided to guests.
> > +
> >  ::
> >  
> >    struct kvm_clock_data {
> >  	__u64 clock;  /* kvmclock current value */
> >  	__u32 flags;
> > -	__u32 pad[9];
> > +	__u32 pad0;
> > +	__u64 realtime;
> > +	__u64 host_tsc;
> > +	__u32 pad[4];
> >    };
> >  
> >  
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index be6805fc0260..9c34b5b63e39 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1936,4 +1936,7 @@ int kvm_cpu_dirty_log_size(void);
> >  
> >  int alloc_all_memslots_rmaps(struct kvm *kvm);
> >  
> > +#define KVM_CLOCK_VALID_FLAGS						\
> > +	(KVM_CLOCK_TSC_STABLE | KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC)
> > +
> >  #endif /* _ASM_X86_KVM_HOST_H */
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 523c4e5c109f..cb5d5cad5124 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -2815,10 +2815,20 @@ static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
> >  	get_cpu();
> >  
> >  	if (__this_cpu_read(cpu_tsc_khz)) {
> > +#ifdef CONFIG_X86_64
> > +		struct timespec64 ts;
> > +
> > +		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
> > +			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
> > +			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
> > +		} else
> > +#endif
> > +		data->host_tsc = rdtsc();
> > +
> >  		kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
> >  				   &hv_clock.tsc_shift,
> >  				   &hv_clock.tsc_to_system_mul);
> > -		data->clock = __pvclock_read_cycles(&hv_clock, rdtsc());
> > +		data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
> >  	} else {
> >  		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
> >  	}
> > @@ -4062,7 +4072,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >  		r = KVM_SYNC_X86_VALID_FIELDS;
> >  		break;
> >  	case KVM_CAP_ADJUST_CLOCK:
> > -		r = KVM_CLOCK_TSC_STABLE;
> > +		r = KVM_CLOCK_VALID_FLAGS;
> >  		break;
> >  	case KVM_CAP_X86_DISABLE_EXITS:
> >  		r |=  KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE |
> > @@ -5859,12 +5869,12 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
> >  {
> >  	struct kvm_arch *ka = &kvm->arch;
> >  	struct kvm_clock_data data;
> > -	u64 now_ns;
> > +	u64 now_raw_ns;
> >  
> >  	if (copy_from_user(&data, argp, sizeof(data)))
> >  		return -EFAULT;
> >  
> > -	if (data.flags)
> > +	if (data.flags & ~KVM_CLOCK_REALTIME)
> >  		return -EINVAL;
> >  
> >  	kvm_hv_invalidate_tsc_page(kvm);
> > @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
> >  	 * is slightly ahead) here we risk going negative on unsigned
> >  	 * 'system_time' when 'data.clock' is very small.
> >  	 */
> > -	if (kvm->arch.use_master_clock)
> > -		now_ns = ka->master_kernel_ns;
> > +	if (data.flags & KVM_CLOCK_REALTIME) {
> > +		u64 now_real_ns = ktime_get_real_ns();
> > +
> > +		/*
> > +		 * Avoid stepping the kvmclock backwards.
> > +		 */
> > +		if (now_real_ns > data.realtime)
> > +			data.clock += now_real_ns - data.realtime;
> > +	}
> 
> Forward jumps can also cause problems, for example:
> 
> * Kernel watchdogs
> 
> * https://patchwork.ozlabs.org/project/qemu-devel/patch/20130618233825.GA19042@amt.cnet/
> 
> So perhaps limiting the amount of forward jump that is allowed 
> would be a good thing? (which can happen if the two hosts realtime
> clocks are off).
> 
> Now by how much, i am not sure.
> 
> Or, as mentioned earlier, only enable KVM_CLOCK_REALTIME if userspace
> KVM code checks clock synchronization.
> 
> Thomas, CC'ed, has deeper understanding of problems with 
> forward time jumps than I do. Thomas, any comments?

Thomas,

Based on the earlier discussion about the problems of synchronizing
the guests clock via a notification to the NTP/Chrony daemon 
(where there is a window where applications can read the stale
value of the clock), a possible solution would be triggering
an NMI on the destination (so that it runs ASAP, with higher
priority than application/kernel).

What would this NMI do, exactly?

> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
> for either vm pause / vm resume (well, if paused for long periods of time) 
> or savevm / restorevm.

Maybe with the NMI above, it would be possible to use
the realtime clock as a way to know time elapsed between
events and advance guest clock without the current 
problematic window.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-09-30 19:21       ` Marcelo Tosatti
  (?)
@ 2021-09-30 23:02         ` Thomas Gleixner
  -1 siblings, 0 replies; 113+ messages in thread
From: Thomas Gleixner @ 2021-09-30 23:02 UTC (permalink / raw)
  To: Marcelo Tosatti, Oliver Upton
  Cc: kvm, kvmarm, Paolo Bonzini, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

Marcelo,

On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote:
> On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
>> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
>> 
>> Thomas, CC'ed, has deeper understanding of problems with 
>> forward time jumps than I do. Thomas, any comments?
>
> Based on the earlier discussion about the problems of synchronizing
> the guests clock via a notification to the NTP/Chrony daemon 
> (where there is a window where applications can read the stale
> value of the clock), a possible solution would be triggering
> an NMI on the destination (so that it runs ASAP, with higher
> priority than application/kernel).
>
> What would this NMI do, exactly?

Nothing. You cannot do anything time related in an NMI.

You might queue irq work which handles that, but that would still not
prevent user space or kernel space from observing the stale time stamp
depending on the execution state from where it resumes.

>> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
>> for either vm pause / vm resume (well, if paused for long periods of time) 
>> or savevm / restorevm.
>
> Maybe with the NMI above, it would be possible to use
> the realtime clock as a way to know time elapsed between
> events and advance guest clock without the current 
> problematic window.

As much duct tape you throw at the problem, it cannot be solved ever
because it's fundamentally broken. All you can do is to make the
observation windows smaller, but that's just curing the symptom.

The problem is that the guest is paused/resumed without getting any
information about that and the execution of the guest is stopped at an
arbitrary instruction boundary from which it resumes after migration or
restore. So there is no way to guarantee that after resume all vCPUs are
executing in a state which can handle that.

But even if that would be the case, then what prevents the stale time
stamps to be visible? Nothing:

T0:    t = now();
         -> pause
         -> resume
         -> magic "fixup"
T1:    dostuff(t);

But that's not a fundamental problem because every preemptible or
interruptible code has the same issue:

T0:    t = now();
         -> preemption or interrupt
T1:    dostuff(t);

Which is usually not a problem, but It becomes a problem when T1 - T0 is
greater than the usual expectations which can obviously be trivially
achieved by guest migration or a savevm, restorevm cycle.

But let's go a step back and look at the clocks and their expectations:

CLOCK_MONOTONIC:

  Monotonically increasing clock which counts unless the system
  is in suspend. On resume it continues counting without jumping
  forward.

  That's the reference clock for everything else and therefore it
  is important that it does _not_ jump around.

  The reasons why CLOCK_MONOTONIC stops during suspend is
  historical and any attempt to change that breaks the world and
  some more because making it jump forward will trigger all sorts
  of timeouts, watchdogs and whatever. The last attempt to make
  CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3
  weeks. It's not going to be attempted again. See a3ed0e4393d6
  ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for
  details.

  Now the proposed change is creating exactly the same problem:

  >> > +	if (data.flags & KVM_CLOCK_REALTIME) {
  >> > +		u64 now_real_ns = ktime_get_real_ns();
  >> > +
  >> > +		/*
  >> > +		 * Avoid stepping the kvmclock backwards.
  >> > +		 */
  >> > +		if (now_real_ns > data.realtime)
  >> > +			data.clock += now_real_ns - data.realtime;
  >> > +	}

  IOW, it takes the time between pause and resume into account and
  forwards the underlying base clock which makes CLOCK_MONOTONIC
  jump forward by exactly that amount of time.

  So depending on the size of the delta you are running into exactly the
  same problem as the final failed attempt to unify CLOCK_MONOTONIC and
  CLOCK_BOOTTIME which btw. would have been a magic cure for virt.

  Too bad, not going to happen ever unless you fix all affected user
  space and kernel side issues.


CLOCK_BOOTTIME:

  CLOCK_MONOTONIC + time spent in suspend


CLOCK_REALTIME/TAI:

  CLOCK_MONOTONIC + offset

  The offset is established by reading RTC at boot time and can be
  changed by clock_settime(2) and adjtimex(2). The latter is used
  by NTP/PTP.

  Any user of CLOCK_REALTIME has to be prepared for it to jump in
  both directions, but of course NTP/PTP daemons have expectations
  vs. such time jumps.

  They rightfully assume on a properly configured and administrated
  system that there are only two things which can make CLOCK_REALTIME
  jump:

  1) NTP/PTP daemon controlled
  2) Suspend/resume related updates by the kernel


Just for the record, these assumptions predate virtualization.

So now virt came along and created a hard to solve circular dependency
problem:

   - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
     sync, but everything else is happy.
     
   - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
     lose, but NTP/PTP is happy.

IOW, you are up a creek without a paddle and you have to chose one evil.

That's simply a design fail because there has been no design for this
from day one. But I'm not surprised at all by that simply because
virtualization followed the hardware design fails vs. time and
timekeeping which keep us entertained for the past 20 years on various
architectures but most prominently on X86 which is the uncontended
master of disaster in that regard.

Of course virt follows the same approach of hardware by ignoring the
problem and coming up with more duct tape and the assumption that lack
of design can be "fixed in software". Just the timeframe is slightly
different: We're discussing this only for about 10 years now.

Seriously? All you folks can come up with in 10 years is piling duct
tape on duct tape instead of sitting down and fixing the underlying root
cause once and forever?

I'm aware that especially chrony has tried to deal with this nonsense
more gracefully, but that still does not make it great and it still gets
upset.

The reason why suspend/resume works perfectly fine is that it's fully
coordinated and NTP state is cleared on resume which makes it easy for
the deamon to accomodate.

So again and I'm telling this for a decade now:

 1) Stop pretending that you can fix the lack of design with duct tape
    engineering

 2) Accept the fundamental properties of Linux time keeping as they are
    not going to change as explained above

 3) Either accept that CLOCK_REALTIME is off and jumping around which
    confuses NTP/PTP or get your act together and design and implement a
    proper synchronization mechanism for this:

    - Notify the guest about the intended "pause" or "savevm" event

    - Let the guest go into a lightweight freeze similar to S2IDLE

    - Save the VM for later resume or migrate the saved state

    - Watch everything working as expected on resume

    - Have the benefit that pause/resume and savevm/restorevm have
      exactly the same behaviour

That won't solve the problem for frankenkernels and !paravirt setups,
but that's unsolvable and you can keep the pieces by chosing one of two
evils. While I do not care at all, I still recommend to chose
CLOCK_MONOTONIC correctness for obvious reasons.

The frankenkernel/legacy problem aside, I know you are going to tell me
that this is too much overhead and has VM customer visible impact. It's
your choice, really:

  Either you chose correctness or you decide to ignore correctness for
  whatever reason.

  There is no middle ground simply because you _cannot_ guarantee that
  your migration time is within the acceptable limits of the
  CLOCK_MONOTONIC or the CLOCK_REALTIME expectations.

  You can limit the damage somehow by making some arbitrary cutoff of
  how much you forward CLOCK_MONOTONIC, but don't ask me about the right
  value.

If you decide that correctness is overrated, then please document it
clearly instead of trying to pretend being correct.

I'm curious whether the hardware people or the virt folks come to senses
first, but honestly I'm not expecting this to happen before I retire.

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-09-30 23:02         ` Thomas Gleixner
  0 siblings, 0 replies; 113+ messages in thread
From: Thomas Gleixner @ 2021-09-30 23:02 UTC (permalink / raw)
  To: Marcelo Tosatti, Oliver Upton
  Cc: Catalin Marinas, kvm, Will Deacon, Marc Zyngier, Peter Shier,
	David Matlack, Paolo Bonzini, kvmarm, linux-arm-kernel,
	Jim Mattson

Marcelo,

On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote:
> On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
>> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
>> 
>> Thomas, CC'ed, has deeper understanding of problems with 
>> forward time jumps than I do. Thomas, any comments?
>
> Based on the earlier discussion about the problems of synchronizing
> the guests clock via a notification to the NTP/Chrony daemon 
> (where there is a window where applications can read the stale
> value of the clock), a possible solution would be triggering
> an NMI on the destination (so that it runs ASAP, with higher
> priority than application/kernel).
>
> What would this NMI do, exactly?

Nothing. You cannot do anything time related in an NMI.

You might queue irq work which handles that, but that would still not
prevent user space or kernel space from observing the stale time stamp
depending on the execution state from where it resumes.

>> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
>> for either vm pause / vm resume (well, if paused for long periods of time) 
>> or savevm / restorevm.
>
> Maybe with the NMI above, it would be possible to use
> the realtime clock as a way to know time elapsed between
> events and advance guest clock without the current 
> problematic window.

As much duct tape you throw at the problem, it cannot be solved ever
because it's fundamentally broken. All you can do is to make the
observation windows smaller, but that's just curing the symptom.

The problem is that the guest is paused/resumed without getting any
information about that and the execution of the guest is stopped at an
arbitrary instruction boundary from which it resumes after migration or
restore. So there is no way to guarantee that after resume all vCPUs are
executing in a state which can handle that.

But even if that would be the case, then what prevents the stale time
stamps to be visible? Nothing:

T0:    t = now();
         -> pause
         -> resume
         -> magic "fixup"
T1:    dostuff(t);

But that's not a fundamental problem because every preemptible or
interruptible code has the same issue:

T0:    t = now();
         -> preemption or interrupt
T1:    dostuff(t);

Which is usually not a problem, but It becomes a problem when T1 - T0 is
greater than the usual expectations which can obviously be trivially
achieved by guest migration or a savevm, restorevm cycle.

But let's go a step back and look at the clocks and their expectations:

CLOCK_MONOTONIC:

  Monotonically increasing clock which counts unless the system
  is in suspend. On resume it continues counting without jumping
  forward.

  That's the reference clock for everything else and therefore it
  is important that it does _not_ jump around.

  The reasons why CLOCK_MONOTONIC stops during suspend is
  historical and any attempt to change that breaks the world and
  some more because making it jump forward will trigger all sorts
  of timeouts, watchdogs and whatever. The last attempt to make
  CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3
  weeks. It's not going to be attempted again. See a3ed0e4393d6
  ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for
  details.

  Now the proposed change is creating exactly the same problem:

  >> > +	if (data.flags & KVM_CLOCK_REALTIME) {
  >> > +		u64 now_real_ns = ktime_get_real_ns();
  >> > +
  >> > +		/*
  >> > +		 * Avoid stepping the kvmclock backwards.
  >> > +		 */
  >> > +		if (now_real_ns > data.realtime)
  >> > +			data.clock += now_real_ns - data.realtime;
  >> > +	}

  IOW, it takes the time between pause and resume into account and
  forwards the underlying base clock which makes CLOCK_MONOTONIC
  jump forward by exactly that amount of time.

  So depending on the size of the delta you are running into exactly the
  same problem as the final failed attempt to unify CLOCK_MONOTONIC and
  CLOCK_BOOTTIME which btw. would have been a magic cure for virt.

  Too bad, not going to happen ever unless you fix all affected user
  space and kernel side issues.


CLOCK_BOOTTIME:

  CLOCK_MONOTONIC + time spent in suspend


CLOCK_REALTIME/TAI:

  CLOCK_MONOTONIC + offset

  The offset is established by reading RTC at boot time and can be
  changed by clock_settime(2) and adjtimex(2). The latter is used
  by NTP/PTP.

  Any user of CLOCK_REALTIME has to be prepared for it to jump in
  both directions, but of course NTP/PTP daemons have expectations
  vs. such time jumps.

  They rightfully assume on a properly configured and administrated
  system that there are only two things which can make CLOCK_REALTIME
  jump:

  1) NTP/PTP daemon controlled
  2) Suspend/resume related updates by the kernel


Just for the record, these assumptions predate virtualization.

So now virt came along and created a hard to solve circular dependency
problem:

   - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
     sync, but everything else is happy.
     
   - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
     lose, but NTP/PTP is happy.

IOW, you are up a creek without a paddle and you have to chose one evil.

That's simply a design fail because there has been no design for this
from day one. But I'm not surprised at all by that simply because
virtualization followed the hardware design fails vs. time and
timekeeping which keep us entertained for the past 20 years on various
architectures but most prominently on X86 which is the uncontended
master of disaster in that regard.

Of course virt follows the same approach of hardware by ignoring the
problem and coming up with more duct tape and the assumption that lack
of design can be "fixed in software". Just the timeframe is slightly
different: We're discussing this only for about 10 years now.

Seriously? All you folks can come up with in 10 years is piling duct
tape on duct tape instead of sitting down and fixing the underlying root
cause once and forever?

I'm aware that especially chrony has tried to deal with this nonsense
more gracefully, but that still does not make it great and it still gets
upset.

The reason why suspend/resume works perfectly fine is that it's fully
coordinated and NTP state is cleared on resume which makes it easy for
the deamon to accomodate.

So again and I'm telling this for a decade now:

 1) Stop pretending that you can fix the lack of design with duct tape
    engineering

 2) Accept the fundamental properties of Linux time keeping as they are
    not going to change as explained above

 3) Either accept that CLOCK_REALTIME is off and jumping around which
    confuses NTP/PTP or get your act together and design and implement a
    proper synchronization mechanism for this:

    - Notify the guest about the intended "pause" or "savevm" event

    - Let the guest go into a lightweight freeze similar to S2IDLE

    - Save the VM for later resume or migrate the saved state

    - Watch everything working as expected on resume

    - Have the benefit that pause/resume and savevm/restorevm have
      exactly the same behaviour

That won't solve the problem for frankenkernels and !paravirt setups,
but that's unsolvable and you can keep the pieces by chosing one of two
evils. While I do not care at all, I still recommend to chose
CLOCK_MONOTONIC correctness for obvious reasons.

The frankenkernel/legacy problem aside, I know you are going to tell me
that this is too much overhead and has VM customer visible impact. It's
your choice, really:

  Either you chose correctness or you decide to ignore correctness for
  whatever reason.

  There is no middle ground simply because you _cannot_ guarantee that
  your migration time is within the acceptable limits of the
  CLOCK_MONOTONIC or the CLOCK_REALTIME expectations.

  You can limit the damage somehow by making some arbitrary cutoff of
  how much you forward CLOCK_MONOTONIC, but don't ask me about the right
  value.

If you decide that correctness is overrated, then please document it
clearly instead of trying to pretend being correct.

I'm curious whether the hardware people or the virt folks come to senses
first, but honestly I'm not expecting this to happen before I retire.

Thanks,

        tglx


_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-09-30 23:02         ` Thomas Gleixner
  0 siblings, 0 replies; 113+ messages in thread
From: Thomas Gleixner @ 2021-09-30 23:02 UTC (permalink / raw)
  To: Marcelo Tosatti, Oliver Upton
  Cc: kvm, kvmarm, Paolo Bonzini, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

Marcelo,

On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote:
> On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
>> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
>> 
>> Thomas, CC'ed, has deeper understanding of problems with 
>> forward time jumps than I do. Thomas, any comments?
>
> Based on the earlier discussion about the problems of synchronizing
> the guests clock via a notification to the NTP/Chrony daemon 
> (where there is a window where applications can read the stale
> value of the clock), a possible solution would be triggering
> an NMI on the destination (so that it runs ASAP, with higher
> priority than application/kernel).
>
> What would this NMI do, exactly?

Nothing. You cannot do anything time related in an NMI.

You might queue irq work which handles that, but that would still not
prevent user space or kernel space from observing the stale time stamp
depending on the execution state from where it resumes.

>> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
>> for either vm pause / vm resume (well, if paused for long periods of time) 
>> or savevm / restorevm.
>
> Maybe with the NMI above, it would be possible to use
> the realtime clock as a way to know time elapsed between
> events and advance guest clock without the current 
> problematic window.

As much duct tape you throw at the problem, it cannot be solved ever
because it's fundamentally broken. All you can do is to make the
observation windows smaller, but that's just curing the symptom.

The problem is that the guest is paused/resumed without getting any
information about that and the execution of the guest is stopped at an
arbitrary instruction boundary from which it resumes after migration or
restore. So there is no way to guarantee that after resume all vCPUs are
executing in a state which can handle that.

But even if that would be the case, then what prevents the stale time
stamps to be visible? Nothing:

T0:    t = now();
         -> pause
         -> resume
         -> magic "fixup"
T1:    dostuff(t);

But that's not a fundamental problem because every preemptible or
interruptible code has the same issue:

T0:    t = now();
         -> preemption or interrupt
T1:    dostuff(t);

Which is usually not a problem, but It becomes a problem when T1 - T0 is
greater than the usual expectations which can obviously be trivially
achieved by guest migration or a savevm, restorevm cycle.

But let's go a step back and look at the clocks and their expectations:

CLOCK_MONOTONIC:

  Monotonically increasing clock which counts unless the system
  is in suspend. On resume it continues counting without jumping
  forward.

  That's the reference clock for everything else and therefore it
  is important that it does _not_ jump around.

  The reasons why CLOCK_MONOTONIC stops during suspend is
  historical and any attempt to change that breaks the world and
  some more because making it jump forward will trigger all sorts
  of timeouts, watchdogs and whatever. The last attempt to make
  CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3
  weeks. It's not going to be attempted again. See a3ed0e4393d6
  ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for
  details.

  Now the proposed change is creating exactly the same problem:

  >> > +	if (data.flags & KVM_CLOCK_REALTIME) {
  >> > +		u64 now_real_ns = ktime_get_real_ns();
  >> > +
  >> > +		/*
  >> > +		 * Avoid stepping the kvmclock backwards.
  >> > +		 */
  >> > +		if (now_real_ns > data.realtime)
  >> > +			data.clock += now_real_ns - data.realtime;
  >> > +	}

  IOW, it takes the time between pause and resume into account and
  forwards the underlying base clock which makes CLOCK_MONOTONIC
  jump forward by exactly that amount of time.

  So depending on the size of the delta you are running into exactly the
  same problem as the final failed attempt to unify CLOCK_MONOTONIC and
  CLOCK_BOOTTIME which btw. would have been a magic cure for virt.

  Too bad, not going to happen ever unless you fix all affected user
  space and kernel side issues.


CLOCK_BOOTTIME:

  CLOCK_MONOTONIC + time spent in suspend


CLOCK_REALTIME/TAI:

  CLOCK_MONOTONIC + offset

  The offset is established by reading RTC at boot time and can be
  changed by clock_settime(2) and adjtimex(2). The latter is used
  by NTP/PTP.

  Any user of CLOCK_REALTIME has to be prepared for it to jump in
  both directions, but of course NTP/PTP daemons have expectations
  vs. such time jumps.

  They rightfully assume on a properly configured and administrated
  system that there are only two things which can make CLOCK_REALTIME
  jump:

  1) NTP/PTP daemon controlled
  2) Suspend/resume related updates by the kernel


Just for the record, these assumptions predate virtualization.

So now virt came along and created a hard to solve circular dependency
problem:

   - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
     sync, but everything else is happy.
     
   - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
     lose, but NTP/PTP is happy.

IOW, you are up a creek without a paddle and you have to chose one evil.

That's simply a design fail because there has been no design for this
from day one. But I'm not surprised at all by that simply because
virtualization followed the hardware design fails vs. time and
timekeeping which keep us entertained for the past 20 years on various
architectures but most prominently on X86 which is the uncontended
master of disaster in that regard.

Of course virt follows the same approach of hardware by ignoring the
problem and coming up with more duct tape and the assumption that lack
of design can be "fixed in software". Just the timeframe is slightly
different: We're discussing this only for about 10 years now.

Seriously? All you folks can come up with in 10 years is piling duct
tape on duct tape instead of sitting down and fixing the underlying root
cause once and forever?

I'm aware that especially chrony has tried to deal with this nonsense
more gracefully, but that still does not make it great and it still gets
upset.

The reason why suspend/resume works perfectly fine is that it's fully
coordinated and NTP state is cleared on resume which makes it easy for
the deamon to accomodate.

So again and I'm telling this for a decade now:

 1) Stop pretending that you can fix the lack of design with duct tape
    engineering

 2) Accept the fundamental properties of Linux time keeping as they are
    not going to change as explained above

 3) Either accept that CLOCK_REALTIME is off and jumping around which
    confuses NTP/PTP or get your act together and design and implement a
    proper synchronization mechanism for this:

    - Notify the guest about the intended "pause" or "savevm" event

    - Let the guest go into a lightweight freeze similar to S2IDLE

    - Save the VM for later resume or migrate the saved state

    - Watch everything working as expected on resume

    - Have the benefit that pause/resume and savevm/restorevm have
      exactly the same behaviour

That won't solve the problem for frankenkernels and !paravirt setups,
but that's unsolvable and you can keep the pieces by chosing one of two
evils. While I do not care at all, I still recommend to chose
CLOCK_MONOTONIC correctness for obvious reasons.

The frankenkernel/legacy problem aside, I know you are going to tell me
that this is too much overhead and has VM customer visible impact. It's
your choice, really:

  Either you chose correctness or you decide to ignore correctness for
  whatever reason.

  There is no middle ground simply because you _cannot_ guarantee that
  your migration time is within the acceptable limits of the
  CLOCK_MONOTONIC or the CLOCK_REALTIME expectations.

  You can limit the damage somehow by making some arbitrary cutoff of
  how much you forward CLOCK_MONOTONIC, but don't ask me about the right
  value.

If you decide that correctness is overrated, then please document it
clearly instead of trying to pretend being correct.

I'm curious whether the hardware people or the virt folks come to senses
first, but honestly I'm not expecting this to happen before I retire.

Thanks,

        tglx



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
  2021-09-30 19:14     ` Marcelo Tosatti
  (?)
@ 2021-10-01  9:17       ` Paolo Bonzini
  -1 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01  9:17 UTC (permalink / raw)
  To: Marcelo Tosatti, Oliver Upton
  Cc: kvm, kvmarm, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On 30/09/21 21:14, Marcelo Tosatti wrote:
>> +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> Hi Oliver,
> 
> This won't advance the TSC values themselves, right?

Why not?  It affects the TSC offset in the vmcs, so the TSC in the VM is 
advanced too.

Paolo

> This (advancing the TSC values by the realtime elapsed time) would be
> awesome because TSC clock_gettime() vdso is faster, and some
> applications prefer to just read from TSC directly.
> See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".
> 
> The advancement with this patchset only applies to kvmclock.
> 


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-01  9:17       ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01  9:17 UTC (permalink / raw)
  To: Marcelo Tosatti, Oliver Upton
  Cc: Catalin Marinas, kvm, Will Deacon, Peter Shier, Marc Zyngier,
	David Matlack, kvmarm, linux-arm-kernel, Jim Mattson

On 30/09/21 21:14, Marcelo Tosatti wrote:
>> +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> Hi Oliver,
> 
> This won't advance the TSC values themselves, right?

Why not?  It affects the TSC offset in the vmcs, so the TSC in the VM is 
advanced too.

Paolo

> This (advancing the TSC values by the realtime elapsed time) would be
> awesome because TSC clock_gettime() vdso is faster, and some
> applications prefer to just read from TSC directly.
> See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".
> 
> The advancement with this patchset only applies to kvmclock.
> 

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-01  9:17       ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01  9:17 UTC (permalink / raw)
  To: Marcelo Tosatti, Oliver Upton
  Cc: kvm, kvmarm, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On 30/09/21 21:14, Marcelo Tosatti wrote:
>> +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> Hi Oliver,
> 
> This won't advance the TSC values themselves, right?

Why not?  It affects the TSC offset in the vmcs, so the TSC in the VM is 
advanced too.

Paolo

> This (advancing the TSC values by the realtime elapsed time) would be
> awesome because TSC clock_gettime() vdso is faster, and some
> applications prefer to just read from TSC directly.
> See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".
> 
> The advancement with this patchset only applies to kvmclock.
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
  2021-10-01  9:17       ` Paolo Bonzini
  (?)
@ 2021-10-01 10:32         ` Marcelo Tosatti
  -1 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-01 10:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Oliver Upton, kvm, kvmarm, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Fri, Oct 01, 2021 at 11:17:33AM +0200, Paolo Bonzini wrote:
> On 30/09/21 21:14, Marcelo Tosatti wrote:
> > > +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> > Hi Oliver,
> > 
> > This won't advance the TSC values themselves, right?
> 
> Why not?  It affects the TSC offset in the vmcs, so the TSC in the VM is
> advanced too.
> 
> Paolo

+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.

You can't advance both kvmclock (kvmclock_offset variable) and the TSCs,
which would be double counting.

So you have to either add the elapsed realtime (1) between KVM_GET_CLOCK
to kvmclock (which this patch is doing), or to the TSCs. If you do both, there
is double counting. Am i missing something?

To make it clearer: TSC clocksource is faster than kvmclock source, so
we'd rather use when possible, which is achievable with TSC scaling 
support on HW.

1: As mentioned earlier, just using the realtime clock delta between
hosts can introduce problems. So need a scheme to:

	- Find the offset between host clocks, with upper and lower
	  bounds on error.
	- Take appropriate actions based on that (for example,
	  do not use KVM_CLOCK_REALTIME flag on KVM_SET_CLOCK
	  if the delta between hosts is large).

Which can be done in userspace or kernel space... (hum, but maybe
delegating this to userspace will introduce different solutions
of the same problem?).

> > This (advancing the TSC values by the realtime elapsed time) would be
> > awesome because TSC clock_gettime() vdso is faster, and some
> > applications prefer to just read from TSC directly.
> > See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".
> > 
> > The advancement with this patchset only applies to kvmclock.
> > 
> 
> 


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-01 10:32         ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-01 10:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Catalin Marinas, kvm, Peter Shier, Marc Zyngier, David Matlack,
	Will Deacon, kvmarm, linux-arm-kernel, Jim Mattson

On Fri, Oct 01, 2021 at 11:17:33AM +0200, Paolo Bonzini wrote:
> On 30/09/21 21:14, Marcelo Tosatti wrote:
> > > +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> > Hi Oliver,
> > 
> > This won't advance the TSC values themselves, right?
> 
> Why not?  It affects the TSC offset in the vmcs, so the TSC in the VM is
> advanced too.
> 
> Paolo

+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.

You can't advance both kvmclock (kvmclock_offset variable) and the TSCs,
which would be double counting.

So you have to either add the elapsed realtime (1) between KVM_GET_CLOCK
to kvmclock (which this patch is doing), or to the TSCs. If you do both, there
is double counting. Am i missing something?

To make it clearer: TSC clocksource is faster than kvmclock source, so
we'd rather use when possible, which is achievable with TSC scaling 
support on HW.

1: As mentioned earlier, just using the realtime clock delta between
hosts can introduce problems. So need a scheme to:

	- Find the offset between host clocks, with upper and lower
	  bounds on error.
	- Take appropriate actions based on that (for example,
	  do not use KVM_CLOCK_REALTIME flag on KVM_SET_CLOCK
	  if the delta between hosts is large).

Which can be done in userspace or kernel space... (hum, but maybe
delegating this to userspace will introduce different solutions
of the same problem?).

> > This (advancing the TSC values by the realtime elapsed time) would be
> > awesome because TSC clock_gettime() vdso is faster, and some
> > applications prefer to just read from TSC directly.
> > See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".
> > 
> > The advancement with this patchset only applies to kvmclock.
> > 
> 
> 

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-01 10:32         ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-01 10:32 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Oliver Upton, kvm, kvmarm, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Fri, Oct 01, 2021 at 11:17:33AM +0200, Paolo Bonzini wrote:
> On 30/09/21 21:14, Marcelo Tosatti wrote:
> > > +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> > Hi Oliver,
> > 
> > This won't advance the TSC values themselves, right?
> 
> Why not?  It affects the TSC offset in the vmcs, so the TSC in the VM is
> advanced too.
> 
> Paolo

+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.

You can't advance both kvmclock (kvmclock_offset variable) and the TSCs,
which would be double counting.

So you have to either add the elapsed realtime (1) between KVM_GET_CLOCK
to kvmclock (which this patch is doing), or to the TSCs. If you do both, there
is double counting. Am i missing something?

To make it clearer: TSC clocksource is faster than kvmclock source, so
we'd rather use when possible, which is achievable with TSC scaling 
support on HW.

1: As mentioned earlier, just using the realtime clock delta between
hosts can introduce problems. So need a scheme to:

	- Find the offset between host clocks, with upper and lower
	  bounds on error.
	- Take appropriate actions based on that (for example,
	  do not use KVM_CLOCK_REALTIME flag on KVM_SET_CLOCK
	  if the delta between hosts is large).

Which can be done in userspace or kernel space... (hum, but maybe
delegating this to userspace will introduce different solutions
of the same problem?).

> > This (advancing the TSC values by the realtime elapsed time) would be
> > awesome because TSC clock_gettime() vdso is faster, and some
> > applications prefer to just read from TSC directly.
> > See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".
> > 
> > The advancement with this patchset only applies to kvmclock.
> > 
> 
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-09-30 23:02         ` Thomas Gleixner
  (?)
@ 2021-10-01 12:05           ` Marcelo Tosatti
  -1 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-01 12:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Oliver Upton, kvm, kvmarm, Paolo Bonzini, Sean Christopherson,
	Marc Zyngier, Peter Shier, Jim Mattson, David Matlack,
	Ricardo Koller, Jing Zhang, Raghavendra Rao Anata, James Morse,
	Alexandru Elisei, Suzuki K Poulose, linux-arm-kernel,
	Andrew Jones, Will Deacon, Catalin Marinas

On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> Marcelo,
> 
> On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote:
> > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> >> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> >> 
> >> Thomas, CC'ed, has deeper understanding of problems with 
> >> forward time jumps than I do. Thomas, any comments?
> >
> > Based on the earlier discussion about the problems of synchronizing
> > the guests clock via a notification to the NTP/Chrony daemon 
> > (where there is a window where applications can read the stale
> > value of the clock), a possible solution would be triggering
> > an NMI on the destination (so that it runs ASAP, with higher
> > priority than application/kernel).
> >
> > What would this NMI do, exactly?
> 
> Nothing. You cannot do anything time related in an NMI.
> 
> You might queue irq work which handles that, but that would still not
> prevent user space or kernel space from observing the stale time stamp
> depending on the execution state from where it resumes.

Yes.

> >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
> >> for either vm pause / vm resume (well, if paused for long periods of time) 
> >> or savevm / restorevm.
> >
> > Maybe with the NMI above, it would be possible to use
> > the realtime clock as a way to know time elapsed between
> > events and advance guest clock without the current 
> > problematic window.
> 
> As much duct tape you throw at the problem, it cannot be solved ever
> because it's fundamentally broken. All you can do is to make the
> observation windows smaller, but that's just curing the symptom.

Yes.

> The problem is that the guest is paused/resumed without getting any
> information about that and the execution of the guest is stopped at an
> arbitrary instruction boundary from which it resumes after migration or
> restore. So there is no way to guarantee that after resume all vCPUs are
> executing in a state which can handle that.
> 
> But even if that would be the case, then what prevents the stale time
> stamps to be visible? Nothing:
> 
> T0:    t = now();
>          -> pause
>          -> resume
>          -> magic "fixup"
> T1:    dostuff(t);

Yes.

BTW, you could have a userspace notification (then applications 
could handle this if desired).

> But that's not a fundamental problem because every preemptible or
> interruptible code has the same issue:
> 
> T0:    t = now();
>          -> preemption or interrupt
> T1:    dostuff(t);
> 
> Which is usually not a problem, but It becomes a problem when T1 - T0 is
> greater than the usual expectations which can obviously be trivially
> achieved by guest migration or a savevm, restorevm cycle.
> 
> But let's go a step back and look at the clocks and their expectations:
> 
> CLOCK_MONOTONIC:
> 
>   Monotonically increasing clock which counts unless the system
>   is in suspend. On resume it continues counting without jumping
>   forward.
> 
>   That's the reference clock for everything else and therefore it
>   is important that it does _not_ jump around.
> 
>   The reasons why CLOCK_MONOTONIC stops during suspend is
>   historical and any attempt to change that breaks the world and
>   some more because making it jump forward will trigger all sorts
>   of timeouts, watchdogs and whatever. The last attempt to make
>   CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3
>   weeks. It's not going to be attempted again. See a3ed0e4393d6
>   ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for
>   details.
> 
>   Now the proposed change is creating exactly the same problem:
> 
>   >> > +	if (data.flags & KVM_CLOCK_REALTIME) {
>   >> > +		u64 now_real_ns = ktime_get_real_ns();
>   >> > +
>   >> > +		/*
>   >> > +		 * Avoid stepping the kvmclock backwards.
>   >> > +		 */
>   >> > +		if (now_real_ns > data.realtime)
>   >> > +			data.clock += now_real_ns - data.realtime;
>   >> > +	}
> 
>   IOW, it takes the time between pause and resume into account and
>   forwards the underlying base clock which makes CLOCK_MONOTONIC
>   jump forward by exactly that amount of time.

Well, it is assuming that the

 T0:    t = now();
 T1:    pause vm()
 T2:	finish vm migration()
 T3:    dostuff(t);

Interval between T1 and T2 is small (and that the guest
clocks are synchronized up to a given boundary).

But i suppose adding a limit to the forward clock advance 
(in the migration case) is useful:

	1) If migration (well actually, only the final steps
	   to finish migration, the time between when guest is paused
	   on source and is resumed on destination) takes too long,
	   then too bad: fix it to be shorter if you want the clocks
	   to have close to zero change to realtime on migration.

	2) Avoid the other bugs in case of large forward advance.

Maybe having it configurable, with a say, 1 minute maximum by default
is a good choice?

An alternative would be to advance only the guests REALTIME clock, from 
data about how long steps T1-T2 took.

>   So depending on the size of the delta you are running into exactly the
>   same problem as the final failed attempt to unify CLOCK_MONOTONIC and
>   CLOCK_BOOTTIME which btw. would have been a magic cure for virt.
> 
>   Too bad, not going to happen ever unless you fix all affected user
>   space and kernel side issues.
> 
> 
> CLOCK_BOOTTIME:
> 
>   CLOCK_MONOTONIC + time spent in suspend
> 
> 
> CLOCK_REALTIME/TAI:
> 
>   CLOCK_MONOTONIC + offset
> 
>   The offset is established by reading RTC at boot time and can be
>   changed by clock_settime(2) and adjtimex(2). The latter is used
>   by NTP/PTP.
> 
>   Any user of CLOCK_REALTIME has to be prepared for it to jump in
>   both directions, but of course NTP/PTP daemons have expectations
>   vs. such time jumps.
> 
>   They rightfully assume on a properly configured and administrated
>   system that there are only two things which can make CLOCK_REALTIME
>   jump:
> 
>   1) NTP/PTP daemon controlled
>   2) Suspend/resume related updates by the kernel
> 
> 
> Just for the record, these assumptions predate virtualization.
> 
> So now virt came along and created a hard to solve circular dependency
> problem:
> 
>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
>      sync, but everything else is happy.
>      
>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
>      lose, but NTP/PTP is happy.

One must handle the

 T0:    t = now();
          -> pause
          -> resume
          -> magic "fixup"
 T1:    dostuff(t);

fact if one is going to use savevm/restorevm anyway, so...
(it is kind of unfixable, unless you modify your application
to accept notifications to redo any computation based on t, isnt it?).

> IOW, you are up a creek without a paddle and you have to chose one evil.
> 
> That's simply a design fail because there has been no design for this
> from day one. But I'm not surprised at all by that simply because
> virtualization followed the hardware design fails vs. time and
> timekeeping which keep us entertained for the past 20 years on various
> architectures but most prominently on X86 which is the uncontended
> master of disaster in that regard.
> 
> Of course virt follows the same approach of hardware by ignoring the
> problem and coming up with more duct tape and the assumption that lack
> of design can be "fixed in software". Just the timeframe is slightly
> different: We're discussing this only for about 10 years now.
> 
> Seriously? All you folks can come up with in 10 years is piling duct
> tape on duct tape instead of sitting down and fixing the underlying root
> cause once and forever?

Been fixing bugs that are reported over 10+ years, yeah.

Hopefully this thread is the "sitting down and fixing the underyling root
cause" :-)

> I'm aware that especially chrony has tried to deal with this nonsense
> more gracefully, but that still does not make it great and it still gets
> upset.
> 
> The reason why suspend/resume works perfectly fine is that it's fully
> coordinated and NTP state is cleared on resume which makes it easy for
> the deamon to accomodate.

This is what is in place now (which is executed on the destination):

    /* Now, if user has passed a time to set and the system time is set, we
     * just need to synchronize the hardware clock. However, if no time was
     * passed, user is requesting the opposite: set the system time from the
     * hardware clock (RTC). */
    pid = fork();
    if (pid == 0) {
        setsid();
        reopen_fd_to_null(0);
        reopen_fd_to_null(1);
        reopen_fd_to_null(2);

        /* Use '/sbin/hwclock -w' to set RTC from the system time,
         * or '/sbin/hwclock -s' to set the system time from RTC. */
        execle(hwclock_path, "hwclock", has_time ? "-w" : "-s",
               NULL, environ);
        _exit(EXIT_FAILURE);
    } else if (pid < 0) {
        error_setg_errno(errp, errno, "failed to create child process");
        return;
    }

> 
> So again and I'm telling this for a decade now:
> 
>  1) Stop pretending that you can fix the lack of design with duct tape
>     engineering
> 
>  2) Accept the fundamental properties of Linux time keeping as they are
>     not going to change as explained above
> 
>  3) Either accept that CLOCK_REALTIME is off and jumping around which
>     confuses NTP/PTP or get your act together and design and implement a
>     proper synchronization mechanism for this:
> 
>     - Notify the guest about the intended "pause" or "savevm" event
> 
>     - Let the guest go into a lightweight freeze similar to S2IDLE
> 
>     - Save the VM for later resume or migrate the saved state
> 
>     - Watch everything working as expected on resume
> 
>     - Have the benefit that pause/resume and savevm/restorevm have
>       exactly the same behaviour

OK!

> That won't solve the problem for frankenkernels and !paravirt setups,
> but that's unsolvable and you can keep the pieces by chosing one of two
> evils. While I do not care at all, I still recommend to chose
> CLOCK_MONOTONIC correctness for obvious reasons.
> 
> The frankenkernel/legacy problem aside, I know you are going to tell me
> that this is too much overhead and has VM customer visible impact. 

No, i think it boils down to someone implementing it.

> It's
> your choice, really:
> 
>   Either you chose correctness or you decide to ignore correctness for
>   whatever reason.
> 
>   There is no middle ground simply because you _cannot_ guarantee that
>   your migration time is within the acceptable limits of the
>   CLOCK_MONOTONIC or the CLOCK_REALTIME expectations.
> 
>   You can limit the damage somehow by making some arbitrary cutoff of
>   how much you forward CLOCK_MONOTONIC, but don't ask me about the right
>   value.

> If you decide that correctness is overrated, then please document it
> clearly instead of trying to pretend being correct.

Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC)
would be correct, right? And its probably not very hard to do.

> I'm curious whether the hardware people or the virt folks come to senses
> first, but honestly I'm not expecting this to happen before I retire.
> 
> Thanks,
> 
>         tglx

Thanks very much for the detailed information! Its a good basis
for the document you ask.



^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 12:05           ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-01 12:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Catalin Marinas, kvm, Marc Zyngier, Peter Shier, David Matlack,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel,
	Jim Mattson

On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> Marcelo,
> 
> On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote:
> > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> >> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> >> 
> >> Thomas, CC'ed, has deeper understanding of problems with 
> >> forward time jumps than I do. Thomas, any comments?
> >
> > Based on the earlier discussion about the problems of synchronizing
> > the guests clock via a notification to the NTP/Chrony daemon 
> > (where there is a window where applications can read the stale
> > value of the clock), a possible solution would be triggering
> > an NMI on the destination (so that it runs ASAP, with higher
> > priority than application/kernel).
> >
> > What would this NMI do, exactly?
> 
> Nothing. You cannot do anything time related in an NMI.
> 
> You might queue irq work which handles that, but that would still not
> prevent user space or kernel space from observing the stale time stamp
> depending on the execution state from where it resumes.

Yes.

> >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
> >> for either vm pause / vm resume (well, if paused for long periods of time) 
> >> or savevm / restorevm.
> >
> > Maybe with the NMI above, it would be possible to use
> > the realtime clock as a way to know time elapsed between
> > events and advance guest clock without the current 
> > problematic window.
> 
> As much duct tape you throw at the problem, it cannot be solved ever
> because it's fundamentally broken. All you can do is to make the
> observation windows smaller, but that's just curing the symptom.

Yes.

> The problem is that the guest is paused/resumed without getting any
> information about that and the execution of the guest is stopped at an
> arbitrary instruction boundary from which it resumes after migration or
> restore. So there is no way to guarantee that after resume all vCPUs are
> executing in a state which can handle that.
> 
> But even if that would be the case, then what prevents the stale time
> stamps to be visible? Nothing:
> 
> T0:    t = now();
>          -> pause
>          -> resume
>          -> magic "fixup"
> T1:    dostuff(t);

Yes.

BTW, you could have a userspace notification (then applications 
could handle this if desired).

> But that's not a fundamental problem because every preemptible or
> interruptible code has the same issue:
> 
> T0:    t = now();
>          -> preemption or interrupt
> T1:    dostuff(t);
> 
> Which is usually not a problem, but It becomes a problem when T1 - T0 is
> greater than the usual expectations which can obviously be trivially
> achieved by guest migration or a savevm, restorevm cycle.
> 
> But let's go a step back and look at the clocks and their expectations:
> 
> CLOCK_MONOTONIC:
> 
>   Monotonically increasing clock which counts unless the system
>   is in suspend. On resume it continues counting without jumping
>   forward.
> 
>   That's the reference clock for everything else and therefore it
>   is important that it does _not_ jump around.
> 
>   The reasons why CLOCK_MONOTONIC stops during suspend is
>   historical and any attempt to change that breaks the world and
>   some more because making it jump forward will trigger all sorts
>   of timeouts, watchdogs and whatever. The last attempt to make
>   CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3
>   weeks. It's not going to be attempted again. See a3ed0e4393d6
>   ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for
>   details.
> 
>   Now the proposed change is creating exactly the same problem:
> 
>   >> > +	if (data.flags & KVM_CLOCK_REALTIME) {
>   >> > +		u64 now_real_ns = ktime_get_real_ns();
>   >> > +
>   >> > +		/*
>   >> > +		 * Avoid stepping the kvmclock backwards.
>   >> > +		 */
>   >> > +		if (now_real_ns > data.realtime)
>   >> > +			data.clock += now_real_ns - data.realtime;
>   >> > +	}
> 
>   IOW, it takes the time between pause and resume into account and
>   forwards the underlying base clock which makes CLOCK_MONOTONIC
>   jump forward by exactly that amount of time.

Well, it is assuming that the

 T0:    t = now();
 T1:    pause vm()
 T2:	finish vm migration()
 T3:    dostuff(t);

Interval between T1 and T2 is small (and that the guest
clocks are synchronized up to a given boundary).

But i suppose adding a limit to the forward clock advance 
(in the migration case) is useful:

	1) If migration (well actually, only the final steps
	   to finish migration, the time between when guest is paused
	   on source and is resumed on destination) takes too long,
	   then too bad: fix it to be shorter if you want the clocks
	   to have close to zero change to realtime on migration.

	2) Avoid the other bugs in case of large forward advance.

Maybe having it configurable, with a say, 1 minute maximum by default
is a good choice?

An alternative would be to advance only the guests REALTIME clock, from 
data about how long steps T1-T2 took.

>   So depending on the size of the delta you are running into exactly the
>   same problem as the final failed attempt to unify CLOCK_MONOTONIC and
>   CLOCK_BOOTTIME which btw. would have been a magic cure for virt.
> 
>   Too bad, not going to happen ever unless you fix all affected user
>   space and kernel side issues.
> 
> 
> CLOCK_BOOTTIME:
> 
>   CLOCK_MONOTONIC + time spent in suspend
> 
> 
> CLOCK_REALTIME/TAI:
> 
>   CLOCK_MONOTONIC + offset
> 
>   The offset is established by reading RTC at boot time and can be
>   changed by clock_settime(2) and adjtimex(2). The latter is used
>   by NTP/PTP.
> 
>   Any user of CLOCK_REALTIME has to be prepared for it to jump in
>   both directions, but of course NTP/PTP daemons have expectations
>   vs. such time jumps.
> 
>   They rightfully assume on a properly configured and administrated
>   system that there are only two things which can make CLOCK_REALTIME
>   jump:
> 
>   1) NTP/PTP daemon controlled
>   2) Suspend/resume related updates by the kernel
> 
> 
> Just for the record, these assumptions predate virtualization.
> 
> So now virt came along and created a hard to solve circular dependency
> problem:
> 
>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
>      sync, but everything else is happy.
>      
>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
>      lose, but NTP/PTP is happy.

One must handle the

 T0:    t = now();
          -> pause
          -> resume
          -> magic "fixup"
 T1:    dostuff(t);

fact if one is going to use savevm/restorevm anyway, so...
(it is kind of unfixable, unless you modify your application
to accept notifications to redo any computation based on t, isnt it?).

> IOW, you are up a creek without a paddle and you have to chose one evil.
> 
> That's simply a design fail because there has been no design for this
> from day one. But I'm not surprised at all by that simply because
> virtualization followed the hardware design fails vs. time and
> timekeeping which keep us entertained for the past 20 years on various
> architectures but most prominently on X86 which is the uncontended
> master of disaster in that regard.
> 
> Of course virt follows the same approach of hardware by ignoring the
> problem and coming up with more duct tape and the assumption that lack
> of design can be "fixed in software". Just the timeframe is slightly
> different: We're discussing this only for about 10 years now.
> 
> Seriously? All you folks can come up with in 10 years is piling duct
> tape on duct tape instead of sitting down and fixing the underlying root
> cause once and forever?

Been fixing bugs that are reported over 10+ years, yeah.

Hopefully this thread is the "sitting down and fixing the underyling root
cause" :-)

> I'm aware that especially chrony has tried to deal with this nonsense
> more gracefully, but that still does not make it great and it still gets
> upset.
> 
> The reason why suspend/resume works perfectly fine is that it's fully
> coordinated and NTP state is cleared on resume which makes it easy for
> the deamon to accomodate.

This is what is in place now (which is executed on the destination):

    /* Now, if user has passed a time to set and the system time is set, we
     * just need to synchronize the hardware clock. However, if no time was
     * passed, user is requesting the opposite: set the system time from the
     * hardware clock (RTC). */
    pid = fork();
    if (pid == 0) {
        setsid();
        reopen_fd_to_null(0);
        reopen_fd_to_null(1);
        reopen_fd_to_null(2);

        /* Use '/sbin/hwclock -w' to set RTC from the system time,
         * or '/sbin/hwclock -s' to set the system time from RTC. */
        execle(hwclock_path, "hwclock", has_time ? "-w" : "-s",
               NULL, environ);
        _exit(EXIT_FAILURE);
    } else if (pid < 0) {
        error_setg_errno(errp, errno, "failed to create child process");
        return;
    }

> 
> So again and I'm telling this for a decade now:
> 
>  1) Stop pretending that you can fix the lack of design with duct tape
>     engineering
> 
>  2) Accept the fundamental properties of Linux time keeping as they are
>     not going to change as explained above
> 
>  3) Either accept that CLOCK_REALTIME is off and jumping around which
>     confuses NTP/PTP or get your act together and design and implement a
>     proper synchronization mechanism for this:
> 
>     - Notify the guest about the intended "pause" or "savevm" event
> 
>     - Let the guest go into a lightweight freeze similar to S2IDLE
> 
>     - Save the VM for later resume or migrate the saved state
> 
>     - Watch everything working as expected on resume
> 
>     - Have the benefit that pause/resume and savevm/restorevm have
>       exactly the same behaviour

OK!

> That won't solve the problem for frankenkernels and !paravirt setups,
> but that's unsolvable and you can keep the pieces by chosing one of two
> evils. While I do not care at all, I still recommend to chose
> CLOCK_MONOTONIC correctness for obvious reasons.
> 
> The frankenkernel/legacy problem aside, I know you are going to tell me
> that this is too much overhead and has VM customer visible impact. 

No, i think it boils down to someone implementing it.

> It's
> your choice, really:
> 
>   Either you chose correctness or you decide to ignore correctness for
>   whatever reason.
> 
>   There is no middle ground simply because you _cannot_ guarantee that
>   your migration time is within the acceptable limits of the
>   CLOCK_MONOTONIC or the CLOCK_REALTIME expectations.
> 
>   You can limit the damage somehow by making some arbitrary cutoff of
>   how much you forward CLOCK_MONOTONIC, but don't ask me about the right
>   value.

> If you decide that correctness is overrated, then please document it
> clearly instead of trying to pretend being correct.

Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC)
would be correct, right? And its probably not very hard to do.

> I'm curious whether the hardware people or the virt folks come to senses
> first, but honestly I'm not expecting this to happen before I retire.
> 
> Thanks,
> 
>         tglx

Thanks very much for the detailed information! Its a good basis
for the document you ask.


_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 12:05           ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-01 12:05 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Oliver Upton, kvm, kvmarm, Paolo Bonzini, Sean Christopherson,
	Marc Zyngier, Peter Shier, Jim Mattson, David Matlack,
	Ricardo Koller, Jing Zhang, Raghavendra Rao Anata, James Morse,
	Alexandru Elisei, Suzuki K Poulose, linux-arm-kernel,
	Andrew Jones, Will Deacon, Catalin Marinas

On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> Marcelo,
> 
> On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote:
> > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> >> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> >> 
> >> Thomas, CC'ed, has deeper understanding of problems with 
> >> forward time jumps than I do. Thomas, any comments?
> >
> > Based on the earlier discussion about the problems of synchronizing
> > the guests clock via a notification to the NTP/Chrony daemon 
> > (where there is a window where applications can read the stale
> > value of the clock), a possible solution would be triggering
> > an NMI on the destination (so that it runs ASAP, with higher
> > priority than application/kernel).
> >
> > What would this NMI do, exactly?
> 
> Nothing. You cannot do anything time related in an NMI.
> 
> You might queue irq work which handles that, but that would still not
> prevent user space or kernel space from observing the stale time stamp
> depending on the execution state from where it resumes.

Yes.

> >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
> >> for either vm pause / vm resume (well, if paused for long periods of time) 
> >> or savevm / restorevm.
> >
> > Maybe with the NMI above, it would be possible to use
> > the realtime clock as a way to know time elapsed between
> > events and advance guest clock without the current 
> > problematic window.
> 
> As much duct tape you throw at the problem, it cannot be solved ever
> because it's fundamentally broken. All you can do is to make the
> observation windows smaller, but that's just curing the symptom.

Yes.

> The problem is that the guest is paused/resumed without getting any
> information about that and the execution of the guest is stopped at an
> arbitrary instruction boundary from which it resumes after migration or
> restore. So there is no way to guarantee that after resume all vCPUs are
> executing in a state which can handle that.
> 
> But even if that would be the case, then what prevents the stale time
> stamps to be visible? Nothing:
> 
> T0:    t = now();
>          -> pause
>          -> resume
>          -> magic "fixup"
> T1:    dostuff(t);

Yes.

BTW, you could have a userspace notification (then applications 
could handle this if desired).

> But that's not a fundamental problem because every preemptible or
> interruptible code has the same issue:
> 
> T0:    t = now();
>          -> preemption or interrupt
> T1:    dostuff(t);
> 
> Which is usually not a problem, but It becomes a problem when T1 - T0 is
> greater than the usual expectations which can obviously be trivially
> achieved by guest migration or a savevm, restorevm cycle.
> 
> But let's go a step back and look at the clocks and their expectations:
> 
> CLOCK_MONOTONIC:
> 
>   Monotonically increasing clock which counts unless the system
>   is in suspend. On resume it continues counting without jumping
>   forward.
> 
>   That's the reference clock for everything else and therefore it
>   is important that it does _not_ jump around.
> 
>   The reasons why CLOCK_MONOTONIC stops during suspend is
>   historical and any attempt to change that breaks the world and
>   some more because making it jump forward will trigger all sorts
>   of timeouts, watchdogs and whatever. The last attempt to make
>   CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3
>   weeks. It's not going to be attempted again. See a3ed0e4393d6
>   ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for
>   details.
> 
>   Now the proposed change is creating exactly the same problem:
> 
>   >> > +	if (data.flags & KVM_CLOCK_REALTIME) {
>   >> > +		u64 now_real_ns = ktime_get_real_ns();
>   >> > +
>   >> > +		/*
>   >> > +		 * Avoid stepping the kvmclock backwards.
>   >> > +		 */
>   >> > +		if (now_real_ns > data.realtime)
>   >> > +			data.clock += now_real_ns - data.realtime;
>   >> > +	}
> 
>   IOW, it takes the time between pause and resume into account and
>   forwards the underlying base clock which makes CLOCK_MONOTONIC
>   jump forward by exactly that amount of time.

Well, it is assuming that the

 T0:    t = now();
 T1:    pause vm()
 T2:	finish vm migration()
 T3:    dostuff(t);

Interval between T1 and T2 is small (and that the guest
clocks are synchronized up to a given boundary).

But i suppose adding a limit to the forward clock advance 
(in the migration case) is useful:

	1) If migration (well actually, only the final steps
	   to finish migration, the time between when guest is paused
	   on source and is resumed on destination) takes too long,
	   then too bad: fix it to be shorter if you want the clocks
	   to have close to zero change to realtime on migration.

	2) Avoid the other bugs in case of large forward advance.

Maybe having it configurable, with a say, 1 minute maximum by default
is a good choice?

An alternative would be to advance only the guests REALTIME clock, from 
data about how long steps T1-T2 took.

>   So depending on the size of the delta you are running into exactly the
>   same problem as the final failed attempt to unify CLOCK_MONOTONIC and
>   CLOCK_BOOTTIME which btw. would have been a magic cure for virt.
> 
>   Too bad, not going to happen ever unless you fix all affected user
>   space and kernel side issues.
> 
> 
> CLOCK_BOOTTIME:
> 
>   CLOCK_MONOTONIC + time spent in suspend
> 
> 
> CLOCK_REALTIME/TAI:
> 
>   CLOCK_MONOTONIC + offset
> 
>   The offset is established by reading RTC at boot time and can be
>   changed by clock_settime(2) and adjtimex(2). The latter is used
>   by NTP/PTP.
> 
>   Any user of CLOCK_REALTIME has to be prepared for it to jump in
>   both directions, but of course NTP/PTP daemons have expectations
>   vs. such time jumps.
> 
>   They rightfully assume on a properly configured and administrated
>   system that there are only two things which can make CLOCK_REALTIME
>   jump:
> 
>   1) NTP/PTP daemon controlled
>   2) Suspend/resume related updates by the kernel
> 
> 
> Just for the record, these assumptions predate virtualization.
> 
> So now virt came along and created a hard to solve circular dependency
> problem:
> 
>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
>      sync, but everything else is happy.
>      
>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
>      lose, but NTP/PTP is happy.

One must handle the

 T0:    t = now();
          -> pause
          -> resume
          -> magic "fixup"
 T1:    dostuff(t);

fact if one is going to use savevm/restorevm anyway, so...
(it is kind of unfixable, unless you modify your application
to accept notifications to redo any computation based on t, isnt it?).

> IOW, you are up a creek without a paddle and you have to chose one evil.
> 
> That's simply a design fail because there has been no design for this
> from day one. But I'm not surprised at all by that simply because
> virtualization followed the hardware design fails vs. time and
> timekeeping which keep us entertained for the past 20 years on various
> architectures but most prominently on X86 which is the uncontended
> master of disaster in that regard.
> 
> Of course virt follows the same approach of hardware by ignoring the
> problem and coming up with more duct tape and the assumption that lack
> of design can be "fixed in software". Just the timeframe is slightly
> different: We're discussing this only for about 10 years now.
> 
> Seriously? All you folks can come up with in 10 years is piling duct
> tape on duct tape instead of sitting down and fixing the underlying root
> cause once and forever?

Been fixing bugs that are reported over 10+ years, yeah.

Hopefully this thread is the "sitting down and fixing the underyling root
cause" :-)

> I'm aware that especially chrony has tried to deal with this nonsense
> more gracefully, but that still does not make it great and it still gets
> upset.
> 
> The reason why suspend/resume works perfectly fine is that it's fully
> coordinated and NTP state is cleared on resume which makes it easy for
> the deamon to accomodate.

This is what is in place now (which is executed on the destination):

    /* Now, if user has passed a time to set and the system time is set, we
     * just need to synchronize the hardware clock. However, if no time was
     * passed, user is requesting the opposite: set the system time from the
     * hardware clock (RTC). */
    pid = fork();
    if (pid == 0) {
        setsid();
        reopen_fd_to_null(0);
        reopen_fd_to_null(1);
        reopen_fd_to_null(2);

        /* Use '/sbin/hwclock -w' to set RTC from the system time,
         * or '/sbin/hwclock -s' to set the system time from RTC. */
        execle(hwclock_path, "hwclock", has_time ? "-w" : "-s",
               NULL, environ);
        _exit(EXIT_FAILURE);
    } else if (pid < 0) {
        error_setg_errno(errp, errno, "failed to create child process");
        return;
    }

> 
> So again and I'm telling this for a decade now:
> 
>  1) Stop pretending that you can fix the lack of design with duct tape
>     engineering
> 
>  2) Accept the fundamental properties of Linux time keeping as they are
>     not going to change as explained above
> 
>  3) Either accept that CLOCK_REALTIME is off and jumping around which
>     confuses NTP/PTP or get your act together and design and implement a
>     proper synchronization mechanism for this:
> 
>     - Notify the guest about the intended "pause" or "savevm" event
> 
>     - Let the guest go into a lightweight freeze similar to S2IDLE
> 
>     - Save the VM for later resume or migrate the saved state
> 
>     - Watch everything working as expected on resume
> 
>     - Have the benefit that pause/resume and savevm/restorevm have
>       exactly the same behaviour

OK!

> That won't solve the problem for frankenkernels and !paravirt setups,
> but that's unsolvable and you can keep the pieces by chosing one of two
> evils. While I do not care at all, I still recommend to chose
> CLOCK_MONOTONIC correctness for obvious reasons.
> 
> The frankenkernel/legacy problem aside, I know you are going to tell me
> that this is too much overhead and has VM customer visible impact. 

No, i think it boils down to someone implementing it.

> It's
> your choice, really:
> 
>   Either you chose correctness or you decide to ignore correctness for
>   whatever reason.
> 
>   There is no middle ground simply because you _cannot_ guarantee that
>   your migration time is within the acceptable limits of the
>   CLOCK_MONOTONIC or the CLOCK_REALTIME expectations.
> 
>   You can limit the damage somehow by making some arbitrary cutoff of
>   how much you forward CLOCK_MONOTONIC, but don't ask me about the right
>   value.

> If you decide that correctness is overrated, then please document it
> clearly instead of trying to pretend being correct.

Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC)
would be correct, right? And its probably not very hard to do.

> I'm curious whether the hardware people or the virt folks come to senses
> first, but honestly I'm not expecting this to happen before I retire.
> 
> Thanks,
> 
>         tglx

Thanks very much for the detailed information! Its a good basis
for the document you ask.



_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-10-01 12:05           ` Marcelo Tosatti
  (?)
@ 2021-10-01 12:10             ` Marcelo Tosatti
  -1 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-01 12:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Oliver Upton, kvm, kvmarm, Paolo Bonzini, Sean Christopherson,
	Marc Zyngier, Peter Shier, Jim Mattson, David Matlack,
	Ricardo Koller, Jing Zhang, Raghavendra Rao Anata, James Morse,
	Alexandru Elisei, Suzuki K Poulose, linux-arm-kernel,
	Andrew Jones, Will Deacon, Catalin Marinas

On Fri, Oct 01, 2021 at 09:05:27AM -0300, Marcelo Tosatti wrote:
> On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> > Marcelo,
> > 
> > On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote:
> > > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> > >> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> > >> 
> > >> Thomas, CC'ed, has deeper understanding of problems with 
> > >> forward time jumps than I do. Thomas, any comments?
> > >
> > > Based on the earlier discussion about the problems of synchronizing
> > > the guests clock via a notification to the NTP/Chrony daemon 
> > > (where there is a window where applications can read the stale
> > > value of the clock), a possible solution would be triggering
> > > an NMI on the destination (so that it runs ASAP, with higher
> > > priority than application/kernel).
> > >
> > > What would this NMI do, exactly?
> > 
> > Nothing. You cannot do anything time related in an NMI.
> > 
> > You might queue irq work which handles that, but that would still not
> > prevent user space or kernel space from observing the stale time stamp
> > depending on the execution state from where it resumes.
> 
> Yes.
> 
> > >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
> > >> for either vm pause / vm resume (well, if paused for long periods of time) 
> > >> or savevm / restorevm.
> > >
> > > Maybe with the NMI above, it would be possible to use
> > > the realtime clock as a way to know time elapsed between
> > > events and advance guest clock without the current 
> > > problematic window.
> > 
> > As much duct tape you throw at the problem, it cannot be solved ever
> > because it's fundamentally broken. All you can do is to make the
> > observation windows smaller, but that's just curing the symptom.
> 
> Yes.
> 
> > The problem is that the guest is paused/resumed without getting any
> > information about that and the execution of the guest is stopped at an
> > arbitrary instruction boundary from which it resumes after migration or
> > restore. So there is no way to guarantee that after resume all vCPUs are
> > executing in a state which can handle that.
> > 
> > But even if that would be the case, then what prevents the stale time
> > stamps to be visible? Nothing:
> > 
> > T0:    t = now();
> >          -> pause
> >          -> resume
> >          -> magic "fixup"
> > T1:    dostuff(t);
> 
> Yes.
> 
> BTW, you could have a userspace notification (then applications 
> could handle this if desired).
> 
> > But that's not a fundamental problem because every preemptible or
> > interruptible code has the same issue:
> > 
> > T0:    t = now();
> >          -> preemption or interrupt
> > T1:    dostuff(t);
> > 
> > Which is usually not a problem, but It becomes a problem when T1 - T0 is
> > greater than the usual expectations which can obviously be trivially
> > achieved by guest migration or a savevm, restorevm cycle.
> > 
> > But let's go a step back and look at the clocks and their expectations:
> > 
> > CLOCK_MONOTONIC:
> > 
> >   Monotonically increasing clock which counts unless the system
> >   is in suspend. On resume it continues counting without jumping
> >   forward.
> > 
> >   That's the reference clock for everything else and therefore it
> >   is important that it does _not_ jump around.
> > 
> >   The reasons why CLOCK_MONOTONIC stops during suspend is
> >   historical and any attempt to change that breaks the world and
> >   some more because making it jump forward will trigger all sorts
> >   of timeouts, watchdogs and whatever. The last attempt to make
> >   CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3
> >   weeks. It's not going to be attempted again. See a3ed0e4393d6
> >   ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for
> >   details.
> > 
> >   Now the proposed change is creating exactly the same problem:
> > 
> >   >> > +	if (data.flags & KVM_CLOCK_REALTIME) {
> >   >> > +		u64 now_real_ns = ktime_get_real_ns();
> >   >> > +
> >   >> > +		/*
> >   >> > +		 * Avoid stepping the kvmclock backwards.
> >   >> > +		 */
> >   >> > +		if (now_real_ns > data.realtime)
> >   >> > +			data.clock += now_real_ns - data.realtime;
> >   >> > +	}
> > 
> >   IOW, it takes the time between pause and resume into account and
> >   forwards the underlying base clock which makes CLOCK_MONOTONIC
> >   jump forward by exactly that amount of time.
> 
> Well, it is assuming that the
> 
>  T0:    t = now();
>  T1:    pause vm()
>  T2:	finish vm migration()
>  T3:    dostuff(t);
> 
> Interval between T1 and T2 is small (and that the guest
> clocks are synchronized up to a given boundary).
> 
> But i suppose adding a limit to the forward clock advance 
> (in the migration case) is useful:
> 
> 	1) If migration (well actually, only the final steps
> 	   to finish migration, the time between when guest is paused
> 	   on source and is resumed on destination) takes too long,
> 	   then too bad: fix it to be shorter if you want the clocks
> 	   to have close to zero change to realtime on migration.
> 
> 	2) Avoid the other bugs in case of large forward advance.
> 
> Maybe having it configurable, with a say, 1 minute maximum by default
> is a good choice?
> 
> An alternative would be to advance only the guests REALTIME clock, from 
> data about how long steps T1-T2 took.
> 
> >   So depending on the size of the delta you are running into exactly the
> >   same problem as the final failed attempt to unify CLOCK_MONOTONIC and
> >   CLOCK_BOOTTIME which btw. would have been a magic cure for virt.
> > 
> >   Too bad, not going to happen ever unless you fix all affected user
> >   space and kernel side issues.
> > 
> > 
> > CLOCK_BOOTTIME:
> > 
> >   CLOCK_MONOTONIC + time spent in suspend
> > 
> > 
> > CLOCK_REALTIME/TAI:
> > 
> >   CLOCK_MONOTONIC + offset
> > 
> >   The offset is established by reading RTC at boot time and can be
> >   changed by clock_settime(2) and adjtimex(2). The latter is used
> >   by NTP/PTP.
> > 
> >   Any user of CLOCK_REALTIME has to be prepared for it to jump in
> >   both directions, but of course NTP/PTP daemons have expectations
> >   vs. such time jumps.
> > 
> >   They rightfully assume on a properly configured and administrated
> >   system that there are only two things which can make CLOCK_REALTIME
> >   jump:
> > 
> >   1) NTP/PTP daemon controlled
> >   2) Suspend/resume related updates by the kernel
> > 
> > 
> > Just for the record, these assumptions predate virtualization.
> > 
> > So now virt came along and created a hard to solve circular dependency
> > problem:
> > 
> >    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
> >      sync, but everything else is happy.
> >      
> >    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
> >      lose, but NTP/PTP is happy.
> 
> One must handle the
> 
>  T0:    t = now();
>           -> pause
>           -> resume
>           -> magic "fixup"
>  T1:    dostuff(t);
> 
> fact if one is going to use savevm/restorevm anyway, so...
> (it is kind of unfixable, unless you modify your application
> to accept notifications to redo any computation based on t, isnt it?).

https://lore.kernel.org/lkml/1289503802-22444-2-git-send-email-virtuoso@slind.org/


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 12:10             ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-01 12:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Catalin Marinas, kvm, Marc Zyngier, Peter Shier, David Matlack,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel,
	Jim Mattson

On Fri, Oct 01, 2021 at 09:05:27AM -0300, Marcelo Tosatti wrote:
> On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> > Marcelo,
> > 
> > On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote:
> > > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> > >> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> > >> 
> > >> Thomas, CC'ed, has deeper understanding of problems with 
> > >> forward time jumps than I do. Thomas, any comments?
> > >
> > > Based on the earlier discussion about the problems of synchronizing
> > > the guests clock via a notification to the NTP/Chrony daemon 
> > > (where there is a window where applications can read the stale
> > > value of the clock), a possible solution would be triggering
> > > an NMI on the destination (so that it runs ASAP, with higher
> > > priority than application/kernel).
> > >
> > > What would this NMI do, exactly?
> > 
> > Nothing. You cannot do anything time related in an NMI.
> > 
> > You might queue irq work which handles that, but that would still not
> > prevent user space or kernel space from observing the stale time stamp
> > depending on the execution state from where it resumes.
> 
> Yes.
> 
> > >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
> > >> for either vm pause / vm resume (well, if paused for long periods of time) 
> > >> or savevm / restorevm.
> > >
> > > Maybe with the NMI above, it would be possible to use
> > > the realtime clock as a way to know time elapsed between
> > > events and advance guest clock without the current 
> > > problematic window.
> > 
> > As much duct tape you throw at the problem, it cannot be solved ever
> > because it's fundamentally broken. All you can do is to make the
> > observation windows smaller, but that's just curing the symptom.
> 
> Yes.
> 
> > The problem is that the guest is paused/resumed without getting any
> > information about that and the execution of the guest is stopped at an
> > arbitrary instruction boundary from which it resumes after migration or
> > restore. So there is no way to guarantee that after resume all vCPUs are
> > executing in a state which can handle that.
> > 
> > But even if that would be the case, then what prevents the stale time
> > stamps to be visible? Nothing:
> > 
> > T0:    t = now();
> >          -> pause
> >          -> resume
> >          -> magic "fixup"
> > T1:    dostuff(t);
> 
> Yes.
> 
> BTW, you could have a userspace notification (then applications 
> could handle this if desired).
> 
> > But that's not a fundamental problem because every preemptible or
> > interruptible code has the same issue:
> > 
> > T0:    t = now();
> >          -> preemption or interrupt
> > T1:    dostuff(t);
> > 
> > Which is usually not a problem, but It becomes a problem when T1 - T0 is
> > greater than the usual expectations which can obviously be trivially
> > achieved by guest migration or a savevm, restorevm cycle.
> > 
> > But let's go a step back and look at the clocks and their expectations:
> > 
> > CLOCK_MONOTONIC:
> > 
> >   Monotonically increasing clock which counts unless the system
> >   is in suspend. On resume it continues counting without jumping
> >   forward.
> > 
> >   That's the reference clock for everything else and therefore it
> >   is important that it does _not_ jump around.
> > 
> >   The reasons why CLOCK_MONOTONIC stops during suspend is
> >   historical and any attempt to change that breaks the world and
> >   some more because making it jump forward will trigger all sorts
> >   of timeouts, watchdogs and whatever. The last attempt to make
> >   CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3
> >   weeks. It's not going to be attempted again. See a3ed0e4393d6
> >   ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for
> >   details.
> > 
> >   Now the proposed change is creating exactly the same problem:
> > 
> >   >> > +	if (data.flags & KVM_CLOCK_REALTIME) {
> >   >> > +		u64 now_real_ns = ktime_get_real_ns();
> >   >> > +
> >   >> > +		/*
> >   >> > +		 * Avoid stepping the kvmclock backwards.
> >   >> > +		 */
> >   >> > +		if (now_real_ns > data.realtime)
> >   >> > +			data.clock += now_real_ns - data.realtime;
> >   >> > +	}
> > 
> >   IOW, it takes the time between pause and resume into account and
> >   forwards the underlying base clock which makes CLOCK_MONOTONIC
> >   jump forward by exactly that amount of time.
> 
> Well, it is assuming that the
> 
>  T0:    t = now();
>  T1:    pause vm()
>  T2:	finish vm migration()
>  T3:    dostuff(t);
> 
> Interval between T1 and T2 is small (and that the guest
> clocks are synchronized up to a given boundary).
> 
> But i suppose adding a limit to the forward clock advance 
> (in the migration case) is useful:
> 
> 	1) If migration (well actually, only the final steps
> 	   to finish migration, the time between when guest is paused
> 	   on source and is resumed on destination) takes too long,
> 	   then too bad: fix it to be shorter if you want the clocks
> 	   to have close to zero change to realtime on migration.
> 
> 	2) Avoid the other bugs in case of large forward advance.
> 
> Maybe having it configurable, with a say, 1 minute maximum by default
> is a good choice?
> 
> An alternative would be to advance only the guests REALTIME clock, from 
> data about how long steps T1-T2 took.
> 
> >   So depending on the size of the delta you are running into exactly the
> >   same problem as the final failed attempt to unify CLOCK_MONOTONIC and
> >   CLOCK_BOOTTIME which btw. would have been a magic cure for virt.
> > 
> >   Too bad, not going to happen ever unless you fix all affected user
> >   space and kernel side issues.
> > 
> > 
> > CLOCK_BOOTTIME:
> > 
> >   CLOCK_MONOTONIC + time spent in suspend
> > 
> > 
> > CLOCK_REALTIME/TAI:
> > 
> >   CLOCK_MONOTONIC + offset
> > 
> >   The offset is established by reading RTC at boot time and can be
> >   changed by clock_settime(2) and adjtimex(2). The latter is used
> >   by NTP/PTP.
> > 
> >   Any user of CLOCK_REALTIME has to be prepared for it to jump in
> >   both directions, but of course NTP/PTP daemons have expectations
> >   vs. such time jumps.
> > 
> >   They rightfully assume on a properly configured and administrated
> >   system that there are only two things which can make CLOCK_REALTIME
> >   jump:
> > 
> >   1) NTP/PTP daemon controlled
> >   2) Suspend/resume related updates by the kernel
> > 
> > 
> > Just for the record, these assumptions predate virtualization.
> > 
> > So now virt came along and created a hard to solve circular dependency
> > problem:
> > 
> >    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
> >      sync, but everything else is happy.
> >      
> >    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
> >      lose, but NTP/PTP is happy.
> 
> One must handle the
> 
>  T0:    t = now();
>           -> pause
>           -> resume
>           -> magic "fixup"
>  T1:    dostuff(t);
> 
> fact if one is going to use savevm/restorevm anyway, so...
> (it is kind of unfixable, unless you modify your application
> to accept notifications to redo any computation based on t, isnt it?).

https://lore.kernel.org/lkml/1289503802-22444-2-git-send-email-virtuoso@slind.org/

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 12:10             ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-01 12:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Oliver Upton, kvm, kvmarm, Paolo Bonzini, Sean Christopherson,
	Marc Zyngier, Peter Shier, Jim Mattson, David Matlack,
	Ricardo Koller, Jing Zhang, Raghavendra Rao Anata, James Morse,
	Alexandru Elisei, Suzuki K Poulose, linux-arm-kernel,
	Andrew Jones, Will Deacon, Catalin Marinas

On Fri, Oct 01, 2021 at 09:05:27AM -0300, Marcelo Tosatti wrote:
> On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> > Marcelo,
> > 
> > On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote:
> > > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> > >> On Thu, Sep 16, 2021 at 06:15:35PM +0000, Oliver Upton wrote:
> > >> 
> > >> Thomas, CC'ed, has deeper understanding of problems with 
> > >> forward time jumps than I do. Thomas, any comments?
> > >
> > > Based on the earlier discussion about the problems of synchronizing
> > > the guests clock via a notification to the NTP/Chrony daemon 
> > > (where there is a window where applications can read the stale
> > > value of the clock), a possible solution would be triggering
> > > an NMI on the destination (so that it runs ASAP, with higher
> > > priority than application/kernel).
> > >
> > > What would this NMI do, exactly?
> > 
> > Nothing. You cannot do anything time related in an NMI.
> > 
> > You might queue irq work which handles that, but that would still not
> > prevent user space or kernel space from observing the stale time stamp
> > depending on the execution state from where it resumes.
> 
> Yes.
> 
> > >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
> > >> for either vm pause / vm resume (well, if paused for long periods of time) 
> > >> or savevm / restorevm.
> > >
> > > Maybe with the NMI above, it would be possible to use
> > > the realtime clock as a way to know time elapsed between
> > > events and advance guest clock without the current 
> > > problematic window.
> > 
> > As much duct tape you throw at the problem, it cannot be solved ever
> > because it's fundamentally broken. All you can do is to make the
> > observation windows smaller, but that's just curing the symptom.
> 
> Yes.
> 
> > The problem is that the guest is paused/resumed without getting any
> > information about that and the execution of the guest is stopped at an
> > arbitrary instruction boundary from which it resumes after migration or
> > restore. So there is no way to guarantee that after resume all vCPUs are
> > executing in a state which can handle that.
> > 
> > But even if that would be the case, then what prevents the stale time
> > stamps to be visible? Nothing:
> > 
> > T0:    t = now();
> >          -> pause
> >          -> resume
> >          -> magic "fixup"
> > T1:    dostuff(t);
> 
> Yes.
> 
> BTW, you could have a userspace notification (then applications 
> could handle this if desired).
> 
> > But that's not a fundamental problem because every preemptible or
> > interruptible code has the same issue:
> > 
> > T0:    t = now();
> >          -> preemption or interrupt
> > T1:    dostuff(t);
> > 
> > Which is usually not a problem, but It becomes a problem when T1 - T0 is
> > greater than the usual expectations which can obviously be trivially
> > achieved by guest migration or a savevm, restorevm cycle.
> > 
> > But let's go a step back and look at the clocks and their expectations:
> > 
> > CLOCK_MONOTONIC:
> > 
> >   Monotonically increasing clock which counts unless the system
> >   is in suspend. On resume it continues counting without jumping
> >   forward.
> > 
> >   That's the reference clock for everything else and therefore it
> >   is important that it does _not_ jump around.
> > 
> >   The reasons why CLOCK_MONOTONIC stops during suspend is
> >   historical and any attempt to change that breaks the world and
> >   some more because making it jump forward will trigger all sorts
> >   of timeouts, watchdogs and whatever. The last attempt to make
> >   CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3
> >   weeks. It's not going to be attempted again. See a3ed0e4393d6
> >   ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for
> >   details.
> > 
> >   Now the proposed change is creating exactly the same problem:
> > 
> >   >> > +	if (data.flags & KVM_CLOCK_REALTIME) {
> >   >> > +		u64 now_real_ns = ktime_get_real_ns();
> >   >> > +
> >   >> > +		/*
> >   >> > +		 * Avoid stepping the kvmclock backwards.
> >   >> > +		 */
> >   >> > +		if (now_real_ns > data.realtime)
> >   >> > +			data.clock += now_real_ns - data.realtime;
> >   >> > +	}
> > 
> >   IOW, it takes the time between pause and resume into account and
> >   forwards the underlying base clock which makes CLOCK_MONOTONIC
> >   jump forward by exactly that amount of time.
> 
> Well, it is assuming that the
> 
>  T0:    t = now();
>  T1:    pause vm()
>  T2:	finish vm migration()
>  T3:    dostuff(t);
> 
> Interval between T1 and T2 is small (and that the guest
> clocks are synchronized up to a given boundary).
> 
> But i suppose adding a limit to the forward clock advance 
> (in the migration case) is useful:
> 
> 	1) If migration (well actually, only the final steps
> 	   to finish migration, the time between when guest is paused
> 	   on source and is resumed on destination) takes too long,
> 	   then too bad: fix it to be shorter if you want the clocks
> 	   to have close to zero change to realtime on migration.
> 
> 	2) Avoid the other bugs in case of large forward advance.
> 
> Maybe having it configurable, with a say, 1 minute maximum by default
> is a good choice?
> 
> An alternative would be to advance only the guests REALTIME clock, from 
> data about how long steps T1-T2 took.
> 
> >   So depending on the size of the delta you are running into exactly the
> >   same problem as the final failed attempt to unify CLOCK_MONOTONIC and
> >   CLOCK_BOOTTIME which btw. would have been a magic cure for virt.
> > 
> >   Too bad, not going to happen ever unless you fix all affected user
> >   space and kernel side issues.
> > 
> > 
> > CLOCK_BOOTTIME:
> > 
> >   CLOCK_MONOTONIC + time spent in suspend
> > 
> > 
> > CLOCK_REALTIME/TAI:
> > 
> >   CLOCK_MONOTONIC + offset
> > 
> >   The offset is established by reading RTC at boot time and can be
> >   changed by clock_settime(2) and adjtimex(2). The latter is used
> >   by NTP/PTP.
> > 
> >   Any user of CLOCK_REALTIME has to be prepared for it to jump in
> >   both directions, but of course NTP/PTP daemons have expectations
> >   vs. such time jumps.
> > 
> >   They rightfully assume on a properly configured and administrated
> >   system that there are only two things which can make CLOCK_REALTIME
> >   jump:
> > 
> >   1) NTP/PTP daemon controlled
> >   2) Suspend/resume related updates by the kernel
> > 
> > 
> > Just for the record, these assumptions predate virtualization.
> > 
> > So now virt came along and created a hard to solve circular dependency
> > problem:
> > 
> >    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
> >      sync, but everything else is happy.
> >      
> >    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
> >      lose, but NTP/PTP is happy.
> 
> One must handle the
> 
>  T0:    t = now();
>           -> pause
>           -> resume
>           -> magic "fixup"
>  T1:    dostuff(t);
> 
> fact if one is going to use savevm/restorevm anyway, so...
> (it is kind of unfixable, unless you modify your application
> to accept notifications to redo any computation based on t, isnt it?).

https://lore.kernel.org/lkml/1289503802-22444-2-git-send-email-virtuoso@slind.org/


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-09-30 23:02         ` Thomas Gleixner
  (?)
@ 2021-10-01 14:17           ` Paolo Bonzini
  -1 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 14:17 UTC (permalink / raw)
  To: Thomas Gleixner, Marcelo Tosatti, Oliver Upton
  Cc: kvm, kvmarm, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On 01/10/21 01:02, Thomas Gleixner wrote:
> 
>   Now the proposed change is creating exactly the same problem:
> 
>   +	if (data.flags & KVM_CLOCK_REALTIME) {
>   +		u64 now_real_ns = ktime_get_real_ns();
>   +
>   +		/*
>   +		 * Avoid stepping the kvmclock backwards.
>   +		 */
>   +		if (now_real_ns > data.realtime)
>   +			data.clock += now_real_ns - data.realtime;
>   +	}

Indeed, though it's opt-in (you can always not pass KVM_CLOCK_REALTIME 
and then the kernel will not muck with the value you gave it).

> virt came along and created a hard to solve circular dependency
> problem:
> 
>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
>      sync, but everything else is happy.
>      
>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
>      lose, but NTP/PTP is happy.

Yes, I agree that this sums it up.

For example QEMU (meaning: Marcelo :)) has gone for the former and 
"hoping" that NTP/PTP sorts it out sooner or later.  The clock in 
nanoseconds is sent out to the destination and restored.

Google's userspace instead went for the latter.  The reason is that 
they've always started running on the destination before finishing the 
memory copy[1], therefore it's much easier to bound the CLOCK_MONOTONIC 
jump.

I do like very much the cooperative S2IDLE or even S3 way to handle the 
brownout during live migration.  However if your stopping time is 
bounded, these patches are nice because, on current processors that have 
TSC scaling, they make it possible to keep the illusion of the TSC 
running.  Of course that's a big "if"; however, you can always bound the 
stopping time by aborting the restart on the destination machine once 
you get close enough to the limit.

Paolo

[1] see https://dl.acm.org/doi/pdf/10.1145/3296975.3186415, figure 3


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 14:17           ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 14:17 UTC (permalink / raw)
  To: Thomas Gleixner, Marcelo Tosatti, Oliver Upton
  Cc: Catalin Marinas, kvm, Will Deacon, Peter Shier, Marc Zyngier,
	David Matlack, kvmarm, linux-arm-kernel, Jim Mattson

On 01/10/21 01:02, Thomas Gleixner wrote:
> 
>   Now the proposed change is creating exactly the same problem:
> 
>   +	if (data.flags & KVM_CLOCK_REALTIME) {
>   +		u64 now_real_ns = ktime_get_real_ns();
>   +
>   +		/*
>   +		 * Avoid stepping the kvmclock backwards.
>   +		 */
>   +		if (now_real_ns > data.realtime)
>   +			data.clock += now_real_ns - data.realtime;
>   +	}

Indeed, though it's opt-in (you can always not pass KVM_CLOCK_REALTIME 
and then the kernel will not muck with the value you gave it).

> virt came along and created a hard to solve circular dependency
> problem:
> 
>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
>      sync, but everything else is happy.
>      
>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
>      lose, but NTP/PTP is happy.

Yes, I agree that this sums it up.

For example QEMU (meaning: Marcelo :)) has gone for the former and 
"hoping" that NTP/PTP sorts it out sooner or later.  The clock in 
nanoseconds is sent out to the destination and restored.

Google's userspace instead went for the latter.  The reason is that 
they've always started running on the destination before finishing the 
memory copy[1], therefore it's much easier to bound the CLOCK_MONOTONIC 
jump.

I do like very much the cooperative S2IDLE or even S3 way to handle the 
brownout during live migration.  However if your stopping time is 
bounded, these patches are nice because, on current processors that have 
TSC scaling, they make it possible to keep the illusion of the TSC 
running.  Of course that's a big "if"; however, you can always bound the 
stopping time by aborting the restart on the destination machine once 
you get close enough to the limit.

Paolo

[1] see https://dl.acm.org/doi/pdf/10.1145/3296975.3186415, figure 3

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 14:17           ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 14:17 UTC (permalink / raw)
  To: Thomas Gleixner, Marcelo Tosatti, Oliver Upton
  Cc: kvm, kvmarm, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On 01/10/21 01:02, Thomas Gleixner wrote:
> 
>   Now the proposed change is creating exactly the same problem:
> 
>   +	if (data.flags & KVM_CLOCK_REALTIME) {
>   +		u64 now_real_ns = ktime_get_real_ns();
>   +
>   +		/*
>   +		 * Avoid stepping the kvmclock backwards.
>   +		 */
>   +		if (now_real_ns > data.realtime)
>   +			data.clock += now_real_ns - data.realtime;
>   +	}

Indeed, though it's opt-in (you can always not pass KVM_CLOCK_REALTIME 
and then the kernel will not muck with the value you gave it).

> virt came along and created a hard to solve circular dependency
> problem:
> 
>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
>      sync, but everything else is happy.
>      
>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
>      lose, but NTP/PTP is happy.

Yes, I agree that this sums it up.

For example QEMU (meaning: Marcelo :)) has gone for the former and 
"hoping" that NTP/PTP sorts it out sooner or later.  The clock in 
nanoseconds is sent out to the destination and restored.

Google's userspace instead went for the latter.  The reason is that 
they've always started running on the destination before finishing the 
memory copy[1], therefore it's much easier to bound the CLOCK_MONOTONIC 
jump.

I do like very much the cooperative S2IDLE or even S3 way to handle the 
brownout during live migration.  However if your stopping time is 
bounded, these patches are nice because, on current processors that have 
TSC scaling, they make it possible to keep the illusion of the TSC 
running.  Of course that's a big "if"; however, you can always bound the 
stopping time by aborting the restart on the destination machine once 
you get close enough to the limit.

Paolo

[1] see https://dl.acm.org/doi/pdf/10.1145/3296975.3186415, figure 3


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-09-16 18:15   ` Oliver Upton
  (?)
@ 2021-10-01 14:39     ` Paolo Bonzini
  -1 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 14:39 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Sean Christopherson, Marc Zyngier, Peter Shier, Jim Mattson,
	David Matlack, Ricardo Koller, Jing Zhang, Raghavendra Rao Anata,
	James Morse, Alexandru Elisei, Suzuki K Poulose,
	linux-arm-kernel, Andrew Jones, Will Deacon, Catalin Marinas

On 16/09/21 20:15, Oliver Upton wrote:
> +	if (data.flags & ~KVM_CLOCK_REALTIME)
>   		return -EINVAL;

Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there 
may be programs that expect to send back to KVM_SET_CLOCK whatever they 
got from KVM_GET_CLOCK.


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 14:39     ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 14:39 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Catalin Marinas, Will Deacon, Peter Shier, Marc Zyngier,
	David Matlack, linux-arm-kernel, Jim Mattson

On 16/09/21 20:15, Oliver Upton wrote:
> +	if (data.flags & ~KVM_CLOCK_REALTIME)
>   		return -EINVAL;

Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there 
may be programs that expect to send back to KVM_SET_CLOCK whatever they 
got from KVM_GET_CLOCK.

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 14:39     ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 14:39 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Sean Christopherson, Marc Zyngier, Peter Shier, Jim Mattson,
	David Matlack, Ricardo Koller, Jing Zhang, Raghavendra Rao Anata,
	James Morse, Alexandru Elisei, Suzuki K Poulose,
	linux-arm-kernel, Andrew Jones, Will Deacon, Catalin Marinas

On 16/09/21 20:15, Oliver Upton wrote:
> +	if (data.flags & ~KVM_CLOCK_REALTIME)
>   		return -EINVAL;

Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there 
may be programs that expect to send back to KVM_SET_CLOCK whatever they 
got from KVM_GET_CLOCK.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-10-01 14:39     ` Paolo Bonzini
  (?)
@ 2021-10-01 14:41       ` Paolo Bonzini
  -1 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 14:41 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Sean Christopherson, Marc Zyngier, Peter Shier, Jim Mattson,
	David Matlack, Ricardo Koller, Jing Zhang, Raghavendra Rao Anata,
	James Morse, Alexandru Elisei, Suzuki K Poulose,
	linux-arm-kernel, Andrew Jones, Will Deacon, Catalin Marinas

On 01/10/21 16:39, Paolo Bonzini wrote:
> On 16/09/21 20:15, Oliver Upton wrote:
>> +    if (data.flags & ~KVM_CLOCK_REALTIME)
>>           return -EINVAL;
> 
> Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there 
> may be programs that expect to send back to KVM_SET_CLOCK whatever they 
> got from KVM_GET_CLOCK.

Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no 
need to do that!

Paolo


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 14:41       ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 14:41 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Catalin Marinas, Will Deacon, Peter Shier, Marc Zyngier,
	David Matlack, linux-arm-kernel, Jim Mattson

On 01/10/21 16:39, Paolo Bonzini wrote:
> On 16/09/21 20:15, Oliver Upton wrote:
>> +    if (data.flags & ~KVM_CLOCK_REALTIME)
>>           return -EINVAL;
> 
> Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there 
> may be programs that expect to send back to KVM_SET_CLOCK whatever they 
> got from KVM_GET_CLOCK.

Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no 
need to do that!

Paolo

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 14:41       ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 14:41 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Sean Christopherson, Marc Zyngier, Peter Shier, Jim Mattson,
	David Matlack, Ricardo Koller, Jing Zhang, Raghavendra Rao Anata,
	James Morse, Alexandru Elisei, Suzuki K Poulose,
	linux-arm-kernel, Andrew Jones, Will Deacon, Catalin Marinas

On 01/10/21 16:39, Paolo Bonzini wrote:
> On 16/09/21 20:15, Oliver Upton wrote:
>> +    if (data.flags & ~KVM_CLOCK_REALTIME)
>>           return -EINVAL;
> 
> Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there 
> may be programs that expect to send back to KVM_SET_CLOCK whatever they 
> got from KVM_GET_CLOCK.

Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no 
need to do that!

Paolo


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
  2021-10-01 10:32         ` Marcelo Tosatti
  (?)
@ 2021-10-01 15:12           ` Paolo Bonzini
  -1 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 15:12 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Oliver Upton, kvm, kvmarm, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On 01/10/21 12:32, Marcelo Tosatti wrote:
>> +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
>> kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
>>  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
>> nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
>> respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
>> set in the provided +   structure. KVM will advance the VM's
>> kvmclock to account for elapsed +   time since recording the clock
>> values.
> 
> You can't advance both kvmclock (kvmclock_offset variable) and the
> TSCs, which would be double counting.
> 
> So you have to either add the elapsed realtime (1) between
> KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> TSCs. If you do both, there is double counting. Am i missing
> something?

Probably one of these two (but it's worth pointing out both of them):

1) the attribute that's introduced here *replaces*
KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.

2) the adjustment formula later in the algorithm does not care about how
much time passed between step 1 and step 4.  It just takes two well
known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
the same on the destination as if the guest was still running on the
source.  It is irrelevant that one of them is before migration and one
is after, all it matters is that one is on the source and one is on the
destination.

Perhaps we can add to step 6 something like:

> +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> time +   elapsed since recording state and (2) difference in TSCs
> between the +   source and destination machine: + +   new_off_n = t_0
> + off_n + (k_1 - k_0) * freq - t_1 +

"off + t - k * freq" is the guest TSC value corresponding to a time of 0
in kvmclock.  The above formula ensures that it is the same on the
destination as it was on the source.

Also, the names are a bit hard to follow.  Perhaps

	t_0		tsc_src
	t_1		tsc_dest
	k_0		guest_src
	k_1		guest_dest
	r_0		host_src
	off_n		ofs_src[i]
	new_off_n	ofs_dest[i]

Paolo


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-01 15:12           ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 15:12 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Catalin Marinas, kvm, Peter Shier, Marc Zyngier, David Matlack,
	Will Deacon, kvmarm, linux-arm-kernel, Jim Mattson

On 01/10/21 12:32, Marcelo Tosatti wrote:
>> +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
>> kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
>>  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
>> nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
>> respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
>> set in the provided +   structure. KVM will advance the VM's
>> kvmclock to account for elapsed +   time since recording the clock
>> values.
> 
> You can't advance both kvmclock (kvmclock_offset variable) and the
> TSCs, which would be double counting.
> 
> So you have to either add the elapsed realtime (1) between
> KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> TSCs. If you do both, there is double counting. Am i missing
> something?

Probably one of these two (but it's worth pointing out both of them):

1) the attribute that's introduced here *replaces*
KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.

2) the adjustment formula later in the algorithm does not care about how
much time passed between step 1 and step 4.  It just takes two well
known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
the same on the destination as if the guest was still running on the
source.  It is irrelevant that one of them is before migration and one
is after, all it matters is that one is on the source and one is on the
destination.

Perhaps we can add to step 6 something like:

> +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> time +   elapsed since recording state and (2) difference in TSCs
> between the +   source and destination machine: + +   new_off_n = t_0
> + off_n + (k_1 - k_0) * freq - t_1 +

"off + t - k * freq" is the guest TSC value corresponding to a time of 0
in kvmclock.  The above formula ensures that it is the same on the
destination as it was on the source.

Also, the names are a bit hard to follow.  Perhaps

	t_0		tsc_src
	t_1		tsc_dest
	k_0		guest_src
	k_1		guest_dest
	r_0		host_src
	off_n		ofs_src[i]
	new_off_n	ofs_dest[i]

Paolo

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-01 15:12           ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 15:12 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Oliver Upton, kvm, kvmarm, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On 01/10/21 12:32, Marcelo Tosatti wrote:
>> +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
>> kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
>>  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
>> nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
>> respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
>> set in the provided +   structure. KVM will advance the VM's
>> kvmclock to account for elapsed +   time since recording the clock
>> values.
> 
> You can't advance both kvmclock (kvmclock_offset variable) and the
> TSCs, which would be double counting.
> 
> So you have to either add the elapsed realtime (1) between
> KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> TSCs. If you do both, there is double counting. Am i missing
> something?

Probably one of these two (but it's worth pointing out both of them):

1) the attribute that's introduced here *replaces*
KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.

2) the adjustment formula later in the algorithm does not care about how
much time passed between step 1 and step 4.  It just takes two well
known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
the same on the destination as if the guest was still running on the
source.  It is irrelevant that one of them is before migration and one
is after, all it matters is that one is on the source and one is on the
destination.

Perhaps we can add to step 6 something like:

> +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> time +   elapsed since recording state and (2) difference in TSCs
> between the +   source and destination machine: + +   new_off_n = t_0
> + off_n + (k_1 - k_0) * freq - t_1 +

"off + t - k * freq" is the guest TSC value corresponding to a time of 0
in kvmclock.  The above formula ensures that it is the same on the
destination as it was on the source.

Also, the names are a bit hard to follow.  Perhaps

	t_0		tsc_src
	t_1		tsc_dest
	k_0		guest_src
	k_1		guest_dest
	r_0		host_src
	off_n		ofs_src[i]
	new_off_n	ofs_dest[i]

Paolo


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-10-01 14:41       ` Paolo Bonzini
  (?)
@ 2021-10-01 15:39         ` Oliver Upton
  -1 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-10-01 15:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, kvmarm, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Fri, Oct 1, 2021 at 7:41 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 01/10/21 16:39, Paolo Bonzini wrote:
> > On 16/09/21 20:15, Oliver Upton wrote:
> >> +    if (data.flags & ~KVM_CLOCK_REALTIME)
> >>           return -EINVAL;
> >
> > Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there
> > may be programs that expect to send back to KVM_SET_CLOCK whatever they
> > got from KVM_GET_CLOCK.
>
> Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no
> need to do that!

Yeah, I don't know the story on the interface but it is really odd
that userspace needs to blow away flags to successfully write the
clock structure.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 15:39         ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-10-01 15:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Catalin Marinas, kvm, Will Deacon, Peter Shier, Marc Zyngier,
	David Matlack, kvmarm, linux-arm-kernel, Jim Mattson

On Fri, Oct 1, 2021 at 7:41 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 01/10/21 16:39, Paolo Bonzini wrote:
> > On 16/09/21 20:15, Oliver Upton wrote:
> >> +    if (data.flags & ~KVM_CLOCK_REALTIME)
> >>           return -EINVAL;
> >
> > Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there
> > may be programs that expect to send back to KVM_SET_CLOCK whatever they
> > got from KVM_GET_CLOCK.
>
> Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no
> need to do that!

Yeah, I don't know the story on the interface but it is really odd
that userspace needs to blow away flags to successfully write the
clock structure.

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 15:39         ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-10-01 15:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, kvmarm, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Fri, Oct 1, 2021 at 7:41 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 01/10/21 16:39, Paolo Bonzini wrote:
> > On 16/09/21 20:15, Oliver Upton wrote:
> >> +    if (data.flags & ~KVM_CLOCK_REALTIME)
> >>           return -EINVAL;
> >
> > Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there
> > may be programs that expect to send back to KVM_SET_CLOCK whatever they
> > got from KVM_GET_CLOCK.
>
> Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no
> need to do that!

Yeah, I don't know the story on the interface but it is really odd
that userspace needs to blow away flags to successfully write the
clock structure.

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-10-01 15:39         ` Oliver Upton
  (?)
@ 2021-10-01 16:42           ` Paolo Bonzini
  -1 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 16:42 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, kvmarm, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On 01/10/21 17:39, Oliver Upton wrote:
>> Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no
>> need to do that!
>
> Yeah, I don't know the story on the interface but it is really odd
> that userspace needs to blow away flags to successfully write the
> clock structure.

Yeah, let's fix it now and accept all three flags  I would like that, 
even though it cannot be fixed in existing kernels.

Paolo


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 16:42           ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 16:42 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Catalin Marinas, kvm, Will Deacon, Peter Shier, Marc Zyngier,
	David Matlack, kvmarm, linux-arm-kernel, Jim Mattson

On 01/10/21 17:39, Oliver Upton wrote:
>> Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no
>> need to do that!
>
> Yeah, I don't know the story on the interface but it is really odd
> that userspace needs to blow away flags to successfully write the
> clock structure.

Yeah, let's fix it now and accept all three flags  I would like that, 
even though it cannot be fixed in existing kernels.

Paolo

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 16:42           ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 16:42 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, kvmarm, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On 01/10/21 17:39, Oliver Upton wrote:
>> Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no
>> need to do that!
>
> Yeah, I don't know the story on the interface but it is really odd
> that userspace needs to blow away flags to successfully write the
> clock structure.

Yeah, let's fix it now and accept all three flags  I would like that, 
even though it cannot be fixed in existing kernels.

Paolo


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount
  2021-09-16 18:15   ` Oliver Upton
  (?)
@ 2021-10-01 16:48     ` Paolo Bonzini
  -1 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 16:48 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Sean Christopherson, Marc Zyngier, Peter Shier, Jim Mattson,
	David Matlack, Ricardo Koller, Jing Zhang, Raghavendra Rao Anata,
	James Morse, Alexandru Elisei, Suzuki K Poulose,
	linux-arm-kernel, Andrew Jones, Will Deacon, Catalin Marinas

On 16/09/21 20:15, Oliver Upton wrote:
> +	seq = read_seqcount_begin(&ka->pvclock_sc);
> +	do {
> +		use_master_clock = ka->use_master_clock;

Oops, the "seq" assignment should be inside the "do".  With that fixed, 
my tests seem to work.  I will shortly push the result to kvm/queue.

Paolo


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount
@ 2021-10-01 16:48     ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 16:48 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Catalin Marinas, Will Deacon, Peter Shier, Marc Zyngier,
	David Matlack, linux-arm-kernel, Jim Mattson

On 16/09/21 20:15, Oliver Upton wrote:
> +	seq = read_seqcount_begin(&ka->pvclock_sc);
> +	do {
> +		use_master_clock = ka->use_master_clock;

Oops, the "seq" assignment should be inside the "do".  With that fixed, 
my tests seem to work.  I will shortly push the result to kvm/queue.

Paolo

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount
@ 2021-10-01 16:48     ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-01 16:48 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Sean Christopherson, Marc Zyngier, Peter Shier, Jim Mattson,
	David Matlack, Ricardo Koller, Jing Zhang, Raghavendra Rao Anata,
	James Morse, Alexandru Elisei, Suzuki K Poulose,
	linux-arm-kernel, Andrew Jones, Will Deacon, Catalin Marinas

On 16/09/21 20:15, Oliver Upton wrote:
> +	seq = read_seqcount_begin(&ka->pvclock_sc);
> +	do {
> +		use_master_clock = ka->use_master_clock;

Oops, the "seq" assignment should be inside the "do".  With that fixed, 
my tests seem to work.  I will shortly push the result to kvm/queue.

Paolo


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
  2021-10-01 15:12           ` Paolo Bonzini
  (?)
@ 2021-10-01 19:11             ` Marcelo Tosatti
  -1 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-01 19:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Catalin Marinas, kvm, Peter Shier, Marc Zyngier, David Matlack,
	Will Deacon, kvmarm, linux-arm-kernel, Jim Mattson

On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > set in the provided +   structure. KVM will advance the VM's
> > > kvmclock to account for elapsed +   time since recording the clock
> > > values.
> > 
> > You can't advance both kvmclock (kvmclock_offset variable) and the
> > TSCs, which would be double counting.
> > 
> > So you have to either add the elapsed realtime (1) between
> > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > TSCs. If you do both, there is double counting. Am i missing
> > something?
> 
> Probably one of these two (but it's worth pointing out both of them):
> 
> 1) the attribute that's introduced here *replaces*
> KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> 
> 2) the adjustment formula later in the algorithm does not care about how
> much time passed between step 1 and step 4.  It just takes two well
> known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> the same on the destination as if the guest was still running on the
> source.  It is irrelevant that one of them is before migration and one
> is after, all it matters is that one is on the source and one is on the
> destination.

OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay 
which is introduced during migration (which is what i would guess is
the lower hanging fruit) (for guests using TSC).

My point was that, by advancing the _TSC value_ by:

T0. stop guest vcpus	(source)
T1. KVM_GET_CLOCK	(source)
T2. KVM_SET_CLOCK	(destination)
T3. Write guest TSCs	(destination)
T4. resume guest	(destination)

new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1

t_0:    host TSC at KVM_GET_CLOCK time.
off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
TSC offset is fixed).
...

+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.

Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
(hopefully modern guests on modern hosts will use TSC clocksource,
whose clock_gettime is faster... some people are using that already).

At some point QEMU should enable invariant TSC flag by default?

That said, the point is: why not advance the _TSC_ values
(instead of kvmclock nanoseconds), as doing so would reduce 
the "the CLOCK_REALTIME delay which is introduced during migration"
for both kvmclock users and modern tsc clocksource users.

So yes, i also like this patchset, but would like it even more
if it fixed the case above as well (and not sure whether adding
the migration delta to KVMCLOCK makes it harder to fix TSC case
later).

> Perhaps we can add to step 6 something like:
> 
> > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > time +   elapsed since recording state and (2) difference in TSCs
> > between the +   source and destination machine: + +   new_off_n = t_0
> > + off_n + (k_1 - k_0) * freq - t_1 +
> 
> "off + t - k * freq" is the guest TSC value corresponding to a time of 0
> in kvmclock.  The above formula ensures that it is the same on the
> destination as it was on the source.
> 
> Also, the names are a bit hard to follow.  Perhaps
> 
> 	t_0		tsc_src
> 	t_1		tsc_dest
> 	k_0		guest_src
> 	k_1		guest_dest
> 	r_0		host_src
> 	off_n		ofs_src[i]
> 	new_off_n	ofs_dest[i]
> 
> Paolo
> 
> 

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-01 19:11             ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-01 19:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Oliver Upton, kvm, kvmarm, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > set in the provided +   structure. KVM will advance the VM's
> > > kvmclock to account for elapsed +   time since recording the clock
> > > values.
> > 
> > You can't advance both kvmclock (kvmclock_offset variable) and the
> > TSCs, which would be double counting.
> > 
> > So you have to either add the elapsed realtime (1) between
> > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > TSCs. If you do both, there is double counting. Am i missing
> > something?
> 
> Probably one of these two (but it's worth pointing out both of them):
> 
> 1) the attribute that's introduced here *replaces*
> KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> 
> 2) the adjustment formula later in the algorithm does not care about how
> much time passed between step 1 and step 4.  It just takes two well
> known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> the same on the destination as if the guest was still running on the
> source.  It is irrelevant that one of them is before migration and one
> is after, all it matters is that one is on the source and one is on the
> destination.

OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay 
which is introduced during migration (which is what i would guess is
the lower hanging fruit) (for guests using TSC).

My point was that, by advancing the _TSC value_ by:

T0. stop guest vcpus	(source)
T1. KVM_GET_CLOCK	(source)
T2. KVM_SET_CLOCK	(destination)
T3. Write guest TSCs	(destination)
T4. resume guest	(destination)

new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1

t_0:    host TSC at KVM_GET_CLOCK time.
off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
TSC offset is fixed).
...

+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.

Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
(hopefully modern guests on modern hosts will use TSC clocksource,
whose clock_gettime is faster... some people are using that already).

At some point QEMU should enable invariant TSC flag by default?

That said, the point is: why not advance the _TSC_ values
(instead of kvmclock nanoseconds), as doing so would reduce 
the "the CLOCK_REALTIME delay which is introduced during migration"
for both kvmclock users and modern tsc clocksource users.

So yes, i also like this patchset, but would like it even more
if it fixed the case above as well (and not sure whether adding
the migration delta to KVMCLOCK makes it harder to fix TSC case
later).

> Perhaps we can add to step 6 something like:
> 
> > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > time +   elapsed since recording state and (2) difference in TSCs
> > between the +   source and destination machine: + +   new_off_n = t_0
> > + off_n + (k_1 - k_0) * freq - t_1 +
> 
> "off + t - k * freq" is the guest TSC value corresponding to a time of 0
> in kvmclock.  The above formula ensures that it is the same on the
> destination as it was on the source.
> 
> Also, the names are a bit hard to follow.  Perhaps
> 
> 	t_0		tsc_src
> 	t_1		tsc_dest
> 	k_0		guest_src
> 	k_1		guest_dest
> 	r_0		host_src
> 	off_n		ofs_src[i]
> 	new_off_n	ofs_dest[i]
> 
> Paolo
> 
> 


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-01 19:11             ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-01 19:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Oliver Upton, kvm, kvmarm, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > set in the provided +   structure. KVM will advance the VM's
> > > kvmclock to account for elapsed +   time since recording the clock
> > > values.
> > 
> > You can't advance both kvmclock (kvmclock_offset variable) and the
> > TSCs, which would be double counting.
> > 
> > So you have to either add the elapsed realtime (1) between
> > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > TSCs. If you do both, there is double counting. Am i missing
> > something?
> 
> Probably one of these two (but it's worth pointing out both of them):
> 
> 1) the attribute that's introduced here *replaces*
> KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> 
> 2) the adjustment formula later in the algorithm does not care about how
> much time passed between step 1 and step 4.  It just takes two well
> known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> the same on the destination as if the guest was still running on the
> source.  It is irrelevant that one of them is before migration and one
> is after, all it matters is that one is on the source and one is on the
> destination.

OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay 
which is introduced during migration (which is what i would guess is
the lower hanging fruit) (for guests using TSC).

My point was that, by advancing the _TSC value_ by:

T0. stop guest vcpus	(source)
T1. KVM_GET_CLOCK	(source)
T2. KVM_SET_CLOCK	(destination)
T3. Write guest TSCs	(destination)
T4. resume guest	(destination)

new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1

t_0:    host TSC at KVM_GET_CLOCK time.
off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
TSC offset is fixed).
...

+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.

Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
(hopefully modern guests on modern hosts will use TSC clocksource,
whose clock_gettime is faster... some people are using that already).

At some point QEMU should enable invariant TSC flag by default?

That said, the point is: why not advance the _TSC_ values
(instead of kvmclock nanoseconds), as doing so would reduce 
the "the CLOCK_REALTIME delay which is introduced during migration"
for both kvmclock users and modern tsc clocksource users.

So yes, i also like this patchset, but would like it even more
if it fixed the case above as well (and not sure whether adding
the migration delta to KVMCLOCK makes it harder to fix TSC case
later).

> Perhaps we can add to step 6 something like:
> 
> > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > time +   elapsed since recording state and (2) difference in TSCs
> > between the +   source and destination machine: + +   new_off_n = t_0
> > + off_n + (k_1 - k_0) * freq - t_1 +
> 
> "off + t - k * freq" is the guest TSC value corresponding to a time of 0
> in kvmclock.  The above formula ensures that it is the same on the
> destination as it was on the source.
> 
> Also, the names are a bit hard to follow.  Perhaps
> 
> 	t_0		tsc_src
> 	t_1		tsc_dest
> 	k_0		guest_src
> 	k_1		guest_dest
> 	r_0		host_src
> 	off_n		ofs_src[i]
> 	new_off_n	ofs_dest[i]
> 
> Paolo
> 
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
  2021-10-01 19:11             ` Marcelo Tosatti
  (?)
@ 2021-10-01 19:33               ` Oliver Upton
  -1 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-10-01 19:33 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, kvm, kvmarm, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

Marcelo,

On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> > On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > > set in the provided +   structure. KVM will advance the VM's
> > > > kvmclock to account for elapsed +   time since recording the clock
> > > > values.
> > >
> > > You can't advance both kvmclock (kvmclock_offset variable) and the
> > > TSCs, which would be double counting.
> > >
> > > So you have to either add the elapsed realtime (1) between
> > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > > TSCs. If you do both, there is double counting. Am i missing
> > > something?
> >
> > Probably one of these two (but it's worth pointing out both of them):
> >
> > 1) the attribute that's introduced here *replaces*
> > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> >
> > 2) the adjustment formula later in the algorithm does not care about how
> > much time passed between step 1 and step 4.  It just takes two well
> > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> > the same on the destination as if the guest was still running on the
> > source.  It is irrelevant that one of them is before migration and one
> > is after, all it matters is that one is on the source and one is on the
> > destination.
>
> OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay
> which is introduced during migration (which is what i would guess is
> the lower hanging fruit) (for guests using TSC).

The series gives userspace the ability to modify the guest's
perception of the TSC in whatever way it sees fit. The algorithm in
the documentation provides a suggestion to userspace on how to do
exactly that. I kept that advancement logic out of the kernel because
IMO it is an implementation detail: users have differing opinions on
how clocks should behave across a migration and KVM shouldn't have any
baked-in rules around it.

At the same time, userspace can choose to _not_ jump the TSC and use
the available interfaces to just migrate the existing state of the
TSCs.

When I had initially proposed this series upstream, Paolo astutely
pointed out that there was no good way to get a (CLOCK_REALTIME, TSC)
pairing, which is critical for the TSC advancement algorithm in the
documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists
in userspace [1], hence the missing kvm clock changes. So, in all, the
spirit of the KVM clock changes is to provide missing UAPI around the
clock/TSC, with the side effect of changing the guest-visible value.

[1] https://cloud.google.com/spanner/docs/true-time-external-consistency

> My point was that, by advancing the _TSC value_ by:
>
> T0. stop guest vcpus    (source)
> T1. KVM_GET_CLOCK       (source)
> T2. KVM_SET_CLOCK       (destination)
> T3. Write guest TSCs    (destination)
> T4. resume guest        (destination)
>
> new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
>
> t_0:    host TSC at KVM_GET_CLOCK time.
> off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
> TSC offset is fixed).
> ...
>
> +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> +   structure. KVM will advance the VM's kvmclock to account for elapsed
> +   time since recording the clock values.
>
> Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
> (hopefully modern guests on modern hosts will use TSC clocksource,
> whose clock_gettime is faster... some people are using that already).
>

Hopefully the above explanation made it clearer how the TSCs are
supposed to get advanced, and why it isn't done in the kernel.

> At some point QEMU should enable invariant TSC flag by default?
>
> That said, the point is: why not advance the _TSC_ values
> (instead of kvmclock nanoseconds), as doing so would reduce
> the "the CLOCK_REALTIME delay which is introduced during migration"
> for both kvmclock users and modern tsc clocksource users.
>
> So yes, i also like this patchset, but would like it even more
> if it fixed the case above as well (and not sure whether adding
> the migration delta to KVMCLOCK makes it harder to fix TSC case
> later).
>
> > Perhaps we can add to step 6 something like:
> >
> > > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > > time +   elapsed since recording state and (2) difference in TSCs
> > > between the +   source and destination machine: + +   new_off_n = t_0
> > > + off_n + (k_1 - k_0) * freq - t_1 +
> >
> > "off + t - k * freq" is the guest TSC value corresponding to a time of 0
> > in kvmclock.  The above formula ensures that it is the same on the
> > destination as it was on the source.
> >
> > Also, the names are a bit hard to follow.  Perhaps
> >
> >       t_0             tsc_src
> >       t_1             tsc_dest
> >       k_0             guest_src
> >       k_1             guest_dest
> >       r_0             host_src
> >       off_n           ofs_src[i]
> >       new_off_n       ofs_dest[i]
> >
> > Paolo
> >

Yeah, sounds good to me. Shall I respin the whole series from what you
have in kvm/queue, or just send you the bits and pieces that ought to
be applied?

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-01 19:33               ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-10-01 19:33 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Catalin Marinas, kvm, Will Deacon, Marc Zyngier, Peter Shier,
	David Matlack, Paolo Bonzini, kvmarm, linux-arm-kernel,
	Jim Mattson

Marcelo,

On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> > On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > > set in the provided +   structure. KVM will advance the VM's
> > > > kvmclock to account for elapsed +   time since recording the clock
> > > > values.
> > >
> > > You can't advance both kvmclock (kvmclock_offset variable) and the
> > > TSCs, which would be double counting.
> > >
> > > So you have to either add the elapsed realtime (1) between
> > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > > TSCs. If you do both, there is double counting. Am i missing
> > > something?
> >
> > Probably one of these two (but it's worth pointing out both of them):
> >
> > 1) the attribute that's introduced here *replaces*
> > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> >
> > 2) the adjustment formula later in the algorithm does not care about how
> > much time passed between step 1 and step 4.  It just takes two well
> > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> > the same on the destination as if the guest was still running on the
> > source.  It is irrelevant that one of them is before migration and one
> > is after, all it matters is that one is on the source and one is on the
> > destination.
>
> OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay
> which is introduced during migration (which is what i would guess is
> the lower hanging fruit) (for guests using TSC).

The series gives userspace the ability to modify the guest's
perception of the TSC in whatever way it sees fit. The algorithm in
the documentation provides a suggestion to userspace on how to do
exactly that. I kept that advancement logic out of the kernel because
IMO it is an implementation detail: users have differing opinions on
how clocks should behave across a migration and KVM shouldn't have any
baked-in rules around it.

At the same time, userspace can choose to _not_ jump the TSC and use
the available interfaces to just migrate the existing state of the
TSCs.

When I had initially proposed this series upstream, Paolo astutely
pointed out that there was no good way to get a (CLOCK_REALTIME, TSC)
pairing, which is critical for the TSC advancement algorithm in the
documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists
in userspace [1], hence the missing kvm clock changes. So, in all, the
spirit of the KVM clock changes is to provide missing UAPI around the
clock/TSC, with the side effect of changing the guest-visible value.

[1] https://cloud.google.com/spanner/docs/true-time-external-consistency

> My point was that, by advancing the _TSC value_ by:
>
> T0. stop guest vcpus    (source)
> T1. KVM_GET_CLOCK       (source)
> T2. KVM_SET_CLOCK       (destination)
> T3. Write guest TSCs    (destination)
> T4. resume guest        (destination)
>
> new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
>
> t_0:    host TSC at KVM_GET_CLOCK time.
> off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
> TSC offset is fixed).
> ...
>
> +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> +   structure. KVM will advance the VM's kvmclock to account for elapsed
> +   time since recording the clock values.
>
> Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
> (hopefully modern guests on modern hosts will use TSC clocksource,
> whose clock_gettime is faster... some people are using that already).
>

Hopefully the above explanation made it clearer how the TSCs are
supposed to get advanced, and why it isn't done in the kernel.

> At some point QEMU should enable invariant TSC flag by default?
>
> That said, the point is: why not advance the _TSC_ values
> (instead of kvmclock nanoseconds), as doing so would reduce
> the "the CLOCK_REALTIME delay which is introduced during migration"
> for both kvmclock users and modern tsc clocksource users.
>
> So yes, i also like this patchset, but would like it even more
> if it fixed the case above as well (and not sure whether adding
> the migration delta to KVMCLOCK makes it harder to fix TSC case
> later).
>
> > Perhaps we can add to step 6 something like:
> >
> > > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > > time +   elapsed since recording state and (2) difference in TSCs
> > > between the +   source and destination machine: + +   new_off_n = t_0
> > > + off_n + (k_1 - k_0) * freq - t_1 +
> >
> > "off + t - k * freq" is the guest TSC value corresponding to a time of 0
> > in kvmclock.  The above formula ensures that it is the same on the
> > destination as it was on the source.
> >
> > Also, the names are a bit hard to follow.  Perhaps
> >
> >       t_0             tsc_src
> >       t_1             tsc_dest
> >       k_0             guest_src
> >       k_1             guest_dest
> >       r_0             host_src
> >       off_n           ofs_src[i]
> >       new_off_n       ofs_dest[i]
> >
> > Paolo
> >

Yeah, sounds good to me. Shall I respin the whole series from what you
have in kvm/queue, or just send you the bits and pieces that ought to
be applied?

--
Thanks,
Oliver
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-01 19:33               ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-10-01 19:33 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Paolo Bonzini, kvm, kvmarm, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

Marcelo,

On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> > On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > > set in the provided +   structure. KVM will advance the VM's
> > > > kvmclock to account for elapsed +   time since recording the clock
> > > > values.
> > >
> > > You can't advance both kvmclock (kvmclock_offset variable) and the
> > > TSCs, which would be double counting.
> > >
> > > So you have to either add the elapsed realtime (1) between
> > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > > TSCs. If you do both, there is double counting. Am i missing
> > > something?
> >
> > Probably one of these two (but it's worth pointing out both of them):
> >
> > 1) the attribute that's introduced here *replaces*
> > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> >
> > 2) the adjustment formula later in the algorithm does not care about how
> > much time passed between step 1 and step 4.  It just takes two well
> > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> > the same on the destination as if the guest was still running on the
> > source.  It is irrelevant that one of them is before migration and one
> > is after, all it matters is that one is on the source and one is on the
> > destination.
>
> OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay
> which is introduced during migration (which is what i would guess is
> the lower hanging fruit) (for guests using TSC).

The series gives userspace the ability to modify the guest's
perception of the TSC in whatever way it sees fit. The algorithm in
the documentation provides a suggestion to userspace on how to do
exactly that. I kept that advancement logic out of the kernel because
IMO it is an implementation detail: users have differing opinions on
how clocks should behave across a migration and KVM shouldn't have any
baked-in rules around it.

At the same time, userspace can choose to _not_ jump the TSC and use
the available interfaces to just migrate the existing state of the
TSCs.

When I had initially proposed this series upstream, Paolo astutely
pointed out that there was no good way to get a (CLOCK_REALTIME, TSC)
pairing, which is critical for the TSC advancement algorithm in the
documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists
in userspace [1], hence the missing kvm clock changes. So, in all, the
spirit of the KVM clock changes is to provide missing UAPI around the
clock/TSC, with the side effect of changing the guest-visible value.

[1] https://cloud.google.com/spanner/docs/true-time-external-consistency

> My point was that, by advancing the _TSC value_ by:
>
> T0. stop guest vcpus    (source)
> T1. KVM_GET_CLOCK       (source)
> T2. KVM_SET_CLOCK       (destination)
> T3. Write guest TSCs    (destination)
> T4. resume guest        (destination)
>
> new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
>
> t_0:    host TSC at KVM_GET_CLOCK time.
> off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
> TSC offset is fixed).
> ...
>
> +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> +   structure. KVM will advance the VM's kvmclock to account for elapsed
> +   time since recording the clock values.
>
> Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
> (hopefully modern guests on modern hosts will use TSC clocksource,
> whose clock_gettime is faster... some people are using that already).
>

Hopefully the above explanation made it clearer how the TSCs are
supposed to get advanced, and why it isn't done in the kernel.

> At some point QEMU should enable invariant TSC flag by default?
>
> That said, the point is: why not advance the _TSC_ values
> (instead of kvmclock nanoseconds), as doing so would reduce
> the "the CLOCK_REALTIME delay which is introduced during migration"
> for both kvmclock users and modern tsc clocksource users.
>
> So yes, i also like this patchset, but would like it even more
> if it fixed the case above as well (and not sure whether adding
> the migration delta to KVMCLOCK makes it harder to fix TSC case
> later).
>
> > Perhaps we can add to step 6 something like:
> >
> > > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > > time +   elapsed since recording state and (2) difference in TSCs
> > > between the +   source and destination machine: + +   new_off_n = t_0
> > > + off_n + (k_1 - k_0) * freq - t_1 +
> >
> > "off + t - k * freq" is the guest TSC value corresponding to a time of 0
> > in kvmclock.  The above formula ensures that it is the same on the
> > destination as it was on the source.
> >
> > Also, the names are a bit hard to follow.  Perhaps
> >
> >       t_0             tsc_src
> >       t_1             tsc_dest
> >       k_0             guest_src
> >       k_1             guest_dest
> >       r_0             host_src
> >       off_n           ofs_src[i]
> >       new_off_n       ofs_dest[i]
> >
> > Paolo
> >

Yeah, sounds good to me. Shall I respin the whole series from what you
have in kvm/queue, or just send you the bits and pieces that ought to
be applied?

--
Thanks,
Oliver

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-10-01 12:05           ` Marcelo Tosatti
  (?)
@ 2021-10-01 19:59             ` Thomas Gleixner
  -1 siblings, 0 replies; 113+ messages in thread
From: Thomas Gleixner @ 2021-10-01 19:59 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Oliver Upton, kvm, kvmarm, Paolo Bonzini, Sean Christopherson,
	Marc Zyngier, Peter Shier, Jim Mattson, David Matlack,
	Ricardo Koller, Jing Zhang, Raghavendra Rao Anata, James Morse,
	Alexandru Elisei, Suzuki K Poulose, linux-arm-kernel,
	Andrew Jones, Will Deacon, Catalin Marinas

Marcelo,

On Fri, Oct 01 2021 at 09:05, Marcelo Tosatti wrote:
> On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
>> But even if that would be the case, then what prevents the stale time
>> stamps to be visible? Nothing:
>> 
>> T0:    t = now();
>>          -> pause
>>          -> resume
>>          -> magic "fixup"
>> T1:    dostuff(t);
>
> Yes.
>
> BTW, you could have a userspace notification (then applications 
> could handle this if desired).

Well, we have that via timerfd with TFD_TIMER_CANCEL_ON_SET for
CLOCK_REALTIME. That's what applications which are sensitive to clock
REALTIME jumps use today.

>>   Now the proposed change is creating exactly the same problem:
>> 
>>   >> > +	if (data.flags & KVM_CLOCK_REALTIME) {
>>   >> > +		u64 now_real_ns = ktime_get_real_ns();
>>   >> > +
>>   >> > +		/*
>>   >> > +		 * Avoid stepping the kvmclock backwards.
>>   >> > +		 */
>>   >> > +		if (now_real_ns > data.realtime)
>>   >> > +			data.clock += now_real_ns - data.realtime;
>>   >> > +	}
>> 
>>   IOW, it takes the time between pause and resume into account and
>>   forwards the underlying base clock which makes CLOCK_MONOTONIC
>>   jump forward by exactly that amount of time.
>
> Well, it is assuming that the
>
>  T0:    t = now();
>  T1:    pause vm()
>  T2:	finish vm migration()
>  T3:    dostuff(t);
>
> Interval between T1 and T2 is small (and that the guest
> clocks are synchronized up to a given boundary).

Yes, I understand that, but it's an assumption and there is no boundary
for the time jump in the proposed patches, which rings my alarm bells :)

> But i suppose adding a limit to the forward clock advance 
> (in the migration case) is useful:
>
> 	1) If migration (well actually, only the final steps
> 	   to finish migration, the time between when guest is paused
> 	   on source and is resumed on destination) takes too long,
> 	   then too bad: fix it to be shorter if you want the clocks
> 	   to have close to zero change to realtime on migration.
>
> 	2) Avoid the other bugs in case of large forward advance.
>
> Maybe having it configurable, with a say, 1 minute maximum by default
> is a good choice?

Don't know what 1 minute does in terms of applications etc. You have to
do some experiments on that.

> An alternative would be to advance only the guests REALTIME clock, from 
> data about how long steps T1-T2 took.

Yes, that's what would happen in the cooperative S2IDLE or S3 case when
the guest resumes.

>> So now virt came along and created a hard to solve circular dependency
>> problem:
>> 
>>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
>>      sync, but everything else is happy.
>>      
>>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
>>      lose, but NTP/PTP is happy.
>
> One must handle the
>
>  T0:    t = now();
>           -> pause
>           -> resume
>           -> magic "fixup"
>  T1:    dostuff(t);
>
> fact if one is going to use savevm/restorevm anyway, so...
> (it is kind of unfixable, unless you modify your application
> to accept notifications to redo any computation based on t, isnt it?).

Well yes, but what applications can deal with is CLOCK_REALTIME jumping
because that's a property of it. Not so much the CLOCK_MONOTONIC part.

>> If you decide that correctness is overrated, then please document it
>> clearly instead of trying to pretend being correct.
>
> Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC)
> would be correct, right? And its probably not very hard to do.

Time _is_ hard to get right. 

So you might experiment with something like this as a stop gap:

  Provide the guest something like this:

          u64		   migration_seq;
          u64      	   realtime_delta_ns;

  in the shared clock page

  Do not forward jump clock MONOTONIC.

  On resume kick an IPI where the guest handler does:

         if (clock_data->migration_seq == migration_seq)
         	return;

         migration_seq = clock_data->migration_seq;

         ts64 = { 0, 0 };
         timespec64_add_ns(&ts64, clock_data->realtime_delta_ns);
         timekeeping_inject_sleeptime64(&ts64);

  Make sure that the IPI completes before you migrate the guest another
  time or implement it slightly smarter, but you get the idea :)

That's what we use for suspend time injection, but it should just work
without frozen tasks as well. It will forward clock REALTIME by the
amount of time spent during migration. It'll also modify the BOOTTIME
offset by the same amount, but that's not a tragedy.

The function will also reset NTP state so the NTP/PTP daemon knows that
there was a kernel initiated time jump and it can work out easily what
to do like it does on resume from an actual suspend. It will also
invoke clock_was_set() which makes all the other time related updates
trigger and wakeup tasks which have a timerfd with
TFD_TIMER_CANCEL_ON_SET armed.

This will obviously not work when the guest is in S2IDLE or S3, but for
initial experimentation you can ignore that and just avoid to do that in
the guest. :)

That still is worse than a cooperative S2IDLE/S3, but it's way more
sensible than the other two evils you have today.

> Thanks very much for the detailed information! Its a good basis
> for the document you ask.

I volunteer to review that documentation once it materializes :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 19:59             ` Thomas Gleixner
  0 siblings, 0 replies; 113+ messages in thread
From: Thomas Gleixner @ 2021-10-01 19:59 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Catalin Marinas, kvm, Marc Zyngier, Peter Shier, David Matlack,
	Paolo Bonzini, Will Deacon, kvmarm, linux-arm-kernel,
	Jim Mattson

Marcelo,

On Fri, Oct 01 2021 at 09:05, Marcelo Tosatti wrote:
> On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
>> But even if that would be the case, then what prevents the stale time
>> stamps to be visible? Nothing:
>> 
>> T0:    t = now();
>>          -> pause
>>          -> resume
>>          -> magic "fixup"
>> T1:    dostuff(t);
>
> Yes.
>
> BTW, you could have a userspace notification (then applications 
> could handle this if desired).

Well, we have that via timerfd with TFD_TIMER_CANCEL_ON_SET for
CLOCK_REALTIME. That's what applications which are sensitive to clock
REALTIME jumps use today.

>>   Now the proposed change is creating exactly the same problem:
>> 
>>   >> > +	if (data.flags & KVM_CLOCK_REALTIME) {
>>   >> > +		u64 now_real_ns = ktime_get_real_ns();
>>   >> > +
>>   >> > +		/*
>>   >> > +		 * Avoid stepping the kvmclock backwards.
>>   >> > +		 */
>>   >> > +		if (now_real_ns > data.realtime)
>>   >> > +			data.clock += now_real_ns - data.realtime;
>>   >> > +	}
>> 
>>   IOW, it takes the time between pause and resume into account and
>>   forwards the underlying base clock which makes CLOCK_MONOTONIC
>>   jump forward by exactly that amount of time.
>
> Well, it is assuming that the
>
>  T0:    t = now();
>  T1:    pause vm()
>  T2:	finish vm migration()
>  T3:    dostuff(t);
>
> Interval between T1 and T2 is small (and that the guest
> clocks are synchronized up to a given boundary).

Yes, I understand that, but it's an assumption and there is no boundary
for the time jump in the proposed patches, which rings my alarm bells :)

> But i suppose adding a limit to the forward clock advance 
> (in the migration case) is useful:
>
> 	1) If migration (well actually, only the final steps
> 	   to finish migration, the time between when guest is paused
> 	   on source and is resumed on destination) takes too long,
> 	   then too bad: fix it to be shorter if you want the clocks
> 	   to have close to zero change to realtime on migration.
>
> 	2) Avoid the other bugs in case of large forward advance.
>
> Maybe having it configurable, with a say, 1 minute maximum by default
> is a good choice?

Don't know what 1 minute does in terms of applications etc. You have to
do some experiments on that.

> An alternative would be to advance only the guests REALTIME clock, from 
> data about how long steps T1-T2 took.

Yes, that's what would happen in the cooperative S2IDLE or S3 case when
the guest resumes.

>> So now virt came along and created a hard to solve circular dependency
>> problem:
>> 
>>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
>>      sync, but everything else is happy.
>>      
>>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
>>      lose, but NTP/PTP is happy.
>
> One must handle the
>
>  T0:    t = now();
>           -> pause
>           -> resume
>           -> magic "fixup"
>  T1:    dostuff(t);
>
> fact if one is going to use savevm/restorevm anyway, so...
> (it is kind of unfixable, unless you modify your application
> to accept notifications to redo any computation based on t, isnt it?).

Well yes, but what applications can deal with is CLOCK_REALTIME jumping
because that's a property of it. Not so much the CLOCK_MONOTONIC part.

>> If you decide that correctness is overrated, then please document it
>> clearly instead of trying to pretend being correct.
>
> Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC)
> would be correct, right? And its probably not very hard to do.

Time _is_ hard to get right. 

So you might experiment with something like this as a stop gap:

  Provide the guest something like this:

          u64		   migration_seq;
          u64      	   realtime_delta_ns;

  in the shared clock page

  Do not forward jump clock MONOTONIC.

  On resume kick an IPI where the guest handler does:

         if (clock_data->migration_seq == migration_seq)
         	return;

         migration_seq = clock_data->migration_seq;

         ts64 = { 0, 0 };
         timespec64_add_ns(&ts64, clock_data->realtime_delta_ns);
         timekeeping_inject_sleeptime64(&ts64);

  Make sure that the IPI completes before you migrate the guest another
  time or implement it slightly smarter, but you get the idea :)

That's what we use for suspend time injection, but it should just work
without frozen tasks as well. It will forward clock REALTIME by the
amount of time spent during migration. It'll also modify the BOOTTIME
offset by the same amount, but that's not a tragedy.

The function will also reset NTP state so the NTP/PTP daemon knows that
there was a kernel initiated time jump and it can work out easily what
to do like it does on resume from an actual suspend. It will also
invoke clock_was_set() which makes all the other time related updates
trigger and wakeup tasks which have a timerfd with
TFD_TIMER_CANCEL_ON_SET armed.

This will obviously not work when the guest is in S2IDLE or S3, but for
initial experimentation you can ignore that and just avoid to do that in
the guest. :)

That still is worse than a cooperative S2IDLE/S3, but it's way more
sensible than the other two evils you have today.

> Thanks very much for the detailed information! Its a good basis
> for the document you ask.

I volunteer to review that documentation once it materializes :)

Thanks,

        tglx
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 19:59             ` Thomas Gleixner
  0 siblings, 0 replies; 113+ messages in thread
From: Thomas Gleixner @ 2021-10-01 19:59 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Oliver Upton, kvm, kvmarm, Paolo Bonzini, Sean Christopherson,
	Marc Zyngier, Peter Shier, Jim Mattson, David Matlack,
	Ricardo Koller, Jing Zhang, Raghavendra Rao Anata, James Morse,
	Alexandru Elisei, Suzuki K Poulose, linux-arm-kernel,
	Andrew Jones, Will Deacon, Catalin Marinas

Marcelo,

On Fri, Oct 01 2021 at 09:05, Marcelo Tosatti wrote:
> On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
>> But even if that would be the case, then what prevents the stale time
>> stamps to be visible? Nothing:
>> 
>> T0:    t = now();
>>          -> pause
>>          -> resume
>>          -> magic "fixup"
>> T1:    dostuff(t);
>
> Yes.
>
> BTW, you could have a userspace notification (then applications 
> could handle this if desired).

Well, we have that via timerfd with TFD_TIMER_CANCEL_ON_SET for
CLOCK_REALTIME. That's what applications which are sensitive to clock
REALTIME jumps use today.

>>   Now the proposed change is creating exactly the same problem:
>> 
>>   >> > +	if (data.flags & KVM_CLOCK_REALTIME) {
>>   >> > +		u64 now_real_ns = ktime_get_real_ns();
>>   >> > +
>>   >> > +		/*
>>   >> > +		 * Avoid stepping the kvmclock backwards.
>>   >> > +		 */
>>   >> > +		if (now_real_ns > data.realtime)
>>   >> > +			data.clock += now_real_ns - data.realtime;
>>   >> > +	}
>> 
>>   IOW, it takes the time between pause and resume into account and
>>   forwards the underlying base clock which makes CLOCK_MONOTONIC
>>   jump forward by exactly that amount of time.
>
> Well, it is assuming that the
>
>  T0:    t = now();
>  T1:    pause vm()
>  T2:	finish vm migration()
>  T3:    dostuff(t);
>
> Interval between T1 and T2 is small (and that the guest
> clocks are synchronized up to a given boundary).

Yes, I understand that, but it's an assumption and there is no boundary
for the time jump in the proposed patches, which rings my alarm bells :)

> But i suppose adding a limit to the forward clock advance 
> (in the migration case) is useful:
>
> 	1) If migration (well actually, only the final steps
> 	   to finish migration, the time between when guest is paused
> 	   on source and is resumed on destination) takes too long,
> 	   then too bad: fix it to be shorter if you want the clocks
> 	   to have close to zero change to realtime on migration.
>
> 	2) Avoid the other bugs in case of large forward advance.
>
> Maybe having it configurable, with a say, 1 minute maximum by default
> is a good choice?

Don't know what 1 minute does in terms of applications etc. You have to
do some experiments on that.

> An alternative would be to advance only the guests REALTIME clock, from 
> data about how long steps T1-T2 took.

Yes, that's what would happen in the cooperative S2IDLE or S3 case when
the guest resumes.

>> So now virt came along and created a hard to solve circular dependency
>> problem:
>> 
>>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
>>      sync, but everything else is happy.
>>      
>>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
>>      lose, but NTP/PTP is happy.
>
> One must handle the
>
>  T0:    t = now();
>           -> pause
>           -> resume
>           -> magic "fixup"
>  T1:    dostuff(t);
>
> fact if one is going to use savevm/restorevm anyway, so...
> (it is kind of unfixable, unless you modify your application
> to accept notifications to redo any computation based on t, isnt it?).

Well yes, but what applications can deal with is CLOCK_REALTIME jumping
because that's a property of it. Not so much the CLOCK_MONOTONIC part.

>> If you decide that correctness is overrated, then please document it
>> clearly instead of trying to pretend being correct.
>
> Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC)
> would be correct, right? And its probably not very hard to do.

Time _is_ hard to get right. 

So you might experiment with something like this as a stop gap:

  Provide the guest something like this:

          u64		   migration_seq;
          u64      	   realtime_delta_ns;

  in the shared clock page

  Do not forward jump clock MONOTONIC.

  On resume kick an IPI where the guest handler does:

         if (clock_data->migration_seq == migration_seq)
         	return;

         migration_seq = clock_data->migration_seq;

         ts64 = { 0, 0 };
         timespec64_add_ns(&ts64, clock_data->realtime_delta_ns);
         timekeeping_inject_sleeptime64(&ts64);

  Make sure that the IPI completes before you migrate the guest another
  time or implement it slightly smarter, but you get the idea :)

That's what we use for suspend time injection, but it should just work
without frozen tasks as well. It will forward clock REALTIME by the
amount of time spent during migration. It'll also modify the BOOTTIME
offset by the same amount, but that's not a tragedy.

The function will also reset NTP state so the NTP/PTP daemon knows that
there was a kernel initiated time jump and it can work out easily what
to do like it does on resume from an actual suspend. It will also
invoke clock_was_set() which makes all the other time related updates
trigger and wakeup tasks which have a timerfd with
TFD_TIMER_CANCEL_ON_SET armed.

This will obviously not work when the guest is in S2IDLE or S3, but for
initial experimentation you can ignore that and just avoid to do that in
the guest. :)

That still is worse than a cooperative S2IDLE/S3, but it's way more
sensible than the other two evils you have today.

> Thanks very much for the detailed information! Its a good basis
> for the document you ask.

I volunteer to review that documentation once it materializes :)

Thanks,

        tglx

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-10-01 19:59             ` Thomas Gleixner
  (?)
@ 2021-10-01 21:03               ` Oliver Upton
  -1 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-10-01 21:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marcelo Tosatti, kvm, kvmarm, Paolo Bonzini, Sean Christopherson,
	Marc Zyngier, Peter Shier, Jim Mattson, David Matlack,
	Ricardo Koller, Jing Zhang, Raghavendra Rao Anata, James Morse,
	Alexandru Elisei, Suzuki K Poulose, linux-arm-kernel,
	Andrew Jones, Will Deacon, Catalin Marinas

Hi Thomas,

On Fri, Oct 1, 2021 at 12:59 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Marcelo,
>
> On Fri, Oct 01 2021 at 09:05, Marcelo Tosatti wrote:
> > On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> >> But even if that would be the case, then what prevents the stale time
> >> stamps to be visible? Nothing:
> >>
> >> T0:    t = now();
> >>          -> pause
> >>          -> resume
> >>          -> magic "fixup"
> >> T1:    dostuff(t);
> >
> > Yes.
> >
> > BTW, you could have a userspace notification (then applications
> > could handle this if desired).
>
> Well, we have that via timerfd with TFD_TIMER_CANCEL_ON_SET for
> CLOCK_REALTIME. That's what applications which are sensitive to clock
> REALTIME jumps use today.
>
> >>   Now the proposed change is creating exactly the same problem:
> >>
> >>   >> > +     if (data.flags & KVM_CLOCK_REALTIME) {
> >>   >> > +             u64 now_real_ns = ktime_get_real_ns();
> >>   >> > +
> >>   >> > +             /*
> >>   >> > +              * Avoid stepping the kvmclock backwards.
> >>   >> > +              */
> >>   >> > +             if (now_real_ns > data.realtime)
> >>   >> > +                     data.clock += now_real_ns - data.realtime;
> >>   >> > +     }
> >>
> >>   IOW, it takes the time between pause and resume into account and
> >>   forwards the underlying base clock which makes CLOCK_MONOTONIC
> >>   jump forward by exactly that amount of time.
> >
> > Well, it is assuming that the
> >
> >  T0:    t = now();
> >  T1:    pause vm()
> >  T2:  finish vm migration()
> >  T3:    dostuff(t);
> >
> > Interval between T1 and T2 is small (and that the guest
> > clocks are synchronized up to a given boundary).
>
> Yes, I understand that, but it's an assumption and there is no boundary
> for the time jump in the proposed patches, which rings my alarm bells :)
>
> > But i suppose adding a limit to the forward clock advance
> > (in the migration case) is useful:
> >
> >       1) If migration (well actually, only the final steps
> >          to finish migration, the time between when guest is paused
> >          on source and is resumed on destination) takes too long,
> >          then too bad: fix it to be shorter if you want the clocks
> >          to have close to zero change to realtime on migration.
> >
> >       2) Avoid the other bugs in case of large forward advance.
> >
> > Maybe having it configurable, with a say, 1 minute maximum by default
> > is a good choice?
>
> Don't know what 1 minute does in terms of applications etc. You have to
> do some experiments on that.

I debated quite a bit on what the absolute limit should be for
advancing the KVM clock, and settled on doing no checks in the kernel
besides the monotonicity invariant. End of the day, userspace can
ignore all of the rules that KVM will try to enforce on the kvm
clock/TSC and jump it as it sees fit (both are already directly
writable). But I agree that there has to be some reason around what is
acceptable. We have an absolute limit on how far forward we will yank
the KVM clock and TSC in our userspace, but of course it has a TOCTOU
problem for whatever madness can come in between userspace and the
time the kernel actually services the ioctl.

--
Thanks,
Oliver


> > An alternative would be to advance only the guests REALTIME clock, from
> > data about how long steps T1-T2 took.
>
> Yes, that's what would happen in the cooperative S2IDLE or S3 case when
> the guest resumes.
>
> >> So now virt came along and created a hard to solve circular dependency
> >> problem:
> >>
> >>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
> >>      sync, but everything else is happy.
> >>
> >>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
> >>      lose, but NTP/PTP is happy.
> >
> > One must handle the
> >
> >  T0:    t = now();
> >           -> pause
> >           -> resume
> >           -> magic "fixup"
> >  T1:    dostuff(t);
> >
> > fact if one is going to use savevm/restorevm anyway, so...
> > (it is kind of unfixable, unless you modify your application
> > to accept notifications to redo any computation based on t, isnt it?).
>
> Well yes, but what applications can deal with is CLOCK_REALTIME jumping
> because that's a property of it. Not so much the CLOCK_MONOTONIC part.
>
> >> If you decide that correctness is overrated, then please document it
> >> clearly instead of trying to pretend being correct.
> >
> > Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC)
> > would be correct, right? And its probably not very hard to do.
>
> Time _is_ hard to get right.
>
> So you might experiment with something like this as a stop gap:
>
>   Provide the guest something like this:
>
>           u64              migration_seq;
>           u64              realtime_delta_ns;
>
>   in the shared clock page
>
>   Do not forward jump clock MONOTONIC.
>
>   On resume kick an IPI where the guest handler does:
>
>          if (clock_data->migration_seq == migration_seq)
>                 return;
>
>          migration_seq = clock_data->migration_seq;
>
>          ts64 = { 0, 0 };
>          timespec64_add_ns(&ts64, clock_data->realtime_delta_ns);
>          timekeeping_inject_sleeptime64(&ts64);
>
>   Make sure that the IPI completes before you migrate the guest another
>   time or implement it slightly smarter, but you get the idea :)
>
> That's what we use for suspend time injection, but it should just work
> without frozen tasks as well. It will forward clock REALTIME by the
> amount of time spent during migration. It'll also modify the BOOTTIME
> offset by the same amount, but that's not a tragedy.
>
> The function will also reset NTP state so the NTP/PTP daemon knows that
> there was a kernel initiated time jump and it can work out easily what
> to do like it does on resume from an actual suspend. It will also
> invoke clock_was_set() which makes all the other time related updates
> trigger and wakeup tasks which have a timerfd with
> TFD_TIMER_CANCEL_ON_SET armed.
>
> This will obviously not work when the guest is in S2IDLE or S3, but for
> initial experimentation you can ignore that and just avoid to do that in
> the guest. :)
>
> That still is worse than a cooperative S2IDLE/S3, but it's way more
> sensible than the other two evils you have today.
>
> > Thanks very much for the detailed information! Its a good basis
> > for the document you ask.
>
> I volunteer to review that documentation once it materializes :)
>
> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 21:03               ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-10-01 21:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Catalin Marinas, kvm, Will Deacon, Marc Zyngier, Marcelo Tosatti,
	Peter Shier, David Matlack, Paolo Bonzini, kvmarm,
	linux-arm-kernel, Jim Mattson

Hi Thomas,

On Fri, Oct 1, 2021 at 12:59 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Marcelo,
>
> On Fri, Oct 01 2021 at 09:05, Marcelo Tosatti wrote:
> > On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> >> But even if that would be the case, then what prevents the stale time
> >> stamps to be visible? Nothing:
> >>
> >> T0:    t = now();
> >>          -> pause
> >>          -> resume
> >>          -> magic "fixup"
> >> T1:    dostuff(t);
> >
> > Yes.
> >
> > BTW, you could have a userspace notification (then applications
> > could handle this if desired).
>
> Well, we have that via timerfd with TFD_TIMER_CANCEL_ON_SET for
> CLOCK_REALTIME. That's what applications which are sensitive to clock
> REALTIME jumps use today.
>
> >>   Now the proposed change is creating exactly the same problem:
> >>
> >>   >> > +     if (data.flags & KVM_CLOCK_REALTIME) {
> >>   >> > +             u64 now_real_ns = ktime_get_real_ns();
> >>   >> > +
> >>   >> > +             /*
> >>   >> > +              * Avoid stepping the kvmclock backwards.
> >>   >> > +              */
> >>   >> > +             if (now_real_ns > data.realtime)
> >>   >> > +                     data.clock += now_real_ns - data.realtime;
> >>   >> > +     }
> >>
> >>   IOW, it takes the time between pause and resume into account and
> >>   forwards the underlying base clock which makes CLOCK_MONOTONIC
> >>   jump forward by exactly that amount of time.
> >
> > Well, it is assuming that the
> >
> >  T0:    t = now();
> >  T1:    pause vm()
> >  T2:  finish vm migration()
> >  T3:    dostuff(t);
> >
> > Interval between T1 and T2 is small (and that the guest
> > clocks are synchronized up to a given boundary).
>
> Yes, I understand that, but it's an assumption and there is no boundary
> for the time jump in the proposed patches, which rings my alarm bells :)
>
> > But i suppose adding a limit to the forward clock advance
> > (in the migration case) is useful:
> >
> >       1) If migration (well actually, only the final steps
> >          to finish migration, the time between when guest is paused
> >          on source and is resumed on destination) takes too long,
> >          then too bad: fix it to be shorter if you want the clocks
> >          to have close to zero change to realtime on migration.
> >
> >       2) Avoid the other bugs in case of large forward advance.
> >
> > Maybe having it configurable, with a say, 1 minute maximum by default
> > is a good choice?
>
> Don't know what 1 minute does in terms of applications etc. You have to
> do some experiments on that.

I debated quite a bit on what the absolute limit should be for
advancing the KVM clock, and settled on doing no checks in the kernel
besides the monotonicity invariant. End of the day, userspace can
ignore all of the rules that KVM will try to enforce on the kvm
clock/TSC and jump it as it sees fit (both are already directly
writable). But I agree that there has to be some reason around what is
acceptable. We have an absolute limit on how far forward we will yank
the KVM clock and TSC in our userspace, but of course it has a TOCTOU
problem for whatever madness can come in between userspace and the
time the kernel actually services the ioctl.

--
Thanks,
Oliver


> > An alternative would be to advance only the guests REALTIME clock, from
> > data about how long steps T1-T2 took.
>
> Yes, that's what would happen in the cooperative S2IDLE or S3 case when
> the guest resumes.
>
> >> So now virt came along and created a hard to solve circular dependency
> >> problem:
> >>
> >>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
> >>      sync, but everything else is happy.
> >>
> >>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
> >>      lose, but NTP/PTP is happy.
> >
> > One must handle the
> >
> >  T0:    t = now();
> >           -> pause
> >           -> resume
> >           -> magic "fixup"
> >  T1:    dostuff(t);
> >
> > fact if one is going to use savevm/restorevm anyway, so...
> > (it is kind of unfixable, unless you modify your application
> > to accept notifications to redo any computation based on t, isnt it?).
>
> Well yes, but what applications can deal with is CLOCK_REALTIME jumping
> because that's a property of it. Not so much the CLOCK_MONOTONIC part.
>
> >> If you decide that correctness is overrated, then please document it
> >> clearly instead of trying to pretend being correct.
> >
> > Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC)
> > would be correct, right? And its probably not very hard to do.
>
> Time _is_ hard to get right.
>
> So you might experiment with something like this as a stop gap:
>
>   Provide the guest something like this:
>
>           u64              migration_seq;
>           u64              realtime_delta_ns;
>
>   in the shared clock page
>
>   Do not forward jump clock MONOTONIC.
>
>   On resume kick an IPI where the guest handler does:
>
>          if (clock_data->migration_seq == migration_seq)
>                 return;
>
>          migration_seq = clock_data->migration_seq;
>
>          ts64 = { 0, 0 };
>          timespec64_add_ns(&ts64, clock_data->realtime_delta_ns);
>          timekeeping_inject_sleeptime64(&ts64);
>
>   Make sure that the IPI completes before you migrate the guest another
>   time or implement it slightly smarter, but you get the idea :)
>
> That's what we use for suspend time injection, but it should just work
> without frozen tasks as well. It will forward clock REALTIME by the
> amount of time spent during migration. It'll also modify the BOOTTIME
> offset by the same amount, but that's not a tragedy.
>
> The function will also reset NTP state so the NTP/PTP daemon knows that
> there was a kernel initiated time jump and it can work out easily what
> to do like it does on resume from an actual suspend. It will also
> invoke clock_was_set() which makes all the other time related updates
> trigger and wakeup tasks which have a timerfd with
> TFD_TIMER_CANCEL_ON_SET armed.
>
> This will obviously not work when the guest is in S2IDLE or S3, but for
> initial experimentation you can ignore that and just avoid to do that in
> the guest. :)
>
> That still is worse than a cooperative S2IDLE/S3, but it's way more
> sensible than the other two evils you have today.
>
> > Thanks very much for the detailed information! Its a good basis
> > for the document you ask.
>
> I volunteer to review that documentation once it materializes :)
>
> Thanks,
>
>         tglx
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2021-10-01 21:03               ` Oliver Upton
  0 siblings, 0 replies; 113+ messages in thread
From: Oliver Upton @ 2021-10-01 21:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Marcelo Tosatti, kvm, kvmarm, Paolo Bonzini, Sean Christopherson,
	Marc Zyngier, Peter Shier, Jim Mattson, David Matlack,
	Ricardo Koller, Jing Zhang, Raghavendra Rao Anata, James Morse,
	Alexandru Elisei, Suzuki K Poulose, linux-arm-kernel,
	Andrew Jones, Will Deacon, Catalin Marinas

Hi Thomas,

On Fri, Oct 1, 2021 at 12:59 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Marcelo,
>
> On Fri, Oct 01 2021 at 09:05, Marcelo Tosatti wrote:
> > On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> >> But even if that would be the case, then what prevents the stale time
> >> stamps to be visible? Nothing:
> >>
> >> T0:    t = now();
> >>          -> pause
> >>          -> resume
> >>          -> magic "fixup"
> >> T1:    dostuff(t);
> >
> > Yes.
> >
> > BTW, you could have a userspace notification (then applications
> > could handle this if desired).
>
> Well, we have that via timerfd with TFD_TIMER_CANCEL_ON_SET for
> CLOCK_REALTIME. That's what applications which are sensitive to clock
> REALTIME jumps use today.
>
> >>   Now the proposed change is creating exactly the same problem:
> >>
> >>   >> > +     if (data.flags & KVM_CLOCK_REALTIME) {
> >>   >> > +             u64 now_real_ns = ktime_get_real_ns();
> >>   >> > +
> >>   >> > +             /*
> >>   >> > +              * Avoid stepping the kvmclock backwards.
> >>   >> > +              */
> >>   >> > +             if (now_real_ns > data.realtime)
> >>   >> > +                     data.clock += now_real_ns - data.realtime;
> >>   >> > +     }
> >>
> >>   IOW, it takes the time between pause and resume into account and
> >>   forwards the underlying base clock which makes CLOCK_MONOTONIC
> >>   jump forward by exactly that amount of time.
> >
> > Well, it is assuming that the
> >
> >  T0:    t = now();
> >  T1:    pause vm()
> >  T2:  finish vm migration()
> >  T3:    dostuff(t);
> >
> > Interval between T1 and T2 is small (and that the guest
> > clocks are synchronized up to a given boundary).
>
> Yes, I understand that, but it's an assumption and there is no boundary
> for the time jump in the proposed patches, which rings my alarm bells :)
>
> > But i suppose adding a limit to the forward clock advance
> > (in the migration case) is useful:
> >
> >       1) If migration (well actually, only the final steps
> >          to finish migration, the time between when guest is paused
> >          on source and is resumed on destination) takes too long,
> >          then too bad: fix it to be shorter if you want the clocks
> >          to have close to zero change to realtime on migration.
> >
> >       2) Avoid the other bugs in case of large forward advance.
> >
> > Maybe having it configurable, with a say, 1 minute maximum by default
> > is a good choice?
>
> Don't know what 1 minute does in terms of applications etc. You have to
> do some experiments on that.

I debated quite a bit on what the absolute limit should be for
advancing the KVM clock, and settled on doing no checks in the kernel
besides the monotonicity invariant. End of the day, userspace can
ignore all of the rules that KVM will try to enforce on the kvm
clock/TSC and jump it as it sees fit (both are already directly
writable). But I agree that there has to be some reason around what is
acceptable. We have an absolute limit on how far forward we will yank
the KVM clock and TSC in our userspace, but of course it has a TOCTOU
problem for whatever madness can come in between userspace and the
time the kernel actually services the ioctl.

--
Thanks,
Oliver


> > An alternative would be to advance only the guests REALTIME clock, from
> > data about how long steps T1-T2 took.
>
> Yes, that's what would happen in the cooperative S2IDLE or S3 case when
> the guest resumes.
>
> >> So now virt came along and created a hard to solve circular dependency
> >> problem:
> >>
> >>    - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
> >>      sync, but everything else is happy.
> >>
> >>    - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
> >>      lose, but NTP/PTP is happy.
> >
> > One must handle the
> >
> >  T0:    t = now();
> >           -> pause
> >           -> resume
> >           -> magic "fixup"
> >  T1:    dostuff(t);
> >
> > fact if one is going to use savevm/restorevm anyway, so...
> > (it is kind of unfixable, unless you modify your application
> > to accept notifications to redo any computation based on t, isnt it?).
>
> Well yes, but what applications can deal with is CLOCK_REALTIME jumping
> because that's a property of it. Not so much the CLOCK_MONOTONIC part.
>
> >> If you decide that correctness is overrated, then please document it
> >> clearly instead of trying to pretend being correct.
> >
> > Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC)
> > would be correct, right? And its probably not very hard to do.
>
> Time _is_ hard to get right.
>
> So you might experiment with something like this as a stop gap:
>
>   Provide the guest something like this:
>
>           u64              migration_seq;
>           u64              realtime_delta_ns;
>
>   in the shared clock page
>
>   Do not forward jump clock MONOTONIC.
>
>   On resume kick an IPI where the guest handler does:
>
>          if (clock_data->migration_seq == migration_seq)
>                 return;
>
>          migration_seq = clock_data->migration_seq;
>
>          ts64 = { 0, 0 };
>          timespec64_add_ns(&ts64, clock_data->realtime_delta_ns);
>          timekeeping_inject_sleeptime64(&ts64);
>
>   Make sure that the IPI completes before you migrate the guest another
>   time or implement it slightly smarter, but you get the idea :)
>
> That's what we use for suspend time injection, but it should just work
> without frozen tasks as well. It will forward clock REALTIME by the
> amount of time spent during migration. It'll also modify the BOOTTIME
> offset by the same amount, but that's not a tragedy.
>
> The function will also reset NTP state so the NTP/PTP daemon knows that
> there was a kernel initiated time jump and it can work out easily what
> to do like it does on resume from an actual suspend. It will also
> invoke clock_was_set() which makes all the other time related updates
> trigger and wakeup tasks which have a timerfd with
> TFD_TIMER_CANCEL_ON_SET armed.
>
> This will obviously not work when the guest is in S2IDLE or S3, but for
> initial experimentation you can ignore that and just avoid to do that in
> the guest. :)
>
> That still is worse than a cooperative S2IDLE/S3, but it's way more
> sensible than the other two evils you have today.
>
> > Thanks very much for the detailed information! Its a good basis
> > for the document you ask.
>
> I volunteer to review that documentation once it materializes :)
>
> Thanks,
>
>         tglx

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
  2021-10-01 19:11             ` Marcelo Tosatti
  (?)
@ 2021-10-04 11:44               ` Paolo Bonzini
  -1 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-04 11:44 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Oliver Upton, kvm, kvmarm, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On 01/10/21 21:11, Marcelo Tosatti wrote:
> That said, the point is: why not advance the_TSC_  values
> (instead of kvmclock nanoseconds), as doing so would reduce
> the "the CLOCK_REALTIME delay which is introduced during migration"
> for both kvmclock users and modern tsc clocksource users.

It already does, that's the cool part.  Take again the formula here:

    guest_off_1 = t_0 + guest_off_0 + (k_1 - k_0) * freq - t_1

and set:

	t_1 = t_0 + host_off_0_1 + (k_1 - k_0) * freq

i.e. t_0 and t_1 are different because 1) the machines were booted at 
different times, which is host_off_0_1 2) t_1 includes the migration 
downtime between k_0 and k_1

Now you have:

    guest_off_1 = t_0 + guest_off_0 + (k_1 - k_0) * freq
	       - t_0 - real_off_n - (k_1 - k_0) * freq

    guest_off_1 = guest_off_0 - host_off_0_1

That is, the TSC is exactly the same as it was on the source, just 
adjusted because the two machines were booted at different times.

The need to have precise (ns, cycle) pairings is exactly because it 
ensures that everything cancels in the formula, and all that is left is 
the differences in the TSC of the two hosts.

Paolo

> So yes, i also like this patchset, but would like it even more
> if it fixed the case above as well (and not sure whether adding
> the migration delta to KVMCLOCK makes it harder to fix TSC case
> later).


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-04 11:44               ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-04 11:44 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Catalin Marinas, kvm, Peter Shier, Marc Zyngier, David Matlack,
	Will Deacon, kvmarm, linux-arm-kernel, Jim Mattson

On 01/10/21 21:11, Marcelo Tosatti wrote:
> That said, the point is: why not advance the_TSC_  values
> (instead of kvmclock nanoseconds), as doing so would reduce
> the "the CLOCK_REALTIME delay which is introduced during migration"
> for both kvmclock users and modern tsc clocksource users.

It already does, that's the cool part.  Take again the formula here:

    guest_off_1 = t_0 + guest_off_0 + (k_1 - k_0) * freq - t_1

and set:

	t_1 = t_0 + host_off_0_1 + (k_1 - k_0) * freq

i.e. t_0 and t_1 are different because 1) the machines were booted at 
different times, which is host_off_0_1 2) t_1 includes the migration 
downtime between k_0 and k_1

Now you have:

    guest_off_1 = t_0 + guest_off_0 + (k_1 - k_0) * freq
	       - t_0 - real_off_n - (k_1 - k_0) * freq

    guest_off_1 = guest_off_0 - host_off_0_1

That is, the TSC is exactly the same as it was on the source, just 
adjusted because the two machines were booted at different times.

The need to have precise (ns, cycle) pairings is exactly because it 
ensures that everything cancels in the formula, and all that is left is 
the differences in the TSC of the two hosts.

Paolo

> So yes, i also like this patchset, but would like it even more
> if it fixed the case above as well (and not sure whether adding
> the migration delta to KVMCLOCK makes it harder to fix TSC case
> later).

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-04 11:44               ` Paolo Bonzini
  0 siblings, 0 replies; 113+ messages in thread
From: Paolo Bonzini @ 2021-10-04 11:44 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Oliver Upton, kvm, kvmarm, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On 01/10/21 21:11, Marcelo Tosatti wrote:
> That said, the point is: why not advance the_TSC_  values
> (instead of kvmclock nanoseconds), as doing so would reduce
> the "the CLOCK_REALTIME delay which is introduced during migration"
> for both kvmclock users and modern tsc clocksource users.

It already does, that's the cool part.  Take again the formula here:

    guest_off_1 = t_0 + guest_off_0 + (k_1 - k_0) * freq - t_1

and set:

	t_1 = t_0 + host_off_0_1 + (k_1 - k_0) * freq

i.e. t_0 and t_1 are different because 1) the machines were booted at 
different times, which is host_off_0_1 2) t_1 includes the migration 
downtime between k_0 and k_1

Now you have:

    guest_off_1 = t_0 + guest_off_0 + (k_1 - k_0) * freq
	       - t_0 - real_off_n - (k_1 - k_0) * freq

    guest_off_1 = guest_off_0 - host_off_0_1

That is, the TSC is exactly the same as it was on the source, just 
adjusted because the two machines were booted at different times.

The need to have precise (ns, cycle) pairings is exactly because it 
ensures that everything cancels in the formula, and all that is left is 
the differences in the TSC of the two hosts.

Paolo

> So yes, i also like this patchset, but would like it even more
> if it fixed the case above as well (and not sure whether adding
> the migration delta to KVMCLOCK makes it harder to fix TSC case
> later).


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
  2021-10-01 19:33               ` Oliver Upton
  (?)
@ 2021-10-04 14:30                 ` Marcelo Tosatti
  -1 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-04 14:30 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Paolo Bonzini, kvm, kvmarm, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Fri, Oct 01, 2021 at 12:33:28PM -0700, Oliver Upton wrote:
> Marcelo,
> 
> On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >
> > On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> > > On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > > > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > > > set in the provided +   structure. KVM will advance the VM's
> > > > > kvmclock to account for elapsed +   time since recording the clock
> > > > > values.
> > > >
> > > > You can't advance both kvmclock (kvmclock_offset variable) and the
> > > > TSCs, which would be double counting.
> > > >
> > > > So you have to either add the elapsed realtime (1) between
> > > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > > > TSCs. If you do both, there is double counting. Am i missing
> > > > something?
> > >
> > > Probably one of these two (but it's worth pointing out both of them):
> > >
> > > 1) the attribute that's introduced here *replaces*
> > > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> > >
> > > 2) the adjustment formula later in the algorithm does not care about how
> > > much time passed between step 1 and step 4.  It just takes two well
> > > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> > > the same on the destination as if the guest was still running on the
> > > source.  It is irrelevant that one of them is before migration and one
> > > is after, all it matters is that one is on the source and one is on the
> > > destination.
> >
> > OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay
> > which is introduced during migration (which is what i would guess is
> > the lower hanging fruit) (for guests using TSC).
> 
> The series gives userspace the ability to modify the guest's
> perception of the TSC in whatever way it sees fit. The algorithm in
> the documentation provides a suggestion to userspace on how to do
> exactly that. I kept that advancement logic out of the kernel because
> IMO it is an implementation detail: users have differing opinions on
> how clocks should behave across a migration and KVM shouldn't have any
> baked-in rules around it.

Ok, was just trying to visualize how this would work with QEMU Linux guests.

> 
> At the same time, userspace can choose to _not_ jump the TSC and use
> the available interfaces to just migrate the existing state of the
> TSCs.
> 
> When I had initially proposed this series upstream, Paolo astutely
> pointed out that there was no good way to get a (CLOCK_REALTIME, TSC)
> pairing, which is critical for the TSC advancement algorithm in the
> documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists
> in userspace [1], hence the missing kvm clock changes. So, in all, the
> spirit of the KVM clock changes is to provide missing UAPI around the
> clock/TSC, with the side effect of changing the guest-visible value.
> 
> [1] https://cloud.google.com/spanner/docs/true-time-external-consistency
> 
> > My point was that, by advancing the _TSC value_ by:
> >
> > T0. stop guest vcpus    (source)
> > T1. KVM_GET_CLOCK       (source)
> > T2. KVM_SET_CLOCK       (destination)
> > T3. Write guest TSCs    (destination)
> > T4. resume guest        (destination)
> >
> > new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> >
> > t_0:    host TSC at KVM_GET_CLOCK time.
> > off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
> > TSC offset is fixed).
> > ...
> >
> > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> > +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> > +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> > +   structure. KVM will advance the VM's kvmclock to account for elapsed
> > +   time since recording the clock values.
> >
> > Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
> > (hopefully modern guests on modern hosts will use TSC clocksource,
> > whose clock_gettime is faster... some people are using that already).
> >
> 
> Hopefully the above explanation made it clearer how the TSCs are
> supposed to get advanced, and why it isn't done in the kernel.
> 
> > At some point QEMU should enable invariant TSC flag by default?
> >
> > That said, the point is: why not advance the _TSC_ values
> > (instead of kvmclock nanoseconds), as doing so would reduce
> > the "the CLOCK_REALTIME delay which is introduced during migration"
> > for both kvmclock users and modern tsc clocksource users.
> >
> > So yes, i also like this patchset, but would like it even more
> > if it fixed the case above as well (and not sure whether adding
> > the migration delta to KVMCLOCK makes it harder to fix TSC case
> > later).
> >
> > > Perhaps we can add to step 6 something like:
> > >
> > > > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > > > time +   elapsed since recording state and (2) difference in TSCs
> > > > between the +   source and destination machine: + +   new_off_n = t_0
> > > > + off_n + (k_1 - k_0) * freq - t_1 +
> > >
> > > "off + t - k * freq" is the guest TSC value corresponding to a time of 0
> > > in kvmclock.  The above formula ensures that it is the same on the
> > > destination as it was on the source.
> > >
> > > Also, the names are a bit hard to follow.  Perhaps
> > >
> > >       t_0             tsc_src
> > >       t_1             tsc_dest
> > >       k_0             guest_src
> > >       k_1             guest_dest
> > >       r_0             host_src
> > >       off_n           ofs_src[i]
> > >       new_off_n       ofs_dest[i]
> > >
> > > Paolo
> > >
> 
> Yeah, sounds good to me. Shall I respin the whole series from what you
> have in kvm/queue, or just send you the bits and pieces that ought to
> be applied?
> 
> --
> Thanks,
> Oliver
> 
> 


^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-04 14:30                 ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-04 14:30 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Catalin Marinas, kvm, Will Deacon, Marc Zyngier, Peter Shier,
	David Matlack, Paolo Bonzini, kvmarm, linux-arm-kernel,
	Jim Mattson

On Fri, Oct 01, 2021 at 12:33:28PM -0700, Oliver Upton wrote:
> Marcelo,
> 
> On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >
> > On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> > > On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > > > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > > > set in the provided +   structure. KVM will advance the VM's
> > > > > kvmclock to account for elapsed +   time since recording the clock
> > > > > values.
> > > >
> > > > You can't advance both kvmclock (kvmclock_offset variable) and the
> > > > TSCs, which would be double counting.
> > > >
> > > > So you have to either add the elapsed realtime (1) between
> > > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > > > TSCs. If you do both, there is double counting. Am i missing
> > > > something?
> > >
> > > Probably one of these two (but it's worth pointing out both of them):
> > >
> > > 1) the attribute that's introduced here *replaces*
> > > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> > >
> > > 2) the adjustment formula later in the algorithm does not care about how
> > > much time passed between step 1 and step 4.  It just takes two well
> > > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> > > the same on the destination as if the guest was still running on the
> > > source.  It is irrelevant that one of them is before migration and one
> > > is after, all it matters is that one is on the source and one is on the
> > > destination.
> >
> > OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay
> > which is introduced during migration (which is what i would guess is
> > the lower hanging fruit) (for guests using TSC).
> 
> The series gives userspace the ability to modify the guest's
> perception of the TSC in whatever way it sees fit. The algorithm in
> the documentation provides a suggestion to userspace on how to do
> exactly that. I kept that advancement logic out of the kernel because
> IMO it is an implementation detail: users have differing opinions on
> how clocks should behave across a migration and KVM shouldn't have any
> baked-in rules around it.

Ok, was just trying to visualize how this would work with QEMU Linux guests.

> 
> At the same time, userspace can choose to _not_ jump the TSC and use
> the available interfaces to just migrate the existing state of the
> TSCs.
> 
> When I had initially proposed this series upstream, Paolo astutely
> pointed out that there was no good way to get a (CLOCK_REALTIME, TSC)
> pairing, which is critical for the TSC advancement algorithm in the
> documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists
> in userspace [1], hence the missing kvm clock changes. So, in all, the
> spirit of the KVM clock changes is to provide missing UAPI around the
> clock/TSC, with the side effect of changing the guest-visible value.
> 
> [1] https://cloud.google.com/spanner/docs/true-time-external-consistency
> 
> > My point was that, by advancing the _TSC value_ by:
> >
> > T0. stop guest vcpus    (source)
> > T1. KVM_GET_CLOCK       (source)
> > T2. KVM_SET_CLOCK       (destination)
> > T3. Write guest TSCs    (destination)
> > T4. resume guest        (destination)
> >
> > new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> >
> > t_0:    host TSC at KVM_GET_CLOCK time.
> > off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
> > TSC offset is fixed).
> > ...
> >
> > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> > +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> > +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> > +   structure. KVM will advance the VM's kvmclock to account for elapsed
> > +   time since recording the clock values.
> >
> > Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
> > (hopefully modern guests on modern hosts will use TSC clocksource,
> > whose clock_gettime is faster... some people are using that already).
> >
> 
> Hopefully the above explanation made it clearer how the TSCs are
> supposed to get advanced, and why it isn't done in the kernel.
> 
> > At some point QEMU should enable invariant TSC flag by default?
> >
> > That said, the point is: why not advance the _TSC_ values
> > (instead of kvmclock nanoseconds), as doing so would reduce
> > the "the CLOCK_REALTIME delay which is introduced during migration"
> > for both kvmclock users and modern tsc clocksource users.
> >
> > So yes, i also like this patchset, but would like it even more
> > if it fixed the case above as well (and not sure whether adding
> > the migration delta to KVMCLOCK makes it harder to fix TSC case
> > later).
> >
> > > Perhaps we can add to step 6 something like:
> > >
> > > > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > > > time +   elapsed since recording state and (2) difference in TSCs
> > > > between the +   source and destination machine: + +   new_off_n = t_0
> > > > + off_n + (k_1 - k_0) * freq - t_1 +
> > >
> > > "off + t - k * freq" is the guest TSC value corresponding to a time of 0
> > > in kvmclock.  The above formula ensures that it is the same on the
> > > destination as it was on the source.
> > >
> > > Also, the names are a bit hard to follow.  Perhaps
> > >
> > >       t_0             tsc_src
> > >       t_1             tsc_dest
> > >       k_0             guest_src
> > >       k_1             guest_dest
> > >       r_0             host_src
> > >       off_n           ofs_src[i]
> > >       new_off_n       ofs_dest[i]
> > >
> > > Paolo
> > >
> 
> Yeah, sounds good to me. Shall I respin the whole series from what you
> have in kvm/queue, or just send you the bits and pieces that ought to
> be applied?
> 
> --
> Thanks,
> Oliver
> 
> 

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-04 14:30                 ` Marcelo Tosatti
  0 siblings, 0 replies; 113+ messages in thread
From: Marcelo Tosatti @ 2021-10-04 14:30 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Paolo Bonzini, kvm, kvmarm, Sean Christopherson, Marc Zyngier,
	Peter Shier, Jim Mattson, David Matlack, Ricardo Koller,
	Jing Zhang, Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Fri, Oct 01, 2021 at 12:33:28PM -0700, Oliver Upton wrote:
> Marcelo,
> 
> On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >
> > On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> > > On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > > > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > > > set in the provided +   structure. KVM will advance the VM's
> > > > > kvmclock to account for elapsed +   time since recording the clock
> > > > > values.
> > > >
> > > > You can't advance both kvmclock (kvmclock_offset variable) and the
> > > > TSCs, which would be double counting.
> > > >
> > > > So you have to either add the elapsed realtime (1) between
> > > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > > > TSCs. If you do both, there is double counting. Am i missing
> > > > something?
> > >
> > > Probably one of these two (but it's worth pointing out both of them):
> > >
> > > 1) the attribute that's introduced here *replaces*
> > > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> > >
> > > 2) the adjustment formula later in the algorithm does not care about how
> > > much time passed between step 1 and step 4.  It just takes two well
> > > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> > > the same on the destination as if the guest was still running on the
> > > source.  It is irrelevant that one of them is before migration and one
> > > is after, all it matters is that one is on the source and one is on the
> > > destination.
> >
> > OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay
> > which is introduced during migration (which is what i would guess is
> > the lower hanging fruit) (for guests using TSC).
> 
> The series gives userspace the ability to modify the guest's
> perception of the TSC in whatever way it sees fit. The algorithm in
> the documentation provides a suggestion to userspace on how to do
> exactly that. I kept that advancement logic out of the kernel because
> IMO it is an implementation detail: users have differing opinions on
> how clocks should behave across a migration and KVM shouldn't have any
> baked-in rules around it.

Ok, was just trying to visualize how this would work with QEMU Linux guests.

> 
> At the same time, userspace can choose to _not_ jump the TSC and use
> the available interfaces to just migrate the existing state of the
> TSCs.
> 
> When I had initially proposed this series upstream, Paolo astutely
> pointed out that there was no good way to get a (CLOCK_REALTIME, TSC)
> pairing, which is critical for the TSC advancement algorithm in the
> documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists
> in userspace [1], hence the missing kvm clock changes. So, in all, the
> spirit of the KVM clock changes is to provide missing UAPI around the
> clock/TSC, with the side effect of changing the guest-visible value.
> 
> [1] https://cloud.google.com/spanner/docs/true-time-external-consistency
> 
> > My point was that, by advancing the _TSC value_ by:
> >
> > T0. stop guest vcpus    (source)
> > T1. KVM_GET_CLOCK       (source)
> > T2. KVM_SET_CLOCK       (destination)
> > T3. Write guest TSCs    (destination)
> > T4. resume guest        (destination)
> >
> > new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> >
> > t_0:    host TSC at KVM_GET_CLOCK time.
> > off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
> > TSC offset is fixed).
> > ...
> >
> > +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> > +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> > +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> > +   structure. KVM will advance the VM's kvmclock to account for elapsed
> > +   time since recording the clock values.
> >
> > Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
> > (hopefully modern guests on modern hosts will use TSC clocksource,
> > whose clock_gettime is faster... some people are using that already).
> >
> 
> Hopefully the above explanation made it clearer how the TSCs are
> supposed to get advanced, and why it isn't done in the kernel.
> 
> > At some point QEMU should enable invariant TSC flag by default?
> >
> > That said, the point is: why not advance the _TSC_ values
> > (instead of kvmclock nanoseconds), as doing so would reduce
> > the "the CLOCK_REALTIME delay which is introduced during migration"
> > for both kvmclock users and modern tsc clocksource users.
> >
> > So yes, i also like this patchset, but would like it even more
> > if it fixed the case above as well (and not sure whether adding
> > the migration delta to KVMCLOCK makes it harder to fix TSC case
> > later).
> >
> > > Perhaps we can add to step 6 something like:
> > >
> > > > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > > > time +   elapsed since recording state and (2) difference in TSCs
> > > > between the +   source and destination machine: + +   new_off_n = t_0
> > > > + off_n + (k_1 - k_0) * freq - t_1 +
> > >
> > > "off + t - k * freq" is the guest TSC value corresponding to a time of 0
> > > in kvmclock.  The above formula ensures that it is the same on the
> > > destination as it was on the source.
> > >
> > > Also, the names are a bit hard to follow.  Perhaps
> > >
> > >       t_0             tsc_src
> > >       t_1             tsc_dest
> > >       k_0             guest_src
> > >       k_1             guest_dest
> > >       r_0             host_src
> > >       off_n           ofs_src[i]
> > >       new_off_n       ofs_dest[i]
> > >
> > > Paolo
> > >
> 
> Yeah, sounds good to me. Shall I respin the whole series from what you
> have in kvm/queue, or just send you the bits and pieces that ought to
> be applied?
> 
> --
> Thanks,
> Oliver
> 
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
  2021-09-16 18:15   ` Oliver Upton
  (?)
@ 2021-10-05 15:22     ` Sean Christopherson
  -1 siblings, 0 replies; 113+ messages in thread
From: Sean Christopherson @ 2021-10-05 15:22 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, kvmarm, Paolo Bonzini, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Thu, Sep 16, 2021, Oliver Upton wrote:
> +static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
> +				 struct kvm_device_attr *attr)
> +{
> +	u64 __user *uaddr = (u64 __user *)attr->addr;

...

> +static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
> +				 struct kvm_device_attr *attr)
> +{
> +	u64 __user *uaddr = (u64 __user *)attr->addr;

These casts break 32-bit builds because of truncating attr->addr from 64-bit int
to a 32-bit pointer.  The address should also be checked to verify bits 63:32 are
not set on 32-bit kernels.

arch/x86/kvm/x86.c: In function ‘kvm_arch_tsc_get_attr’:
arch/x86/kvm/x86.c:4947:22: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
 4947 |  u64 __user *uaddr = (u64 __user *)attr->addr;
      |                      ^
arch/x86/kvm/x86.c: In function ‘kvm_arch_tsc_set_attr’:
arch/x86/kvm/x86.c:4967:22: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
 4967 |  u64 __user *uaddr = (u64 __user *)attr->addr;
      |                      ^


Not sure if there's a more elegant approach than casts galore?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e5e462ffd65..3930e5dcdf0e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4944,9 +4944,12 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
 static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
                                 struct kvm_device_attr *attr)
 {
-       u64 __user *uaddr = (u64 __user *)attr->addr;
+       u64 __user *uaddr = (u64 __user *)(unsigned long)attr->addr;
        int r;

+       if ((u64)(unsigned long)uaddr != attr->addr)
+               return -EFAULT;
+
        switch (attr->attr) {
        case KVM_VCPU_TSC_OFFSET:
                r = -EFAULT;
@@ -4964,10 +4967,13 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
 static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
                                 struct kvm_device_attr *attr)
 {
-       u64 __user *uaddr = (u64 __user *)attr->addr;
+       u64 __user *uaddr = (u64 __user *)(unsigned long)attr->addr;
        struct kvm *kvm = vcpu->kvm;
        int r;

+       if ((u64)(unsigned long)uaddr != attr->addr)
+               return -EFAULT;
+
        switch (attr->attr) {
        case KVM_VCPU_TSC_OFFSET: {
                u64 offset, tsc, ns;

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-05 15:22     ` Sean Christopherson
  0 siblings, 0 replies; 113+ messages in thread
From: Sean Christopherson @ 2021-10-05 15:22 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Catalin Marinas, kvm, Will Deacon, Marc Zyngier, Peter Shier,
	David Matlack, Paolo Bonzini, kvmarm, linux-arm-kernel,
	Jim Mattson

On Thu, Sep 16, 2021, Oliver Upton wrote:
> +static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
> +				 struct kvm_device_attr *attr)
> +{
> +	u64 __user *uaddr = (u64 __user *)attr->addr;

...

> +static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
> +				 struct kvm_device_attr *attr)
> +{
> +	u64 __user *uaddr = (u64 __user *)attr->addr;

These casts break 32-bit builds because of truncating attr->addr from 64-bit int
to a 32-bit pointer.  The address should also be checked to verify bits 63:32 are
not set on 32-bit kernels.

arch/x86/kvm/x86.c: In function ‘kvm_arch_tsc_get_attr’:
arch/x86/kvm/x86.c:4947:22: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
 4947 |  u64 __user *uaddr = (u64 __user *)attr->addr;
      |                      ^
arch/x86/kvm/x86.c: In function ‘kvm_arch_tsc_set_attr’:
arch/x86/kvm/x86.c:4967:22: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
 4967 |  u64 __user *uaddr = (u64 __user *)attr->addr;
      |                      ^


Not sure if there's a more elegant approach than casts galore?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e5e462ffd65..3930e5dcdf0e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4944,9 +4944,12 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
 static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
                                 struct kvm_device_attr *attr)
 {
-       u64 __user *uaddr = (u64 __user *)attr->addr;
+       u64 __user *uaddr = (u64 __user *)(unsigned long)attr->addr;
        int r;

+       if ((u64)(unsigned long)uaddr != attr->addr)
+               return -EFAULT;
+
        switch (attr->attr) {
        case KVM_VCPU_TSC_OFFSET:
                r = -EFAULT;
@@ -4964,10 +4967,13 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
 static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
                                 struct kvm_device_attr *attr)
 {
-       u64 __user *uaddr = (u64 __user *)attr->addr;
+       u64 __user *uaddr = (u64 __user *)(unsigned long)attr->addr;
        struct kvm *kvm = vcpu->kvm;
        int r;

+       if ((u64)(unsigned long)uaddr != attr->addr)
+               return -EFAULT;
+
        switch (attr->attr) {
        case KVM_VCPU_TSC_OFFSET: {
                u64 offset, tsc, ns;
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2021-10-05 15:22     ` Sean Christopherson
  0 siblings, 0 replies; 113+ messages in thread
From: Sean Christopherson @ 2021-10-05 15:22 UTC (permalink / raw)
  To: Oliver Upton
  Cc: kvm, kvmarm, Paolo Bonzini, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

On Thu, Sep 16, 2021, Oliver Upton wrote:
> +static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
> +				 struct kvm_device_attr *attr)
> +{
> +	u64 __user *uaddr = (u64 __user *)attr->addr;

...

> +static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
> +				 struct kvm_device_attr *attr)
> +{
> +	u64 __user *uaddr = (u64 __user *)attr->addr;

These casts break 32-bit builds because of truncating attr->addr from 64-bit int
to a 32-bit pointer.  The address should also be checked to verify bits 63:32 are
not set on 32-bit kernels.

arch/x86/kvm/x86.c: In function ‘kvm_arch_tsc_get_attr’:
arch/x86/kvm/x86.c:4947:22: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
 4947 |  u64 __user *uaddr = (u64 __user *)attr->addr;
      |                      ^
arch/x86/kvm/x86.c: In function ‘kvm_arch_tsc_set_attr’:
arch/x86/kvm/x86.c:4967:22: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
 4967 |  u64 __user *uaddr = (u64 __user *)attr->addr;
      |                      ^


Not sure if there's a more elegant approach than casts galore?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8e5e462ffd65..3930e5dcdf0e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4944,9 +4944,12 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
 static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
                                 struct kvm_device_attr *attr)
 {
-       u64 __user *uaddr = (u64 __user *)attr->addr;
+       u64 __user *uaddr = (u64 __user *)(unsigned long)attr->addr;
        int r;

+       if ((u64)(unsigned long)uaddr != attr->addr)
+               return -EFAULT;
+
        switch (attr->attr) {
        case KVM_VCPU_TSC_OFFSET:
                r = -EFAULT;
@@ -4964,10 +4967,13 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
 static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
                                 struct kvm_device_attr *attr)
 {
-       u64 __user *uaddr = (u64 __user *)attr->addr;
+       u64 __user *uaddr = (u64 __user *)(unsigned long)attr->addr;
        struct kvm *kvm = vcpu->kvm;
        int r;

+       if ((u64)(unsigned long)uaddr != attr->addr)
+               return -EFAULT;
+
        switch (attr->attr) {
        case KVM_VCPU_TSC_OFFSET: {
                u64 offset, tsc, ns;

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
  2021-09-16 18:15   ` Oliver Upton
  (?)
@ 2022-02-23 10:02     ` David Woodhouse
  -1 siblings, 0 replies; 113+ messages in thread
From: David Woodhouse @ 2022-02-23 10:02 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

[-- Attachment #1: Type: text/plain, Size: 2444 bytes --]

On Thu, 2021-09-16 at 18:15 +0000, Oliver Upton wrote:
> To date, VMM-directed TSC synchronization and migration has been a bit
> messy. KVM has some baked-in heuristics around TSC writes to infer if
> the VMM is attempting to synchronize. This is problematic, as it depends
> on host userspace writing to the guest's TSC within 1 second of the last
> write.
> 
> A much cleaner approach to configuring the guest's views of the TSC is to
> simply migrate the TSC offset for every vCPU. Offsets are idempotent,
> and thus not subject to change depending on when the VMM actually
> reads/writes values from/to KVM. The VMM can then read the TSC once with
> KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
> the guest is paused.
> 
> Cc: David Matlack <dmatlack@google.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  Documentation/virt/kvm/devices/vcpu.rst |  57 ++++++++++++
>  arch/x86/include/asm/kvm_host.h         |   1 +
>  arch/x86/include/uapi/asm/kvm.h         |   4 +
>  arch/x86/kvm/x86.c                      | 110 ++++++++++++++++++++++++
>  4 files changed, 172 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
> index 2acec3b9ef65..3b399d727c11 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -161,3 +161,60 @@ Specifies the base address of the stolen time structure for this VCPU. The
>  base address must be 64 byte aligned and exist within a valid guest memory
>  region. See Documentation/virt/kvm/arm/pvtime.rst for more information
>  including the layout of the stolen time structure.
> +
> +4. GROUP: KVM_VCPU_TSC_CTRL
> +===========================
> +
> +:Architectures: x86
> +
> +4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
> +
> +:Parameters: 64-bit unsigned TSC offset
> +
> +Returns:
> +
> +	 ======= ======================================
> +	 -EFAULT Error reading/writing the provided
> +		 parameter address.
> +	 -ENXIO  Attribute not supported
> +	 ======= ======================================
> +
> +Specifies the guest's TSC offset relative to the host's TSC. The guest's
> +TSC is then derived by the following equation:
> +
> +  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET

This isn't true. The guest TSC also depends on the *scaling* factor.



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2022-02-23 10:02     ` David Woodhouse
  0 siblings, 0 replies; 113+ messages in thread
From: David Woodhouse @ 2022-02-23 10:02 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas


[-- Attachment #1.1: Type: text/plain, Size: 2444 bytes --]

On Thu, 2021-09-16 at 18:15 +0000, Oliver Upton wrote:
> To date, VMM-directed TSC synchronization and migration has been a bit
> messy. KVM has some baked-in heuristics around TSC writes to infer if
> the VMM is attempting to synchronize. This is problematic, as it depends
> on host userspace writing to the guest's TSC within 1 second of the last
> write.
> 
> A much cleaner approach to configuring the guest's views of the TSC is to
> simply migrate the TSC offset for every vCPU. Offsets are idempotent,
> and thus not subject to change depending on when the VMM actually
> reads/writes values from/to KVM. The VMM can then read the TSC once with
> KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
> the guest is paused.
> 
> Cc: David Matlack <dmatlack@google.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  Documentation/virt/kvm/devices/vcpu.rst |  57 ++++++++++++
>  arch/x86/include/asm/kvm_host.h         |   1 +
>  arch/x86/include/uapi/asm/kvm.h         |   4 +
>  arch/x86/kvm/x86.c                      | 110 ++++++++++++++++++++++++
>  4 files changed, 172 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
> index 2acec3b9ef65..3b399d727c11 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -161,3 +161,60 @@ Specifies the base address of the stolen time structure for this VCPU. The
>  base address must be 64 byte aligned and exist within a valid guest memory
>  region. See Documentation/virt/kvm/arm/pvtime.rst for more information
>  including the layout of the stolen time structure.
> +
> +4. GROUP: KVM_VCPU_TSC_CTRL
> +===========================
> +
> +:Architectures: x86
> +
> +4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
> +
> +:Parameters: 64-bit unsigned TSC offset
> +
> +Returns:
> +
> +	 ======= ======================================
> +	 -EFAULT Error reading/writing the provided
> +		 parameter address.
> +	 -ENXIO  Attribute not supported
> +	 ======= ======================================
> +
> +Specifies the guest's TSC offset relative to the host's TSC. The guest's
> +TSC is then derived by the following equation:
> +
> +  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET

This isn't true. The guest TSC also depends on the *scaling* factor.



[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace
@ 2022-02-23 10:02     ` David Woodhouse
  0 siblings, 0 replies; 113+ messages in thread
From: David Woodhouse @ 2022-02-23 10:02 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Catalin Marinas, Will Deacon, Marc Zyngier, Peter Shier,
	David Matlack, Paolo Bonzini, linux-arm-kernel, Jim Mattson


[-- Attachment #1.1: Type: text/plain, Size: 2444 bytes --]

On Thu, 2021-09-16 at 18:15 +0000, Oliver Upton wrote:
> To date, VMM-directed TSC synchronization and migration has been a bit
> messy. KVM has some baked-in heuristics around TSC writes to infer if
> the VMM is attempting to synchronize. This is problematic, as it depends
> on host userspace writing to the guest's TSC within 1 second of the last
> write.
> 
> A much cleaner approach to configuring the guest's views of the TSC is to
> simply migrate the TSC offset for every vCPU. Offsets are idempotent,
> and thus not subject to change depending on when the VMM actually
> reads/writes values from/to KVM. The VMM can then read the TSC once with
> KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
> the guest is paused.
> 
> Cc: David Matlack <dmatlack@google.com>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Oliver Upton <oupton@google.com>
> ---
>  Documentation/virt/kvm/devices/vcpu.rst |  57 ++++++++++++
>  arch/x86/include/asm/kvm_host.h         |   1 +
>  arch/x86/include/uapi/asm/kvm.h         |   4 +
>  arch/x86/kvm/x86.c                      | 110 ++++++++++++++++++++++++
>  4 files changed, 172 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
> index 2acec3b9ef65..3b399d727c11 100644
> --- a/Documentation/virt/kvm/devices/vcpu.rst
> +++ b/Documentation/virt/kvm/devices/vcpu.rst
> @@ -161,3 +161,60 @@ Specifies the base address of the stolen time structure for this VCPU. The
>  base address must be 64 byte aligned and exist within a valid guest memory
>  region. See Documentation/virt/kvm/arm/pvtime.rst for more information
>  including the layout of the stolen time structure.
> +
> +4. GROUP: KVM_VCPU_TSC_CTRL
> +===========================
> +
> +:Architectures: x86
> +
> +4.1 ATTRIBUTE: KVM_VCPU_TSC_OFFSET
> +
> +:Parameters: 64-bit unsigned TSC offset
> +
> +Returns:
> +
> +	 ======= ======================================
> +	 -EFAULT Error reading/writing the provided
> +		 parameter address.
> +	 -ENXIO  Attribute not supported
> +	 ======= ======================================
> +
> +Specifies the guest's TSC offset relative to the host's TSC. The guest's
> +TSC is then derived by the following equation:
> +
> +  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET

This isn't true. The guest TSC also depends on the *scaling* factor.



[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

[-- Attachment #2: Type: text/plain, Size: 151 bytes --]

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
  2021-09-16 18:15   ` Oliver Upton
@ 2024-01-17 14:28     ` David Woodhouse
  -1 siblings, 0 replies; 113+ messages in thread
From: David Woodhouse @ 2024-01-17 14:28 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas

[-- Attachment #1: Type: text/plain, Size: 3249 bytes --]

On Thu, 2021-09-16 at 18:15 +0000, Oliver Upton wrote:
> 
> @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
>          * is slightly ahead) here we risk going negative on unsigned
>          * 'system_time' when 'data.clock' is very small.
>          */
> -       if (kvm->arch.use_master_clock)
> -               now_ns = ka->master_kernel_ns;
> +       if (data.flags & KVM_CLOCK_REALTIME) {
> +               u64 now_real_ns = ktime_get_real_ns();
> +
> +               /*
> +                * Avoid stepping the kvmclock backwards.
> +                */
> +               if (now_real_ns > data.realtime)
> +                       data.clock += now_real_ns - data.realtime;
> +       }
> +
> +       if (ka->use_master_clock)
> +               now_raw_ns = ka->master_kernel_ns;

This looks wrong to me.

>         else
> -               now_ns = get_kvmclock_base_ns();
> -       ka->kvmclock_offset = data.clock - now_ns;
> +               now_raw_ns = get_kvmclock_base_ns();
> +       ka->kvmclock_offset = data.clock - now_raw_ns;
>         kvm_end_pvclock_update(kvm);
>         return 0;
>  }

We use the host CLOCK_MONOTONIC_RAW plus the boot offset, as a
'kvmclock base clock', and get_kvmclock_base_ns() returns that. The KVM
clocks for each VMs are based on this 'kvmclock base clock', each
offset by a ka->kvmclock_offset which represents the time at which that
VM was started — so each VM's clock starts from zero.

The values of ka->master_kernel_ns and ka->master_cycle_now represent a
single point in time, the former being the value of
get_kvmclock_base_ns() at that moment and the latter being the host TSC
value. In pvclock_update_vm_gtod_copy(), kvm_get_time_and_clockread()
is used to return both values at precisely the same moment, from the
*same* rdtsc().

This allows the current 'kvmclock base clock' to be calculated at any
moment by reading the TSC, calculating a delta to that reading from
ka->master_cycle_now to determine how much time has elapsed since
ka->master_kernel_ns. We can then add ka->kvmclock_offset to get the
kvmclock for this particular VM.

Now, looking at the code quoted above. It's given a kvm_clock_data
struct which contains a value of the KVM clock which is to be set as
the time "now", and all it does is adjust ka->kvmclock_offset
accordingly. Which is really simple:

		now_raw_ns = get_kvmclock_base_ns();
	ka->kvmclock_offset = data.clock - now_raw_ns;

Et voilà, now get_kvmclock_base_ns() + ka->kvmclock_offset at any given
moment in time will result in a kvmclock value according to what was
just set. Yay!

Except... in the case where the TSC is constant, we actually set
'now_raw_ns' to a value that doesn't represent *now*. Instead, we set
it to ka->master_kernel_ns which represents some point in the *past*.
We should add the number of TSC ticks since ka->master_cycle_now if
we're going to use that, surely?



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
@ 2024-01-17 14:28     ` David Woodhouse
  0 siblings, 0 replies; 113+ messages in thread
From: David Woodhouse @ 2024-01-17 14:28 UTC (permalink / raw)
  To: Oliver Upton, kvm, kvmarm
  Cc: Paolo Bonzini, Sean Christopherson, Marc Zyngier, Peter Shier,
	Jim Mattson, David Matlack, Ricardo Koller, Jing Zhang,
	Raghavendra Rao Anata, James Morse, Alexandru Elisei,
	Suzuki K Poulose, linux-arm-kernel, Andrew Jones, Will Deacon,
	Catalin Marinas


[-- Attachment #1.1: Type: text/plain, Size: 3249 bytes --]

On Thu, 2021-09-16 at 18:15 +0000, Oliver Upton wrote:
> 
> @@ -5878,11 +5888,21 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
>          * is slightly ahead) here we risk going negative on unsigned
>          * 'system_time' when 'data.clock' is very small.
>          */
> -       if (kvm->arch.use_master_clock)
> -               now_ns = ka->master_kernel_ns;
> +       if (data.flags & KVM_CLOCK_REALTIME) {
> +               u64 now_real_ns = ktime_get_real_ns();
> +
> +               /*
> +                * Avoid stepping the kvmclock backwards.
> +                */
> +               if (now_real_ns > data.realtime)
> +                       data.clock += now_real_ns - data.realtime;
> +       }
> +
> +       if (ka->use_master_clock)
> +               now_raw_ns = ka->master_kernel_ns;

This looks wrong to me.

>         else
> -               now_ns = get_kvmclock_base_ns();
> -       ka->kvmclock_offset = data.clock - now_ns;
> +               now_raw_ns = get_kvmclock_base_ns();
> +       ka->kvmclock_offset = data.clock - now_raw_ns;
>         kvm_end_pvclock_update(kvm);
>         return 0;
>  }

We use the host CLOCK_MONOTONIC_RAW plus the boot offset, as a
'kvmclock base clock', and get_kvmclock_base_ns() returns that. The KVM
clocks for each VMs are based on this 'kvmclock base clock', each
offset by a ka->kvmclock_offset which represents the time at which that
VM was started — so each VM's clock starts from zero.

The values of ka->master_kernel_ns and ka->master_cycle_now represent a
single point in time, the former being the value of
get_kvmclock_base_ns() at that moment and the latter being the host TSC
value. In pvclock_update_vm_gtod_copy(), kvm_get_time_and_clockread()
is used to return both values at precisely the same moment, from the
*same* rdtsc().

This allows the current 'kvmclock base clock' to be calculated at any
moment by reading the TSC, calculating a delta to that reading from
ka->master_cycle_now to determine how much time has elapsed since
ka->master_kernel_ns. We can then add ka->kvmclock_offset to get the
kvmclock for this particular VM.

Now, looking at the code quoted above. It's given a kvm_clock_data
struct which contains a value of the KVM clock which is to be set as
the time "now", and all it does is adjust ka->kvmclock_offset
accordingly. Which is really simple:

		now_raw_ns = get_kvmclock_base_ns();
	ka->kvmclock_offset = data.clock - now_raw_ns;

Et voilà, now get_kvmclock_base_ns() + ka->kvmclock_offset at any given
moment in time will result in a kvmclock value according to what was
just set. Yay!

Except... in the case where the TSC is constant, we actually set
'now_raw_ns' to a value that doesn't represent *now*. Instead, we set
it to ka->master_kernel_ns which represents some point in the *past*.
We should add the number of TSC ticks since ka->master_cycle_now if
we're going to use that, surely?



[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

[-- Attachment #2: Type: text/plain, Size: 176 bytes --]

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 113+ messages in thread

end of thread, other threads:[~2024-01-17 14:29 UTC | newest]

Thread overview: 113+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-16 18:15 [PATCH v8 0/7] KVM: x86: Add idempotent controls for migrating system counter state Oliver Upton
2021-09-16 18:15 ` Oliver Upton
2021-09-16 18:15 ` Oliver Upton
2021-09-16 18:15 ` [PATCH v8 1/7] kvm: x86: abstract locking around pvclock_update_vm_gtod_copy Oliver Upton
2021-09-16 18:15   ` Oliver Upton
2021-09-16 18:15   ` Oliver Upton
2021-09-16 18:15 ` [PATCH v8 2/7] KVM: x86: extract KVM_GET_CLOCK/KVM_SET_CLOCK to separate functions Oliver Upton
2021-09-16 18:15   ` Oliver Upton
2021-09-16 18:15   ` Oliver Upton
2021-09-16 18:15 ` [PATCH v8 3/7] KVM: x86: Fix potential race in KVM_GET_CLOCK Oliver Upton
2021-09-16 18:15   ` Oliver Upton
2021-09-16 18:15   ` Oliver Upton
2021-09-29 13:33   ` Marcelo Tosatti
2021-09-29 13:33     ` Marcelo Tosatti
2021-09-29 13:33     ` Marcelo Tosatti
2021-09-16 18:15 ` [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values " Oliver Upton
2021-09-16 18:15   ` Oliver Upton
2021-09-16 18:15   ` Oliver Upton
2021-09-28 18:53   ` Marcelo Tosatti
2021-09-28 18:53     ` Marcelo Tosatti
2021-09-28 18:53     ` Marcelo Tosatti
2021-09-29 11:20     ` Paolo Bonzini
2021-09-29 11:20       ` Paolo Bonzini
2021-09-29 11:20       ` Paolo Bonzini
2021-09-29 18:56   ` Marcelo Tosatti
2021-09-29 18:56     ` Marcelo Tosatti
2021-09-29 18:56     ` Marcelo Tosatti
2021-09-30 19:21     ` Marcelo Tosatti
2021-09-30 19:21       ` Marcelo Tosatti
2021-09-30 19:21       ` Marcelo Tosatti
2021-09-30 23:02       ` Thomas Gleixner
2021-09-30 23:02         ` Thomas Gleixner
2021-09-30 23:02         ` Thomas Gleixner
2021-10-01 12:05         ` Marcelo Tosatti
2021-10-01 12:05           ` Marcelo Tosatti
2021-10-01 12:05           ` Marcelo Tosatti
2021-10-01 12:10           ` Marcelo Tosatti
2021-10-01 12:10             ` Marcelo Tosatti
2021-10-01 12:10             ` Marcelo Tosatti
2021-10-01 19:59           ` Thomas Gleixner
2021-10-01 19:59             ` Thomas Gleixner
2021-10-01 19:59             ` Thomas Gleixner
2021-10-01 21:03             ` Oliver Upton
2021-10-01 21:03               ` Oliver Upton
2021-10-01 21:03               ` Oliver Upton
2021-10-01 14:17         ` Paolo Bonzini
2021-10-01 14:17           ` Paolo Bonzini
2021-10-01 14:17           ` Paolo Bonzini
2021-10-01 14:39   ` Paolo Bonzini
2021-10-01 14:39     ` Paolo Bonzini
2021-10-01 14:39     ` Paolo Bonzini
2021-10-01 14:41     ` Paolo Bonzini
2021-10-01 14:41       ` Paolo Bonzini
2021-10-01 14:41       ` Paolo Bonzini
2021-10-01 15:39       ` Oliver Upton
2021-10-01 15:39         ` Oliver Upton
2021-10-01 15:39         ` Oliver Upton
2021-10-01 16:42         ` Paolo Bonzini
2021-10-01 16:42           ` Paolo Bonzini
2021-10-01 16:42           ` Paolo Bonzini
2024-01-17 14:28   ` David Woodhouse
2024-01-17 14:28     ` David Woodhouse
2021-09-16 18:15 ` [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount Oliver Upton
2021-09-16 18:15   ` Oliver Upton
2021-09-16 18:15   ` Oliver Upton
2021-09-24 16:42   ` Paolo Bonzini
2021-09-24 16:42     ` Paolo Bonzini
2021-09-24 16:42     ` Paolo Bonzini
2021-09-30 17:51   ` Marcelo Tosatti
2021-09-30 17:51     ` Marcelo Tosatti
2021-09-30 17:51     ` Marcelo Tosatti
2021-10-01 16:48   ` Paolo Bonzini
2021-10-01 16:48     ` Paolo Bonzini
2021-10-01 16:48     ` Paolo Bonzini
2021-09-16 18:15 ` [PATCH v8 6/7] KVM: x86: Refactor tsc synchronization code Oliver Upton
2021-09-16 18:15   ` Oliver Upton
2021-09-16 18:15   ` Oliver Upton
2021-09-16 18:15 ` [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace Oliver Upton
2021-09-16 18:15   ` Oliver Upton
2021-09-16 18:15   ` Oliver Upton
2021-09-30 19:14   ` Marcelo Tosatti
2021-09-30 19:14     ` Marcelo Tosatti
2021-09-30 19:14     ` Marcelo Tosatti
2021-10-01  9:17     ` Paolo Bonzini
2021-10-01  9:17       ` Paolo Bonzini
2021-10-01  9:17       ` Paolo Bonzini
2021-10-01 10:32       ` Marcelo Tosatti
2021-10-01 10:32         ` Marcelo Tosatti
2021-10-01 10:32         ` Marcelo Tosatti
2021-10-01 15:12         ` Paolo Bonzini
2021-10-01 15:12           ` Paolo Bonzini
2021-10-01 15:12           ` Paolo Bonzini
2021-10-01 19:11           ` Marcelo Tosatti
2021-10-01 19:11             ` Marcelo Tosatti
2021-10-01 19:11             ` Marcelo Tosatti
2021-10-01 19:33             ` Oliver Upton
2021-10-01 19:33               ` Oliver Upton
2021-10-01 19:33               ` Oliver Upton
2021-10-04 14:30               ` Marcelo Tosatti
2021-10-04 14:30                 ` Marcelo Tosatti
2021-10-04 14:30                 ` Marcelo Tosatti
2021-10-04 11:44             ` Paolo Bonzini
2021-10-04 11:44               ` Paolo Bonzini
2021-10-04 11:44               ` Paolo Bonzini
2021-10-05 15:22   ` Sean Christopherson
2021-10-05 15:22     ` Sean Christopherson
2021-10-05 15:22     ` Sean Christopherson
2022-02-23 10:02   ` David Woodhouse
2022-02-23 10:02     ` David Woodhouse
2022-02-23 10:02     ` David Woodhouse
2021-09-24 16:43 ` [PATCH v8 0/7] KVM: x86: Add idempotent controls for migrating system counter state Paolo Bonzini
2021-09-24 16:43   ` Paolo Bonzini
2021-09-24 16:43   ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.