[RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support
@ 2021-10-20 12:04 Hikaru Nishida
  2021-10-20 12:04 ` [RFC PATCH v3 1/5] timekeeping: Expose tk->offs_boot via ktime_get_offs_boot_ns Hikaru Nishida
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: Hikaru Nishida @ 2021-10-20 12:04 UTC (permalink / raw)
  To: linux-kernel, dme, tglx, mlevitsk, linux, pbonzini, vkuznets, maz, will
  Cc: suleiman, senozhatsky, kvmarm, linux-arm-kernel, Hikaru Nishida,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	Geert Uytterhoeven, H. Peter Anvin, Ingo Molnar, Ingo Molnar,
	Jim Mattson, Joerg Roedel, John Stultz, Jonathan Corbet,
	Juergen Gross, Kees Cook, Lai Jiangshan, Linus Walleij,
	Peter Zijlstra, Sean Christopherson, Stephen Boyd, Wanpeng Li,
	kvm, linux-doc, x86


Hi,

This patch series adds virtual suspend time injection support to KVM.
It is an updated version of the following series:
v2:
https://lore.kernel.org/kvm/20210806100710.2425336-1-hikalium@chromium.org/
v1:
https://lore.kernel.org/kvm/20210426090644.2218834-1-hikalium@chromium.org/

Please take a look again.

To kvm/arm64 folks:
I'm going to implement this mechanism to ARM64 as well but not
sure which function should be used to make an IRQ (like kvm_apic_set_irq
in x86) and if it is okay to use kvm_gfn_to_hva_cache /
kvm_write_guest_cached for sharing the suspend duration.
Please let me know if there is other suitable way or any suggestions.

Thanks,

Hikaru Nishida


Changes in v3:
- Used PM notifier instead of modifying timekeeping_resume()
  - This avoids holding kvm_lock under interrupt disabled context.
- Used KVM_REQ_* to make a request for vcpus.
- Reused HYPERVISOR_CALLBACK_VECTOR IRQ instead of adding a new one.
- Extracted arch-independent parts.
- Fixed other reviewed points.

Hikaru Nishida (5):
  timekeeping: Expose tk->offs_boot via ktime_get_offs_boot_ns
  kvm/x86: Include asm/pvclock.h in asm/kvmclock.h
  kvm/x86: virtual suspend time injection: Add common definitions
  kvm/x86: virtual suspend time injection: Implement host side
  kvm/x86: virtual suspend time injection: Implement guest side

 Documentation/virt/kvm/cpuid.rst     |   3 +
 Documentation/virt/kvm/msr.rst       |  30 ++++++++
 arch/x86/Kconfig                     |  13 ++++
 arch/x86/include/asm/idtentry.h      |   2 +-
 arch/x86/include/asm/kvm_host.h      |   2 +
 arch/x86/include/asm/kvmclock.h      |  11 +++
 arch/x86/include/uapi/asm/kvm_para.h |   6 ++
 arch/x86/kernel/kvm.c                |  14 +++-
 arch/x86/kernel/kvmclock.c           |  26 +++++++
 arch/x86/kvm/Kconfig                 |  13 ++++
 arch/x86/kvm/cpuid.c                 |   4 +
 arch/x86/kvm/x86.c                   | 109 +++++++++++++++++++++++++++
 arch/x86/mm/fault.c                  |   2 +-
 include/linux/kvm_host.h             |  48 ++++++++++++
 include/linux/timekeeper_internal.h  |   5 ++
 include/linux/timekeeping.h          |   6 ++
 kernel/time/timekeeping.c            |  56 ++++++++++++++
 virt/kvm/kvm_main.c                  |  88 +++++++++++++++++++++
 18 files changed, 432 insertions(+), 6 deletions(-)

-- 
2.33.0.1079.g6e70778dc9-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH v3 1/5] timekeeping: Expose tk->offs_boot via ktime_get_offs_boot_ns
  2021-10-20 12:04 [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support Hikaru Nishida
@ 2021-10-20 12:04 ` Hikaru Nishida
  2021-10-20 12:04 ` [RFC PATCH v3 2/5] kvm/x86: Include asm/pvclock.h in asm/kvmclock.h Hikaru Nishida
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Hikaru Nishida @ 2021-10-20 12:04 UTC (permalink / raw)
  To: linux-kernel, dme, tglx, mlevitsk, linux, pbonzini, vkuznets, maz, will
  Cc: suleiman, senozhatsky, kvmarm, linux-arm-kernel, Hikaru Nishida,
	Arnd Bergmann, Geert Uytterhoeven, Ingo Molnar, John Stultz,
	Linus Walleij, Stephen Boyd

Expose tk->offs_boot to be used in kvm virtual suspend injection.

Signed-off-by: Hikaru Nishida <hikalium@chromium.org>
---

Changes in v3:
- Added this patch.

 include/linux/timekeeping.h |  2 ++
 kernel/time/timekeeping.c   | 10 ++++++++++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 78a98bdff76d..f7be69c81dab 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -179,6 +179,8 @@ extern u64 ktime_get_raw_fast_ns(void);
 extern u64 ktime_get_boot_fast_ns(void);
 extern u64 ktime_get_real_fast_ns(void);
 
+extern u64 ktime_get_offs_boot_ns(void);
+
 /*
  * timespec64/time64_t interfaces utilizing the ktime based ones
  * for API completeness, these could be implemented more efficiently
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index b348749a9fc6..e77580d9f8c1 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -565,6 +565,16 @@ u64 ktime_get_real_fast_ns(void)
 }
 EXPORT_SYMBOL_GPL(ktime_get_real_fast_ns);
 
+/**
+ * ktime_get_offs_boot_ns - boottime offset to monotonic.
+ * Return: boottime offset in nanoseconds.
+ */
+u64 ktime_get_offs_boot_ns(void)
+{
+	return ktime_to_ns(tk_core.timekeeper.offs_boot);
+}
+EXPORT_SYMBOL_GPL(ktime_get_offs_boot_ns);
+
 /**
  * ktime_get_fast_timestamps: - NMI safe timestamps
  * @snapshot:	Pointer to timestamp storage
-- 
2.33.0.1079.g6e70778dc9-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v3 2/5] kvm/x86: Include asm/pvclock.h in asm/kvmclock.h
  2021-10-20 12:04 [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support Hikaru Nishida
  2021-10-20 12:04 ` [RFC PATCH v3 1/5] timekeeping: Expose tk->offs_boot via ktime_get_offs_boot_ns Hikaru Nishida
@ 2021-10-20 12:04 ` Hikaru Nishida
  2021-10-20 12:04 ` [RFC PATCH v3 3/5] kvm/x86: virtual suspend time injection: Add common definitions Hikaru Nishida
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Hikaru Nishida @ 2021-10-20 12:04 UTC (permalink / raw)
  To: linux-kernel, dme, tglx, mlevitsk, linux, pbonzini, vkuznets, maz, will
  Cc: suleiman, senozhatsky, kvmarm, linux-arm-kernel, Hikaru Nishida,
	Borislav Petkov, H. Peter Anvin, Ingo Molnar, Jim Mattson,
	Joerg Roedel, Sean Christopherson, Wanpeng Li, kvm, x86

Include asm/pvclock.h in asm/kvmclock.h to make
struct pvclock_vsyscall_time_info visible since kvmclock.h defines
this_cpu_pvti() that needs a definition of the struct.

Signed-off-by: Hikaru Nishida <hikalium@chromium.org>
---

Changes in v3:
- Added this patch.

 arch/x86/include/asm/kvmclock.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/include/asm/kvmclock.h b/arch/x86/include/asm/kvmclock.h
index 6c5765192102..9add14edc24d 100644
--- a/arch/x86/include/asm/kvmclock.h
+++ b/arch/x86/include/asm/kvmclock.h
@@ -4,6 +4,8 @@
 
 #include <linux/percpu.h>
 
+#include <asm/pvclock.h>
+
 extern struct clocksource kvm_clock;
 
 DECLARE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
-- 
2.33.0.1079.g6e70778dc9-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v3 3/5] kvm/x86: virtual suspend time injection: Add common definitions
  2021-10-20 12:04 [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support Hikaru Nishida
  2021-10-20 12:04 ` [RFC PATCH v3 1/5] timekeeping: Expose tk->offs_boot via ktime_get_offs_boot_ns Hikaru Nishida
  2021-10-20 12:04 ` [RFC PATCH v3 2/5] kvm/x86: Include asm/pvclock.h in asm/kvmclock.h Hikaru Nishida
@ 2021-10-20 12:04 ` Hikaru Nishida
  2021-10-20 12:04 ` [RFC PATCH v3 4/5] kvm/x86: virtual suspend time injection: Implement host side Hikaru Nishida
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: Hikaru Nishida @ 2021-10-20 12:04 UTC (permalink / raw)
  To: linux-kernel, dme, tglx, mlevitsk, linux, pbonzini, vkuznets, maz, will
  Cc: suleiman, senozhatsky, kvmarm, linux-arm-kernel, Hikaru Nishida,
	Borislav Petkov, H. Peter Anvin, Ingo Molnar, Jim Mattson,
	Joerg Roedel, Jonathan Corbet, Sean Christopherson, Wanpeng Li,
	kvm, linux-doc, x86

Add definitions of MSR, KVM_FEATURE bit and a structure called
kvm_suspend_time that are used by later patches to support the
virtual suspend time injection mechanism.

Also add documentations for them.

Signed-off-by: Hikaru Nishida <hikalium@chromium.org>
---

Changes in v3:
- Moved the definition of struct kvm_suspend_time into this patch.

 Documentation/virt/kvm/cpuid.rst     |  3 +++
 Documentation/virt/kvm/msr.rst       | 30 ++++++++++++++++++++++++++++
 arch/x86/include/uapi/asm/kvm_para.h |  6 ++++++
 3 files changed, 39 insertions(+)

diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index bda3e3e737d7..f17b95b0d943 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -103,6 +103,9 @@ KVM_FEATURE_HC_MAP_GPA_RANGE       16          guest checks this feature bit bef
 KVM_FEATURE_MIGRATION_CONTROL      17          guest checks this feature bit before
                                                using MSR_KVM_MIGRATION_CONTROL
 
+KVM_FEATURE_HOST_SUSPEND_TIME      18          host suspend time information
+                                               is available at msr 0x4b564d09.
+
 KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24          host will warn if no guest-side
                                                per-cpu warps are expected in
                                                kvmclock
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index 9315fc385fb0..40ec0fd263ac 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -389,3 +389,33 @@ data:
         guest is communicating page encryption status to the host using the
         ``KVM_HC_MAP_GPA_RANGE`` hypercall, it can set bit 0 in this MSR to
         allow live migration of the guest.
+
+MSR_KVM_HOST_SUSPEND_TIME:
+	0x4b564d09
+
+data:
+	8-byte alignment physical address of a memory area which must be
+	in guest RAM, plus an enable bit in bit 0. This memory is expected to
+	hold a copy of the following structure::
+
+	 struct kvm_suspend_time {
+		__u64   suspend_time_ns;
+	 };
+
+	whose data will be filled in by the hypervisor.
+	If the guest register this structure through the MSR write, the host
+	will stop all the clocks visible to the guest (including TSCs) during
+	the host's suspension and report the duration of suspend through this
+	structure. The update will be notified through
+	HYPERVISOR_CALLBACK_VECTOR IRQ. Fields have the following meanings:
+
+	suspend_time_ns:
+		Total number of nanoseconds passed during the host's suspend
+		while the VM is running. This value will be increasing
+		monotonically and cumulative.
+
+	Note that although MSRs are per-CPU entities, the effect of this
+	particular MSR is global.
+
+	Availability of this MSR must be checked via bit 18 in 0x4000001 cpuid
+	leaf prior to usage.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 5146bbab84d4..ccea4e344f46 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -35,6 +35,7 @@
 #define KVM_FEATURE_MSI_EXT_DEST_ID	15
 #define KVM_FEATURE_HC_MAP_GPA_RANGE	16
 #define KVM_FEATURE_MIGRATION_CONTROL	17
+#define KVM_FEATURE_HOST_SUSPEND_TIME	18
 
 #define KVM_HINTS_REALTIME      0
 
@@ -57,6 +58,7 @@
 #define MSR_KVM_ASYNC_PF_INT	0x4b564d06
 #define MSR_KVM_ASYNC_PF_ACK	0x4b564d07
 #define MSR_KVM_MIGRATION_CONTROL	0x4b564d08
+#define MSR_KVM_HOST_SUSPEND_TIME      0x4b564d09
 
 struct kvm_steal_time {
 	__u64 steal;
@@ -79,6 +81,10 @@ struct kvm_clock_pairing {
 	__u32 pad[9];
 };
 
+struct kvm_suspend_time {
+	__u64   suspend_time_ns;
+};
+
 #define KVM_STEAL_ALIGNMENT_BITS 5
 #define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
 #define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
-- 
2.33.0.1079.g6e70778dc9-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v3 4/5] kvm/x86: virtual suspend time injection: Implement host side
  2021-10-20 12:04 [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support Hikaru Nishida
                   ` (2 preceding siblings ...)
  2021-10-20 12:04 ` [RFC PATCH v3 3/5] kvm/x86: virtual suspend time injection: Add common definitions Hikaru Nishida
@ 2021-10-20 12:04 ` Hikaru Nishida
  2021-10-20 12:04 ` [RFC PATCH v3 5/5] kvm/x86: virtual suspend time injection: Implement guest side Hikaru Nishida
  2021-10-20 13:52 ` [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support Marc Zyngier
  5 siblings, 0 replies; 10+ messages in thread
From: Hikaru Nishida @ 2021-10-20 12:04 UTC (permalink / raw)
  To: linux-kernel, dme, tglx, mlevitsk, linux, pbonzini, vkuznets, maz, will
  Cc: suleiman, senozhatsky, kvmarm, linux-arm-kernel, Hikaru Nishida,
	Borislav Petkov, H. Peter Anvin, Ingo Molnar, Jim Mattson,
	Joerg Roedel, Sean Christopherson, Wanpeng Li, kvm, x86

Add main logics that adjust the guest's clocks and notify about the
suspension to the guest.

Adjustment flow:
- Before going into suspend, KVM_REQ_SUSPEND_TIME_ADJ will be
  requested for each vcpus through the PM notifier if the suspend time
  injection is enabled for the kvm.
- Before the first vmenter after the resume, each vcpu will check the
  the request and do two kinds of adjustments.
  - One is kvm-wide adjustment: kvm-clock will be adjusted to the value
    before the suspend.
  - Another is per-vcpu adjustment: tsc will be adjusted to the value
    before the suspend.
  - Those adjustments happen before the vcpu run: so the guest will not
    observe the "rewinding" of the clocks.
- After the adjustment is made, the guest will be notified about the
  adjustment through HYPERVISOR_CALLBACK_VECTOR IRQ.
    - It is guest's responsibility to adjust their CLOCK_BOOTTIME and
      the wall clock to reflect the suspend.
      This will be done in the later patch.

Signed-off-by: Hikaru Nishida <hikalium@chromium.org>
---

Changes in v3:
- Used PM notifier instead of modifying timekeeping_resume()
  - This avoids holding kvm_lock under interrupt disabled context.
- Used KVM_REQ_* to make a request for vcpus.
- Reused HYPERVISOR_CALLBACK_VECTOR IRQ instead of adding a new one.
- Extracted arch-independent parts.

 arch/x86/include/asm/kvm_host.h |   2 +
 arch/x86/kvm/Kconfig            |  13 ++++
 arch/x86/kvm/cpuid.c            |   4 ++
 arch/x86/kvm/x86.c              | 109 ++++++++++++++++++++++++++++++++
 include/linux/kvm_host.h        |  48 ++++++++++++++
 virt/kvm/kvm_main.c             |  88 ++++++++++++++++++++++++++
 6 files changed, 264 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8f48a7ec577..bdff8f777632 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1085,6 +1085,8 @@ struct kvm_arch {
 	bool pause_in_guest;
 	bool cstate_in_guest;
 
+	u64 msr_suspend_time;
+
 	unsigned long irq_sources_bitmap;
 	s64 kvmclock_offset;
 	raw_spinlock_t tsc_write_lock;
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index ac69894eab88..6d68a4d6be87 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -129,4 +129,17 @@ config KVM_MMU_AUDIT
 	 This option adds a R/W kVM module parameter 'mmu_audit', which allows
 	 auditing of KVM MMU events at runtime.
 
+config KVM_VIRT_SUSPEND_TIMING
+	bool "Host support for virtual suspend time injection"
+	depends on KVM=y && HAVE_KVM_PM_NOTIFIER
+	default n
+	help
+	 This option makes the host's suspension reflected on the guest's clocks.
+	 In other words, guest's CLOCK_MONOTONIC will stop and
+	 CLOCK_BOOTTIME keeps running during the host's suspension.
+	 This feature will only be effective when both guest and host support
+	 this feature. For the guest side, see KVM_VIRT_SUSPEND_TIMING_GUEST.
+
+	 If unsure, say N.
+
 endif # VIRTUALIZATION
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 751aa85a3001..34a2fe147503 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -886,6 +886,10 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
 			     (1 << KVM_FEATURE_ASYNC_PF_INT);
 
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING
+		entry->eax |= (1 << KVM_FEATURE_HOST_SUSPEND_TIME);
+#endif
+
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index aabd3a2ec1bc..b6d0d7f73196 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1367,6 +1367,7 @@ static const u32 emulated_msrs_all[] = {
 
 	MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
 	MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF_INT, MSR_KVM_ASYNC_PF_ACK,
+	MSR_KVM_HOST_SUSPEND_TIME,
 
 	MSR_IA32_TSC_ADJUST,
 	MSR_IA32_TSC_DEADLINE,
@@ -3467,6 +3468,19 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		vcpu->arch.msr_kvm_poll_control = data;
 		break;
 
+	case MSR_KVM_HOST_SUSPEND_TIME:
+		if (!guest_pv_has(vcpu, KVM_FEATURE_HOST_SUSPEND_TIME))
+			return 1;
+
+		if (!(data & KVM_MSR_ENABLED))
+			break;
+
+		if (kvm_init_suspend_time_ghc(vcpu->kvm, data & ~1ULL))
+			return 1;
+
+		vcpu->kvm->arch.msr_suspend_time = data;
+		break;
+
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
@@ -3785,6 +3799,12 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 
 		msr_info->data = vcpu->arch.msr_kvm_poll_control;
 		break;
+	case MSR_KVM_HOST_SUSPEND_TIME:
+		if (!guest_pv_has(vcpu, KVM_FEATURE_HOST_SUSPEND_TIME))
+			return 1;
+
+		msr_info->data = vcpu->kvm->arch.msr_suspend_time;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
@@ -9392,6 +9412,93 @@ void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(__kvm_request_immediate_exit);
 
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING
+bool virt_suspend_time_enabled(struct kvm *kvm)
+{
+	return kvm->arch.msr_suspend_time & KVM_MSR_ENABLED;
+}
+
+/*
+ * Do per-vcpu suspend time adjustment (tsc) and
+ * make an interrupt to notify it.
+ */
+static void vcpu_do_suspend_time_adjustment(struct kvm_vcpu *vcpu,
+					    u64 total_ns)
+{
+	struct kvm_lapic_irq irq = {
+		.delivery_mode = APIC_DM_FIXED,
+		.vector = HYPERVISOR_CALLBACK_VECTOR
+	};
+	u64 last_suspend_duration = 0;
+	s64 adj;
+
+	spin_lock(&vcpu->suspend_time_ns_lock);
+	if (total_ns > vcpu->suspend_time_ns) {
+		last_suspend_duration = total_ns - vcpu->suspend_time_ns;
+		vcpu->suspend_time_ns = total_ns;
+	}
+	spin_unlock(&vcpu->suspend_time_ns_lock);
+
+	if (!last_suspend_duration) {
+		/* It looks like the suspend is not happened yet. Retry. */
+		kvm_make_request(KVM_REQ_SUSPEND_TIME_ADJ, vcpu);
+		return;
+	}
+
+	adj = __this_cpu_read(cpu_tsc_khz) *
+		(last_suspend_duration / 1000000);
+	adjust_tsc_offset_host(vcpu, -adj);
+	/*
+	 * This request should be processed before
+	 * the first vmenter after resume to avoid
+	 * an unadjusted TSC value is observed.
+	 */
+	kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
+	kvm_write_suspend_time(vcpu->kvm);
+	if (!kvm_apic_set_irq(vcpu, &irq, NULL))
+		pr_err("kvm: failed to set suspend time irq\n");
+}
+
+/*
+ * Do kvm-wide suspend time adjustment (kvm-clock).
+ */
+static void kvm_do_suspend_time_adjustment(struct kvm *kvm, u64 total_ns)
+{
+	spin_lock(&kvm->suspend_time_ns_lock);
+	if (total_ns > kvm->suspend_time_ns) {
+		u64 last_suspend_duration = total_ns - kvm->suspend_time_ns;
+		/*
+		 * Move the offset of kvm_clock here as if it is stopped
+		 * during the suspension.
+		 */
+		kvm->arch.kvmclock_offset -= last_suspend_duration;
+
+		/* suspend_time is accumulated per VM. */
+		kvm->suspend_time_ns += last_suspend_duration;
+		/*
+		 * This adjustment will be reflected to the struct provided
+		 * from the guest via MSR_KVM_HOST_SUSPEND_TIME before
+		 * the notification interrupt is injected.
+		 */
+		kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);
+	}
+	spin_unlock(&kvm->suspend_time_ns_lock);
+}
+
+static void kvm_adjust_suspend_time(struct kvm_vcpu *vcpu)
+{
+	u64 total_ns = kvm_total_suspend_time(vcpu->kvm);
+	/* Do kvm-wide adjustment (kvm-clock) */
+	kvm_do_suspend_time_adjustment(vcpu->kvm, total_ns);
+	/* Do per-vcpu adjustment (tsc) */
+	vcpu_do_suspend_time_adjustment(vcpu, total_ns);
+}
+#else
+static void kvm_adjust_suspend_time(struct kvm_vcpu *vcpu)
+{
+}
+#endif
+
 /*
  * Returns 1 to let vcpu_run() continue the guest execution loop without
  * exiting to the userspace.  Otherwise, the value will be returned to the
@@ -9421,6 +9528,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			r = -EIO;
 			goto out;
 		}
+		if (kvm_check_request(KVM_REQ_SUSPEND_TIME_ADJ, vcpu))
+			kvm_adjust_suspend_time(vcpu);
 		if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
 			if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
 				r = 0;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0f18df7fe874..ef93c067ceba 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -151,6 +151,7 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQ_UNBLOCK           2
 #define KVM_REQ_UNHALT            3
 #define KVM_REQ_VM_BUGGED         (4 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
+#define KVM_REQ_SUSPEND_TIME_ADJ  5
 #define KVM_REQUEST_ARCH_BASE     8
 
 #define KVM_ARCH_REQ_FLAGS(nr, flags) ({ \
@@ -336,6 +337,11 @@ struct kvm_vcpu {
 	} async_pf;
 #endif
 
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING
+	u64 suspend_time_ns;
+	spinlock_t suspend_time_ns_lock;
+#endif
+
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
 	/*
 	 * Cpu relax intercept or pause loop exit optimization
@@ -623,6 +629,12 @@ struct kvm {
 	struct notifier_block pm_notifier;
 #endif
 	char stats_id[KVM_STATS_NAME_SIZE];
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING
+	u64 suspend_time_ns;
+	spinlock_t suspend_time_ns_lock;
+	u64 base_offs_boot_ns;
+	struct gfn_to_hva_cache suspend_time_ghc;
+#endif
 };
 
 #define kvm_err(fmt, ...) \
@@ -1829,6 +1841,42 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 }
 #endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
 
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING
+bool virt_suspend_time_enabled(struct kvm *kvm);
+void kvm_write_suspend_time(struct kvm *kvm);
+int kvm_init_suspend_time_ghc(struct kvm *kvm, gpa_t gpa);
+static inline u64 kvm_total_suspend_time(struct kvm *kvm)
+{
+	return ktime_get_offs_boot_ns() - kvm->base_offs_boot_ns;
+}
+
+static inline u64 vcpu_suspend_time_injected(struct kvm_vcpu *vcpu)
+{
+	return vcpu->suspend_time_ns;
+}
+#else
+static inline bool virt_suspend_time_enabled(struct kvm *kvm)
+{
+	return 0;
+}
+static inline void kvm_write_suspend_time(struct kvm *kvm)
+{
+}
+static inline int kvm_init_suspend_time_ghc(struct kvm *kvm, gpa_t gpa)
+{
+	return 1;
+}
+static inline u64 kvm_total_suspend_time(struct kvm *kvm)
+{
+	return 0;
+}
+
+static inline u64 vcpu_suspend_time_injected(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+#endif /* CONFIG_KVM_VIRT_SUSPEND_TIMING */
+
 /*
  * This defines how many reserved entries we want to keep before we
  * kick the vcpu to the userspace to avoid dirty ring full.  This
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7851f3a1b5f7..a4fedd2455d4 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -425,6 +425,11 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->ready = false;
 	preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
 	vcpu->last_used_slot = 0;
+
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING
+	vcpu->suspend_time_ns = kvm->suspend_time_ns;
+	spin_lock_init(&vcpu->suspend_time_ns_lock);
+#endif
 }
 
 void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
@@ -812,12 +817,70 @@ static int kvm_init_mmu_notifier(struct kvm *kvm)
 #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */
 
 #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER
+static int kvm_suspend_notifier(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	if (!virt_suspend_time_enabled(kvm))
+		return NOTIFY_DONE;
+
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvm_make_request(KVM_REQ_SUSPEND_TIME_ADJ, vcpu);
+	mutex_unlock(&kvm->lock);
+
+	return NOTIFY_DONE;
+}
+
+static int kvm_resume_notifier(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	if (!virt_suspend_time_enabled(kvm))
+		return NOTIFY_DONE;
+
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		/*
+		 * Clear KVM_REQ_SUSPEND_TIME_ADJ if the suspend injection is
+		 * not needed (e.g. suspend failure)
+		 * The following condition is also true when the adjustment is
+		 * already done and it is safe to clear the request again here.
+		 */
+		if (kvm_total_suspend_time(kvm) ==
+		    vcpu_suspend_time_injected(vcpu))
+			kvm_clear_request(KVM_REQ_SUSPEND_TIME_ADJ, vcpu);
+	}
+	mutex_unlock(&kvm->lock);
+
+	return NOTIFY_DONE;
+}
+
+static int kvm_pm_notifier(struct kvm *kvm, unsigned long state)
+{
+	switch (state) {
+	case PM_HIBERNATION_PREPARE:
+	case PM_SUSPEND_PREPARE:
+		return kvm_suspend_notifier(kvm);
+	case PM_POST_HIBERNATION:
+	case PM_POST_SUSPEND:
+		return kvm_resume_notifier(kvm);
+	}
+
+	return NOTIFY_DONE;
+}
+
 static int kvm_pm_notifier_call(struct notifier_block *bl,
 				unsigned long state,
 				void *unused)
 {
 	struct kvm *kvm = container_of(bl, struct kvm, pm_notifier);
 
+	if (kvm_pm_notifier(kvm, state) != NOTIFY_DONE)
+		return NOTIFY_BAD;
+
 	return kvm_arch_pm_notifier(kvm, state);
 }
 
@@ -843,6 +906,26 @@ static void kvm_destroy_pm_notifier(struct kvm *kvm)
 }
 #endif /* CONFIG_HAVE_KVM_PM_NOTIFIER */
 
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING
+void kvm_write_suspend_time(struct kvm *kvm)
+{
+	struct kvm_suspend_time st;
+
+	st.suspend_time_ns = kvm->suspend_time_ns;
+	kvm_write_guest_cached(kvm, &kvm->suspend_time_ghc, &st, sizeof(st));
+}
+
+int kvm_init_suspend_time_ghc(struct kvm *kvm, gpa_t gpa)
+{
+	if (kvm_gfn_to_hva_cache_init(kvm, &kvm->suspend_time_ghc, gpa,
+				      sizeof(struct kvm_suspend_time)))
+		return 1;
+
+	kvm_write_suspend_time(kvm);
+	return 0;
+}
+#endif
+
 static struct kvm_memslots *kvm_alloc_memslots(void)
 {
 	int i;
@@ -1080,6 +1163,11 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	if (r)
 		goto out_err_no_disable;
 
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING
+	spin_lock_init(&kvm->suspend_time_ns_lock);
+	kvm->base_offs_boot_ns = ktime_get_offs_boot_ns();
+#endif
+
 #ifdef CONFIG_HAVE_KVM_IRQFD
 	INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
 #endif
-- 
2.33.0.1079.g6e70778dc9-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH v3 5/5] kvm/x86: virtual suspend time injection: Implement guest side
  2021-10-20 12:04 [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support Hikaru Nishida
                   ` (3 preceding siblings ...)
  2021-10-20 12:04 ` [RFC PATCH v3 4/5] kvm/x86: virtual suspend time injection: Implement host side Hikaru Nishida
@ 2021-10-20 12:04 ` Hikaru Nishida
  2021-10-20 13:52 ` [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support Marc Zyngier
  5 siblings, 0 replies; 10+ messages in thread
From: Hikaru Nishida @ 2021-10-20 12:04 UTC (permalink / raw)
  To: linux-kernel, dme, tglx, mlevitsk, linux, pbonzini, vkuznets, maz, will
  Cc: suleiman, senozhatsky, kvmarm, linux-arm-kernel, Hikaru Nishida,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	Geert Uytterhoeven, H. Peter Anvin, Ingo Molnar, Jim Mattson,
	Joerg Roedel, John Stultz, Juergen Gross, Kees Cook,
	Lai Jiangshan, Linus Walleij, Peter Zijlstra,
	Sean Christopherson, Stephen Boyd, Wanpeng Li, kvm, x86

Add guest side implementation of KVM virtual suspend time injection.

How it works from guest's view:
- Guest will be paused without going through suspend/resume path in the
  guest kernel
- Before resuming the execution of the guest's vcpus, host will adjust
  the hardware clock (and kvm_clock) to the time before the suspend.
  - By this action, guest's CLOCK_MONOTONIC behaves as expected (stops
    during the host's suspension.)
- the guest will receive an IRQ from the guest that notifies about the
  suspend which was invisible to the guest. In the handler, the guest
  can adjust their CLOCK_BOOTTIME to reflect the suspension.
  - Now, CLOCK_BOOTTIME includes the time passed during the host's
    suspension.

Signed-off-by: Hikaru Nishida <hikalium@chromium.org>
---

Changes in v3:
- Reused HYPERVISOR_CALLBACK_VECTOR IRQ instead of adding a new one.
- Extracted arch-independent parts.

 arch/x86/Kconfig                    | 13 ++++++++
 arch/x86/include/asm/idtentry.h     |  2 +-
 arch/x86/include/asm/kvmclock.h     |  9 ++++++
 arch/x86/kernel/kvm.c               | 14 ++++++---
 arch/x86/kernel/kvmclock.c          | 26 ++++++++++++++++
 arch/x86/mm/fault.c                 |  2 +-
 include/linux/timekeeper_internal.h |  5 ++++
 include/linux/timekeeping.h         |  4 +++
 kernel/time/timekeeping.c           | 46 +++++++++++++++++++++++++++++
 9 files changed, 115 insertions(+), 6 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d9830e7e1060..1d4a529d1577 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -824,6 +824,19 @@ config KVM_GUEST
 	  underlying device model, the host provides the guest with
 	  timing infrastructure such as time of day, and system time
 
+config KVM_VIRT_SUSPEND_TIMING_GUEST
+	bool "Guest support for virtual suspend time injection"
+	depends on KVM_GUEST
+	default n
+	help
+	 This option makes the host's suspension reflected on the guest's clocks.
+	 In other words, guest's CLOCK_MONOTONIC will stop and
+	 CLOCK_BOOTTIME keeps running during the host's suspension.
+	 This feature will only be effective when both guest and host support
+	 this feature. For the host side, see KVM_VIRT_SUSPEND_TIMING.
+
+	 If unsure, say N.
+
 config ARCH_CPUIDLE_HALTPOLL
 	def_bool n
 	prompt "Disable host haltpoll when loading haltpoll driver"
diff --git a/arch/x86/include/asm/idtentry.h b/arch/x86/include/asm/idtentry.h
index 1345088e9902..5e30f84ea07e 100644
--- a/arch/x86/include/asm/idtentry.h
+++ b/arch/x86/include/asm/idtentry.h
@@ -686,7 +686,7 @@ DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_xen_hvm_callback);
 #endif
 
 #ifdef CONFIG_KVM_GUEST
-DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_kvm_asyncpf_interrupt);
+DECLARE_IDTENTRY_SYSVEC(HYPERVISOR_CALLBACK_VECTOR,	sysvec_kvm_hv_callback);
 #endif
 
 #undef X86_TRAP_OTHER
diff --git a/arch/x86/include/asm/kvmclock.h b/arch/x86/include/asm/kvmclock.h
index 9add14edc24d..2bf1a5c92319 100644
--- a/arch/x86/include/asm/kvmclock.h
+++ b/arch/x86/include/asm/kvmclock.h
@@ -20,4 +20,13 @@ static inline struct pvclock_vsyscall_time_info *this_cpu_hvclock(void)
 	return this_cpu_read(hv_clock_per_cpu);
 }
 
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING_GUEST
+u64 kvm_get_suspend_time(void);
+#else
+static inline u64 kvm_get_suspend_time(void)
+{
+	return 0;
+}
+#endif
+
 #endif /* _ASM_X86_KVM_CLOCK_H */
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index b656456c3a94..3d84ef6d9df2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -40,6 +40,7 @@
 #include <asm/ptrace.h>
 #include <asm/reboot.h>
 #include <asm/svm.h>
+#include <asm/kvmclock.h>
 
 DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
 
@@ -270,7 +271,7 @@ noinstr bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token)
 	return true;
 }
 
-DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_asyncpf_interrupt)
+DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_hv_callback)
 {
 	struct pt_regs *old_regs = set_irq_regs(regs);
 	u32 token;
@@ -286,6 +287,8 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_kvm_asyncpf_interrupt)
 		wrmsrl(MSR_KVM_ASYNC_PF_ACK, 1);
 	}
 
+	timekeeping_inject_virtual_suspend_time(kvm_get_suspend_time());
+
 	set_irq_regs(old_regs);
 }
 
@@ -710,10 +713,13 @@ static void __init kvm_guest_init(void)
 	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
 		apic_set_eoi_write(kvm_guest_apic_eoi_write);
 
-	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_INT) && kvmapf) {
+	if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_INT) && kvmapf)
 		static_branch_enable(&kvm_async_pf_enabled);
-		alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR, asm_sysvec_kvm_asyncpf_interrupt);
-	}
+
+	if ((kvm_para_has_feature(KVM_FEATURE_ASYNC_PF_INT) && kvmapf) ||
+	    kvm_para_has_feature(KVM_FEATURE_HOST_SUSPEND_TIME))
+		alloc_intr_gate(HYPERVISOR_CALLBACK_VECTOR,
+				asm_sysvec_kvm_hv_callback);
 
 #ifdef CONFIG_SMP
 	if (pv_tlb_flush_supported()) {
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 73c74b961d0f..3e16d0ab79f3 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -16,11 +16,15 @@
 #include <linux/mm.h>
 #include <linux/slab.h>
 #include <linux/set_memory.h>
+#include <linux/interrupt.h>
+#include <linux/irq.h>
 
 #include <asm/hypervisor.h>
 #include <asm/mem_encrypt.h>
 #include <asm/x86_init.h>
 #include <asm/kvmclock.h>
+#include <asm/desc.h>
+#include <asm/idtentry.h>
 
 static int kvmclock __initdata = 1;
 static int kvmclock_vsyscall __initdata = 1;
@@ -48,6 +52,9 @@ early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
 
 static struct pvclock_vsyscall_time_info
 			hv_clock_boot[HVC_BOOT_ARRAY_SIZE] __bss_decrypted __aligned(PAGE_SIZE);
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING_GUEST
+static struct kvm_suspend_time suspend_time __bss_decrypted;
+#endif
 static struct pvclock_wall_clock wall_clock __bss_decrypted;
 static struct pvclock_vsyscall_time_info *hvclock_mem;
 DEFINE_PER_CPU(struct pvclock_vsyscall_time_info *, hv_clock_per_cpu);
@@ -281,6 +288,17 @@ static int kvmclock_setup_percpu(unsigned int cpu)
 	return p ? 0 : -ENOMEM;
 }
 
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING_GUEST
+/**
+ * kvm_get_suspend_time - duration of host suspend.
+ * Return: Cumulative duration of host suspend in nanoseconds.
+ */
+u64 kvm_get_suspend_time(void)
+{
+	return suspend_time.suspend_time_ns;
+}
+#endif
+
 void __init kvmclock_init(void)
 {
 	u8 flags;
@@ -295,6 +313,14 @@ void __init kvmclock_init(void)
 		return;
 	}
 
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING_GUEST
+	if (kvm_para_has_feature(KVM_FEATURE_HOST_SUSPEND_TIME)) {
+		/* Register the suspend time structure */
+		wrmsrl(MSR_KVM_HOST_SUSPEND_TIME,
+		       slow_virt_to_phys(&suspend_time) | KVM_MSR_ENABLED);
+	}
+#endif
+
 	if (cpuhp_setup_state(CPUHP_BP_PREPARE_DYN, "kvmclock:setup_percpu",
 			      kvmclock_setup_percpu, NULL) < 0) {
 		return;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 84a2c8c4af73..f36f49585d5d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1509,7 +1509,7 @@ DEFINE_IDTENTRY_RAW_ERRORCODE(exc_page_fault)
 	 * memory is swapped out). Note, the corresponding "page ready" event
 	 * which is injected when the memory becomes available, is delivered via
 	 * an interrupt mechanism and not a #PF exception
-	 * (see arch/x86/kernel/kvm.c: sysvec_kvm_asyncpf_interrupt()).
+	 * (see arch/x86/kernel/kvm.c: sysvec_kvm_hv_callback()).
 	 *
 	 * We are relying on the interrupted context being sane (valid RSP,
 	 * relevant locks not held, etc.), which is fine as long as the
diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index 84ff2844df2a..0d5b29122d40 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -68,6 +68,8 @@ struct tk_read_base {
  *			shifted nano seconds.
  * @ntp_error_shift:	Shift conversion between clock shifted nano seconds and
  *			ntp shifted nano seconds.
+ * @kvm_suspend_time:	The cumulative duration of suspend injected through KVM
+ *			in nano seconds.
  * @last_warning:	Warning ratelimiter (DEBUG_TIMEKEEPING)
  * @underflow_seen:	Underflow warning flag (DEBUG_TIMEKEEPING)
  * @overflow_seen:	Overflow warning flag (DEBUG_TIMEKEEPING)
@@ -124,6 +126,9 @@ struct timekeeper {
 	u32			ntp_err_mult;
 	/* Flag used to avoid updating NTP twice with same second */
 	u32			skip_second_overflow;
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING_GUEST
+	u64			kvm_suspend_time;
+#endif
 #ifdef CONFIG_DEBUG_TIMEKEEPING
 	long			last_warning;
 	/*
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index f7be69c81dab..a2228300c3f9 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -310,4 +310,8 @@ void read_persistent_wall_and_boot_offset(struct timespec64 *wall_clock,
 extern int update_persistent_clock64(struct timespec64 now);
 #endif
 
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING_GUEST
+void timekeeping_inject_virtual_suspend_time(u64 total_duration_ns);
+#endif
+
 #endif
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index e77580d9f8c1..5f474cde0bae 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -2133,6 +2133,52 @@ static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset,
 	return offset;
 }
 
+#ifdef CONFIG_KVM_VIRT_SUSPEND_TIMING_GUEST
+/**
+ * timekeeping_inject_virtual_suspend_time - Inject virtual suspend time
+ * when requested by the kvm host.
+ * @total_duration_ns:	Total suspend time to be injected in nanoseconds.
+ */
+void timekeeping_inject_virtual_suspend_time(u64 total_duration_ns)
+{
+	struct timekeeper *tk = &tk_core.timekeeper;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&timekeeper_lock, flags);
+	if (total_duration_ns > tk->kvm_suspend_time) {
+		/*
+		 * Do injection only if the time is not injected yet.
+		 * total_duration_ns and tk->kvm_suspend_time values are
+		 * cumulative, so the delta between them will be an amount
+		 * of adjustments. For example, if the host suspends 2 times
+		 * during the guest is running and each suspend is 5 seconds,
+		 * total_duration_ns will be 5 seconds at the first injection
+		 * and tk->kvm_suspend_time was initialized to zero so the
+		 * adjustment injected here will be 5 - 0 = 5 seconds and
+		 * tk->kvm_suspend_time will be updated to 5 seconds.
+		 * On the second injection after the second resume,
+		 * total_duration_ns will be 10 seconds and
+		 * tk->kvm_suspend_time will be 5 seconds so 10 - 5 = 5 seconds
+		 * of the suspend time will be injected again.
+		 */
+		struct timespec64 delta =
+			ns_to_timespec64(total_duration_ns -
+					 tk->kvm_suspend_time);
+		tk->kvm_suspend_time = total_duration_ns;
+
+		write_seqcount_begin(&tk_core.seq);
+		timekeeping_forward_now(tk);
+		__timekeeping_inject_sleeptime(tk, &delta);
+		timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);
+		write_seqcount_end(&tk_core.seq);
+
+		/* signal hrtimers about time change */
+		clock_was_set_delayed();
+	}
+	raw_spin_unlock_irqrestore(&timekeeper_lock, flags);
+}
+#endif
+
 /*
  * timekeeping_advance - Updates the timekeeper to the current time and
  * current NTP tick length
-- 
2.33.0.1079.g6e70778dc9-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support
  2021-10-20 12:04 [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support Hikaru Nishida
                   ` (4 preceding siblings ...)
  2021-10-20 12:04 ` [RFC PATCH v3 5/5] kvm/x86: virtual suspend time injection: Implement guest side Hikaru Nishida
@ 2021-10-20 13:52 ` Marc Zyngier
  2021-11-04  9:10   ` Hikaru Nishida
  5 siblings, 1 reply; 10+ messages in thread
From: Marc Zyngier @ 2021-10-20 13:52 UTC (permalink / raw)
  To: Hikaru Nishida
  Cc: linux-kernel, dme, tglx, mlevitsk, linux, pbonzini, vkuznets,
	will, suleiman, senozhatsky, kvmarm, linux-arm-kernel,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	Geert Uytterhoeven, H. Peter Anvin, Ingo Molnar, Ingo Molnar,
	Jim Mattson, Joerg Roedel, John Stultz, Jonathan Corbet,
	Juergen Gross, Kees Cook, Lai Jiangshan, Linus Walleij,
	Peter Zijlstra, Sean Christopherson, Stephen Boyd, Wanpeng Li,
	kvm, linux-doc, x86

Hi Hikaru,

On Wed, 20 Oct 2021 13:04:25 +0100,
Hikaru Nishida <hikalium@chromium.org> wrote:
> 
> 
> Hi,
> 
> This patch series adds virtual suspend time injection support to KVM.
> It is an updated version of the following series:
> v2:
> https://lore.kernel.org/kvm/20210806100710.2425336-1-hikalium@chromium.org/
> v1:
> https://lore.kernel.org/kvm/20210426090644.2218834-1-hikalium@chromium.org/
> 
> Please take a look again.
> 
> To kvm/arm64 folks:
> I'm going to implement this mechanism to ARM64 as well but not
> sure which function should be used to make an IRQ (like kvm_apic_set_irq
> in x86) and if it is okay to use kvm_gfn_to_hva_cache /
> kvm_write_guest_cached for sharing the suspend duration.

Before we discuss interrupt injection, I want to understand what this
is doing, and how this is doing it. And more precisely, I want to find
out how you solve the various problems described by Thomas here [1].

Assuming you solve these, you should model the guest memory access
similarly to what we do for stolen time. As for injecting an
interrupt, why can't this be a userspace thing?

Thanks,

	M.

[1] https://lore.kernel.org/all/871r557jls.ffs@tglx


-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support
  2021-10-20 13:52 ` [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support Marc Zyngier
@ 2021-11-04  9:10   ` Hikaru Nishida
  2021-12-04 17:30     ` Marc Zyngier
  0 siblings, 1 reply; 10+ messages in thread
From: Hikaru Nishida @ 2021-11-04  9:10 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: linux-kernel, dme, tglx, mlevitsk, linux, pbonzini, vkuznets,
	will, suleiman, senozhatsky, kvmarm, linux-arm-kernel,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	Geert Uytterhoeven, H. Peter Anvin, Ingo Molnar, Ingo Molnar,
	Jim Mattson, Joerg Roedel, John Stultz, Jonathan Corbet,
	Juergen Gross, Kees Cook, Lai Jiangshan, Linus Walleij,
	Peter Zijlstra, Sean Christopherson, Stephen Boyd, Wanpeng Li,
	kvm, linux-doc, x86

Hi Marc,

Thanks for the comments! (Sorry for the late reply)

On Wed, Oct 20, 2021 at 10:52 PM Marc Zyngier <maz@kernel.org> wrote:
>
> Hi Hikaru,
>
> On Wed, 20 Oct 2021 13:04:25 +0100,
> Hikaru Nishida <hikalium@chromium.org> wrote:
> >
> >
> > Hi,
> >
> > This patch series adds virtual suspend time injection support to KVM.
> > It is an updated version of the following series:
> > v2:
> > https://lore.kernel.org/kvm/20210806100710.2425336-1-hikalium@chromium.org/
> > v1:
> > https://lore.kernel.org/kvm/20210426090644.2218834-1-hikalium@chromium.org/
> >
> > Please take a look again.
> >
> > To kvm/arm64 folks:
> > I'm going to implement this mechanism to ARM64 as well but not
> > sure which function should be used to make an IRQ (like kvm_apic_set_irq
> > in x86) and if it is okay to use kvm_gfn_to_hva_cache /
> > kvm_write_guest_cached for sharing the suspend duration.
>
> Before we discuss interrupt injection, I want to understand what this
> is doing, and how this is doing it. And more precisely, I want to find
> out how you solve the various problems described by Thomas here [1].

The problems described by Thomas in the thread was:
- User space or kernel space can observe the stale timestamp before
the adjustment
  - Moving CLOCK_MONOTONIC forward will trigger all sorts of timeouts,
watchdogs, etc...
- The last attempt to make CLOCK_MONOTONIC behave like CLOCK_BOOTTIME
was reverted within 3 weeks. a3ed0e4393d6 ("Revert: Unify
CLOCK_MONOTONIC and CLOCK_BOOTTIME")
  - CLOCK_MONOTONIC correctness (stops during the suspend) should be maintained.

I agree with the points above. And, the current CLOCK_MONOTONIC
behavior in the KVM guest is not aligned with the statements above.
(it advances during the host's suspension.)
This causes the problems described above (triggering watchdog
timeouts, etc...) so my patches are going to fix this by 2 steps
roughly:
1. Stopping the guest's clocks during the host's suspension
2. Adjusting CLOCK_BOOTTIME later
This will make the clocks behave like the host does, not making
CLOCK_MONOTONIC behave like CLOCK_BOOTTIME.

First one is a bit tricky since the guest can use a timestamp counter
in each CPUs (TSC in x86) and we need to adjust it without stale
values are observed by the guest kernel to prevent rewinding of
CLOCK_MONOTONIC (which is our top priority to make the kernel happy).
To achieve this, my patch adjusts TSCs (and a kvm-clock) before the
first vcpu runs of each vcpus after the resume.

Second one is relatively safe: since jumping CLOCK_BOOTTIME forward
can happen even before my patches when suspend/resume happens, and
that will not break the monotonicity of the clocks, we can do that
through IRQ.

[1] shows the flow of the adjustment logic, and [2] shows how the
clocks behave in the guest and the host before/after my patches.
The numbers on each step in [1] corresponds to the timing shown in [2].
The left side of [2] is showing the behavior of the clocks before the
patches, and the right side shows after the patches. Also, upper
charts show the guest clocks, and bottom charts are host clocks.

Before the patches(left side), CLOCK_MONOTONIC seems to be jumped from
the guest's perspective after the host's suspension. As Thomas says,
large jumps of CLOCK_MONOTONIC may lead to watchdog timeouts and other
bad things that we want to avoid.
With the patches(right side), both clocks will be adjusted (t=4,5) as
if they are stopped during the suspension. This adjustment is done by
the host side and invisible to the guest since it is done before the
first vcpu run after the resume. After that, CLOCK_BOOTTIME will be
adjusted from the guest side, triggered by the IRQ sent from the host.

[1]: https://hikalium.com/files/kvm_virt_suspend_time_seq.png
[2]: https://hikalium.com/files/kvm_virt_suspend_time_clocks.png

>
> Assuming you solve these, you should model the guest memory access
> similarly to what we do for stolen time. As for injecting an
> interrupt, why can't this be a userspace thing?

Since CLOCK_BOOTTIME is calculated by adding a gap
(tk->monotonic_to_boot) to CLOCK_MONOTONIC, and there are no way to
change the value from the outside of the guest kernel, we should
implement some mechanism in the kernel to adjust it.
(Actually, I tried to add a sysfs interface to modify the gap [3], but
I learned that that is not a good idea...)

[3]: https://lore.kernel.org/lkml/87eehoax14.fsf@nanos.tec.linutronix.de/

Thank you,

Hikaru Nishida

>
> Thanks,
>
>         M.
>
> [1] https://lore.kernel.org/all/871r557jls.ffs@tglx
>
>
> --
> Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support
  2021-11-04  9:10   ` Hikaru Nishida
@ 2021-12-04 17:30     ` Marc Zyngier
  2022-04-06  3:48       ` Hikaru Nishida
  0 siblings, 1 reply; 10+ messages in thread
From: Marc Zyngier @ 2021-12-04 17:30 UTC (permalink / raw)
  To: Hikaru Nishida
  Cc: linux-kernel, dme, tglx, mlevitsk, linux, pbonzini, vkuznets,
	will, suleiman, senozhatsky, kvmarm, linux-arm-kernel,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Dave Hansen,
	Geert Uytterhoeven, H. Peter Anvin, Ingo Molnar, Ingo Molnar,
	Jim Mattson, Joerg Roedel, John Stultz, Jonathan Corbet,
	Juergen Gross, Kees Cook, Lai Jiangshan, Linus Walleij,
	Peter Zijlstra, Sean Christopherson, Stephen Boyd, Wanpeng Li,
	kvm, linux-doc, x86

Hi Hikaru,

Apologies for the much delayed reply.

> The problems described by Thomas in the thread was:
> - User space or kernel space can observe the stale timestamp before
> the adjustment
>   - Moving CLOCK_MONOTONIC forward will trigger all sorts of timeouts,
> watchdogs, etc...
> - The last attempt to make CLOCK_MONOTONIC behave like CLOCK_BOOTTIME
> was reverted within 3 weeks. a3ed0e4393d6 ("Revert: Unify
> CLOCK_MONOTONIC and CLOCK_BOOTTIME")
>   - CLOCK_MONOTONIC correctness (stops during the suspend) should be maintained.
> 
> I agree with the points above. And, the current CLOCK_MONOTONIC
> behavior in the KVM guest is not aligned with the statements above.
> (it advances during the host's suspension.)
> This causes the problems described above (triggering watchdog
> timeouts, etc...) so my patches are going to fix this by 2 steps
> roughly:
> 1. Stopping the guest's clocks during the host's suspension
> 2. Adjusting CLOCK_BOOTTIME later
> This will make the clocks behave like the host does, not making
> CLOCK_MONOTONIC behave like CLOCK_BOOTTIME.
> 
> First one is a bit tricky since the guest can use a timestamp counter
> in each CPUs (TSC in x86) and we need to adjust it without stale
> values are observed by the guest kernel to prevent rewinding of
> CLOCK_MONOTONIC (which is our top priority to make the kernel happy).
> To achieve this, my patch adjusts TSCs (and a kvm-clock) before the
> first vcpu runs of each vcpus after the resume.
> 
> Second one is relatively safe: since jumping CLOCK_BOOTTIME forward
> can happen even before my patches when suspend/resume happens, and
> that will not break the monotonicity of the clocks, we can do that
> through IRQ.
> 
> [1] shows the flow of the adjustment logic, and [2] shows how the
> clocks behave in the guest and the host before/after my patches.
> The numbers on each step in [1] corresponds to the timing shown in [2].
> The left side of [2] is showing the behavior of the clocks before the
> patches, and the right side shows after the patches. Also, upper
> charts show the guest clocks, and bottom charts are host clocks.
> 
> Before the patches(left side), CLOCK_MONOTONIC seems to be jumped from
> the guest's perspective after the host's suspension. As Thomas says,
> large jumps of CLOCK_MONOTONIC may lead to watchdog timeouts and other
> bad things that we want to avoid.
> With the patches(right side), both clocks will be adjusted (t=4,5) as
> if they are stopped during the suspension. This adjustment is done by
> the host side and invisible to the guest since it is done before the
> first vcpu run after the resume. After that, CLOCK_BOOTTIME will be
> adjusted from the guest side, triggered by the IRQ sent from the host.
> 
> [1]: https://hikalium.com/files/kvm_virt_suspend_time_seq.png
> [2]: https://hikalium.com/files/kvm_virt_suspend_time_clocks.png

Thanks for the very detailed explanation. You obviously have though
about this, and it makes sense.

My worry is that this looks to be designed for the needs of Linux on
x86, and does not match the reality of KVM on arm64, where there is no
KVM clock (there is no need for it, and I have no plan to support it),
and there is more than a single counter visible to the guest (at least
two, and up to four with NV, all with various offsets). This also
deals with concepts that are Linux-specific. How would it work for
another (arbitrary) guest operating system?

Can we please take a step back and look at what we want to expose from
a hypervisor PoV? It seems to me that all we want is:

(1) tell the guest that time has moved forward
(2) tell the guest by how much time has moved forward

In a way, this isn't different from stolen time, only that it affects
the whole VM and not just a single CPU (and for a much longer quantum
of time).

How the event is handled by the guest (what it means for its clocks
and all that) is a guest problem. Why should KVM itself adjust the
counters? This goes against what the architecture specifies (the
counter is in an always-on domain that keeps counting when suspended),
and KVM must preserve the architectural guarantees instead of
deviating from them.

> > Assuming you solve these, you should model the guest memory access
> > similarly to what we do for stolen time. As for injecting an
> > interrupt, why can't this be a userspace thing?
> 
> Since CLOCK_BOOTTIME is calculated by adding a gap
> (tk->monotonic_to_boot) to CLOCK_MONOTONIC, and there are no way to
> change the value from the outside of the guest kernel, we should
> implement some mechanism in the kernel to adjust it.
> (Actually, I tried to add a sysfs interface to modify the gap [3], but
> I learned that that is not a good idea...)

It is not what I was suggesting.

My suggestion was to have a shared memory between the VMM and the
guest again, similar to the way we handle stolen time), let the VMM
expose the drift in this shared memory, and inject an interrupt from
userspace to signify a wake-up. All this requires is that on suspend,
you force the vcpus to exit. On resume, the VMM update the guest
visible drift, posts an interrupt, and let things rip.

This requires very minimal KVM support, and squarely places the logic
in the guest. Why doesn't this work?

Another question is maybe even more naive: on bare metal, we don't
need any of this. The system suspends, resumes, and recovers well
enough. Nobody hides anything, and yet everything works just fine.
That's because the kernel knows it is being suspended, and it acts
accordingly. It looks to me that there is some value in following the
same principles, even if this means that the host suspend has to
synchronise with the guest being suspended.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support
  2021-12-04 17:30     ` Marc Zyngier
@ 2022-04-06  3:48       ` Hikaru Nishida
  0 siblings, 0 replies; 10+ messages in thread
From: Hikaru Nishida @ 2022-04-06  3:48 UTC (permalink / raw)
  To: Marc Zyngier
  Cc: Hikaru Nishida, linux-kernel, dme, Thomas Gleixner, mlevitsk,
	linux, pbonzini, vkuznets, will, Suleiman Souhlal,
	Sergey Senozhatsky, kvmarm, linux-arm-kernel, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Dave Hansen, Geert Uytterhoeven,
	H. Peter Anvin, Ingo Molnar, Ingo Molnar, Jim Mattson,
	Joerg Roedel, John Stultz, Jonathan Corbet, Juergen Gross,
	Kees Cook, Lai Jiangshan, Linus Walleij, Peter Zijlstra,
	Sean Christopherson, Stephen Boyd, Wanpeng Li, KVM list,
	linux-doc, x86, Sangwhan Moon

Hi Marc,

Thanks for the reply! (And sorry for a late reply...)

On Sun, Dec 5, 2021 at 2:30 AM Marc Zyngier <maz@kernel.org> wrote:
>
> Hi Hikaru,
>
> Apologies for the much delayed reply.
>
> > The problems described by Thomas in the thread was:
> > - User space or kernel space can observe the stale timestamp before
> > the adjustment
> >   - Moving CLOCK_MONOTONIC forward will trigger all sorts of timeouts,
> > watchdogs, etc...
> > - The last attempt to make CLOCK_MONOTONIC behave like CLOCK_BOOTTIME
> > was reverted within 3 weeks. a3ed0e4393d6 ("Revert: Unify
> > CLOCK_MONOTONIC and CLOCK_BOOTTIME")
> >   - CLOCK_MONOTONIC correctness (stops during the suspend) should be maintained.
> >
> > I agree with the points above. And, the current CLOCK_MONOTONIC
> > behavior in the KVM guest is not aligned with the statements above.
> > (it advances during the host's suspension.)
> > This causes the problems described above (triggering watchdog
> > timeouts, etc...) so my patches are going to fix this by 2 steps
> > roughly:
> > 1. Stopping the guest's clocks during the host's suspension
> > 2. Adjusting CLOCK_BOOTTIME later
> > This will make the clocks behave like the host does, not making
> > CLOCK_MONOTONIC behave like CLOCK_BOOTTIME.
> >
> > First one is a bit tricky since the guest can use a timestamp counter
> > in each CPUs (TSC in x86) and we need to adjust it without stale
> > values are observed by the guest kernel to prevent rewinding of
> > CLOCK_MONOTONIC (which is our top priority to make the kernel happy).
> > To achieve this, my patch adjusts TSCs (and a kvm-clock) before the
> > first vcpu runs of each vcpus after the resume.
> >
> > Second one is relatively safe: since jumping CLOCK_BOOTTIME forward
> > can happen even before my patches when suspend/resume happens, and
> > that will not break the monotonicity of the clocks, we can do that
> > through IRQ.
> >
> > [1] shows the flow of the adjustment logic, and [2] shows how the
> > clocks behave in the guest and the host before/after my patches.
> > The numbers on each step in [1] corresponds to the timing shown in [2].
> > The left side of [2] is showing the behavior of the clocks before the
> > patches, and the right side shows after the patches. Also, upper
> > charts show the guest clocks, and bottom charts are host clocks.
> >
> > Before the patches(left side), CLOCK_MONOTONIC seems to be jumped from
> > the guest's perspective after the host's suspension. As Thomas says,
> > large jumps of CLOCK_MONOTONIC may lead to watchdog timeouts and other
> > bad things that we want to avoid.
> > With the patches(right side), both clocks will be adjusted (t=4,5) as
> > if they are stopped during the suspension. This adjustment is done by
> > the host side and invisible to the guest since it is done before the
> > first vcpu run after the resume. After that, CLOCK_BOOTTIME will be
> > adjusted from the guest side, triggered by the IRQ sent from the host.
> >
> > [1]: https://hikalium.com/files/kvm_virt_suspend_time_seq.png
> > [2]: https://hikalium.com/files/kvm_virt_suspend_time_clocks.png
>
> Thanks for the very detailed explanation. You obviously have though
> about this, and it makes sense.
>
> My worry is that this looks to be designed for the needs of Linux on
> x86, and does not match the reality of KVM on arm64, where there is no
> KVM clock (there is no need for it, and I have no plan to support it),
> and there is more than a single counter visible to the guest (at least
> two, and up to four with NV, all with various offsets). This also
> deals with concepts that are Linux-specific. How would it work for
> another (arbitrary) guest operating system?

Whether the architecture has a kvm-clock or not will not be a problem.
The high-level requirements for implementing this feature are:
- host hypervisor can adjust the offset of the virtualized hardware
clocks (tsc and kvm-clock in x86, generic timer in aarch64)
- host hypervisor can notify guest kernel about the host suspension
(interrupts etc.)
- host hypervisor can share the duration to be injected with the guest
(by shared memory or registers etc...)
so I believe it can be implemented on aarch64 as well.

This logic is only designed for Linux since the problem we want to
solve is linux specific.
(CLOCK_MONOTONIC vs CLOCK_BOOTTIME)

>
> Can we please take a step back and look at what we want to expose from
> a hypervisor PoV? It seems to me that all we want is:
>
> (1) tell the guest that time has moved forward
> (2) tell the guest by how much time has moved forward
>
> In a way, this isn't different from stolen time, only that it affects
> the whole VM and not just a single CPU (and for a much longer quantum
> of time).
>

(1) and (2) in the above may be implemented out of the host kernel,
but there is another most important step (0) is existed:

(0) adjust the clocks to compensate for how much the clocks have
incremented over a period of suspension, *before* any vcpu resume.

(0) is not possible to be done from outside of the host kernel (as far
as I tried) since there is no way to ensure that we can do the
adjustment "before the first vcpu runs after the host's resume" in the
userland.
Suspending a VM from VMM before the host's suspend will not always
work, since we can't guarantee that the VMM stopped the VM before the
host kernel enters into suspend.
That's why we implemented this feature with a modification in the host side.

As described in the above, just telling the guest about the time has
moved forward is not enough to solve the problem (a large jump of
CLOCK_MONOTONIC forward after the host's suspension will happen, which
can cause bad things like watchdog timeouts etc...).

> How the event is handled by the guest (what it means for its clocks
> and all that) is a guest problem. Why should KVM itself adjust the
> counters? This goes against what the architecture specifies (the
> counter is in an always-on domain that keeps counting when suspended),
> and KVM must preserve the architectural guarantees instead of
> deviating from them.
>

The counters need to be adjusted by KVM because:
1. We cannot adjust the guest's CLOCK_MONOTONIC from the guest side
since it breaks its monotonicity.
2. The counters are used by the guest userland to provide a fast
interface to read the clocks (vdso)
so the only way to adjust the counters without breaking their
monotonicity is doing that adjustment outside of the guest.

Let's think about this in a different way... For VM migrations, TSC
offset can be modified since we need to adjust it.
My patches are doing a similar thing, maybe we can say our patches are
doing "VM migration on the timeline". In this perspective, the
architectural restriction from the guest side is not broken since we
can consider that the VM has been time-traveled forward.

> > > Assuming you solve these, you should model the guest memory access
> > > similarly to what we do for stolen time. As for injecting an
> > > interrupt, why can't this be a userspace thing?
> >
> > Since CLOCK_BOOTTIME is calculated by adding a gap
> > (tk->monotonic_to_boot) to CLOCK_MONOTONIC, and there are no way to
> > change the value from the outside of the guest kernel, we should
> > implement some mechanism in the kernel to adjust it.
> > (Actually, I tried to add a sysfs interface to modify the gap [3], but
> > I learned that that is not a good idea...)
>
> It is not what I was suggesting.
>
> My suggestion was to have a shared memory between the VMM and the
> guest again, similar to the way we handle stolen time), let the VMM
> expose the drift in this shared memory, and inject an interrupt from
> userspace to signify a wake-up. All this requires is that on suspend,
> you force the vcpus to exit. On resume, the VMM update the guest
> visible drift, posts an interrupt, and let things rip.
>
> This requires very minimal KVM support, and squarely places the logic
> in the guest. Why doesn't this work?

Is there a way to know the host is going to suspend from host'
userland applications?
If there is a way to do that, it may be possible to implement the
feature outside of KVM but my knowledge about the Linux kernel was not
enough to find out how to do that. I would really appreciate it if you
could tell me how to do that.

>
> Another question is maybe even more naive: on bare metal, we don't
> need any of this. The system suspends, resumes, and recovers well
> enough. Nobody hides anything, and yet everything works just fine.
> That's because the kernel knows it is being suspended, and it acts
> accordingly. It looks to me that there is some value in following the
> same principles, even if this means that the host suspend has to
> synchronise with the guest being suspended.
>
> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

Thank you,
--
Hikaru Nishida

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-04-06 13:13 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-20 12:04 [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support Hikaru Nishida
2021-10-20 12:04 ` [RFC PATCH v3 1/5] timekeeping: Expose tk->offs_boot via ktime_get_offs_boot_ns Hikaru Nishida
2021-10-20 12:04 ` [RFC PATCH v3 2/5] kvm/x86: Include asm/pvclock.h in asm/kvmclock.h Hikaru Nishida
2021-10-20 12:04 ` [RFC PATCH v3 3/5] kvm/x86: virtual suspend time injection: Add common definitions Hikaru Nishida
2021-10-20 12:04 ` [RFC PATCH v3 4/5] kvm/x86: virtual suspend time injection: Implement host side Hikaru Nishida
2021-10-20 12:04 ` [RFC PATCH v3 5/5] kvm/x86: virtual suspend time injection: Implement guest side Hikaru Nishida
2021-10-20 13:52 ` [RFC PATCH v3 0/5] x86/kvm: Virtual suspend time injection support Marc Zyngier
2021-11-04  9:10   ` Hikaru Nishida
2021-12-04 17:30     ` Marc Zyngier
2022-04-06  3:48       ` Hikaru Nishida

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).