All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] x86/kvm/hyperv: stable clocksorce for L2 guests when running nested KVM on Hyper-V
@ 2017-12-08 10:49 ` Vitaly Kuznetsov
  0 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-08 10:49 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Michael Kelley (EOSG),
	Andy Lutomirski, Mohammed Gamal, Cathy Avery, linux-kernel,
	devel

Changes since RFC:
- Handle disabled TSC page case [Radim Krčmář]
- Function/parameter renames [Radim Krčmář]
- Protect against simultaneous CPU shutdown (future-proof) [Radim Krčmář]
- BUG() in !CONFIG_HYPERV_TSCPAGE case [Stephen Hemminger, Paolo Bonzini]
- Fix build issues (hopefully) [kbuild test robot]
- Fix typos [Andrew Jones]
- Switch to delayed work to workaround a host issue (reported to Microsoft)
Changes are described in the individual patches too.

Original description:

Currently, KVM passes PVCLOCK_TSC_STABLE_BIT to its guests when running in
so called 'masterclock' mode and this is only possible when the clocksource
on the host is TSC. When running nested on Hyper-V we're using a different
clocksource in L1 (Hyper-V TSC Page) which can actually be used for
masterclock. This series brings the required support.

Making KVM work with TSC page clocksource is relatively easy, it is done in
PATCH 5 of the series. All the rest is required to support L1 migration
when TSC frequency changes, we use a special feature from Hyper-V to do
the job.

Vitaly Kuznetsov (6):
  x86/hyper-v: check for required priviliges in hyperv_init()
  x86/hyper-v: add a function to read both TSC and TSC page value
    simulateneously
  x86/hyper-v: reenlightenment notifications support
  x86/hyper-v: redirect reenlightment notifications on CPU offlining
  x86/kvm: pass stable clocksource to guests when running nested on
    Hyper-V
  x86/kvm: support Hyper-V reenlightenment

 arch/x86/entry/entry_64.S          |   4 ++
 arch/x86/hyperv/hv_init.c          | 113 +++++++++++++++++++++++++++++-
 arch/x86/include/asm/entry_arch.h  |   4 ++
 arch/x86/include/asm/hw_irq.h      |   1 +
 arch/x86/include/asm/irq_vectors.h |   7 +-
 arch/x86/include/asm/mshyperv.h    |  31 +++++++--
 arch/x86/include/uapi/asm/hyperv.h |  27 ++++++++
 arch/x86/kernel/idt.c              |   3 +
 arch/x86/kvm/x86.c                 | 138 ++++++++++++++++++++++++++++++-------
 9 files changed, 296 insertions(+), 32 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 0/6] x86/kvm/hyperv: stable clocksorce for L2 guests when running nested KVM on Hyper-V
@ 2017-12-08 10:49 ` Vitaly Kuznetsov
  0 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-08 10:49 UTC (permalink / raw)
  To: kvm, x86
  Cc: Stephen Hemminger, Radim Krčmář,
	Haiyang Zhang, linux-kernel, devel, Michael Kelley (EOSG),
	Ingo Molnar, Andy Lutomirski, H. Peter Anvin, Paolo Bonzini,
	Thomas Gleixner, Mohammed Gamal

Changes since RFC:
- Handle disabled TSC page case [Radim Krčmář]
- Function/parameter renames [Radim Krčmář]
- Protect against simultaneous CPU shutdown (future-proof) [Radim Krčmář]
- BUG() in !CONFIG_HYPERV_TSCPAGE case [Stephen Hemminger, Paolo Bonzini]
- Fix build issues (hopefully) [kbuild test robot]
- Fix typos [Andrew Jones]
- Switch to delayed work to workaround a host issue (reported to Microsoft)
Changes are described in the individual patches too.

Original description:

Currently, KVM passes PVCLOCK_TSC_STABLE_BIT to its guests when running in
so called 'masterclock' mode and this is only possible when the clocksource
on the host is TSC. When running nested on Hyper-V we're using a different
clocksource in L1 (Hyper-V TSC Page) which can actually be used for
masterclock. This series brings the required support.

Making KVM work with TSC page clocksource is relatively easy, it is done in
PATCH 5 of the series. All the rest is required to support L1 migration
when TSC frequency changes, we use a special feature from Hyper-V to do
the job.

Vitaly Kuznetsov (6):
  x86/hyper-v: check for required priviliges in hyperv_init()
  x86/hyper-v: add a function to read both TSC and TSC page value
    simulateneously
  x86/hyper-v: reenlightenment notifications support
  x86/hyper-v: redirect reenlightment notifications on CPU offlining
  x86/kvm: pass stable clocksource to guests when running nested on
    Hyper-V
  x86/kvm: support Hyper-V reenlightenment

 arch/x86/entry/entry_64.S          |   4 ++
 arch/x86/hyperv/hv_init.c          | 113 +++++++++++++++++++++++++++++-
 arch/x86/include/asm/entry_arch.h  |   4 ++
 arch/x86/include/asm/hw_irq.h      |   1 +
 arch/x86/include/asm/irq_vectors.h |   7 +-
 arch/x86/include/asm/mshyperv.h    |  31 +++++++--
 arch/x86/include/uapi/asm/hyperv.h |  27 ++++++++
 arch/x86/kernel/idt.c              |   3 +
 arch/x86/kvm/x86.c                 | 138 ++++++++++++++++++++++++++++++-------
 9 files changed, 296 insertions(+), 32 deletions(-)

-- 
2.14.3

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/6] x86/hyper-v: check for required priviliges in hyperv_init()
  2017-12-08 10:49 ` Vitaly Kuznetsov
  (?)
@ 2017-12-08 10:49 ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-08 10:49 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Michael Kelley (EOSG),
	Andy Lutomirski, Mohammed Gamal, Cathy Avery, linux-kernel,
	devel

In hyperv_init() we presume we always have access to VP index and hypercall
MSRs while according to the specification we should check if we're allowed
to access the corresponding MSRs before accessing them.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
RFC -> v1:
- 'requied_msrs' -> 'required_msrs' [Andrew Jones]
---
 arch/x86/hyperv/hv_init.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 189a398290db..21f9d53d9f00 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -110,12 +110,19 @@ static int hv_cpu_init(unsigned int cpu)
  */
 void hyperv_init(void)
 {
-	u64 guest_id;
+	u64 guest_id, required_msrs;
 	union hv_x64_msr_hypercall_contents hypercall_msr;
 
 	if (x86_hyper_type != X86_HYPER_MS_HYPERV)
 		return;
 
+	/* Absolutely required MSRs */
+	required_msrs = HV_X64_MSR_HYPERCALL_AVAILABLE |
+		HV_X64_MSR_VP_INDEX_AVAILABLE;
+
+	if ((ms_hyperv.features & required_msrs) != required_msrs)
+		return;
+
 	/* Allocate percpu VP index */
 	hv_vp_index = kmalloc_array(num_possible_cpus(), sizeof(*hv_vp_index),
 				    GFP_KERNEL);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/6] x86/hyper-v: add a function to read both TSC and TSC page value simulateneously
  2017-12-08 10:49 ` Vitaly Kuznetsov
@ 2017-12-08 10:49   ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-08 10:49 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Michael Kelley (EOSG),
	Andy Lutomirski, Mohammed Gamal, Cathy Avery, linux-kernel,
	devel

This is going to be used from KVM code where we need to get both
TSC and TSC page value.

When Hyper-V code is compiled out just return rdtsc(), this will allow us
to avoid ugly ifdefs in non-Hyper-V code.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
RFC -> v1:
- EXPORT_SYMBOL_GPL(hv_get_tsc_page) [kbuild test robot]
- BUG() in !CONFIG_HYPERV_TSCPAGE case [Stephen Hemminger, Paolo Bonzini]
- 'were' -> 'where' [Andrew Jones]
---
 arch/x86/hyperv/hv_init.c       |  1 +
 arch/x86/include/asm/mshyperv.h | 23 +++++++++++++++++++----
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 21f9d53d9f00..1a6c63f721bc 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -37,6 +37,7 @@ struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
 {
 	return tsc_pg;
 }
+EXPORT_SYMBOL_GPL(hv_get_tsc_page);
 
 static u64 read_hv_clock_tsc(struct clocksource *arg)
 {
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 5400add2885b..a0b34994f453 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -323,9 +323,10 @@ static inline void hyperv_setup_mmu_ops(void) {}
 
 #ifdef CONFIG_HYPERV_TSCPAGE
 struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
-static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
+static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
+				       u64 *cur_tsc)
 {
-	u64 scale, offset, cur_tsc;
+	u64 scale, offset;
 	u32 sequence;
 
 	/*
@@ -356,7 +357,7 @@ static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
 
 		scale = READ_ONCE(tsc_pg->tsc_scale);
 		offset = READ_ONCE(tsc_pg->tsc_offset);
-		cur_tsc = rdtsc_ordered();
+		*cur_tsc = rdtsc_ordered();
 
 		/*
 		 * Make sure we read sequence after we read all other values
@@ -366,7 +367,14 @@ static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
 
 	} while (READ_ONCE(tsc_pg->tsc_sequence) != sequence);
 
-	return mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
+	return mul_u64_u64_shr(*cur_tsc, scale, 64) + offset;
+}
+
+static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
+{
+	u64 cur_tsc;
+
+	return hv_read_tsc_page_tsc(tsc_pg, &cur_tsc);
 }
 
 #else
@@ -374,5 +382,12 @@ static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
 {
 	return NULL;
 }
+
+static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
+				       u64 *cur_tsc)
+{
+	BUG();
+	return U64_MAX;
+}
 #endif
 #endif
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/6] x86/hyper-v: add a function to read both TSC and TSC page value simulateneously
@ 2017-12-08 10:49   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-08 10:49 UTC (permalink / raw)
  To: kvm, x86
  Cc: Stephen Hemminger, Radim Krčmář,
	Haiyang Zhang, linux-kernel, devel, Michael Kelley (EOSG),
	Ingo Molnar, Andy Lutomirski, H. Peter Anvin, Paolo Bonzini,
	Thomas Gleixner, Mohammed Gamal

This is going to be used from KVM code where we need to get both
TSC and TSC page value.

When Hyper-V code is compiled out just return rdtsc(), this will allow us
to avoid ugly ifdefs in non-Hyper-V code.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
RFC -> v1:
- EXPORT_SYMBOL_GPL(hv_get_tsc_page) [kbuild test robot]
- BUG() in !CONFIG_HYPERV_TSCPAGE case [Stephen Hemminger, Paolo Bonzini]
- 'were' -> 'where' [Andrew Jones]
---
 arch/x86/hyperv/hv_init.c       |  1 +
 arch/x86/include/asm/mshyperv.h | 23 +++++++++++++++++++----
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 21f9d53d9f00..1a6c63f721bc 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -37,6 +37,7 @@ struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
 {
 	return tsc_pg;
 }
+EXPORT_SYMBOL_GPL(hv_get_tsc_page);
 
 static u64 read_hv_clock_tsc(struct clocksource *arg)
 {
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 5400add2885b..a0b34994f453 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -323,9 +323,10 @@ static inline void hyperv_setup_mmu_ops(void) {}
 
 #ifdef CONFIG_HYPERV_TSCPAGE
 struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
-static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
+static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
+				       u64 *cur_tsc)
 {
-	u64 scale, offset, cur_tsc;
+	u64 scale, offset;
 	u32 sequence;
 
 	/*
@@ -356,7 +357,7 @@ static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
 
 		scale = READ_ONCE(tsc_pg->tsc_scale);
 		offset = READ_ONCE(tsc_pg->tsc_offset);
-		cur_tsc = rdtsc_ordered();
+		*cur_tsc = rdtsc_ordered();
 
 		/*
 		 * Make sure we read sequence after we read all other values
@@ -366,7 +367,14 @@ static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
 
 	} while (READ_ONCE(tsc_pg->tsc_sequence) != sequence);
 
-	return mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
+	return mul_u64_u64_shr(*cur_tsc, scale, 64) + offset;
+}
+
+static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
+{
+	u64 cur_tsc;
+
+	return hv_read_tsc_page_tsc(tsc_pg, &cur_tsc);
 }
 
 #else
@@ -374,5 +382,12 @@ static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
 {
 	return NULL;
 }
+
+static inline u64 hv_read_tsc_page_tsc(const struct ms_hyperv_tsc_page *tsc_pg,
+				       u64 *cur_tsc)
+{
+	BUG();
+	return U64_MAX;
+}
 #endif
 #endif
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/6] x86/hyper-v: reenlightenment notifications support
  2017-12-08 10:49 ` Vitaly Kuznetsov
                   ` (2 preceding siblings ...)
  (?)
@ 2017-12-08 10:49 ` Vitaly Kuznetsov
  2017-12-08 18:10   ` Roman Kagan
  2017-12-13  0:50   ` Michael Kelley (EOSG)
  -1 siblings, 2 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-08 10:49 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Michael Kelley (EOSG),
	Andy Lutomirski, Mohammed Gamal, Cathy Avery, linux-kernel,
	devel

Hyper-V supports Live Migration notification. This is supposed to be used
in conjunction with TSC emulation: when we are migrated to a host with
different TSC frequency for some short period host emulates our accesses
to TSC and sends us an interrupt to notify about the event. When we're
done updating everything we can disable TSC emulation and everything will
start working fast again.

We didn't need these notifications before as Hyper-V guests are not
supposed to use TSC as a clocksource: in Linux we even mark it as unstable
on boot. Guests normally use 'tsc page' clocksouce and host updates its
values on migrations automatically.

Things change when we want to run nested virtualization: even when we pass
through PV clocksources (kvm-clock or tsc page) to our guests we need to
know TSC frequency and when it changes.

Hyper-V Top Level Functional Specification (as of v5.0b) wrongly specifies
EAX:BIT(12) of CPUID:0x40000009 as the feature identification bit. The
right one to check is EAX:BIT(13) of CPUID:0x40000003. I was assured that
the fix in on the way.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
RFC -> v1:
- #include <asm/apic.h> [kbuild test robot]
- use div64_u64() [kbuild test robot]
- DECLARE_WORK -> DECLARE_DELAYED_WORK as testing showed there's some bug
  in Hyper-V hypervisor and disabling emulation after receiving interrupt
  may screw TSC counters.
---
 arch/x86/entry/entry_64.S          |  4 +++
 arch/x86/hyperv/hv_init.c          | 71 ++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/entry_arch.h  |  4 +++
 arch/x86/include/asm/hw_irq.h      |  1 +
 arch/x86/include/asm/irq_vectors.h |  7 +++-
 arch/x86/include/asm/mshyperv.h    |  8 +++++
 arch/x86/include/uapi/asm/hyperv.h | 27 +++++++++++++++
 arch/x86/kernel/idt.c              |  3 ++
 8 files changed, 124 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index f81d50d7ceac..a32730b260bc 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -826,6 +826,10 @@ apicinterrupt SPURIOUS_APIC_VECTOR		spurious_interrupt		smp_spurious_interrupt
 apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
 #endif
 
+#if IS_ENABLED(CONFIG_HYPERV)
+apicinterrupt HYPERV_REENLIGHTENMENT_VECTOR	hyperv_reenlightenment_intr	smp_hyperv_reenlightenment_intr
+#endif
+
 /*
  * Exception entry points.
  */
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 1a6c63f721bc..bb62839ede81 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -18,6 +18,7 @@
  */
 
 #include <linux/types.h>
+#include <asm/apic.h>
 #include <asm/hypervisor.h>
 #include <asm/hyperv.h>
 #include <asm/mshyperv.h>
@@ -102,6 +103,76 @@ static int hv_cpu_init(unsigned int cpu)
 	return 0;
 }
 
+static void (*hv_reenlightenment_cb)(void);
+
+static void hv_reenlightenment_notify(struct work_struct *dummy)
+{
+	if (hv_reenlightenment_cb)
+		hv_reenlightenment_cb();
+}
+static DECLARE_DELAYED_WORK(hv_reenlightenment_work, hv_reenlightenment_notify);
+
+void hyperv_stop_tsc_emulation(void)
+{
+	u64 freq;
+	struct hv_tsc_emulation_status emu_status;
+
+	rdmsrl(HV_X64_MSR_TSC_EMULATION_STATUS, *(u64 *)&emu_status);
+	emu_status.inprogress = 0;
+	wrmsrl(HV_X64_MSR_TSC_EMULATION_STATUS, *(u64 *)&emu_status);
+
+	rdmsrl(HV_X64_MSR_TSC_FREQUENCY, freq);
+	tsc_khz = div64_u64(freq, 1000);
+}
+EXPORT_SYMBOL_GPL(hyperv_stop_tsc_emulation);
+
+void register_hv_tsc_update(void (*cb)(void))
+{
+	struct hv_reenlightenment_control re_ctrl = {
+		.vector = HYPERV_REENLIGHTENMENT_VECTOR,
+		.enabled = 1,
+		.target_vp = hv_vp_index[smp_processor_id()]
+	};
+	struct hv_tsc_emulation_control emu_ctrl = {.enabled = 1};
+
+	if (!(ms_hyperv.features & HV_X64_ACCESS_REENLIGHTENMENT))
+		return;
+
+	hv_reenlightenment_cb = cb;
+
+	/* Make sure callback is registered before we write to MSRs */
+	wmb();
+
+	wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)&re_ctrl));
+	wrmsrl(HV_X64_MSR_TSC_EMULATION_CONTROL, *((u64 *)&emu_ctrl));
+}
+EXPORT_SYMBOL_GPL(register_hv_tsc_update);
+
+void unregister_hv_tsc_update(void)
+{
+	struct hv_reenlightenment_control re_ctrl;
+
+	if (!(ms_hyperv.features & HV_X64_ACCESS_REENLIGHTENMENT))
+		return;
+
+	rdmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *(u64 *)&re_ctrl);
+	re_ctrl.enabled = 0;
+	wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *(u64 *)&re_ctrl);
+
+	hv_reenlightenment_cb = NULL;
+}
+EXPORT_SYMBOL_GPL(unregister_hv_tsc_update);
+
+asmlinkage __visible void
+__irq_entry smp_hyperv_reenlightenment_intr(struct pt_regs *regs)
+{
+	entering_ack_irq();
+
+	schedule_delayed_work(&hv_reenlightenment_work, HZ/10);
+
+	exiting_irq();
+}
+
 /*
  * This function is to be invoked early in the boot sequence after the
  * hypervisor has been detected.
diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
index 416422762845..eb936cc49b62 100644
--- a/arch/x86/include/asm/entry_arch.h
+++ b/arch/x86/include/asm/entry_arch.h
@@ -54,3 +54,7 @@ BUILD_INTERRUPT(threshold_interrupt,THRESHOLD_APIC_VECTOR)
 BUILD_INTERRUPT(deferred_error_interrupt, DEFERRED_ERROR_VECTOR)
 #endif
 #endif
+
+#if IS_ENABLED(CONFIG_HYPERV)
+BUILD_INTERRUPT(hyperv_reenlightenment_intr, HYPERV_REENLIGHTENMENT_VECTOR)
+#endif
diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 2851077b6051..c65193dac7d9 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -36,6 +36,7 @@ extern asmlinkage void kvm_posted_intr_wakeup_ipi(void);
 extern asmlinkage void kvm_posted_intr_nested_ipi(void);
 extern asmlinkage void error_interrupt(void);
 extern asmlinkage void irq_work_interrupt(void);
+extern asmlinkage void hyperv_reenlightenment_intr(void);
 
 extern asmlinkage void spurious_interrupt(void);
 extern asmlinkage void thermal_interrupt(void);
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
index 67421f649cfa..e71c1120426b 100644
--- a/arch/x86/include/asm/irq_vectors.h
+++ b/arch/x86/include/asm/irq_vectors.h
@@ -103,7 +103,12 @@
 #endif
 
 #define MANAGED_IRQ_SHUTDOWN_VECTOR	0xef
-#define LOCAL_TIMER_VECTOR		0xee
+
+#if IS_ENABLED(CONFIG_HYPERV)
+#define HYPERV_REENLIGHTENMENT_VECTOR	0xee
+#endif
+
+#define LOCAL_TIMER_VECTOR		0xed
 
 #define NR_VECTORS			 256
 
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index a0b34994f453..43164b097585 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -314,11 +314,19 @@ void hyper_alloc_mmu(void);
 void hyperv_report_panic(struct pt_regs *regs, long err);
 bool hv_is_hypercall_page_setup(void);
 void hyperv_cleanup(void);
+
+asmlinkage void smp_hyperv_reenlightenment_intr(struct pt_regs *regs);
+void register_hv_tsc_update(void (*cb)(void));
+void unregister_hv_tsc_update(void);
+void hyperv_stop_tsc_emulation(void);
 #else /* CONFIG_HYPERV */
 static inline void hyperv_init(void) {}
 static inline bool hv_is_hypercall_page_setup(void) { return false; }
 static inline void hyperv_cleanup(void) {}
 static inline void hyperv_setup_mmu_ops(void) {}
+static inline void register_hv_tsc_update(void (*cb)(void)) {}
+static inline void unregister_hv_tsc_update(void) {}
+static inline void hyperv_stop_tsc_emulation(void) {};
 #endif /* CONFIG_HYPERV */
 
 #ifdef CONFIG_HYPERV_TSCPAGE
diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h
index 1a5bfead93b4..197c2e6c7376 100644
--- a/arch/x86/include/uapi/asm/hyperv.h
+++ b/arch/x86/include/uapi/asm/hyperv.h
@@ -40,6 +40,9 @@
  */
 #define HV_X64_ACCESS_FREQUENCY_MSRS		(1 << 11)
 
+/* AccessReenlightenmentControls privilege */
+#define HV_X64_ACCESS_REENLIGHTENMENT		BIT(13)
+
 /*
  * Basic SynIC MSRs (HV_X64_MSR_SCONTROL through HV_X64_MSR_EOM
  * and HV_X64_MSR_SINT0 through HV_X64_MSR_SINT15) available
@@ -234,6 +237,30 @@
 #define HV_X64_MSR_CRASH_PARAMS		\
 		(1 + (HV_X64_MSR_CRASH_P4 - HV_X64_MSR_CRASH_P0))
 
+/* TSC emulation after migration */
+#define HV_X64_MSR_REENLIGHTENMENT_CONTROL	0x40000106
+
+struct hv_reenlightenment_control {
+	u64 vector:8;
+	u64 reserved1:8;
+	u64 enabled:1;
+	u64 reserved2:15;
+	u64 target_vp:32;
+};
+
+#define HV_X64_MSR_TSC_EMULATION_CONTROL	0x40000107
+#define HV_X64_MSR_TSC_EMULATION_STATUS		0x40000108
+
+struct hv_tsc_emulation_control {
+	u64 enabled:1;
+	u64 reserved:63;
+};
+
+struct hv_tsc_emulation_status {
+	u64 inprogress:1;
+	u64 reserved:63;
+};
+
 #define HV_X64_MSR_HYPERCALL_ENABLE		0x00000001
 #define HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_SHIFT	12
 #define HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_MASK	\
diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
index d985cef3984f..5b8512d48aa3 100644
--- a/arch/x86/kernel/idt.c
+++ b/arch/x86/kernel/idt.c
@@ -140,6 +140,9 @@ static const __initdata struct idt_data apic_idts[] = {
 # ifdef CONFIG_IRQ_WORK
 	INTG(IRQ_WORK_VECTOR,		irq_work_interrupt),
 # endif
+#if IS_ENABLED(CONFIG_HYPERV)
+	INTG(HYPERV_REENLIGHTENMENT_VECTOR, hyperv_reenlightenment_intr),
+#endif
 	INTG(SPURIOUS_APIC_VECTOR,	spurious_interrupt),
 	INTG(ERROR_APIC_VECTOR,		error_interrupt),
 #endif
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/6] x86/hyper-v: redirect reenlightment notifications on CPU offlining
  2017-12-08 10:49 ` Vitaly Kuznetsov
@ 2017-12-08 10:49   ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-08 10:49 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Michael Kelley (EOSG),
	Andy Lutomirski, Mohammed Gamal, Cathy Avery, linux-kernel,
	devel

It is very unlikely for CPUs to get offlined when we run on Hyper-V as
we have a protection in vmbus module which prevents it when we have any
VMBus devices assigned. This, however,  may change in future if an option
to reassign an already active channel will be added. It is also possible
to run without any Hyper-V devices of have a CPU with no assigned channels.

Reassign reenlightenment notifications to some other active CPU when
the CPU which is assigned to get them goes offline.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
RFC -> v1:
- protect hv_cpu_die() from reentrance (when multiple CPUs go offline
  in parallel. I don't think it happens nowdays though but things tend
  to change).
---
 arch/x86/hyperv/hv_init.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index bb62839ede81..fbf8e372e261 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -173,6 +173,36 @@ __irq_entry smp_hyperv_reenlightenment_intr(struct pt_regs *regs)
 	exiting_irq();
 }
 
+static int hv_cpu_die(unsigned int cpu)
+{
+	struct hv_reenlightenment_control re_ctrl;
+	int i;
+	static DEFINE_SPINLOCK(lock);
+
+	if (hv_reenlightenment_cb == NULL)
+		return 0;
+
+	/* Make sure the CPU we migrate to is not going away too */
+	spin_lock(&lock);
+	rdmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)&re_ctrl));
+	if (re_ctrl.target_vp == hv_vp_index[cpu]) {
+		/* Find some other online CPU */
+		for_each_online_cpu(i) {
+			if (i == cpu)
+				continue;
+
+			re_ctrl.target_vp = hv_vp_index[i];
+			wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL,
+			       *((u64 *)&re_ctrl));
+
+			break;
+		}
+	}
+	spin_unlock(&lock);
+
+	return 0;
+}
+
 /*
  * This function is to be invoked early in the boot sequence after the
  * hypervisor has been detected.
@@ -202,7 +232,7 @@ void hyperv_init(void)
 		return;
 
 	if (cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/hyperv_init:online",
-			      hv_cpu_init, NULL) < 0)
+			      hv_cpu_init, hv_cpu_die) < 0)
 		goto free_vp_index;
 
 	/*
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/6] x86/hyper-v: redirect reenlightment notifications on CPU offlining
@ 2017-12-08 10:49   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-08 10:49 UTC (permalink / raw)
  To: kvm, x86
  Cc: Stephen Hemminger, Radim Krčmář,
	Haiyang Zhang, linux-kernel, devel, Michael Kelley (EOSG),
	Ingo Molnar, Andy Lutomirski, H. Peter Anvin, Paolo Bonzini,
	Thomas Gleixner, Mohammed Gamal

It is very unlikely for CPUs to get offlined when we run on Hyper-V as
we have a protection in vmbus module which prevents it when we have any
VMBus devices assigned. This, however,  may change in future if an option
to reassign an already active channel will be added. It is also possible
to run without any Hyper-V devices of have a CPU with no assigned channels.

Reassign reenlightenment notifications to some other active CPU when
the CPU which is assigned to get them goes offline.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
RFC -> v1:
- protect hv_cpu_die() from reentrance (when multiple CPUs go offline
  in parallel. I don't think it happens nowdays though but things tend
  to change).
---
 arch/x86/hyperv/hv_init.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index bb62839ede81..fbf8e372e261 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -173,6 +173,36 @@ __irq_entry smp_hyperv_reenlightenment_intr(struct pt_regs *regs)
 	exiting_irq();
 }
 
+static int hv_cpu_die(unsigned int cpu)
+{
+	struct hv_reenlightenment_control re_ctrl;
+	int i;
+	static DEFINE_SPINLOCK(lock);
+
+	if (hv_reenlightenment_cb == NULL)
+		return 0;
+
+	/* Make sure the CPU we migrate to is not going away too */
+	spin_lock(&lock);
+	rdmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)&re_ctrl));
+	if (re_ctrl.target_vp == hv_vp_index[cpu]) {
+		/* Find some other online CPU */
+		for_each_online_cpu(i) {
+			if (i == cpu)
+				continue;
+
+			re_ctrl.target_vp = hv_vp_index[i];
+			wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL,
+			       *((u64 *)&re_ctrl));
+
+			break;
+		}
+	}
+	spin_unlock(&lock);
+
+	return 0;
+}
+
 /*
  * This function is to be invoked early in the boot sequence after the
  * hypervisor has been detected.
@@ -202,7 +232,7 @@ void hyperv_init(void)
 		return;
 
 	if (cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "x86/hyperv_init:online",
-			      hv_cpu_init, NULL) < 0)
+			      hv_cpu_init, hv_cpu_die) < 0)
 		goto free_vp_index;
 
 	/*
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/6] x86/kvm: pass stable clocksource to guests when running nested on Hyper-V
  2017-12-08 10:49 ` Vitaly Kuznetsov
@ 2017-12-08 10:49   ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-08 10:49 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Michael Kelley (EOSG),
	Andy Lutomirski, Mohammed Gamal, Cathy Avery, linux-kernel,
	devel

Currently, KVM is able to work in 'masterclock' mode passing
PVCLOCK_TSC_STABLE_BIT to guests when the clocksource we use on the host
is TSC. When running nested on Hyper-V we normally use a different one:
TSC page which is resistant to TSC frequency changes on event like L1
migration. Add support for it in KVM.

The only non-trivial change in the patch is in vgettsc(): when updating
our gtod copy we now need to get both the clockread and tsc value.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
RFC -> v1:
- 'gtod_cs_mode_good' -> 'gtod_is_based_on_tsc' [Radim Krčmář]
- rewrite vgettsc() to handle -1 from hv_read_tsc_page_tsc() [Radim Krčmář]
- cycle_now -> tsc_timestamp [Radim Krčmář]
---
 arch/x86/kvm/x86.c | 93 +++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 68 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eee8e7faf1af..96e04a0cb921 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -67,6 +67,7 @@
 #include <asm/pvclock.h>
 #include <asm/div64.h>
 #include <asm/irq_remapping.h>
+#include <asm/mshyperv.h>
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -1377,6 +1378,11 @@ static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 	return tsc;
 }
 
+static inline int gtod_is_based_on_tsc(int mode)
+{
+	return mode == VCLOCK_TSC || mode == VCLOCK_HVCLOCK;
+}
+
 static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 {
 #ifdef CONFIG_X86_64
@@ -1396,7 +1402,7 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 	 * perform request to enable masterclock.
 	 */
 	if (ka->use_master_clock ||
-	    (gtod->clock.vclock_mode == VCLOCK_TSC && vcpus_matched))
+	    (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched))
 		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
 
 	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
@@ -1459,6 +1465,19 @@ static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 	vcpu->arch.tsc_offset = offset;
 }
 
+static inline bool kvm_check_tsc_unstable(void)
+{
+#ifdef CONFIG_X86_64
+	/*
+	 * TSC is marked unstable when we're running on Hyper-V,
+	 * 'TSC page' clocksource is good.
+	 */
+	if (pvclock_gtod_data.clock.vclock_mode == VCLOCK_HVCLOCK)
+		return false;
+#endif
+	return check_tsc_unstable();
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
 	struct kvm *kvm = vcpu->kvm;
@@ -1504,7 +1523,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr)
          */
 	if (synchronizing &&
 	    vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
-		if (!check_tsc_unstable()) {
+		if (!kvm_check_tsc_unstable()) {
 			offset = kvm->arch.cur_tsc_offset;
 			pr_debug("kvm: matched tsc offset for %llu\n", data);
 		} else {
@@ -1604,18 +1623,43 @@ static u64 read_tsc(void)
 	return last;
 }
 
-static inline u64 vgettsc(u64 *cycle_now)
+static inline u64 vgettsc(u64 *tsc_timestamp, int *mode)
 {
 	long v;
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+	u64 tsc_pg_val;
+
+	switch (gtod->clock.vclock_mode) {
+	case VCLOCK_HVCLOCK:
+		tsc_pg_val = hv_read_tsc_page_tsc(hv_get_tsc_page(),
+						  tsc_timestamp);
+		if (tsc_pg_val != U64_MAX) {
+			/* TSC page valid */
+			*mode = VCLOCK_HVCLOCK;
+			v = (tsc_pg_val - gtod->clock.cycle_last) &
+				gtod->clock.mask;
+		} else {
+			/* TSC page invalid */
+			*mode = VCLOCK_NONE;
+		}
+		break;
+	case VCLOCK_TSC:
+		*mode = VCLOCK_TSC;
+		*tsc_timestamp = read_tsc();
+		v = (*tsc_timestamp - gtod->clock.cycle_last) &
+			gtod->clock.mask;
+		break;
+	default:
+		*mode = VCLOCK_NONE;
+	}
 
-	*cycle_now = read_tsc();
+	if (*mode == VCLOCK_NONE)
+		*tsc_timestamp = v = 0;
 
-	v = (*cycle_now - gtod->clock.cycle_last) & gtod->clock.mask;
 	return v * gtod->clock.mult;
 }
 
-static int do_monotonic_boot(s64 *t, u64 *cycle_now)
+static int do_monotonic_boot(s64 *t, u64 *tsc_timestamp)
 {
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 	unsigned long seq;
@@ -1624,9 +1668,8 @@ static int do_monotonic_boot(s64 *t, u64 *cycle_now)
 
 	do {
 		seq = read_seqcount_begin(&gtod->seq);
-		mode = gtod->clock.vclock_mode;
 		ns = gtod->nsec_base;
-		ns += vgettsc(cycle_now);
+		ns += vgettsc(tsc_timestamp, &mode);
 		ns >>= gtod->clock.shift;
 		ns += gtod->boot_ns;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
@@ -1635,7 +1678,7 @@ static int do_monotonic_boot(s64 *t, u64 *cycle_now)
 	return mode;
 }
 
-static int do_realtime(struct timespec *ts, u64 *cycle_now)
+static int do_realtime(struct timespec *ts, u64 *tsc_timestamp)
 {
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 	unsigned long seq;
@@ -1644,10 +1687,9 @@ static int do_realtime(struct timespec *ts, u64 *cycle_now)
 
 	do {
 		seq = read_seqcount_begin(&gtod->seq);
-		mode = gtod->clock.vclock_mode;
 		ts->tv_sec = gtod->wall_time_sec;
 		ns = gtod->nsec_base;
-		ns += vgettsc(cycle_now);
+		ns += vgettsc(tsc_timestamp, &mode);
 		ns >>= gtod->clock.shift;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 
@@ -1657,25 +1699,26 @@ static int do_realtime(struct timespec *ts, u64 *cycle_now)
 	return mode;
 }
 
-/* returns true if host is using tsc clocksource */
-static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *cycle_now)
+/* returns true if host is using TSC based clocksource */
+static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
 {
 	/* checked again under seqlock below */
-	if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC)
+	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
 		return false;
 
-	return do_monotonic_boot(kernel_ns, cycle_now) == VCLOCK_TSC;
+	return gtod_is_based_on_tsc(do_monotonic_boot(kernel_ns,
+						      tsc_timestamp));
 }
 
-/* returns true if host is using tsc clocksource */
+/* returns true if host is using TSC based clocksource */
 static bool kvm_get_walltime_and_clockread(struct timespec *ts,
-					   u64 *cycle_now)
+					   u64 *tsc_timestamp)
 {
 	/* checked again under seqlock below */
-	if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC)
+	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
 		return false;
 
-	return do_realtime(ts, cycle_now) == VCLOCK_TSC;
+	return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
 }
 #endif
 
@@ -2869,13 +2912,13 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 	}
 
-	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
+	if (unlikely(vcpu->cpu != cpu) || kvm_check_tsc_unstable()) {
 		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
 				rdtsc() - vcpu->arch.last_host_tsc;
 		if (tsc_delta < 0)
 			mark_tsc_unstable("KVM discovered backwards TSC");
 
-		if (check_tsc_unstable()) {
+		if (kvm_check_tsc_unstable()) {
 			u64 offset = kvm_compute_tsc_offset(vcpu,
 						vcpu->arch.last_guest_tsc);
 			kvm_vcpu_write_tsc_offset(vcpu, offset);
@@ -6124,9 +6167,9 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
 	update_pvclock_gtod(tk);
 
 	/* disable master clock if host does not trust, or does not
-	 * use, TSC clocksource
+	 * use, TSC based clocksource.
 	 */
-	if (gtod->clock.vclock_mode != VCLOCK_TSC &&
+	if (!gtod_is_based_on_tsc(gtod->clock.vclock_mode) &&
 	    atomic_read(&kvm_guest_has_master_clock) != 0)
 		queue_work(system_long_wq, &pvclock_gtod_work);
 
@@ -7749,7 +7792,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm,
 {
 	struct kvm_vcpu *vcpu;
 
-	if (check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
+	if (kvm_check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
 		printk_once(KERN_WARNING
 		"kvm: SMP vm created on host with unstable TSC; "
 		"guest TSC will not be reliable\n");
@@ -7903,7 +7946,7 @@ int kvm_arch_hardware_enable(void)
 		return ret;
 
 	local_tsc = rdtsc();
-	stable = !check_tsc_unstable();
+	stable = !kvm_check_tsc_unstable();
 	list_for_each_entry(kvm, &vm_list, vm_list) {
 		kvm_for_each_vcpu(i, vcpu, kvm) {
 			if (!stable && vcpu->cpu == smp_processor_id())
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/6] x86/kvm: pass stable clocksource to guests when running nested on Hyper-V
@ 2017-12-08 10:49   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-08 10:49 UTC (permalink / raw)
  To: kvm, x86
  Cc: Stephen Hemminger, Radim Krčmář,
	Haiyang Zhang, linux-kernel, devel, Michael Kelley (EOSG),
	Ingo Molnar, Andy Lutomirski, H. Peter Anvin, Paolo Bonzini,
	Thomas Gleixner, Mohammed Gamal

Currently, KVM is able to work in 'masterclock' mode passing
PVCLOCK_TSC_STABLE_BIT to guests when the clocksource we use on the host
is TSC. When running nested on Hyper-V we normally use a different one:
TSC page which is resistant to TSC frequency changes on event like L1
migration. Add support for it in KVM.

The only non-trivial change in the patch is in vgettsc(): when updating
our gtod copy we now need to get both the clockread and tsc value.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
RFC -> v1:
- 'gtod_cs_mode_good' -> 'gtod_is_based_on_tsc' [Radim Krčmář]
- rewrite vgettsc() to handle -1 from hv_read_tsc_page_tsc() [Radim Krčmář]
- cycle_now -> tsc_timestamp [Radim Krčmář]
---
 arch/x86/kvm/x86.c | 93 +++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 68 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eee8e7faf1af..96e04a0cb921 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -67,6 +67,7 @@
 #include <asm/pvclock.h>
 #include <asm/div64.h>
 #include <asm/irq_remapping.h>
+#include <asm/mshyperv.h>
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -1377,6 +1378,11 @@ static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 	return tsc;
 }
 
+static inline int gtod_is_based_on_tsc(int mode)
+{
+	return mode == VCLOCK_TSC || mode == VCLOCK_HVCLOCK;
+}
+
 static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 {
 #ifdef CONFIG_X86_64
@@ -1396,7 +1402,7 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 	 * perform request to enable masterclock.
 	 */
 	if (ka->use_master_clock ||
-	    (gtod->clock.vclock_mode == VCLOCK_TSC && vcpus_matched))
+	    (gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched))
 		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
 
 	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
@@ -1459,6 +1465,19 @@ static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 	vcpu->arch.tsc_offset = offset;
 }
 
+static inline bool kvm_check_tsc_unstable(void)
+{
+#ifdef CONFIG_X86_64
+	/*
+	 * TSC is marked unstable when we're running on Hyper-V,
+	 * 'TSC page' clocksource is good.
+	 */
+	if (pvclock_gtod_data.clock.vclock_mode == VCLOCK_HVCLOCK)
+		return false;
+#endif
+	return check_tsc_unstable();
+}
+
 void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
 	struct kvm *kvm = vcpu->kvm;
@@ -1504,7 +1523,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr)
          */
 	if (synchronizing &&
 	    vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
-		if (!check_tsc_unstable()) {
+		if (!kvm_check_tsc_unstable()) {
 			offset = kvm->arch.cur_tsc_offset;
 			pr_debug("kvm: matched tsc offset for %llu\n", data);
 		} else {
@@ -1604,18 +1623,43 @@ static u64 read_tsc(void)
 	return last;
 }
 
-static inline u64 vgettsc(u64 *cycle_now)
+static inline u64 vgettsc(u64 *tsc_timestamp, int *mode)
 {
 	long v;
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+	u64 tsc_pg_val;
+
+	switch (gtod->clock.vclock_mode) {
+	case VCLOCK_HVCLOCK:
+		tsc_pg_val = hv_read_tsc_page_tsc(hv_get_tsc_page(),
+						  tsc_timestamp);
+		if (tsc_pg_val != U64_MAX) {
+			/* TSC page valid */
+			*mode = VCLOCK_HVCLOCK;
+			v = (tsc_pg_val - gtod->clock.cycle_last) &
+				gtod->clock.mask;
+		} else {
+			/* TSC page invalid */
+			*mode = VCLOCK_NONE;
+		}
+		break;
+	case VCLOCK_TSC:
+		*mode = VCLOCK_TSC;
+		*tsc_timestamp = read_tsc();
+		v = (*tsc_timestamp - gtod->clock.cycle_last) &
+			gtod->clock.mask;
+		break;
+	default:
+		*mode = VCLOCK_NONE;
+	}
 
-	*cycle_now = read_tsc();
+	if (*mode == VCLOCK_NONE)
+		*tsc_timestamp = v = 0;
 
-	v = (*cycle_now - gtod->clock.cycle_last) & gtod->clock.mask;
 	return v * gtod->clock.mult;
 }
 
-static int do_monotonic_boot(s64 *t, u64 *cycle_now)
+static int do_monotonic_boot(s64 *t, u64 *tsc_timestamp)
 {
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 	unsigned long seq;
@@ -1624,9 +1668,8 @@ static int do_monotonic_boot(s64 *t, u64 *cycle_now)
 
 	do {
 		seq = read_seqcount_begin(&gtod->seq);
-		mode = gtod->clock.vclock_mode;
 		ns = gtod->nsec_base;
-		ns += vgettsc(cycle_now);
+		ns += vgettsc(tsc_timestamp, &mode);
 		ns >>= gtod->clock.shift;
 		ns += gtod->boot_ns;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
@@ -1635,7 +1678,7 @@ static int do_monotonic_boot(s64 *t, u64 *cycle_now)
 	return mode;
 }
 
-static int do_realtime(struct timespec *ts, u64 *cycle_now)
+static int do_realtime(struct timespec *ts, u64 *tsc_timestamp)
 {
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 	unsigned long seq;
@@ -1644,10 +1687,9 @@ static int do_realtime(struct timespec *ts, u64 *cycle_now)
 
 	do {
 		seq = read_seqcount_begin(&gtod->seq);
-		mode = gtod->clock.vclock_mode;
 		ts->tv_sec = gtod->wall_time_sec;
 		ns = gtod->nsec_base;
-		ns += vgettsc(cycle_now);
+		ns += vgettsc(tsc_timestamp, &mode);
 		ns >>= gtod->clock.shift;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 
@@ -1657,25 +1699,26 @@ static int do_realtime(struct timespec *ts, u64 *cycle_now)
 	return mode;
 }
 
-/* returns true if host is using tsc clocksource */
-static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *cycle_now)
+/* returns true if host is using TSC based clocksource */
+static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
 {
 	/* checked again under seqlock below */
-	if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC)
+	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
 		return false;
 
-	return do_monotonic_boot(kernel_ns, cycle_now) == VCLOCK_TSC;
+	return gtod_is_based_on_tsc(do_monotonic_boot(kernel_ns,
+						      tsc_timestamp));
 }
 
-/* returns true if host is using tsc clocksource */
+/* returns true if host is using TSC based clocksource */
 static bool kvm_get_walltime_and_clockread(struct timespec *ts,
-					   u64 *cycle_now)
+					   u64 *tsc_timestamp)
 {
 	/* checked again under seqlock below */
-	if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC)
+	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
 		return false;
 
-	return do_realtime(ts, cycle_now) == VCLOCK_TSC;
+	return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
 }
 #endif
 
@@ -2869,13 +2912,13 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
 	}
 
-	if (unlikely(vcpu->cpu != cpu) || check_tsc_unstable()) {
+	if (unlikely(vcpu->cpu != cpu) || kvm_check_tsc_unstable()) {
 		s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
 				rdtsc() - vcpu->arch.last_host_tsc;
 		if (tsc_delta < 0)
 			mark_tsc_unstable("KVM discovered backwards TSC");
 
-		if (check_tsc_unstable()) {
+		if (kvm_check_tsc_unstable()) {
 			u64 offset = kvm_compute_tsc_offset(vcpu,
 						vcpu->arch.last_guest_tsc);
 			kvm_vcpu_write_tsc_offset(vcpu, offset);
@@ -6124,9 +6167,9 @@ static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
 	update_pvclock_gtod(tk);
 
 	/* disable master clock if host does not trust, or does not
-	 * use, TSC clocksource
+	 * use, TSC based clocksource.
 	 */
-	if (gtod->clock.vclock_mode != VCLOCK_TSC &&
+	if (!gtod_is_based_on_tsc(gtod->clock.vclock_mode) &&
 	    atomic_read(&kvm_guest_has_master_clock) != 0)
 		queue_work(system_long_wq, &pvclock_gtod_work);
 
@@ -7749,7 +7792,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm,
 {
 	struct kvm_vcpu *vcpu;
 
-	if (check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
+	if (kvm_check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
 		printk_once(KERN_WARNING
 		"kvm: SMP vm created on host with unstable TSC; "
 		"guest TSC will not be reliable\n");
@@ -7903,7 +7946,7 @@ int kvm_arch_hardware_enable(void)
 		return ret;
 
 	local_tsc = rdtsc();
-	stable = !check_tsc_unstable();
+	stable = !kvm_check_tsc_unstable();
 	list_for_each_entry(kvm, &vm_list, vm_list) {
 		kvm_for_each_vcpu(i, vcpu, kvm) {
 			if (!stable && vcpu->cpu == smp_processor_id())
-- 
2.14.3

_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/6] x86/kvm: support Hyper-V reenlightenment
  2017-12-08 10:49 ` Vitaly Kuznetsov
                   ` (5 preceding siblings ...)
  (?)
@ 2017-12-08 10:50 ` Vitaly Kuznetsov
  2017-12-08 17:39   ` Roman Kagan
  -1 siblings, 1 reply; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-08 10:50 UTC (permalink / raw)
  To: kvm, x86
  Cc: Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Michael Kelley (EOSG),
	Andy Lutomirski, Mohammed Gamal, Cathy Avery, linux-kernel,
	devel

When we run nested KVM on Hyper-V guests we need to update masterclocks for
all guests when L1 migrates to a host with different TSC frequency.
Implement the procedure in the following way:
- Pause all guests.
- Tell our host (Hyper-V) to stop emulating TSC accesses.
- Update our gtod copy, recompute clocks.
- Unpause all guests.

This is somewhat similar to cpufreq but we have two important differences:
we can only disable TSC emulation globally (on all CPUs) and we don't know
the new TSC frequency until we turn the emulation off so we can't
'prepare' ourselves to the event.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/x86.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 96e04a0cb921..04d90712ffd2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -68,6 +68,7 @@
 #include <asm/div64.h>
 #include <asm/irq_remapping.h>
 #include <asm/mshyperv.h>
+#include <asm/hypervisor.h>
 
 #define CREATE_TRACE_POINTS
 #include "trace.h"
@@ -5946,6 +5947,43 @@ static void tsc_khz_changed(void *data)
 	__this_cpu_write(cpu_tsc_khz, khz);
 }
 
+void kvm_hyperv_tsc_notifier(void)
+{
+#ifdef CONFIG_X86_64
+	struct kvm *kvm;
+	struct kvm_vcpu *vcpu;
+	int cpu;
+
+	spin_lock(&kvm_lock);
+	list_for_each_entry(kvm, &vm_list, vm_list)
+		kvm_make_mclock_inprogress_request(kvm);
+
+	hyperv_stop_tsc_emulation();
+
+	/* TSC frequency always matches when on Hyper-V */
+	for_each_present_cpu(cpu)
+		per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
+	kvm_max_guest_tsc_khz = tsc_khz;
+
+	list_for_each_entry(kvm, &vm_list, vm_list) {
+		struct kvm_arch *ka = &kvm->arch;
+
+		spin_lock(&ka->pvclock_gtod_sync_lock);
+
+		pvclock_update_vm_gtod_copy(kvm);
+
+		kvm_for_each_vcpu(cpu, vcpu, kvm)
+			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+
+		kvm_for_each_vcpu(cpu, vcpu, kvm)
+			kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
+
+		spin_unlock(&ka->pvclock_gtod_sync_lock);
+	}
+	spin_unlock(&kvm_lock);
+#endif
+}
+
 static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
 				     void *data)
 {
@@ -6231,6 +6269,9 @@ int kvm_arch_init(void *opaque)
 	kvm_lapic_init();
 #ifdef CONFIG_X86_64
 	pvclock_gtod_register_notifier(&pvclock_gtod_notifier);
+
+	if (x86_hyper_type == X86_HYPER_MS_HYPERV)
+		register_hv_tsc_update(kvm_hyperv_tsc_notifier);
 #endif
 
 	return 0;
@@ -6243,6 +6284,10 @@ int kvm_arch_init(void *opaque)
 
 void kvm_arch_exit(void)
 {
+#ifdef CONFIG_X86_64
+	if (x86_hyper_type == X86_HYPER_MS_HYPERV)
+		unregister_hv_tsc_update();
+#endif
 	kvm_lapic_exit();
 	perf_unregister_guest_info_callbacks(&kvm_guest_cbs);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] x86/kvm: support Hyper-V reenlightenment
  2017-12-08 10:50 ` [PATCH 6/6] x86/kvm: support Hyper-V reenlightenment Vitaly Kuznetsov
@ 2017-12-08 17:39   ` Roman Kagan
  2017-12-11  9:57       ` Vitaly Kuznetsov
  0 siblings, 1 reply; 24+ messages in thread
From: Roman Kagan @ 2017-12-08 17:39 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Michael Kelley (EOSG),
	Andy Lutomirski, Mohammed Gamal, Cathy Avery, linux-kernel,
	devel

On Fri, Dec 08, 2017 at 11:50:00AM +0100, Vitaly Kuznetsov wrote:
> When we run nested KVM on Hyper-V guests we need to update masterclocks for
> all guests when L1 migrates to a host with different TSC frequency.
> Implement the procedure in the following way:
> - Pause all guests.
> - Tell our host (Hyper-V) to stop emulating TSC accesses.
> - Update our gtod copy, recompute clocks.
> - Unpause all guests.
> 
> This is somewhat similar to cpufreq but we have two important differences:
> we can only disable TSC emulation globally (on all CPUs) and we don't know
> the new TSC frequency until we turn the emulation off so we can't
> 'prepare' ourselves to the event.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/x86.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 96e04a0cb921..04d90712ffd2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -68,6 +68,7 @@
>  #include <asm/div64.h>
>  #include <asm/irq_remapping.h>
>  #include <asm/mshyperv.h>
> +#include <asm/hypervisor.h>
>  
>  #define CREATE_TRACE_POINTS
>  #include "trace.h"
> @@ -5946,6 +5947,43 @@ static void tsc_khz_changed(void *data)
>  	__this_cpu_write(cpu_tsc_khz, khz);
>  }
>  
> +void kvm_hyperv_tsc_notifier(void)
> +{
> +#ifdef CONFIG_X86_64
> +	struct kvm *kvm;
> +	struct kvm_vcpu *vcpu;
> +	int cpu;
> +
> +	spin_lock(&kvm_lock);
> +	list_for_each_entry(kvm, &vm_list, vm_list)
> +		kvm_make_mclock_inprogress_request(kvm);
> +
> +	hyperv_stop_tsc_emulation();
> +
> +	/* TSC frequency always matches when on Hyper-V */
> +	for_each_present_cpu(cpu)
> +		per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
> +	kvm_max_guest_tsc_khz = tsc_khz;
> +
> +	list_for_each_entry(kvm, &vm_list, vm_list) {
> +		struct kvm_arch *ka = &kvm->arch;
> +
> +		spin_lock(&ka->pvclock_gtod_sync_lock);
> +
> +		pvclock_update_vm_gtod_copy(kvm);
> +
> +		kvm_for_each_vcpu(cpu, vcpu, kvm)
> +			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
> +
> +		kvm_for_each_vcpu(cpu, vcpu, kvm)
> +			kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
> +
> +		spin_unlock(&ka->pvclock_gtod_sync_lock);
> +	}
> +	spin_unlock(&kvm_lock);

Can't you skip all this if the tsc frequency hasn't changed (which
should probably be the case when the CPU supports tsc frequency
scaling)?

Roman.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] x86/hyper-v: reenlightenment notifications support
  2017-12-08 10:49 ` [PATCH 3/6] x86/hyper-v: reenlightenment notifications support Vitaly Kuznetsov
@ 2017-12-08 18:10   ` Roman Kagan
  2017-12-11  9:56     ` Vitaly Kuznetsov
  2017-12-13  0:50   ` Michael Kelley (EOSG)
  1 sibling, 1 reply; 24+ messages in thread
From: Roman Kagan @ 2017-12-08 18:10 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Michael Kelley (EOSG),
	Andy Lutomirski, Mohammed Gamal, Cathy Avery, linux-kernel,
	devel

On Fri, Dec 08, 2017 at 11:49:57AM +0100, Vitaly Kuznetsov wrote:
> Hyper-V supports Live Migration notification. This is supposed to be used
> in conjunction with TSC emulation: when we are migrated to a host with
> different TSC frequency for some short period host emulates our accesses
> to TSC and sends us an interrupt to notify about the event. When we're
> done updating everything we can disable TSC emulation and everything will
> start working fast again.
> 
> We didn't need these notifications before as Hyper-V guests are not
> supposed to use TSC as a clocksource: in Linux we even mark it as unstable
> on boot. Guests normally use 'tsc page' clocksouce and host updates its
> values on migrations automatically.
> 
> Things change when we want to run nested virtualization: even when we pass
> through PV clocksources (kvm-clock or tsc page) to our guests we need to
> know TSC frequency and when it changes.
> 
> Hyper-V Top Level Functional Specification (as of v5.0b) wrongly specifies
> EAX:BIT(12) of CPUID:0x40000009 as the feature identification bit. The
> right one to check is EAX:BIT(13) of CPUID:0x40000003. I was assured that
> the fix in on the way.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
> RFC -> v1:
> - #include <asm/apic.h> [kbuild test robot]
> - use div64_u64() [kbuild test robot]
> - DECLARE_WORK -> DECLARE_DELAYED_WORK as testing showed there's some bug
>   in Hyper-V hypervisor and disabling emulation after receiving interrupt
>   may screw TSC counters.

Looks kinda fragile...

> ---
>  arch/x86/entry/entry_64.S          |  4 +++
>  arch/x86/hyperv/hv_init.c          | 71 ++++++++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/entry_arch.h  |  4 +++
>  arch/x86/include/asm/hw_irq.h      |  1 +
>  arch/x86/include/asm/irq_vectors.h |  7 +++-
>  arch/x86/include/asm/mshyperv.h    |  8 +++++
>  arch/x86/include/uapi/asm/hyperv.h | 27 +++++++++++++++
>  arch/x86/kernel/idt.c              |  3 ++
>  8 files changed, 124 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index f81d50d7ceac..a32730b260bc 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -826,6 +826,10 @@ apicinterrupt SPURIOUS_APIC_VECTOR		spurious_interrupt		smp_spurious_interrupt
>  apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
>  #endif
>  
> +#if IS_ENABLED(CONFIG_HYPERV)
> +apicinterrupt HYPERV_REENLIGHTENMENT_VECTOR	hyperv_reenlightenment_intr	smp_hyperv_reenlightenment_intr
> +#endif
> +
>  /*
>   * Exception entry points.
>   */
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 1a6c63f721bc..bb62839ede81 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -18,6 +18,7 @@
>   */
>  
>  #include <linux/types.h>
> +#include <asm/apic.h>
>  #include <asm/hypervisor.h>
>  #include <asm/hyperv.h>
>  #include <asm/mshyperv.h>
> @@ -102,6 +103,76 @@ static int hv_cpu_init(unsigned int cpu)
>  	return 0;
>  }
>  
> +static void (*hv_reenlightenment_cb)(void);
> +
> +static void hv_reenlightenment_notify(struct work_struct *dummy)
> +{
> +	if (hv_reenlightenment_cb)
> +		hv_reenlightenment_cb();
> +}
> +static DECLARE_DELAYED_WORK(hv_reenlightenment_work, hv_reenlightenment_notify);
> +
> +void hyperv_stop_tsc_emulation(void)
> +{
> +	u64 freq;
> +	struct hv_tsc_emulation_status emu_status;
> +
> +	rdmsrl(HV_X64_MSR_TSC_EMULATION_STATUS, *(u64 *)&emu_status);
> +	emu_status.inprogress = 0;
> +	wrmsrl(HV_X64_MSR_TSC_EMULATION_STATUS, *(u64 *)&emu_status);
> +
> +	rdmsrl(HV_X64_MSR_TSC_FREQUENCY, freq);

IIRC the availability of this msr is not guaranteed (I guess in reality
it's present if the reenlightenment is supported, but I'd rather check).

> +	tsc_khz = div64_u64(freq, 1000);
> +}
> +EXPORT_SYMBOL_GPL(hyperv_stop_tsc_emulation);
> +
> +void register_hv_tsc_update(void (*cb)(void))
> +{

The function name seems unfortunate.  IMHO such a name suggests
registering a callback on a notifier chain (rather than unconditionally
replacing the old callback), and having no other side effects.

> +	struct hv_reenlightenment_control re_ctrl = {
> +		.vector = HYPERV_REENLIGHTENMENT_VECTOR,
> +		.enabled = 1,
> +		.target_vp = hv_vp_index[smp_processor_id()]
> +	};
> +	struct hv_tsc_emulation_control emu_ctrl = {.enabled = 1};
> +
> +	if (!(ms_hyperv.features & HV_X64_ACCESS_REENLIGHTENMENT))
> +		return;

What happens then?  L2 guests keep running with their clocks ticking at
a different speed?

> +
> +	hv_reenlightenment_cb = cb;
> +
> +	/* Make sure callback is registered before we write to MSRs */
> +	wmb();
> +
> +	wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)&re_ctrl));
> +	wrmsrl(HV_X64_MSR_TSC_EMULATION_CONTROL, *((u64 *)&emu_ctrl));
> +}
> +EXPORT_SYMBOL_GPL(register_hv_tsc_update);
> +
> +void unregister_hv_tsc_update(void)
> +{
> +	struct hv_reenlightenment_control re_ctrl;
> +
> +	if (!(ms_hyperv.features & HV_X64_ACCESS_REENLIGHTENMENT))
> +		return;
> +
> +	rdmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *(u64 *)&re_ctrl);
> +	re_ctrl.enabled = 0;
> +	wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *(u64 *)&re_ctrl);
> +
> +	hv_reenlightenment_cb = NULL;
> +}
> +EXPORT_SYMBOL_GPL(unregister_hv_tsc_update);
> +
> +asmlinkage __visible void
> +__irq_entry smp_hyperv_reenlightenment_intr(struct pt_regs *regs)
> +{
> +	entering_ack_irq();
> +
> +	schedule_delayed_work(&hv_reenlightenment_work, HZ/10);
> +
> +	exiting_irq();
> +}
> +
>  /*
>   * This function is to be invoked early in the boot sequence after the
>   * hypervisor has been detected.
> diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
> index 416422762845..eb936cc49b62 100644
> --- a/arch/x86/include/asm/entry_arch.h
> +++ b/arch/x86/include/asm/entry_arch.h
> @@ -54,3 +54,7 @@ BUILD_INTERRUPT(threshold_interrupt,THRESHOLD_APIC_VECTOR)
>  BUILD_INTERRUPT(deferred_error_interrupt, DEFERRED_ERROR_VECTOR)
>  #endif
>  #endif
> +
> +#if IS_ENABLED(CONFIG_HYPERV)
> +BUILD_INTERRUPT(hyperv_reenlightenment_intr, HYPERV_REENLIGHTENMENT_VECTOR)
> +#endif
> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
> index 2851077b6051..c65193dac7d9 100644
> --- a/arch/x86/include/asm/hw_irq.h
> +++ b/arch/x86/include/asm/hw_irq.h
> @@ -36,6 +36,7 @@ extern asmlinkage void kvm_posted_intr_wakeup_ipi(void);
>  extern asmlinkage void kvm_posted_intr_nested_ipi(void);
>  extern asmlinkage void error_interrupt(void);
>  extern asmlinkage void irq_work_interrupt(void);
> +extern asmlinkage void hyperv_reenlightenment_intr(void);
>  
>  extern asmlinkage void spurious_interrupt(void);
>  extern asmlinkage void thermal_interrupt(void);
> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
> index 67421f649cfa..e71c1120426b 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -103,7 +103,12 @@
>  #endif
>  
>  #define MANAGED_IRQ_SHUTDOWN_VECTOR	0xef
> -#define LOCAL_TIMER_VECTOR		0xee
> +
> +#if IS_ENABLED(CONFIG_HYPERV)
> +#define HYPERV_REENLIGHTENMENT_VECTOR	0xee
> +#endif
> +
> +#define LOCAL_TIMER_VECTOR		0xed
>  
>  #define NR_VECTORS			 256
>  
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index a0b34994f453..43164b097585 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -314,11 +314,19 @@ void hyper_alloc_mmu(void);
>  void hyperv_report_panic(struct pt_regs *regs, long err);
>  bool hv_is_hypercall_page_setup(void);
>  void hyperv_cleanup(void);
> +
> +asmlinkage void smp_hyperv_reenlightenment_intr(struct pt_regs *regs);
> +void register_hv_tsc_update(void (*cb)(void));
> +void unregister_hv_tsc_update(void);
> +void hyperv_stop_tsc_emulation(void);
>  #else /* CONFIG_HYPERV */
>  static inline void hyperv_init(void) {}
>  static inline bool hv_is_hypercall_page_setup(void) { return false; }
>  static inline void hyperv_cleanup(void) {}
>  static inline void hyperv_setup_mmu_ops(void) {}
> +static inline void register_hv_tsc_update(void (*cb)(void)) {}
> +static inline void unregister_hv_tsc_update(void) {}
> +static inline void hyperv_stop_tsc_emulation(void) {};
>  #endif /* CONFIG_HYPERV */
>  
>  #ifdef CONFIG_HYPERV_TSCPAGE
> diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h
> index 1a5bfead93b4..197c2e6c7376 100644
> --- a/arch/x86/include/uapi/asm/hyperv.h
> +++ b/arch/x86/include/uapi/asm/hyperv.h
> @@ -40,6 +40,9 @@
>   */
>  #define HV_X64_ACCESS_FREQUENCY_MSRS		(1 << 11)
>  
> +/* AccessReenlightenmentControls privilege */
> +#define HV_X64_ACCESS_REENLIGHTENMENT		BIT(13)
> +
>  /*
>   * Basic SynIC MSRs (HV_X64_MSR_SCONTROL through HV_X64_MSR_EOM
>   * and HV_X64_MSR_SINT0 through HV_X64_MSR_SINT15) available
> @@ -234,6 +237,30 @@
>  #define HV_X64_MSR_CRASH_PARAMS		\
>  		(1 + (HV_X64_MSR_CRASH_P4 - HV_X64_MSR_CRASH_P0))
>  
> +/* TSC emulation after migration */
> +#define HV_X64_MSR_REENLIGHTENMENT_CONTROL	0x40000106
> +
> +struct hv_reenlightenment_control {
> +	u64 vector:8;
> +	u64 reserved1:8;
> +	u64 enabled:1;
> +	u64 reserved2:15;
> +	u64 target_vp:32;
> +};
> +
> +#define HV_X64_MSR_TSC_EMULATION_CONTROL	0x40000107
> +#define HV_X64_MSR_TSC_EMULATION_STATUS		0x40000108
> +
> +struct hv_tsc_emulation_control {
> +	u64 enabled:1;
> +	u64 reserved:63;
> +};
> +
> +struct hv_tsc_emulation_status {
> +	u64 inprogress:1;
> +	u64 reserved:63;
> +};
> +
>  #define HV_X64_MSR_HYPERCALL_ENABLE		0x00000001
>  #define HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_SHIFT	12
>  #define HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_MASK	\
> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
> index d985cef3984f..5b8512d48aa3 100644
> --- a/arch/x86/kernel/idt.c
> +++ b/arch/x86/kernel/idt.c
> @@ -140,6 +140,9 @@ static const __initdata struct idt_data apic_idts[] = {
>  # ifdef CONFIG_IRQ_WORK
>  	INTG(IRQ_WORK_VECTOR,		irq_work_interrupt),
>  # endif
> +#if IS_ENABLED(CONFIG_HYPERV)
> +	INTG(HYPERV_REENLIGHTENMENT_VECTOR, hyperv_reenlightenment_intr),
> +#endif
>  	INTG(SPURIOUS_APIC_VECTOR,	spurious_interrupt),
>  	INTG(ERROR_APIC_VECTOR,		error_interrupt),
>  #endif


Roman.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] x86/hyper-v: reenlightenment notifications support
  2017-12-08 18:10   ` Roman Kagan
@ 2017-12-11  9:56     ` Vitaly Kuznetsov
  2017-12-11 17:10       ` Roman Kagan
  0 siblings, 1 reply; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-11  9:56 UTC (permalink / raw)
  To: Roman Kagan
  Cc: kvm, x86, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Michael Kelley (EOSG),
	Andy Lutomirski, Mohammed Gamal, Cathy Avery, linux-kernel,
	devel

Roman Kagan <rkagan@virtuozzo.com> writes:

> On Fri, Dec 08, 2017 at 11:49:57AM +0100, Vitaly Kuznetsov wrote:
>> Hyper-V supports Live Migration notification. This is supposed to be used
>> in conjunction with TSC emulation: when we are migrated to a host with
>> different TSC frequency for some short period host emulates our accesses
>> to TSC and sends us an interrupt to notify about the event. When we're
>> done updating everything we can disable TSC emulation and everything will
>> start working fast again.
>> 
>> We didn't need these notifications before as Hyper-V guests are not
>> supposed to use TSC as a clocksource: in Linux we even mark it as unstable
>> on boot. Guests normally use 'tsc page' clocksouce and host updates its
>> values on migrations automatically.
>> 
>> Things change when we want to run nested virtualization: even when we pass
>> through PV clocksources (kvm-clock or tsc page) to our guests we need to
>> know TSC frequency and when it changes.
>> 
>> Hyper-V Top Level Functional Specification (as of v5.0b) wrongly specifies
>> EAX:BIT(12) of CPUID:0x40000009 as the feature identification bit. The
>> right one to check is EAX:BIT(13) of CPUID:0x40000003. I was assured that
>> the fix in on the way.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>> RFC -> v1:
>> - #include <asm/apic.h> [kbuild test robot]
>> - use div64_u64() [kbuild test robot]
>> - DECLARE_WORK -> DECLARE_DELAYED_WORK as testing showed there's some bug
>>   in Hyper-V hypervisor and disabling emulation after receiving interrupt
>>   may screw TSC counters.
>
> Looks kinda fragile...

I believe this is temporary and Microsoft will fix things up host side.

>
>> ---
>>  arch/x86/entry/entry_64.S          |  4 +++
>>  arch/x86/hyperv/hv_init.c          | 71 ++++++++++++++++++++++++++++++++++++++
>>  arch/x86/include/asm/entry_arch.h  |  4 +++
>>  arch/x86/include/asm/hw_irq.h      |  1 +
>>  arch/x86/include/asm/irq_vectors.h |  7 +++-
>>  arch/x86/include/asm/mshyperv.h    |  8 +++++
>>  arch/x86/include/uapi/asm/hyperv.h | 27 +++++++++++++++
>>  arch/x86/kernel/idt.c              |  3 ++
>>  8 files changed, 124 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index f81d50d7ceac..a32730b260bc 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -826,6 +826,10 @@ apicinterrupt SPURIOUS_APIC_VECTOR		spurious_interrupt		smp_spurious_interrupt
>>  apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
>>  #endif
>>  
>> +#if IS_ENABLED(CONFIG_HYPERV)
>> +apicinterrupt HYPERV_REENLIGHTENMENT_VECTOR	hyperv_reenlightenment_intr	smp_hyperv_reenlightenment_intr
>> +#endif
>> +
>>  /*
>>   * Exception entry points.
>>   */
>> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
>> index 1a6c63f721bc..bb62839ede81 100644
>> --- a/arch/x86/hyperv/hv_init.c
>> +++ b/arch/x86/hyperv/hv_init.c
>> @@ -18,6 +18,7 @@
>>   */
>>  
>>  #include <linux/types.h>
>> +#include <asm/apic.h>
>>  #include <asm/hypervisor.h>
>>  #include <asm/hyperv.h>
>>  #include <asm/mshyperv.h>
>> @@ -102,6 +103,76 @@ static int hv_cpu_init(unsigned int cpu)
>>  	return 0;
>>  }
>>  
>> +static void (*hv_reenlightenment_cb)(void);
>> +
>> +static void hv_reenlightenment_notify(struct work_struct *dummy)
>> +{
>> +	if (hv_reenlightenment_cb)
>> +		hv_reenlightenment_cb();
>> +}
>> +static DECLARE_DELAYED_WORK(hv_reenlightenment_work, hv_reenlightenment_notify);
>> +
>> +void hyperv_stop_tsc_emulation(void)
>> +{
>> +	u64 freq;
>> +	struct hv_tsc_emulation_status emu_status;
>> +
>> +	rdmsrl(HV_X64_MSR_TSC_EMULATION_STATUS, *(u64 *)&emu_status);
>> +	emu_status.inprogress = 0;
>> +	wrmsrl(HV_X64_MSR_TSC_EMULATION_STATUS, *(u64 *)&emu_status);
>> +
>> +	rdmsrl(HV_X64_MSR_TSC_FREQUENCY, freq);
>
> IIRC the availability of this msr is not guaranteed (I guess in reality
> it's present if the reenlightenment is supported, but I'd rather check).
>

Will do.

>> +	tsc_khz = div64_u64(freq, 1000);
>> +}
>> +EXPORT_SYMBOL_GPL(hyperv_stop_tsc_emulation);
>> +
>> +void register_hv_tsc_update(void (*cb)(void))
>> +{
>
> The function name seems unfortunate.  IMHO such a name suggests
> registering a callback on a notifier chain (rather than unconditionally
> replacing the old callback), and having no other side effects.

I see, any suggestion? register_hv_reenlightenment_cb? register_hv_tscchange_cb?

>
>> +	struct hv_reenlightenment_control re_ctrl = {
>> +		.vector = HYPERV_REENLIGHTENMENT_VECTOR,
>> +		.enabled = 1,
>> +		.target_vp = hv_vp_index[smp_processor_id()]
>> +	};
>> +	struct hv_tsc_emulation_control emu_ctrl = {.enabled = 1};
>> +
>> +	if (!(ms_hyperv.features & HV_X64_ACCESS_REENLIGHTENMENT))
>> +		return;
>
> What happens then?  L2 guests keep running with their clocks ticking at
> a different speed?
>

In reallity this never happens -- in case nested virtualization is
supported reenlightenment is also available. In theory, L0 can emulate
TSC acceess for forever after migration.

>> +
>> +	hv_reenlightenment_cb = cb;
>> +
>> +	/* Make sure callback is registered before we write to MSRs */
>> +	wmb();
>> +
>> +	wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *((u64 *)&re_ctrl));
>> +	wrmsrl(HV_X64_MSR_TSC_EMULATION_CONTROL, *((u64 *)&emu_ctrl));
>> +}
>> +EXPORT_SYMBOL_GPL(register_hv_tsc_update);
>> +
>> +void unregister_hv_tsc_update(void)
>> +{
>> +	struct hv_reenlightenment_control re_ctrl;
>> +
>> +	if (!(ms_hyperv.features & HV_X64_ACCESS_REENLIGHTENMENT))
>> +		return;
>> +
>> +	rdmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *(u64 *)&re_ctrl);
>> +	re_ctrl.enabled = 0;
>> +	wrmsrl(HV_X64_MSR_REENLIGHTENMENT_CONTROL, *(u64 *)&re_ctrl);
>> +
>> +	hv_reenlightenment_cb = NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(unregister_hv_tsc_update);
>> +
>> +asmlinkage __visible void
>> +__irq_entry smp_hyperv_reenlightenment_intr(struct pt_regs *regs)
>> +{
>> +	entering_ack_irq();
>> +
>> +	schedule_delayed_work(&hv_reenlightenment_work, HZ/10);
>> +
>> +	exiting_irq();
>> +}
>> +
>>  /*
>>   * This function is to be invoked early in the boot sequence after the
>>   * hypervisor has been detected.
>> diff --git a/arch/x86/include/asm/entry_arch.h b/arch/x86/include/asm/entry_arch.h
>> index 416422762845..eb936cc49b62 100644
>> --- a/arch/x86/include/asm/entry_arch.h
>> +++ b/arch/x86/include/asm/entry_arch.h
>> @@ -54,3 +54,7 @@ BUILD_INTERRUPT(threshold_interrupt,THRESHOLD_APIC_VECTOR)
>>  BUILD_INTERRUPT(deferred_error_interrupt, DEFERRED_ERROR_VECTOR)
>>  #endif
>>  #endif
>> +
>> +#if IS_ENABLED(CONFIG_HYPERV)
>> +BUILD_INTERRUPT(hyperv_reenlightenment_intr, HYPERV_REENLIGHTENMENT_VECTOR)
>> +#endif
>> diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
>> index 2851077b6051..c65193dac7d9 100644
>> --- a/arch/x86/include/asm/hw_irq.h
>> +++ b/arch/x86/include/asm/hw_irq.h
>> @@ -36,6 +36,7 @@ extern asmlinkage void kvm_posted_intr_wakeup_ipi(void);
>>  extern asmlinkage void kvm_posted_intr_nested_ipi(void);
>>  extern asmlinkage void error_interrupt(void);
>>  extern asmlinkage void irq_work_interrupt(void);
>> +extern asmlinkage void hyperv_reenlightenment_intr(void);
>>  
>>  extern asmlinkage void spurious_interrupt(void);
>>  extern asmlinkage void thermal_interrupt(void);
>> diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h
>> index 67421f649cfa..e71c1120426b 100644
>> --- a/arch/x86/include/asm/irq_vectors.h
>> +++ b/arch/x86/include/asm/irq_vectors.h
>> @@ -103,7 +103,12 @@
>>  #endif
>>  
>>  #define MANAGED_IRQ_SHUTDOWN_VECTOR	0xef
>> -#define LOCAL_TIMER_VECTOR		0xee
>> +
>> +#if IS_ENABLED(CONFIG_HYPERV)
>> +#define HYPERV_REENLIGHTENMENT_VECTOR	0xee
>> +#endif
>> +
>> +#define LOCAL_TIMER_VECTOR		0xed
>>  
>>  #define NR_VECTORS			 256
>>  
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index a0b34994f453..43164b097585 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -314,11 +314,19 @@ void hyper_alloc_mmu(void);
>>  void hyperv_report_panic(struct pt_regs *regs, long err);
>>  bool hv_is_hypercall_page_setup(void);
>>  void hyperv_cleanup(void);
>> +
>> +asmlinkage void smp_hyperv_reenlightenment_intr(struct pt_regs *regs);
>> +void register_hv_tsc_update(void (*cb)(void));
>> +void unregister_hv_tsc_update(void);
>> +void hyperv_stop_tsc_emulation(void);
>>  #else /* CONFIG_HYPERV */
>>  static inline void hyperv_init(void) {}
>>  static inline bool hv_is_hypercall_page_setup(void) { return false; }
>>  static inline void hyperv_cleanup(void) {}
>>  static inline void hyperv_setup_mmu_ops(void) {}
>> +static inline void register_hv_tsc_update(void (*cb)(void)) {}
>> +static inline void unregister_hv_tsc_update(void) {}
>> +static inline void hyperv_stop_tsc_emulation(void) {};
>>  #endif /* CONFIG_HYPERV */
>>  
>>  #ifdef CONFIG_HYPERV_TSCPAGE
>> diff --git a/arch/x86/include/uapi/asm/hyperv.h b/arch/x86/include/uapi/asm/hyperv.h
>> index 1a5bfead93b4..197c2e6c7376 100644
>> --- a/arch/x86/include/uapi/asm/hyperv.h
>> +++ b/arch/x86/include/uapi/asm/hyperv.h
>> @@ -40,6 +40,9 @@
>>   */
>>  #define HV_X64_ACCESS_FREQUENCY_MSRS		(1 << 11)
>>  
>> +/* AccessReenlightenmentControls privilege */
>> +#define HV_X64_ACCESS_REENLIGHTENMENT		BIT(13)
>> +
>>  /*
>>   * Basic SynIC MSRs (HV_X64_MSR_SCONTROL through HV_X64_MSR_EOM
>>   * and HV_X64_MSR_SINT0 through HV_X64_MSR_SINT15) available
>> @@ -234,6 +237,30 @@
>>  #define HV_X64_MSR_CRASH_PARAMS		\
>>  		(1 + (HV_X64_MSR_CRASH_P4 - HV_X64_MSR_CRASH_P0))
>>  
>> +/* TSC emulation after migration */
>> +#define HV_X64_MSR_REENLIGHTENMENT_CONTROL	0x40000106
>> +
>> +struct hv_reenlightenment_control {
>> +	u64 vector:8;
>> +	u64 reserved1:8;
>> +	u64 enabled:1;
>> +	u64 reserved2:15;
>> +	u64 target_vp:32;
>> +};
>> +
>> +#define HV_X64_MSR_TSC_EMULATION_CONTROL	0x40000107
>> +#define HV_X64_MSR_TSC_EMULATION_STATUS		0x40000108
>> +
>> +struct hv_tsc_emulation_control {
>> +	u64 enabled:1;
>> +	u64 reserved:63;
>> +};
>> +
>> +struct hv_tsc_emulation_status {
>> +	u64 inprogress:1;
>> +	u64 reserved:63;
>> +};
>> +
>>  #define HV_X64_MSR_HYPERCALL_ENABLE		0x00000001
>>  #define HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_SHIFT	12
>>  #define HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_MASK	\
>> diff --git a/arch/x86/kernel/idt.c b/arch/x86/kernel/idt.c
>> index d985cef3984f..5b8512d48aa3 100644
>> --- a/arch/x86/kernel/idt.c
>> +++ b/arch/x86/kernel/idt.c
>> @@ -140,6 +140,9 @@ static const __initdata struct idt_data apic_idts[] = {
>>  # ifdef CONFIG_IRQ_WORK
>>  	INTG(IRQ_WORK_VECTOR,		irq_work_interrupt),
>>  # endif
>> +#if IS_ENABLED(CONFIG_HYPERV)
>> +	INTG(HYPERV_REENLIGHTENMENT_VECTOR, hyperv_reenlightenment_intr),
>> +#endif
>>  	INTG(SPURIOUS_APIC_VECTOR,	spurious_interrupt),
>>  	INTG(ERROR_APIC_VECTOR,		error_interrupt),
>>  #endif
>
> Roman.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] x86/kvm: support Hyper-V reenlightenment
  2017-12-08 17:39   ` Roman Kagan
@ 2017-12-11  9:57       ` Vitaly Kuznetsov
  0 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-11  9:57 UTC (permalink / raw)
  To: Roman Kagan
  Cc: kvm, x86, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Michael Kelley (EOSG),
	Andy Lutomirski, Mohammed Gamal, Cathy Avery, linux-kernel,
	devel

Roman Kagan <rkagan@virtuozzo.com> writes:

> On Fri, Dec 08, 2017 at 11:50:00AM +0100, Vitaly Kuznetsov wrote:
>> When we run nested KVM on Hyper-V guests we need to update masterclocks for
>> all guests when L1 migrates to a host with different TSC frequency.
>> Implement the procedure in the following way:
>> - Pause all guests.
>> - Tell our host (Hyper-V) to stop emulating TSC accesses.
>> - Update our gtod copy, recompute clocks.
>> - Unpause all guests.
>> 
>> This is somewhat similar to cpufreq but we have two important differences:
>> we can only disable TSC emulation globally (on all CPUs) and we don't know
>> the new TSC frequency until we turn the emulation off so we can't
>> 'prepare' ourselves to the event.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  arch/x86/kvm/x86.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 45 insertions(+)
>> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 96e04a0cb921..04d90712ffd2 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -68,6 +68,7 @@
>>  #include <asm/div64.h>
>>  #include <asm/irq_remapping.h>
>>  #include <asm/mshyperv.h>
>> +#include <asm/hypervisor.h>
>>  
>>  #define CREATE_TRACE_POINTS
>>  #include "trace.h"
>> @@ -5946,6 +5947,43 @@ static void tsc_khz_changed(void *data)
>>  	__this_cpu_write(cpu_tsc_khz, khz);
>>  }
>>  
>> +void kvm_hyperv_tsc_notifier(void)
>> +{
>> +#ifdef CONFIG_X86_64
>> +	struct kvm *kvm;
>> +	struct kvm_vcpu *vcpu;
>> +	int cpu;
>> +
>> +	spin_lock(&kvm_lock);
>> +	list_for_each_entry(kvm, &vm_list, vm_list)
>> +		kvm_make_mclock_inprogress_request(kvm);
>> +
>> +	hyperv_stop_tsc_emulation();
>> +
>> +	/* TSC frequency always matches when on Hyper-V */
>> +	for_each_present_cpu(cpu)
>> +		per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
>> +	kvm_max_guest_tsc_khz = tsc_khz;
>> +
>> +	list_for_each_entry(kvm, &vm_list, vm_list) {
>> +		struct kvm_arch *ka = &kvm->arch;
>> +
>> +		spin_lock(&ka->pvclock_gtod_sync_lock);
>> +
>> +		pvclock_update_vm_gtod_copy(kvm);
>> +
>> +		kvm_for_each_vcpu(cpu, vcpu, kvm)
>> +			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>> +
>> +		kvm_for_each_vcpu(cpu, vcpu, kvm)
>> +			kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
>> +
>> +		spin_unlock(&ka->pvclock_gtod_sync_lock);
>> +	}
>> +	spin_unlock(&kvm_lock);
>
> Can't you skip all this if the tsc frequency hasn't changed (which
> should probably be the case when the CPU supports tsc frequency
> scaling)?
>

The thing is that we don't know if it changed or not: only after
disabling TSC emulation we'll be able to read the new one from the host
and we need to do this with all VMs paused.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] x86/kvm: support Hyper-V reenlightenment
@ 2017-12-11  9:57       ` Vitaly Kuznetsov
  0 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-11  9:57 UTC (permalink / raw)
  To: Roman Kagan
  Cc: Stephen Hemminger, kvm, Radim Krčmář,
	Haiyang Zhang, x86, linux-kernel, devel, Michael Kelley (EOSG),
	Ingo Molnar, Andy Lutomirski, H. Peter Anvin, Paolo Bonzini,
	Thomas Gleixner, Mohammed Gamal

Roman Kagan <rkagan@virtuozzo.com> writes:

> On Fri, Dec 08, 2017 at 11:50:00AM +0100, Vitaly Kuznetsov wrote:
>> When we run nested KVM on Hyper-V guests we need to update masterclocks for
>> all guests when L1 migrates to a host with different TSC frequency.
>> Implement the procedure in the following way:
>> - Pause all guests.
>> - Tell our host (Hyper-V) to stop emulating TSC accesses.
>> - Update our gtod copy, recompute clocks.
>> - Unpause all guests.
>> 
>> This is somewhat similar to cpufreq but we have two important differences:
>> we can only disable TSC emulation globally (on all CPUs) and we don't know
>> the new TSC frequency until we turn the emulation off so we can't
>> 'prepare' ourselves to the event.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  arch/x86/kvm/x86.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 45 insertions(+)
>> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 96e04a0cb921..04d90712ffd2 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -68,6 +68,7 @@
>>  #include <asm/div64.h>
>>  #include <asm/irq_remapping.h>
>>  #include <asm/mshyperv.h>
>> +#include <asm/hypervisor.h>
>>  
>>  #define CREATE_TRACE_POINTS
>>  #include "trace.h"
>> @@ -5946,6 +5947,43 @@ static void tsc_khz_changed(void *data)
>>  	__this_cpu_write(cpu_tsc_khz, khz);
>>  }
>>  
>> +void kvm_hyperv_tsc_notifier(void)
>> +{
>> +#ifdef CONFIG_X86_64
>> +	struct kvm *kvm;
>> +	struct kvm_vcpu *vcpu;
>> +	int cpu;
>> +
>> +	spin_lock(&kvm_lock);
>> +	list_for_each_entry(kvm, &vm_list, vm_list)
>> +		kvm_make_mclock_inprogress_request(kvm);
>> +
>> +	hyperv_stop_tsc_emulation();
>> +
>> +	/* TSC frequency always matches when on Hyper-V */
>> +	for_each_present_cpu(cpu)
>> +		per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
>> +	kvm_max_guest_tsc_khz = tsc_khz;
>> +
>> +	list_for_each_entry(kvm, &vm_list, vm_list) {
>> +		struct kvm_arch *ka = &kvm->arch;
>> +
>> +		spin_lock(&ka->pvclock_gtod_sync_lock);
>> +
>> +		pvclock_update_vm_gtod_copy(kvm);
>> +
>> +		kvm_for_each_vcpu(cpu, vcpu, kvm)
>> +			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>> +
>> +		kvm_for_each_vcpu(cpu, vcpu, kvm)
>> +			kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
>> +
>> +		spin_unlock(&ka->pvclock_gtod_sync_lock);
>> +	}
>> +	spin_unlock(&kvm_lock);
>
> Can't you skip all this if the tsc frequency hasn't changed (which
> should probably be the case when the CPU supports tsc frequency
> scaling)?
>

The thing is that we don't know if it changed or not: only after
disabling TSC emulation we'll be able to read the new one from the host
and we need to do this with all VMs paused.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] x86/hyper-v: reenlightenment notifications support
  2017-12-11  9:56     ` Vitaly Kuznetsov
@ 2017-12-11 17:10       ` Roman Kagan
  2017-12-12  8:04           ` Vitaly Kuznetsov
  0 siblings, 1 reply; 24+ messages in thread
From: Roman Kagan @ 2017-12-11 17:10 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: kvm, x86, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Michael Kelley (EOSG),
	Andy Lutomirski, Mohammed Gamal, Cathy Avery, linux-kernel,
	devel

On Mon, Dec 11, 2017 at 10:56:33AM +0100, Vitaly Kuznetsov wrote:
> Roman Kagan <rkagan@virtuozzo.com> writes:
> > On Fri, Dec 08, 2017 at 11:49:57AM +0100, Vitaly Kuznetsov wrote:
> >> +void register_hv_tsc_update(void (*cb)(void))
> >> +{
> >
> > The function name seems unfortunate.  IMHO such a name suggests
> > registering a callback on a notifier chain (rather than unconditionally
> > replacing the old callback), and having no other side effects.
> 
> I see, any suggestion? register_hv_reenlightenment_cb? register_hv_tscchange_cb?

IMHO arm_hv_reenlightenment_cb or arm_hv_tscchange_cb would be better,
but I'm not very good at giving descriptive names.

> 
> >
> >> +	struct hv_reenlightenment_control re_ctrl = {
> >> +		.vector = HYPERV_REENLIGHTENMENT_VECTOR,
> >> +		.enabled = 1,
> >> +		.target_vp = hv_vp_index[smp_processor_id()]
> >> +	};
> >> +	struct hv_tsc_emulation_control emu_ctrl = {.enabled = 1};
> >> +
> >> +	if (!(ms_hyperv.features & HV_X64_ACCESS_REENLIGHTENMENT))
> >> +		return;
> >
> > What happens then?  L2 guests keep running with their clocks ticking at
> > a different speed?
> >
> 
> In reallity this never happens -- in case nested virtualization is
> supported reenlightenment is also available. In theory, L0 can emulate
> TSC acceess for forever after migration.

I would think that Hyper-V only started rdtsc emulation if
TSC_EMULATION_CONTROL was turned on, which wouldn't happen here.

But indeed, normally this shouldn't be a problem.  It may make sense
just to issue a warning if the feature is unsupported, though.

Roman.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] x86/hyper-v: reenlightenment notifications support
  2017-12-11 17:10       ` Roman Kagan
@ 2017-12-12  8:04           ` Vitaly Kuznetsov
  0 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-12  8:04 UTC (permalink / raw)
  To: Roman Kagan
  Cc: kvm, x86, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Michael Kelley (EOSG),
	Andy Lutomirski, Mohammed Gamal, Cathy Avery, linux-kernel,
	devel

Roman Kagan <rkagan@virtuozzo.com> writes:

> On Mon, Dec 11, 2017 at 10:56:33AM +0100, Vitaly Kuznetsov wrote:
>> Roman Kagan <rkagan@virtuozzo.com> writes:
>> > On Fri, Dec 08, 2017 at 11:49:57AM +0100, Vitaly Kuznetsov wrote:
>> >> +void register_hv_tsc_update(void (*cb)(void))
>> >> +{
>> >
>> > The function name seems unfortunate.  IMHO such a name suggests
>> > registering a callback on a notifier chain (rather than unconditionally
>> > replacing the old callback), and having no other side effects.
>> 
>> I see, any suggestion? register_hv_reenlightenment_cb? register_hv_tscchange_cb?
>
> IMHO arm_hv_reenlightenment_cb or arm_hv_tscchange_cb would be better,
> but I'm not very good at giving descriptive names.
>

I would probably try to avoid using 'arm' word in x86 code to assist
poor git-greppers :-) And we actually need a pair of functions
(enable/disable). I will probably go with

set_hv_tscchange_cb()
clear_hv_tscchange_cb()

in v2 unless there's a better suggestion.

>> 
>> >
>> >> +	struct hv_reenlightenment_control re_ctrl = {
>> >> +		.vector = HYPERV_REENLIGHTENMENT_VECTOR,
>> >> +		.enabled = 1,
>> >> +		.target_vp = hv_vp_index[smp_processor_id()]
>> >> +	};
>> >> +	struct hv_tsc_emulation_control emu_ctrl = {.enabled = 1};
>> >> +
>> >> +	if (!(ms_hyperv.features & HV_X64_ACCESS_REENLIGHTENMENT))
>> >> +		return;
>> >
>> > What happens then?  L2 guests keep running with their clocks ticking at
>> > a different speed?
>> >
>> 
>> In reallity this never happens -- in case nested virtualization is
>> supported reenlightenment is also available. In theory, L0 can emulate
>> TSC acceess for forever after migration.
>
> I would think that Hyper-V only started rdtsc emulation if
> TSC_EMULATION_CONTROL was turned on, which wouldn't happen here.
>

Yes, this is the de-facto behavior I observe with WS2016.

> But indeed, normally this shouldn't be a problem.  It may make sense
> just to issue a warning if the feature is unsupported, though.

Will do in v2, thanks.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] x86/hyper-v: reenlightenment notifications support
@ 2017-12-12  8:04           ` Vitaly Kuznetsov
  0 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-12  8:04 UTC (permalink / raw)
  To: Roman Kagan
  Cc: Stephen Hemminger, kvm, Radim Krčmář,
	Haiyang Zhang, x86, linux-kernel, devel, Michael Kelley (EOSG),
	Ingo Molnar, Andy Lutomirski, H. Peter Anvin, Paolo Bonzini,
	Thomas Gleixner, Mohammed Gamal

Roman Kagan <rkagan@virtuozzo.com> writes:

> On Mon, Dec 11, 2017 at 10:56:33AM +0100, Vitaly Kuznetsov wrote:
>> Roman Kagan <rkagan@virtuozzo.com> writes:
>> > On Fri, Dec 08, 2017 at 11:49:57AM +0100, Vitaly Kuznetsov wrote:
>> >> +void register_hv_tsc_update(void (*cb)(void))
>> >> +{
>> >
>> > The function name seems unfortunate.  IMHO such a name suggests
>> > registering a callback on a notifier chain (rather than unconditionally
>> > replacing the old callback), and having no other side effects.
>> 
>> I see, any suggestion? register_hv_reenlightenment_cb? register_hv_tscchange_cb?
>
> IMHO arm_hv_reenlightenment_cb or arm_hv_tscchange_cb would be better,
> but I'm not very good at giving descriptive names.
>

I would probably try to avoid using 'arm' word in x86 code to assist
poor git-greppers :-) And we actually need a pair of functions
(enable/disable). I will probably go with

set_hv_tscchange_cb()
clear_hv_tscchange_cb()

in v2 unless there's a better suggestion.

>> 
>> >
>> >> +	struct hv_reenlightenment_control re_ctrl = {
>> >> +		.vector = HYPERV_REENLIGHTENMENT_VECTOR,
>> >> +		.enabled = 1,
>> >> +		.target_vp = hv_vp_index[smp_processor_id()]
>> >> +	};
>> >> +	struct hv_tsc_emulation_control emu_ctrl = {.enabled = 1};
>> >> +
>> >> +	if (!(ms_hyperv.features & HV_X64_ACCESS_REENLIGHTENMENT))
>> >> +		return;
>> >
>> > What happens then?  L2 guests keep running with their clocks ticking at
>> > a different speed?
>> >
>> 
>> In reallity this never happens -- in case nested virtualization is
>> supported reenlightenment is also available. In theory, L0 can emulate
>> TSC acceess for forever after migration.
>
> I would think that Hyper-V only started rdtsc emulation if
> TSC_EMULATION_CONTROL was turned on, which wouldn't happen here.
>

Yes, this is the de-facto behavior I observe with WS2016.

> But indeed, normally this shouldn't be a problem.  It may make sense
> just to issue a warning if the feature is unsupported, though.

Will do in v2, thanks.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] x86/kvm: support Hyper-V reenlightenment
  2017-12-11  9:57       ` Vitaly Kuznetsov
@ 2017-12-12  8:17         ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-12  8:17 UTC (permalink / raw)
  To: Roman Kagan
  Cc: kvm, x86, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Michael Kelley (EOSG),
	Andy Lutomirski, Mohammed Gamal, Cathy Avery, linux-kernel,
	devel

Vitaly Kuznetsov <vkuznets@redhat.com> writes:

> Roman Kagan <rkagan@virtuozzo.com> writes:
>
>> On Fri, Dec 08, 2017 at 11:50:00AM +0100, Vitaly Kuznetsov wrote:
>>> When we run nested KVM on Hyper-V guests we need to update masterclocks for
>>> all guests when L1 migrates to a host with different TSC frequency.
>>> Implement the procedure in the following way:
>>> - Pause all guests.
>>> - Tell our host (Hyper-V) to stop emulating TSC accesses.
>>> - Update our gtod copy, recompute clocks.
>>> - Unpause all guests.
>>> 
>>> This is somewhat similar to cpufreq but we have two important differences:
>>> we can only disable TSC emulation globally (on all CPUs) and we don't know
>>> the new TSC frequency until we turn the emulation off so we can't
>>> 'prepare' ourselves to the event.
>>> 
>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>>> ---
>>>  arch/x86/kvm/x86.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 45 insertions(+)
>>> 
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 96e04a0cb921..04d90712ffd2 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -68,6 +68,7 @@
>>>  #include <asm/div64.h>
>>>  #include <asm/irq_remapping.h>
>>>  #include <asm/mshyperv.h>
>>> +#include <asm/hypervisor.h>
>>>  
>>>  #define CREATE_TRACE_POINTS
>>>  #include "trace.h"
>>> @@ -5946,6 +5947,43 @@ static void tsc_khz_changed(void *data)
>>>  	__this_cpu_write(cpu_tsc_khz, khz);
>>>  }
>>>  
>>> +void kvm_hyperv_tsc_notifier(void)
>>> +{
>>> +#ifdef CONFIG_X86_64
>>> +	struct kvm *kvm;
>>> +	struct kvm_vcpu *vcpu;
>>> +	int cpu;
>>> +
>>> +	spin_lock(&kvm_lock);
>>> +	list_for_each_entry(kvm, &vm_list, vm_list)
>>> +		kvm_make_mclock_inprogress_request(kvm);
>>> +
>>> +	hyperv_stop_tsc_emulation();
>>> +
>>> +	/* TSC frequency always matches when on Hyper-V */
>>> +	for_each_present_cpu(cpu)
>>> +		per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
>>> +	kvm_max_guest_tsc_khz = tsc_khz;
>>> +
>>> +	list_for_each_entry(kvm, &vm_list, vm_list) {
>>> +		struct kvm_arch *ka = &kvm->arch;
>>> +
>>> +		spin_lock(&ka->pvclock_gtod_sync_lock);
>>> +
>>> +		pvclock_update_vm_gtod_copy(kvm);
>>> +
>>> +		kvm_for_each_vcpu(cpu, vcpu, kvm)
>>> +			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>>> +
>>> +		kvm_for_each_vcpu(cpu, vcpu, kvm)
>>> +			kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
>>> +
>>> +		spin_unlock(&ka->pvclock_gtod_sync_lock);
>>> +	}
>>> +	spin_unlock(&kvm_lock);
>>
>> Can't you skip all this if the tsc frequency hasn't changed (which
>> should probably be the case when the CPU supports tsc frequency
>> scaling)?
>>
>
> The thing is that we don't know if it changed or not: only after
> disabling TSC emulation we'll be able to read the new one from the host
> and we need to do this with all VMs paused.

(having second thoughts here)

While we don't know if TSC frequency has changed or not, we can check
the emulation status before calling the callback and if TSC accesses are
not emulated omit the call. However, it seems that Hyper-V host (as of
WS2016) turns on emulation regardless of the TSC scaling presence.

I'll add emulation status check before issuing the callback in v2. The
change will go to PATCH3.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] x86/kvm: support Hyper-V reenlightenment
@ 2017-12-12  8:17         ` Vitaly Kuznetsov
  0 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-12  8:17 UTC (permalink / raw)
  To: Roman Kagan
  Cc: Stephen Hemminger, kvm, Radim Krčmář,
	Haiyang Zhang, x86, linux-kernel, devel, Michael Kelley (EOSG),
	Ingo Molnar, Andy Lutomirski, H. Peter Anvin, Paolo Bonzini,
	Thomas Gleixner, Mohammed Gamal

Vitaly Kuznetsov <vkuznets@redhat.com> writes:

> Roman Kagan <rkagan@virtuozzo.com> writes:
>
>> On Fri, Dec 08, 2017 at 11:50:00AM +0100, Vitaly Kuznetsov wrote:
>>> When we run nested KVM on Hyper-V guests we need to update masterclocks for
>>> all guests when L1 migrates to a host with different TSC frequency.
>>> Implement the procedure in the following way:
>>> - Pause all guests.
>>> - Tell our host (Hyper-V) to stop emulating TSC accesses.
>>> - Update our gtod copy, recompute clocks.
>>> - Unpause all guests.
>>> 
>>> This is somewhat similar to cpufreq but we have two important differences:
>>> we can only disable TSC emulation globally (on all CPUs) and we don't know
>>> the new TSC frequency until we turn the emulation off so we can't
>>> 'prepare' ourselves to the event.
>>> 
>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>>> ---
>>>  arch/x86/kvm/x86.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>>  1 file changed, 45 insertions(+)
>>> 
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 96e04a0cb921..04d90712ffd2 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -68,6 +68,7 @@
>>>  #include <asm/div64.h>
>>>  #include <asm/irq_remapping.h>
>>>  #include <asm/mshyperv.h>
>>> +#include <asm/hypervisor.h>
>>>  
>>>  #define CREATE_TRACE_POINTS
>>>  #include "trace.h"
>>> @@ -5946,6 +5947,43 @@ static void tsc_khz_changed(void *data)
>>>  	__this_cpu_write(cpu_tsc_khz, khz);
>>>  }
>>>  
>>> +void kvm_hyperv_tsc_notifier(void)
>>> +{
>>> +#ifdef CONFIG_X86_64
>>> +	struct kvm *kvm;
>>> +	struct kvm_vcpu *vcpu;
>>> +	int cpu;
>>> +
>>> +	spin_lock(&kvm_lock);
>>> +	list_for_each_entry(kvm, &vm_list, vm_list)
>>> +		kvm_make_mclock_inprogress_request(kvm);
>>> +
>>> +	hyperv_stop_tsc_emulation();
>>> +
>>> +	/* TSC frequency always matches when on Hyper-V */
>>> +	for_each_present_cpu(cpu)
>>> +		per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
>>> +	kvm_max_guest_tsc_khz = tsc_khz;
>>> +
>>> +	list_for_each_entry(kvm, &vm_list, vm_list) {
>>> +		struct kvm_arch *ka = &kvm->arch;
>>> +
>>> +		spin_lock(&ka->pvclock_gtod_sync_lock);
>>> +
>>> +		pvclock_update_vm_gtod_copy(kvm);
>>> +
>>> +		kvm_for_each_vcpu(cpu, vcpu, kvm)
>>> +			kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
>>> +
>>> +		kvm_for_each_vcpu(cpu, vcpu, kvm)
>>> +			kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
>>> +
>>> +		spin_unlock(&ka->pvclock_gtod_sync_lock);
>>> +	}
>>> +	spin_unlock(&kvm_lock);
>>
>> Can't you skip all this if the tsc frequency hasn't changed (which
>> should probably be the case when the CPU supports tsc frequency
>> scaling)?
>>
>
> The thing is that we don't know if it changed or not: only after
> disabling TSC emulation we'll be able to read the new one from the host
> and we need to do this with all VMs paused.

(having second thoughts here)

While we don't know if TSC frequency has changed or not, we can check
the emulation status before calling the callback and if TSC accesses are
not emulated omit the call. However, it seems that Hyper-V host (as of
WS2016) turns on emulation regardless of the TSC scaling presence.

I'll add emulation status check before issuing the callback in v2. The
change will go to PATCH3.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [PATCH 3/6] x86/hyper-v: reenlightenment notifications support
  2017-12-08 10:49 ` [PATCH 3/6] x86/hyper-v: reenlightenment notifications support Vitaly Kuznetsov
  2017-12-08 18:10   ` Roman Kagan
@ 2017-12-13  0:50   ` Michael Kelley (EOSG)
  2017-12-13 10:03       ` Vitaly Kuznetsov
  1 sibling, 1 reply; 24+ messages in thread
From: Michael Kelley (EOSG) @ 2017-12-13  0:50 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, x86
  Cc: Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, KY Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Andy Lutomirski,
	Mohammed Gamal, Cathy Avery, linux-kernel, devel

On Fri, Dec 08, 2017 at 11:49:57AM +0100, Vitaly Kuznetsov wrote:

> -----Original Message-----
> From: Vitaly Kuznetsov [mailto:vkuznets@redhat.com]
> Sent: Friday, December 8, 2017 2:50 AM
> To: kvm@vger.kernel.org; x86@kernel.org
> Cc: Paolo Bonzini <pbonzini@redhat.com>; Radim Krčmář <rkrcmar@redhat.com>; Thomas
> Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; H. Peter Anvin
> <hpa@zytor.com>; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>; Michael
> Kelley (EOSG) <Michael.H.Kelley@microsoft.com>; Andy Lutomirski <luto@kernel.org>;
> Mohammed Gamal <mmorsy@redhat.com>; Cathy Avery <cavery@redhat.com>; linux-
> kernel@vger.kernel.org; devel@linuxdriverproject.org
> Subject: [PATCH 3/6] x86/hyper-v: reenlightenment notifications support
> 
> Hyper-V supports Live Migration notification. This is supposed to be used in conjunction with
> TSC emulation: when we are migrated to a host with different TSC frequency for some short
> period host emulates our accesses to TSC and sends us an interrupt to notify about the event.
> When we're done updating everything we can disable TSC emulation and everything will start
> working fast again.
> 
> We didn't need these notifications before as Hyper-V guests are not supposed to use TSC as a
> clocksource: in Linux we even mark it as unstable on boot. Guests normally use 'tsc page'
> clocksouce and host updates its values on migrations automatically.
> 
> Things change when we want to run nested virtualization: even when we pass through PV
> clocksources (kvm-clock or tsc page) to our guests we need to know TSC frequency and when it
> changes.
> 
> Hyper-V Top Level Functional Specification (as of v5.0b) wrongly specifies
> EAX:BIT(12) of CPUID:0x40000009 as the feature identification bit. The right one to check is
> EAX:BIT(13) of CPUID:0x40000003. I was assured that the fix in on the way.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>

[snip]

> diff --git a/arch/x86/include/asm/irq_vectors.h
> b/arch/x86/include/asm/irq_vectors.h
> index 67421f649cfa..e71c1120426b 100644
> --- a/arch/x86/include/asm/irq_vectors.h
> +++ b/arch/x86/include/asm/irq_vectors.h
> @@ -103,7 +103,12 @@
>  #endif
> 
>  #define MANAGED_IRQ_SHUTDOWN_VECTOR	0xef
> -#define LOCAL_TIMER_VECTOR		0xee
> +
> +#if IS_ENABLED(CONFIG_HYPERV)
> +#define HYPERV_REENLIGHTENMENT_VECTOR	0xee
> +#endif
> +
> +#define LOCAL_TIMER_VECTOR		0xed
> 
>  #define NR_VECTORS			 256

[snip]

Since you are pre-allocating a new vector, would you want to update the irq_cpustat_t
data structure and your interrupt handler to count the occurrences of these interrupts,
and update arch_show_interrupts() to show the count?  Then cat /proc/interrupts
will show the count.   The reenlightenment interrupts will presumably be rare, but so
are some of the others that are already counted and displayed, and it seems like
consistency should be maintained.

Michael

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] x86/hyper-v: reenlightenment notifications support
  2017-12-13  0:50   ` Michael Kelley (EOSG)
@ 2017-12-13 10:03       ` Vitaly Kuznetsov
  0 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-13 10:03 UTC (permalink / raw)
  To: Michael Kelley (EOSG)
  Cc: kvm, x86, Paolo Bonzini, Radim Krčmář,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin, KY Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Andy Lutomirski,
	Mohammed Gamal, Cathy Avery, linux-kernel, devel

"Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com> writes:

> On Fri, Dec 08, 2017 at 11:49:57AM +0100, Vitaly Kuznetsov wrote:
>
>> -----Original Message-----
>> From: Vitaly Kuznetsov [mailto:vkuznets@redhat.com]
>> Sent: Friday, December 8, 2017 2:50 AM
>> To: kvm@vger.kernel.org; x86@kernel.org
>> Cc: Paolo Bonzini <pbonzini@redhat.com>; Radim Krčmář <rkrcmar@redhat.com>; Thomas
>> Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; H. Peter Anvin
>> <hpa@zytor.com>; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
>> <haiyangz@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>; Michael
>> Kelley (EOSG) <Michael.H.Kelley@microsoft.com>; Andy Lutomirski <luto@kernel.org>;
>> Mohammed Gamal <mmorsy@redhat.com>; Cathy Avery <cavery@redhat.com>; linux-
>> kernel@vger.kernel.org; devel@linuxdriverproject.org
>> Subject: [PATCH 3/6] x86/hyper-v: reenlightenment notifications support
>> 
>> Hyper-V supports Live Migration notification. This is supposed to be used in conjunction with
>> TSC emulation: when we are migrated to a host with different TSC frequency for some short
>> period host emulates our accesses to TSC and sends us an interrupt to notify about the event.
>> When we're done updating everything we can disable TSC emulation and everything will start
>> working fast again.
>> 
>> We didn't need these notifications before as Hyper-V guests are not supposed to use TSC as a
>> clocksource: in Linux we even mark it as unstable on boot. Guests normally use 'tsc page'
>> clocksouce and host updates its values on migrations automatically.
>> 
>> Things change when we want to run nested virtualization: even when we pass through PV
>> clocksources (kvm-clock or tsc page) to our guests we need to know TSC frequency and when it
>> changes.
>> 
>> Hyper-V Top Level Functional Specification (as of v5.0b) wrongly specifies
>> EAX:BIT(12) of CPUID:0x40000009 as the feature identification bit. The right one to check is
>> EAX:BIT(13) of CPUID:0x40000003. I was assured that the fix in on the way.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>
> [snip]
>
>> diff --git a/arch/x86/include/asm/irq_vectors.h
>> b/arch/x86/include/asm/irq_vectors.h
>> index 67421f649cfa..e71c1120426b 100644
>> --- a/arch/x86/include/asm/irq_vectors.h
>> +++ b/arch/x86/include/asm/irq_vectors.h
>> @@ -103,7 +103,12 @@
>>  #endif
>> 
>>  #define MANAGED_IRQ_SHUTDOWN_VECTOR	0xef
>> -#define LOCAL_TIMER_VECTOR		0xee
>> +
>> +#if IS_ENABLED(CONFIG_HYPERV)
>> +#define HYPERV_REENLIGHTENMENT_VECTOR	0xee
>> +#endif
>> +
>> +#define LOCAL_TIMER_VECTOR		0xed
>> 
>>  #define NR_VECTORS			 256
>
> [snip]
>
> Since you are pre-allocating a new vector, would you want to update the irq_cpustat_t
> data structure and your interrupt handler to count the occurrences of these interrupts,
> and update arch_show_interrupts() to show the count?  Then cat /proc/interrupts
> will show the count.   The reenlightenment interrupts will presumably be rare, but so
> are some of the others that are already counted and displayed, and it seems like
> consistency should be maintained.

I could do that.

The problem with adding this entry to /proc/interrupts, as I see it,
is that everyone who has CONFIG_HYPERV enabled in their distro will see
one additional line which 99,9% of the time will be just '0'. I,
however, see some value in seeing that L1 migration occured.

I'll add a separate patch to the series and let x86 maintainers decide
if it's worthy. Thanks,

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] x86/hyper-v: reenlightenment notifications support
@ 2017-12-13 10:03       ` Vitaly Kuznetsov
  0 siblings, 0 replies; 24+ messages in thread
From: Vitaly Kuznetsov @ 2017-12-13 10:03 UTC (permalink / raw)
  To: Michael Kelley (EOSG)
  Cc: Stephen Hemminger, kvm, Radim Krčmář,
	Haiyang Zhang, x86, linux-kernel, devel, Ingo Molnar,
	Andy Lutomirski, H. Peter Anvin, Paolo Bonzini, Thomas Gleixner,
	Mohammed Gamal

"Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com> writes:

> On Fri, Dec 08, 2017 at 11:49:57AM +0100, Vitaly Kuznetsov wrote:
>
>> -----Original Message-----
>> From: Vitaly Kuznetsov [mailto:vkuznets@redhat.com]
>> Sent: Friday, December 8, 2017 2:50 AM
>> To: kvm@vger.kernel.org; x86@kernel.org
>> Cc: Paolo Bonzini <pbonzini@redhat.com>; Radim Krčmář <rkrcmar@redhat.com>; Thomas
>> Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>; H. Peter Anvin
>> <hpa@zytor.com>; KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
>> <haiyangz@microsoft.com>; Stephen Hemminger <sthemmin@microsoft.com>; Michael
>> Kelley (EOSG) <Michael.H.Kelley@microsoft.com>; Andy Lutomirski <luto@kernel.org>;
>> Mohammed Gamal <mmorsy@redhat.com>; Cathy Avery <cavery@redhat.com>; linux-
>> kernel@vger.kernel.org; devel@linuxdriverproject.org
>> Subject: [PATCH 3/6] x86/hyper-v: reenlightenment notifications support
>> 
>> Hyper-V supports Live Migration notification. This is supposed to be used in conjunction with
>> TSC emulation: when we are migrated to a host with different TSC frequency for some short
>> period host emulates our accesses to TSC and sends us an interrupt to notify about the event.
>> When we're done updating everything we can disable TSC emulation and everything will start
>> working fast again.
>> 
>> We didn't need these notifications before as Hyper-V guests are not supposed to use TSC as a
>> clocksource: in Linux we even mark it as unstable on boot. Guests normally use 'tsc page'
>> clocksouce and host updates its values on migrations automatically.
>> 
>> Things change when we want to run nested virtualization: even when we pass through PV
>> clocksources (kvm-clock or tsc page) to our guests we need to know TSC frequency and when it
>> changes.
>> 
>> Hyper-V Top Level Functional Specification (as of v5.0b) wrongly specifies
>> EAX:BIT(12) of CPUID:0x40000009 as the feature identification bit. The right one to check is
>> EAX:BIT(13) of CPUID:0x40000003. I was assured that the fix in on the way.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>
> [snip]
>
>> diff --git a/arch/x86/include/asm/irq_vectors.h
>> b/arch/x86/include/asm/irq_vectors.h
>> index 67421f649cfa..e71c1120426b 100644
>> --- a/arch/x86/include/asm/irq_vectors.h
>> +++ b/arch/x86/include/asm/irq_vectors.h
>> @@ -103,7 +103,12 @@
>>  #endif
>> 
>>  #define MANAGED_IRQ_SHUTDOWN_VECTOR	0xef
>> -#define LOCAL_TIMER_VECTOR		0xee
>> +
>> +#if IS_ENABLED(CONFIG_HYPERV)
>> +#define HYPERV_REENLIGHTENMENT_VECTOR	0xee
>> +#endif
>> +
>> +#define LOCAL_TIMER_VECTOR		0xed
>> 
>>  #define NR_VECTORS			 256
>
> [snip]
>
> Since you are pre-allocating a new vector, would you want to update the irq_cpustat_t
> data structure and your interrupt handler to count the occurrences of these interrupts,
> and update arch_show_interrupts() to show the count?  Then cat /proc/interrupts
> will show the count.   The reenlightenment interrupts will presumably be rare, but so
> are some of the others that are already counted and displayed, and it seems like
> consistency should be maintained.

I could do that.

The problem with adding this entry to /proc/interrupts, as I see it,
is that everyone who has CONFIG_HYPERV enabled in their distro will see
one additional line which 99,9% of the time will be just '0'. I,
however, see some value in seeing that L1 migration occured.

I'll add a separate patch to the series and let x86 maintainers decide
if it's worthy. Thanks,

-- 
  Vitaly
_______________________________________________
devel mailing list
devel@linuxdriverproject.org
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-12-13 10:03 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-08 10:49 [PATCH 0/6] x86/kvm/hyperv: stable clocksorce for L2 guests when running nested KVM on Hyper-V Vitaly Kuznetsov
2017-12-08 10:49 ` Vitaly Kuznetsov
2017-12-08 10:49 ` [PATCH 1/6] x86/hyper-v: check for required priviliges in hyperv_init() Vitaly Kuznetsov
2017-12-08 10:49 ` [PATCH 2/6] x86/hyper-v: add a function to read both TSC and TSC page value simulateneously Vitaly Kuznetsov
2017-12-08 10:49   ` Vitaly Kuznetsov
2017-12-08 10:49 ` [PATCH 3/6] x86/hyper-v: reenlightenment notifications support Vitaly Kuznetsov
2017-12-08 18:10   ` Roman Kagan
2017-12-11  9:56     ` Vitaly Kuznetsov
2017-12-11 17:10       ` Roman Kagan
2017-12-12  8:04         ` Vitaly Kuznetsov
2017-12-12  8:04           ` Vitaly Kuznetsov
2017-12-13  0:50   ` Michael Kelley (EOSG)
2017-12-13 10:03     ` Vitaly Kuznetsov
2017-12-13 10:03       ` Vitaly Kuznetsov
2017-12-08 10:49 ` [PATCH 4/6] x86/hyper-v: redirect reenlightment notifications on CPU offlining Vitaly Kuznetsov
2017-12-08 10:49   ` Vitaly Kuznetsov
2017-12-08 10:49 ` [PATCH 5/6] x86/kvm: pass stable clocksource to guests when running nested on Hyper-V Vitaly Kuznetsov
2017-12-08 10:49   ` Vitaly Kuznetsov
2017-12-08 10:50 ` [PATCH 6/6] x86/kvm: support Hyper-V reenlightenment Vitaly Kuznetsov
2017-12-08 17:39   ` Roman Kagan
2017-12-11  9:57     ` Vitaly Kuznetsov
2017-12-11  9:57       ` Vitaly Kuznetsov
2017-12-12  8:17       ` Vitaly Kuznetsov
2017-12-12  8:17         ` Vitaly Kuznetsov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.