[PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
@ 2017-02-14 12:44 ` Vitaly Kuznetsov
  0 siblings, 0 replies; 28+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-14 12:44 UTC (permalink / raw)
  To: x86, Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Dexuan Cui, linux-kernel,
	devel, virtualization

Hi,

while we're still waiting for a definitive ACK from Microsoft that the
algorithm is good for SMP case (as we can't prevent the code in vdso from
migrating between CPUs) I'd like to send v2 with some modifications to keep
the discussion going.

Changes since v1:
- Document the TSC page reading protocol [Thomas Gleixner].

- Separate the TSC page reading code from read_hv_clock_tsc() and put it to
  asm/mshyperv.h to use from both hv_init.c and vdso.

- Add explicit barriers [Thomas Gleixner]

Original description:

Hyper-V TSC page clocksource is suitable for vDSO, however, the protocol
defined by the hypervisor is different from VCLOCK_PVCLOCK. Implemented the
required support. Simple sysbench test shows the following results:

Before:
# time sysbench --test=memory --max-requests=500000 run
...
real    1m22.618s
user    0m50.193s
sys     0m32.268s

After:
# time sysbench --test=memory --max-requests=500000 run
...
real	0m47.241s
user	0m47.117s
sys	0m0.008s

Patches 1 and 2 are made on top of K. Y.'s code refactoring which moved tsc
page clocksource to arch/x86/hyperv/hv_init.c, this is currently present in
Greg's char-misc-next tree.

Vitaly Kuznetsov (3):
  x86/hyperv: implement hv_get_tsc_page()
  x86/hyperv: move TSC reading method to asm/mshyperv.h
  x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

 arch/x86/entry/vdso/vclock_gettime.c  | 24 +++++++++++++++
 arch/x86/entry/vdso/vdso-layout.lds.S |  3 +-
 arch/x86/entry/vdso/vdso2c.c          |  3 ++
 arch/x86/entry/vdso/vma.c             |  7 +++++
 arch/x86/hyperv/hv_init.c             | 48 +++++++++--------------------
 arch/x86/include/asm/clocksource.h    |  3 +-
 arch/x86/include/asm/mshyperv.h       | 58 +++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/vdso.h           |  1 +
 drivers/hv/Kconfig                    |  3 ++
 9 files changed, 114 insertions(+), 36 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
@ 2017-02-14 12:44 ` Vitaly Kuznetsov
  0 siblings, 0 replies; 28+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-14 12:44 UTC (permalink / raw)
  To: x86, Andy Lutomirski
  Cc: Stephen Hemminger, Haiyang Zhang, linux-kernel, virtualization,
	Ingo Molnar, H. Peter Anvin, devel, Thomas Gleixner

Hi,

while we're still waiting for a definitive ACK from Microsoft that the
algorithm is good for SMP case (as we can't prevent the code in vdso from
migrating between CPUs) I'd like to send v2 with some modifications to keep
the discussion going.

Changes since v1:
- Document the TSC page reading protocol [Thomas Gleixner].

- Separate the TSC page reading code from read_hv_clock_tsc() and put it to
  asm/mshyperv.h to use from both hv_init.c and vdso.

- Add explicit barriers [Thomas Gleixner]

Original description:

Hyper-V TSC page clocksource is suitable for vDSO, however, the protocol
defined by the hypervisor is different from VCLOCK_PVCLOCK. Implemented the
required support. Simple sysbench test shows the following results:

Before:
# time sysbench --test=memory --max-requests=500000 run
...
real    1m22.618s
user    0m50.193s
sys     0m32.268s

After:
# time sysbench --test=memory --max-requests=500000 run
...
real	0m47.241s
user	0m47.117s
sys	0m0.008s

Patches 1 and 2 are made on top of K. Y.'s code refactoring which moved tsc
page clocksource to arch/x86/hyperv/hv_init.c, this is currently present in
Greg's char-misc-next tree.

Vitaly Kuznetsov (3):
  x86/hyperv: implement hv_get_tsc_page()
  x86/hyperv: move TSC reading method to asm/mshyperv.h
  x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method

 arch/x86/entry/vdso/vclock_gettime.c  | 24 +++++++++++++++
 arch/x86/entry/vdso/vdso-layout.lds.S |  3 +-
 arch/x86/entry/vdso/vdso2c.c          |  3 ++
 arch/x86/entry/vdso/vma.c             |  7 +++++
 arch/x86/hyperv/hv_init.c             | 48 +++++++++--------------------
 arch/x86/include/asm/clocksource.h    |  3 +-
 arch/x86/include/asm/mshyperv.h       | 58 +++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/vdso.h           |  1 +
 drivers/hv/Kconfig                    |  3 ++
 9 files changed, 114 insertions(+), 36 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 1/3] x86/hyperv: implement hv_get_tsc_page()
  2017-02-14 12:44 ` Vitaly Kuznetsov
@ 2017-02-14 12:44   ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 28+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-14 12:44 UTC (permalink / raw)
  To: x86, Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Dexuan Cui, linux-kernel,
	devel, virtualization

To use Hyper-V TSC page clocksource from vDSO we need to make tsc_pg
available. Implement hv_get_tsc_page() and add CONFIG_HYPERV_TSCPAGE to
make #ifdef-s simple.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/hyperv/hv_init.c       | 9 +++++++--
 arch/x86/include/asm/mshyperv.h | 8 ++++++++
 drivers/hv/Kconfig              | 3 +++
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index b371d0e..0ce8485 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -27,10 +27,15 @@
 #include <linux/clockchips.h>
 
 
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_HYPERV_TSCPAGE
 
 static struct ms_hyperv_tsc_page *tsc_pg;
 
+struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
+{
+	return tsc_pg;
+}
+
 static u64 read_hv_clock_tsc(struct clocksource *arg)
 {
 	u64 current_tick;
@@ -136,7 +141,7 @@ void hyperv_init(void)
 	/*
 	 * Register Hyper-V specific clocksource.
 	 */
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_HYPERV_TSCPAGE
 	if (ms_hyperv.features & HV_X64_MSR_REFERENCE_TSC_AVAILABLE) {
 		union hv_x64_msr_hypercall_contents tsc_msr;
 
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index f8dc370..14dd92c 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -173,4 +173,12 @@ void hyperv_report_panic(struct pt_regs *regs);
 bool hv_is_hypercall_page_setup(void);
 void hyperv_cleanup(void);
 #endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
+#else
+static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
+{
+	return NULL;
+}
+#endif
 #endif
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 0403b51..c29cd53 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -7,6 +7,9 @@ config HYPERV
 	  Select this option to run Linux as a Hyper-V client operating
 	  system.
 
+config HYPERV_TSCPAGE
+       def_bool HYPERV && X86_64
+
 config HYPERV_UTILS
 	tristate "Microsoft Hyper-V Utilities driver"
 	depends on HYPERV && CONNECTOR && NLS
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 1/3] x86/hyperv: implement hv_get_tsc_page()
@ 2017-02-14 12:44   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 28+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-14 12:44 UTC (permalink / raw)
  To: x86, Andy Lutomirski
  Cc: Stephen Hemminger, Haiyang Zhang, linux-kernel, virtualization,
	Ingo Molnar, H. Peter Anvin, devel, Thomas Gleixner

To use Hyper-V TSC page clocksource from vDSO we need to make tsc_pg
available. Implement hv_get_tsc_page() and add CONFIG_HYPERV_TSCPAGE to
make #ifdef-s simple.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/hyperv/hv_init.c       | 9 +++++++--
 arch/x86/include/asm/mshyperv.h | 8 ++++++++
 drivers/hv/Kconfig              | 3 +++
 3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index b371d0e..0ce8485 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -27,10 +27,15 @@
 #include <linux/clockchips.h>
 
 
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_HYPERV_TSCPAGE
 
 static struct ms_hyperv_tsc_page *tsc_pg;
 
+struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
+{
+	return tsc_pg;
+}
+
 static u64 read_hv_clock_tsc(struct clocksource *arg)
 {
 	u64 current_tick;
@@ -136,7 +141,7 @@ void hyperv_init(void)
 	/*
 	 * Register Hyper-V specific clocksource.
 	 */
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_HYPERV_TSCPAGE
 	if (ms_hyperv.features & HV_X64_MSR_REFERENCE_TSC_AVAILABLE) {
 		union hv_x64_msr_hypercall_contents tsc_msr;
 
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index f8dc370..14dd92c 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -173,4 +173,12 @@ void hyperv_report_panic(struct pt_regs *regs);
 bool hv_is_hypercall_page_setup(void);
 void hyperv_cleanup(void);
 #endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
+#else
+static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
+{
+	return NULL;
+}
+#endif
 #endif
diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 0403b51..c29cd53 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -7,6 +7,9 @@ config HYPERV
 	  Select this option to run Linux as a Hyper-V client operating
 	  system.
 
+config HYPERV_TSCPAGE
+       def_bool HYPERV && X86_64
+
 config HYPERV_UTILS
 	tristate "Microsoft Hyper-V Utilities driver"
 	depends on HYPERV && CONNECTOR && NLS
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 2/3] x86/hyperv: move TSC reading method to asm/mshyperv.h
  2017-02-14 12:44 ` Vitaly Kuznetsov
@ 2017-02-14 12:44   ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 28+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-14 12:44 UTC (permalink / raw)
  To: x86, Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Dexuan Cui, linux-kernel,
	devel, virtualization

As a preparation to making Hyper-V TSC page suitable for vDSO move
the TSC page reading logic to asm/mshyperv.h. While on it, do the
following
- Document the reading algorithm.
- Simplify the code a bit.
- Add explicit barriers to prevent re-ordering (we need to read sequence
  stricktly before and after)
- Use mul_u64_u64_shr() instead of assembly. I checked and on x86_64
  gcc generates a single 'mul' instruction.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/hyperv/hv_init.c       | 36 ++++-------------------------
 arch/x86/include/asm/mshyperv.h | 50 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+), 32 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 0ce8485..77fec0c 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -38,39 +38,11 @@ struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
 
 static u64 read_hv_clock_tsc(struct clocksource *arg)
 {
-	u64 current_tick;
+	u64 current_tick = hv_read_tsc_page(tsc_pg);
+
+	if (current_tick == U64_MAX)
+		rdmsrl(HV_X64_MSR_TIME_REF_COUNT, current_tick);
 
-	if (tsc_pg->tsc_sequence != 0) {
-		/*
-		 * Use the tsc page to compute the value.
-		 */
-
-		while (1) {
-			u64 tmp;
-			u32 sequence = tsc_pg->tsc_sequence;
-			u64 cur_tsc;
-			u64 scale = tsc_pg->tsc_scale;
-			s64 offset = tsc_pg->tsc_offset;
-
-			rdtscll(cur_tsc);
-			/* current_tick = ((cur_tsc *scale) >> 64) + offset */
-			asm("mulq %3"
-				: "=d" (current_tick), "=a" (tmp)
-				: "a" (cur_tsc), "r" (scale));
-
-			current_tick += offset;
-			if (tsc_pg->tsc_sequence == sequence)
-				return current_tick;
-
-			if (tsc_pg->tsc_sequence != 0)
-				continue;
-			/*
-			 * Fallback using MSR method.
-			 */
-			break;
-		}
-	}
-	rdmsrl(HV_X64_MSR_TIME_REF_COUNT, current_tick);
 	return current_tick;
 }
 
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 14dd92c..ddd071c 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -175,6 +175,56 @@ void hyperv_cleanup(void);
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
 struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
+static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
+{
+	u64 scale, offset, current_tick, cur_tsc;
+	u32 sequence;
+
+	/*
+	 * The protocol for reading Hyper-V TSC page is specified in Hypervisor
+	 * Top-Level Functional Specification ver. 3.0 and above. To get the
+	 * reference time we must do the following:
+	 * - READ ReferenceTscSequence
+	 *   A special '0' value indicates the time source is unreliable and we
+	 *   need to use something else. The currently published specification
+	 *   versions (up to 4.0b) contain a mistake and wrongly claim '-1'
+	 *   instead of '0' as the special value, see commit c35b82ef0294.
+	 * - ReferenceTime =
+	 *        ((RDTSC() * ReferenceTscScale) >> 64) + ReferenceTscOffset
+	 * - READ ReferenceTscSequence again. In case its value has changed
+	 *   since our first reading we need to discard ReferenceTime and repeat
+	 *   the whole sequence as the hypervisor was updating the page in
+	 *   between.
+	 */
+	while (1) {
+		sequence = tsc_pg->tsc_sequence;
+		if (!sequence)
+			break;
+		/*
+		 * Make sure we read sequence before we read other values from
+		 * TSC page.
+		 */
+		virt_rmb();
+
+		scale = tsc_pg->tsc_scale;
+		offset = tsc_pg->tsc_offset;
+		rdtscll(cur_tsc);
+
+		current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
+
+		/*
+		 * Make sure we read sequence after we read all other values
+		 * from TSC page.
+		 */
+		virt_rmb();
+
+		if (tsc_pg->tsc_sequence == sequence)
+			return current_tick;
+	}
+
+	return U64_MAX;
+}
+
 #else
 static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
 {
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 2/3] x86/hyperv: move TSC reading method to asm/mshyperv.h
@ 2017-02-14 12:44   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 28+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-14 12:44 UTC (permalink / raw)
  To: x86, Andy Lutomirski
  Cc: Stephen Hemminger, Haiyang Zhang, linux-kernel, virtualization,
	Ingo Molnar, H. Peter Anvin, devel, Thomas Gleixner

As a preparation to making Hyper-V TSC page suitable for vDSO move
the TSC page reading logic to asm/mshyperv.h. While on it, do the
following
- Document the reading algorithm.
- Simplify the code a bit.
- Add explicit barriers to prevent re-ordering (we need to read sequence
  stricktly before and after)
- Use mul_u64_u64_shr() instead of assembly. I checked and on x86_64
  gcc generates a single 'mul' instruction.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/hyperv/hv_init.c       | 36 ++++-------------------------
 arch/x86/include/asm/mshyperv.h | 50 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+), 32 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 0ce8485..77fec0c 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -38,39 +38,11 @@ struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
 
 static u64 read_hv_clock_tsc(struct clocksource *arg)
 {
-	u64 current_tick;
+	u64 current_tick = hv_read_tsc_page(tsc_pg);
+
+	if (current_tick == U64_MAX)
+		rdmsrl(HV_X64_MSR_TIME_REF_COUNT, current_tick);
 
-	if (tsc_pg->tsc_sequence != 0) {
-		/*
-		 * Use the tsc page to compute the value.
-		 */
-
-		while (1) {
-			u64 tmp;
-			u32 sequence = tsc_pg->tsc_sequence;
-			u64 cur_tsc;
-			u64 scale = tsc_pg->tsc_scale;
-			s64 offset = tsc_pg->tsc_offset;
-
-			rdtscll(cur_tsc);
-			/* current_tick = ((cur_tsc *scale) >> 64) + offset */
-			asm("mulq %3"
-				: "=d" (current_tick), "=a" (tmp)
-				: "a" (cur_tsc), "r" (scale));
-
-			current_tick += offset;
-			if (tsc_pg->tsc_sequence == sequence)
-				return current_tick;
-
-			if (tsc_pg->tsc_sequence != 0)
-				continue;
-			/*
-			 * Fallback using MSR method.
-			 */
-			break;
-		}
-	}
-	rdmsrl(HV_X64_MSR_TIME_REF_COUNT, current_tick);
 	return current_tick;
 }
 
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index 14dd92c..ddd071c 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -175,6 +175,56 @@ void hyperv_cleanup(void);
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
 struct ms_hyperv_tsc_page *hv_get_tsc_page(void);
+static inline u64 hv_read_tsc_page(const struct ms_hyperv_tsc_page *tsc_pg)
+{
+	u64 scale, offset, current_tick, cur_tsc;
+	u32 sequence;
+
+	/*
+	 * The protocol for reading Hyper-V TSC page is specified in Hypervisor
+	 * Top-Level Functional Specification ver. 3.0 and above. To get the
+	 * reference time we must do the following:
+	 * - READ ReferenceTscSequence
+	 *   A special '0' value indicates the time source is unreliable and we
+	 *   need to use something else. The currently published specification
+	 *   versions (up to 4.0b) contain a mistake and wrongly claim '-1'
+	 *   instead of '0' as the special value, see commit c35b82ef0294.
+	 * - ReferenceTime =
+	 *        ((RDTSC() * ReferenceTscScale) >> 64) + ReferenceTscOffset
+	 * - READ ReferenceTscSequence again. In case its value has changed
+	 *   since our first reading we need to discard ReferenceTime and repeat
+	 *   the whole sequence as the hypervisor was updating the page in
+	 *   between.
+	 */
+	while (1) {
+		sequence = tsc_pg->tsc_sequence;
+		if (!sequence)
+			break;
+		/*
+		 * Make sure we read sequence before we read other values from
+		 * TSC page.
+		 */
+		virt_rmb();
+
+		scale = tsc_pg->tsc_scale;
+		offset = tsc_pg->tsc_offset;
+		rdtscll(cur_tsc);
+
+		current_tick = mul_u64_u64_shr(cur_tsc, scale, 64) + offset;
+
+		/*
+		 * Make sure we read sequence after we read all other values
+		 * from TSC page.
+		 */
+		virt_rmb();
+
+		if (tsc_pg->tsc_sequence == sequence)
+			return current_tick;
+	}
+
+	return U64_MAX;
+}
+
 #else
 static inline struct ms_hyperv_tsc_page *hv_get_tsc_page(void)
 {
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 3/3] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method
  2017-02-14 12:44 ` Vitaly Kuznetsov
                   ` (3 preceding siblings ...)
  (?)
@ 2017-02-14 12:44 ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 28+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-14 12:44 UTC (permalink / raw)
  To: x86, Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, K. Y. Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Dexuan Cui, linux-kernel,
	devel, virtualization

Hyper-V TSC page clocksource is suitable for vDSO, however, the protocol
defined by the hypervisor is different from VCLOCK_PVCLOCK. Implement the
required support by adding hvclock_page VVAR.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/entry/vdso/vclock_gettime.c  | 24 ++++++++++++++++++++++++
 arch/x86/entry/vdso/vdso-layout.lds.S |  3 ++-
 arch/x86/entry/vdso/vdso2c.c          |  3 +++
 arch/x86/entry/vdso/vma.c             |  7 +++++++
 arch/x86/hyperv/hv_init.c             |  3 +++
 arch/x86/include/asm/clocksource.h    |  3 ++-
 arch/x86/include/asm/vdso.h           |  1 +
 7 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index 9d4d6e1..fa8dbfc 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -17,6 +17,7 @@
 #include <asm/unistd.h>
 #include <asm/msr.h>
 #include <asm/pvclock.h>
+#include <asm/mshyperv.h>
 #include <linux/math64.h>
 #include <linux/time.h>
 #include <linux/kernel.h>
@@ -32,6 +33,11 @@ extern u8 pvclock_page
 	__attribute__((visibility("hidden")));
 #endif
 
+#ifdef CONFIG_HYPERV_TSCPAGE
+extern u8 hvclock_page
+	__attribute__((visibility("hidden")));
+#endif
+
 #ifndef BUILD_VDSO32
 
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
@@ -141,6 +147,20 @@ static notrace u64 vread_pvclock(int *mode)
 	return last;
 }
 #endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+static notrace u64 vread_hvclock(int *mode)
+{
+	const struct ms_hyperv_tsc_page *tsc_pg =
+		(const struct ms_hyperv_tsc_page *)&hvclock_page;
+	u64 current_tick = hv_read_tsc_page(tsc_pg);
+
+	if (current_tick != U64_MAX)
+		return current_tick;
+
+	*mode = VCLOCK_NONE;
+	return 0;
+}
+#endif
 
 notrace static u64 vread_tsc(void)
 {
@@ -173,6 +193,10 @@ notrace static inline u64 vgetsns(int *mode)
 	else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
 		cycles = vread_pvclock(mode);
 #endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+	else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+		cycles = vread_hvclock(mode);
+#endif
 	else
 		return 0;
 	v = (cycles - gtod->cycle_last) & gtod->mask;
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index a708aa9..8ebb4b6 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -25,7 +25,7 @@ SECTIONS
 	 * segment.
 	 */
 
-	vvar_start = . - 2 * PAGE_SIZE;
+	vvar_start = . - 3 * PAGE_SIZE;
 	vvar_page = vvar_start;
 
 	/* Place all vvars at the offsets in asm/vvar.h. */
@@ -36,6 +36,7 @@ SECTIONS
 #undef EMIT_VVAR
 
 	pvclock_page = vvar_start + PAGE_SIZE;
+	hvclock_page = vvar_start + 2 * PAGE_SIZE;
 
 	. = SIZEOF_HEADERS;
 
diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 491020b..0780a44 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -74,6 +74,7 @@ enum {
 	sym_vvar_page,
 	sym_hpet_page,
 	sym_pvclock_page,
+	sym_hvclock_page,
 	sym_VDSO_FAKE_SECTION_TABLE_START,
 	sym_VDSO_FAKE_SECTION_TABLE_END,
 };
@@ -82,6 +83,7 @@ const int special_pages[] = {
 	sym_vvar_page,
 	sym_hpet_page,
 	sym_pvclock_page,
+	sym_hvclock_page,
 };
 
 struct vdso_sym {
@@ -94,6 +96,7 @@ struct vdso_sym required_syms[] = {
 	[sym_vvar_page] = {"vvar_page", true},
 	[sym_hpet_page] = {"hpet_page", true},
 	[sym_pvclock_page] = {"pvclock_page", true},
+	[sym_hvclock_page] = {"hvclock_page", true},
 	[sym_VDSO_FAKE_SECTION_TABLE_START] = {
 		"VDSO_FAKE_SECTION_TABLE_START", false
 	},
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 10820f6..b256a3b 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -21,6 +21,7 @@
 #include <asm/page.h>
 #include <asm/desc.h>
 #include <asm/cpufeature.h>
+#include <asm/mshyperv.h>
 
 #if defined(CONFIG_X86_64)
 unsigned int __read_mostly vdso64_enabled = 1;
@@ -120,6 +121,12 @@ static int vvar_fault(const struct vm_special_mapping *sm,
 				vmf->address,
 				__pa(pvti) >> PAGE_SHIFT);
 		}
+	} else if (sym_offset == image->sym_hvclock_page) {
+		struct ms_hyperv_tsc_page *tsc_pg = hv_get_tsc_page();
+
+		if (tsc_pg && vclock_was_used(VCLOCK_HVCLOCK))
+			ret = vm_insert_pfn(vma, vmf->address,
+					    vmalloc_to_pfn(tsc_pg));
 	}
 
 	if (ret == 0 || ret == -EBUSY)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 77fec0c..870f0eb 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -129,6 +129,9 @@ void hyperv_init(void)
 		tsc_msr.guest_physical_address = vmalloc_to_pfn(tsc_pg);
 
 		wrmsrl(HV_X64_MSR_REFERENCE_TSC, tsc_msr.as_uint64);
+
+		hyperv_cs_tsc.archdata.vclock_mode = VCLOCK_HVCLOCK;
+
 		clocksource_register_hz(&hyperv_cs_tsc, NSEC_PER_SEC/100);
 		return;
 	}
diff --git a/arch/x86/include/asm/clocksource.h b/arch/x86/include/asm/clocksource.h
index eae33c7..47bea8c 100644
--- a/arch/x86/include/asm/clocksource.h
+++ b/arch/x86/include/asm/clocksource.h
@@ -6,7 +6,8 @@
 #define VCLOCK_NONE	0	/* No vDSO clock available.		*/
 #define VCLOCK_TSC	1	/* vDSO should use vread_tsc.		*/
 #define VCLOCK_PVCLOCK	2	/* vDSO should use vread_pvclock.	*/
-#define VCLOCK_MAX	2
+#define VCLOCK_HVCLOCK	3	/* vDSO should use vread_hvclock.	*/
+#define VCLOCK_MAX	3
 
 struct arch_clocksource_data {
 	int vclock_mode;
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 2444189..bccdf49 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -20,6 +20,7 @@ struct vdso_image {
 	long sym_vvar_page;
 	long sym_hpet_page;
 	long sym_pvclock_page;
+	long sym_hvclock_page;
 	long sym_VDSO32_NOTE_MASK;
 	long sym___kernel_sigreturn;
 	long sym___kernel_rt_sigreturn;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH v2 3/3] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method
  2017-02-14 12:44 ` Vitaly Kuznetsov
                   ` (2 preceding siblings ...)
  (?)
@ 2017-02-14 12:44 ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 28+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-14 12:44 UTC (permalink / raw)
  To: x86, Andy Lutomirski
  Cc: Stephen Hemminger, Haiyang Zhang, linux-kernel, virtualization,
	Ingo Molnar, H. Peter Anvin, devel, Thomas Gleixner

Hyper-V TSC page clocksource is suitable for vDSO, however, the protocol
defined by the hypervisor is different from VCLOCK_PVCLOCK. Implement the
required support by adding hvclock_page VVAR.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/entry/vdso/vclock_gettime.c  | 24 ++++++++++++++++++++++++
 arch/x86/entry/vdso/vdso-layout.lds.S |  3 ++-
 arch/x86/entry/vdso/vdso2c.c          |  3 +++
 arch/x86/entry/vdso/vma.c             |  7 +++++++
 arch/x86/hyperv/hv_init.c             |  3 +++
 arch/x86/include/asm/clocksource.h    |  3 ++-
 arch/x86/include/asm/vdso.h           |  1 +
 7 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
index 9d4d6e1..fa8dbfc 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -17,6 +17,7 @@
 #include <asm/unistd.h>
 #include <asm/msr.h>
 #include <asm/pvclock.h>
+#include <asm/mshyperv.h>
 #include <linux/math64.h>
 #include <linux/time.h>
 #include <linux/kernel.h>
@@ -32,6 +33,11 @@ extern u8 pvclock_page
 	__attribute__((visibility("hidden")));
 #endif
 
+#ifdef CONFIG_HYPERV_TSCPAGE
+extern u8 hvclock_page
+	__attribute__((visibility("hidden")));
+#endif
+
 #ifndef BUILD_VDSO32
 
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
@@ -141,6 +147,20 @@ static notrace u64 vread_pvclock(int *mode)
 	return last;
 }
 #endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+static notrace u64 vread_hvclock(int *mode)
+{
+	const struct ms_hyperv_tsc_page *tsc_pg =
+		(const struct ms_hyperv_tsc_page *)&hvclock_page;
+	u64 current_tick = hv_read_tsc_page(tsc_pg);
+
+	if (current_tick != U64_MAX)
+		return current_tick;
+
+	*mode = VCLOCK_NONE;
+	return 0;
+}
+#endif
 
 notrace static u64 vread_tsc(void)
 {
@@ -173,6 +193,10 @@ notrace static inline u64 vgetsns(int *mode)
 	else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
 		cycles = vread_pvclock(mode);
 #endif
+#ifdef CONFIG_HYPERV_TSCPAGE
+	else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
+		cycles = vread_hvclock(mode);
+#endif
 	else
 		return 0;
 	v = (cycles - gtod->cycle_last) & gtod->mask;
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S
index a708aa9..8ebb4b6 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -25,7 +25,7 @@ SECTIONS
 	 * segment.
 	 */
 
-	vvar_start = . - 2 * PAGE_SIZE;
+	vvar_start = . - 3 * PAGE_SIZE;
 	vvar_page = vvar_start;
 
 	/* Place all vvars at the offsets in asm/vvar.h. */
@@ -36,6 +36,7 @@ SECTIONS
 #undef EMIT_VVAR
 
 	pvclock_page = vvar_start + PAGE_SIZE;
+	hvclock_page = vvar_start + 2 * PAGE_SIZE;
 
 	. = SIZEOF_HEADERS;
 
diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 491020b..0780a44 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -74,6 +74,7 @@ enum {
 	sym_vvar_page,
 	sym_hpet_page,
 	sym_pvclock_page,
+	sym_hvclock_page,
 	sym_VDSO_FAKE_SECTION_TABLE_START,
 	sym_VDSO_FAKE_SECTION_TABLE_END,
 };
@@ -82,6 +83,7 @@ const int special_pages[] = {
 	sym_vvar_page,
 	sym_hpet_page,
 	sym_pvclock_page,
+	sym_hvclock_page,
 };
 
 struct vdso_sym {
@@ -94,6 +96,7 @@ struct vdso_sym required_syms[] = {
 	[sym_vvar_page] = {"vvar_page", true},
 	[sym_hpet_page] = {"hpet_page", true},
 	[sym_pvclock_page] = {"pvclock_page", true},
+	[sym_hvclock_page] = {"hvclock_page", true},
 	[sym_VDSO_FAKE_SECTION_TABLE_START] = {
 		"VDSO_FAKE_SECTION_TABLE_START", false
 	},
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 10820f6..b256a3b 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -21,6 +21,7 @@
 #include <asm/page.h>
 #include <asm/desc.h>
 #include <asm/cpufeature.h>
+#include <asm/mshyperv.h>
 
 #if defined(CONFIG_X86_64)
 unsigned int __read_mostly vdso64_enabled = 1;
@@ -120,6 +121,12 @@ static int vvar_fault(const struct vm_special_mapping *sm,
 				vmf->address,
 				__pa(pvti) >> PAGE_SHIFT);
 		}
+	} else if (sym_offset == image->sym_hvclock_page) {
+		struct ms_hyperv_tsc_page *tsc_pg = hv_get_tsc_page();
+
+		if (tsc_pg && vclock_was_used(VCLOCK_HVCLOCK))
+			ret = vm_insert_pfn(vma, vmf->address,
+					    vmalloc_to_pfn(tsc_pg));
 	}
 
 	if (ret == 0 || ret == -EBUSY)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 77fec0c..870f0eb 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -129,6 +129,9 @@ void hyperv_init(void)
 		tsc_msr.guest_physical_address = vmalloc_to_pfn(tsc_pg);
 
 		wrmsrl(HV_X64_MSR_REFERENCE_TSC, tsc_msr.as_uint64);
+
+		hyperv_cs_tsc.archdata.vclock_mode = VCLOCK_HVCLOCK;
+
 		clocksource_register_hz(&hyperv_cs_tsc, NSEC_PER_SEC/100);
 		return;
 	}
diff --git a/arch/x86/include/asm/clocksource.h b/arch/x86/include/asm/clocksource.h
index eae33c7..47bea8c 100644
--- a/arch/x86/include/asm/clocksource.h
+++ b/arch/x86/include/asm/clocksource.h
@@ -6,7 +6,8 @@
 #define VCLOCK_NONE	0	/* No vDSO clock available.		*/
 #define VCLOCK_TSC	1	/* vDSO should use vread_tsc.		*/
 #define VCLOCK_PVCLOCK	2	/* vDSO should use vread_pvclock.	*/
-#define VCLOCK_MAX	2
+#define VCLOCK_HVCLOCK	3	/* vDSO should use vread_hvclock.	*/
+#define VCLOCK_MAX	3
 
 struct arch_clocksource_data {
 	int vclock_mode;
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 2444189..bccdf49 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -20,6 +20,7 @@ struct vdso_image {
 	long sym_vvar_page;
 	long sym_hpet_page;
 	long sym_pvclock_page;
+	long sym_hvclock_page;
 	long sym_VDSO32_NOTE_MASK;
 	long sym___kernel_sigreturn;
 	long sym___kernel_rt_sigreturn;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* RE: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
  2017-02-14 12:44 ` Vitaly Kuznetsov
@ 2017-02-14 14:46   ` KY Srinivasan via Virtualization
  -1 siblings, 0 replies; 28+ messages in thread
From: KY Srinivasan @ 2017-02-14 14:46 UTC (permalink / raw)
  To: Vitaly Kuznetsov, x86, Andy Lutomirski
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Haiyang Zhang,
	Stephen Hemminger, Dexuan Cui, linux-kernel, devel,
	virtualization



> -----Original Message-----
> From: Vitaly Kuznetsov [mailto:vkuznets@redhat.com]
> Sent: Tuesday, February 14, 2017 4:44 AM
> To: x86@kernel.org; Andy Lutomirski <luto@amacapital.net>
> Cc: Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
> H. Peter Anvin <hpa@zytor.com>; KY Srinivasan <kys@microsoft.com>;
> Haiyang Zhang <haiyangz@microsoft.com>; Stephen Hemminger
> <sthemmin@microsoft.com>; Dexuan Cui <decui@microsoft.com>; linux-
> kernel@vger.kernel.org; devel@linuxdriverproject.org;
> virtualization@lists.linux-foundation.org
> Subject: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
> 
> Hi,
> 
> while we're still waiting for a definitive ACK from Microsoft that the algorithm
> is good for SMP case (as we can't prevent the code in vdso from migrating
> between CPUs) I'd like to send v2 with some modifications to keep the
> discussion going.

I checked with the folks on the Hyper-V side and they have confirmed that we need to
add memory barriers in the guest code to ensure the various reads from the TSC page are
correctly ordered - especially, the initial read of the sequence counter must have acquire
semantics. We should ensure that other reads from the TSC page are completed before the
second read of the sequence counter. I am working with the Windows team to correctly
reflect this algorithm in the Hyper-V specification.

Regards,

K. Y
> 
> Changes since v1:
> - Document the TSC page reading protocol [Thomas Gleixner].
> 
> - Separate the TSC page reading code from read_hv_clock_tsc() and put it to
>   asm/mshyperv.h to use from both hv_init.c and vdso.
> 
> - Add explicit barriers [Thomas Gleixner]
> 
> Original description:
> 
> Hyper-V TSC page clocksource is suitable for vDSO, however, the protocol
> defined by the hypervisor is different from VCLOCK_PVCLOCK. Implemented
> the required support. Simple sysbench test shows the following results:
> 
> Before:
> # time sysbench --test=memory --max-requests=500000 run ...
> real    1m22.618s
> user    0m50.193s
> sys     0m32.268s
> 
> After:
> # time sysbench --test=memory --max-requests=500000 run ...
> real	0m47.241s
> user	0m47.117s
> sys	0m0.008s
> 
> Patches 1 and 2 are made on top of K. Y.'s code refactoring which moved tsc
> page clocksource to arch/x86/hyperv/hv_init.c, this is currently present in
> Greg's char-misc-next tree.
> 
> Vitaly Kuznetsov (3):
>   x86/hyperv: implement hv_get_tsc_page()
>   x86/hyperv: move TSC reading method to asm/mshyperv.h
>   x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method
> 
>  arch/x86/entry/vdso/vclock_gettime.c  | 24 +++++++++++++++
> arch/x86/entry/vdso/vdso-layout.lds.S |  3 +-
>  arch/x86/entry/vdso/vdso2c.c          |  3 ++
>  arch/x86/entry/vdso/vma.c             |  7 +++++
>  arch/x86/hyperv/hv_init.c             | 48 +++++++++--------------------
>  arch/x86/include/asm/clocksource.h    |  3 +-
>  arch/x86/include/asm/mshyperv.h       | 58
> +++++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/vdso.h           |  1 +
>  drivers/hv/Kconfig                    |  3 ++
>  9 files changed, 114 insertions(+), 36 deletions(-)
> 
> --
> 2.9.3

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
@ 2017-02-14 14:46   ` KY Srinivasan via Virtualization
  0 siblings, 0 replies; 28+ messages in thread
From: KY Srinivasan via Virtualization @ 2017-02-14 14:46 UTC (permalink / raw)
  To: Vitaly Kuznetsov, x86, Andy Lutomirski
  Cc: Stephen Hemminger, Haiyang Zhang, linux-kernel, virtualization,
	Ingo Molnar, H. Peter Anvin, devel, Thomas Gleixner



> -----Original Message-----
> From: Vitaly Kuznetsov [mailto:vkuznets@redhat.com]
> Sent: Tuesday, February 14, 2017 4:44 AM
> To: x86@kernel.org; Andy Lutomirski <luto@amacapital.net>
> Cc: Thomas Gleixner <tglx@linutronix.de>; Ingo Molnar <mingo@redhat.com>;
> H. Peter Anvin <hpa@zytor.com>; KY Srinivasan <kys@microsoft.com>;
> Haiyang Zhang <haiyangz@microsoft.com>; Stephen Hemminger
> <sthemmin@microsoft.com>; Dexuan Cui <decui@microsoft.com>; linux-
> kernel@vger.kernel.org; devel@linuxdriverproject.org;
> virtualization@lists.linux-foundation.org
> Subject: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
> 
> Hi,
> 
> while we're still waiting for a definitive ACK from Microsoft that the algorithm
> is good for SMP case (as we can't prevent the code in vdso from migrating
> between CPUs) I'd like to send v2 with some modifications to keep the
> discussion going.

I checked with the folks on the Hyper-V side and they have confirmed that we need to
add memory barriers in the guest code to ensure the various reads from the TSC page are
correctly ordered - especially, the initial read of the sequence counter must have acquire
semantics. We should ensure that other reads from the TSC page are completed before the
second read of the sequence counter. I am working with the Windows team to correctly
reflect this algorithm in the Hyper-V specification.

Regards,

K. Y
> 
> Changes since v1:
> - Document the TSC page reading protocol [Thomas Gleixner].
> 
> - Separate the TSC page reading code from read_hv_clock_tsc() and put it to
>   asm/mshyperv.h to use from both hv_init.c and vdso.
> 
> - Add explicit barriers [Thomas Gleixner]
> 
> Original description:
> 
> Hyper-V TSC page clocksource is suitable for vDSO, however, the protocol
> defined by the hypervisor is different from VCLOCK_PVCLOCK. Implemented
> the required support. Simple sysbench test shows the following results:
> 
> Before:
> # time sysbench --test=memory --max-requests=500000 run ...
> real    1m22.618s
> user    0m50.193s
> sys     0m32.268s
> 
> After:
> # time sysbench --test=memory --max-requests=500000 run ...
> real	0m47.241s
> user	0m47.117s
> sys	0m0.008s
> 
> Patches 1 and 2 are made on top of K. Y.'s code refactoring which moved tsc
> page clocksource to arch/x86/hyperv/hv_init.c, this is currently present in
> Greg's char-misc-next tree.
> 
> Vitaly Kuznetsov (3):
>   x86/hyperv: implement hv_get_tsc_page()
>   x86/hyperv: move TSC reading method to asm/mshyperv.h
>   x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method
> 
>  arch/x86/entry/vdso/vclock_gettime.c  | 24 +++++++++++++++
> arch/x86/entry/vdso/vdso-layout.lds.S |  3 +-
>  arch/x86/entry/vdso/vdso2c.c          |  3 ++
>  arch/x86/entry/vdso/vma.c             |  7 +++++
>  arch/x86/hyperv/hv_init.c             | 48 +++++++++--------------------
>  arch/x86/include/asm/clocksource.h    |  3 +-
>  arch/x86/include/asm/mshyperv.h       | 58
> +++++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/vdso.h           |  1 +
>  drivers/hv/Kconfig                    |  3 ++
>  9 files changed, 114 insertions(+), 36 deletions(-)
> 
> --
> 2.9.3

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
  2017-02-14 12:44 ` Vitaly Kuznetsov
@ 2017-02-14 14:56   ` Thomas Gleixner
  -1 siblings, 0 replies; 28+ messages in thread
From: Thomas Gleixner @ 2017-02-14 14:56 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: x86, Andy Lutomirski, Ingo Molnar, H. Peter Anvin,
	K. Y. Srinivasan, Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	linux-kernel, devel, virtualization

On Tue, 14 Feb 2017, Vitaly Kuznetsov wrote:

> Hi,
> 
> while we're still waiting for a definitive ACK from Microsoft that the
> algorithm is good for SMP case (as we can't prevent the code in vdso from
> migrating between CPUs) I'd like to send v2 with some modifications to keep
> the discussion going.

Migration is irrelevant. The TSC page is guest global so updates will
happen on some (random) host CPU and therefor you need the usual barriers
like we have them in our seqcounts unless an access to the sequence will
trap into the host, which would defeat the whole purpose of the TSC page.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
@ 2017-02-14 14:56   ` Thomas Gleixner
  0 siblings, 0 replies; 28+ messages in thread
From: Thomas Gleixner @ 2017-02-14 14:56 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Stephen Hemminger, Haiyang Zhang, x86, linux-kernel,
	Andy Lutomirski, Ingo Molnar, H. Peter Anvin, devel,
	virtualization

On Tue, 14 Feb 2017, Vitaly Kuznetsov wrote:

> Hi,
> 
> while we're still waiting for a definitive ACK from Microsoft that the
> algorithm is good for SMP case (as we can't prevent the code in vdso from
> migrating between CPUs) I'd like to send v2 with some modifications to keep
> the discussion going.

Migration is irrelevant. The TSC page is guest global so updates will
happen on some (random) host CPU and therefor you need the usual barriers
like we have them in our seqcounts unless an access to the sequence will
trap into the host, which would defeat the whole purpose of the TSC page.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
  2017-02-14 14:56   ` Thomas Gleixner
@ 2017-02-14 15:50     ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 28+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-14 15:50 UTC (permalink / raw)
  To: Thomas Gleixner, K. Y. Srinivasan
  Cc: x86, Andy Lutomirski, Ingo Molnar, H. Peter Anvin, Haiyang Zhang,
	Stephen Hemminger, Dexuan Cui, linux-kernel, devel,
	virtualization

Thomas Gleixner <tglx@linutronix.de> writes:

> On Tue, 14 Feb 2017, Vitaly Kuznetsov wrote:
>
>> Hi,
>> 
>> while we're still waiting for a definitive ACK from Microsoft that the
>> algorithm is good for SMP case (as we can't prevent the code in vdso from
>> migrating between CPUs) I'd like to send v2 with some modifications to keep
>> the discussion going.
>
> Migration is irrelevant. The TSC page is guest global so updates will
> happen on some (random) host CPU and therefor you need the usual barriers
> like we have them in our seqcounts unless an access to the sequence will
> trap into the host, which would defeat the whole purpose of the TSC page.
>

KY Srinivasan <kys@microsoft.com> writes:

> I checked with the folks on the Hyper-V side and they have confirmed that we need to
> add memory barriers in the guest code to ensure the various reads from the TSC page are
> correctly ordered - especially, the initial read of the sequence counter must have acquire
> semantics. We should ensure that other reads from the TSC page are completed before the
> second read of the sequence counter. I am working with the Windows team to correctly
> reflect this algorithm in the Hyper-V specification.

Thank you,

do I get it right that combining the above I only need to replace
virt_rmb() barriers with plain rmb() to get 'lfence' in hv_read_tsc_page
(PATCH 2)? As members of struct ms_hyperv_tsc_page are volatile we don't
need READ_ONCE(), compilers are not allowed to merge accesses. The
resulting code looks good to me:

(gdb) disassemble read_hv_clock_tsc 
Dump of assembler code for function read_hv_clock_tsc:
   0xffffffff8102ca60 <+0>:	callq  0xffffffff816c7500 <__fentry__>
   0xffffffff8102ca65 <+5>:	mov    0xf67974(%rip),%rcx        # 0xffffffff81f943e0 <tsc_pg>
   0xffffffff8102ca6c <+12>:	jmp    0xffffffff8102ca87 <read_hv_clock_tsc+39>
   0xffffffff8102ca6e <+14>:	lfence 
   0xffffffff8102ca71 <+17>:	mov    0x8(%rcx),%r9
   0xffffffff8102ca75 <+21>:	mov    0x10(%rcx),%r8
   0xffffffff8102ca79 <+25>:	nop
   0xffffffff8102ca7a <+26>:	nop
   0xffffffff8102ca7b <+27>:	nop
   0xffffffff8102ca7c <+28>:	rdtsc  
   0xffffffff8102ca7e <+30>:	lfence 
   0xffffffff8102ca81 <+33>:	mov    (%rcx),%edi
   0xffffffff8102ca83 <+35>:	cmp    %edi,%esi
   0xffffffff8102ca85 <+37>:	je     0xffffffff8102caa3 <read_hv_clock_tsc+67>
   0xffffffff8102ca87 <+39>:	mov    (%rcx),%esi
   0xffffffff8102ca89 <+41>:	test   %esi,%esi
   0xffffffff8102ca8b <+43>:	jne    0xffffffff8102ca6e <read_hv_clock_tsc+14>
   0xffffffff8102ca8d <+45>:	push   %rbp
   0xffffffff8102ca8e <+46>:	mov    $0x40000020,%edi
   0xffffffff8102ca93 <+51>:	mov    %rsp,%rbp
   0xffffffff8102ca96 <+54>:	and    $0xfffffffffffffff0,%rsp
   0xffffffff8102ca9a <+58>:	callq  *0xffffffff81c36330
   0xffffffff8102caa1 <+65>:	leaveq 
   0xffffffff8102caa2 <+66>:	retq   
   0xffffffff8102caa3 <+67>:	shl    $0x20,%rdx
   0xffffffff8102caa7 <+71>:	or     %rdx,%rax
   0xffffffff8102caaa <+74>:	mul    %r9
   0xffffffff8102caad <+77>:	mov    %rdx,%rax
   0xffffffff8102cab0 <+80>:	add    %r8,%rax
   0xffffffff8102cab3 <+83>:	cmp    $0xffffffffffffffff,%rax
   0xffffffff8102cab7 <+87>:	je     0xffffffff8102ca8d <read_hv_clock_tsc+45>
   0xffffffff8102cab9 <+89>:	repz retq 
End of assembler dump.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
@ 2017-02-14 15:50     ` Vitaly Kuznetsov
  0 siblings, 0 replies; 28+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-14 15:50 UTC (permalink / raw)
  To: Thomas Gleixner, K. Y. Srinivasan
  Cc: Stephen Hemminger, Haiyang Zhang, x86, linux-kernel,
	Andy Lutomirski, Ingo Molnar, H. Peter Anvin, devel,
	virtualization

Thomas Gleixner <tglx@linutronix.de> writes:

> On Tue, 14 Feb 2017, Vitaly Kuznetsov wrote:
>
>> Hi,
>> 
>> while we're still waiting for a definitive ACK from Microsoft that the
>> algorithm is good for SMP case (as we can't prevent the code in vdso from
>> migrating between CPUs) I'd like to send v2 with some modifications to keep
>> the discussion going.
>
> Migration is irrelevant. The TSC page is guest global so updates will
> happen on some (random) host CPU and therefor you need the usual barriers
> like we have them in our seqcounts unless an access to the sequence will
> trap into the host, which would defeat the whole purpose of the TSC page.
>

KY Srinivasan <kys@microsoft.com> writes:

> I checked with the folks on the Hyper-V side and they have confirmed that we need to
> add memory barriers in the guest code to ensure the various reads from the TSC page are
> correctly ordered - especially, the initial read of the sequence counter must have acquire
> semantics. We should ensure that other reads from the TSC page are completed before the
> second read of the sequence counter. I am working with the Windows team to correctly
> reflect this algorithm in the Hyper-V specification.

Thank you,

do I get it right that combining the above I only need to replace
virt_rmb() barriers with plain rmb() to get 'lfence' in hv_read_tsc_page
(PATCH 2)? As members of struct ms_hyperv_tsc_page are volatile we don't
need READ_ONCE(), compilers are not allowed to merge accesses. The
resulting code looks good to me:

(gdb) disassemble read_hv_clock_tsc 
Dump of assembler code for function read_hv_clock_tsc:
   0xffffffff8102ca60 <+0>:	callq  0xffffffff816c7500 <__fentry__>
   0xffffffff8102ca65 <+5>:	mov    0xf67974(%rip),%rcx        # 0xffffffff81f943e0 <tsc_pg>
   0xffffffff8102ca6c <+12>:	jmp    0xffffffff8102ca87 <read_hv_clock_tsc+39>
   0xffffffff8102ca6e <+14>:	lfence 
   0xffffffff8102ca71 <+17>:	mov    0x8(%rcx),%r9
   0xffffffff8102ca75 <+21>:	mov    0x10(%rcx),%r8
   0xffffffff8102ca79 <+25>:	nop
   0xffffffff8102ca7a <+26>:	nop
   0xffffffff8102ca7b <+27>:	nop
   0xffffffff8102ca7c <+28>:	rdtsc  
   0xffffffff8102ca7e <+30>:	lfence 
   0xffffffff8102ca81 <+33>:	mov    (%rcx),%edi
   0xffffffff8102ca83 <+35>:	cmp    %edi,%esi
   0xffffffff8102ca85 <+37>:	je     0xffffffff8102caa3 <read_hv_clock_tsc+67>
   0xffffffff8102ca87 <+39>:	mov    (%rcx),%esi
   0xffffffff8102ca89 <+41>:	test   %esi,%esi
   0xffffffff8102ca8b <+43>:	jne    0xffffffff8102ca6e <read_hv_clock_tsc+14>
   0xffffffff8102ca8d <+45>:	push   %rbp
   0xffffffff8102ca8e <+46>:	mov    $0x40000020,%edi
   0xffffffff8102ca93 <+51>:	mov    %rsp,%rbp
   0xffffffff8102ca96 <+54>:	and    $0xfffffffffffffff0,%rsp
   0xffffffff8102ca9a <+58>:	callq  *0xffffffff81c36330
   0xffffffff8102caa1 <+65>:	leaveq 
   0xffffffff8102caa2 <+66>:	retq   
   0xffffffff8102caa3 <+67>:	shl    $0x20,%rdx
   0xffffffff8102caa7 <+71>:	or     %rdx,%rax
   0xffffffff8102caaa <+74>:	mul    %r9
   0xffffffff8102caad <+77>:	mov    %rdx,%rax
   0xffffffff8102cab0 <+80>:	add    %r8,%rax
   0xffffffff8102cab3 <+83>:	cmp    $0xffffffffffffffff,%rax
   0xffffffff8102cab7 <+87>:	je     0xffffffff8102ca8d <read_hv_clock_tsc+45>
   0xffffffff8102cab9 <+89>:	repz retq 
End of assembler dump.

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
  2017-02-14 15:50     ` Vitaly Kuznetsov
@ 2017-02-14 17:34       ` Andy Lutomirski
  -1 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2017-02-14 17:34 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Thomas Gleixner, K. Y. Srinivasan, X86 ML, Ingo Molnar,
	H. Peter Anvin, Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	linux-kernel, devel, Linux Virtualization

On Tue, Feb 14, 2017 at 7:50 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
>
>> On Tue, 14 Feb 2017, Vitaly Kuznetsov wrote:
>>
>>> Hi,
>>>
>>> while we're still waiting for a definitive ACK from Microsoft that the
>>> algorithm is good for SMP case (as we can't prevent the code in vdso from
>>> migrating between CPUs) I'd like to send v2 with some modifications to keep
>>> the discussion going.
>>
>> Migration is irrelevant. The TSC page is guest global so updates will
>> happen on some (random) host CPU and therefor you need the usual barriers
>> like we have them in our seqcounts unless an access to the sequence will
>> trap into the host, which would defeat the whole purpose of the TSC page.
>>
>
> KY Srinivasan <kys@microsoft.com> writes:
>
>> I checked with the folks on the Hyper-V side and they have confirmed that we need to
>> add memory barriers in the guest code to ensure the various reads from the TSC page are
>> correctly ordered - especially, the initial read of the sequence counter must have acquire
>> semantics. We should ensure that other reads from the TSC page are completed before the
>> second read of the sequence counter. I am working with the Windows team to correctly
>> reflect this algorithm in the Hyper-V specification.
>
>
> Thank you,
>
> do I get it right that combining the above I only need to replace
> virt_rmb() barriers with plain rmb() to get 'lfence' in hv_read_tsc_page
> (PATCH 2)? As members of struct ms_hyperv_tsc_page are volatile we don't
> need READ_ONCE(), compilers are not allowed to merge accesses. The
> resulting code looks good to me:

No, on multiple counts, unfortunately.

1. LFENCE is basically useless except for IO and for (Intel only)
rdtsc_ordered().  AFAIK there is literally no scenario under which
LFENCE is useful for access to normal memory.

2. The problem here has little to do with barriers.  You're doing:

read seq;
read var1;
read var2;
read tsc;
read seq again;

If the hypervisor updates things between reading var1 and var2 or
between reading var2 and tsc, you aren't guaranteed to notice unless
something fancy is going on regardless of barriers.  Similarly, if the
hypervisor starts updating, then you start and finish the whole
function, and then the hypervisor finishes, what guarantees you
notice?

This really needs a spec from the hypervisor folks as to what's going
on.  KVM does this horrible thing in which it sometimes freezes all
vCPUs when updating, for better or for worse.  Mostly for worse.  If
MS does the same thing, they should document it.

3. You need rdtsc_ordered().  Plain RDTSC is not ordered wrt other
memory access, and it can observably screw up as a result.

Also, Linux tries to avoid volatile variables, so please use READ_ONCE().

>
> (gdb) disassemble read_hv_clock_tsc
> Dump of assembler code for function read_hv_clock_tsc:
>    0xffffffff8102ca60 <+0>:     callq  0xffffffff816c7500 <__fentry__>
>    0xffffffff8102ca65 <+5>:     mov    0xf67974(%rip),%rcx        # 0xffffffff81f943e0 <tsc_pg>
>    0xffffffff8102ca6c <+12>:    jmp    0xffffffff8102ca87 <read_hv_clock_tsc+39>
>    0xffffffff8102ca6e <+14>:    lfence
>    0xffffffff8102ca71 <+17>:    mov    0x8(%rcx),%r9
>    0xffffffff8102ca75 <+21>:    mov    0x10(%rcx),%r8
>    0xffffffff8102ca79 <+25>:    nop
>    0xffffffff8102ca7a <+26>:    nop
>    0xffffffff8102ca7b <+27>:    nop
>    0xffffffff8102ca7c <+28>:    rdtsc
>    0xffffffff8102ca7e <+30>:    lfence
>    0xffffffff8102ca81 <+33>:    mov    (%rcx),%edi
>    0xffffffff8102ca83 <+35>:    cmp    %edi,%esi
>    0xffffffff8102ca85 <+37>:    je     0xffffffff8102caa3 <read_hv_clock_tsc+67>
>    0xffffffff8102ca87 <+39>:    mov    (%rcx),%esi
>    0xffffffff8102ca89 <+41>:    test   %esi,%esi
>    0xffffffff8102ca8b <+43>:    jne    0xffffffff8102ca6e <read_hv_clock_tsc+14>
>    0xffffffff8102ca8d <+45>:    push   %rbp
>    0xffffffff8102ca8e <+46>:    mov    $0x40000020,%edi
>    0xffffffff8102ca93 <+51>:    mov    %rsp,%rbp
>    0xffffffff8102ca96 <+54>:    and    $0xfffffffffffffff0,%rsp
>    0xffffffff8102ca9a <+58>:    callq  *0xffffffff81c36330
>    0xffffffff8102caa1 <+65>:    leaveq
>    0xffffffff8102caa2 <+66>:    retq
>    0xffffffff8102caa3 <+67>:    shl    $0x20,%rdx
>    0xffffffff8102caa7 <+71>:    or     %rdx,%rax
>    0xffffffff8102caaa <+74>:    mul    %r9
>    0xffffffff8102caad <+77>:    mov    %rdx,%rax
>    0xffffffff8102cab0 <+80>:    add    %r8,%rax
>    0xffffffff8102cab3 <+83>:    cmp    $0xffffffffffffffff,%rax
>    0xffffffff8102cab7 <+87>:    je     0xffffffff8102ca8d <read_hv_clock_tsc+45>
>    0xffffffff8102cab9 <+89>:    repz retq
> End of assembler dump.
>
> --
>   Vitaly



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
@ 2017-02-14 17:34       ` Andy Lutomirski
  0 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2017-02-14 17:34 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Stephen Hemminger, Haiyang Zhang, X86 ML, linux-kernel,
	Linux Virtualization, Ingo Molnar, H. Peter Anvin, devel,
	Thomas Gleixner

On Tue, Feb 14, 2017 at 7:50 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
>
>> On Tue, 14 Feb 2017, Vitaly Kuznetsov wrote:
>>
>>> Hi,
>>>
>>> while we're still waiting for a definitive ACK from Microsoft that the
>>> algorithm is good for SMP case (as we can't prevent the code in vdso from
>>> migrating between CPUs) I'd like to send v2 with some modifications to keep
>>> the discussion going.
>>
>> Migration is irrelevant. The TSC page is guest global so updates will
>> happen on some (random) host CPU and therefor you need the usual barriers
>> like we have them in our seqcounts unless an access to the sequence will
>> trap into the host, which would defeat the whole purpose of the TSC page.
>>
>
> KY Srinivasan <kys@microsoft.com> writes:
>
>> I checked with the folks on the Hyper-V side and they have confirmed that we need to
>> add memory barriers in the guest code to ensure the various reads from the TSC page are
>> correctly ordered - especially, the initial read of the sequence counter must have acquire
>> semantics. We should ensure that other reads from the TSC page are completed before the
>> second read of the sequence counter. I am working with the Windows team to correctly
>> reflect this algorithm in the Hyper-V specification.
>
>
> Thank you,
>
> do I get it right that combining the above I only need to replace
> virt_rmb() barriers with plain rmb() to get 'lfence' in hv_read_tsc_page
> (PATCH 2)? As members of struct ms_hyperv_tsc_page are volatile we don't
> need READ_ONCE(), compilers are not allowed to merge accesses. The
> resulting code looks good to me:

No, on multiple counts, unfortunately.

1. LFENCE is basically useless except for IO and for (Intel only)
rdtsc_ordered().  AFAIK there is literally no scenario under which
LFENCE is useful for access to normal memory.

2. The problem here has little to do with barriers.  You're doing:

read seq;
read var1;
read var2;
read tsc;
read seq again;

If the hypervisor updates things between reading var1 and var2 or
between reading var2 and tsc, you aren't guaranteed to notice unless
something fancy is going on regardless of barriers.  Similarly, if the
hypervisor starts updating, then you start and finish the whole
function, and then the hypervisor finishes, what guarantees you
notice?

This really needs a spec from the hypervisor folks as to what's going
on.  KVM does this horrible thing in which it sometimes freezes all
vCPUs when updating, for better or for worse.  Mostly for worse.  If
MS does the same thing, they should document it.

3. You need rdtsc_ordered().  Plain RDTSC is not ordered wrt other
memory access, and it can observably screw up as a result.

Also, Linux tries to avoid volatile variables, so please use READ_ONCE().

>
> (gdb) disassemble read_hv_clock_tsc
> Dump of assembler code for function read_hv_clock_tsc:
>    0xffffffff8102ca60 <+0>:     callq  0xffffffff816c7500 <__fentry__>
>    0xffffffff8102ca65 <+5>:     mov    0xf67974(%rip),%rcx        # 0xffffffff81f943e0 <tsc_pg>
>    0xffffffff8102ca6c <+12>:    jmp    0xffffffff8102ca87 <read_hv_clock_tsc+39>
>    0xffffffff8102ca6e <+14>:    lfence
>    0xffffffff8102ca71 <+17>:    mov    0x8(%rcx),%r9
>    0xffffffff8102ca75 <+21>:    mov    0x10(%rcx),%r8
>    0xffffffff8102ca79 <+25>:    nop
>    0xffffffff8102ca7a <+26>:    nop
>    0xffffffff8102ca7b <+27>:    nop
>    0xffffffff8102ca7c <+28>:    rdtsc
>    0xffffffff8102ca7e <+30>:    lfence
>    0xffffffff8102ca81 <+33>:    mov    (%rcx),%edi
>    0xffffffff8102ca83 <+35>:    cmp    %edi,%esi
>    0xffffffff8102ca85 <+37>:    je     0xffffffff8102caa3 <read_hv_clock_tsc+67>
>    0xffffffff8102ca87 <+39>:    mov    (%rcx),%esi
>    0xffffffff8102ca89 <+41>:    test   %esi,%esi
>    0xffffffff8102ca8b <+43>:    jne    0xffffffff8102ca6e <read_hv_clock_tsc+14>
>    0xffffffff8102ca8d <+45>:    push   %rbp
>    0xffffffff8102ca8e <+46>:    mov    $0x40000020,%edi
>    0xffffffff8102ca93 <+51>:    mov    %rsp,%rbp
>    0xffffffff8102ca96 <+54>:    and    $0xfffffffffffffff0,%rsp
>    0xffffffff8102ca9a <+58>:    callq  *0xffffffff81c36330
>    0xffffffff8102caa1 <+65>:    leaveq
>    0xffffffff8102caa2 <+66>:    retq
>    0xffffffff8102caa3 <+67>:    shl    $0x20,%rdx
>    0xffffffff8102caa7 <+71>:    or     %rdx,%rax
>    0xffffffff8102caaa <+74>:    mul    %r9
>    0xffffffff8102caad <+77>:    mov    %rdx,%rax
>    0xffffffff8102cab0 <+80>:    add    %r8,%rax
>    0xffffffff8102cab3 <+83>:    cmp    $0xffffffffffffffff,%rax
>    0xffffffff8102cab7 <+87>:    je     0xffffffff8102ca8d <read_hv_clock_tsc+45>
>    0xffffffff8102cab9 <+89>:    repz retq
> End of assembler dump.
>
> --
>   Vitaly



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
  2017-02-14 17:34       ` Andy Lutomirski
@ 2017-02-15 14:01         ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 28+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-15 14:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, K. Y. Srinivasan, X86 ML, Ingo Molnar,
	H. Peter Anvin, Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	linux-kernel, devel, Linux Virtualization

Andy Lutomirski <luto@amacapital.net> writes:

> On Tue, Feb 14, 2017 at 7:50 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>> Thomas Gleixner <tglx@linutronix.de> writes:
>>
>>> On Tue, 14 Feb 2017, Vitaly Kuznetsov wrote:
>>>
>>>> Hi,
>>>>
>>>> while we're still waiting for a definitive ACK from Microsoft that the
>>>> algorithm is good for SMP case (as we can't prevent the code in vdso from
>>>> migrating between CPUs) I'd like to send v2 with some modifications to keep
>>>> the discussion going.
>>>
>>> Migration is irrelevant. The TSC page is guest global so updates will
>>> happen on some (random) host CPU and therefor you need the usual barriers
>>> like we have them in our seqcounts unless an access to the sequence will
>>> trap into the host, which would defeat the whole purpose of the TSC page.
>>>
>>
>> KY Srinivasan <kys@microsoft.com> writes:
>>
>>> I checked with the folks on the Hyper-V side and they have confirmed that we need to
>>> add memory barriers in the guest code to ensure the various reads from the TSC page are
>>> correctly ordered - especially, the initial read of the sequence counter must have acquire
>>> semantics. We should ensure that other reads from the TSC page are completed before the
>>> second read of the sequence counter. I am working with the Windows team to correctly
>>> reflect this algorithm in the Hyper-V specification.
>>
>>
>> Thank you,
>>
>> do I get it right that combining the above I only need to replace
>> virt_rmb() barriers with plain rmb() to get 'lfence' in hv_read_tsc_page
>> (PATCH 2)? As members of struct ms_hyperv_tsc_page are volatile we don't
>> need READ_ONCE(), compilers are not allowed to merge accesses. The
>> resulting code looks good to me:
>
> No, on multiple counts, unfortunately.
>
> 1. LFENCE is basically useless except for IO and for (Intel only)
> rdtsc_ordered().  AFAIK there is literally no scenario under which
> LFENCE is useful for access to normal memory.
>

Interesting,

(For some reason I was under the impression that when I do

READ var1 -> reg1
READ var2 -> reg2

from normal memory reads can actually happen in any order and LFENCE
in between gives us strict ordering.) But I completely agree it wouldn't
help in situations you descibe below:

> 2. The problem here has little to do with barriers.  You're doing:
>
> read seq;
> read var1;
> read var2;
> read tsc;
> read seq again;
>
> If the hypervisor updates things between reading var1 and var2 or
> between reading var2 and tsc, you aren't guaranteed to notice unless
> something fancy is going on regardless of barriers.  Similarly, if the
> hypervisor starts updating, then you start and finish the whole
> function, and then the hypervisor finishes, what guarantees you
> notice?
>
> This really needs a spec from the hypervisor folks as to what's going
> on.  KVM does this horrible thing in which it sometimes freezes all
> vCPUs when updating, for better or for worse.  Mostly for worse.  If
> MS does the same thing, they should document it.

... so I'll have to summon K. Y. again and ask him to use his magic
powers to extract some info from the Hyper-V team. As we have TSC page
clocksource for quite a while now and no bugs were reported there should
be something.

Actually, we already have an implementation of TSC page update in KVM
(see arch/x86/kvm/hyperv.c, kvm_hv_setup_tsc_page()) and the update does
the following:

0) stash seq into seq_prev
1) seq = 0 making all reads from the page invalid
2) smp_wmb()
3) update tsc_scale, tsc_offset
4) smp_wmb()
5) set seq = seq_prev + 1

As far as I understand this helps with situations you described above as
guest will notice either invalid value of 0 or seq change. In case the
implementation in real Hyper-V is the same we're safe with compile
barriers only.

>
> 3. You need rdtsc_ordered().  Plain RDTSC is not ordered wrt other
> memory access, and it can observably screw up as a result.

Another interesting thing is if we look at how this is implemented in
Windows (see linux.git commit c35b82ef0294) there are no barriers there
even for rdtsc...

>
> Also, Linux tries to avoid volatile variables, so please use READ_ONCE().
>

Will do both of the above in my next submission, thanks for the feedback!

[snip]

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
@ 2017-02-15 14:01         ` Vitaly Kuznetsov
  0 siblings, 0 replies; 28+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-15 14:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Stephen Hemminger, Haiyang Zhang, X86 ML, linux-kernel,
	Linux Virtualization, Ingo Molnar, H. Peter Anvin, devel,
	Thomas Gleixner

Andy Lutomirski <luto@amacapital.net> writes:

> On Tue, Feb 14, 2017 at 7:50 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>> Thomas Gleixner <tglx@linutronix.de> writes:
>>
>>> On Tue, 14 Feb 2017, Vitaly Kuznetsov wrote:
>>>
>>>> Hi,
>>>>
>>>> while we're still waiting for a definitive ACK from Microsoft that the
>>>> algorithm is good for SMP case (as we can't prevent the code in vdso from
>>>> migrating between CPUs) I'd like to send v2 with some modifications to keep
>>>> the discussion going.
>>>
>>> Migration is irrelevant. The TSC page is guest global so updates will
>>> happen on some (random) host CPU and therefor you need the usual barriers
>>> like we have them in our seqcounts unless an access to the sequence will
>>> trap into the host, which would defeat the whole purpose of the TSC page.
>>>
>>
>> KY Srinivasan <kys@microsoft.com> writes:
>>
>>> I checked with the folks on the Hyper-V side and they have confirmed that we need to
>>> add memory barriers in the guest code to ensure the various reads from the TSC page are
>>> correctly ordered - especially, the initial read of the sequence counter must have acquire
>>> semantics. We should ensure that other reads from the TSC page are completed before the
>>> second read of the sequence counter. I am working with the Windows team to correctly
>>> reflect this algorithm in the Hyper-V specification.
>>
>>
>> Thank you,
>>
>> do I get it right that combining the above I only need to replace
>> virt_rmb() barriers with plain rmb() to get 'lfence' in hv_read_tsc_page
>> (PATCH 2)? As members of struct ms_hyperv_tsc_page are volatile we don't
>> need READ_ONCE(), compilers are not allowed to merge accesses. The
>> resulting code looks good to me:
>
> No, on multiple counts, unfortunately.
>
> 1. LFENCE is basically useless except for IO and for (Intel only)
> rdtsc_ordered().  AFAIK there is literally no scenario under which
> LFENCE is useful for access to normal memory.
>

Interesting,

(For some reason I was under the impression that when I do

READ var1 -> reg1
READ var2 -> reg2

from normal memory reads can actually happen in any order and LFENCE
in between gives us strict ordering.) But I completely agree it wouldn't
help in situations you descibe below:

> 2. The problem here has little to do with barriers.  You're doing:
>
> read seq;
> read var1;
> read var2;
> read tsc;
> read seq again;
>
> If the hypervisor updates things between reading var1 and var2 or
> between reading var2 and tsc, you aren't guaranteed to notice unless
> something fancy is going on regardless of barriers.  Similarly, if the
> hypervisor starts updating, then you start and finish the whole
> function, and then the hypervisor finishes, what guarantees you
> notice?
>
> This really needs a spec from the hypervisor folks as to what's going
> on.  KVM does this horrible thing in which it sometimes freezes all
> vCPUs when updating, for better or for worse.  Mostly for worse.  If
> MS does the same thing, they should document it.

... so I'll have to summon K. Y. again and ask him to use his magic
powers to extract some info from the Hyper-V team. As we have TSC page
clocksource for quite a while now and no bugs were reported there should
be something.

Actually, we already have an implementation of TSC page update in KVM
(see arch/x86/kvm/hyperv.c, kvm_hv_setup_tsc_page()) and the update does
the following:

0) stash seq into seq_prev
1) seq = 0 making all reads from the page invalid
2) smp_wmb()
3) update tsc_scale, tsc_offset
4) smp_wmb()
5) set seq = seq_prev + 1

As far as I understand this helps with situations you described above as
guest will notice either invalid value of 0 or seq change. In case the
implementation in real Hyper-V is the same we're safe with compile
barriers only.

>
> 3. You need rdtsc_ordered().  Plain RDTSC is not ordered wrt other
> memory access, and it can observably screw up as a result.

Another interesting thing is if we look at how this is implemented in
Windows (see linux.git commit c35b82ef0294) there are no barriers there
even for rdtsc...

>
> Also, Linux tries to avoid volatile variables, so please use READ_ONCE().
>

Will do both of the above in my next submission, thanks for the feedback!

[snip]

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
  2017-02-15 14:01         ` Vitaly Kuznetsov
@ 2017-02-16 17:50           ` Thomas Gleixner
  -1 siblings, 0 replies; 28+ messages in thread
From: Thomas Gleixner @ 2017-02-16 17:50 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Andy Lutomirski, K. Y. Srinivasan, X86 ML, Ingo Molnar,
	H. Peter Anvin, Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	linux-kernel, devel, Linux Virtualization

On Wed, 15 Feb 2017, Vitaly Kuznetsov wrote:
> Actually, we already have an implementation of TSC page update in KVM
> (see arch/x86/kvm/hyperv.c, kvm_hv_setup_tsc_page()) and the update does
> the following:
> 
> 0) stash seq into seq_prev
> 1) seq = 0 making all reads from the page invalid
> 2) smp_wmb()
> 3) update tsc_scale, tsc_offset
> 4) smp_wmb()
> 5) set seq = seq_prev + 1

I hope they handle the case where seq_prev overflows and becomes 0 :)
 
> As far as I understand this helps with situations you described above as
> guest will notice either invalid value of 0 or seq change. In case the
> implementation in real Hyper-V is the same we're safe with compile
> barriers only.

On x86 that's correct. smp_rmb() resolves to barrier(), but you certainly
need the smp_wmb() on the writer side.

Now looking at the above your reader side code is bogus:

+       while (1) {
+               sequence = tsc_pg->tsc_sequence;
+               if (!sequence)
+                       break;

Why would you break out of the loop when seq is 0? The 0 is just telling
you that there is an update in progress.

The Linux seqcount writer side is:

    seq++;
    smp_wmb();
    
    update...
    
    smp_wmb();
    seq++;

and it's defined that an odd sequence count, i.e. bit 0 set means update in
progress. Which is nice, because you don't have to treat 0 special on the
writer side and you don't need extra storage to stash seq away :)

So the reader side does:

    do {
    	  while (1) {
    	     s = READ_ONCE(seq);
	     if (!(s & 0x01))
	        break;
     	     cpu_relax();
         }
	 smp_rmb();

	 read data ...

	 smp_rmb();
   } while (s != seq)

So for that hyperv thing you want:

    do {
    	  while (1) {
    	     s = READ_ONCE(seq);
	     if (s)
	        break;
     	     cpu_relax();
         }
	 smp_rmb();

	 read data ...

	 smp_rmb();
   } while (s != seq)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
@ 2017-02-16 17:50           ` Thomas Gleixner
  0 siblings, 0 replies; 28+ messages in thread
From: Thomas Gleixner @ 2017-02-16 17:50 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Stephen Hemminger, Haiyang Zhang, X86 ML, linux-kernel,
	Andy Lutomirski, Ingo Molnar, H. Peter Anvin, devel,
	Linux Virtualization

On Wed, 15 Feb 2017, Vitaly Kuznetsov wrote:
> Actually, we already have an implementation of TSC page update in KVM
> (see arch/x86/kvm/hyperv.c, kvm_hv_setup_tsc_page()) and the update does
> the following:
> 
> 0) stash seq into seq_prev
> 1) seq = 0 making all reads from the page invalid
> 2) smp_wmb()
> 3) update tsc_scale, tsc_offset
> 4) smp_wmb()
> 5) set seq = seq_prev + 1

I hope they handle the case where seq_prev overflows and becomes 0 :)
 
> As far as I understand this helps with situations you described above as
> guest will notice either invalid value of 0 or seq change. In case the
> implementation in real Hyper-V is the same we're safe with compile
> barriers only.

On x86 that's correct. smp_rmb() resolves to barrier(), but you certainly
need the smp_wmb() on the writer side.

Now looking at the above your reader side code is bogus:

+       while (1) {
+               sequence = tsc_pg->tsc_sequence;
+               if (!sequence)
+                       break;

Why would you break out of the loop when seq is 0? The 0 is just telling
you that there is an update in progress.

The Linux seqcount writer side is:

    seq++;
    smp_wmb();
    
    update...
    
    smp_wmb();
    seq++;

and it's defined that an odd sequence count, i.e. bit 0 set means update in
progress. Which is nice, because you don't have to treat 0 special on the
writer side and you don't need extra storage to stash seq away :)

So the reader side does:

    do {
    	  while (1) {
    	     s = READ_ONCE(seq);
	     if (!(s & 0x01))
	        break;
     	     cpu_relax();
         }
	 smp_rmb();

	 read data ...

	 smp_rmb();
   } while (s != seq)

So for that hyperv thing you want:

    do {
    	  while (1) {
    	     s = READ_ONCE(seq);
	     if (s)
	        break;
     	     cpu_relax();
         }
	 smp_rmb();

	 read data ...

	 smp_rmb();
   } while (s != seq)

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
  2017-02-16 17:50           ` Thomas Gleixner
  (?)
@ 2017-02-17 10:14           ` Vitaly Kuznetsov
  2017-02-17 10:35               ` Thomas Gleixner
                               ` (2 more replies)
  -1 siblings, 3 replies; 28+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-17 10:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, K. Y. Srinivasan, X86 ML, Ingo Molnar,
	H. Peter Anvin, Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	linux-kernel, devel, Linux Virtualization

Thomas Gleixner <tglx@linutronix.de> writes:

> On Wed, 15 Feb 2017, Vitaly Kuznetsov wrote:
>> Actually, we already have an implementation of TSC page update in KVM
>> (see arch/x86/kvm/hyperv.c, kvm_hv_setup_tsc_page()) and the update does
>> the following:
>> 
>> 0) stash seq into seq_prev
>> 1) seq = 0 making all reads from the page invalid
>> 2) smp_wmb()
>> 3) update tsc_scale, tsc_offset
>> 4) smp_wmb()
>> 5) set seq = seq_prev + 1
>
> I hope they handle the case where seq_prev overflows and becomes 0 :)
>
>> As far as I understand this helps with situations you described above as
>> guest will notice either invalid value of 0 or seq change. In case the
>> implementation in real Hyper-V is the same we're safe with compile
>> barriers only.
>
> On x86 that's correct. smp_rmb() resolves to barrier(), but you certainly
> need the smp_wmb() on the writer side.
>
> Now looking at the above your reader side code is bogus:
>
> +       while (1) {
> +               sequence = tsc_pg->tsc_sequence;
> +               if (!sequence)
> +                       break;
>
> Why would you break out of the loop when seq is 0? The 0 is just telling
> you that there is an update in progress.

Not only. As far as I understand (and I *think* K. Y. pointed this out)
when VM is migrating to another host TSC page clocksource is disabled for
extended period of time so we're better off reading from MSR than
looping here. With regards to VDSO this means reverting to doing normal
syscall.

>
> The Linux seqcount writer side is:
>
>     seq++;
>     smp_wmb();
>
>     update...
>
>     smp_wmb();
>     seq++;
>
> and it's defined that an odd sequence count, i.e. bit 0 set means update in
> progress. Which is nice, because you don't have to treat 0 special on the
> writer side and you don't need extra storage to stash seq away :)
>
> So the reader side does:
>
>     do {
>     	  while (1) {
>     	     s = READ_ONCE(seq);
> 	     if (!(s & 0x01))
> 	        break;
>      	     cpu_relax();
>          }
> 	 smp_rmb();
>
> 	 read data ...
>
> 	 smp_rmb();
>    } while (s != seq)
>
> So for that hyperv thing you want:
>
>     do {
>     	  while (1) {
>     	     s = READ_ONCE(seq);
> 	     if (s)
> 	        break;
>      	     cpu_relax();
>          }
> 	 smp_rmb();
>
> 	 read data ...
>
> 	 smp_rmb();
>    } while (s != seq)
>
> Thanks,
>
> 	tglx

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
  2017-02-16 17:50           ` Thomas Gleixner
  (?)
  (?)
@ 2017-02-17 10:14           ` Vitaly Kuznetsov
  -1 siblings, 0 replies; 28+ messages in thread
From: Vitaly Kuznetsov @ 2017-02-17 10:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Stephen Hemminger, Haiyang Zhang, X86 ML, linux-kernel,
	Andy Lutomirski, Ingo Molnar, H. Peter Anvin, devel,
	Linux Virtualization

Thomas Gleixner <tglx@linutronix.de> writes:

> On Wed, 15 Feb 2017, Vitaly Kuznetsov wrote:
>> Actually, we already have an implementation of TSC page update in KVM
>> (see arch/x86/kvm/hyperv.c, kvm_hv_setup_tsc_page()) and the update does
>> the following:
>> 
>> 0) stash seq into seq_prev
>> 1) seq = 0 making all reads from the page invalid
>> 2) smp_wmb()
>> 3) update tsc_scale, tsc_offset
>> 4) smp_wmb()
>> 5) set seq = seq_prev + 1
>
> I hope they handle the case where seq_prev overflows and becomes 0 :)
>
>> As far as I understand this helps with situations you described above as
>> guest will notice either invalid value of 0 or seq change. In case the
>> implementation in real Hyper-V is the same we're safe with compile
>> barriers only.
>
> On x86 that's correct. smp_rmb() resolves to barrier(), but you certainly
> need the smp_wmb() on the writer side.
>
> Now looking at the above your reader side code is bogus:
>
> +       while (1) {
> +               sequence = tsc_pg->tsc_sequence;
> +               if (!sequence)
> +                       break;
>
> Why would you break out of the loop when seq is 0? The 0 is just telling
> you that there is an update in progress.

Not only. As far as I understand (and I *think* K. Y. pointed this out)
when VM is migrating to another host TSC page clocksource is disabled for
extended period of time so we're better off reading from MSR than
looping here. With regards to VDSO this means reverting to doing normal
syscall.

>
> The Linux seqcount writer side is:
>
>     seq++;
>     smp_wmb();
>
>     update...
>
>     smp_wmb();
>     seq++;
>
> and it's defined that an odd sequence count, i.e. bit 0 set means update in
> progress. Which is nice, because you don't have to treat 0 special on the
> writer side and you don't need extra storage to stash seq away :)
>
> So the reader side does:
>
>     do {
>     	  while (1) {
>     	     s = READ_ONCE(seq);
> 	     if (!(s & 0x01))
> 	        break;
>      	     cpu_relax();
>          }
> 	 smp_rmb();
>
> 	 read data ...
>
> 	 smp_rmb();
>    } while (s != seq)
>
> So for that hyperv thing you want:
>
>     do {
>     	  while (1) {
>     	     s = READ_ONCE(seq);
> 	     if (s)
> 	        break;
>      	     cpu_relax();
>          }
> 	 smp_rmb();
>
> 	 read data ...
>
> 	 smp_rmb();
>    } while (s != seq)
>
> Thanks,
>
> 	tglx

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
  2017-02-17 10:14           ` Vitaly Kuznetsov
@ 2017-02-17 10:35               ` Thomas Gleixner
  2017-02-17 17:02             ` Andy Lutomirski
  2017-02-17 17:02             ` Andy Lutomirski
  2 siblings, 0 replies; 28+ messages in thread
From: Thomas Gleixner @ 2017-02-17 10:35 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Andy Lutomirski, K. Y. Srinivasan, X86 ML, Ingo Molnar,
	H. Peter Anvin, Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	linux-kernel, devel, Linux Virtualization

On Fri, 17 Feb 2017, Vitaly Kuznetsov wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> > On Wed, 15 Feb 2017, Vitaly Kuznetsov wrote:
> >> Actually, we already have an implementation of TSC page update in KVM
> >> (see arch/x86/kvm/hyperv.c, kvm_hv_setup_tsc_page()) and the update does
> >> the following:
> >> 
> >> 0) stash seq into seq_prev
> >> 1) seq = 0 making all reads from the page invalid
> >> 2) smp_wmb()
> >> 3) update tsc_scale, tsc_offset
> >> 4) smp_wmb()
> >> 5) set seq = seq_prev + 1
> >
> > I hope they handle the case where seq_prev overflows and becomes 0 :)
> >
> >> As far as I understand this helps with situations you described above as
> >> guest will notice either invalid value of 0 or seq change. In case the
> >> implementation in real Hyper-V is the same we're safe with compile
> >> barriers only.
> >
> > On x86 that's correct. smp_rmb() resolves to barrier(), but you certainly
> > need the smp_wmb() on the writer side.
> >
> > Now looking at the above your reader side code is bogus:
> >
> > +       while (1) {
> > +               sequence = tsc_pg->tsc_sequence;
> > +               if (!sequence)
> > +                       break;
> >
> > Why would you break out of the loop when seq is 0? The 0 is just telling
> > you that there is an update in progress.
> 
> Not only. As far as I understand (and I *think* K. Y. pointed this out)
> when VM is migrating to another host TSC page clocksource is disabled for
> extended period of time so we're better off reading from MSR than
> looping here. With regards to VDSO this means reverting to doing normal
> syscall.

If you migrate to another host and the VM is using the TSC page, then the
TSC page on the new host _must_ be available and accessible _before_ the VM
resumes there. So that extended period of time does not make any sense at
all. Voodoo programming is the only explanation which come to my mind.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
@ 2017-02-17 10:35               ` Thomas Gleixner
  0 siblings, 0 replies; 28+ messages in thread
From: Thomas Gleixner @ 2017-02-17 10:35 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Stephen Hemminger, Haiyang Zhang, X86 ML, linux-kernel,
	Andy Lutomirski, Ingo Molnar, H. Peter Anvin, devel,
	Linux Virtualization

On Fri, 17 Feb 2017, Vitaly Kuznetsov wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> > On Wed, 15 Feb 2017, Vitaly Kuznetsov wrote:
> >> Actually, we already have an implementation of TSC page update in KVM
> >> (see arch/x86/kvm/hyperv.c, kvm_hv_setup_tsc_page()) and the update does
> >> the following:
> >> 
> >> 0) stash seq into seq_prev
> >> 1) seq = 0 making all reads from the page invalid
> >> 2) smp_wmb()
> >> 3) update tsc_scale, tsc_offset
> >> 4) smp_wmb()
> >> 5) set seq = seq_prev + 1
> >
> > I hope they handle the case where seq_prev overflows and becomes 0 :)
> >
> >> As far as I understand this helps with situations you described above as
> >> guest will notice either invalid value of 0 or seq change. In case the
> >> implementation in real Hyper-V is the same we're safe with compile
> >> barriers only.
> >
> > On x86 that's correct. smp_rmb() resolves to barrier(), but you certainly
> > need the smp_wmb() on the writer side.
> >
> > Now looking at the above your reader side code is bogus:
> >
> > +       while (1) {
> > +               sequence = tsc_pg->tsc_sequence;
> > +               if (!sequence)
> > +                       break;
> >
> > Why would you break out of the loop when seq is 0? The 0 is just telling
> > you that there is an update in progress.
> 
> Not only. As far as I understand (and I *think* K. Y. pointed this out)
> when VM is migrating to another host TSC page clocksource is disabled for
> extended period of time so we're better off reading from MSR than
> looping here. With regards to VDSO this means reverting to doing normal
> syscall.

If you migrate to another host and the VM is using the TSC page, then the
TSC page on the new host _must_ be available and accessible _before_ the VM
resumes there. So that extended period of time does not make any sense at
all. Voodoo programming is the only explanation which come to my mind.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
  2017-02-17 10:14           ` Vitaly Kuznetsov
  2017-02-17 10:35               ` Thomas Gleixner
  2017-02-17 17:02             ` Andy Lutomirski
@ 2017-02-17 17:02             ` Andy Lutomirski
  2017-02-17 17:55               ` Thomas Gleixner
  2017-02-17 17:55               ` Thomas Gleixner
  2 siblings, 2 replies; 28+ messages in thread
From: Andy Lutomirski @ 2017-02-17 17:02 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Thomas Gleixner, K. Y. Srinivasan, X86 ML, Ingo Molnar,
	H. Peter Anvin, Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	linux-kernel, devel, Linux Virtualization

On Fri, Feb 17, 2017 at 2:14 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
>
>> On Wed, 15 Feb 2017, Vitaly Kuznetsov wrote:
>>> Actually, we already have an implementation of TSC page update in KVM
>>> (see arch/x86/kvm/hyperv.c, kvm_hv_setup_tsc_page()) and the update does
>>> the following:
>>>
>>> 0) stash seq into seq_prev
>>> 1) seq = 0 making all reads from the page invalid
>>> 2) smp_wmb()
>>> 3) update tsc_scale, tsc_offset
>>> 4) smp_wmb()
>>> 5) set seq = seq_prev + 1
>>
>> I hope they handle the case where seq_prev overflows and becomes 0 :)
>>
>>> As far as I understand this helps with situations you described above as
>>> guest will notice either invalid value of 0 or seq change. In case the
>>> implementation in real Hyper-V is the same we're safe with compile
>>> barriers only.
>>
>> On x86 that's correct. smp_rmb() resolves to barrier(), but you certainly
>> need the smp_wmb() on the writer side.
>>
>> Now looking at the above your reader side code is bogus:
>>
>> +       while (1) {
>> +               sequence = tsc_pg->tsc_sequence;
>> +               if (!sequence)
>> +                       break;
>>
>> Why would you break out of the loop when seq is 0? The 0 is just telling
>> you that there is an update in progress.
>
> Not only. As far as I understand (and I *think* K. Y. pointed this out)
> when VM is migrating to another host TSC page clocksource is disabled for
> extended period of time so we're better off reading from MSR than
> looping here. With regards to VDSO this means reverting to doing normal
> syscall.

That's a crappy design.  The guest really ought to be able to
distinguish "busy, try again" from "bail and use MSR".

Thomas, I can see valid reasons why the hypervisor might want to
temporarily disable shared page-based timing, but I think it's silly
that it's conflated with indicating "retry".

But if this is indeed the ABI, we're stuck with it.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
  2017-02-17 10:14           ` Vitaly Kuznetsov
  2017-02-17 10:35               ` Thomas Gleixner
@ 2017-02-17 17:02             ` Andy Lutomirski
  2017-02-17 17:02             ` Andy Lutomirski
  2 siblings, 0 replies; 28+ messages in thread
From: Andy Lutomirski @ 2017-02-17 17:02 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Stephen Hemminger, Haiyang Zhang, X86 ML, linux-kernel,
	Linux Virtualization, Ingo Molnar, H. Peter Anvin, devel,
	Thomas Gleixner

On Fri, Feb 17, 2017 at 2:14 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
>
>> On Wed, 15 Feb 2017, Vitaly Kuznetsov wrote:
>>> Actually, we already have an implementation of TSC page update in KVM
>>> (see arch/x86/kvm/hyperv.c, kvm_hv_setup_tsc_page()) and the update does
>>> the following:
>>>
>>> 0) stash seq into seq_prev
>>> 1) seq = 0 making all reads from the page invalid
>>> 2) smp_wmb()
>>> 3) update tsc_scale, tsc_offset
>>> 4) smp_wmb()
>>> 5) set seq = seq_prev + 1
>>
>> I hope they handle the case where seq_prev overflows and becomes 0 :)
>>
>>> As far as I understand this helps with situations you described above as
>>> guest will notice either invalid value of 0 or seq change. In case the
>>> implementation in real Hyper-V is the same we're safe with compile
>>> barriers only.
>>
>> On x86 that's correct. smp_rmb() resolves to barrier(), but you certainly
>> need the smp_wmb() on the writer side.
>>
>> Now looking at the above your reader side code is bogus:
>>
>> +       while (1) {
>> +               sequence = tsc_pg->tsc_sequence;
>> +               if (!sequence)
>> +                       break;
>>
>> Why would you break out of the loop when seq is 0? The 0 is just telling
>> you that there is an update in progress.
>
> Not only. As far as I understand (and I *think* K. Y. pointed this out)
> when VM is migrating to another host TSC page clocksource is disabled for
> extended period of time so we're better off reading from MSR than
> looping here. With regards to VDSO this means reverting to doing normal
> syscall.

That's a crappy design.  The guest really ought to be able to
distinguish "busy, try again" from "bail and use MSR".

Thomas, I can see valid reasons why the hypervisor might want to
temporarily disable shared page-based timing, but I think it's silly
that it's conflated with indicating "retry".

But if this is indeed the ABI, we're stuck with it.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
  2017-02-17 17:02             ` Andy Lutomirski
  2017-02-17 17:55               ` Thomas Gleixner
@ 2017-02-17 17:55               ` Thomas Gleixner
  1 sibling, 0 replies; 28+ messages in thread
From: Thomas Gleixner @ 2017-02-17 17:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Vitaly Kuznetsov, K. Y. Srinivasan, X86 ML, Ingo Molnar,
	H. Peter Anvin, Haiyang Zhang, Stephen Hemminger, Dexuan Cui,
	linux-kernel, devel, Linux Virtualization

On Fri, 17 Feb 2017, Andy Lutomirski wrote:
> On Fri, Feb 17, 2017 at 2:14 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> > Not only. As far as I understand (and I *think* K. Y. pointed this out)
> > when VM is migrating to another host TSC page clocksource is disabled for
> > extended period of time so we're better off reading from MSR than
> > looping here. With regards to VDSO this means reverting to doing normal
> > syscall.
> 
> That's a crappy design.  The guest really ought to be able to
> distinguish "busy, try again" from "bail and use MSR".
> 
> Thomas, I can see valid reasons why the hypervisor might want to
> temporarily disable shared page-based timing, but I think it's silly
> that it's conflated with indicating "retry".

It's beyond silly if that's the case.

> But if this is indeed the ABI, we're stuck with it.

No argument.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support
  2017-02-17 17:02             ` Andy Lutomirski
@ 2017-02-17 17:55               ` Thomas Gleixner
  2017-02-17 17:55               ` Thomas Gleixner
  1 sibling, 0 replies; 28+ messages in thread
From: Thomas Gleixner @ 2017-02-17 17:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Stephen Hemminger, Haiyang Zhang, X86 ML, linux-kernel,
	Linux Virtualization, Ingo Molnar, H. Peter Anvin, devel

On Fri, 17 Feb 2017, Andy Lutomirski wrote:
> On Fri, Feb 17, 2017 at 2:14 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> > Not only. As far as I understand (and I *think* K. Y. pointed this out)
> > when VM is migrating to another host TSC page clocksource is disabled for
> > extended period of time so we're better off reading from MSR than
> > looping here. With regards to VDSO this means reverting to doing normal
> > syscall.
> 
> That's a crappy design.  The guest really ought to be able to
> distinguish "busy, try again" from "bail and use MSR".
> 
> Thomas, I can see valid reasons why the hypervisor might want to
> temporarily disable shared page-based timing, but I think it's silly
> that it's conflated with indicating "retry".

It's beyond silly if that's the case.

> But if this is indeed the ABI, we're stuck with it.

No argument.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2017-02-17 17:55 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-14 12:44 [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support Vitaly Kuznetsov
2017-02-14 12:44 ` Vitaly Kuznetsov
2017-02-14 12:44 ` [PATCH v2 1/3] x86/hyperv: implement hv_get_tsc_page() Vitaly Kuznetsov
2017-02-14 12:44   ` Vitaly Kuznetsov
2017-02-14 12:44 ` [PATCH v2 2/3] x86/hyperv: move TSC reading method to asm/mshyperv.h Vitaly Kuznetsov
2017-02-14 12:44   ` Vitaly Kuznetsov
2017-02-14 12:44 ` [PATCH v2 3/3] x86/vdso: Add VCLOCK_HVCLOCK vDSO clock read method Vitaly Kuznetsov
2017-02-14 12:44 ` Vitaly Kuznetsov
2017-02-14 14:46 ` [PATCH v2 0/3] x86/vdso: Add Hyper-V TSC page clocksource support KY Srinivasan
2017-02-14 14:46   ` KY Srinivasan via Virtualization
2017-02-14 14:56 ` Thomas Gleixner
2017-02-14 14:56   ` Thomas Gleixner
2017-02-14 15:50   ` Vitaly Kuznetsov
2017-02-14 15:50     ` Vitaly Kuznetsov
2017-02-14 17:34     ` Andy Lutomirski
2017-02-14 17:34       ` Andy Lutomirski
2017-02-15 14:01       ` Vitaly Kuznetsov
2017-02-15 14:01         ` Vitaly Kuznetsov
2017-02-16 17:50         ` Thomas Gleixner
2017-02-16 17:50           ` Thomas Gleixner
2017-02-17 10:14           ` Vitaly Kuznetsov
2017-02-17 10:35             ` Thomas Gleixner
2017-02-17 10:35               ` Thomas Gleixner
2017-02-17 17:02             ` Andy Lutomirski
2017-02-17 17:02             ` Andy Lutomirski
2017-02-17 17:55               ` Thomas Gleixner
2017-02-17 17:55               ` Thomas Gleixner
2017-02-17 10:14           ` Vitaly Kuznetsov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.