linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
@ 2018-09-14 12:50 Thomas Gleixner
  2018-09-14 12:50 ` [patch 01/11] clocksource: Provide clocksource_arch_init() Thomas Gleixner
                   ` (14 more replies)
  0 siblings, 15 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
implementation, which extended the clockid switch case and added yet
another slightly different copy of the same code.

Especially the extended switch case is problematic as the compiler tends to
generate a jump table which then requires to use retpolines. If jump tables
are disabled it adds yet another conditional to the existing maze.

This series takes a different approach by consolidating the almost
identical functions into one implementation for high resolution clocks and
one for the coarse grained clock ids by storing the base data for each
clock id in an array which is indexed by the clock id.

This completely eliminates the switch case and allows further
simplifications of the code base, which at the end all together gain a few
cycles performance or at least stay on par with todays code. The resulting
performance depends heavily on the micro architecture and the compiler.

Thanks,

	tglx

8<-------------------
 arch/x86/Kconfig                        |    1 
 arch/x86/entry/vdso/vclock_gettime.c    |  199 ++++++++------------------------
 arch/x86/entry/vsyscall/vsyscall_gtod.c |   55 ++++----
 arch/x86/include/asm/vgtod.h            |   46 ++++---
 arch/x86/kernel/time.c                  |   22 +++
 include/linux/clocksource.h             |    5 
 kernel/time/Kconfig                     |    4 
 kernel/time/clocksource.c               |    2 
 8 files changed, 144 insertions(+), 190 deletions(-)


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [patch 01/11] clocksource: Provide clocksource_arch_init()
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
@ 2018-09-14 12:50 ` Thomas Gleixner
  2018-09-14 12:50 ` [patch 02/11] x86/time: Implement clocksource_arch_init() Thomas Gleixner
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

[-- Attachment #1: clocksource--Provide-clocksource_arch_init--.patch --]
[-- Type: text/plain, Size: 1496 bytes --]

Architectures have extra archdata in the clocksource, e.g. for VDSO
support. There are no sanity checks or general initializations for this
available. Add support for that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 include/linux/clocksource.h |    5 +++++
 kernel/time/Kconfig         |    4 ++++
 kernel/time/clocksource.c   |    2 ++
 3 files changed, 11 insertions(+)

--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -241,6 +241,11 @@ static inline void __clocksource_update_
 	__clocksource_update_freq_scale(cs, 1000, khz);
 }
 
+#ifdef CONFIG_ARCH_CLOCKSOURCE_INIT
+extern void clocksource_arch_init(struct clocksource *cs);
+#else
+static inline void clocksource_arch_init(struct clocksource *cs) { }
+#endif
 
 extern int timekeeping_notify(struct clocksource *clock);
 
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -12,6 +12,10 @@ config CLOCKSOURCE_WATCHDOG
 config ARCH_CLOCKSOURCE_DATA
 	bool
 
+# Architecture has extra clocksource init called from registration
+config ARCH_CLOCKSOURCE_INIT
+	bool
+
 # Clocksources require validation of the clocksource against the last
 # cycle update - x86/TSC misfeature
 config CLOCKSOURCE_VALIDATE_LAST_CYCLE
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -937,6 +937,8 @@ int __clocksource_register_scale(struct
 {
 	unsigned long flags;
 
+	clocksource_arch_init(cs);
+
 	/* Initialize mult/shift and max_idle_ns */
 	__clocksource_update_freq_scale(cs, scale, freq);
 



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [patch 02/11] x86/time: Implement clocksource_arch_init()
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
  2018-09-14 12:50 ` [patch 01/11] clocksource: Provide clocksource_arch_init() Thomas Gleixner
@ 2018-09-14 12:50 ` Thomas Gleixner
  2018-09-14 15:45   ` Vitaly Kuznetsov
  2018-09-14 12:50 ` [patch 03/11] x86/vdso: Enforce 64bit clocksource Thomas Gleixner
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

[-- Attachment #1: x86-time--Implement-clocksource_arch_init--.patch --]
[-- Type: text/plain, Size: 1339 bytes --]

Runtime validate the VCLOCK_MODE in clocksource::archdata and disable
VCLOCK if invalid, which disables the VDSO but keeps the system running.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/Kconfig       |    1 +
 arch/x86/kernel/time.c |   16 ++++++++++++++++
 2 files changed, 17 insertions(+)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -48,6 +48,7 @@ config X86
 	select ACPI_SYSTEM_POWER_STATES_SUPPORT	if ACPI
 	select ANON_INODES
 	select ARCH_CLOCKSOURCE_DATA
+	select ARCH_CLOCKSOURCE_INIT
 	select ARCH_DISCARD_MEMBLOCK
 	select ARCH_HAS_ACPI_TABLE_UPGRADE	if ACPI
 	select ARCH_HAS_DEBUG_VIRTUAL
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -10,6 +10,7 @@
  *
  */
 
+#include <linux/clocksource.h>
 #include <linux/clockchips.h>
 #include <linux/interrupt.h>
 #include <linux/irq.h>
@@ -105,3 +106,18 @@ void __init time_init(void)
 {
 	late_time_init = x86_late_time_init;
 }
+
+/*
+ * Sanity check the vdso related archdata content.
+ */
+void clocksource_arch_init(struct clocksource *cs)
+{
+	if (cs->archdata.vclock_mode == VCLOCK_NONE)
+		return;
+
+	if (cs->archdata.vclock_mode >= VCLOCK_MAX) {
+		pr_warn("clocksource %s registered with invalid vclock_mode %d. Disabling vclock.\n",
+			cs->name, cs->archdata.vclock_mode);
+		cs->archdata.vclock_mode = VCLOCK_NONE;
+	}
+}



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [patch 03/11] x86/vdso: Enforce 64bit clocksource
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
  2018-09-14 12:50 ` [patch 01/11] clocksource: Provide clocksource_arch_init() Thomas Gleixner
  2018-09-14 12:50 ` [patch 02/11] x86/time: Implement clocksource_arch_init() Thomas Gleixner
@ 2018-09-14 12:50 ` Thomas Gleixner
  2018-09-14 12:50 ` [patch 04/11] x86/vdso: Use unsigned int consistently for vsyscall_gtod_data::seq Thomas Gleixner
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

[-- Attachment #1: x86-vdso--Enforce-64bit-clocksource.patch --]
[-- Type: text/plain, Size: 1076 bytes --]

All VDSO clock sources are TSC based and use CLOCKSOURCE_MASK(64). There is
no point in masking with all FF. Get rid of it and enforce the mask in the
sanity checker.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c |    2 +-
 arch/x86/kernel/time.c               |    6 ++++++
 2 files changed, 7 insertions(+), 1 deletion(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -199,7 +199,7 @@ notrace static inline u64 vgetsns(int *m
 #endif
 	else
 		return 0;
-	v = (cycles - gtod->cycle_last) & gtod->mask;
+	v = cycles - gtod->cycle_last;
 	return v * gtod->mult;
 }
 
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -120,4 +120,10 @@ void clocksource_arch_init(struct clocks
 			cs->name, cs->archdata.vclock_mode);
 		cs->archdata.vclock_mode = VCLOCK_NONE;
 	}
+
+	if (cs->mask != CLOCKSOURCE_MASK(64)) {
+		pr_warn("clocksource %s registered with invalid mask %016llx. Disabling vclock.\n",
+			cs->name, cs->mask);
+		cs->archdata.vclock_mode = VCLOCK_NONE;
+	}
 }



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [patch 04/11] x86/vdso: Use unsigned int consistently for vsyscall_gtod_data::seq
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
                   ` (2 preceding siblings ...)
  2018-09-14 12:50 ` [patch 03/11] x86/vdso: Enforce 64bit clocksource Thomas Gleixner
@ 2018-09-14 12:50 ` Thomas Gleixner
  2018-09-14 12:50 ` [patch 05/11] x86/vdso: Introduce and use vgtod_ts Thomas Gleixner
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

[-- Attachment #1: x86-vdso--Mintor-cleanups.patch --]
[-- Type: text/plain, Size: 2385 bytes --]

The sequence count in vgtod_data is unsigned int, but the call sites use
unsigned long, which is a pointless exercise. Fix the call sites and
replace 'unsigned' with unsinged 'int' while at it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c |    8 ++++----
 arch/x86/include/asm/vgtod.h         |   10 +++++-----
 2 files changed, 9 insertions(+), 9 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -206,7 +206,7 @@ notrace static inline u64 vgetsns(int *m
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
-	unsigned long seq;
+	unsigned int seq;
 	u64 ns;
 	int mode;
 
@@ -227,7 +227,7 @@ notrace static int __always_inline do_re
 
 notrace static int __always_inline do_monotonic(struct timespec *ts)
 {
-	unsigned long seq;
+	unsigned int seq;
 	u64 ns;
 	int mode;
 
@@ -248,7 +248,7 @@ notrace static int __always_inline do_mo
 
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
-	unsigned long seq;
+	unsigned int seq;
 	do {
 		seq = gtod_read_begin(gtod);
 		ts->tv_sec = gtod->wall_time_coarse_sec;
@@ -258,7 +258,7 @@ notrace static void do_realtime_coarse(s
 
 notrace static void do_monotonic_coarse(struct timespec *ts)
 {
-	unsigned long seq;
+	unsigned int seq;
 	do {
 		seq = gtod_read_begin(gtod);
 		ts->tv_sec = gtod->monotonic_time_coarse_sec;
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -15,9 +15,9 @@ typedef unsigned long gtod_long_t;
  * so be carefull by modifying this structure.
  */
 struct vsyscall_gtod_data {
-	unsigned seq;
+	unsigned int seq;
 
-	int vclock_mode;
+	int	vclock_mode;
 	u64	cycle_last;
 	u64	mask;
 	u32	mult;
@@ -44,9 +44,9 @@ static inline bool vclock_was_used(int v
 	return READ_ONCE(vclocks_used) & (1 << vclock);
 }
 
-static inline unsigned gtod_read_begin(const struct vsyscall_gtod_data *s)
+static inline unsigned int gtod_read_begin(const struct vsyscall_gtod_data *s)
 {
-	unsigned ret;
+	unsigned int ret;
 
 repeat:
 	ret = READ_ONCE(s->seq);
@@ -59,7 +59,7 @@ static inline unsigned gtod_read_begin(c
 }
 
 static inline int gtod_read_retry(const struct vsyscall_gtod_data *s,
-					unsigned start)
+				  unsigned int start)
 {
 	smp_rmb();
 	return unlikely(s->seq != start);



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [patch 05/11] x86/vdso: Introduce and use vgtod_ts
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
                   ` (3 preceding siblings ...)
  2018-09-14 12:50 ` [patch 04/11] x86/vdso: Use unsigned int consistently for vsyscall_gtod_data::seq Thomas Gleixner
@ 2018-09-14 12:50 ` Thomas Gleixner
  2018-09-14 12:50 ` [patch 06/11] x86/vdso: Collapse high resolution functions Thomas Gleixner
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

[-- Attachment #1: x86-vdso--Introduce-and-use-vgtod_ts.patch --]
[-- Type: text/plain, Size: 7485 bytes --]

It's desired to support more clocks in the VDSO, e.g. CLOCK_TAI. This
results either in indirect calls due to the larger switch case, which then
requires retpolines or when the compiler is forced to avoid jump tables it
results in even more conditionals.

To avoid both variants which are bad for performance the high resolution
functions and the coarse grained functions will be collapsed into one for
each. That requires to store the clock specific base time in an array.

Introcude struct vgtod_ts for storage and convert the data store, the
update function and the individual clock functions over to use it.

The new storage does not longer use gtod_long_t for seconds depending on 32
or 64 bit compile because this needs to be the full 64bit value even for
32bit when a Y2038 function is added. No point in keeping the distinction
alive in the internal representation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c    |   24 +++++++++------
 arch/x86/entry/vsyscall/vsyscall_gtod.c |   51 ++++++++++++++++----------------
 arch/x86/include/asm/vgtod.h            |   36 ++++++++++++----------
 3 files changed, 61 insertions(+), 50 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -206,6 +206,7 @@ notrace static inline u64 vgetsns(int *m
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
+	struct vgtod_ts *base = &gtod->basetime[CLOCK_REALTIME];
 	unsigned int seq;
 	u64 ns;
 	int mode;
@@ -213,8 +214,8 @@ notrace static int __always_inline do_re
 	do {
 		seq = gtod_read_begin(gtod);
 		mode = gtod->vclock_mode;
-		ts->tv_sec = gtod->wall_time_sec;
-		ns = gtod->wall_time_snsec;
+		ts->tv_sec = base->sec;
+		ns = base->nsec;
 		ns += vgetsns(&mode);
 		ns >>= gtod->shift;
 	} while (unlikely(gtod_read_retry(gtod, seq)));
@@ -227,6 +228,7 @@ notrace static int __always_inline do_re
 
 notrace static int __always_inline do_monotonic(struct timespec *ts)
 {
+	struct vgtod_ts *base = &gtod->basetime[CLOCK_MONOTONIC];
 	unsigned int seq;
 	u64 ns;
 	int mode;
@@ -234,8 +236,8 @@ notrace static int __always_inline do_mo
 	do {
 		seq = gtod_read_begin(gtod);
 		mode = gtod->vclock_mode;
-		ts->tv_sec = gtod->monotonic_time_sec;
-		ns = gtod->monotonic_time_snsec;
+		ts->tv_sec = base->sec;
+		ns = base->nsec;
 		ns += vgetsns(&mode);
 		ns >>= gtod->shift;
 	} while (unlikely(gtod_read_retry(gtod, seq)));
@@ -248,21 +250,25 @@ notrace static int __always_inline do_mo
 
 notrace static void do_realtime_coarse(struct timespec *ts)
 {
+	struct vgtod_ts *base = &gtod->basetime[CLOCK_REALTIME_COARSE];
 	unsigned int seq;
+
 	do {
 		seq = gtod_read_begin(gtod);
-		ts->tv_sec = gtod->wall_time_coarse_sec;
-		ts->tv_nsec = gtod->wall_time_coarse_nsec;
+		ts->tv_sec = base->sec;
+		ts->tv_nsec = base->nsec;
 	} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
 notrace static void do_monotonic_coarse(struct timespec *ts)
 {
+	struct vgtod_ts *base = &gtod->basetime[CLOCK_MONOTONIC_COARSE];
 	unsigned int seq;
+
 	do {
 		seq = gtod_read_begin(gtod);
-		ts->tv_sec = gtod->monotonic_time_coarse_sec;
-		ts->tv_nsec = gtod->monotonic_time_coarse_nsec;
+		ts->tv_sec = base->sec;
+		ts->tv_nsec = base->nsec;
 	} while (unlikely(gtod_read_retry(gtod, seq)));
 }
 
@@ -318,7 +324,7 @@ int gettimeofday(struct timeval *, struc
 notrace time_t __vdso_time(time_t *t)
 {
 	/* This is atomic on x86 so we don't need any locks. */
-	time_t result = READ_ONCE(gtod->wall_time_sec);
+	time_t result = READ_ONCE(gtod->basetime[CLOCK_REALTIME].sec);
 
 	if (t)
 		*t = result;
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -31,6 +31,8 @@ void update_vsyscall(struct timekeeper *
 {
 	int vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode;
 	struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;
+	struct vgtod_ts *base;
+	u64 nsec;
 
 	/* Mark the new vclock used. */
 	BUILD_BUG_ON(VCLOCK_MAX >= 32);
@@ -45,34 +47,33 @@ void update_vsyscall(struct timekeeper *
 	vdata->mult		= tk->tkr_mono.mult;
 	vdata->shift		= tk->tkr_mono.shift;
 
-	vdata->wall_time_sec		= tk->xtime_sec;
-	vdata->wall_time_snsec		= tk->tkr_mono.xtime_nsec;
-
-	vdata->monotonic_time_sec	= tk->xtime_sec
-					+ tk->wall_to_monotonic.tv_sec;
-	vdata->monotonic_time_snsec	= tk->tkr_mono.xtime_nsec
-					+ ((u64)tk->wall_to_monotonic.tv_nsec
-						<< tk->tkr_mono.shift);
-	while (vdata->monotonic_time_snsec >=
-					(((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) {
-		vdata->monotonic_time_snsec -=
-					((u64)NSEC_PER_SEC) << tk->tkr_mono.shift;
-		vdata->monotonic_time_sec++;
+	base = &vdata->basetime[CLOCK_REALTIME];
+	base->sec = tk->xtime_sec;
+	base->nsec = tk->tkr_mono.xtime_nsec;
+
+	base = &vdata->basetime[CLOCK_MONOTONIC];
+	base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
+	nsec = tk->tkr_mono.xtime_nsec;
+	nsec +=	((u64)tk->wall_to_monotonic.tv_nsec << tk->tkr_mono.shift);
+	while (nsec >= (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) {
+		nsec -= ((u64)NSEC_PER_SEC) << tk->tkr_mono.shift;
+		base->sec++;
 	}
+	base->nsec = nsec;
 
-	vdata->wall_time_coarse_sec	= tk->xtime_sec;
-	vdata->wall_time_coarse_nsec	= (long)(tk->tkr_mono.xtime_nsec >>
-						 tk->tkr_mono.shift);
-
-	vdata->monotonic_time_coarse_sec =
-		vdata->wall_time_coarse_sec + tk->wall_to_monotonic.tv_sec;
-	vdata->monotonic_time_coarse_nsec =
-		vdata->wall_time_coarse_nsec + tk->wall_to_monotonic.tv_nsec;
-
-	while (vdata->monotonic_time_coarse_nsec >= NSEC_PER_SEC) {
-		vdata->monotonic_time_coarse_nsec -= NSEC_PER_SEC;
-		vdata->monotonic_time_coarse_sec++;
+	base = &vdata->basetime[CLOCK_REALTIME_COARSE];
+	base->sec = tk->xtime_sec;
+	base->nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+
+	base = &vdata->basetime[CLOCK_MONOTONIC_COARSE];
+	base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
+	nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift;
+	nsec += tk->wall_to_monotonic.tv_nsec;
+	while (nsec >= NSEC_PER_SEC) {
+		nsec -= NSEC_PER_SEC;
+		base->sec++;
 	}
+	base->nsec = nsec;
 
 	gtod_write_end(vdata);
 }
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -5,33 +5,37 @@
 #include <linux/compiler.h>
 #include <linux/clocksource.h>
 
+#include <uapi/linux/time.h>
+
 #ifdef BUILD_VDSO32_64
 typedef u64 gtod_long_t;
 #else
 typedef unsigned long gtod_long_t;
 #endif
+
+struct vgtod_ts {
+	u64		sec;
+	u64		nsec;
+};
+
+#define VGTOD_BASES	(CLOCK_MONOTONIC_COARSE + 1)
+#define VGTOD_HRES	(BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC))
+#define VGTOD_COARSE	(BIT(CLOCK_REALTIME_COARSE) | BIT(CLOCK_MONOTONIC_COARSE))
+
 /*
  * vsyscall_gtod_data will be accessed by 32 and 64 bit code at the same time
  * so be carefull by modifying this structure.
  */
 struct vsyscall_gtod_data {
-	unsigned int seq;
+	unsigned int	seq;
+
+	int		vclock_mode;
+	u64		cycle_last;
+	u64		mask;
+	u32		mult;
+	u32		shift;
 
-	int	vclock_mode;
-	u64	cycle_last;
-	u64	mask;
-	u32	mult;
-	u32	shift;
-
-	/* open coded 'struct timespec' */
-	u64		wall_time_snsec;
-	gtod_long_t	wall_time_sec;
-	gtod_long_t	monotonic_time_sec;
-	u64		monotonic_time_snsec;
-	gtod_long_t	wall_time_coarse_sec;
-	gtod_long_t	wall_time_coarse_nsec;
-	gtod_long_t	monotonic_time_coarse_sec;
-	gtod_long_t	monotonic_time_coarse_nsec;
+	struct vgtod_ts	basetime[VGTOD_BASES];
 
 	int		tz_minuteswest;
 	int		tz_dsttime;



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [patch 06/11] x86/vdso: Collapse high resolution functions
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
                   ` (4 preceding siblings ...)
  2018-09-14 12:50 ` [patch 05/11] x86/vdso: Introduce and use vgtod_ts Thomas Gleixner
@ 2018-09-14 12:50 ` Thomas Gleixner
  2018-09-14 12:50 ` [patch 07/11] x86/vdso: Collapse coarse functions Thomas Gleixner
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

[-- Attachment #1: x86-vdso--Collapse-high-res-functions.patch --]
[-- Type: text/plain, Size: 2223 bytes --]

do_realtime() and do_monotonic() are now the same except for the storage
array index. Hand the index in as an argument and collapse the functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c |   35 +++++++----------------------------
 1 file changed, 7 insertions(+), 28 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -203,35 +203,12 @@ notrace static inline u64 vgetsns(int *m
 	return v * gtod->mult;
 }
 
-/* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
-notrace static int __always_inline do_realtime(struct timespec *ts)
+notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
-	struct vgtod_ts *base = &gtod->basetime[CLOCK_REALTIME];
+	struct vgtod_ts *base = &gtod->basetime[clk];
 	unsigned int seq;
-	u64 ns;
 	int mode;
-
-	do {
-		seq = gtod_read_begin(gtod);
-		mode = gtod->vclock_mode;
-		ts->tv_sec = base->sec;
-		ns = base->nsec;
-		ns += vgetsns(&mode);
-		ns >>= gtod->shift;
-	} while (unlikely(gtod_read_retry(gtod, seq)));
-
-	ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
-	ts->tv_nsec = ns;
-
-	return mode;
-}
-
-notrace static int __always_inline do_monotonic(struct timespec *ts)
-{
-	struct vgtod_ts *base = &gtod->basetime[CLOCK_MONOTONIC];
-	unsigned int seq;
 	u64 ns;
-	int mode;
 
 	do {
 		seq = gtod_read_begin(gtod);
@@ -276,11 +253,11 @@ notrace int __vdso_clock_gettime(clockid
 {
 	switch (clock) {
 	case CLOCK_REALTIME:
-		if (do_realtime(ts) == VCLOCK_NONE)
+		if (do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE)
 			goto fallback;
 		break;
 	case CLOCK_MONOTONIC:
-		if (do_monotonic(ts) == VCLOCK_NONE)
+		if (do_hres(CLOCK_MONOTONIC, ts) == VCLOCK_NONE)
 			goto fallback;
 		break;
 	case CLOCK_REALTIME_COARSE:
@@ -303,7 +280,9 @@ int clock_gettime(clockid_t, struct time
 notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
 {
 	if (likely(tv != NULL)) {
-		if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE))
+		struct timespec *ts = (struct timespec *) tv;
+
+		if (unlikely(do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE))
 			return vdso_fallback_gtod(tv, tz);
 		tv->tv_usec /= 1000;
 	}



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [patch 07/11] x86/vdso: Collapse coarse functions
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
                   ` (5 preceding siblings ...)
  2018-09-14 12:50 ` [patch 06/11] x86/vdso: Collapse high resolution functions Thomas Gleixner
@ 2018-09-14 12:50 ` Thomas Gleixner
  2018-09-14 12:50 ` [patch 08/11] x86/vdso: Replace the clockid switch case Thomas Gleixner
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

[-- Attachment #1: x86-vdso--Collapse-coarse-functions.patch --]
[-- Type: text/plain, Size: 1423 bytes --]

do_realtime_coarse() and do_monotonic_coarse() are now the same except for
the storage array index. Hand the index in as an argument and collapse the
functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c |   20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -225,21 +225,9 @@ notrace static int do_hres(clockid_t clk
 	return mode;
 }
 
-notrace static void do_realtime_coarse(struct timespec *ts)
+notrace static void do_coarse(clockid_t clk, struct timespec *ts)
 {
-	struct vgtod_ts *base = &gtod->basetime[CLOCK_REALTIME_COARSE];
-	unsigned int seq;
-
-	do {
-		seq = gtod_read_begin(gtod);
-		ts->tv_sec = base->sec;
-		ts->tv_nsec = base->nsec;
-	} while (unlikely(gtod_read_retry(gtod, seq)));
-}
-
-notrace static void do_monotonic_coarse(struct timespec *ts)
-{
-	struct vgtod_ts *base = &gtod->basetime[CLOCK_MONOTONIC_COARSE];
+	struct vgtod_ts *base = &gtod->basetime[clk];
 	unsigned int seq;
 
 	do {
@@ -261,10 +249,10 @@ notrace int __vdso_clock_gettime(clockid
 			goto fallback;
 		break;
 	case CLOCK_REALTIME_COARSE:
-		do_realtime_coarse(ts);
+		do_coarse(CLOCK_REALTIME_COARSE, ts);
 		break;
 	case CLOCK_MONOTONIC_COARSE:
-		do_monotonic_coarse(ts);
+		do_coarse(CLOCK_MONOTONIC_COARSE, ts);
 		break;
 	default:
 		goto fallback;



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [patch 08/11] x86/vdso: Replace the clockid switch case
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
                   ` (6 preceding siblings ...)
  2018-09-14 12:50 ` [patch 07/11] x86/vdso: Collapse coarse functions Thomas Gleixner
@ 2018-09-14 12:50 ` Thomas Gleixner
  2018-09-14 12:50 ` [patch 09/11] x86/vdso: Simplify the invalid vclock case Thomas Gleixner
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

[-- Attachment #1: x86-vdso--Replace-the-clockid-switch-case.patch --]
[-- Type: text/plain, Size: 2114 bytes --]

Now that the time getter functions use the clockid as index into the
storage array for the base time access, the switch case can be replaced.

- Check for clockid >= MAX_CLOCKS and for negative clockid (CPU/FD) first
  and call the fallback function right away.

- After establishing that clockid is < MAX_CLOCKS, convert the clockid to a
  bitmask

- Check for the supported high resolution and coarse functions by anding
  the bitmask of supported clocks and check whether a bit is set.

This completely avoids jump tables, reduces the number of conditionals and
makes the VDSO extensible for other clock ids.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c |   38 ++++++++++++++++-------------------
 1 file changed, 18 insertions(+), 20 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -239,29 +239,27 @@ notrace static void do_coarse(clockid_t
 
 notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
 {
-	switch (clock) {
-	case CLOCK_REALTIME:
-		if (do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE)
-			goto fallback;
-		break;
-	case CLOCK_MONOTONIC:
-		if (do_hres(CLOCK_MONOTONIC, ts) == VCLOCK_NONE)
-			goto fallback;
-		break;
-	case CLOCK_REALTIME_COARSE:
-		do_coarse(CLOCK_REALTIME_COARSE, ts);
-		break;
-	case CLOCK_MONOTONIC_COARSE:
-		do_coarse(CLOCK_MONOTONIC_COARSE, ts);
-		break;
-	default:
-		goto fallback;
-	}
+	unsigned int msk;
+
+	/* Sort out negative (CPU/FD) and invalid clocks */
+	if (unlikely((unsigned int) clock >= MAX_CLOCKS))
+		return vdso_fallback_gettime(clock, ts);
 
-	return 0;
-fallback:
+	/*
+	 * Convert the clockid to a bitmask and use it to check which
+	 * clocks are handled in the VDSO directly.
+	 */
+	msk = 1U << clock;
+	if (likely(msk & VGTOD_HRES)) {
+		if (do_hres(clock, ts) != VCLOCK_NONE)
+			return 0;
+	} else if (msk & VGTOD_COARSE) {
+		do_coarse(clock, ts);
+		return 0;
+	}
 	return vdso_fallback_gettime(clock, ts);
 }
+
 int clock_gettime(clockid_t, struct timespec *)
 	__attribute__((weak, alias("__vdso_clock_gettime")));
 



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
                   ` (7 preceding siblings ...)
  2018-09-14 12:50 ` [patch 08/11] x86/vdso: Replace the clockid switch case Thomas Gleixner
@ 2018-09-14 12:50 ` Thomas Gleixner
  2018-09-17 19:25   ` Andy Lutomirski
  2018-09-14 12:50 ` [patch 10/11] x86/vdso: Move cycle_last handling into the caller Thomas Gleixner
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

[-- Attachment #1: x86-vdso--Simplify-the-invalid-vclock-case.patch --]
[-- Type: text/plain, Size: 5309 bytes --]

The code flow for the vclocks is convoluted as it requires the vclocks
which can be invalidated separately from the vsyscall_gtod_data sequence to
store the fact in a separate variable. That's inefficient.

Restructure the code so the vclock readout returns cycles and the
conversion to nanoseconds is handled at the call site.

If the clock gets invalidated or vclock is already VCLOCK_NONE, return
U64_MAX as the cycle value, which is invalid for all clocks and leave the
sequence loop immediately in that case by calling the fallback function
directly.

This allows to remove the gettimeofday fallback as it now uses the
clock_gettime() fallback and does the nanoseconds to microseconds
conversion in the same way as it does when the vclock is functional. It
does not make a difference whether the division by 1000 happens in the
kernel fallback or in userspace.

Generates way better code and gains a few cycles back.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c |   81 +++++++++--------------------------
 1 file changed, 21 insertions(+), 60 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -48,16 +48,6 @@ notrace static long vdso_fallback_gettim
 	return ret;
 }
 
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
-	long ret;
-
-	asm("syscall" : "=a" (ret) :
-	    "0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");
-	return ret;
-}
-
-
 #else
 
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
@@ -75,21 +65,6 @@ notrace static long vdso_fallback_gettim
 	return ret;
 }
 
-notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
-{
-	long ret;
-
-	asm(
-		"mov %%ebx, %%edx \n"
-		"mov %2, %%ebx \n"
-		"call __kernel_vsyscall \n"
-		"mov %%edx, %%ebx \n"
-		: "=a" (ret)
-		: "0" (__NR_gettimeofday), "g" (tv), "c" (tz)
-		: "memory", "edx");
-	return ret;
-}
-
 #endif
 
 #ifdef CONFIG_PARAVIRT_CLOCK
@@ -98,7 +73,7 @@ static notrace const struct pvclock_vsys
 	return (const struct pvclock_vsyscall_time_info *)&pvclock_page;
 }
 
-static notrace u64 vread_pvclock(int *mode)
+static notrace u64 vread_pvclock(void)
 {
 	const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti;
 	u64 ret;
@@ -130,10 +105,8 @@ static notrace u64 vread_pvclock(int *mo
 	do {
 		version = pvclock_read_begin(pvti);
 
-		if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) {
-			*mode = VCLOCK_NONE;
-			return 0;
-		}
+		if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT)))
+			return U64_MAX;
 
 		ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
 	} while (pvclock_read_retry(pvti, version));
@@ -148,17 +121,12 @@ static notrace u64 vread_pvclock(int *mo
 }
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-static notrace u64 vread_hvclock(int *mode)
+static notrace u64 vread_hvclock(void)
 {
 	const struct ms_hyperv_tsc_page *tsc_pg =
 		(const struct ms_hyperv_tsc_page *)&hvclock_page;
-	u64 current_tick = hv_read_tsc_page(tsc_pg);
-
-	if (current_tick != U64_MAX)
-		return current_tick;
 
-	*mode = VCLOCK_NONE;
-	return 0;
+	return hv_read_tsc_page(tsc_pg);
 }
 #endif
 
@@ -182,47 +150,42 @@ notrace static u64 vread_tsc(void)
 	return last;
 }
 
-notrace static inline u64 vgetsns(int *mode)
+notrace static inline u64 vgetcyc(int mode)
 {
-	u64 v;
-	cycles_t cycles;
-
-	if (gtod->vclock_mode == VCLOCK_TSC)
-		cycles = vread_tsc();
+	if (mode == VCLOCK_TSC)
+		return vread_tsc();
 #ifdef CONFIG_PARAVIRT_CLOCK
-	else if (gtod->vclock_mode == VCLOCK_PVCLOCK)
-		cycles = vread_pvclock(mode);
+	else if (mode == VCLOCK_PVCLOCK)
+		return vread_pvclock();
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
-	else if (gtod->vclock_mode == VCLOCK_HVCLOCK)
-		cycles = vread_hvclock(mode);
+	else if (mode == VCLOCK_HVCLOCK)
+		return vread_hvclock();
 #endif
-	else
-		return 0;
-	v = cycles - gtod->cycle_last;
-	return v * gtod->mult;
+	return U64_MAX;
 }
 
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
 	struct vgtod_ts *base = &gtod->basetime[clk];
 	unsigned int seq;
-	int mode;
-	u64 ns;
+	u64 cycles, ns;
 
 	do {
 		seq = gtod_read_begin(gtod);
-		mode = gtod->vclock_mode;
 		ts->tv_sec = base->sec;
 		ns = base->nsec;
-		ns += vgetsns(&mode);
+		cycles = vgetcyc(gtod->vclock_mode);
+		if (unlikely((s64)cycles < 0))
+			return vdso_fallback_gettime(clk, ts);
+		ns += (cycles - gtod->cycle_last) * gtod->mult;
 		ns >>= gtod->shift;
 	} while (unlikely(gtod_read_retry(gtod, seq)));
 
 	ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
 	ts->tv_nsec = ns;
 
-	return mode;
+	return 0;
 }
 
 notrace static void do_coarse(clockid_t clk, struct timespec *ts)
@@ -251,8 +214,7 @@ notrace int __vdso_clock_gettime(clockid
 	 */
 	msk = 1U << clock;
 	if (likely(msk & VGTOD_HRES)) {
-		if (do_hres(clock, ts) != VCLOCK_NONE)
-			return 0;
+		return do_hres(clock, ts);
 	} else if (msk & VGTOD_COARSE) {
 		do_coarse(clock, ts);
 		return 0;
@@ -268,8 +230,7 @@ notrace int __vdso_gettimeofday(struct t
 	if (likely(tv != NULL)) {
 		struct timespec *ts = (struct timespec *) tv;
 
-		if (unlikely(do_hres(CLOCK_REALTIME, ts) == VCLOCK_NONE))
-			return vdso_fallback_gtod(tv, tz);
+		do_hres(CLOCK_REALTIME, ts);
 		tv->tv_usec /= 1000;
 	}
 	if (unlikely(tz != NULL)) {



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [patch 10/11] x86/vdso: Move cycle_last handling into the caller
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
                   ` (8 preceding siblings ...)
  2018-09-14 12:50 ` [patch 09/11] x86/vdso: Simplify the invalid vclock case Thomas Gleixner
@ 2018-09-14 12:50 ` Thomas Gleixner
  2018-09-14 15:26   ` Vitaly Kuznetsov
  2018-09-14 12:50 ` [patch 11/11] x66/vdso: Add CLOCK_TAI support Thomas Gleixner
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

[-- Attachment #1: x86-vdso--Move-cycle_last-handling-into-the-caller.patch --]
[-- Type: text/plain, Size: 2778 bytes --]

Dereferencing gtod->cycle_last all over the place and foing the cycles <
last comparison in the vclock read functions generates horrible code. Doing
it at the call site is much better and gains a few cycles both for TSC and
pvclock.

Caveat: This adds the comparison to the hyperv vclock as well, but I have
no way to test that.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c |   39 ++++++-----------------------------
 1 file changed, 7 insertions(+), 32 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -76,9 +76,8 @@ static notrace const struct pvclock_vsys
 static notrace u64 vread_pvclock(void)
 {
 	const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti;
-	u64 ret;
-	u64 last;
 	u32 version;
+	u64 ret;
 
 	/*
 	 * Note: The kernel and hypervisor must guarantee that cpu ID
@@ -111,13 +110,7 @@ static notrace u64 vread_pvclock(void)
 		ret = __pvclock_read_cycles(pvti, rdtsc_ordered());
 	} while (pvclock_read_retry(pvti, version));
 
-	/* refer to vread_tsc() comment for rationale */
-	last = gtod->cycle_last;
-
-	if (likely(ret >= last))
-		return ret;
-
-	return last;
+	return ret;
 }
 #endif
 #ifdef CONFIG_HYPERV_TSCPAGE
@@ -130,30 +123,10 @@ static notrace u64 vread_hvclock(void)
 }
 #endif
 
-notrace static u64 vread_tsc(void)
-{
-	u64 ret = (u64)rdtsc_ordered();
-	u64 last = gtod->cycle_last;
-
-	if (likely(ret >= last))
-		return ret;
-
-	/*
-	 * GCC likes to generate cmov here, but this branch is extremely
-	 * predictable (it's just a function of time and the likely is
-	 * very likely) and there's a data dependence, so force GCC
-	 * to generate a branch instead.  I don't barrier() because
-	 * we don't actually need a barrier, and if this function
-	 * ever gets inlined it will generate worse code.
-	 */
-	asm volatile ("");
-	return last;
-}
-
 notrace static inline u64 vgetcyc(int mode)
 {
 	if (mode == VCLOCK_TSC)
-		return vread_tsc();
+		return (u64)rdtsc_ordered();
 #ifdef CONFIG_PARAVIRT_CLOCK
 	else if (mode == VCLOCK_PVCLOCK)
 		return vread_pvclock();
@@ -168,17 +141,19 @@ notrace static inline u64 vgetcyc(int mo
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
 	struct vgtod_ts *base = &gtod->basetime[clk];
+	u64 cycles, last, ns;
 	unsigned int seq;
-	u64 cycles, ns;
 
 	do {
 		seq = gtod_read_begin(gtod);
 		ts->tv_sec = base->sec;
 		ns = base->nsec;
+		last = gtod->cycle_last;
 		cycles = vgetcyc(gtod->vclock_mode);
 		if (unlikely((s64)cycles < 0))
 			return vdso_fallback_gettime(clk, ts);
-		ns += (cycles - gtod->cycle_last) * gtod->mult;
+		if (cycles > last)
+			ns += (cycles - last) * gtod->mult;
 		ns >>= gtod->shift;
 	} while (unlikely(gtod_read_retry(gtod, seq)));
 



^ permalink raw reply	[flat|nested] 81+ messages in thread

* [patch 11/11] x66/vdso: Add CLOCK_TAI support
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
                   ` (9 preceding siblings ...)
  2018-09-14 12:50 ` [patch 10/11] x86/vdso: Move cycle_last handling into the caller Thomas Gleixner
@ 2018-09-14 12:50 ` Thomas Gleixner
  2018-09-14 14:04   ` Andy Lutomirski
  2018-09-14 12:56 ` [patch 00/11] x86/vdso: Cleanups, simmplifications and " Florian Weimer
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 12:50 UTC (permalink / raw)
  To: LKML
  Cc: Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

[-- Attachment #1: x66-vdso--Add-CLOCK_TAI-support.patch --]
[-- Type: text/plain, Size: 2325 bytes --]

With the storage array in place it's now trivial to support CLOCK_TAI in
the vdso. Instead of extending the array to accomodate CLOCK_TAI, make use
of the fact that:

 - CLOCK ids are set in stone
 - CLOCK_THREAD_CPUTIME is never going to be supported in the VDSO so
   the array slot 3 is unused
 - CLOCK_TAI is id 11 which results in 3 when masked with 0x3

Add the mask to the basetime array lookup and set up the CLOCK_TAI base
time in update_vsyscall().

The performance impact of the mask operation is within the noise.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/vdso/vclock_gettime.c    |    2 +-
 arch/x86/entry/vsyscall/vsyscall_gtod.c |    4 ++++
 arch/x86/include/asm/vgtod.h            |    6 +++++-
 3 files changed, 10 insertions(+), 2 deletions(-)

--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -140,7 +140,7 @@ notrace static inline u64 vgetcyc(int mo
 
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
-	struct vgtod_ts *base = &gtod->basetime[clk];
+	struct vgtod_ts *base = &gtod->basetime[clk & VGTOD_HRES_MASK];
 	u64 cycles, last, ns;
 	unsigned int seq;
 
--- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
+++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
@@ -51,6 +51,10 @@ void update_vsyscall(struct timekeeper *
 	base->sec = tk->xtime_sec;
 	base->nsec = tk->tkr_mono.xtime_nsec;
 
+	base = &vdata->basetime[VGTOD_TAI];
+	base->sec = tk->xtime_sec + (s64)tk->tai_offset;
+	base->nsec = tk->tkr_mono.xtime_nsec;
+
 	base = &vdata->basetime[CLOCK_MONOTONIC];
 	base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
 	nsec = tk->tkr_mono.xtime_nsec;
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -19,9 +19,13 @@ struct vgtod_ts {
 };
 
 #define VGTOD_BASES	(CLOCK_MONOTONIC_COARSE + 1)
-#define VGTOD_HRES	(BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC))
+#define VGTOD_HRES	(BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC) | BIT(CLOCK_TAI))
 #define VGTOD_COARSE	(BIT(CLOCK_REALTIME_COARSE) | BIT(CLOCK_MONOTONIC_COARSE))
 
+/* Abuse CLOCK_THREAD_CPUTIME_ID for VGTOD CLOCK TAI */
+#define VGTOD_HRES_MASK	0x3
+#define VGTOD_TAI	(CLOCK_TAI & VGTOD_HRES_MASK)
+
 /*
  * vsyscall_gtod_data will be accessed by 32 and 64 bit code at the same time
  * so be carefull by modifying this structure.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
                   ` (10 preceding siblings ...)
  2018-09-14 12:50 ` [patch 11/11] x66/vdso: Add CLOCK_TAI support Thomas Gleixner
@ 2018-09-14 12:56 ` Florian Weimer
  2018-09-14 13:05   ` Thomas Gleixner
  2018-09-14 13:09   ` Peter Zijlstra
  2018-09-14 14:22 ` Arnd Bergmann
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 81+ messages in thread
From: Florian Weimer @ 2018-09-14 12:56 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard, Stephen Boyd,
	John Stultz, K. Y. Srinivasan, Vitaly Kuznetsov, devel,
	virtualization, Paolo Bonzini, Arnd Bergmann, Juergen Gross

On 09/14/2018 02:50 PM, Thomas Gleixner wrote:
> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> implementation, which extended the clockid switch case and added yet
> another slightly different copy of the same code.
> 
> Especially the extended switch case is problematic as the compiler tends to
> generate a jump table which then requires to use retpolines.

Does vDSO code really have to use retpolines?  It's in userspace, after all.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-09-14 12:56 ` [patch 00/11] x86/vdso: Cleanups, simmplifications and " Florian Weimer
@ 2018-09-14 13:05   ` Thomas Gleixner
  2018-09-14 13:06     ` Florian Weimer
  2018-09-14 13:09   ` Peter Zijlstra
  1 sibling, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 13:05 UTC (permalink / raw)
  To: Florian Weimer
  Cc: LKML, Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Fri, 14 Sep 2018, Florian Weimer wrote:

> On 09/14/2018 02:50 PM, Thomas Gleixner wrote:
> > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > implementation, which extended the clockid switch case and added yet
> > another slightly different copy of the same code.
> > 
> > Especially the extended switch case is problematic as the compiler tends to
> > generate a jump table which then requires to use retpolines.
> 
> Does vDSO code really have to use retpolines?  It's in userspace, after all.

Unless you have IBRS/STIPB enabled, you need user space ratpoutine as well.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-09-14 13:05   ` Thomas Gleixner
@ 2018-09-14 13:06     ` Florian Weimer
  2018-09-14 13:19       ` Thomas Gleixner
  0 siblings, 1 reply; 81+ messages in thread
From: Florian Weimer @ 2018-09-14 13:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On 09/14/2018 03:05 PM, Thomas Gleixner wrote:
> On Fri, 14 Sep 2018, Florian Weimer wrote:
> 
>> On 09/14/2018 02:50 PM, Thomas Gleixner wrote:
>>> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
>>> implementation, which extended the clockid switch case and added yet
>>> another slightly different copy of the same code.
>>>
>>> Especially the extended switch case is problematic as the compiler tends to
>>> generate a jump table which then requires to use retpolines.
>>
>> Does vDSO code really have to use retpolines?  It's in userspace, after all.
> 
> Unless you have IBRS/STIPB enabled, you need user space ratpoutine as well.

I don't think this is a consensus position, and it obviously depends on 
the (sub)architecture.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-09-14 12:56 ` [patch 00/11] x86/vdso: Cleanups, simmplifications and " Florian Weimer
  2018-09-14 13:05   ` Thomas Gleixner
@ 2018-09-14 13:09   ` Peter Zijlstra
  1 sibling, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2018-09-14 13:09 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Thomas Gleixner, LKML, Andy Lutomirski, x86, Matt Rickard,
	Stephen Boyd, John Stultz, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Fri, Sep 14, 2018 at 02:56:46PM +0200, Florian Weimer wrote:
> On 09/14/2018 02:50 PM, Thomas Gleixner wrote:
> > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > implementation, which extended the clockid switch case and added yet
> > another slightly different copy of the same code.
> > 
> > Especially the extended switch case is problematic as the compiler tends to
> > generate a jump table which then requires to use retpolines.
> 
> Does vDSO code really have to use retpolines?  It's in userspace, after all.

Userspace is equally susceptible to spectre-v2. Ideally we'd recompile
world with retpoline, but given the amount of inline asm in say things
like openssl and similar projects, validating that there are indeed no
indirect calls/jumps left is nontrivial.

There are currently pending patches to otherwise address user-user
spectre-v2 attacks.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-09-14 13:06     ` Florian Weimer
@ 2018-09-14 13:19       ` Thomas Gleixner
  0 siblings, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 13:19 UTC (permalink / raw)
  To: Florian Weimer
  Cc: LKML, Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Fri, 14 Sep 2018, Florian Weimer wrote:

> On 09/14/2018 03:05 PM, Thomas Gleixner wrote:
> > On Fri, 14 Sep 2018, Florian Weimer wrote:
> > 
> > > On 09/14/2018 02:50 PM, Thomas Gleixner wrote:
> > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > > > implementation, which extended the clockid switch case and added yet
> > > > another slightly different copy of the same code.
> > > > 
> > > > Especially the extended switch case is problematic as the compiler tends
> > > > to
> > > > generate a jump table which then requires to use retpolines.
> > > 
> > > Does vDSO code really have to use retpolines?  It's in userspace, after
> > > all.
> > 
> > Unless you have IBRS/STIPB enabled, you need user space ratpoutine as well.
> 
> I don't think this is a consensus position, and it obviously depends on the
> (sub)architecture.

It does, but we are not building kernels for specific micro architectures
nor do distros AFAIK.

But that aside, even with jump tables the generated code (ratpoutine
disabled) is not any better than with the current approach. It's even worse
and needs the same optimizations at the end and then the main difference is
that you have 3 copies of one function and 2 of the other. Not a win at
all.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 11/11] x66/vdso: Add CLOCK_TAI support
  2018-09-14 12:50 ` [patch 11/11] x66/vdso: Add CLOCK_TAI support Thomas Gleixner
@ 2018-09-14 14:04   ` Andy Lutomirski
  2018-09-14 14:27     ` Thomas Gleixner
  0 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-09-14 14:04 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, K. Y. Srinivasan,
	Vitaly Kuznetsov, devel, virtualization, Paolo Bonzini,
	Arnd Bergmann, Juergen Gross



> On Sep 14, 2018, at 5:50 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> With the storage array in place it's now trivial to support CLOCK_TAI in
> the vdso. Instead of extending the array to accomodate CLOCK_TAI, make use
> of the fact that:
> 
> - CLOCK ids are set in stone
> - CLOCK_THREAD_CPUTIME is never going to be supported in the VDSO so
>   the array slot 3 is unused
> - CLOCK_TAI is id 11 which results in 3 when masked with 0x3
> 
> Add the mask to the basetime array lookup and set up the CLOCK_TAI base
> time in update_vsyscall().

That’s... horrible. In an amazing way. Can you add BUILD_BUG_ON somewhere to assert that this actually works?

> 
> The performance impact of the mask operation is within the noise.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> arch/x86/entry/vdso/vclock_gettime.c    |    2 +-
> arch/x86/entry/vsyscall/vsyscall_gtod.c |    4 ++++
> arch/x86/include/asm/vgtod.h            |    6 +++++-
> 3 files changed, 10 insertions(+), 2 deletions(-)
> 
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -140,7 +140,7 @@ notrace static inline u64 vgetcyc(int mo
> 
> notrace static int do_hres(clockid_t clk, struct timespec *ts)
> {
> -    struct vgtod_ts *base = &gtod->basetime[clk];
> +    struct vgtod_ts *base = &gtod->basetime[clk & VGTOD_HRES_MASK];
>    u64 cycles, last, ns;
>    unsigned int seq;
> 
> --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c
> +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c
> @@ -51,6 +51,10 @@ void update_vsyscall(struct timekeeper *
>    base->sec = tk->xtime_sec;
>    base->nsec = tk->tkr_mono.xtime_nsec;
> 
> +    base = &vdata->basetime[VGTOD_TAI];
> +    base->sec = tk->xtime_sec + (s64)tk->tai_offset;
> +    base->nsec = tk->tkr_mono.xtime_nsec;
> +
>    base = &vdata->basetime[CLOCK_MONOTONIC];
>    base->sec = tk->xtime_sec + tk->wall_to_monotonic.tv_sec;
>    nsec = tk->tkr_mono.xtime_nsec;
> --- a/arch/x86/include/asm/vgtod.h
> +++ b/arch/x86/include/asm/vgtod.h
> @@ -19,9 +19,13 @@ struct vgtod_ts {
> };
> 
> #define VGTOD_BASES    (CLOCK_MONOTONIC_COARSE + 1)
> -#define VGTOD_HRES    (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC))
> +#define VGTOD_HRES    (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC) | BIT(CLOCK_TAI))
> #define VGTOD_COARSE    (BIT(CLOCK_REALTIME_COARSE) | BIT(CLOCK_MONOTONIC_COARSE))
> 
> +/* Abuse CLOCK_THREAD_CPUTIME_ID for VGTOD CLOCK TAI */
> +#define VGTOD_HRES_MASK    0x3
> +#define VGTOD_TAI    (CLOCK_TAI & VGTOD_HRES_MASK)
> +
> /*
>  * vsyscall_gtod_data will be accessed by 32 and 64 bit code at the same time
>  * so be carefull by modifying this structure.
> 
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
                   ` (11 preceding siblings ...)
  2018-09-14 12:56 ` [patch 00/11] x86/vdso: Cleanups, simmplifications and " Florian Weimer
@ 2018-09-14 14:22 ` Arnd Bergmann
  2018-09-17 13:00   ` Thomas Gleixner
  2018-10-03  5:15 ` Andy Lutomirski
  2018-10-04 20:32 ` Andy Lutomirski
  14 siblings, 1 reply; 81+ messages in thread
From: Arnd Bergmann @ 2018-09-14 14:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linux Kernel Mailing List, Andy Lutomirski,
	the arch/x86 maintainers, Peter Zijlstra, matt, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, vkuznets, devel,
	virtualization, Paolo Bonzini, Juergen Gross, Deepa Dinamani

On Fri, Sep 14, 2018 at 2:52 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> implementation, which extended the clockid switch case and added yet
> another slightly different copy of the same code.
>
> Especially the extended switch case is problematic as the compiler tends to
> generate a jump table which then requires to use retpolines. If jump tables
> are disabled it adds yet another conditional to the existing maze.
>
> This series takes a different approach by consolidating the almost
> identical functions into one implementation for high resolution clocks and
> one for the coarse grained clock ids by storing the base data for each
> clock id in an array which is indexed by the clock id.
>
> This completely eliminates the switch case and allows further
> simplifications of the code base, which at the end all together gain a few
> cycles performance or at least stay on par with todays code. The resulting
> performance depends heavily on the micro architecture and the compiler.

No objections from my side, just a few general remarks: Deepa
and I have discussed the vdso in the past for 2038. I have started
a patch that I'll have to redo on top of yours, but that is fine, your
cleanup is only going to help here.

More generally speaking, Deepa said that she would like to see
some generalization on the vdso before adding the time64_t
based variants. Your series probably makes x86 diverge more
from the others, which makes it harder to consolidate them again,
but we might just as well use your new implementation to base the
generic one on, and just move the other ones over to that.

A couple of architectures (s390, ia64, riscv, powerpc, arm64)
implement the vdso as assembler code at the moment, so they
won't be as easy to consolidate (other than outright replacing all
the code).

The other five:
arch/x86/entry/vdso/vclock_gettime.c
arch/sparc/vdso/vclock_gettime.c
arch/nds32/kernel/vdso/gettimeofday.c
arch/mips/vdso/gettimeofday.c
arch/arm/vdso/vgettimeofday.c

are basically all minor variations of the same code base and could be
consolidated to some degree.
Any suggestions here? Should we plan to do that consolitdation based on
your new version, or just add clock_gettime64 in arm32 and x86-32, and then
be done with it? The other ones will obviously still be fast for 32-bit time_t
and will have a working non-vdso sys_clock_getttime64().

I also wonder about clock_getres(): half the architectures seem to implement
it in vdso, but notably arm32 and x86 don't, and I had not expected it to be
performance critical given that the result is easily cached in user space.

         Arnd

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 11/11] x66/vdso: Add CLOCK_TAI support
  2018-09-14 14:04   ` Andy Lutomirski
@ 2018-09-14 14:27     ` Thomas Gleixner
  2018-09-14 14:59       ` Andy Lutomirski
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-14 14:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, K. Y. Srinivasan,
	Vitaly Kuznetsov, devel, virtualization, Paolo Bonzini,
	Arnd Bergmann, Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 836 bytes --]

On Fri, 14 Sep 2018, Andy Lutomirski wrote:
> > On Sep 14, 2018, at 5:50 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > With the storage array in place it's now trivial to support CLOCK_TAI in
> > the vdso. Instead of extending the array to accomodate CLOCK_TAI, make use
> > of the fact that:
> > 
> > - CLOCK ids are set in stone
> > - CLOCK_THREAD_CPUTIME is never going to be supported in the VDSO so
> >   the array slot 3 is unused
> > - CLOCK_TAI is id 11 which results in 3 when masked with 0x3
> > 
> > Add the mask to the basetime array lookup and set up the CLOCK_TAI base
> > time in update_vsyscall().
> 
> That’s... horrible. In an amazing way. Can you add BUILD_BUG_ON somewhere
> to assert that this actually works?

Sure, but changing any of the clock ids will cause more wreckage than that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 11/11] x66/vdso: Add CLOCK_TAI support
  2018-09-14 14:27     ` Thomas Gleixner
@ 2018-09-14 14:59       ` Andy Lutomirski
  2018-09-16  9:39         ` Thomas Gleixner
  0 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-09-14 14:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, K. Y. Srinivasan,
	Vitaly Kuznetsov, devel, virtualization, Paolo Bonzini,
	Arnd Bergmann, Juergen Gross



> On Sep 14, 2018, at 7:27 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Fri, 14 Sep 2018, Andy Lutomirski wrote:
>>> On Sep 14, 2018, at 5:50 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>> With the storage array in place it's now trivial to support CLOCK_TAI in
>>> the vdso. Instead of extending the array to accomodate CLOCK_TAI, make use
>>> of the fact that:
>>> 
>>> - CLOCK ids are set in stone
>>> - CLOCK_THREAD_CPUTIME is never going to be supported in the VDSO so
>>>  the array slot 3 is unused
>>> - CLOCK_TAI is id 11 which results in 3 when masked with 0x3
>>> 
>>> Add the mask to the basetime array lookup and set up the CLOCK_TAI base
>>> time in update_vsyscall().
>> 
>> That’s... horrible. In an amazing way. Can you add BUILD_BUG_ON somewhere
>> to assert that this actually works?
> 
> Sure, but changing any of the clock ids will cause more wreckage than that.
> 

I’m more concerned that we add a new one and break the magic masking. Maybe two start sharing the same slot. 

> Thanks,
> 
>    tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 10/11] x86/vdso: Move cycle_last handling into the caller
  2018-09-14 12:50 ` [patch 10/11] x86/vdso: Move cycle_last handling into the caller Thomas Gleixner
@ 2018-09-14 15:26   ` Vitaly Kuznetsov
  0 siblings, 0 replies; 81+ messages in thread
From: Vitaly Kuznetsov @ 2018-09-14 15:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, K. Y. Srinivasan,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

Thomas Gleixner <tglx@linutronix.de> writes:

> Dereferencing gtod->cycle_last all over the place and foing the cycles <
> last comparison in the vclock read functions generates horrible code. Doing
> it at the call site is much better and gains a few cycles both for TSC and
> pvclock.
>
> Caveat: This adds the comparison to the hyperv vclock as well, but I have
> no way to test that.

While the change looks obviously correct for Hyper-V vclock too, by saying
that you encouraged me to give the whole series a shot. I did and turns
out Hyper-V vclock is broken: simple clock_gettime(CLOCK_REALTIME, )
test goes from 70 cpu cycles to 1000 (meaning that we're falling back to
syscall I guess). Bisecting now, stay tuned!

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 02/11] x86/time: Implement clocksource_arch_init()
  2018-09-14 12:50 ` [patch 02/11] x86/time: Implement clocksource_arch_init() Thomas Gleixner
@ 2018-09-14 15:45   ` Vitaly Kuznetsov
  2018-09-15  6:05     ` Thomas Gleixner
  0 siblings, 1 reply; 81+ messages in thread
From: Vitaly Kuznetsov @ 2018-09-14 15:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, K. Y. Srinivasan,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

Thomas Gleixner <tglx@linutronix.de> writes:

> Runtime validate the VCLOCK_MODE in clocksource::archdata and disable
> VCLOCK if invalid, which disables the VDSO but keeps the system running.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/Kconfig       |    1 +
>  arch/x86/kernel/time.c |   16 ++++++++++++++++
>  2 files changed, 17 insertions(+)
>
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -48,6 +48,7 @@ config X86
>  	select ACPI_SYSTEM_POWER_STATES_SUPPORT	if ACPI
>  	select ANON_INODES
>  	select ARCH_CLOCKSOURCE_DATA
> +	select ARCH_CLOCKSOURCE_INIT
>  	select ARCH_DISCARD_MEMBLOCK
>  	select ARCH_HAS_ACPI_TABLE_UPGRADE	if ACPI
>  	select ARCH_HAS_DEBUG_VIRTUAL
> --- a/arch/x86/kernel/time.c
> +++ b/arch/x86/kernel/time.c
> @@ -10,6 +10,7 @@
>   *
>   */
>
> +#include <linux/clocksource.h>
>  #include <linux/clockchips.h>
>  #include <linux/interrupt.h>
>  #include <linux/irq.h>
> @@ -105,3 +106,18 @@ void __init time_init(void)
>  {
>  	late_time_init = x86_late_time_init;
>  }
> +
> +/*
> + * Sanity check the vdso related archdata content.
> + */
> +void clocksource_arch_init(struct clocksource *cs)
> +{
> +	if (cs->archdata.vclock_mode == VCLOCK_NONE)
> +		return;
> +
> +	if (cs->archdata.vclock_mode >= VCLOCK_MAX) {

It should be '>' as VCLOCK_MAX == VCLOCK_HVCLOCK == 3

> +		pr_warn("clocksource %s registered with invalid vclock_mode %d. Disabling vclock.\n",
> +			cs->name, cs->archdata.vclock_mode);
> +		cs->archdata.vclock_mode = VCLOCK_NONE;
> +	}
> +}

-- 
  Vitaly

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 02/11] x86/time: Implement clocksource_arch_init()
  2018-09-14 15:45   ` Vitaly Kuznetsov
@ 2018-09-15  6:05     ` Thomas Gleixner
  0 siblings, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-15  6:05 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: LKML, Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, K. Y. Srinivasan,
	devel, virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Fri, 14 Sep 2018, Vitaly Kuznetsov wrote:
> Thomas Gleixner <tglx@linutronix.de> writes:
> > +
> > +	if (cs->archdata.vclock_mode >= VCLOCK_MAX) {
> 
> It should be '>' as VCLOCK_MAX == VCLOCK_HVCLOCK == 3

Indeed.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 11/11] x66/vdso: Add CLOCK_TAI support
  2018-09-14 14:59       ` Andy Lutomirski
@ 2018-09-16  9:39         ` Thomas Gleixner
  0 siblings, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-16  9:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, Andy Lutomirski, x86, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, K. Y. Srinivasan,
	Vitaly Kuznetsov, devel, virtualization, Paolo Bonzini,
	Arnd Bergmann, Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 627 bytes --]

On Fri, 14 Sep 2018, Andy Lutomirski wrote:
> > On Sep 14, 2018, at 7:27 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Fri, 14 Sep 2018, Andy Lutomirski wrote:
> >> That’s... horrible. In an amazing way. Can you add BUILD_BUG_ON somewhere
> >> to assert that this actually works?
> > 
> > Sure, but changing any of the clock ids will cause more wreckage than that.
> > 
> I’m more concerned that we add a new one and break the magic
> masking. Maybe two start sharing the same slot.

You are right. The obvious extension is CLOCK_BOOTTIME. That's id 7 which
indeed conflicts. I'll remove the magic.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-09-14 14:22 ` Arnd Bergmann
@ 2018-09-17 13:00   ` Thomas Gleixner
  2018-09-24 21:08     ` Arnd Bergmann
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-17 13:00 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Linux Kernel Mailing List, Andy Lutomirski,
	the arch/x86 maintainers, Peter Zijlstra, matt, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, vkuznets, devel,
	virtualization, Paolo Bonzini, Juergen Gross, Deepa Dinamani

On Fri, 14 Sep 2018, Arnd Bergmann wrote:
> On Fri, Sep 14, 2018 at 2:52 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> A couple of architectures (s390, ia64, riscv, powerpc, arm64)
> implement the vdso as assembler code at the moment, so they
> won't be as easy to consolidate (other than outright replacing all
> the code).
> 
> The other five:
> arch/x86/entry/vdso/vclock_gettime.c
> arch/sparc/vdso/vclock_gettime.c
> arch/nds32/kernel/vdso/gettimeofday.c
> arch/mips/vdso/gettimeofday.c
> arch/arm/vdso/vgettimeofday.c
>
> are basically all minor variations of the same code base and could be
> consolidated to some degree.
> Any suggestions here? Should we plan to do that consolitdation based on
> your new version, or just add clock_gettime64 in arm32 and x86-32, and then
> be done with it? The other ones will obviously still be fast for 32-bit time_t
> and will have a working non-vdso sys_clock_getttime64().

In principle consolidating all those implementations should be possible to
some extent and probably worthwhile. What's arch specific are the actual
accessors to the hardware clocks.

> I also wonder about clock_getres(): half the architectures seem to implement
> it in vdso, but notably arm32 and x86 don't, and I had not expected it to be
> performance critical given that the result is easily cached in user space.

getres() is not really performance critical, but adding it does not create
a huge problem either.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-14 12:50 ` [patch 09/11] x86/vdso: Simplify the invalid vclock case Thomas Gleixner
@ 2018-09-17 19:25   ` Andy Lutomirski
  2018-09-17 20:12     ` John Stultz
  0 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-09-17 19:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Andy Lutomirski, X86 ML, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, K. Y. Srinivasan,
	Vitaly Kuznetsov, devel, Linux Virtualization, Paolo Bonzini,
	Arnd Bergmann, Juergen Gross

On Fri, Sep 14, 2018 at 5:50 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> The code flow for the vclocks is convoluted as it requires the vclocks
> which can be invalidated separately from the vsyscall_gtod_data sequence to
> store the fact in a separate variable. That's inefficient.
>

>  notrace static int do_hres(clockid_t clk, struct timespec *ts)
>  {
>         struct vgtod_ts *base = &gtod->basetime[clk];
>         unsigned int seq;
> -       int mode;
> -       u64 ns;
> +       u64 cycles, ns;
>
>         do {
>                 seq = gtod_read_begin(gtod);
> -               mode = gtod->vclock_mode;
>                 ts->tv_sec = base->sec;
>                 ns = base->nsec;
> -               ns += vgetsns(&mode);
> +               cycles = vgetcyc(gtod->vclock_mode);
> +               if (unlikely((s64)cycles < 0))
> +                       return vdso_fallback_gettime(clk, ts);

i was contemplating this, and I would suggest one of two optimizations:

1. have all the helpers return a struct containing a u64 and a bool,
and use that bool.  The compiler should do okay.

2. Be sneaky.  Later in the series, you do:

                if (unlikely((s64)cycles < 0))
                        return vdso_fallback_gettime(clk, ts);
-               ns += (cycles - gtod->cycle_last) * gtod->mult;
+               if (cycles > last)
+                       ns += (cycles - last) * gtod->mult;

How about:

if (unlikely((s64)cycles <= last)) {
  if (cycles < 0) [or cycles == -1 or whatever]
    return vdso_fallback_gettime;
} else {
  ns += (cycles - last) * gtod->mult;
}

which should actually make this whole mess be essentially free.

Also, I'm not entirely convinced that this "last" thing is needed at
all.  John, what's the scenario under which we need it?

--Andy

--Andy

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-17 19:25   ` Andy Lutomirski
@ 2018-09-17 20:12     ` John Stultz
  2018-09-18  7:52       ` Thomas Gleixner
  0 siblings, 1 reply; 81+ messages in thread
From: John Stultz @ 2018-09-17 20:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, Linux Virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Sep 14, 2018 at 5:50 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> The code flow for the vclocks is convoluted as it requires the vclocks
>> which can be invalidated separately from the vsyscall_gtod_data sequence to
>> store the fact in a separate variable. That's inefficient.
>>
>
>>  notrace static int do_hres(clockid_t clk, struct timespec *ts)
>>  {
>>         struct vgtod_ts *base = &gtod->basetime[clk];
>>         unsigned int seq;
>> -       int mode;
>> -       u64 ns;
>> +       u64 cycles, ns;
>>
>>         do {
>>                 seq = gtod_read_begin(gtod);
>> -               mode = gtod->vclock_mode;
>>                 ts->tv_sec = base->sec;
>>                 ns = base->nsec;
>> -               ns += vgetsns(&mode);
>> +               cycles = vgetcyc(gtod->vclock_mode);
>> +               if (unlikely((s64)cycles < 0))
>> +                       return vdso_fallback_gettime(clk, ts);
>
> i was contemplating this, and I would suggest one of two optimizations:
>
> 1. have all the helpers return a struct containing a u64 and a bool,
> and use that bool.  The compiler should do okay.
>
> 2. Be sneaky.  Later in the series, you do:
>
>                 if (unlikely((s64)cycles < 0))
>                         return vdso_fallback_gettime(clk, ts);
> -               ns += (cycles - gtod->cycle_last) * gtod->mult;
> +               if (cycles > last)
> +                       ns += (cycles - last) * gtod->mult;
>
> How about:
>
> if (unlikely((s64)cycles <= last)) {
>   if (cycles < 0) [or cycles == -1 or whatever]
>     return vdso_fallback_gettime;
> } else {
>   ns += (cycles - last) * gtod->mult;
> }
>
> which should actually make this whole mess be essentially free.
>
> Also, I'm not entirely convinced that this "last" thing is needed at
> all.  John, what's the scenario under which we need it?

So my memory is probably a bit foggy, but I recall that as we
accelerated gettimeofday, we found that even on systems that claimed
to have synced TSCs, they were actually just slightly out of sync.
Enough that right after cycles_last had been updated, a read on
another cpu could come in just behind cycles_last, resulting in a
negative interval causing lots of havoc.

So the sanity check is needed to avoid that case.

thanks
-john

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-17 20:12     ` John Stultz
@ 2018-09-18  7:52       ` Thomas Gleixner
  2018-09-18  8:30         ` Peter Zijlstra
  2018-09-18 14:01         ` Andy Lutomirski
  0 siblings, 2 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-18  7:52 UTC (permalink / raw)
  To: John Stultz
  Cc: Andy Lutomirski, LKML, X86 ML, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, Linux Virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Mon, 17 Sep 2018, John Stultz wrote:
> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > Also, I'm not entirely convinced that this "last" thing is needed at
> > all.  John, what's the scenario under which we need it?
> 
> So my memory is probably a bit foggy, but I recall that as we
> accelerated gettimeofday, we found that even on systems that claimed
> to have synced TSCs, they were actually just slightly out of sync.
> Enough that right after cycles_last had been updated, a read on
> another cpu could come in just behind cycles_last, resulting in a
> negative interval causing lots of havoc.
> 
> So the sanity check is needed to avoid that case.

Your memory serves you right. That's indeed observable on CPUs which
lack TSC_ADJUST.

@Andy: Welcome to the wonderful world of TSC.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18  7:52       ` Thomas Gleixner
@ 2018-09-18  8:30         ` Peter Zijlstra
  2018-09-18  8:52           ` Thomas Gleixner
  2018-09-18 14:01         ` Andy Lutomirski
  1 sibling, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2018-09-18  8:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, Linux Virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Tue, Sep 18, 2018 at 09:52:26AM +0200, Thomas Gleixner wrote:
> On Mon, 17 Sep 2018, John Stultz wrote:
> > On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > > Also, I'm not entirely convinced that this "last" thing is needed at
> > > all.  John, what's the scenario under which we need it?
> > 
> > So my memory is probably a bit foggy, but I recall that as we
> > accelerated gettimeofday, we found that even on systems that claimed
> > to have synced TSCs, they were actually just slightly out of sync.
> > Enough that right after cycles_last had been updated, a read on
> > another cpu could come in just behind cycles_last, resulting in a
> > negative interval causing lots of havoc.
> > 
> > So the sanity check is needed to avoid that case.
> 
> Your memory serves you right. That's indeed observable on CPUs which
> lack TSC_ADJUST.

But, if the gtod code can observe this, then why doesn't the code that
checks the sync?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18  8:30         ` Peter Zijlstra
@ 2018-09-18  8:52           ` Thomas Gleixner
  2018-09-18 10:06             ` Thomas Gleixner
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-18  8:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, Linux Virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> On Tue, Sep 18, 2018 at 09:52:26AM +0200, Thomas Gleixner wrote:
> > On Mon, 17 Sep 2018, John Stultz wrote:
> > > On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
> > > > Also, I'm not entirely convinced that this "last" thing is needed at
> > > > all.  John, what's the scenario under which we need it?
> > > 
> > > So my memory is probably a bit foggy, but I recall that as we
> > > accelerated gettimeofday, we found that even on systems that claimed
> > > to have synced TSCs, they were actually just slightly out of sync.
> > > Enough that right after cycles_last had been updated, a read on
> > > another cpu could come in just behind cycles_last, resulting in a
> > > negative interval causing lots of havoc.
> > > 
> > > So the sanity check is needed to avoid that case.
> > 
> > Your memory serves you right. That's indeed observable on CPUs which
> > lack TSC_ADJUST.
> 
> But, if the gtod code can observe this, then why doesn't the code that
> checks the sync?

Because it depends where the involved CPUs are in the topology. The sync
code might just run on the same package an simply not see it. Yes, w/o
TSC_ADJUST the TSC sync code can just fail to see the havoc.

Thanks,

	tglx





^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18  8:52           ` Thomas Gleixner
@ 2018-09-18 10:06             ` Thomas Gleixner
  2018-09-18 10:41               ` Thomas Gleixner
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-18 10:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, Linux Virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> > > Your memory serves you right. That's indeed observable on CPUs which
> > > lack TSC_ADJUST.
> > 
> > But, if the gtod code can observe this, then why doesn't the code that
> > checks the sync?
> 
> Because it depends where the involved CPUs are in the topology. The sync
> code might just run on the same package an simply not see it. Yes, w/o
> TSC_ADJUST the TSC sync code can just fail to see the havoc.

Even with TSC adjust the TSC can be slightly off by design on multi-socket
systems.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18 10:06             ` Thomas Gleixner
@ 2018-09-18 10:41               ` Thomas Gleixner
  2018-09-18 12:48                 ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-18 10:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, Linux Virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> > On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> > > > Your memory serves you right. That's indeed observable on CPUs which
> > > > lack TSC_ADJUST.
> > > 
> > > But, if the gtod code can observe this, then why doesn't the code that
> > > checks the sync?
> > 
> > Because it depends where the involved CPUs are in the topology. The sync
> > code might just run on the same package an simply not see it. Yes, w/o
> > TSC_ADJUST the TSC sync code can just fail to see the havoc.
> 
> Even with TSC adjust the TSC can be slightly off by design on multi-socket
> systems.

Here are the gory details:

   https://lore.kernel.org/lkml/3c1737210708230408i7a8049a9m5db49e6c4d89ab62@mail.gmail.com/

The changelog has an explanation as well.

    d8bb6f4c1670 ("x86: tsc prevent time going backwards")

I still have one of the machines which is affected by this.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18 10:41               ` Thomas Gleixner
@ 2018-09-18 12:48                 ` Peter Zijlstra
  2018-09-18 13:23                   ` Thomas Gleixner
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2018-09-18 12:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, Linux Virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Tue, Sep 18, 2018 at 12:41:57PM +0200, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> > On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> > > On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> > > > > Your memory serves you right. That's indeed observable on CPUs which
> > > > > lack TSC_ADJUST.
> > > > 
> > > > But, if the gtod code can observe this, then why doesn't the code that
> > > > checks the sync?
> > > 
> > > Because it depends where the involved CPUs are in the topology. The sync
> > > code might just run on the same package an simply not see it. Yes, w/o
> > > TSC_ADJUST the TSC sync code can just fail to see the havoc.
> > 
> > Even with TSC adjust the TSC can be slightly off by design on multi-socket
> > systems.
> 
> Here are the gory details:
> 
>    https://lore.kernel.org/lkml/3c1737210708230408i7a8049a9m5db49e6c4d89ab62@mail.gmail.com/
> 
> The changelog has an explanation as well.
> 
>     d8bb6f4c1670 ("x86: tsc prevent time going backwards")
> 
> I still have one of the machines which is affected by this.

Are we sure this isn't a load vs rdtsc reorder? Because if I look at the
current code:

notrace static u64 vread_tsc(void)
{
	u64 ret = (u64)rdtsc_ordered();
	u64 last = gtod->cycle_last;

	if (likely(ret >= last))
		return ret;

	/*
	 * GCC likes to generate cmov here, but this branch is extremely
	 * predictable (it's just a function of time and the likely is
	 * very likely) and there's a data dependence, so force GCC
	 * to generate a branch instead.  I don't barrier() because
	 * we don't actually need a barrier, and if this function
	 * ever gets inlined it will generate worse code.
	 */
	asm volatile ("");
	return last;
}

That does:

	lfence
	rdtsc
	load gtod->cycle_last

Which obviously allows us to observe a cycles_last that is later than
the rdtsc itself, and thus time can trivially go backwards.

The new code:

		last = gtod->cycle_last;
		cycles = vgetcyc(gtod->vclock_mode);
		if (unlikely((s64)cycles < 0))
			return vdso_fallback_gettime(clk, ts);
		if (cycles > last)
			ns += (cycles - last) * gtod->mult;

looks like:

	load gtod->cycle_last
	lfence
	rdtsc

which avoids that possibility, the cycle_last load must have completed
before the rdtsc.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18 12:48                 ` Peter Zijlstra
@ 2018-09-18 13:23                   ` Thomas Gleixner
  2018-09-18 13:38                     ` Peter Zijlstra
  2018-09-18 15:52                     ` Thomas Gleixner
  0 siblings, 2 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-18 13:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, Linux Virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> On Tue, Sep 18, 2018 at 12:41:57PM +0200, Thomas Gleixner wrote:
> > I still have one of the machines which is affected by this.
> 
> Are we sure this isn't a load vs rdtsc reorder? Because if I look at the
> current code:

The load order of last vs. rdtsc does not matter at all.

CPU0					CPU1

....
now0 = rdtsc_ordered();
...
tk->cycle_last = now0;

gtod->seq++;
gtod->cycle_last = tk->cycle_last;
...
gtod->seq++;
					seq_begin(gtod->seq);
					now1 = rdtsc_ordered();

So if the TSC on CPU1 is slightly behind the TSC on CPU0 then now1 can be
smaller than cycle_last. The TSC sync stuff does not catch the small delta
for unknown raisins. I'll go and find that machine and test that again.

Thanks,

	tglx




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18 13:23                   ` Thomas Gleixner
@ 2018-09-18 13:38                     ` Peter Zijlstra
  2018-09-18 15:52                     ` Thomas Gleixner
  1 sibling, 0 replies; 81+ messages in thread
From: Peter Zijlstra @ 2018-09-18 13:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, Linux Virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Tue, Sep 18, 2018 at 03:23:08PM +0200, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Peter Zijlstra wrote:
> > On Tue, Sep 18, 2018 at 12:41:57PM +0200, Thomas Gleixner wrote:
> > > I still have one of the machines which is affected by this.
> > 
> > Are we sure this isn't a load vs rdtsc reorder? Because if I look at the
> > current code:
> 
> The load order of last vs. rdtsc does not matter at all.
> 
> CPU0					CPU1
> 
> ....
> now0 = rdtsc_ordered();
> ...
> tk->cycle_last = now0;
> 
> gtod->seq++;
> gtod->cycle_last = tk->cycle_last;
> ...
> gtod->seq++;
> 					seq_begin(gtod->seq);
> 					now1 = rdtsc_ordered();
> 
> So if the TSC on CPU1 is slightly behind the TSC on CPU0 then now1 can be
> smaller than cycle_last. The TSC sync stuff does not catch the small delta
> for unknown raisins. I'll go and find that machine and test that again.

Yeah, somehow I forgot about rseq.. maybe I should go sleep or
something.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18  7:52       ` Thomas Gleixner
  2018-09-18  8:30         ` Peter Zijlstra
@ 2018-09-18 14:01         ` Andy Lutomirski
  2018-09-18 22:46           ` Thomas Gleixner
  1 sibling, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-09-18 14:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Peter Zijlstra,
	Matt Rickard, Stephen Boyd, Florian Weimer, K. Y. Srinivasan,
	Vitaly Kuznetsov, devel, Linux Virtualization, Paolo Bonzini,
	Arnd Bergmann, Juergen Gross


> On Sep 18, 2018, at 12:52 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
>> On Mon, 17 Sep 2018, John Stultz wrote:
>>> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>> Also, I'm not entirely convinced that this "last" thing is needed at
>>> all.  John, what's the scenario under which we need it?
>> 
>> So my memory is probably a bit foggy, but I recall that as we
>> accelerated gettimeofday, we found that even on systems that claimed
>> to have synced TSCs, they were actually just slightly out of sync.
>> Enough that right after cycles_last had been updated, a read on
>> another cpu could come in just behind cycles_last, resulting in a
>> negative interval causing lots of havoc.
>> 
>> So the sanity check is needed to avoid that case.
> 
> Your memory serves you right. That's indeed observable on CPUs which
> lack TSC_ADJUST.
> 
> @Andy: Welcome to the wonderful world of TSC.
> 

Do we do better if we use signed arithmetic for the whole calculation? Then a small backwards movement would result in a small backwards result.  Or we could offset everything so that we’d have to go back several hundred ms before we cross zero.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18 13:23                   ` Thomas Gleixner
  2018-09-18 13:38                     ` Peter Zijlstra
@ 2018-09-18 15:52                     ` Thomas Gleixner
  2018-09-27 14:41                       ` Thomas Gleixner
  1 sibling, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-18 15:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, Linux Virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> So if the TSC on CPU1 is slightly behind the TSC on CPU0 then now1 can be
> smaller than cycle_last. The TSC sync stuff does not catch the small delta
> for unknown raisins. I'll go and find that machine and test that again.

Of course it does not trigger anymore. We accumulated code between the
point in timekeeping_advance() where the TSC is read and the update of the
VDSO data.

I'll might have to get an 2.6ish kernel booted on that machine and try with
that again. /me shudders

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18 14:01         ` Andy Lutomirski
@ 2018-09-18 22:46           ` Thomas Gleixner
  2018-09-18 23:03             ` Andy Lutomirski
  2018-09-19  9:08             ` Rasmus Villemoes
  0 siblings, 2 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-18 22:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Peter Zijlstra,
	Matt Rickard, Stephen Boyd, Florian Weimer, K. Y. Srinivasan,
	Vitaly Kuznetsov, devel, Linux Virtualization, Paolo Bonzini,
	Arnd Bergmann, Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 1734 bytes --]

On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> > On Sep 18, 2018, at 12:52 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> >> On Mon, 17 Sep 2018, John Stultz wrote:
> >>> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >>> Also, I'm not entirely convinced that this "last" thing is needed at
> >>> all.  John, what's the scenario under which we need it?
> >> 
> >> So my memory is probably a bit foggy, but I recall that as we
> >> accelerated gettimeofday, we found that even on systems that claimed
> >> to have synced TSCs, they were actually just slightly out of sync.
> >> Enough that right after cycles_last had been updated, a read on
> >> another cpu could come in just behind cycles_last, resulting in a
> >> negative interval causing lots of havoc.
> >> 
> >> So the sanity check is needed to avoid that case.
> > 
> > Your memory serves you right. That's indeed observable on CPUs which
> > lack TSC_ADJUST.
> > 
> > @Andy: Welcome to the wonderful world of TSC.
> > 
> 
> Do we do better if we use signed arithmetic for the whole calculation?
> Then a small backwards movement would result in a small backwards result.
> Or we could offset everything so that we’d have to go back several
> hundred ms before we cross zero.

That would be probably the better solution as signed math would be
problematic when the resulting ns value becomes negative. As the delta is
really small, otherwise the TSC sync check would have caught it, the caller
should never be able to observe time going backwards.

I'll have a look into that. It needs some thought vs. the fractional part
of the base time, but it should be not rocket science to get that
correct. Famous last words...

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18 22:46           ` Thomas Gleixner
@ 2018-09-18 23:03             ` Andy Lutomirski
  2018-09-18 23:16               ` Thomas Gleixner
  2018-09-19  9:08             ` Rasmus Villemoes
  1 sibling, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-09-18 23:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Peter Zijlstra,
	Matt Rickard, Stephen Boyd, Florian Weimer, K. Y. Srinivasan,
	Vitaly Kuznetsov, devel, Linux Virtualization, Paolo Bonzini,
	Arnd Bergmann, Juergen Gross



> On Sep 18, 2018, at 3:46 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Tue, 18 Sep 2018, Andy Lutomirski wrote:
>>> On Sep 18, 2018, at 12:52 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>>>> On Mon, 17 Sep 2018, John Stultz wrote:
>>>>> On Mon, Sep 17, 2018 at 12:25 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>>>> Also, I'm not entirely convinced that this "last" thing is needed at
>>>>> all.  John, what's the scenario under which we need it?
>>>> 
>>>> So my memory is probably a bit foggy, but I recall that as we
>>>> accelerated gettimeofday, we found that even on systems that claimed
>>>> to have synced TSCs, they were actually just slightly out of sync.
>>>> Enough that right after cycles_last had been updated, a read on
>>>> another cpu could come in just behind cycles_last, resulting in a
>>>> negative interval causing lots of havoc.
>>>> 
>>>> So the sanity check is needed to avoid that case.
>>> 
>>> Your memory serves you right. That's indeed observable on CPUs which
>>> lack TSC_ADJUST.
>>> 
>>> @Andy: Welcome to the wonderful world of TSC.
>>> 
>> 
>> Do we do better if we use signed arithmetic for the whole calculation?
>> Then a small backwards movement would result in a small backwards result.
>> Or we could offset everything so that we’d have to go back several
>> hundred ms before we cross zero.
> 
> That would be probably the better solution as signed math would be
> problematic when the resulting ns value becomes negative. As the delta is
> really small, otherwise the TSC sync check would have caught it, the caller
> should never be able to observe time going backwards.
> 
> I'll have a look into that. It needs some thought vs. the fractional part
> of the base time, but it should be not rocket science to get that
> correct. Famous last words...
> 

It’s also fiddly to tune. If you offset it too much, then the fancy divide-by-repeated-subtraction loop will hurt more than the comparison to last.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18 23:03             ` Andy Lutomirski
@ 2018-09-18 23:16               ` Thomas Gleixner
  2018-09-27 14:36                 ` Thomas Gleixner
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-18 23:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Peter Zijlstra,
	Matt Rickard, Stephen Boyd, Florian Weimer, K. Y. Srinivasan,
	Vitaly Kuznetsov, devel, Linux Virtualization, Paolo Bonzini,
	Arnd Bergmann, Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 1254 bytes --]

On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> > On Sep 18, 2018, at 3:46 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> >> Do we do better if we use signed arithmetic for the whole calculation?
> >> Then a small backwards movement would result in a small backwards result.
> >> Or we could offset everything so that we’d have to go back several
> >> hundred ms before we cross zero.
> > 
> > That would be probably the better solution as signed math would be
> > problematic when the resulting ns value becomes negative. As the delta is
> > really small, otherwise the TSC sync check would have caught it, the caller
> > should never be able to observe time going backwards.
> > 
> > I'll have a look into that. It needs some thought vs. the fractional part
> > of the base time, but it should be not rocket science to get that
> > correct. Famous last words...
> > 
> 
> It’s also fiddly to tune. If you offset it too much, then the fancy
> divide-by-repeated-subtraction loop will hurt more than the comparison to
> last.

Not really. It's sufficient to offset it by at max. 1000 cycles or so. That
won't hurt the magic loop, but it will definitely cover that slight offset
case.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18 22:46           ` Thomas Gleixner
  2018-09-18 23:03             ` Andy Lutomirski
@ 2018-09-19  9:08             ` Rasmus Villemoes
  2018-09-19 13:29               ` Thomas Gleixner
  1 sibling, 1 reply; 81+ messages in thread
From: Rasmus Villemoes @ 2018-09-19  9:08 UTC (permalink / raw)
  To: Thomas Gleixner, Andy Lutomirski
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Peter Zijlstra,
	Matt Rickard, Stephen Boyd, Florian Weimer, K. Y. Srinivasan,
	Vitaly Kuznetsov, devel, Linux Virtualization, Paolo Bonzini,
	Arnd Bergmann, Juergen Gross

On 2018-09-19 00:46, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Andy Lutomirski wrote:
>>>
>>
>> Do we do better if we use signed arithmetic for the whole calculation?
>> Then a small backwards movement would result in a small backwards result.
>> Or we could offset everything so that we’d have to go back several
>> hundred ms before we cross zero.
> 
> That would be probably the better solution as signed math would be
> problematic when the resulting ns value becomes negative. As the delta is
> really small, otherwise the TSC sync check would have caught it, the caller
> should never be able to observe time going backwards.
> 
> I'll have a look into that. It needs some thought vs. the fractional part
> of the base time, but it should be not rocket science to get that
> correct. Famous last words...

Does the sentinel need to be U64_MAX? What if vgetcyc and its minions
returned gtod->cycle_last-1 (for some value of 1), and the caller just
does "if ((s64)cycles - (s64)last < 0) return fallback; ns +=
(cycles-last)* ...". That should just be a "sub ; js ; ". It's an extra
load of ->cycle_last, but only on the path where we're heading for the
fallback anyway. The value of 1 can be adjusted so that in the "js"
path, we could detect and accept an rdtsc_ordered() call that's just a
few 10s of cycles behind last and treat that as 0 and continue back on
the normal path. But maybe it's hard to get gcc to generate the expected
code.

Rasmus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-19  9:08             ` Rasmus Villemoes
@ 2018-09-19 13:29               ` Thomas Gleixner
  0 siblings, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-19 13:29 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andy Lutomirski, John Stultz, Andy Lutomirski, LKML, X86 ML,
	Peter Zijlstra, Matt Rickard, Stephen Boyd, Florian Weimer,
	K. Y. Srinivasan, Vitaly Kuznetsov, devel, Linux Virtualization,
	Paolo Bonzini, Arnd Bergmann, Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 1781 bytes --]

On Wed, 19 Sep 2018, Rasmus Villemoes wrote:
> On 2018-09-19 00:46, Thomas Gleixner wrote:
> > On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> >>>
> >>
> >> Do we do better if we use signed arithmetic for the whole calculation?
> >> Then a small backwards movement would result in a small backwards result.
> >> Or we could offset everything so that we’d have to go back several
> >> hundred ms before we cross zero.
> > 
> > That would be probably the better solution as signed math would be
> > problematic when the resulting ns value becomes negative. As the delta is
> > really small, otherwise the TSC sync check would have caught it, the caller
> > should never be able to observe time going backwards.
> > 
> > I'll have a look into that. It needs some thought vs. the fractional part
> > of the base time, but it should be not rocket science to get that
> > correct. Famous last words...
> 
> Does the sentinel need to be U64_MAX? What if vgetcyc and its minions
> returned gtod->cycle_last-1 (for some value of 1), and the caller just
> does "if ((s64)cycles - (s64)last < 0) return fallback; ns +=
> (cycles-last)* ...". That should just be a "sub ; js ; ". It's an extra
> load of ->cycle_last, but only on the path where we're heading for the
> fallback anyway. The value of 1 can be adjusted so that in the "js"
> path, we could detect and accept an rdtsc_ordered() call that's just a
> few 10s of cycles behind last and treat that as 0 and continue back on
> the normal path. But maybe it's hard to get gcc to generate the expected
> code.

I played around with a lot of variants and GCC generates all kinds of
interesting ASM. And at some point optimizing that math code is not buying
anything because the LFENCE before RDTSC is dominating all of it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-09-17 13:00   ` Thomas Gleixner
@ 2018-09-24 21:08     ` Arnd Bergmann
  0 siblings, 0 replies; 81+ messages in thread
From: Arnd Bergmann @ 2018-09-24 21:08 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linux Kernel Mailing List, Andy Lutomirski,
	the arch/x86 maintainers, Peter Zijlstra, matt, Stephen Boyd,
	John Stultz, Florian Weimer, K. Y. Srinivasan, vkuznets, devel,
	virtualization, Paolo Bonzini, Juergen Gross, Deepa Dinamani

On Mon, Sep 17, 2018 at 3:00 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Fri, 14 Sep 2018, Arnd Bergmann wrote:
> > On Fri, Sep 14, 2018 at 2:52 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> > A couple of architectures (s390, ia64, riscv, powerpc, arm64)
> > implement the vdso as assembler code at the moment, so they
> > won't be as easy to consolidate (other than outright replacing all
> > the code).
> >
> > The other five:
> > arch/x86/entry/vdso/vclock_gettime.c
> > arch/sparc/vdso/vclock_gettime.c
> > arch/nds32/kernel/vdso/gettimeofday.c
> > arch/mips/vdso/gettimeofday.c
> > arch/arm/vdso/vgettimeofday.c
> >
> > are basically all minor variations of the same code base and could be
> > consolidated to some degree.
> > Any suggestions here? Should we plan to do that consolitdation based on
> > your new version, or just add clock_gettime64 in arm32 and x86-32, and then
> > be done with it? The other ones will obviously still be fast for 32-bit time_t
> > and will have a working non-vdso sys_clock_getttime64().
>
> In principle consolidating all those implementations should be possible to
> some extent and probably worthwhile. What's arch specific are the actual
> accessors to the hardware clocks.

Ok.

> > I also wonder about clock_getres(): half the architectures seem to implement
> > it in vdso, but notably arm32 and x86 don't, and I had not expected it to be
> > performance critical given that the result is easily cached in user space.
>
> getres() is not really performance critical, but adding it does not create
> a huge problem either.

Right, I'd just not add a getres_time64() to the vdso then.

        Arnd

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18 23:16               ` Thomas Gleixner
@ 2018-09-27 14:36                 ` Thomas Gleixner
  2018-09-27 14:39                   ` Andy Lutomirski
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-27 14:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Peter Zijlstra,
	Matt Rickard, Stephen Boyd, Florian Weimer, K. Y. Srinivasan,
	Vitaly Kuznetsov, devel, Linux Virtualization, Paolo Bonzini,
	Arnd Bergmann, Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 1706 bytes --]

On Wed, 19 Sep 2018, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> > > On Sep 18, 2018, at 3:46 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > On Tue, 18 Sep 2018, Andy Lutomirski wrote:
> > >> Do we do better if we use signed arithmetic for the whole calculation?
> > >> Then a small backwards movement would result in a small backwards result.
> > >> Or we could offset everything so that we’d have to go back several
> > >> hundred ms before we cross zero.
> > > 
> > > That would be probably the better solution as signed math would be
> > > problematic when the resulting ns value becomes negative. As the delta is
> > > really small, otherwise the TSC sync check would have caught it, the caller
> > > should never be able to observe time going backwards.
> > > 
> > > I'll have a look into that. It needs some thought vs. the fractional part
> > > of the base time, but it should be not rocket science to get that
> > > correct. Famous last words...
> > > 
> > 
> > It’s also fiddly to tune. If you offset it too much, then the fancy
> > divide-by-repeated-subtraction loop will hurt more than the comparison to
> > last.
> 
> Not really. It's sufficient to offset it by at max. 1000 cycles or so. That
> won't hurt the magic loop, but it will definitely cover that slight offset
> case.

I got it working, but first of all the gain is close to 0.

There is this other subtle issue that we've seen TSCs slowly drifting apart
which is caught by the TSC watchdog eventually, but if it exeeds the offset
_before_ the watchdog triggers, we're back to square one.

So I rather stay on the safe side and just accept that we have to deal with
that. Sigh.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-27 14:36                 ` Thomas Gleixner
@ 2018-09-27 14:39                   ` Andy Lutomirski
  0 siblings, 0 replies; 81+ messages in thread
From: Andy Lutomirski @ 2018-09-27 14:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Peter Zijlstra,
	Matt Rickard, Stephen Boyd, Florian Weimer, K. Y. Srinivasan,
	Vitaly Kuznetsov, devel, Linux Virtualization, Paolo Bonzini,
	Arnd Bergmann, Juergen Gross



> On Sep 27, 2018, at 7:36 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
>> On Wed, 19 Sep 2018, Thomas Gleixner wrote:
>> On Tue, 18 Sep 2018, Andy Lutomirski wrote:
>>>> On Sep 18, 2018, at 3:46 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>>> On Tue, 18 Sep 2018, Andy Lutomirski wrote:
>>>>> Do we do better if we use signed arithmetic for the whole calculation?
>>>>> Then a small backwards movement would result in a small backwards result.
>>>>> Or we could offset everything so that we’d have to go back several
>>>>> hundred ms before we cross zero.
>>>> 
>>>> That would be probably the better solution as signed math would be
>>>> problematic when the resulting ns value becomes negative. As the delta is
>>>> really small, otherwise the TSC sync check would have caught it, the caller
>>>> should never be able to observe time going backwards.
>>>> 
>>>> I'll have a look into that. It needs some thought vs. the fractional part
>>>> of the base time, but it should be not rocket science to get that
>>>> correct. Famous last words...
>>>> 
>>> 
>>> It’s also fiddly to tune. If you offset it too much, then the fancy
>>> divide-by-repeated-subtraction loop will hurt more than the comparison to
>>> last.
>> 
>> Not really. It's sufficient to offset it by at max. 1000 cycles or so. That
>> won't hurt the magic loop, but it will definitely cover that slight offset
>> case.
> 
> I got it working, but first of all the gain is close to 0.
> 
> There is this other subtle issue that we've seen TSCs slowly drifting apart
> which is caught by the TSC watchdog eventually, but if it exeeds the offset
> _before_ the watchdog triggers, we're back to square one.
> 
> So I rather stay on the safe side and just accept that we have to deal with
> that. Sigh.
> 
> 

Seems okay to me. Oh well. 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 09/11] x86/vdso: Simplify the invalid vclock case
  2018-09-18 15:52                     ` Thomas Gleixner
@ 2018-09-27 14:41                       ` Thomas Gleixner
  0 siblings, 0 replies; 81+ messages in thread
From: Thomas Gleixner @ 2018-09-27 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: John Stultz, Andy Lutomirski, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, Florian Weimer, K. Y. Srinivasan, Vitaly Kuznetsov,
	devel, Linux Virtualization, Paolo Bonzini, Arnd Bergmann,
	Juergen Gross

On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> On Tue, 18 Sep 2018, Thomas Gleixner wrote:
> > So if the TSC on CPU1 is slightly behind the TSC on CPU0 then now1 can be
> > smaller than cycle_last. The TSC sync stuff does not catch the small delta
> > for unknown raisins. I'll go and find that machine and test that again.
> 
> Of course it does not trigger anymore. We accumulated code between the
> point in timekeeping_advance() where the TSC is read and the update of the
> VDSO data.
> 
> I'll might have to get an 2.6ish kernel booted on that machine and try with
> that again. /me shudders

Actually it does happen, because the TSC is very slowly drifting apart due
to SMI wreckage trying to hide itself. It just takes a very long time.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
                   ` (12 preceding siblings ...)
  2018-09-14 14:22 ` Arnd Bergmann
@ 2018-10-03  5:15 ` Andy Lutomirski
  2018-10-03  9:22   ` Vitaly Kuznetsov
  2018-10-03 19:00   ` Marcelo Tosatti
  2018-10-04 20:32 ` Andy Lutomirski
  14 siblings, 2 replies; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-03  5:15 UTC (permalink / raw)
  To: Thomas Gleixner, Marcelo Tosatti, Paolo Bonzini, Radim Krcmar,
	Wanpeng Li
  Cc: LKML, Andrew Lutomirski, X86 ML, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan,
	Vitaly Kuznetsov, devel, Linux Virtualization, Arnd Bergmann,
	Juergen Gross

Hi Vitaly, Paolo, Radim, etc.,

On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> implementation, which extended the clockid switch case and added yet
> another slightly different copy of the same code.
>
> Especially the extended switch case is problematic as the compiler tends to
> generate a jump table which then requires to use retpolines. If jump tables
> are disabled it adds yet another conditional to the existing maze.
>
> This series takes a different approach by consolidating the almost
> identical functions into one implementation for high resolution clocks and
> one for the coarse grained clock ids by storing the base data for each
> clock id in an array which is indexed by the clock id.
>

I was trying to understand more of the implications of this patch
series, and I was again reminded that there is an entire extra copy of
the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
that code is very, very opaque.

Can one of you explain what the code is even doing?  From a couple of
attempts to read through it, it's a whole bunch of
probably-extremely-buggy code that, drumroll please, tries to
atomically read the TSC value and the time.  And decide whether the
result is "based on the TSC".  And then synthesizes a TSC-to-ns
multiplier and shift, based on *something other than the actual
multiply and shift used*.

IOW, unless I'm totally misunderstanding it, the code digs into the
private arch clocksource data intended for the vDSO, uses a poorly
maintained copy of the vDSO code to read the time (instead of doing
the sane thing and using the kernel interfaces for this), and
propagates a totally made up copy to the guest.  And gets it entirely
wrong when doing nested virt, since, unless there's some secret in
this maze, it doesn't acutlaly use the scaling factor from the host
when it tells the guest what to do.

I am really, seriously tempted to send a patch to simply delete all
this code.  The correct way to do it is to hook

And I don't see how it's even possible to pass kvmclock correctly to
the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
L1 isn't notified when the data structure changes, so how the heck is
it supposed to update the kvmclock structure?

--Andy

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-03  5:15 ` Andy Lutomirski
@ 2018-10-03  9:22   ` Vitaly Kuznetsov
  2018-10-03 10:20     ` Andy Lutomirski
  2018-10-03 19:06     ` Marcelo Tosatti
  2018-10-03 19:00   ` Marcelo Tosatti
  1 sibling, 2 replies; 81+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-03  9:22 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner, Marcelo Tosatti, Paolo Bonzini,
	Radim Krcmar, Wanpeng Li
  Cc: LKML, X86 ML, Peter Zijlstra, Matt Rickard, Stephen Boyd,
	John Stultz, Florian Weimer, KY Srinivasan, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

Andy Lutomirski <luto@kernel.org> writes:

> Hi Vitaly, Paolo, Radim, etc.,
>
> On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
>> implementation, which extended the clockid switch case and added yet
>> another slightly different copy of the same code.
>>
>> Especially the extended switch case is problematic as the compiler tends to
>> generate a jump table which then requires to use retpolines. If jump tables
>> are disabled it adds yet another conditional to the existing maze.
>>
>> This series takes a different approach by consolidating the almost
>> identical functions into one implementation for high resolution clocks and
>> one for the coarse grained clock ids by storing the base data for each
>> clock id in an array which is indexed by the clock id.
>>
>
> I was trying to understand more of the implications of this patch
> series, and I was again reminded that there is an entire extra copy of
> the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
> that code is very, very opaque.
>
> Can one of you explain what the code is even doing?  From a couple of
> attempts to read through it, it's a whole bunch of
> probably-extremely-buggy code that, drumroll please, tries to
> atomically read the TSC value and the time.  And decide whether the
> result is "based on the TSC".  And then synthesizes a TSC-to-ns
> multiplier and shift, based on *something other than the actual
> multiply and shift used*.
>
> IOW, unless I'm totally misunderstanding it, the code digs into the
> private arch clocksource data intended for the vDSO, uses a poorly
> maintained copy of the vDSO code to read the time (instead of doing
> the sane thing and using the kernel interfaces for this), and
> propagates a totally made up copy to the guest.  And gets it entirely
> wrong when doing nested virt, since, unless there's some secret in
> this maze, it doesn't acutlaly use the scaling factor from the host
> when it tells the guest what to do.
>
> I am really, seriously tempted to send a patch to simply delete all
> this code.  The correct way to do it is to hook

"I have discovered a truly marvelous proof of this, which this margin is
too narrow to contain" :-)

There is a very long history of different (hardware) issues Marcelo was
fighting with and the current code is the survived Frankenstein. E.g. it
is very, very unclear what "catchup", "always catchup" and
masterclock-less mode in general are and if we still need them.

That said I'm all for simplification. I'm not sure if we still need to
care about buggy hardware though.

>
> And I don't see how it's even possible to pass kvmclock correctly to
> the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
> L1 isn't notified when the data structure changes, so how the heck is
> it supposed to update the kvmclock structure?

Well, this kind of works in the the followin way:
L1's clocksource is 'tsc_page' which is, basically, a compliment to TSC:
two numbers provided by L0: offset and scale and KVM was tought to treat
this clocksource as a good one (see b0c39dc68e3b "x86/kvm: Pass stable
clocksource to guests when running nested on Hyper-V").

The notification you're talking about exists, it is called
Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V
reenlightenment"). When TSC page changes (and this only happens when L1
is migrated to a different host with a different TSC frequency and TSC
scaling is not supported by the CPU) we receive an interrupt in L1 (at
this moment all TSC accesses are emulated which guarantees the
correctness of the readings), pause all L2 guests, update their kvmclock
structures with new data (we already know the new TSC frequency) and
then tell L0 that we're done and it can stop emulating TSC accesses.

(Nothing like this exists for KVM-on-KVM, by the way, when L1's
clocksource is 'kvmclock' L2s won't get a stable kvmclock clocksource.)

-- 
Vitaly

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-03  9:22   ` Vitaly Kuznetsov
@ 2018-10-03 10:20     ` Andy Lutomirski
  2018-10-03 12:01       ` Vitaly Kuznetsov
  2018-10-03 19:06     ` Marcelo Tosatti
  1 sibling, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-03 10:20 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Andy Lutomirski, Thomas Gleixner, Marcelo Tosatti, Paolo Bonzini,
	Radim Krcmar, Wanpeng Li, LKML, X86 ML, Peter Zijlstra,
	Matt Rickard, Stephen Boyd, John Stultz, Florian Weimer,
	KY Srinivasan, devel, Linux Virtualization, Arnd Bergmann,
	Juergen Gross



> On Oct 3, 2018, at 2:22 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> 
> Andy Lutomirski <luto@kernel.org> writes:
> 
>> Hi Vitaly, Paolo, Radim, etc.,
>> 
>>> On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
>>> implementation, which extended the clockid switch case and added yet
>>> another slightly different copy of the same code.
>>> 
>>> Especially the extended switch case is problematic as the compiler tends to
>>> generate a jump table which then requires to use retpolines. If jump tables
>>> are disabled it adds yet another conditional to the existing maze.
>>> 
>>> This series takes a different approach by consolidating the almost
>>> identical functions into one implementation for high resolution clocks and
>>> one for the coarse grained clock ids by storing the base data for each
>>> clock id in an array which is indexed by the clock id.
>>> 
>> 
>> I was trying to understand more of the implications of this patch
>> series, and I was again reminded that there is an entire extra copy of
>> the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
>> that code is very, very opaque.
>> 
>> Can one of you explain what the code is even doing?  From a couple of
>> attempts to read through it, it's a whole bunch of
>> probably-extremely-buggy code that, drumroll please, tries to
>> atomically read the TSC value and the time.  And decide whether the
>> result is "based on the TSC".  And then synthesizes a TSC-to-ns
>> multiplier and shift, based on *something other than the actual
>> multiply and shift used*.
>> 
>> IOW, unless I'm totally misunderstanding it, the code digs into the
>> private arch clocksource data intended for the vDSO, uses a poorly
>> maintained copy of the vDSO code to read the time (instead of doing
>> the sane thing and using the kernel interfaces for this), and
>> propagates a totally made up copy to the guest.  And gets it entirely
>> wrong when doing nested virt, since, unless there's some secret in
>> this maze, it doesn't acutlaly use the scaling factor from the host
>> when it tells the guest what to do.
>> 
>> I am really, seriously tempted to send a patch to simply delete all
>> this code.  The correct way to do it is to hook
> 
> "I have discovered a truly marvelous proof of this, which this margin is
> too narrow to contain" :-)
> 
> There is a very long history of different (hardware) issues Marcelo was
> fighting with and the current code is the survived Frankenstein. E.g. it
> is very, very unclear what "catchup", "always catchup" and
> masterclock-less mode in general are and if we still need them.
> 
> That said I'm all for simplification. I'm not sure if we still need to
> care about buggy hardware though.
> 
>> 
>> And I don't see how it's even possible to pass kvmclock correctly to
>> the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
>> L1 isn't notified when the data structure changes, so how the heck is
>> it supposed to update the kvmclock structure?
> 
> Well, this kind of works in the the followin way:
> L1's clocksource is 'tsc_page' which is, basically, a compliment to TSC:
> two numbers provided by L0: offset and scale and KVM was tought to treat
> this clocksource as a good one (see b0c39dc68e3b "x86/kvm: Pass stable
> clocksource to guests when running nested on Hyper-V").
> 
> The notification you're talking about exists, it is called
> Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V
> reenlightenment"). When TSC page changes (and this only happens when L1
> is migrated to a different host with a different TSC frequency and TSC
> scaling is not supported by the CPU) we receive an interrupt in L1 (at
> this moment all TSC accesses are emulated which guarantees the
> correctness of the readings), pause all L2 guests, update their kvmclock
> structures with new data (we already know the new TSC frequency) and
> then tell L0 that we're done and it can stop emulating TSC accesses.

That’s delightful!  Does the emulation magic also work for L1 user mode?  If so, couldn’t we drop the HyperV vclock entirely and just fold the adjustment into the core timekeeping data?  (Preferably the actual core data, which would require core changes, but it could plausibly be done in arch code, too.)

> 
> (Nothing like this exists for KVM-on-KVM, by the way, when L1's
> clocksource is 'kvmclock' L2s won't get a stable kvmclock clocksource.)
> 
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-03 10:20     ` Andy Lutomirski
@ 2018-10-03 12:01       ` Vitaly Kuznetsov
  2018-10-03 14:20         ` Andy Lutomirski
  0 siblings, 1 reply; 81+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-03 12:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Thomas Gleixner, Marcelo Tosatti, Paolo Bonzini,
	Radim Krcmar, Wanpeng Li, LKML, X86 ML, Peter Zijlstra,
	Matt Rickard, Stephen Boyd, John Stultz, Florian Weimer,
	KY Srinivasan, devel, Linux Virtualization, Arnd Bergmann,
	Juergen Gross

Andy Lutomirski <luto@amacapital.net> writes:

>> On Oct 3, 2018, at 2:22 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>> 
>> Andy Lutomirski <luto@kernel.org> writes:
>> 
>>> Hi Vitaly, Paolo, Radim, etc.,
>>> 
>> The notification you're talking about exists, it is called
>> Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V
>> reenlightenment"). When TSC page changes (and this only happens when L1
>> is migrated to a different host with a different TSC frequency and TSC
>> scaling is not supported by the CPU) we receive an interrupt in L1 (at
>> this moment all TSC accesses are emulated which guarantees the
>> correctness of the readings), pause all L2 guests, update their kvmclock
>> structures with new data (we already know the new TSC frequency) and
>> then tell L0 that we're done and it can stop emulating TSC accesses.
>
> That’s delightful!  Does the emulation magic also work for L1 user
> mode?

As far as I understand - yes, all rdtsc* calls will trap into L0.

>  If so, couldn’t we drop the HyperV vclock entirely and just
> fold the adjustment into the core timekeeping data?  (Preferably the
> actual core data, which would require core changes, but it could
> plausibly be done in arch code, too.)

Not all Hyper-V hosts support reenlightenment notifications (and, if I'm
not mistaken, you need to enable nesting for the VM to get the feature -
and most VMs don't have this) so I think we'll have to keep Hyper-V
vclock for the time being.

-- 
Vitaly

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-03 12:01       ` Vitaly Kuznetsov
@ 2018-10-03 14:20         ` Andy Lutomirski
  2018-10-03 15:10           ` Thomas Gleixner
  0 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-03 14:20 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Andy Lutomirski, Thomas Gleixner, Marcelo Tosatti, Paolo Bonzini,
	Radim Krcmar, Wanpeng Li, LKML, X86 ML, Peter Zijlstra,
	Matt Rickard, Stephen Boyd, John Stultz, Florian Weimer,
	KY Srinivasan, devel, Linux Virtualization, Arnd Bergmann,
	Juergen Gross



> On Oct 3, 2018, at 5:01 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> 
> Andy Lutomirski <luto@amacapital.net> writes:
> 
>>> On Oct 3, 2018, at 2:22 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>>> 
>>> Andy Lutomirski <luto@kernel.org> writes:
>>> 
>>>> Hi Vitaly, Paolo, Radim, etc.,
>>>> 
>>> The notification you're talking about exists, it is called
>>> Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V
>>> reenlightenment"). When TSC page changes (and this only happens when L1
>>> is migrated to a different host with a different TSC frequency and TSC
>>> scaling is not supported by the CPU) we receive an interrupt in L1 (at
>>> this moment all TSC accesses are emulated which guarantees the
>>> correctness of the readings), pause all L2 guests, update their kvmclock
>>> structures with new data (we already know the new TSC frequency) and
>>> then tell L0 that we're done and it can stop emulating TSC accesses.
>> 
>> That’s delightful!  Does the emulation magic also work for L1 user
>> mode?
> 
> As far as I understand - yes, all rdtsc* calls will trap into L0.
> 
>> If so, couldn’t we drop the HyperV vclock entirely and just
>> fold the adjustment into the core timekeeping data?  (Preferably the
>> actual core data, which would require core changes, but it could
>> plausibly be done in arch code, too.)
> 
> Not all Hyper-V hosts support reenlightenment notifications (and, if I'm
> not mistaken, you need to enable nesting for the VM to get the feature -
> and most VMs don't have this) so I think we'll have to keep Hyper-V
> vclock for the time being.
> 
> 

But this does suggest that the correct way to pass a clock through to an L2 guest where L0 is HV is to make L1 use the “tsc” clock and L2 use kvmclock (or something newer and better).  This would require adding support for atomic frequency changes all the way through the timekeeping and arch code.

John, tglx, would that be okay or crazy?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-03 14:20         ` Andy Lutomirski
@ 2018-10-03 15:10           ` Thomas Gleixner
  2018-10-03 16:18             ` Andy Lutomirski
  0 siblings, 1 reply; 81+ messages in thread
From: Thomas Gleixner @ 2018-10-03 15:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Vitaly Kuznetsov, Andy Lutomirski, Marcelo Tosatti,
	Paolo Bonzini, Radim Krcmar, Wanpeng Li, LKML, X86 ML,
	Peter Zijlstra, Matt Rickard, Stephen Boyd, John Stultz,
	Florian Weimer, KY Srinivasan, devel, Linux Virtualization,
	Arnd Bergmann, Juergen Gross

[-- Attachment #1: Type: text/plain, Size: 822 bytes --]

On Wed, 3 Oct 2018, Andy Lutomirski wrote:
> > On Oct 3, 2018, at 5:01 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> > Not all Hyper-V hosts support reenlightenment notifications (and, if I'm
> > not mistaken, you need to enable nesting for the VM to get the feature -
> > and most VMs don't have this) so I think we'll have to keep Hyper-V
> > vclock for the time being.
> > 
> But this does suggest that the correct way to pass a clock through to an
> L2 guest where L0 is HV is to make L1 use the “tsc” clock and L2 use
> kvmclock (or something newer and better).  This would require adding
> support for atomic frequency changes all the way through the timekeeping
> and arch code.
>
> John, tglx, would that be okay or crazy?

Not sure what you mean. I think I lost you somewhere on the way.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-03 15:10           ` Thomas Gleixner
@ 2018-10-03 16:18             ` Andy Lutomirski
  0 siblings, 0 replies; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-03 16:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Vitaly Kuznetsov, Andrew Lutomirski, Marcelo Tosatti,
	Paolo Bonzini, Radim Krcmar, Wanpeng Li, LKML, X86 ML,
	Peter Zijlstra, Matt Rickard, Stephen Boyd, John Stultz,
	Florian Weimer, KY Srinivasan, devel, Linux Virtualization,
	Arnd Bergmann, Juergen Gross

On Wed, Oct 3, 2018 at 8:10 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Wed, 3 Oct 2018, Andy Lutomirski wrote:
> > > On Oct 3, 2018, at 5:01 AM, Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
> > > Not all Hyper-V hosts support reenlightenment notifications (and, if I'm
> > > not mistaken, you need to enable nesting for the VM to get the feature -
> > > and most VMs don't have this) so I think we'll have to keep Hyper-V
> > > vclock for the time being.
> > >
> > But this does suggest that the correct way to pass a clock through to an
> > L2 guest where L0 is HV is to make L1 use the “tsc” clock and L2 use
> > kvmclock (or something newer and better).  This would require adding
> > support for atomic frequency changes all the way through the timekeeping
> > and arch code.
> >
> > John, tglx, would that be okay or crazy?
>
> Not sure what you mean. I think I lost you somewhere on the way.
>

What I mean is: currently we have a clocksource called
""hyperv_clocksource_tsc_page".  Reading it does:

static u64 read_hv_clock_tsc(struct clocksource *arg)
{
    u64 current_tick = hv_read_tsc_page(tsc_pg);

    if (current_tick == U64_MAX)
        rdmsrl(HV_X64_MSR_TIME_REF_COUNT, current_tick);

    return current_tick;
}

From Vitaly's email, it sounds like, on most (all?) hyperv systems
with nesting enabled, this clock is better behaved than it appears.
It sounds like the read behavior is that current_tick will never be
U64_MAX -- instead, the clock always works and, more importantly, the
actual scaling factor and offset only change observably on *guest*
request.

So why don't we we improve the actual "tsc" clocksource to understand
this?  ISTM the best model would be where the
__clocksource_update_freq_xyz() mechanism gets called so we can use it
like this:

clocksource_begin_update();
clocksource_update_mult_shift();
tell_hv_that_we_reenlightened();
clocksource_end_update();

Where clocksource_begin_update() bumps the seqcount for the vDSO and
takes all the locks, clocksource_update_mult_shift() updates
everything, and clocksource_end_update() makes the updated parameters
usable.

(AFAICT there are currently no clocksources at all in the entire
kernel that update their frequency on the fly using
__clocksource_update_xyz().  Unless I'm missing something, the x86 tsc
cpufreq hooks don't call into the core timekeeping at all, so I'm
assuming that the tsc clocksource is just unusable as a clocksource on
systems that change its frequency.)

Or we could keep the hyperv_clocksource_tsc_page clocksource but make
it use VCLOCK_TSC and a similar update mechanism.

I don't personally want to do this, because the timekeeping code is
subtle and I'm unfamiliar with it.  And I don't have *that* many spare
cycles :)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-03  5:15 ` Andy Lutomirski
  2018-10-03  9:22   ` Vitaly Kuznetsov
@ 2018-10-03 19:00   ` Marcelo Tosatti
  2018-10-03 19:05     ` [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support\ Marcelo Tosatti
  2018-10-03 22:32     ` [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Andy Lutomirski
  1 sibling, 2 replies; 81+ messages in thread
From: Marcelo Tosatti @ 2018-10-03 19:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Paolo Bonzini, Radim Krcmar, Wanpeng Li, LKML,
	X86 ML, Peter Zijlstra, Matt Rickard, Stephen Boyd, John Stultz,
	Florian Weimer, KY Srinivasan, Vitaly Kuznetsov, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:
> Hi Vitaly, Paolo, Radim, etc.,
> 
> On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > implementation, which extended the clockid switch case and added yet
> > another slightly different copy of the same code.
> >
> > Especially the extended switch case is problematic as the compiler tends to
> > generate a jump table which then requires to use retpolines. If jump tables
> > are disabled it adds yet another conditional to the existing maze.
> >
> > This series takes a different approach by consolidating the almost
> > identical functions into one implementation for high resolution clocks and
> > one for the coarse grained clock ids by storing the base data for each
> > clock id in an array which is indexed by the clock id.
> >
> 
> I was trying to understand more of the implications of this patch
> series, and I was again reminded that there is an entire extra copy of
> the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
> that code is very, very opaque.
> 
> Can one of you explain what the code is even doing?  From a couple of
> attempts to read through it, it's a whole bunch of
> probably-extremely-buggy code that, 

Yes, probably.

> drumroll please, tries to atomically read the TSC value and the time.  And decide whether the
> result is "based on the TSC".  

I think "based on the TSC" refers to whether TSC clocksource is being
used.

> And then synthesizes a TSC-to-ns
> multiplier and shift, based on *something other than the actual
> multiply and shift used*.
> 
> IOW, unless I'm totally misunderstanding it, the code digs into the
> private arch clocksource data intended for the vDSO, uses a poorly
> maintained copy of the vDSO code to read the time (instead of doing
> the sane thing and using the kernel interfaces for this), and
> propagates a totally made up copy to the guest.

I posted kernel interfaces for this, and it was suggested to 
instead write a "in-kernel user of pvclock data".

If you can get kernel interfaces to replace that, go for it. I prefer
kernel interfaces as well.

>  And gets it entirely
> wrong when doing nested virt, since, unless there's some secret in
> this maze, it doesn't acutlaly use the scaling factor from the host
> when it tells the guest what to do.
> 
> I am really, seriously tempted to send a patch to simply delete all
> this code.  

If your patch which deletes the code gets the necessary features right,
sure, go for it.

> The correct way to do it is to hook

Can you expand on the correct way to do it?

> And I don't see how it's even possible to pass kvmclock correctly to
> the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
> L1 isn't notified when the data structure changes, so how the heck is
> it supposed to update the kvmclock structure?

I don't parse your question.

> 
> --Andy

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support\
  2018-10-03 19:00   ` Marcelo Tosatti
@ 2018-10-03 19:05     ` Marcelo Tosatti
  2018-10-03 22:32     ` [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Andy Lutomirski
  1 sibling, 0 replies; 81+ messages in thread
From: Marcelo Tosatti @ 2018-10-03 19:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Paolo Bonzini, Radim Krcmar, Wanpeng Li, LKML,
	X86 ML, Peter Zijlstra, Matt Rickard, Stephen Boyd, John Stultz,
	Florian Weimer, KY Srinivasan, Vitaly Kuznetsov, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

On Wed, Oct 03, 2018 at 04:00:29PM -0300, Marcelo Tosatti wrote:
> On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:
> > Hi Vitaly, Paolo, Radim, etc.,
> > 
> > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > > implementation, which extended the clockid switch case and added yet
> > > another slightly different copy of the same code.
> > >
> > > Especially the extended switch case is problematic as the compiler tends to
> > > generate a jump table which then requires to use retpolines. If jump tables
> > > are disabled it adds yet another conditional to the existing maze.
> > >
> > > This series takes a different approach by consolidating the almost
> > > identical functions into one implementation for high resolution clocks and
> > > one for the coarse grained clock ids by storing the base data for each
> > > clock id in an array which is indexed by the clock id.
> > >
> > 
> > I was trying to understand more of the implications of this patch
> > series, and I was again reminded that there is an entire extra copy of
> > the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
> > that code is very, very opaque.
> > 
> > Can one of you explain what the code is even doing?  From a couple of
> > attempts to read through it, it's a whole bunch of
> > probably-extremely-buggy code that, 
> 
> Yes, probably.
> 
> > drumroll please, tries to atomically read the TSC value and the time.  And decide whether the
> > result is "based on the TSC".  
> 
> I think "based on the TSC" refers to whether TSC clocksource is being
> used.
> 
> > And then synthesizes a TSC-to-ns
> > multiplier and shift, based on *something other than the actual
> > multiply and shift used*.
> > 
> > IOW, unless I'm totally misunderstanding it, the code digs into the
> > private arch clocksource data intended for the vDSO, uses a poorly
> > maintained copy of the vDSO code to read the time (instead of doing
> > the sane thing and using the kernel interfaces for this), and
> > propagates a totally made up copy to the guest.
> 
> I posted kernel interfaces for this, and it was suggested to 
> instead write a "in-kernel user of pvclock data".
> 
> If you can get kernel interfaces to replace that, go for it. I prefer
> kernel interfaces as well.

And cleanup patches, to make that code look nicer, are also very
welcome!


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-03  9:22   ` Vitaly Kuznetsov
  2018-10-03 10:20     ` Andy Lutomirski
@ 2018-10-03 19:06     ` Marcelo Tosatti
  2018-10-04  7:54       ` Vitaly Kuznetsov
  1 sibling, 1 reply; 81+ messages in thread
From: Marcelo Tosatti @ 2018-10-03 19:06 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Andy Lutomirski, Thomas Gleixner, Paolo Bonzini, Radim Krcmar,
	Wanpeng Li, LKML, X86 ML, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

On Wed, Oct 03, 2018 at 11:22:58AM +0200, Vitaly Kuznetsov wrote:
> Andy Lutomirski <luto@kernel.org> writes:
> 
> > Hi Vitaly, Paolo, Radim, etc.,
> >
> > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>
> >> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> >> implementation, which extended the clockid switch case and added yet
> >> another slightly different copy of the same code.
> >>
> >> Especially the extended switch case is problematic as the compiler tends to
> >> generate a jump table which then requires to use retpolines. If jump tables
> >> are disabled it adds yet another conditional to the existing maze.
> >>
> >> This series takes a different approach by consolidating the almost
> >> identical functions into one implementation for high resolution clocks and
> >> one for the coarse grained clock ids by storing the base data for each
> >> clock id in an array which is indexed by the clock id.
> >>
> >
> > I was trying to understand more of the implications of this patch
> > series, and I was again reminded that there is an entire extra copy of
> > the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
> > that code is very, very opaque.
> >
> > Can one of you explain what the code is even doing?  From a couple of
> > attempts to read through it, it's a whole bunch of
> > probably-extremely-buggy code that, drumroll please, tries to
> > atomically read the TSC value and the time.  And decide whether the
> > result is "based on the TSC".  And then synthesizes a TSC-to-ns
> > multiplier and shift, based on *something other than the actual
> > multiply and shift used*.
> >
> > IOW, unless I'm totally misunderstanding it, the code digs into the
> > private arch clocksource data intended for the vDSO, uses a poorly
> > maintained copy of the vDSO code to read the time (instead of doing
> > the sane thing and using the kernel interfaces for this), and
> > propagates a totally made up copy to the guest.  And gets it entirely
> > wrong when doing nested virt, since, unless there's some secret in
> > this maze, it doesn't acutlaly use the scaling factor from the host
> > when it tells the guest what to do.
> >
> > I am really, seriously tempted to send a patch to simply delete all
> > this code.  The correct way to do it is to hook
> 
> "I have discovered a truly marvelous proof of this, which this margin is
> too narrow to contain" :-)
> 
> There is a very long history of different (hardware) issues Marcelo was
> fighting with and the current code is the survived Frankenstein.

Right, the code has to handle different TSC modes.

>  E.g. it
> is very, very unclear what "catchup", "always catchup" and
> masterclock-less mode in general are and if we still need them.

Catchup means handling exposed (to guest) TSC frequency smaller than
HW TSC frequency.

That is "frankenstein" code, could be removed.

> That said I'm all for simplification. I'm not sure if we still need to
> care about buggy hardware though.

What simplification is that again? 


> >
> > And I don't see how it's even possible to pass kvmclock correctly to
> > the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
> > L1 isn't notified when the data structure changes, so how the heck is
> > it supposed to update the kvmclock structure?
> 
> Well, this kind of works in the the followin way:
> L1's clocksource is 'tsc_page' which is, basically, a compliment to TSC:
> two numbers provided by L0: offset and scale and KVM was tought to treat
> this clocksource as a good one (see b0c39dc68e3b "x86/kvm: Pass stable
> clocksource to guests when running nested on Hyper-V").
> 
> The notification you're talking about exists, it is called
> Reenligntenment, see 0092e4346f49 "x86/kvm: Support Hyper-V
> reenlightenment"). When TSC page changes (and this only happens when L1
> is migrated to a different host with a different TSC frequency and TSC
> scaling is not supported by the CPU) we receive an interrupt in L1 (at
> this moment all TSC accesses are emulated which guarantees the
> correctness of the readings), pause all L2 guests, update their kvmclock
> structures with new data (we already know the new TSC frequency) and
> then tell L0 that we're done and it can stop emulating TSC accesses.
> 
> (Nothing like this exists for KVM-on-KVM, by the way, when L1's
> clocksource is 'kvmclock' L2s won't get a stable kvmclock clocksource.)
> 
> -- 
> Vitaly

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-03 19:00   ` Marcelo Tosatti
  2018-10-03 19:05     ` [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support\ Marcelo Tosatti
@ 2018-10-03 22:32     ` Andy Lutomirski
  2018-10-04 16:37       ` Marcelo Tosatti
  1 sibling, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-03 22:32 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andrew Lutomirski, Thomas Gleixner, Paolo Bonzini, Radim Krcmar,
	Wanpeng Li, LKML, X86 ML, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan,
	Vitaly Kuznetsov, devel, Linux Virtualization, Arnd Bergmann,
	Juergen Gross

On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:
> > Hi Vitaly, Paolo, Radim, etc.,
> >
> > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > > implementation, which extended the clockid switch case and added yet
> > > another slightly different copy of the same code.
> > >
> > > Especially the extended switch case is problematic as the compiler tends to
> > > generate a jump table which then requires to use retpolines. If jump tables
> > > are disabled it adds yet another conditional to the existing maze.
> > >
> > > This series takes a different approach by consolidating the almost
> > > identical functions into one implementation for high resolution clocks and
> > > one for the coarse grained clock ids by storing the base data for each
> > > clock id in an array which is indexed by the clock id.
> > >
> >
> > I was trying to understand more of the implications of this patch
> > series, and I was again reminded that there is an entire extra copy of
> > the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
> > that code is very, very opaque.
> >
> > Can one of you explain what the code is even doing?  From a couple of
> > attempts to read through it, it's a whole bunch of
> > probably-extremely-buggy code that,
>
> Yes, probably.
>
> > drumroll please, tries to atomically read the TSC value and the time.  And decide whether the
> > result is "based on the TSC".
>
> I think "based on the TSC" refers to whether TSC clocksource is being
> used.
>
> > And then synthesizes a TSC-to-ns
> > multiplier and shift, based on *something other than the actual
> > multiply and shift used*.
> >
> > IOW, unless I'm totally misunderstanding it, the code digs into the
> > private arch clocksource data intended for the vDSO, uses a poorly
> > maintained copy of the vDSO code to read the time (instead of doing
> > the sane thing and using the kernel interfaces for this), and
> > propagates a totally made up copy to the guest.
>
> I posted kernel interfaces for this, and it was suggested to
> instead write a "in-kernel user of pvclock data".
>
> If you can get kernel interfaces to replace that, go for it. I prefer
> kernel interfaces as well.
>
> >  And gets it entirely
> > wrong when doing nested virt, since, unless there's some secret in
> > this maze, it doesn't acutlaly use the scaling factor from the host
> > when it tells the guest what to do.
> >
> > I am really, seriously tempted to send a patch to simply delete all
> > this code.
>
> If your patch which deletes the code gets the necessary features right,
> sure, go for it.
>
> > The correct way to do it is to hook
>
> Can you expand on the correct way to do it?
>
> > And I don't see how it's even possible to pass kvmclock correctly to
> > the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
> > L1 isn't notified when the data structure changes, so how the heck is
> > it supposed to update the kvmclock structure?
>
> I don't parse your question.

Let me ask it more intelligently: when the "reenlightenment" IRQ
happens, what tells KVM to do its own update for its guests?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-03 19:06     ` Marcelo Tosatti
@ 2018-10-04  7:54       ` Vitaly Kuznetsov
  2018-10-04  8:11         ` Peter Zijlstra
                           ` (2 more replies)
  0 siblings, 3 replies; 81+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-04  7:54 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andy Lutomirski, Thomas Gleixner, Paolo Bonzini, Radim Krcmar,
	Wanpeng Li, LKML, X86 ML, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

Marcelo Tosatti <mtosatti@redhat.com> writes:

> On Wed, Oct 03, 2018 at 11:22:58AM +0200, Vitaly Kuznetsov wrote:
>> 
>> There is a very long history of different (hardware) issues Marcelo was
>> fighting with and the current code is the survived Frankenstein.
>
> Right, the code has to handle different TSC modes.
>
>>  E.g. it
>> is very, very unclear what "catchup", "always catchup" and
>> masterclock-less mode in general are and if we still need them.
>
> Catchup means handling exposed (to guest) TSC frequency smaller than
> HW TSC frequency.
>
> That is "frankenstein" code, could be removed.
>
>> That said I'm all for simplification. I'm not sure if we still need to
>> care about buggy hardware though.
>
> What simplification is that again? 
>

I was hoping to hear this from you :-) If I am to suggest how we can
move forward I'd propose:
- Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
is supported).
- Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
clocksource is a single page for the whole VM, not a per-cpu thing. Can
we think that all the buggy hardware is already gone?

-- 
Vitaly

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-04  7:54       ` Vitaly Kuznetsov
@ 2018-10-04  8:11         ` Peter Zijlstra
  2018-10-04 14:00           ` Andy Lutomirski
  2018-10-04 12:00         ` Paolo Bonzini
  2018-10-05 21:18         ` Marcelo Tosatti
  2 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2018-10-04  8:11 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Marcelo Tosatti, Andy Lutomirski, Thomas Gleixner, Paolo Bonzini,
	Radim Krcmar, Wanpeng Li, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote:
> I was hoping to hear this from you :-) If I am to suggest how we can
> move forward I'd propose:
> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
> is supported).
> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
> clocksource is a single page for the whole VM, not a per-cpu thing. Can
> we think that all the buggy hardware is already gone?

No, and it is not the hardware you have to worry about (mostly), it is
the frigging PoS firmware people put on it.

Ever since Nehalem TSC is stable (unless you get to >4 socket systems,
after which it still can be, but bets are off). But even relatively
recent systems fail the TSC sync test because firmware messes it up by
writing to either MSR_TSC or MSR_TSC_ADJUST.

But the thing is, if the TSC is not synced, you cannot use it for
timekeeping, full stop. So having a single page is fine, it either
contains a mult/shift that is valid, or it indicates TSC is messed up
and you fall back to something else.

There is no inbetween there.

For sched_clock we can still use the global page, because the rate will
still be the same for each cpu, it's just offset between CPUs and the
code compensates for that.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-04  7:54       ` Vitaly Kuznetsov
  2018-10-04  8:11         ` Peter Zijlstra
@ 2018-10-04 12:00         ` Paolo Bonzini
  2018-10-04 14:04           ` Andy Lutomirski
  2018-10-05 21:18         ` Marcelo Tosatti
  2 siblings, 1 reply; 81+ messages in thread
From: Paolo Bonzini @ 2018-10-04 12:00 UTC (permalink / raw)
  To: Vitaly Kuznetsov, Marcelo Tosatti
  Cc: Andy Lutomirski, Thomas Gleixner, Radim Krcmar, Wanpeng Li, LKML,
	X86 ML, Peter Zijlstra, Matt Rickard, Stephen Boyd, John Stultz,
	Florian Weimer, KY Srinivasan, devel, Linux Virtualization,
	Arnd Bergmann, Juergen Gross

On 04/10/2018 09:54, Vitaly Kuznetsov wrote:
> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
> is supported).

Not if you want to migrate to pre-Skylake systems.

> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
> clocksource is a single page for the whole VM, not a per-cpu thing. Can
> we think that all the buggy hardware is already gone?

No. :(  We still get reports whenever we break 2007-2008 hardware.

Paolo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-04  8:11         ` Peter Zijlstra
@ 2018-10-04 14:00           ` Andy Lutomirski
  2018-10-04 19:31             ` Peter Zijlstra
  0 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-04 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vitaly Kuznetsov, Marcelo Tosatti, Andy Lutomirski,
	Thomas Gleixner, Paolo Bonzini, Radim Krcmar, Wanpeng Li, LKML,
	X86 ML, Matt Rickard, Stephen Boyd, John Stultz, Florian Weimer,
	KY Srinivasan, devel, Linux Virtualization, Arnd Bergmann,
	Juergen Gross



> On Oct 4, 2018, at 1:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
>> On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote:
>> I was hoping to hear this from you :-) If I am to suggest how we can
>> move forward I'd propose:
>> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
>> is supported).
>> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
>> clocksource is a single page for the whole VM, not a per-cpu thing. Can
>> we think that all the buggy hardware is already gone?
> 
> No, and it is not the hardware you have to worry about (mostly), it is
> the frigging PoS firmware people put on it.
> 
> Ever since Nehalem TSC is stable (unless you get to >4 socket systems,
> after which it still can be, but bets are off). But even relatively
> recent systems fail the TSC sync test because firmware messes it up by
> writing to either MSR_TSC or MSR_TSC_ADJUST.
> 
> But the thing is, if the TSC is not synced, you cannot use it for
> timekeeping, full stop. So having a single page is fine, it either
> contains a mult/shift that is valid, or it indicates TSC is messed up
> and you fall back to something else.
> 
> There is no inbetween there.
> 
> For sched_clock we can still use the global page, because the rate will
> still be the same for each cpu, it's just offset between CPUs and the
> code compensates for that.

But if we’re in a KVM guest, then the clock will jump around on the same *vCPU* when the vCPU migrates.

But I don’t see how kvmclock helps here, since I don’t think it’s used for sched_clock.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-04 12:00         ` Paolo Bonzini
@ 2018-10-04 14:04           ` Andy Lutomirski
  0 siblings, 0 replies; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-04 14:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vitaly Kuznetsov, Marcelo Tosatti, Andy Lutomirski,
	Thomas Gleixner, Radim Krcmar, Wanpeng Li, LKML, X86 ML,
	Peter Zijlstra, Matt Rickard, Stephen Boyd, John Stultz,
	Florian Weimer, KY Srinivasan, devel, Linux Virtualization,
	Arnd Bergmann, Juergen Gross


> On Oct 4, 2018, at 5:00 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
>> On 04/10/2018 09:54, Vitaly Kuznetsov wrote:
>> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
>> is supported).
> 
> Not if you want to migrate to pre-Skylake systems.
> 
>> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
>> clocksource is a single page for the whole VM, not a per-cpu thing. Can
>> we think that all the buggy hardware is already gone?
> 
> No. :(  We still get reports whenever we break 2007-2008 hardware.
> 
> 

Does the KVM non-masterclock mode actually help?  It’s not clear to me exactly how it’s supposed to work, but it seems like it’s trying to expose per-vCPU adjustments to the guest. Which is dubious at best, since the guest can’t validly use them for anything other than sched_clock, since they aren’t fully corrected by anything KVM can do.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-03 22:32     ` [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Andy Lutomirski
@ 2018-10-04 16:37       ` Marcelo Tosatti
  2018-10-04 17:08         ` Andy Lutomirski
  0 siblings, 1 reply; 81+ messages in thread
From: Marcelo Tosatti @ 2018-10-04 16:37 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, Paolo Bonzini, Radim Krcmar, Wanpeng Li, LKML,
	X86 ML, Peter Zijlstra, Matt Rickard, Stephen Boyd, John Stultz,
	Florian Weimer, KY Srinivasan, Vitaly Kuznetsov, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

On Wed, Oct 03, 2018 at 03:32:08PM -0700, Andy Lutomirski wrote:
> On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >
> > On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:
> > > Hi Vitaly, Paolo, Radim, etc.,
> > >
> > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > >
> > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > > > implementation, which extended the clockid switch case and added yet
> > > > another slightly different copy of the same code.
> > > >
> > > > Especially the extended switch case is problematic as the compiler tends to
> > > > generate a jump table which then requires to use retpolines. If jump tables
> > > > are disabled it adds yet another conditional to the existing maze.
> > > >
> > > > This series takes a different approach by consolidating the almost
> > > > identical functions into one implementation for high resolution clocks and
> > > > one for the coarse grained clock ids by storing the base data for each
> > > > clock id in an array which is indexed by the clock id.
> > > >
> > >
> > > I was trying to understand more of the implications of this patch
> > > series, and I was again reminded that there is an entire extra copy of
> > > the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
> > > that code is very, very opaque.
> > >
> > > Can one of you explain what the code is even doing?  From a couple of
> > > attempts to read through it, it's a whole bunch of
> > > probably-extremely-buggy code that,
> >
> > Yes, probably.
> >
> > > drumroll please, tries to atomically read the TSC value and the time.  And decide whether the
> > > result is "based on the TSC".
> >
> > I think "based on the TSC" refers to whether TSC clocksource is being
> > used.
> >
> > > And then synthesizes a TSC-to-ns
> > > multiplier and shift, based on *something other than the actual
> > > multiply and shift used*.
> > >
> > > IOW, unless I'm totally misunderstanding it, the code digs into the
> > > private arch clocksource data intended for the vDSO, uses a poorly
> > > maintained copy of the vDSO code to read the time (instead of doing
> > > the sane thing and using the kernel interfaces for this), and
> > > propagates a totally made up copy to the guest.
> >
> > I posted kernel interfaces for this, and it was suggested to
> > instead write a "in-kernel user of pvclock data".
> >
> > If you can get kernel interfaces to replace that, go for it. I prefer
> > kernel interfaces as well.
> >
> > >  And gets it entirely
> > > wrong when doing nested virt, since, unless there's some secret in
> > > this maze, it doesn't acutlaly use the scaling factor from the host
> > > when it tells the guest what to do.
> > >
> > > I am really, seriously tempted to send a patch to simply delete all
> > > this code.
> >
> > If your patch which deletes the code gets the necessary features right,
> > sure, go for it.
> >
> > > The correct way to do it is to hook
> >
> > Can you expand on the correct way to do it?
> >
> > > And I don't see how it's even possible to pass kvmclock correctly to
> > > the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
> > > L1 isn't notified when the data structure changes, so how the heck is
> > > it supposed to update the kvmclock structure?
> >
> > I don't parse your question.
> 
> Let me ask it more intelligently: when the "reenlightenment" IRQ
> happens, what tells KVM to do its own update for its guests?

Update of what, and why it needs to update anything from IRQ? 

The update i can think of is from host kernel clocksource, 
which there is a notifier for.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-04 16:37       ` Marcelo Tosatti
@ 2018-10-04 17:08         ` Andy Lutomirski
  2018-10-04 17:28           ` Vitaly Kuznetsov
  0 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-04 17:08 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andrew Lutomirski, Thomas Gleixner, Paolo Bonzini, Radim Krcmar,
	Wanpeng Li, LKML, X86 ML, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan,
	Vitaly Kuznetsov, devel, Linux Virtualization, Arnd Bergmann,
	Juergen Gross

On Thu, Oct 4, 2018 at 9:43 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Wed, Oct 03, 2018 at 03:32:08PM -0700, Andy Lutomirski wrote:
> > On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > >
> > > On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:
> > > > Hi Vitaly, Paolo, Radim, etc.,
> > > >
> > > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > > > >
> > > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> > > > > implementation, which extended the clockid switch case and added yet
> > > > > another slightly different copy of the same code.
> > > > >
> > > > > Especially the extended switch case is problematic as the compiler tends to
> > > > > generate a jump table which then requires to use retpolines. If jump tables
> > > > > are disabled it adds yet another conditional to the existing maze.
> > > > >
> > > > > This series takes a different approach by consolidating the almost
> > > > > identical functions into one implementation for high resolution clocks and
> > > > > one for the coarse grained clock ids by storing the base data for each
> > > > > clock id in an array which is indexed by the clock id.
> > > > >
> > > >
> > > > I was trying to understand more of the implications of this patch
> > > > series, and I was again reminded that there is an entire extra copy of
> > > > the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
> > > > that code is very, very opaque.
> > > >
> > > > Can one of you explain what the code is even doing?  From a couple of
> > > > attempts to read through it, it's a whole bunch of
> > > > probably-extremely-buggy code that,
> > >
> > > Yes, probably.
> > >
> > > > drumroll please, tries to atomically read the TSC value and the time.  And decide whether the
> > > > result is "based on the TSC".
> > >
> > > I think "based on the TSC" refers to whether TSC clocksource is being
> > > used.
> > >
> > > > And then synthesizes a TSC-to-ns
> > > > multiplier and shift, based on *something other than the actual
> > > > multiply and shift used*.
> > > >
> > > > IOW, unless I'm totally misunderstanding it, the code digs into the
> > > > private arch clocksource data intended for the vDSO, uses a poorly
> > > > maintained copy of the vDSO code to read the time (instead of doing
> > > > the sane thing and using the kernel interfaces for this), and
> > > > propagates a totally made up copy to the guest.
> > >
> > > I posted kernel interfaces for this, and it was suggested to
> > > instead write a "in-kernel user of pvclock data".
> > >
> > > If you can get kernel interfaces to replace that, go for it. I prefer
> > > kernel interfaces as well.
> > >
> > > >  And gets it entirely
> > > > wrong when doing nested virt, since, unless there's some secret in
> > > > this maze, it doesn't acutlaly use the scaling factor from the host
> > > > when it tells the guest what to do.
> > > >
> > > > I am really, seriously tempted to send a patch to simply delete all
> > > > this code.
> > >
> > > If your patch which deletes the code gets the necessary features right,
> > > sure, go for it.
> > >
> > > > The correct way to do it is to hook
> > >
> > > Can you expand on the correct way to do it?
> > >
> > > > And I don't see how it's even possible to pass kvmclock correctly to
> > > > the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
> > > > L1 isn't notified when the data structure changes, so how the heck is
> > > > it supposed to update the kvmclock structure?
> > >
> > > I don't parse your question.
> >
> > Let me ask it more intelligently: when the "reenlightenment" IRQ
> > happens, what tells KVM to do its own update for its guests?
>
> Update of what, and why it needs to update anything from IRQ?
>
> The update i can think of is from host kernel clocksource,
> which there is a notifier for.
>
>

Unless I've missed some serious magic, L2 guests see kvmclock, not hv.
So we have the following sequence of events:

 - L0 migrates the whole VM.  Starting now, RDTSC is emulated to match
the old host, which applies in L1 and L2.

 - An IRQ is queued to L1.

 - L1 acknowledges that it noticed the TSC change.  RDTSC stops being
emulated for L1 and L2.

 - L2 reads the TSC.  It has no idea that anything changed, and it
gets the wrong answer.

 - At some point, kvm clock updates.

What prevents this?  Vitaly, am I missing some subtlety of what
actually happens?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-04 17:08         ` Andy Lutomirski
@ 2018-10-04 17:28           ` Vitaly Kuznetsov
  0 siblings, 0 replies; 81+ messages in thread
From: Vitaly Kuznetsov @ 2018-10-04 17:28 UTC (permalink / raw)
  To: Andy Lutomirski, Marcelo Tosatti
  Cc: Andrew Lutomirski, Thomas Gleixner, Paolo Bonzini, Radim Krcmar,
	Wanpeng Li, LKML, X86 ML, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

Andy Lutomirski <luto@kernel.org> writes:

> On Thu, Oct 4, 2018 at 9:43 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>>
>> On Wed, Oct 03, 2018 at 03:32:08PM -0700, Andy Lutomirski wrote:
>> > On Wed, Oct 3, 2018 at 12:01 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>> > >
>> > > On Tue, Oct 02, 2018 at 10:15:49PM -0700, Andy Lutomirski wrote:
>> > > > Hi Vitaly, Paolo, Radim, etc.,
>> > > >
>> > > > On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>> > > > >
>> > > > > Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
>> > > > > implementation, which extended the clockid switch case and added yet
>> > > > > another slightly different copy of the same code.
>> > > > >
>> > > > > Especially the extended switch case is problematic as the compiler tends to
>> > > > > generate a jump table which then requires to use retpolines. If jump tables
>> > > > > are disabled it adds yet another conditional to the existing maze.
>> > > > >
>> > > > > This series takes a different approach by consolidating the almost
>> > > > > identical functions into one implementation for high resolution clocks and
>> > > > > one for the coarse grained clock ids by storing the base data for each
>> > > > > clock id in an array which is indexed by the clock id.
>> > > > >
>> > > >
>> > > > I was trying to understand more of the implications of this patch
>> > > > series, and I was again reminded that there is an entire extra copy of
>> > > > the vclock reading code in arch/x86/kvm/x86.c.  And the purpose of
>> > > > that code is very, very opaque.
>> > > >
>> > > > Can one of you explain what the code is even doing?  From a couple of
>> > > > attempts to read through it, it's a whole bunch of
>> > > > probably-extremely-buggy code that,
>> > >
>> > > Yes, probably.
>> > >
>> > > > drumroll please, tries to atomically read the TSC value and the time.  And decide whether the
>> > > > result is "based on the TSC".
>> > >
>> > > I think "based on the TSC" refers to whether TSC clocksource is being
>> > > used.
>> > >
>> > > > And then synthesizes a TSC-to-ns
>> > > > multiplier and shift, based on *something other than the actual
>> > > > multiply and shift used*.
>> > > >
>> > > > IOW, unless I'm totally misunderstanding it, the code digs into the
>> > > > private arch clocksource data intended for the vDSO, uses a poorly
>> > > > maintained copy of the vDSO code to read the time (instead of doing
>> > > > the sane thing and using the kernel interfaces for this), and
>> > > > propagates a totally made up copy to the guest.
>> > >
>> > > I posted kernel interfaces for this, and it was suggested to
>> > > instead write a "in-kernel user of pvclock data".
>> > >
>> > > If you can get kernel interfaces to replace that, go for it. I prefer
>> > > kernel interfaces as well.
>> > >
>> > > >  And gets it entirely
>> > > > wrong when doing nested virt, since, unless there's some secret in
>> > > > this maze, it doesn't acutlaly use the scaling factor from the host
>> > > > when it tells the guest what to do.
>> > > >
>> > > > I am really, seriously tempted to send a patch to simply delete all
>> > > > this code.
>> > >
>> > > If your patch which deletes the code gets the necessary features right,
>> > > sure, go for it.
>> > >
>> > > > The correct way to do it is to hook
>> > >
>> > > Can you expand on the correct way to do it?
>> > >
>> > > > And I don't see how it's even possible to pass kvmclock correctly to
>> > > > the L2 guest when L0 is hyperv.  KVM could pass *hyperv's* clock, but
>> > > > L1 isn't notified when the data structure changes, so how the heck is
>> > > > it supposed to update the kvmclock structure?
>> > >
>> > > I don't parse your question.
>> >
>> > Let me ask it more intelligently: when the "reenlightenment" IRQ
>> > happens, what tells KVM to do its own update for its guests?
>>
>> Update of what, and why it needs to update anything from IRQ?
>>
>> The update i can think of is from host kernel clocksource,
>> which there is a notifier for.
>>
>>
>
> Unless I've missed some serious magic, L2 guests see kvmclock, not hv.
> So we have the following sequence of events:
>
>  - L0 migrates the whole VM.  Starting now, RDTSC is emulated to match
> the old host, which applies in L1 and L2.
>
>  - An IRQ is queued to L1.
>
>  - L1 acknowledges that it noticed the TSC change.

Before the acknowledgement we actually pause all guests so they don't
notice the change ....

>  RDTSC stops being emulated for L1 and L2.

.... and right after that we update all kvmclocks for all L2s and
unpause them so all their readings are still correct (see
kvm_hyperv_tsc_notifier()).

>
>  - L2 reads the TSC.  It has no idea that anything changed, and it
> gets the wrong answer.

I have to admit I forgot what happens if L2 uses raw TSC. I *think* that
we actually adjust TSC offset along with adjusting kvmclocks so the
reading is still correct. I'll have to check this.

All bets are off in case L2 was using TSC for time interval
measurements: frequency, of course, changes.

>
>  - At some point, kvm clock updates.
>
> What prevents this?  Vitaly, am I missing some subtlety of what
> actually happens?

-- 
Vitaly

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-04 14:00           ` Andy Lutomirski
@ 2018-10-04 19:31             ` Peter Zijlstra
  2018-10-04 20:05               ` Andy Lutomirski
  0 siblings, 1 reply; 81+ messages in thread
From: Peter Zijlstra @ 2018-10-04 19:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Vitaly Kuznetsov, Marcelo Tosatti, Andy Lutomirski,
	Thomas Gleixner, Paolo Bonzini, Radim Krcmar, Wanpeng Li, LKML,
	X86 ML, Matt Rickard, Stephen Boyd, John Stultz, Florian Weimer,
	KY Srinivasan, devel, Linux Virtualization, Arnd Bergmann,
	Juergen Gross

On Thu, Oct 04, 2018 at 07:00:45AM -0700, Andy Lutomirski wrote:
> > On Oct 4, 2018, at 1:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> >> On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote:
> >> I was hoping to hear this from you :-) If I am to suggest how we can
> >> move forward I'd propose:
> >> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
> >> is supported).
> >> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
> >> clocksource is a single page for the whole VM, not a per-cpu thing. Can
> >> we think that all the buggy hardware is already gone?
> > 
> > No, and it is not the hardware you have to worry about (mostly), it is
> > the frigging PoS firmware people put on it.
> > 
> > Ever since Nehalem TSC is stable (unless you get to >4 socket systems,
> > after which it still can be, but bets are off). But even relatively
> > recent systems fail the TSC sync test because firmware messes it up by
> > writing to either MSR_TSC or MSR_TSC_ADJUST.
> > 
> > But the thing is, if the TSC is not synced, you cannot use it for
> > timekeeping, full stop. So having a single page is fine, it either
> > contains a mult/shift that is valid, or it indicates TSC is messed up
> > and you fall back to something else.
> > 
> > There is no inbetween there.
> > 
> > For sched_clock we can still use the global page, because the rate will
> > still be the same for each cpu, it's just offset between CPUs and the
> > code compensates for that.
> 
> But if we’re in a KVM guest, then the clock will jump around on the
> same *vCPU* when the vCPU migrates.

Urgh yes..

> But I don’t see how kvmclock helps here, since I don’t think it’s used
> for sched_clock.

I get hits on kvm_sched_clock, but haven't looked further.

Anyway, Most of the argument still holds, either TSC is synced or it is
not and it _really_ should not be used. Not much middle ground there.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-04 19:31             ` Peter Zijlstra
@ 2018-10-04 20:05               ` Andy Lutomirski
  2018-10-04 22:15                 ` Andy Lutomirski
  0 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-04 20:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vitaly Kuznetsov, Marcelo Tosatti, Andy Lutomirski,
	Thomas Gleixner, Paolo Bonzini, Radim Krcmar, Wanpeng Li, LKML,
	X86 ML, Matt Rickard, Stephen Boyd, John Stultz, Florian Weimer,
	KY Srinivasan, devel, Linux Virtualization, Arnd Bergmann,
	Juergen Gross



> On Oct 4, 2018, at 12:31 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Thu, Oct 04, 2018 at 07:00:45AM -0700, Andy Lutomirski wrote:
>>> On Oct 4, 2018, at 1:11 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>>> 
>>>> On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote:
>>>> I was hoping to hear this from you :-) If I am to suggest how we can
>>>> move forward I'd propose:
>>>> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
>>>> is supported).
>>>> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
>>>> clocksource is a single page for the whole VM, not a per-cpu thing. Can
>>>> we think that all the buggy hardware is already gone?
>>> 
>>> No, and it is not the hardware you have to worry about (mostly), it is
>>> the frigging PoS firmware people put on it.
>>> 
>>> Ever since Nehalem TSC is stable (unless you get to >4 socket systems,
>>> after which it still can be, but bets are off). But even relatively
>>> recent systems fail the TSC sync test because firmware messes it up by
>>> writing to either MSR_TSC or MSR_TSC_ADJUST.
>>> 
>>> But the thing is, if the TSC is not synced, you cannot use it for
>>> timekeeping, full stop. So having a single page is fine, it either
>>> contains a mult/shift that is valid, or it indicates TSC is messed up
>>> and you fall back to something else.
>>> 
>>> There is no inbetween there.
>>> 
>>> For sched_clock we can still use the global page, because the rate will
>>> still be the same for each cpu, it's just offset between CPUs and the
>>> code compensates for that.
>> 
>> But if we’re in a KVM guest, then the clock will jump around on the
>> same *vCPU* when the vCPU migrates.
> 
> Urgh yes..
> 
>> But I don’t see how kvmclock helps here, since I don’t think it’s used
>> for sched_clock.
> 
> I get hits on kvm_sched_clock, but haven't looked further.

Ok, so KVM is using the per-vCPU pvclock data for sched_clock. Which hopefully does something intelligent.

Regardless of any TSC syncing issues, a paravirt clock should presumably be used for sched_clock to account for time that the vCPU was stopped.

It would be fantastic if this stuff were documented much better, both in terms of the data structures and the Linux code.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
                   ` (13 preceding siblings ...)
  2018-10-03  5:15 ` Andy Lutomirski
@ 2018-10-04 20:32 ` Andy Lutomirski
  14 siblings, 0 replies; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-04 20:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, Andrew Lutomirski, X86 ML, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan,
	Vitaly Kuznetsov, devel, Linux Virtualization, Paolo Bonzini,
	Arnd Bergmann, Juergen Gross

On Fri, Sep 14, 2018 at 5:52 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Matt attempted to add CLOCK_TAI support to the VDSO clock_gettime()
> implementation, which extended the clockid switch case and added yet
> another slightly different copy of the same code.
>
> Especially the extended switch case is problematic as the compiler tends to
> generate a jump table which then requires to use retpolines. If jump tables
> are disabled it adds yet another conditional to the existing maze.
>
> This series takes a different approach by consolidating the almost
> identical functions into one implementation for high resolution clocks and
> one for the coarse grained clock ids by storing the base data for each
> clock id in an array which is indexed by the clock id.
>
> This completely eliminates the switch case and allows further
> simplifications of the code base, which at the end all together gain a few
> cycles performance or at least stay on par with todays code. The resulting
> performance depends heavily on the micro architecture and the compiler.

tglx, please consider this whole series Acked-by: Andy Lutomirski
<luto@kernel.org>

Please feel free to push it top tip:x86/vdso, as long as you pull in
tip/x86/urgent first.  Once it lands, I'll email out a couple of
follow-up patches.

--Andy

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-04 20:05               ` Andy Lutomirski
@ 2018-10-04 22:15                 ` Andy Lutomirski
  2018-10-06 20:27                   ` Marcelo Tosatti
  2018-10-06 20:49                   ` Marcelo Tosatti
  0 siblings, 2 replies; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-04 22:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vitaly Kuznetsov, Marcelo Tosatti, Andrew Lutomirski,
	Thomas Gleixner, Paolo Bonzini, Radim Krcmar, Wanpeng Li, LKML,
	X86 ML, Matt Rickard, Stephen Boyd, John Stultz, Florian Weimer,
	KY Srinivasan, devel, Linux Virtualization, Arnd Bergmann,
	Juergen Gross

For better or for worse, I'm trying to understand this code.  So far,
I've come up with this patch:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf

Is it correct, or am I missing some subtlety?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-04  7:54       ` Vitaly Kuznetsov
  2018-10-04  8:11         ` Peter Zijlstra
  2018-10-04 12:00         ` Paolo Bonzini
@ 2018-10-05 21:18         ` Marcelo Tosatti
  2 siblings, 0 replies; 81+ messages in thread
From: Marcelo Tosatti @ 2018-10-05 21:18 UTC (permalink / raw)
  To: Vitaly Kuznetsov
  Cc: Andy Lutomirski, Thomas Gleixner, Paolo Bonzini, Radim Krcmar,
	Wanpeng Li, LKML, X86 ML, Peter Zijlstra, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

On Thu, Oct 04, 2018 at 09:54:45AM +0200, Vitaly Kuznetsov wrote:
> Marcelo Tosatti <mtosatti@redhat.com> writes:
> 
> > On Wed, Oct 03, 2018 at 11:22:58AM +0200, Vitaly Kuznetsov wrote:
> >> 
> >> There is a very long history of different (hardware) issues Marcelo was
> >> fighting with and the current code is the survived Frankenstein.
> >
> > Right, the code has to handle different TSC modes.
> >
> >>  E.g. it
> >> is very, very unclear what "catchup", "always catchup" and
> >> masterclock-less mode in general are and if we still need them.
> >
> > Catchup means handling exposed (to guest) TSC frequency smaller than
> > HW TSC frequency.
> >
> > That is "frankenstein" code, could be removed.
> >
> >> That said I'm all for simplification. I'm not sure if we still need to
> >> care about buggy hardware though.
> >
> > What simplification is that again? 
> >
> 
> I was hoping to hear this from you :-) If I am to suggest how we can
> move forward I'd propose:
> - Check if pure TSC can be used on SkyLake+ systems (where TSC scaling
> is supported).

In that case just use TSC clocksource on the guest directly: i am
writing code for that now (its faster than pvclock syscall).

> - Check if non-masterclock mode is still needed. E.g. HyperV's TSC page
> clocksource is a single page for the whole VM, not a per-cpu thing. Can
> we think that all the buggy hardware is already gone?

As Peter mentioned, non sync TSC hardware might exist in the future. 
And older hardware must be supported.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-04 22:15                 ` Andy Lutomirski
@ 2018-10-06 20:27                   ` Marcelo Tosatti
  2018-10-06 22:28                     ` Andy Lutomirski
  2018-10-06 20:49                   ` Marcelo Tosatti
  1 sibling, 1 reply; 81+ messages in thread
From: Marcelo Tosatti @ 2018-10-06 20:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Vitaly Kuznetsov, Thomas Gleixner, Paolo Bonzini,
	Radim Krcmar, Wanpeng Li, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote:
> For better or for worse, I'm trying to understand this code.  So far,
> I've come up with this patch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf
> 
> Is it correct, or am I missing some subtlety?

The master clock, when initialized, has a pair

masterclockvalues=(TSC value, time-of-day data).

When updating the guest clock, we only update relative to (TSC value)
that was read on masterclock initialization.

See the following comment on x86.c:

/*
 *
 * Assuming a stable TSC across physical CPUS, and a stable TSC
 * across virtual CPUs, the following condition is possible.
 * Each numbered line represents an event visible to both
 * CPUs at the next numbered event.
...

When updating the "masterclockvalues" pair, all vcpus are 
stopped.




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-04 22:15                 ` Andy Lutomirski
  2018-10-06 20:27                   ` Marcelo Tosatti
@ 2018-10-06 20:49                   ` Marcelo Tosatti
  1 sibling, 0 replies; 81+ messages in thread
From: Marcelo Tosatti @ 2018-10-06 20:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Vitaly Kuznetsov, Thomas Gleixner, Paolo Bonzini,
	Radim Krcmar, Wanpeng Li, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote:
> For better or for worse, I'm trying to understand this code.  So far,
> I've come up with this patch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf
> 
> Is it correct, or am I missing some subtlety?

"In the non-fallback case, a bunch of gnarly arithmetic is done: a
hopefully matched pair of (TSC, boot time) is read, then all locks
are dropped, then the TSC frequency is read, a branch new multiplier
and shift is read, and the result is turned into a time.

This seems quite racy to me.  Because locks are not held, I don't
see what keeps TSC frequency used in sync with the master clock
data."

If tsc_khz changes, the host TSC clocksource will not be used, which 
disables masterclock:

if ((val == CPUFREQ_PRECHANGE  && freq->old < freq->new) ||
                        (val == CPUFREQ_POSTCHANGE && freq->old >
freq->new)) {
                *lpj = cpufreq_scale(loops_per_jiffy_ref, ref_freq,
freq->new);

                tsc_khz = cpufreq_scale(tsc_khz_ref, ref_freq,
freq->new);
                if (!(freq->flags & CPUFREQ_CONST_LOOPS))
                        mark_tsc_unstable("cpufreq changes");

In which case it ends up in:

-	spin_lock(&ka->pvclock_gtod_sync_lock);
-	if (!ka->use_master_clock) {
-		spin_unlock(&ka->pvclock_gtod_sync_lock);
-		return ktime_get_boot_ns() + ka->kvmclock_offset;
-	}

masterclock -> non masterclock transition sets
a REQUEST bit on each vCPU, so as to invalidate any previous
clock reads.

static void kvm_gen_update_masterclock(struct kvm *kvm)
{
#ifdef CONFIG_X86_64
        int i;
        struct kvm_vcpu *vcpu;
        struct kvm_arch *ka = &kvm->arch;

        spin_lock(&ka->pvclock_gtod_sync_lock);
        kvm_make_mclock_inprogress_request(kvm);
        /* no guest entries from this point */
        pvclock_update_vm_gtod_copy(kvm);

        kvm_for_each_vcpu(i, vcpu, kvm)
                kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

        /* guest entries allowed */
        kvm_for_each_vcpu(i, vcpu, kvm)
                kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);

        spin_unlock(&ka->pvclock_gtod_sync_lock);
#endif



        /*
         * If the host uses TSC clock, then passthrough TSC as stable
         * to the guest.
         */
        host_tsc_clocksource = kvm_get_time_and_clockread(
                                        &ka->master_kernel_ns,
                                        &ka->master_cycle_now);

        ka->use_master_clock = host_tsc_clocksource && vcpus_matched
                                && !ka->backwards_tsc_observed
                                && !ka->boot_vcpu_runs_old_kvmclock;





^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-06 20:27                   ` Marcelo Tosatti
@ 2018-10-06 22:28                     ` Andy Lutomirski
  2018-10-08 15:26                       ` Marcelo Tosatti
  0 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-06 22:28 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andrew Lutomirski, Peter Zijlstra, Vitaly Kuznetsov,
	Thomas Gleixner, Paolo Bonzini, Radim Krcmar, Wanpeng Li, LKML,
	X86 ML, Matt Rickard, Stephen Boyd, John Stultz, Florian Weimer,
	KY Srinivasan, devel, Linux Virtualization, Arnd Bergmann,
	Juergen Gross

On Sat, Oct 6, 2018 at 1:29 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote:
> > For better or for worse, I'm trying to understand this code.  So far,
> > I've come up with this patch:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf
> >
> > Is it correct, or am I missing some subtlety?
>
> The master clock, when initialized, has a pair
>
> masterclockvalues=(TSC value, time-of-day data).
>
> When updating the guest clock, we only update relative to (TSC value)
> that was read on masterclock initialization.

I don't see the problem.  The masterclock data is updated here:

    host_tsc_clocksource = kvm_get_time_and_clockread(
                    &ka->master_kernel_ns,
                    &ka->master_cycle_now);

kvm_get_time_and_clockread() gets those values from
do_monotonic_boot(), which, barring bugs, should cause
get_kvmclock_ns() to return exactly the same thing as
ktime_get_boot_ns() + ka->kvmclock_offset, albeit in a rather
roundabout manner.

So what am I missing?  Is there actually something wrong with my patch?


>
> See the following comment on x86.c:

I read that comment, and it's not obvious to me how it's related.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-06 22:28                     ` Andy Lutomirski
@ 2018-10-08 15:26                       ` Marcelo Tosatti
  2018-10-08 17:38                         ` Andy Lutomirski
  0 siblings, 1 reply; 81+ messages in thread
From: Marcelo Tosatti @ 2018-10-08 15:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Vitaly Kuznetsov, Thomas Gleixner, Paolo Bonzini,
	Radim Krcmar, Wanpeng Li, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

On Sat, Oct 06, 2018 at 03:28:05PM -0700, Andy Lutomirski wrote:
> On Sat, Oct 6, 2018 at 1:29 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >
> > On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote:
> > > For better or for worse, I'm trying to understand this code.  So far,
> > > I've come up with this patch:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf
> > >
> > > Is it correct, or am I missing some subtlety?
> >
> > The master clock, when initialized, has a pair
> >
> > masterclockvalues=(TSC value, time-of-day data).
> >
> > When updating the guest clock, we only update relative to (TSC value)
> > that was read on masterclock initialization.
> 
> I don't see the problem.  The masterclock data is updated here:
> 
>     host_tsc_clocksource = kvm_get_time_and_clockread(
>                     &ka->master_kernel_ns,
>                     &ka->master_cycle_now);
> 
> kvm_get_time_and_clockread() gets those values from
> do_monotonic_boot(), which, barring bugs, should cause
> get_kvmclock_ns() to return exactly the same thing as
> ktime_get_boot_ns() + ka->kvmclock_offset, albeit in a rather
> roundabout manner.
> 
> So what am I missing?  Is there actually something wrong with my patch?

For the bug mentioned in the comment not to happen, you must only read
TSC and add it as offset to (TSC value, time-of-day data).

Its more than "a roundabout manner".

Read the comment again.

> 
> 
> >
> > See the following comment on x86.c:
> 
> I read that comment, and it's not obvious to me how it's related.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-08 15:26                       ` Marcelo Tosatti
@ 2018-10-08 17:38                         ` Andy Lutomirski
  2018-10-08 19:36                           ` Marcelo Tosatti
  0 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-08 17:38 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andrew Lutomirski, Peter Zijlstra, Vitaly Kuznetsov,
	Thomas Gleixner, Paolo Bonzini, Radim Krcmar, Wanpeng Li, LKML,
	X86 ML, Matt Rickard, Stephen Boyd, John Stultz, Florian Weimer,
	KY Srinivasan, devel, Linux Virtualization, Arnd Bergmann,
	Juergen Gross

On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Sat, Oct 06, 2018 at 03:28:05PM -0700, Andy Lutomirski wrote:
> > On Sat, Oct 6, 2018 at 1:29 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > >
> > > On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote:
> > > > For better or for worse, I'm trying to understand this code.  So far,
> > > > I've come up with this patch:
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf
> > > >
> > > > Is it correct, or am I missing some subtlety?
> > >
> > > The master clock, when initialized, has a pair
> > >
> > > masterclockvalues=(TSC value, time-of-day data).
> > >
> > > When updating the guest clock, we only update relative to (TSC value)
> > > that was read on masterclock initialization.
> >
> > I don't see the problem.  The masterclock data is updated here:
> >
> >     host_tsc_clocksource = kvm_get_time_and_clockread(
> >                     &ka->master_kernel_ns,
> >                     &ka->master_cycle_now);
> >
> > kvm_get_time_and_clockread() gets those values from
> > do_monotonic_boot(), which, barring bugs, should cause
> > get_kvmclock_ns() to return exactly the same thing as
> > ktime_get_boot_ns() + ka->kvmclock_offset, albeit in a rather
> > roundabout manner.
> >
> > So what am I missing?  Is there actually something wrong with my patch?
>
> For the bug mentioned in the comment not to happen, you must only read
> TSC and add it as offset to (TSC value, time-of-day data).
>
> Its more than "a roundabout manner".
>
> Read the comment again.
>

I read the comment three more times and even dug through the git
history.  It seems like what you're saying is that, under certain
conditions (which arguably would be bugs in the core Linux timing
code), actually calling ktime_get_boot_ns() could be non-monotonic
with respect to the kvmclock timing.  But get_kvmclock_ns() isn't used
for VM timing as such -- it's used for the IOCTL interfaces for
updating the time offset.  So can you explain how my patch is
incorrect?

--Andy

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-08 17:38                         ` Andy Lutomirski
@ 2018-10-08 19:36                           ` Marcelo Tosatti
  2018-10-09 20:09                             ` Andy Lutomirski
  0 siblings, 1 reply; 81+ messages in thread
From: Marcelo Tosatti @ 2018-10-08 19:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Vitaly Kuznetsov, Thomas Gleixner, Paolo Bonzini,
	Radim Krcmar, Wanpeng Li, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

On Mon, Oct 08, 2018 at 10:38:22AM -0700, Andy Lutomirski wrote:
> On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >
> > On Sat, Oct 06, 2018 at 03:28:05PM -0700, Andy Lutomirski wrote:
> > > On Sat, Oct 6, 2018 at 1:29 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > >
> > > > On Thu, Oct 04, 2018 at 03:15:32PM -0700, Andy Lutomirski wrote:
> > > > > For better or for worse, I'm trying to understand this code.  So far,
> > > > > I've come up with this patch:
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/vdso-tglx&id=14fd71e12b1c4492a06f368f75041f263e6862bf
> > > > >
> > > > > Is it correct, or am I missing some subtlety?
> > > >
> > > > The master clock, when initialized, has a pair
> > > >
> > > > masterclockvalues=(TSC value, time-of-day data).
> > > >
> > > > When updating the guest clock, we only update relative to (TSC value)
> > > > that was read on masterclock initialization.
> > >
> > > I don't see the problem.  The masterclock data is updated here:
> > >
> > >     host_tsc_clocksource = kvm_get_time_and_clockread(
> > >                     &ka->master_kernel_ns,
> > >                     &ka->master_cycle_now);
> > >
> > > kvm_get_time_and_clockread() gets those values from
> > > do_monotonic_boot(), which, barring bugs, should cause
> > > get_kvmclock_ns() to return exactly the same thing as
> > > ktime_get_boot_ns() + ka->kvmclock_offset, albeit in a rather
> > > roundabout manner.
> > >
> > > So what am I missing?  Is there actually something wrong with my patch?
> >
> > For the bug mentioned in the comment not to happen, you must only read
> > TSC and add it as offset to (TSC value, time-of-day data).
> >
> > Its more than "a roundabout manner".
> >
> > Read the comment again.
> >
> 
> I read the comment three more times and even dug through the git
> history.  It seems like what you're saying is that, under certain
> conditions (which arguably would be bugs in the core Linux timing
> code), 

I don't see that as a bug. Its just a side effect of reading two
different clocks (one is CLOCK_MONOTONIC and the other is TSC),
and using those two clocks to as a "base + offset".

As the comment explains, if you do that, can't guarantee monotonicity.

> actually calling ktime_get_boot_ns() could be non-monotonic
> with respect to the kvmclock timing.  But get_kvmclock_ns() isn't used
> for VM timing as such -- it's used for the IOCTL interfaces for
> updating the time offset.  So can you explain how my patch is
> incorrect?

ktime_get_boot_ns() has frequency correction applied, while 
reading masterclock + TSC offset does not.

So the clock reads differ.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-08 19:36                           ` Marcelo Tosatti
@ 2018-10-09 20:09                             ` Andy Lutomirski
  2018-10-11 22:27                               ` Marcelo Tosatti
  0 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-09 20:09 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andrew Lutomirski, Peter Zijlstra, Vitaly Kuznetsov,
	Thomas Gleixner, Paolo Bonzini, Radim Krcmar, Wanpeng Li, LKML,
	X86 ML, Matt Rickard, Stephen Boyd, John Stultz, Florian Weimer,
	KY Srinivasan, devel, Linux Virtualization, Arnd Bergmann,
	Juergen Gross

On Tue, Oct 9, 2018 at 8:28 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Mon, Oct 08, 2018 at 10:38:22AM -0700, Andy Lutomirski wrote:
> > On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:

> > I read the comment three more times and even dug through the git
> > history.  It seems like what you're saying is that, under certain
> > conditions (which arguably would be bugs in the core Linux timing
> > code),
>
> I don't see that as a bug. Its just a side effect of reading two
> different clocks (one is CLOCK_MONOTONIC and the other is TSC),
> and using those two clocks to as a "base + offset".
>
> As the comment explains, if you do that, can't guarantee monotonicity.
>
> > actually calling ktime_get_boot_ns() could be non-monotonic
> > with respect to the kvmclock timing.  But get_kvmclock_ns() isn't used
> > for VM timing as such -- it's used for the IOCTL interfaces for
> > updating the time offset.  So can you explain how my patch is
> > incorrect?
>
> ktime_get_boot_ns() has frequency correction applied, while
> reading masterclock + TSC offset does not.
>
> So the clock reads differ.
>

Ah, okay, I finally think I see what's going on.  In the kvmclock data
exposed to the guest, tsc_shift and tsc_to_system_mul come from
tgt_tsc_khz, whereas master_kernel_ns and master_cycle_now come from
CLOCK_BOOTTIME.  So the kvmclock and kernel clock drift apart at a
rate given by the frequency shift and then suddenly agree again every
time the pvclock data is updated.

Is there a reason to do it this way?

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-09 20:09                             ` Andy Lutomirski
@ 2018-10-11 22:27                               ` Marcelo Tosatti
  2018-10-11 23:00                                 ` Andy Lutomirski
  0 siblings, 1 reply; 81+ messages in thread
From: Marcelo Tosatti @ 2018-10-11 22:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Vitaly Kuznetsov, Thomas Gleixner, Paolo Bonzini,
	Radim Krcmar, Wanpeng Li, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

On Tue, Oct 09, 2018 at 01:09:42PM -0700, Andy Lutomirski wrote:
> On Tue, Oct 9, 2018 at 8:28 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >
> > On Mon, Oct 08, 2018 at 10:38:22AM -0700, Andy Lutomirski wrote:
> > > On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> 
> > > I read the comment three more times and even dug through the git
> > > history.  It seems like what you're saying is that, under certain
> > > conditions (which arguably would be bugs in the core Linux timing
> > > code),
> >
> > I don't see that as a bug. Its just a side effect of reading two
> > different clocks (one is CLOCK_MONOTONIC and the other is TSC),
> > and using those two clocks to as a "base + offset".
> >
> > As the comment explains, if you do that, can't guarantee monotonicity.
> >
> > > actually calling ktime_get_boot_ns() could be non-monotonic
> > > with respect to the kvmclock timing.  But get_kvmclock_ns() isn't used
> > > for VM timing as such -- it's used for the IOCTL interfaces for
> > > updating the time offset.  So can you explain how my patch is
> > > incorrect?
> >
> > ktime_get_boot_ns() has frequency correction applied, while
> > reading masterclock + TSC offset does not.
> >
> > So the clock reads differ.
> >
> 
> Ah, okay, I finally think I see what's going on.  In the kvmclock data
> exposed to the guest, tsc_shift and tsc_to_system_mul come from
> tgt_tsc_khz, whereas master_kernel_ns and master_cycle_now come from
> CLOCK_BOOTTIME.  So the kvmclock and kernel clock drift apart at a
> rate given by the frequency shift and then suddenly agree again every
> time the pvclock data is updated.

Yes.

> Is there a reason to do it this way?

Since pvclock updates which update system_timestamp are expensive (must stop all vcpus), 
they should be avoided. 

So only HW TSC counts, and used as offset against vcpu's tsc_timestamp.




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-11 22:27                               ` Marcelo Tosatti
@ 2018-10-11 23:00                                 ` Andy Lutomirski
  2018-10-15 13:39                                   ` Marcelo Tosatti
  0 siblings, 1 reply; 81+ messages in thread
From: Andy Lutomirski @ 2018-10-11 23:00 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Andrew Lutomirski, Peter Zijlstra, Vitaly Kuznetsov,
	Thomas Gleixner, Paolo Bonzini, Radim Krcmar, Wanpeng Li, LKML,
	X86 ML, Matt Rickard, Stephen Boyd, John Stultz, Florian Weimer,
	KY Srinivasan, devel, Linux Virtualization, Arnd Bergmann,
	Juergen Gross

On Thu, Oct 11, 2018 at 3:28 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
>
> On Tue, Oct 09, 2018 at 01:09:42PM -0700, Andy Lutomirski wrote:
> > On Tue, Oct 9, 2018 at 8:28 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > >
> > > On Mon, Oct 08, 2018 at 10:38:22AM -0700, Andy Lutomirski wrote:
> > > > On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >
> > > > I read the comment three more times and even dug through the git
> > > > history.  It seems like what you're saying is that, under certain
> > > > conditions (which arguably would be bugs in the core Linux timing
> > > > code),
> > >
> > > I don't see that as a bug. Its just a side effect of reading two
> > > different clocks (one is CLOCK_MONOTONIC and the other is TSC),
> > > and using those two clocks to as a "base + offset".
> > >
> > > As the comment explains, if you do that, can't guarantee monotonicity.
> > >
> > > > actually calling ktime_get_boot_ns() could be non-monotonic
> > > > with respect to the kvmclock timing.  But get_kvmclock_ns() isn't used
> > > > for VM timing as such -- it's used for the IOCTL interfaces for
> > > > updating the time offset.  So can you explain how my patch is
> > > > incorrect?
> > >
> > > ktime_get_boot_ns() has frequency correction applied, while
> > > reading masterclock + TSC offset does not.
> > >
> > > So the clock reads differ.
> > >
> >
> > Ah, okay, I finally think I see what's going on.  In the kvmclock data
> > exposed to the guest, tsc_shift and tsc_to_system_mul come from
> > tgt_tsc_khz, whereas master_kernel_ns and master_cycle_now come from
> > CLOCK_BOOTTIME.  So the kvmclock and kernel clock drift apart at a
> > rate given by the frequency shift and then suddenly agree again every
> > time the pvclock data is updated.
>
> Yes.
>
> > Is there a reason to do it this way?
>
> Since pvclock updates which update system_timestamp are expensive (must stop all vcpus),
> they should be avoided.
>

Fair enough.

> So only HW TSC counts

makes sense.

>, and used as offset against vcpu's tsc_timestamp.
>

Why don't you just expose CLOCK_MONTONIC_RAW or CLOCK_MONOTONIC_RAW
plus suspend time, though?  Then you would actually be tracking a real
kernel timekeeping mode, and you wouldn't need all this complicated
offsetting work to avoid accidentally going backwards.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support
  2018-10-11 23:00                                 ` Andy Lutomirski
@ 2018-10-15 13:39                                   ` Marcelo Tosatti
  0 siblings, 0 replies; 81+ messages in thread
From: Marcelo Tosatti @ 2018-10-15 13:39 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Peter Zijlstra, Vitaly Kuznetsov, Thomas Gleixner, Paolo Bonzini,
	Radim Krcmar, Wanpeng Li, LKML, X86 ML, Matt Rickard,
	Stephen Boyd, John Stultz, Florian Weimer, KY Srinivasan, devel,
	Linux Virtualization, Arnd Bergmann, Juergen Gross

On Thu, Oct 11, 2018 at 04:00:29PM -0700, Andy Lutomirski wrote:
> On Thu, Oct 11, 2018 at 3:28 PM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> >
> > On Tue, Oct 09, 2018 at 01:09:42PM -0700, Andy Lutomirski wrote:
> > > On Tue, Oct 9, 2018 at 8:28 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > > >
> > > > On Mon, Oct 08, 2018 at 10:38:22AM -0700, Andy Lutomirski wrote:
> > > > > On Mon, Oct 8, 2018 at 8:27 AM Marcelo Tosatti <mtosatti@redhat.com> wrote:
> > >
> > > > > I read the comment three more times and even dug through the git
> > > > > history.  It seems like what you're saying is that, under certain
> > > > > conditions (which arguably would be bugs in the core Linux timing
> > > > > code),
> > > >
> > > > I don't see that as a bug. Its just a side effect of reading two
> > > > different clocks (one is CLOCK_MONOTONIC and the other is TSC),
> > > > and using those two clocks to as a "base + offset".
> > > >
> > > > As the comment explains, if you do that, can't guarantee monotonicity.
> > > >
> > > > > actually calling ktime_get_boot_ns() could be non-monotonic
> > > > > with respect to the kvmclock timing.  But get_kvmclock_ns() isn't used
> > > > > for VM timing as such -- it's used for the IOCTL interfaces for
> > > > > updating the time offset.  So can you explain how my patch is
> > > > > incorrect?
> > > >
> > > > ktime_get_boot_ns() has frequency correction applied, while
> > > > reading masterclock + TSC offset does not.
> > > >
> > > > So the clock reads differ.
> > > >
> > >
> > > Ah, okay, I finally think I see what's going on.  In the kvmclock data
> > > exposed to the guest, tsc_shift and tsc_to_system_mul come from
> > > tgt_tsc_khz, whereas master_kernel_ns and master_cycle_now come from
> > > CLOCK_BOOTTIME.  So the kvmclock and kernel clock drift apart at a
> > > rate given by the frequency shift and then suddenly agree again every
> > > time the pvclock data is updated.
> >
> > Yes.
> >
> > > Is there a reason to do it this way?
> >
> > Since pvclock updates which update system_timestamp are expensive (must stop all vcpus),
> > they should be avoided.
> >
> 
> Fair enough.
> 
> > So only HW TSC counts
> 
> makes sense.
> 
> >, and used as offset against vcpu's tsc_timestamp.
> >
> 
> Why don't you just expose CLOCK_MONTONIC_RAW or CLOCK_MONOTONIC_RAW
> plus suspend time, though?  Then you would actually be tracking a real
> kernel timekeeping mode, and you wouldn't need all this complicated
> offsetting work to avoid accidentally going backwards.

Can you outline how that would work ? 


^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2018-10-15 14:42 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-14 12:50 [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Thomas Gleixner
2018-09-14 12:50 ` [patch 01/11] clocksource: Provide clocksource_arch_init() Thomas Gleixner
2018-09-14 12:50 ` [patch 02/11] x86/time: Implement clocksource_arch_init() Thomas Gleixner
2018-09-14 15:45   ` Vitaly Kuznetsov
2018-09-15  6:05     ` Thomas Gleixner
2018-09-14 12:50 ` [patch 03/11] x86/vdso: Enforce 64bit clocksource Thomas Gleixner
2018-09-14 12:50 ` [patch 04/11] x86/vdso: Use unsigned int consistently for vsyscall_gtod_data::seq Thomas Gleixner
2018-09-14 12:50 ` [patch 05/11] x86/vdso: Introduce and use vgtod_ts Thomas Gleixner
2018-09-14 12:50 ` [patch 06/11] x86/vdso: Collapse high resolution functions Thomas Gleixner
2018-09-14 12:50 ` [patch 07/11] x86/vdso: Collapse coarse functions Thomas Gleixner
2018-09-14 12:50 ` [patch 08/11] x86/vdso: Replace the clockid switch case Thomas Gleixner
2018-09-14 12:50 ` [patch 09/11] x86/vdso: Simplify the invalid vclock case Thomas Gleixner
2018-09-17 19:25   ` Andy Lutomirski
2018-09-17 20:12     ` John Stultz
2018-09-18  7:52       ` Thomas Gleixner
2018-09-18  8:30         ` Peter Zijlstra
2018-09-18  8:52           ` Thomas Gleixner
2018-09-18 10:06             ` Thomas Gleixner
2018-09-18 10:41               ` Thomas Gleixner
2018-09-18 12:48                 ` Peter Zijlstra
2018-09-18 13:23                   ` Thomas Gleixner
2018-09-18 13:38                     ` Peter Zijlstra
2018-09-18 15:52                     ` Thomas Gleixner
2018-09-27 14:41                       ` Thomas Gleixner
2018-09-18 14:01         ` Andy Lutomirski
2018-09-18 22:46           ` Thomas Gleixner
2018-09-18 23:03             ` Andy Lutomirski
2018-09-18 23:16               ` Thomas Gleixner
2018-09-27 14:36                 ` Thomas Gleixner
2018-09-27 14:39                   ` Andy Lutomirski
2018-09-19  9:08             ` Rasmus Villemoes
2018-09-19 13:29               ` Thomas Gleixner
2018-09-14 12:50 ` [patch 10/11] x86/vdso: Move cycle_last handling into the caller Thomas Gleixner
2018-09-14 15:26   ` Vitaly Kuznetsov
2018-09-14 12:50 ` [patch 11/11] x66/vdso: Add CLOCK_TAI support Thomas Gleixner
2018-09-14 14:04   ` Andy Lutomirski
2018-09-14 14:27     ` Thomas Gleixner
2018-09-14 14:59       ` Andy Lutomirski
2018-09-16  9:39         ` Thomas Gleixner
2018-09-14 12:56 ` [patch 00/11] x86/vdso: Cleanups, simmplifications and " Florian Weimer
2018-09-14 13:05   ` Thomas Gleixner
2018-09-14 13:06     ` Florian Weimer
2018-09-14 13:19       ` Thomas Gleixner
2018-09-14 13:09   ` Peter Zijlstra
2018-09-14 14:22 ` Arnd Bergmann
2018-09-17 13:00   ` Thomas Gleixner
2018-09-24 21:08     ` Arnd Bergmann
2018-10-03  5:15 ` Andy Lutomirski
2018-10-03  9:22   ` Vitaly Kuznetsov
2018-10-03 10:20     ` Andy Lutomirski
2018-10-03 12:01       ` Vitaly Kuznetsov
2018-10-03 14:20         ` Andy Lutomirski
2018-10-03 15:10           ` Thomas Gleixner
2018-10-03 16:18             ` Andy Lutomirski
2018-10-03 19:06     ` Marcelo Tosatti
2018-10-04  7:54       ` Vitaly Kuznetsov
2018-10-04  8:11         ` Peter Zijlstra
2018-10-04 14:00           ` Andy Lutomirski
2018-10-04 19:31             ` Peter Zijlstra
2018-10-04 20:05               ` Andy Lutomirski
2018-10-04 22:15                 ` Andy Lutomirski
2018-10-06 20:27                   ` Marcelo Tosatti
2018-10-06 22:28                     ` Andy Lutomirski
2018-10-08 15:26                       ` Marcelo Tosatti
2018-10-08 17:38                         ` Andy Lutomirski
2018-10-08 19:36                           ` Marcelo Tosatti
2018-10-09 20:09                             ` Andy Lutomirski
2018-10-11 22:27                               ` Marcelo Tosatti
2018-10-11 23:00                                 ` Andy Lutomirski
2018-10-15 13:39                                   ` Marcelo Tosatti
2018-10-06 20:49                   ` Marcelo Tosatti
2018-10-04 12:00         ` Paolo Bonzini
2018-10-04 14:04           ` Andy Lutomirski
2018-10-05 21:18         ` Marcelo Tosatti
2018-10-03 19:00   ` Marcelo Tosatti
2018-10-03 19:05     ` [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support\ Marcelo Tosatti
2018-10-03 22:32     ` [patch 00/11] x86/vdso: Cleanups, simmplifications and CLOCK_TAI support Andy Lutomirski
2018-10-04 16:37       ` Marcelo Tosatti
2018-10-04 17:08         ` Andy Lutomirski
2018-10-04 17:28           ` Vitaly Kuznetsov
2018-10-04 20:32 ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).