linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
@ 2012-09-17 22:04 John Stultz
  2012-09-17 22:04 ` [PATCH 1/6][RFC] time: Move timekeeper structure to timekeeper_internal.h for vsyscall changes John Stultz
                   ` (6 more replies)
  0 siblings, 7 replies; 27+ messages in thread
From: John Stultz @ 2012-09-17 22:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Stultz, Tony Luck, Paul Mackerras, Benjamin Herrenschmidt,
	Andy Lutomirski, Martin Schwidefsky, Paul Turner, Steven Rostedt,
	Richard Cochran, Prarit Bhargava, Thomas Gleixner

This item has been on my todo list for a number of years.

One interesting bit of the timekeeping code is that we internally
keep sub-nanosecond precision. This allows for really fine
grained error accounting, which allows us to keep long term
accuracy, even if the clocksource can only make very coarse
adjustments.

Since sub-nanosecond precision isn't useful to userland, we
normally truncate this extra data off before handing it to
userland. So only the timekeeping core deals with this extra
resolution.

Brief background here:

Timekeeping roughly works as follows:
	time_ns = base_ns + cyc2ns(cycles)

With our sub-ns resolution we can internally do calculations
like:
	base_ns = 0.9
	cyc2ns(cycles) =  0.9
	Thus:
	time_ns = 0.9 + 0.9 = 1.8 (which we truncate down to 1)


Where we periodically accumulate the cyc2ns(cycles) portion into
the base_ns to avoid cycles getting to large where it might overflow.

So we might have a case where we accumulate 3 cycle "chunks", each
cycle being 10.3 ns long.

So before accumulation:
	base_ns = 0
	cyc2ns(4) = 41.2
	time_now = 41.2 (truncated to 41)

After accumulation:
	base_ns = 30.9
	cyc2ns(1) = 10.3
	time_now = 41.2 (truncated to 41)


One quirk is when we export timekeeping data to the vsyscall code,
we also truncate the extra resolution. This in the past has caused
problems, where single nanosecond inconsistencies could be detected.

So before accumulation:
	base_ns = 0
	cyc2ns(4) = 41.2 (truncated to 41)
	time_now = 41

After accumulation:
	base_ns = 30.9 (truncated to 30)
	cyc2ns(1) = 10.3 (truncated to 10)
	time_now = 40

And time looks like it went backwards!

In order to avoid this, we currently end round up to the next
nanosecond when we do accumulation. In order to keep this from
causing long term drift (as much as 1ns per tick), we add the
amount we rounded up to the error accounting, which will slow the
clocksource frequency appropriately to avoid the drift.

This works, but causes the clocksource adjustment code to do much
more work. Steven Rosdet pointed out that the unlikely() case in
timekeeping_adjust is ends up being true every time.

Further this, rounding up and slowing down adds more complexity to
the timekeeping core.

The better solution is to provide the full sub-nanosecond precision
data to the vsyscall code, so that we do the truncation on the final
data, in the exact same way the timekeeping core does, rather then
truncating some of the source data. This requires reworking the
vsyscall code paths (x86, ppc, s390, ia64) to be able to handle this
extra data.

This patch set provides an initial draft of how I'd like to solve it. 
1) Introducing first a way for the vsyscall data to access the entire
   timekeeper stat
2) Transitioning the existing update_vsyscall methods to
   update_vsyscall_old
3) Introduce the new full-resolution update_vsyscall method
4) Limit the problematic extra rounding to only systems using the
   old vsyscall method
5) Convert x86 to use the new vsyscall update and full resolution
   gettime calculation.

Powerpc, s390 and ia64 will also need to be converted, but this
allows for a slow transition.

Anyway, I'd greatly appreciate any thoughts or feedback on this
approach.

Thanks
-john

Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>



John Stultz (6):
  time: Move timekeeper structure to timekeeper_internal.h for vsyscall
    changes
  time: Move update_vsyscall definitions to timekeeper_internal.h
  time: Convert CONFIG_GENERIC_TIME_VSYSCALL to
    CONFIG_GENERIC_TIME_VSYSCALL_OLD
  time: Introduce new GENERIC_TIME_VSYSCALL
  time: Only do nanosecond rounding on GENERIC_TIME_VSYSCALL_OLD
    systems
  time: Convert x86_64 to using new update_vsyscall

 arch/ia64/Kconfig                   |    2 +-
 arch/ia64/kernel/time.c             |    4 +-
 arch/powerpc/Kconfig                |    2 +-
 arch/powerpc/kernel/time.c          |    4 +-
 arch/s390/Kconfig                   |    2 +-
 arch/s390/kernel/time.c             |    4 +-
 arch/x86/include/asm/vgtod.h        |    4 +-
 arch/x86/kernel/vsyscall_64.c       |   49 +++++++++------
 arch/x86/vdso/vclock_gettime.c      |   22 ++++---
 include/linux/clocksource.h         |   16 -----
 include/linux/timekeeper_internal.h |  108 ++++++++++++++++++++++++++++++++
 kernel/time.c                       |    2 +-
 kernel/time/Kconfig                 |    4 ++
 kernel/time/timekeeping.c           |  115 ++++++++++-------------------------
 14 files changed, 200 insertions(+), 138 deletions(-)
 create mode 100644 include/linux/timekeeper_internal.h

-- 
1.7.9.5


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/6][RFC] time: Move timekeeper structure to timekeeper_internal.h for vsyscall changes
  2012-09-17 22:04 [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core John Stultz
@ 2012-09-17 22:04 ` John Stultz
  2012-09-17 22:04 ` [PATCH 2/6][RFC] time: Move update_vsyscall definitions to timekeeper_internal.h John Stultz
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: John Stultz @ 2012-09-17 22:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Stultz, Tony Luck, Paul Mackerras, Benjamin Herrenschmidt,
	Andy Lutomirski, Martin Schwidefsky, Paul Turner, Steven Rostedt,
	Richard Cochran, Prarit Bhargava, Thomas Gleixner

We're going to need to access the timekeeper in update_vsyscall,
so make the structure available for those who need it.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/timekeeper_internal.h |   68 +++++++++++++++++++++++++++++++++++
 kernel/time/timekeeping.c           |   56 +----------------------------
 2 files changed, 69 insertions(+), 55 deletions(-)
 create mode 100644 include/linux/timekeeper_internal.h

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
new file mode 100644
index 0000000..8ba43fa
--- /dev/null
+++ b/include/linux/timekeeper_internal.h
@@ -0,0 +1,68 @@
+/*
+ * You SHOULD NOT be including this unless you're vsyscall
+ * handling code or timekeeping internal code!
+ */
+
+#ifndef _LINUX_TIMEKEEPER_INTERNAL_H
+#define _LINUX_TIMEKEEPER_INTERNAL_H
+
+#include <linux/clocksource.h>
+#include <linux/jiffies.h>
+#include <linux/time.h>
+
+/* Structure holding internal timekeeping values. */
+struct timekeeper {
+	/* Current clocksource used for timekeeping. */
+	struct clocksource	*clock;
+	/* NTP adjusted clock multiplier */
+	u32			mult;
+	/* The shift value of the current clocksource. */
+	u32			shift;
+	/* Number of clock cycles in one NTP interval. */
+	cycle_t			cycle_interval;
+	/* Number of clock shifted nano seconds in one NTP interval. */
+	u64			xtime_interval;
+	/* shifted nano seconds left over when rounding cycle_interval */
+	s64			xtime_remainder;
+	/* Raw nano seconds accumulated per NTP interval. */
+	u32			raw_interval;
+
+	/* Current CLOCK_REALTIME time in seconds */
+	u64			xtime_sec;
+	/* Clock shifted nano seconds */
+	u64			xtime_nsec;
+
+	/* Difference between accumulated time and NTP time in ntp
+	 * shifted nano seconds. */
+	s64			ntp_error;
+	/* Shift conversion between clock shifted nano seconds and
+	 * ntp shifted nano seconds. */
+	u32			ntp_error_shift;
+
+	/*
+	 * wall_to_monotonic is what we need to add to xtime (or xtime corrected
+	 * for sub jiffie times) to get to monotonic time.  Monotonic is pegged
+	 * at zero at system boot time, so wall_to_monotonic will be negative,
+	 * however, we will ALWAYS keep the tv_nsec part positive so we can use
+	 * the usual normalization.
+	 *
+	 * wall_to_monotonic is moved after resume from suspend for the
+	 * monotonic time not to jump. We need to add total_sleep_time to
+	 * wall_to_monotonic to get the real boot based time offset.
+	 *
+	 * - wall_to_monotonic is no longer the boot time, getboottime must be
+	 * used instead.
+	 */
+	struct timespec		wall_to_monotonic;
+	/* Offset clock monotonic -> clock realtime */
+	ktime_t			offs_real;
+	/* time spent in suspend */
+	struct timespec		total_sleep_time;
+	/* Offset clock monotonic -> clock boottime */
+	ktime_t			offs_boot;
+	/* The raw monotonic time for the CLOCK_MONOTONIC_RAW posix clock. */
+	struct timespec		raw_time;
+	/* Seqlock for all timekeeper values */
+	seqlock_t		lock;
+};
+#endif /* _LINUX_TIMEKEEPER_INTERNAL_H */
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 34e5eac..61f5fba 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -8,6 +8,7 @@
  *
  */
 
+#include <linux/timekeeper_internal.h>
 #include <linux/module.h>
 #include <linux/interrupt.h>
 #include <linux/percpu.h>
@@ -21,61 +22,6 @@
 #include <linux/tick.h>
 #include <linux/stop_machine.h>
 
-/* Structure holding internal timekeeping values. */
-struct timekeeper {
-	/* Current clocksource used for timekeeping. */
-	struct clocksource	*clock;
-	/* NTP adjusted clock multiplier */
-	u32			mult;
-	/* The shift value of the current clocksource. */
-	u32			shift;
-	/* Number of clock cycles in one NTP interval. */
-	cycle_t			cycle_interval;
-	/* Number of clock shifted nano seconds in one NTP interval. */
-	u64			xtime_interval;
-	/* shifted nano seconds left over when rounding cycle_interval */
-	s64			xtime_remainder;
-	/* Raw nano seconds accumulated per NTP interval. */
-	u32			raw_interval;
-
-	/* Current CLOCK_REALTIME time in seconds */
-	u64			xtime_sec;
-	/* Clock shifted nano seconds */
-	u64			xtime_nsec;
-
-	/* Difference between accumulated time and NTP time in ntp
-	 * shifted nano seconds. */
-	s64			ntp_error;
-	/* Shift conversion between clock shifted nano seconds and
-	 * ntp shifted nano seconds. */
-	u32			ntp_error_shift;
-
-	/*
-	 * wall_to_monotonic is what we need to add to xtime (or xtime corrected
-	 * for sub jiffie times) to get to monotonic time.  Monotonic is pegged
-	 * at zero at system boot time, so wall_to_monotonic will be negative,
-	 * however, we will ALWAYS keep the tv_nsec part positive so we can use
-	 * the usual normalization.
-	 *
-	 * wall_to_monotonic is moved after resume from suspend for the
-	 * monotonic time not to jump. We need to add total_sleep_time to
-	 * wall_to_monotonic to get the real boot based time offset.
-	 *
-	 * - wall_to_monotonic is no longer the boot time, getboottime must be
-	 * used instead.
-	 */
-	struct timespec		wall_to_monotonic;
-	/* Offset clock monotonic -> clock realtime */
-	ktime_t			offs_real;
-	/* time spent in suspend */
-	struct timespec		total_sleep_time;
-	/* Offset clock monotonic -> clock boottime */
-	ktime_t			offs_boot;
-	/* The raw monotonic time for the CLOCK_MONOTONIC_RAW posix clock. */
-	struct timespec		raw_time;
-	/* Seqlock for all timekeeper values */
-	seqlock_t		lock;
-};
 
 static struct timekeeper timekeeper;
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/6][RFC] time: Move update_vsyscall definitions to timekeeper_internal.h
  2012-09-17 22:04 [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core John Stultz
  2012-09-17 22:04 ` [PATCH 1/6][RFC] time: Move timekeeper structure to timekeeper_internal.h for vsyscall changes John Stultz
@ 2012-09-17 22:04 ` John Stultz
  2012-09-27  3:14   ` Paul Mackerras
  2012-09-17 22:04 ` [PATCH 3/6][RFC] time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD John Stultz
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: John Stultz @ 2012-09-17 22:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Stultz, Tony Luck, Paul Mackerras, Benjamin Herrenschmidt,
	Andy Lutomirski, Martin Schwidefsky, Paul Turner, Steven Rostedt,
	Richard Cochran, Prarit Bhargava, Thomas Gleixner

Since users will need to include timekeeper_internal.h, move
update_vsyscall definitions to timekeeper_internal.h.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 arch/ia64/kernel/time.c             |    2 +-
 arch/powerpc/kernel/time.c          |    2 +-
 arch/s390/kernel/time.c             |    2 +-
 arch/x86/kernel/vsyscall_64.c       |    2 +-
 include/linux/clocksource.h         |   16 ----------------
 include/linux/timekeeper_internal.h |   17 +++++++++++++++++
 kernel/time.c                       |    2 +-
 7 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index ecc904b..acb688f 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -19,7 +19,7 @@
 #include <linux/interrupt.h>
 #include <linux/efi.h>
 #include <linux/timex.h>
-#include <linux/clocksource.h>
+#include <linux/timekeeper_internal.h>
 #include <linux/platform_device.h>
 
 #include <asm/machvec.h>
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index e49e931..613a830d 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -73,7 +73,7 @@
 /* powerpc clocksource/clockevent code */
 
 #include <linux/clockchips.h>
-#include <linux/clocksource.h>
+#include <linux/timekeeper_internal.h>
 
 static cycle_t rtc_read(struct clocksource *);
 static struct clocksource clocksource_rtc = {
diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index dcec960..bfb62ad 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -34,7 +34,7 @@
 #include <linux/profile.h>
 #include <linux/timex.h>
 #include <linux/notifier.h>
-#include <linux/clocksource.h>
+#include <linux/timekeeper_internal.h>
 #include <linux/clockchips.h>
 #include <linux/gfp.h>
 #include <linux/kprobes.h>
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 8d141b3..6ec8411 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -28,7 +28,7 @@
 #include <linux/jiffies.h>
 #include <linux/sysctl.h>
 #include <linux/topology.h>
-#include <linux/clocksource.h>
+#include <linux/timekeeper_internal.h>
 #include <linux/getcpu.h>
 #include <linux/cpu.h>
 #include <linux/smp.h>
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index fbe89e1..4dceaf8 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -319,22 +319,6 @@ static inline void __clocksource_updatefreq_khz(struct clocksource *cs, u32 khz)
 	__clocksource_updatefreq_scale(cs, 1000, khz);
 }
 
-#ifdef CONFIG_GENERIC_TIME_VSYSCALL
-extern void
-update_vsyscall(struct timespec *ts, struct timespec *wtm,
-			struct clocksource *c, u32 mult);
-extern void update_vsyscall_tz(void);
-#else
-static inline void
-update_vsyscall(struct timespec *ts, struct timespec *wtm,
-			struct clocksource *c, u32 mult)
-{
-}
-
-static inline void update_vsyscall_tz(void)
-{
-}
-#endif
 
 extern void timekeeping_notify(struct clocksource *clock);
 
diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index 8ba43fa..9c1c2cf 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -65,4 +65,21 @@ struct timekeeper {
 	/* Seqlock for all timekeeper values */
 	seqlock_t		lock;
 };
+
+
+#ifdef CONFIG_GENERIC_TIME_VSYSCALL
+extern void
+update_vsyscall(struct timespec *ts, struct timespec *wtm,
+			struct clocksource *c, u32 mult);
+extern void update_vsyscall_tz(void);
+#else
+static inline void update_vsyscall(struct timespec *ts, struct timespec *wtm,
+					struct clocksource *c, u32 mult)
+{
+}
+static inline void update_vsyscall_tz(void)
+{
+}
+#endif
+
 #endif /* _LINUX_TIMEKEEPER_INTERNAL_H */
diff --git a/kernel/time.c b/kernel/time.c
index ba744cf..d226c6a 100644
--- a/kernel/time.c
+++ b/kernel/time.c
@@ -30,7 +30,7 @@
 #include <linux/export.h>
 #include <linux/timex.h>
 #include <linux/capability.h>
-#include <linux/clocksource.h>
+#include <linux/timekeeper_internal.h>
 #include <linux/errno.h>
 #include <linux/syscalls.h>
 #include <linux/security.h>
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/6][RFC] time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD
  2012-09-17 22:04 [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core John Stultz
  2012-09-17 22:04 ` [PATCH 1/6][RFC] time: Move timekeeper structure to timekeeper_internal.h for vsyscall changes John Stultz
  2012-09-17 22:04 ` [PATCH 2/6][RFC] time: Move update_vsyscall definitions to timekeeper_internal.h John Stultz
@ 2012-09-17 22:04 ` John Stultz
  2012-09-27  3:14   ` Paul Mackerras
  2012-09-17 22:04 ` [PATCH 4/6][RFC] time: Introduce new GENERIC_TIME_VSYSCALL John Stultz
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 27+ messages in thread
From: John Stultz @ 2012-09-17 22:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Stultz, Tony Luck, Paul Mackerras, Benjamin Herrenschmidt,
	Andy Lutomirski, Martin Schwidefsky, Paul Turner, Steven Rostedt,
	Richard Cochran, Prarit Bhargava, Thomas Gleixner

To help migrate archtectures over to the new update_vsyscall method,
redfine CONFIG_GENERIC_TIME_VSYSCALL as CONFIG_GENERIC_TIME_VSYSCALL_OLD

Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 arch/ia64/Kconfig                   |    2 +-
 arch/ia64/kernel/time.c             |    2 +-
 arch/powerpc/Kconfig                |    2 +-
 arch/powerpc/kernel/time.c          |    2 +-
 arch/s390/Kconfig                   |    2 +-
 arch/s390/kernel/time.c             |    2 +-
 arch/x86/Kconfig                    |    2 +-
 arch/x86/kernel/vsyscall_64.c       |    2 +-
 include/linux/timekeeper_internal.h |    7 ++++---
 kernel/time/Kconfig                 |    2 +-
 kernel/time/timekeeping.c           |    2 +-
 11 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 310cf57..f9e673c 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -38,7 +38,7 @@ config IA64
 	select ARCH_TASK_STRUCT_ALLOCATOR
 	select ARCH_THREAD_INFO_ALLOCATOR
 	select ARCH_CLOCKSOURCE_DATA
-	select GENERIC_TIME_VSYSCALL
+	select GENERIC_TIME_VSYSCALL_OLD
 	default y
 	help
 	  The Itanium Processor Family is Intel's 64-bit successor to
diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index acb688f..d2f4e26 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -454,7 +454,7 @@ void update_vsyscall_tz(void)
 {
 }
 
-void update_vsyscall(struct timespec *wall, struct timespec *wtm,
+void update_vsyscall_old(struct timespec *wall, struct timespec *wtm,
 			struct clocksource *c, u32 mult)
 {
 	write_seqcount_begin(&fsyscall_gtod_data.seq);
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 352f416..0881660 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -135,7 +135,7 @@ config PPC
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select GENERIC_SMP_IDLE_THREAD
 	select GENERIC_CMOS_UPDATE
-	select GENERIC_TIME_VSYSCALL
+	select GENERIC_TIME_VSYSCALL_OLD
 	select GENERIC_CLOCKEVENTS
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 613a830d..c825809 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -712,7 +712,7 @@ static cycle_t timebase_read(struct clocksource *cs)
 	return (cycle_t)get_tb();
 }
 
-void update_vsyscall(struct timespec *wall_time, struct timespec *wtm,
+void update_vsyscall_old(struct timespec *wall_time, struct timespec *wtm,
 			struct clocksource *clock, u32 mult)
 {
 	u64 new_tb_to_xs, new_stamp_xsec;
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 107610e..ba488aa 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -121,7 +121,7 @@ config S390
 	select ARCH_INLINE_WRITE_UNLOCK_IRQRESTORE
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select GENERIC_SMP_IDLE_THREAD
-	select GENERIC_TIME_VSYSCALL
+	select GENERIC_TIME_VSYSCALL_OLD
 	select GENERIC_CLOCKEVENTS
 	select KTIME_SCALAR if 32BIT
 	select HAVE_ARCH_SECCOMP_FILTER
diff --git a/arch/s390/kernel/time.c b/arch/s390/kernel/time.c
index bfb62ad..c5430bf 100644
--- a/arch/s390/kernel/time.c
+++ b/arch/s390/kernel/time.c
@@ -219,7 +219,7 @@ struct clocksource * __init clocksource_default_clock(void)
 	return &clocksource_tod;
 }
 
-void update_vsyscall(struct timespec *wall_time, struct timespec *wtm,
+void update_vsyscall_old(struct timespec *wall_time, struct timespec *wtm,
 			struct clocksource *clock, u32 mult)
 {
 	if (clock != &clocksource_tod)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8ec3a1a..e3f3c1a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -93,7 +93,7 @@ config X86
 	select GENERIC_CLOCKEVENTS
 	select ARCH_CLOCKSOURCE_DATA if X86_64
 	select GENERIC_CLOCKEVENTS_BROADCAST if X86_64 || (X86_32 && X86_LOCAL_APIC)
-	select GENERIC_TIME_VSYSCALL if X86_64
+	select GENERIC_TIME_VSYSCALL_OLD if X86_64
 	select KTIME_SCALAR if X86_32
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 6ec8411..77dde29 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -82,7 +82,7 @@ void update_vsyscall_tz(void)
 	vsyscall_gtod_data.sys_tz = sys_tz;
 }
 
-void update_vsyscall(struct timespec *wall_time, struct timespec *wtm,
+void update_vsyscall_old(struct timespec *wall_time, struct timespec *wtm,
 			struct clocksource *clock, u32 mult)
 {
 	struct timespec monotonic;
diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index 9c1c2cf..a904d76 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -67,13 +67,14 @@ struct timekeeper {
 };
 
 
-#ifdef CONFIG_GENERIC_TIME_VSYSCALL
+#ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD
 extern void
-update_vsyscall(struct timespec *ts, struct timespec *wtm,
+update_vsyscall_old(struct timespec *ts, struct timespec *wtm,
 			struct clocksource *c, u32 mult);
 extern void update_vsyscall_tz(void);
 #else
-static inline void update_vsyscall(struct timespec *ts, struct timespec *wtm,
+static inline void
+update_vsyscall_old(struct timespec *ts, struct timespec *wtm,
 					struct clocksource *c, u32 mult)
 {
 }
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index fd42bd4..489c861 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -13,7 +13,7 @@ config ARCH_CLOCKSOURCE_DATA
 	bool
 
 # Timekeeping vsyscall support
-config GENERIC_TIME_VSYSCALL
+config GENERIC_TIME_VSYSCALL_OLD
 	bool
 
 # ktime_t scalar 64bit nsec representation
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 61f5fba..bdce688 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -199,7 +199,7 @@ static void timekeeping_update(struct timekeeper *tk, bool clearntp)
 		ntp_clear();
 	}
 	xt = tk_xtime(tk);
-	update_vsyscall(&xt, &tk->wall_to_monotonic, tk->clock, tk->mult);
+	update_vsyscall_old(&xt, &tk->wall_to_monotonic, tk->clock, tk->mult);
 }
 
 /**
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/6][RFC] time: Introduce new GENERIC_TIME_VSYSCALL
  2012-09-17 22:04 [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core John Stultz
                   ` (2 preceding siblings ...)
  2012-09-17 22:04 ` [PATCH 3/6][RFC] time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD John Stultz
@ 2012-09-17 22:04 ` John Stultz
  2012-09-17 22:05 ` [PATCH 5/6][RFC] time: Only do nanosecond rounding on GENERIC_TIME_VSYSCALL_OLD systems John Stultz
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 27+ messages in thread
From: John Stultz @ 2012-09-17 22:04 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Stultz, Tony Luck, Paul Mackerras, Benjamin Herrenschmidt,
	Andy Lutomirski, Martin Schwidefsky, Paul Turner, Steven Rostedt,
	Richard Cochran, Prarit Bhargava, Thomas Gleixner

Now that we moved everyone over to GENERIC_TIME_VSYSCALL_OLD,
introduce the new declaration and config option for the new
update_vsyscall method.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 include/linux/timekeeper_internal.h |   36 ++++++++++++++++++++++++++++-------
 kernel/time/Kconfig                 |    4 ++++
 kernel/time/timekeeping.c           |   14 +-------------
 3 files changed, 34 insertions(+), 20 deletions(-)

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index a904d76..8896471 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -66,16 +66,38 @@ struct timekeeper {
 	seqlock_t		lock;
 };
 
+static inline struct timespec tk_xtime(struct timekeeper *tk)
+{
+	struct timespec ts;
+
+	ts.tv_sec = tk->xtime_sec;
+	ts.tv_nsec = (long)(tk->xtime_nsec >> tk->shift);
+	return ts;
+}
+
+
+#ifdef CONFIG_GENERIC_TIME_VSYSCALL
+
+extern void update_vsyscall(struct timekeeper *tk);
+extern void update_vsyscall_tz(void);
 
-#ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD
-extern void
-update_vsyscall_old(struct timespec *ts, struct timespec *wtm,
-			struct clocksource *c, u32 mult);
+#elif defined(CONFIG_GENERIC_TIME_VSYSCALL_OLD)
+
+extern void update_vsyscall_old(struct timespec *ts, struct timespec *wtm,
+				struct clocksource *c, u32 mult);
 extern void update_vsyscall_tz(void);
+
+static inline void update_vsyscall(struct timekeeper *tk)
+{
+	struct timespec xt;
+
+	xt = tk_xtime(tk);
+	update_vsyscall_old(&xt, &tk->wall_to_monotonic, tk->clock, tk->mult);
+}
+
 #else
-static inline void
-update_vsyscall_old(struct timespec *ts, struct timespec *wtm,
-					struct clocksource *c, u32 mult)
+
+static inline void update_vsyscall(struct timekeeper *tk);
 {
 }
 static inline void update_vsyscall_tz(void)
diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 489c861..8601f0d 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -13,6 +13,10 @@ config ARCH_CLOCKSOURCE_DATA
 	bool
 
 # Timekeeping vsyscall support
+config GENERIC_TIME_VSYSCALL
+	bool
+
+# Timekeeping vsyscall support
 config GENERIC_TIME_VSYSCALL_OLD
 	bool
 
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index bdce688..61eaf3b 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -42,15 +42,6 @@ static inline void tk_normalize_xtime(struct timekeeper *tk)
 	}
 }
 
-static struct timespec tk_xtime(struct timekeeper *tk)
-{
-	struct timespec ts;
-
-	ts.tv_sec = tk->xtime_sec;
-	ts.tv_nsec = (long)(tk->xtime_nsec >> tk->shift);
-	return ts;
-}
-
 static void tk_set_xtime(struct timekeeper *tk, const struct timespec *ts)
 {
 	tk->xtime_sec = ts->tv_sec;
@@ -192,14 +183,11 @@ static inline s64 timekeeping_get_ns_raw(struct timekeeper *tk)
 /* must hold write on timekeeper.lock */
 static void timekeeping_update(struct timekeeper *tk, bool clearntp)
 {
-	struct timespec xt;
-
 	if (clearntp) {
 		tk->ntp_error = 0;
 		ntp_clear();
 	}
-	xt = tk_xtime(tk);
-	update_vsyscall_old(&xt, &tk->wall_to_monotonic, tk->clock, tk->mult);
+	update_vsyscall(tk);
 }
 
 /**
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/6][RFC] time: Only do nanosecond rounding on GENERIC_TIME_VSYSCALL_OLD systems
  2012-09-17 22:04 [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core John Stultz
                   ` (3 preceding siblings ...)
  2012-09-17 22:04 ` [PATCH 4/6][RFC] time: Introduce new GENERIC_TIME_VSYSCALL John Stultz
@ 2012-09-17 22:05 ` John Stultz
  2012-09-17 22:05 ` [PATCH 6/6][RFC] time: Convert x86_64 to using new update_vsyscall John Stultz
  2012-09-17 23:49 ` [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core Andy Lutomirski
  6 siblings, 0 replies; 27+ messages in thread
From: John Stultz @ 2012-09-17 22:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Stultz, Tony Luck, Paul Mackerras, Benjamin Herrenschmidt,
	Andy Lutomirski, Martin Schwidefsky, Paul Turner, Steven Rostedt,
	Richard Cochran, Prarit Bhargava, Thomas Gleixner

We only do rounding to the next nanosecond so we don't see minor
1ns inconsistencies in the vsyscall implementations. Since we're
changing the vsyscall implementations to avoid this, conditionalize
the rounding only to the GENERIC_TIME_VSYSCALL_OLD architectures.

Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/timekeeping.c |   45 +++++++++++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 14 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 61eaf3b..b976692 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1059,6 +1059,33 @@ static cycle_t logarithmic_accumulation(struct timekeeper *tk, cycle_t offset,
 	return offset;
 }
 
+#ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD
+static inline void old_vsyscall_fixup(struct timekeeper *tk)
+{
+	s64 remainder;
+
+	/*
+	* Store only full nanoseconds into xtime_nsec after rounding
+	* it up and add the remainder to the error difference.
+	* XXX - This is necessary to avoid small 1ns inconsistnecies caused
+	* by truncating the remainder in vsyscalls. However, it causes
+	* additional work to be done in timekeeping_adjust(). Once
+	* the vsyscall implementations are converted to use xtime_nsec
+	* (shifted nanoseconds), and CONFIG_GENERIC_TIME_VSYSCALL_OLD
+	* users are removed, this can be killed.
+	*/
+	remainder = tk->xtime_nsec & ((1ULL << tk->shift) - 1);
+	tk->xtime_nsec -= remainder;
+	tk->xtime_nsec += 1ULL << tk->shift;
+	tk->ntp_error += remainder << tk->ntp_error_shift;
+
+}
+#else
+#define old_vsyscall_fixup(tk)
+#endif
+
+
+
 /**
  * update_wall_time - Uses the current clocksource to increment the wall time
  *
@@ -1070,7 +1097,6 @@ static void update_wall_time(void)
 	cycle_t offset;
 	int shift = 0, maxshift;
 	unsigned long flags;
-	s64 remainder;
 
 	write_seqlock_irqsave(&tk->lock, flags);
 
@@ -1112,20 +1138,11 @@ static void update_wall_time(void)
 	/* correct the clock when NTP error is too big */
 	timekeeping_adjust(tk, offset);
 
-
 	/*
-	* Store only full nanoseconds into xtime_nsec after rounding
-	* it up and add the remainder to the error difference.
-	* XXX - This is necessary to avoid small 1ns inconsistnecies caused
-	* by truncating the remainder in vsyscalls. However, it causes
-	* additional work to be done in timekeeping_adjust(). Once
-	* the vsyscall implementations are converted to use xtime_nsec
-	* (shifted nanoseconds), this can be killed.
-	*/
-	remainder = tk->xtime_nsec & ((1ULL << tk->shift) - 1);
-	tk->xtime_nsec -= remainder;
-	tk->xtime_nsec += 1ULL << tk->shift;
-	tk->ntp_error += remainder << tk->ntp_error_shift;
+	 * XXX This can be killed once everyone converts
+	 * to the new update_vsyscall.
+	 */
+	old_vsyscall_fixup(tk);
 
 	/*
 	 * Finally, make sure that after the rounding
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/6][RFC] time: Convert x86_64 to using new update_vsyscall
  2012-09-17 22:04 [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core John Stultz
                   ` (4 preceding siblings ...)
  2012-09-17 22:05 ` [PATCH 5/6][RFC] time: Only do nanosecond rounding on GENERIC_TIME_VSYSCALL_OLD systems John Stultz
@ 2012-09-17 22:05 ` John Stultz
  2012-09-17 23:49 ` [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core Andy Lutomirski
  6 siblings, 0 replies; 27+ messages in thread
From: John Stultz @ 2012-09-17 22:05 UTC (permalink / raw)
  To: linux-kernel
  Cc: John Stultz, Tony Luck, Paul Mackerras, Benjamin Herrenschmidt,
	Andy Lutomirski, Martin Schwidefsky, Paul Turner, Steven Rostedt,
	Richard Cochran, Prarit Bhargava, Thomas Gleixner

Switch x86_64 to using sub-ns precise vsyscall

Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 arch/x86/Kconfig               |    2 +-
 arch/x86/include/asm/vgtod.h   |    4 ++--
 arch/x86/kernel/vsyscall_64.c  |   47 ++++++++++++++++++++++++----------------
 arch/x86/vdso/vclock_gettime.c |   22 ++++++++++++-------
 4 files changed, 45 insertions(+), 30 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e3f3c1a..8ec3a1a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -93,7 +93,7 @@ config X86
 	select GENERIC_CLOCKEVENTS
 	select ARCH_CLOCKSOURCE_DATA if X86_64
 	select GENERIC_CLOCKEVENTS_BROADCAST if X86_64 || (X86_32 && X86_LOCAL_APIC)
-	select GENERIC_TIME_VSYSCALL_OLD if X86_64
+	select GENERIC_TIME_VSYSCALL if X86_64
 	select KTIME_SCALAR if X86_32
 	select GENERIC_STRNCPY_FROM_USER
 	select GENERIC_STRNLEN_USER
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index 8b38be2..46e24d3 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -17,8 +17,8 @@ struct vsyscall_gtod_data {
 
 	/* open coded 'struct timespec' */
 	time_t		wall_time_sec;
-	u32		wall_time_nsec;
-	u32		monotonic_time_nsec;
+	u64		wall_time_snsec;
+	u64		monotonic_time_snsec;
 	time_t		monotonic_time_sec;
 
 	struct timezone sys_tz;
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 77dde29..3a3e8c9 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -82,32 +82,41 @@ void update_vsyscall_tz(void)
 	vsyscall_gtod_data.sys_tz = sys_tz;
 }
 
-void update_vsyscall_old(struct timespec *wall_time, struct timespec *wtm,
-			struct clocksource *clock, u32 mult)
+void update_vsyscall(struct timekeeper *tk)
 {
-	struct timespec monotonic;
+	struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data;
 
-	write_seqcount_begin(&vsyscall_gtod_data.seq);
+	write_seqcount_begin(&vdata->seq);
 
 	/* copy vsyscall data */
-	vsyscall_gtod_data.clock.vclock_mode	= clock->archdata.vclock_mode;
-	vsyscall_gtod_data.clock.cycle_last	= clock->cycle_last;
-	vsyscall_gtod_data.clock.mask		= clock->mask;
-	vsyscall_gtod_data.clock.mult		= mult;
-	vsyscall_gtod_data.clock.shift		= clock->shift;
-
-	vsyscall_gtod_data.wall_time_sec	= wall_time->tv_sec;
-	vsyscall_gtod_data.wall_time_nsec	= wall_time->tv_nsec;
+	vdata->clock.vclock_mode	= tk->clock->archdata.vclock_mode;
+	vdata->clock.cycle_last		= tk->clock->cycle_last;
+	vdata->clock.mask		= tk->clock->mask;
+	vdata->clock.mult		= tk->mult;
+	vdata->clock.shift		= tk->shift;
+
+	vdata->wall_time_sec		= tk->xtime_sec;
+	vdata->wall_time_snsec		= tk->xtime_nsec;
+
+	vdata->monotonic_time_sec	= tk->xtime_sec
+					+ tk->wall_to_monotonic.tv_sec;
+	vdata->monotonic_time_snsec	= tk->xtime_nsec
+					+ (tk->wall_to_monotonic.tv_nsec
+						<< tk->shift);
+	while (vdata->monotonic_time_snsec >=
+					(((u64)NSEC_PER_SEC) << tk->shift)) {
+		vdata->monotonic_time_snsec -=
+					((u64)NSEC_PER_SEC) << tk->shift;
+		vdata->monotonic_time_sec++;
+	}
 
-	monotonic = timespec_add(*wall_time, *wtm);
-	vsyscall_gtod_data.monotonic_time_sec	= monotonic.tv_sec;
-	vsyscall_gtod_data.monotonic_time_nsec	= monotonic.tv_nsec;
+	vdata->wall_time_coarse.tv_sec	= tk->xtime_sec;
+	vdata->wall_time_coarse.tv_nsec	= (long)(tk->xtime_nsec >> tk->shift);
 
-	vsyscall_gtod_data.wall_time_coarse	= __current_kernel_time();
-	vsyscall_gtod_data.monotonic_time_coarse =
-		timespec_add(vsyscall_gtod_data.wall_time_coarse, *wtm);
+	vdata->monotonic_time_coarse	= timespec_add(vdata->wall_time_coarse,
+							tk->wall_to_monotonic);
 
-	write_seqcount_end(&vsyscall_gtod_data.seq);
+	write_seqcount_end(&vdata->seq);
 }
 
 static void warn_bad_vsyscall(const char *level, struct pt_regs *regs,
diff --git a/arch/x86/vdso/vclock_gettime.c b/arch/x86/vdso/vclock_gettime.c
index 885eff4..4df6c37 100644
--- a/arch/x86/vdso/vclock_gettime.c
+++ b/arch/x86/vdso/vclock_gettime.c
@@ -80,7 +80,7 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
 }
 
 
-notrace static inline long vgetns(void)
+notrace static inline u64 vgetsns(void)
 {
 	long v;
 	cycles_t cycles;
@@ -91,21 +91,24 @@ notrace static inline long vgetns(void)
 	else
 		return 0;
 	v = (cycles - gtod->clock.cycle_last) & gtod->clock.mask;
-	return (v * gtod->clock.mult) >> gtod->clock.shift;
+	return v * gtod->clock.mult;
 }
 
 /* Code size doesn't matter (vdso is 4k anyway) and this is faster. */
 notrace static int __always_inline do_realtime(struct timespec *ts)
 {
-	unsigned long seq, ns;
+	unsigned long seq;
+	u64 ns;
 	int mode;
 
+	ts->tv_nsec = 0;
 	do {
 		seq = read_seqcount_begin(&gtod->seq);
 		mode = gtod->clock.vclock_mode;
 		ts->tv_sec = gtod->wall_time_sec;
-		ts->tv_nsec = gtod->wall_time_nsec;
-		ns = vgetns();
+		ns = gtod->wall_time_snsec;
+		ns += vgetsns();
+		ns >>= gtod->clock.shift;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 
 	timespec_add_ns(ts, ns);
@@ -114,15 +117,18 @@ notrace static int __always_inline do_realtime(struct timespec *ts)
 
 notrace static int do_monotonic(struct timespec *ts)
 {
-	unsigned long seq, ns;
+	unsigned long seq;
+	u64 ns;
 	int mode;
 
+	ts->tv_nsec = 0;
 	do {
 		seq = read_seqcount_begin(&gtod->seq);
 		mode = gtod->clock.vclock_mode;
 		ts->tv_sec = gtod->monotonic_time_sec;
-		ts->tv_nsec = gtod->monotonic_time_nsec;
-		ns = vgetns();
+		ns = gtod->monotonic_time_snsec;
+		ns += vgetsns();
+		ns >>= gtod->clock.shift;
 	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
 	timespec_add_ns(ts, ns);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-17 22:04 [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core John Stultz
                   ` (5 preceding siblings ...)
  2012-09-17 22:05 ` [PATCH 6/6][RFC] time: Convert x86_64 to using new update_vsyscall John Stultz
@ 2012-09-17 23:49 ` Andy Lutomirski
  2012-09-18  0:20   ` John Stultz
  2012-09-20 14:31   ` Steven Rostedt
  6 siblings, 2 replies; 27+ messages in thread
From: Andy Lutomirski @ 2012-09-17 23:49 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, Tony Luck, Paul Mackerras, Benjamin Herrenschmidt,
	Martin Schwidefsky, Paul Turner, Steven Rostedt, Richard Cochran,
	Prarit Bhargava, Thomas Gleixner

On Mon, Sep 17, 2012 at 3:04 PM, John Stultz <john.stultz@linaro.org> wrote:
> This item has been on my todo list for a number of years.
>
> One interesting bit of the timekeeping code is that we internally
> keep sub-nanosecond precision. This allows for really fine
> grained error accounting, which allows us to keep long term
> accuracy, even if the clocksource can only make very coarse
> adjustments.
>
> Since sub-nanosecond precision isn't useful to userland, we
> normally truncate this extra data off before handing it to
> userland. So only the timekeeping core deals with this extra
> resolution.
>
> Brief background here:
>
> Timekeeping roughly works as follows:
>         time_ns = base_ns + cyc2ns(cycles)
>
> With our sub-ns resolution we can internally do calculations
> like:
>         base_ns = 0.9
>         cyc2ns(cycles) =  0.9
>         Thus:
>         time_ns = 0.9 + 0.9 = 1.8 (which we truncate down to 1)
>
>
> Where we periodically accumulate the cyc2ns(cycles) portion into
> the base_ns to avoid cycles getting to large where it might overflow.
>
> So we might have a case where we accumulate 3 cycle "chunks", each
> cycle being 10.3 ns long.
>
> So before accumulation:
>         base_ns = 0
>         cyc2ns(4) = 41.2
>         time_now = 41.2 (truncated to 41)
>
> After accumulation:
>         base_ns = 30.9
>         cyc2ns(1) = 10.3
>         time_now = 41.2 (truncated to 41)
>
>
> One quirk is when we export timekeeping data to the vsyscall code,
> we also truncate the extra resolution. This in the past has caused
> problems, where single nanosecond inconsistencies could be detected.
>
> So before accumulation:
>         base_ns = 0
>         cyc2ns(4) = 41.2 (truncated to 41)
>         time_now = 41
>
> After accumulation:
>         base_ns = 30.9 (truncated to 30)
>         cyc2ns(1) = 10.3 (truncated to 10)
>         time_now = 40
>
> And time looks like it went backwards!
>
> In order to avoid this, we currently end round up to the next
> nanosecond when we do accumulation. In order to keep this from
> causing long term drift (as much as 1ns per tick), we add the
> amount we rounded up to the error accounting, which will slow the
> clocksource frequency appropriately to avoid the drift.
>
> This works, but causes the clocksource adjustment code to do much
> more work. Steven Rosdet pointed out that the unlikely() case in
> timekeeping_adjust is ends up being true every time.
>
> Further this, rounding up and slowing down adds more complexity to
> the timekeeping core.
>
> The better solution is to provide the full sub-nanosecond precision
> data to the vsyscall code, so that we do the truncation on the final
> data, in the exact same way the timekeeping core does, rather then
> truncating some of the source data. This requires reworking the
> vsyscall code paths (x86, ppc, s390, ia64) to be able to handle this
> extra data.
>
> This patch set provides an initial draft of how I'd like to solve it.
> 1) Introducing first a way for the vsyscall data to access the entire
>    timekeeper stat
> 2) Transitioning the existing update_vsyscall methods to
>    update_vsyscall_old
> 3) Introduce the new full-resolution update_vsyscall method
> 4) Limit the problematic extra rounding to only systems using the
>    old vsyscall method
> 5) Convert x86 to use the new vsyscall update and full resolution
>    gettime calculation.
>
> Powerpc, s390 and ia64 will also need to be converted, but this
> allows for a slow transition.
>
> Anyway, I'd greatly appreciate any thoughts or feedback on this
> approach.

I haven't looked in any great detail, but the approach looks sensible
and should slow down the vsyscall code.

That being said, as long as you're playing with this, here are a
couple thoughts:

1. The TSC-reading code does this:

	ret = (cycle_t)vget_cycles();

	last = VVAR(vsyscall_gtod_data).clock.cycle_last;

	if (likely(ret >= last))
		return ret;

I haven't specifically benchmarked the cost of that branch, but I
suspect it's a fairly large fraction of the total cost of
vclock_gettime.  IIUC, the point is that there might be a few cycles
worth of clock skew even on systems with otherwise usable TSCs, and we
don't want a different CPU to return complete garbage if the cycle
count is just below cycle_last.

A different formulation would avoid the problem: set cycle_last to,
say, 100ms *before* the time of the last update_vsyscall, and adjust
the wall_time, etc variables accordingly.  That way a few cycles (or
anything up to 100ms) or skew won't cause an overflow.  Then you could
kill that branch.


2. There's nothing vsyscall-specific about the code in
vclock_gettime.c.  In fact, the VVAR macro should work just fine in
kernel code.  If you moved all this code into a header, then in-kernel
uses could use it, and maybe even other arches could use it.  Last
time I checked, it seemed like vclock_gettime was considerably faster
than whatever the in-kernel equivalent did.


3. The mul_u64_u32_shr function [1] might show up soon, and it would
potentially allow much longer intervals between timekeeping updates.
I'm not sure whether the sub-ns formuation would still work, though (I
suspect it would with some care).


[1] https://lkml.org/lkml/2012/4/25/150

--Andy

>
> Thanks
> -john
>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Richard Cochran <richardcochran@gmail.com>
> Cc: Prarit Bhargava <prarit@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
>
>
>
> John Stultz (6):
>   time: Move timekeeper structure to timekeeper_internal.h for vsyscall
>     changes
>   time: Move update_vsyscall definitions to timekeeper_internal.h
>   time: Convert CONFIG_GENERIC_TIME_VSYSCALL to
>     CONFIG_GENERIC_TIME_VSYSCALL_OLD
>   time: Introduce new GENERIC_TIME_VSYSCALL
>   time: Only do nanosecond rounding on GENERIC_TIME_VSYSCALL_OLD
>     systems
>   time: Convert x86_64 to using new update_vsyscall
>
>  arch/ia64/Kconfig                   |    2 +-
>  arch/ia64/kernel/time.c             |    4 +-
>  arch/powerpc/Kconfig                |    2 +-
>  arch/powerpc/kernel/time.c          |    4 +-
>  arch/s390/Kconfig                   |    2 +-
>  arch/s390/kernel/time.c             |    4 +-
>  arch/x86/include/asm/vgtod.h        |    4 +-
>  arch/x86/kernel/vsyscall_64.c       |   49 +++++++++------
>  arch/x86/vdso/vclock_gettime.c      |   22 ++++---
>  include/linux/clocksource.h         |   16 -----
>  include/linux/timekeeper_internal.h |  108 ++++++++++++++++++++++++++++++++
>  kernel/time.c                       |    2 +-
>  kernel/time/Kconfig                 |    4 ++
>  kernel/time/timekeeping.c           |  115 ++++++++++-------------------------
>  14 files changed, 200 insertions(+), 138 deletions(-)
>  create mode 100644 include/linux/timekeeper_internal.h
>
> --
> 1.7.9.5
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-17 23:49 ` [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core Andy Lutomirski
@ 2012-09-18  0:20   ` John Stultz
  2012-09-18  0:43     ` Andy Lutomirski
  2012-09-18 18:02     ` Richard Cochran
  2012-09-20 14:31   ` Steven Rostedt
  1 sibling, 2 replies; 27+ messages in thread
From: John Stultz @ 2012-09-18  0:20 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Tony Luck, Paul Mackerras, Benjamin Herrenschmidt,
	Martin Schwidefsky, Paul Turner, Steven Rostedt, Richard Cochran,
	Prarit Bhargava, Thomas Gleixner

On 09/17/2012 04:49 PM, Andy Lutomirski wrote:
> On Mon, Sep 17, 2012 at 3:04 PM, John Stultz <john.stultz@linaro.org> wrote:
>> This item has been on my todo list for a number of years.
>>
>> One interesting bit of the timekeeping code is that we internally
>> keep sub-nanosecond precision. This allows for really fine
>> grained error accounting, which allows us to keep long term
>> accuracy, even if the clocksource can only make very coarse
>> adjustments.
>>
>> Since sub-nanosecond precision isn't useful to userland, we
>> normally truncate this extra data off before handing it to
>> userland. So only the timekeeping core deals with this extra
>> resolution.
>>
>> Brief background here:
>>
>> Timekeeping roughly works as follows:
>>          time_ns = base_ns + cyc2ns(cycles)
>>
>> With our sub-ns resolution we can internally do calculations
>> like:
>>          base_ns = 0.9
>>          cyc2ns(cycles) =  0.9
>>          Thus:
>>          time_ns = 0.9 + 0.9 = 1.8 (which we truncate down to 1)
>>
>>
>> Where we periodically accumulate the cyc2ns(cycles) portion into
>> the base_ns to avoid cycles getting to large where it might overflow.
>>
>> So we might have a case where we accumulate 3 cycle "chunks", each
>> cycle being 10.3 ns long.
>>
>> So before accumulation:
>>          base_ns = 0
>>          cyc2ns(4) = 41.2
>>          time_now = 41.2 (truncated to 41)
>>
>> After accumulation:
>>          base_ns = 30.9
>>          cyc2ns(1) = 10.3
>>          time_now = 41.2 (truncated to 41)
>>
>>
>> One quirk is when we export timekeeping data to the vsyscall code,
>> we also truncate the extra resolution. This in the past has caused
>> problems, where single nanosecond inconsistencies could be detected.
>>
>> So before accumulation:
>>          base_ns = 0
>>          cyc2ns(4) = 41.2 (truncated to 41)
>>          time_now = 41
>>
>> After accumulation:
>>          base_ns = 30.9 (truncated to 30)
>>          cyc2ns(1) = 10.3 (truncated to 10)
>>          time_now = 40
>>
>> And time looks like it went backwards!
>>
>> In order to avoid this, we currently end round up to the next
>> nanosecond when we do accumulation. In order to keep this from
>> causing long term drift (as much as 1ns per tick), we add the
>> amount we rounded up to the error accounting, which will slow the
>> clocksource frequency appropriately to avoid the drift.
>>
>> This works, but causes the clocksource adjustment code to do much
>> more work. Steven Rosdet pointed out that the unlikely() case in
>> timekeeping_adjust is ends up being true every time.
>>
>> Further this, rounding up and slowing down adds more complexity to
>> the timekeeping core.
>>
>> The better solution is to provide the full sub-nanosecond precision
>> data to the vsyscall code, so that we do the truncation on the final
>> data, in the exact same way the timekeeping core does, rather then
>> truncating some of the source data. This requires reworking the
>> vsyscall code paths (x86, ppc, s390, ia64) to be able to handle this
>> extra data.
>>
>> This patch set provides an initial draft of how I'd like to solve it.
>> 1) Introducing first a way for the vsyscall data to access the entire
>>     timekeeper stat
>> 2) Transitioning the existing update_vsyscall methods to
>>     update_vsyscall_old
>> 3) Introduce the new full-resolution update_vsyscall method
>> 4) Limit the problematic extra rounding to only systems using the
>>     old vsyscall method
>> 5) Convert x86 to use the new vsyscall update and full resolution
>>     gettime calculation.
>>
>> Powerpc, s390 and ia64 will also need to be converted, but this
>> allows for a slow transition.
>>
>> Anyway, I'd greatly appreciate any thoughts or feedback on this
>> approach.
> I haven't looked in any great detail, but the approach looks sensible
> and should slow down the vsyscall code.

Did you mean "shouldn't"?   Or are you concerned about something 
specifically?

I know you've done quite a bit of tuning on the x86 vdso side, and I 
don't want to wreck that, so I'd appreciate any specific thoughts you 
have if you get a chance to look at it.

> That being said, as long as you're playing with this, here are a
> couple thoughts:
>
> 1. The TSC-reading code does this:
>
> 	ret = (cycle_t)vget_cycles();
>
> 	last = VVAR(vsyscall_gtod_data).clock.cycle_last;
>
> 	if (likely(ret >= last))
> 		return ret;
>
> I haven't specifically benchmarked the cost of that branch, but I
> suspect it's a fairly large fraction of the total cost of
> vclock_gettime.  IIUC, the point is that there might be a few cycles
> worth of clock skew even on systems with otherwise usable TSCs, and we
> don't want a different CPU to return complete garbage if the cycle
> count is just below cycle_last.
>
> A different formulation would avoid the problem: set cycle_last to,
> say, 100ms *before* the time of the last update_vsyscall, and adjust
> the wall_time, etc variables accordingly.  That way a few cycles (or
> anything up to 100ms) or skew won't cause an overflow.  Then you could
> kill that branch.
Interesting.  So I want to keep the scope of this patch set in check, so 
I'd probably hold off on something like this till later, but this might 
not be too complicated to do in the update_wall_time() function, 
basically delaying the accumulation by some amount of time  Although 
this would have odd effects on things like filesystem timestamps, which 
provide "the time at the last tick", which would then be "the time at 
the last tick + Xms".  So it probably needs more careful consideration.



> 2. There's nothing vsyscall-specific about the code in
> vclock_gettime.c.  In fact, the VVAR macro should work just fine in
> kernel code.  If you moved all this code into a header, then in-kernel
> uses could use it, and maybe even other arches could use it.  Last
> time I checked, it seemed like vclock_gettime was considerably faster
> than whatever the in-kernel equivalent did.
I like the idea of unifying the implementations, but I'd want to know 
more about why vclock_gettime was faster then the in-kernel 
getnstimeofday(), since it might be due to the more limited locking (we 
only update vsyscall data under the vsyscall lock, where as the 
timekeeper lock is held for the entire execution of update_wall_time()), 
or some of the optimizations in the vsyscall code is focused on 
providing timespecs to userland, where as in-kernel we also have to 
provide ktime_ts.

If you have any details here, I'd love to hear more. There is some work 
I have planned to address some of these differences, but I'm not sure 
when I'll eventually get to it.


> 3. The mul_u64_u32_shr function [1] might show up soon, and it would
> potentially allow much longer intervals between timekeeping updates.
> I'm not sure whether the sub-ns formuation would still work, though (I
> suspect it would with some care).
>
Yea, we already have a number of calculations to try to maximize the 
interval while still allowing for fine tuned adjustments. Even so, we 
are somewhat bound by the need to provide tick-granular timestamps for 
filesystem code, etc.  So it may be limited to extending idle times out 
until folks are ok with either going with most expensive clock-granular 
timestamps for filesystem code, or loosing the granularity needs somewhat.

But thanks for pointing it out, I'll keep an eye on it.

Tanks again!
-john


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-18  0:20   ` John Stultz
@ 2012-09-18  0:43     ` Andy Lutomirski
  2012-09-18 18:02     ` Richard Cochran
  1 sibling, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2012-09-18  0:43 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, Tony Luck, Paul Mackerras, Benjamin Herrenschmidt,
	Martin Schwidefsky, Paul Turner, Steven Rostedt, Richard Cochran,
	Prarit Bhargava, Thomas Gleixner

On Mon, Sep 17, 2012 at 5:20 PM, John Stultz <john.stultz@linaro.org> wrote:
> On 09/17/2012 04:49 PM, Andy Lutomirski wrote:
>>
>> On Mon, Sep 17, 2012 at 3:04 PM, John Stultz <john.stultz@linaro.org>
>> wrote:

>>>
>>> Anyway, I'd greatly appreciate any thoughts or feedback on this
>>> approach.
>>
>> I haven't looked in any great detail, but the approach looks sensible
>> and should slow down the vsyscall code.
>
>
> Did you mean "shouldn't"?   Or are you concerned about something
> specifically?
>
> I know you've done quite a bit of tuning on the x86 vdso side, and I don't
> want to wreck that, so I'd appreciate any specific thoughts you have if you
> get a chance to look at it.

I meant "shouldn't".  The thing I use for testing is:
https://gitorious.org/linux-test-utils/linux-clock-tests

>
>
>> That being said, as long as you're playing with this, here are a
>> couple thoughts:
>>
>> 1. The TSC-reading code does this:
>>
>>         ret = (cycle_t)vget_cycles();
>>
>>         last = VVAR(vsyscall_gtod_data).clock.cycle_last;
>>
>>         if (likely(ret >= last))
>>                 return ret;
>>
>> I haven't specifically benchmarked the cost of that branch, but I
>> suspect it's a fairly large fraction of the total cost of
>> vclock_gettime.  IIUC, the point is that there might be a few cycles
>> worth of clock skew even on systems with otherwise usable TSCs, and we
>> don't want a different CPU to return complete garbage if the cycle
>> count is just below cycle_last.
>>
>> A different formulation would avoid the problem: set cycle_last to,
>> say, 100ms *before* the time of the last update_vsyscall, and adjust
>> the wall_time, etc variables accordingly.  That way a few cycles (or
>> anything up to 100ms) or skew won't cause an overflow.  Then you could
>> kill that branch.
>
> Interesting.  So I want to keep the scope of this patch set in check, so I'd
> probably hold off on something like this till later, but this might not be
> too complicated to do in the update_wall_time() function, basically delaying
> the accumulation by some amount of time  Although this would have odd
> effects on things like filesystem timestamps, which provide "the time at the
> last tick", which would then be "the time at the last tick + Xms".  So it
> probably needs more careful consideration.

Fair enough.  One way to do this cleanly might be to have helpers that
effectively read CLOCK_REALTIME_COARSE and use those instead of xtime.

>
>
>
>
>> 2. There's nothing vsyscall-specific about the code in
>> vclock_gettime.c.  In fact, the VVAR macro should work just fine in
>> kernel code.  If you moved all this code into a header, then in-kernel
>> uses could use it, and maybe even other arches could use it.  Last
>> time I checked, it seemed like vclock_gettime was considerably faster
>> than whatever the in-kernel equivalent did.
>
> I like the idea of unifying the implementations, but I'd want to know more
> about why vclock_gettime was faster then the in-kernel getnstimeofday(),
> since it might be due to the more limited locking (we only update vsyscall
> data under the vsyscall lock, where as the timekeeper lock is held for the
> entire execution of update_wall_time()), or some of the optimizations in the
> vsyscall code is focused on providing timespecs to userland, where as
> in-kernel we also have to provide ktime_ts.
>
> If you have any details here, I'd love to hear more. There is some work I
> have planned to address some of these differences, but I'm not sure when
> I'll eventually get to it.
>

I suspect it's because everyone who's benchmarked and tuned the "what
time is it?" code has focused on the vdso version, so the in-kernel
version hasn't gotten much love.  The vclock_gettime code has
basically no abstractions and uses only direct (i.e. explicitly
devirtualized) calls.  ktime_get, OTOH, uses clock->read, which, in
the best case, is read_tsc.  read_tsc (in arch/x86/kernel/tsc.c) is
missing the don't-use-cmov optimization.

The vclock_gettime code also produces a struct timespec without
div/mod, which is probably impossible for any code that goes through
ktime.

In the other direction, AFAICS the in-kernel code is completely
missing the rdtsc_barrier call, which is empirically rather necessary.
 It's pretty easy to detect clock skew without it.

>
>
>> 3. The mul_u64_u32_shr function [1] might show up soon, and it would
>> potentially allow much longer intervals between timekeeping updates.
>> I'm not sure whether the sub-ns formuation would still work, though (I
>> suspect it would with some care).
>>
> Yea, we already have a number of calculations to try to maximize the
> interval while still allowing for fine tuned adjustments. Even so, we are
> somewhat bound by the need to provide tick-granular timestamps for
> filesystem code, etc.  So it may be limited to extending idle times out
> until folks are ok with either going with most expensive clock-granular
> timestamps for filesystem code, or loosing the granularity needs somewhat.

Maybe this could depend on which clocksource is in use.  On recent
Intel boxen, the entire vclock_gettime call takes under 20 ns.  When
called from the kernel, it could be even faster if static keys (or
whatever they're called these days) were used.

FWIW, I don't understand the NTP code at all.  I'm pretty sure I know
what a PLL is, and I don't understand how that has anything to do with
NTP (although it makes sense in a PPS context).  I've always wondered
why the kernel's idea of time was any more complicated than "here's
the clock frequency; in addition, slew the time by X ns (or sub-ns)
over the next Y ns" -- some driver or even userspace code could
program that and the kernel could wake up when it needed to for the
update.  (To make this seamless, vclock_gettime would either need a
branch or the kernel would need to wake up a little early and update
the vsyscall data.)

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-18  0:20   ` John Stultz
  2012-09-18  0:43     ` Andy Lutomirski
@ 2012-09-18 18:02     ` Richard Cochran
  2012-09-18 18:17       ` Andy Lutomirski
  2012-09-18 18:29       ` John Stultz
  1 sibling, 2 replies; 27+ messages in thread
From: Richard Cochran @ 2012-09-18 18:02 UTC (permalink / raw)
  To: John Stultz
  Cc: Andy Lutomirski, linux-kernel, Tony Luck, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Steven Rostedt, Prarit Bhargava, Thomas Gleixner

On Mon, Sep 17, 2012 at 05:20:41PM -0700, John Stultz wrote:
> On 09/17/2012 04:49 PM, Andy Lutomirski wrote:
> >2. There's nothing vsyscall-specific about the code in
> >vclock_gettime.c.  In fact, the VVAR macro should work just fine in
> >kernel code.  If you moved all this code into a header, then in-kernel
> >uses could use it, and maybe even other arches could use it.  Last
> >time I checked, it seemed like vclock_gettime was considerably faster
> >than whatever the in-kernel equivalent did.
> I like the idea of unifying the implementations, but I'd want to
> know more about why vclock_gettime was faster then the in-kernel
> getnstimeofday(), since it might be due to the more limited locking
> (we only update vsyscall data under the vsyscall lock, where as the
> timekeeper lock is held for the entire execution of
> update_wall_time()), or some of the optimizations in the vsyscall
> code is focused on providing timespecs to userland, where as
> in-kernel we also have to provide ktime_ts.

This there a valid technical reason why each arch has its own vdso
implementation?

If not, I would suggest that the first step would be to refactor these
into one C-language header. If this can be shared with kernel code,
then all the better.

It would make it a lot easier to fix the leap second thing, too.

Thanks,
Richard


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-18 18:02     ` Richard Cochran
@ 2012-09-18 18:17       ` Andy Lutomirski
  2012-09-18 18:29       ` John Stultz
  1 sibling, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2012-09-18 18:17 UTC (permalink / raw)
  To: Richard Cochran
  Cc: John Stultz, linux-kernel, Tony Luck, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Steven Rostedt, Prarit Bhargava, Thomas Gleixner

On Tue, Sep 18, 2012 at 11:02 AM, Richard Cochran
<richardcochran@gmail.com> wrote:
> On Mon, Sep 17, 2012 at 05:20:41PM -0700, John Stultz wrote:
>> On 09/17/2012 04:49 PM, Andy Lutomirski wrote:
>> >2. There's nothing vsyscall-specific about the code in
>> >vclock_gettime.c.  In fact, the VVAR macro should work just fine in
>> >kernel code.  If you moved all this code into a header, then in-kernel
>> >uses could use it, and maybe even other arches could use it.  Last
>> >time I checked, it seemed like vclock_gettime was considerably faster
>> >than whatever the in-kernel equivalent did.
>> I like the idea of unifying the implementations, but I'd want to
>> know more about why vclock_gettime was faster then the in-kernel
>> getnstimeofday(), since it might be due to the more limited locking
>> (we only update vsyscall data under the vsyscall lock, where as the
>> timekeeper lock is held for the entire execution of
>> update_wall_time()), or some of the optimizations in the vsyscall
>> code is focused on providing timespecs to userland, where as
>> in-kernel we also have to provide ktime_ts.
>
> This there a valid technical reason why each arch has its own vdso
> implementation?

I don't know too much about other arch vdsos.  i386's doesn't have
clock functions.  x32 works exactly like x86-64, except that it
probably involves a bit of addressing mode weirdness.  ia64's is very
strange indeed, I think.

In any case, the VVAR macro is an x86-64-ism, although if it were to
be the beginning of a generic mechanism, #define VVAR(x) (x) would be
a perfectly fine start, I think.

>
> If not, I would suggest that the first step would be to refactor these
> into one C-language header. If this can be shared with kernel code,
> then all the better.

That should be most straightforward to do.

One issue: you can't call a function pointer from vdso code (because
the vdso is in a different place in different processes).  The
vclock_mode stuff would need to be extended to work across
architectures, and the fallback to a real syscall would need to turn
into something else.

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-18 18:02     ` Richard Cochran
  2012-09-18 18:17       ` Andy Lutomirski
@ 2012-09-18 18:29       ` John Stultz
  2012-09-19  4:50         ` Richard Cochran
  1 sibling, 1 reply; 27+ messages in thread
From: John Stultz @ 2012-09-18 18:29 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Andy Lutomirski, linux-kernel, Tony Luck, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Steven Rostedt, Prarit Bhargava, Thomas Gleixner

On 09/18/2012 11:02 AM, Richard Cochran wrote:
> On Mon, Sep 17, 2012 at 05:20:41PM -0700, John Stultz wrote:
>> On 09/17/2012 04:49 PM, Andy Lutomirski wrote:
>>> 2. There's nothing vsyscall-specific about the code in
>>> vclock_gettime.c.  In fact, the VVAR macro should work just fine in
>>> kernel code.  If you moved all this code into a header, then in-kernel
>>> uses could use it, and maybe even other arches could use it.  Last
>>> time I checked, it seemed like vclock_gettime was considerably faster
>>> than whatever the in-kernel equivalent did.
>> I like the idea of unifying the implementations, but I'd want to
>> know more about why vclock_gettime was faster then the in-kernel
>> getnstimeofday(), since it might be due to the more limited locking
>> (we only update vsyscall data under the vsyscall lock, where as the
>> timekeeper lock is held for the entire execution of
>> update_wall_time()), or some of the optimizations in the vsyscall
>> code is focused on providing timespecs to userland, where as
>> in-kernel we also have to provide ktime_ts.
> This there a valid technical reason why each arch has its own vdso
> implementation?
I believe its mostly historical, but on some architectures that history 
has become an established ABI, making it technical.

powerpc, for example exports timekeeping data at a specific address, and 
the code logic to use that data is in userland libraries, outside of 
kernel control.  ia64 uses a fsyscall method, which is (to my 
understanding) a mode that allows limited access to kernel data from 
userland, but restricts what instructions can be used, requiring it to 
be hand written in asm.

Now, x86_64 too had its own magic vsyscall address that was hard coded, 
but Andy did some very cool work allowing that to bounce to the normal 
syscall for compatability, allowing the nicer vdso method to be used.  
It may be that such a vdso method could be introduced and migrated to on 
these other arches, but we'd still have to preserve the existing ABI as 
well (and in cases like ppc, that preservation would be just as 
complicated as it is now).

> If not, I would suggest that the first step would be to refactor these
> into one C-language header. If this can be shared with kernel code,
> then all the better.
>
> It would make it a lot easier to fix the leap second thing, too.
Indeed, it would be nice.  Tweaking the ia64 fsyscall  isn't anything I 
look forward to. :)

But such heavy lifting will likely need to be done by arch maintainers. 
That's why with this patchset I preserve the existing method, but make 
it clear its deprecated and allow arches that don't need the old method 
to avoid the extra overhead caused by the additional rounding fix. Then 
those arches can migrate when they can, rather then having to block 
change on everyone conforming to a new standard.

thanks
-john


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-18 18:29       ` John Stultz
@ 2012-09-19  4:50         ` Richard Cochran
  2012-09-19  5:30           ` Andy Lutomirski
  2012-09-19 16:31           ` John Stultz
  0 siblings, 2 replies; 27+ messages in thread
From: Richard Cochran @ 2012-09-19  4:50 UTC (permalink / raw)
  To: John Stultz
  Cc: Andy Lutomirski, linux-kernel, Tony Luck, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Steven Rostedt, Prarit Bhargava, Thomas Gleixner

On Tue, Sep 18, 2012 at 11:29:50AM -0700, John Stultz wrote:
> I believe its mostly historical, but on some architectures that
> history has become an established ABI, making it technical.

Fine, but what do you mean by "ABI?" Are you talking about magic
addresses for functions?

Without knowing the dirty details, what I imagine is a jump/branch
from the arch-specific code into the common implementation.

Can that be done?

Thanks,
Richard

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-19  4:50         ` Richard Cochran
@ 2012-09-19  5:30           ` Andy Lutomirski
  2012-09-19 16:31           ` John Stultz
  1 sibling, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2012-09-19  5:30 UTC (permalink / raw)
  To: Richard Cochran
  Cc: John Stultz, linux-kernel, Tony Luck, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Steven Rostedt, Prarit Bhargava, Thomas Gleixner

On Tue, Sep 18, 2012 at 9:50 PM, Richard Cochran
<richardcochran@gmail.com> wrote:
> On Tue, Sep 18, 2012 at 11:29:50AM -0700, John Stultz wrote:
>> I believe its mostly historical, but on some architectures that
>> history has become an established ABI, making it technical.
>
> Fine, but what do you mean by "ABI?" Are you talking about magic
> addresses for functions?
>
> Without knowing the dirty details, what I imagine is a jump/branch
> from the arch-specific code into the common implementation.

I suspect (based mostly on speculation) that this is possible for
everything except ia64.  ia64 looks like it uses a highly magical
syscall entry, where user code jumps to a special address for *all*
syscalls, and that code does some kind of partial kernel entry.  Very
careful asm code can run in that weird state, and that code implements
gettimeofday.  Now, if that code could return back to normal execution
and jump to a vdso, it could use common code, I guess.

This assumes that all architectures that want to use vclock_gettime,
etc are capable of sharing a page of read-only kernel memory with
userspace at a fixed address or can otherwise make the VVAR macro
work.  The only architectures I actually understand are x86-{32,64},
which can.  I don't think anyone cares about x86-32 enough to
implement it, though.

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-19  4:50         ` Richard Cochran
  2012-09-19  5:30           ` Andy Lutomirski
@ 2012-09-19 16:31           ` John Stultz
  2012-09-19 17:03             ` Richard Cochran
  1 sibling, 1 reply; 27+ messages in thread
From: John Stultz @ 2012-09-19 16:31 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Andy Lutomirski, linux-kernel, Tony Luck, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Steven Rostedt, Prarit Bhargava, Thomas Gleixner

On 09/18/2012 09:50 PM, Richard Cochran wrote:
> On Tue, Sep 18, 2012 at 11:29:50AM -0700, John Stultz wrote:
>> I believe its mostly historical, but on some architectures that
>> history has become an established ABI, making it technical.
> Fine, but what do you mean by "ABI?" Are you talking about magic
> addresses for functions?
On powerpc, I mean magic addresses where userland can find structures 
that it can use to calculate time.

On ia64 I mean the fsyscall method (which is arch specific).

> Without knowing the dirty details, what I imagine is a jump/branch
> from the arch-specific code into the common implementation.
>
> Can that be done?
In the two cases above, what you suggest unfortunately isn't possible 
(at least to my understanding - arch maintainers jump in to correct me).

With powerpc, there is no arch specific kernel code involved, its just a 
data structure the kernel exports that is accessible to userland. The 
execution logic lives in userland libraries, or sometimes application 
code itself.

With ia64's fsyscall, its a special mode that limits what you can do and 
which registers you access. So you couldn't just jump to other code 
while in that mode.

But maybe someone has a neat idea on how to get around this?

thanks
-john


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-19 16:31           ` John Stultz
@ 2012-09-19 17:03             ` Richard Cochran
  2012-09-19 17:54               ` John Stultz
  0 siblings, 1 reply; 27+ messages in thread
From: Richard Cochran @ 2012-09-19 17:03 UTC (permalink / raw)
  To: John Stultz
  Cc: Andy Lutomirski, linux-kernel, Tony Luck, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Steven Rostedt, Prarit Bhargava, Thomas Gleixner

On Wed, Sep 19, 2012 at 09:31:35AM -0700, John Stultz wrote:

> On powerpc, I mean magic addresses where userland can find
> structures that it can use to calculate time.

...
 
> With powerpc, there is no arch specific kernel code involved, its
> just a data structure the kernel exports that is accessible to
> userland. The execution logic lives in userland libraries, or
> sometimes application code itself.

I took a brief look at arch/powerpc/kernel/vdso32/gettimeofday.S and
arch/powerpc/kernel/vdso64/gettimeofday.S, and I see what looks a lot
like functions

$ find arch/powerpc/kernel/vdso* -name gettimeofday.S|xargs grep FUNCTION_BEGIN

arch/powerpc/kernel/vdso32/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_gettimeofday)
arch/powerpc/kernel/vdso32/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_clock_gettime)
arch/powerpc/kernel/vdso32/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_clock_getres)
arch/powerpc/kernel/vdso64/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_gettimeofday)
arch/powerpc/kernel/vdso64/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_clock_gettime)
arch/powerpc/kernel/vdso64/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_clock_getres)
arch/powerpc/kernel/vdso64/gettimeofday.S:V_FUNCTION_BEGIN(__do_get_tspec)

and I wonder whether these could be done in C instead.

Thanks,
Richard

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-19 17:03             ` Richard Cochran
@ 2012-09-19 17:54               ` John Stultz
  2012-09-19 18:26                 ` Andy Lutomirski
  0 siblings, 1 reply; 27+ messages in thread
From: John Stultz @ 2012-09-19 17:54 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Andy Lutomirski, linux-kernel, Tony Luck, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Steven Rostedt, Prarit Bhargava, Thomas Gleixner

On 09/19/2012 10:03 AM, Richard Cochran wrote:
> On Wed, Sep 19, 2012 at 09:31:35AM -0700, John Stultz wrote:
>> With powerpc, there is no arch specific kernel code involved, its
>> just a data structure the kernel exports that is accessible to
>> userland. The execution logic lives in userland libraries, or
>> sometimes application code itself.
> I took a brief look at arch/powerpc/kernel/vdso32/gettimeofday.S and
> arch/powerpc/kernel/vdso64/gettimeofday.S, and I see what looks a lot
> like functions
Sorry, yes. My statement wasn't subtle enough (and I may be confusing my 
history).

You are right, there is arch specific code involved, but the data 
structure that is exported is considered part of the abi since some 
applications access it directly.

See the comments and structure in:
arch/powerpc/include/asm/vdso_datapage.h


> $ find arch/powerpc/kernel/vdso* -name gettimeofday.S|xargs grep FUNCTION_BEGIN
>
> arch/powerpc/kernel/vdso32/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_gettimeofday)
> arch/powerpc/kernel/vdso32/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_clock_gettime)
> arch/powerpc/kernel/vdso32/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_clock_getres)
> arch/powerpc/kernel/vdso64/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_gettimeofday)
> arch/powerpc/kernel/vdso64/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_clock_gettime)
> arch/powerpc/kernel/vdso64/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_clock_getres)
> arch/powerpc/kernel/vdso64/gettimeofday.S:V_FUNCTION_BEGIN(__do_get_tspec)
>
> and I wonder whether these could be done in C instead.

Possibly, but I suspect they're in asm for performance reasons.

Paul/Ben: Do you have any thoughts here?

thanks
-john


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-19 17:54               ` John Stultz
@ 2012-09-19 18:26                 ` Andy Lutomirski
  2012-09-19 20:50                   ` Luck, Tony
  0 siblings, 1 reply; 27+ messages in thread
From: Andy Lutomirski @ 2012-09-19 18:26 UTC (permalink / raw)
  To: John Stultz
  Cc: Richard Cochran, linux-kernel, Tony Luck, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Steven Rostedt, Prarit Bhargava, Thomas Gleixner

On Wed, Sep 19, 2012 at 10:54 AM, John Stultz <john.stultz@linaro.org> wrote:
> On 09/19/2012 10:03 AM, Richard Cochran wrote:
>>
>> On Wed, Sep 19, 2012 at 09:31:35AM -0700, John Stultz wrote:
>>>
>>> With powerpc, there is no arch specific kernel code involved, its
>>> just a data structure the kernel exports that is accessible to
>>> userland. The execution logic lives in userland libraries, or
>>> sometimes application code itself.
>>
>> I took a brief look at arch/powerpc/kernel/vdso32/gettimeofday.S and
>> arch/powerpc/kernel/vdso64/gettimeofday.S, and I see what looks a lot
>> like functions
>
> Sorry, yes. My statement wasn't subtle enough (and I may be confusing my
> history).
>
> You are right, there is arch specific code involved, but the data structure
> that is exported is considered part of the abi since some applications
> access it directly.
>
> See the comments and structure in:
> arch/powerpc/include/asm/vdso_datapage.h
>
>
>
>> $ find arch/powerpc/kernel/vdso* -name gettimeofday.S|xargs grep
>> FUNCTION_BEGIN
>>
>>
>> arch/powerpc/kernel/vdso32/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_gettimeofday)
>>
>> arch/powerpc/kernel/vdso32/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_clock_gettime)
>>
>> arch/powerpc/kernel/vdso32/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_clock_getres)
>>
>> arch/powerpc/kernel/vdso64/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_gettimeofday)
>>
>> arch/powerpc/kernel/vdso64/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_clock_gettime)
>>
>> arch/powerpc/kernel/vdso64/gettimeofday.S:V_FUNCTION_BEGIN(__kernel_clock_getres)
>> arch/powerpc/kernel/vdso64/gettimeofday.S:V_FUNCTION_BEGIN(__do_get_tspec)
>>
>> and I wonder whether these could be done in C instead.
>
>
> Possibly, but I suspect they're in asm for performance reasons.
>
> Paul/Ben: Do you have any thoughts here?
>

Does anything except the vDSO actually use the vDSO data page?  It's
mapped as part of the vDSO image (i.e. at a non-constant address), and
it's not immediate obvious how userspace would locate that page.

FWIW, this is kind of cute and is not obviously worse than the way
that x86-64 does it, other than the possible ABI issue.
Pros:
 - vDSO code could use RIP-relative addressing instead of absolute
addressing, although I'm not sure whether there's actually a RIP +
16-bit offset encoding, so this might not be a win
 - Avoids a read-only page at a constant address
 - The mapping is per-process, so evil tricks could be used to shove,
say, pid in there.  This is almost certainly not worth it, especially
given the odd things that clone can do and that glibc already ought to
cache the pid.
Cons:
 - If ptrace or mprotect pokes that page, the clock stops for that process.
 - Lots of complexity for minimal gain.

In any case, it looks like #define VVAR(x) vdso_data->x would work on
ppc, although the vvar specification mechanism would not work.

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-19 18:26                 ` Andy Lutomirski
@ 2012-09-19 20:50                   ` Luck, Tony
  2012-09-19 21:11                     ` John Stultz
  2012-09-19 21:15                     ` Andy Lutomirski
  0 siblings, 2 replies; 27+ messages in thread
From: Luck, Tony @ 2012-09-19 20:50 UTC (permalink / raw)
  To: Andy Lutomirski, John Stultz
  Cc: Richard Cochran, linux-kernel, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Steven Rostedt, Prarit Bhargava, Thomas Gleixner

> Does anything except the vDSO actually use the vDSO data page?  It's
> mapped as part of the vDSO image (i.e. at a non-constant address), and
> it's not immediate obvious how userspace would locate that page.

Just for reference - on ia64 the address of the entry point for the magic
fast system call page is passed to each applications via the "auxv" structure
that exec(2) drops at the top of stack after args/env in the AT_SYSINFO
entry. Apps look for it to find out where to jump for fast system call entry
(if it isn't there, they fall back to regular slow syscall path).

Same method could be used to provide the address of a magic read-only
for users page that kernel fills with stuff for simple system calls.

-Tony

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-19 20:50                   ` Luck, Tony
@ 2012-09-19 21:11                     ` John Stultz
  2012-09-20  7:36                       ` Richard Cochran
  2012-09-19 21:15                     ` Andy Lutomirski
  1 sibling, 1 reply; 27+ messages in thread
From: John Stultz @ 2012-09-19 21:11 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Andy Lutomirski, Richard Cochran, linux-kernel, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Steven Rostedt, Prarit Bhargava, Thomas Gleixner

On 09/19/2012 01:50 PM, Luck, Tony wrote:
>> Does anything except the vDSO actually use the vDSO data page?  It's
>> mapped as part of the vDSO image (i.e. at a non-constant address), and
>> it's not immediate obvious how userspace would locate that page.
> Just for reference - on ia64 the address of the entry point for the magic
> fast system call page is passed to each applications via the "auxv" structure
> that exec(2) drops at the top of stack after args/env in the AT_SYSINFO
> entry. Apps look for it to find out where to jump for fast system call entry
> (if it isn't there, they fall back to regular slow syscall path).

Nice. So we could "disable" fsyscalls on a kernel and not break 
userland. That makes it easier to replace with the vdso method at some 
point. So that's good to hear!

> Same method could be used to provide the address of a magic read-only
> for users page that kernel fills with stuff for simple system calls.

Now, I suspect the difficult part will be finding someone with the time 
and interest to try get the vdso gettime working on ia64 (or s390 or 
powerpc), and then try to unify the kernel side implementation. Reducing 
the maintenance burden might not be inspirational enough, especially if 
there is a small performance cost along with it.

I sort of suspect that its more likely that such unification work could 
be done as part of enabling vdso on an otherwise non vsyscall enabled 
arch (like ARM), since at least there they have the carrot of the 
performance gain to drive them.

And also, all this discussion is a bit far afield of the patchset I'm 
proposing here. :)  I'd still like to hear any thoughts on it from the 
various arch maintainers, otherwise I'll submit it to Thomas.

thanks
-john


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-19 20:50                   ` Luck, Tony
  2012-09-19 21:11                     ` John Stultz
@ 2012-09-19 21:15                     ` Andy Lutomirski
  1 sibling, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2012-09-19 21:15 UTC (permalink / raw)
  To: Luck, Tony
  Cc: John Stultz, Richard Cochran, linux-kernel, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Steven Rostedt, Prarit Bhargava, Thomas Gleixner

On Wed, Sep 19, 2012 at 1:50 PM, Luck, Tony <tony.luck@intel.com> wrote:
>> Does anything except the vDSO actually use the vDSO data page?  It's
>> mapped as part of the vDSO image (i.e. at a non-constant address), and
>> it's not immediate obvious how userspace would locate that page.
>
> Just for reference - on ia64 the address of the entry point for the magic
> fast system call page is passed to each applications via the "auxv" structure
> that exec(2) drops at the top of stack after args/env in the AT_SYSINFO
> entry. Apps look for it to find out where to jump for fast system call entry
> (if it isn't there, they fall back to regular slow syscall path).
>
> Same method could be used to provide the address of a magic read-only
> for users page that kernel fills with stuff for simple system calls.

Erk.  I'd much rather *not* pass this information to user apps -- if
they only ever access this stuff via the vDSO, then there's no ABI
issue the next time the data structure changes.  (e.g. on x86-64 it's
changed at least twice this year, and nothing noticed.)

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-19 21:11                     ` John Stultz
@ 2012-09-20  7:36                       ` Richard Cochran
  0 siblings, 0 replies; 27+ messages in thread
From: Richard Cochran @ 2012-09-20  7:36 UTC (permalink / raw)
  To: John Stultz
  Cc: Luck, Tony, Andy Lutomirski, linux-kernel, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Steven Rostedt, Prarit Bhargava, Thomas Gleixner

On Wed, Sep 19, 2012 at 02:11:02PM -0700, John Stultz wrote:
> Now, I suspect the difficult part will be finding someone with the
> time and interest to try get the vdso gettime working on ia64 (or
> s390 or powerpc), and then try to unify the kernel side
> implementation. Reducing the maintenance burden might not be
> inspirational enough, especially if there is a small performance
> cost along with it.

Small performance cost verses correct time keeping?

I couldn't help but notice all of the leap second issues and fixes
that, once again, appeared this summer. It is great that John looks
after this stuff, but I cannot avoid the image of Hans Brinker
stopping leaks with his fingers.

There is a way to fix this issue once and for all (as we discussed
before). But in order to implement it, one would have to change all of
the vdsos, too. So if there is way to refactor the vdsos, then I see
this as a most "inspirational" task.

> And also, all this discussion is a bit far afield of the patchset
> I'm proposing here. :)  I'd still like to hear any thoughts on it
> from the various arch maintainers, otherwise I'll submit it to

(Sorry for the continuing off topic rant)

Thanks,
Richard

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-17 23:49 ` [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core Andy Lutomirski
  2012-09-18  0:20   ` John Stultz
@ 2012-09-20 14:31   ` Steven Rostedt
  2012-09-20 17:32     ` Andy Lutomirski
  1 sibling, 1 reply; 27+ messages in thread
From: Steven Rostedt @ 2012-09-20 14:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: John Stultz, linux-kernel, Tony Luck, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Richard Cochran, Prarit Bhargava, Thomas Gleixner

On Mon, 2012-09-17 at 16:49 -0700, Andy Lutomirski wrote:

> I haven't looked in any great detail, but the approach looks sensible
> and should slow down the vsyscall code.
> 
> That being said, as long as you're playing with this, here are a
> couple thoughts:
> 
> 1. The TSC-reading code does this:
> 
> 	ret = (cycle_t)vget_cycles();
> 
> 	last = VVAR(vsyscall_gtod_data).clock.cycle_last;
> 
> 	if (likely(ret >= last))
> 		return ret;
> 
> I haven't specifically benchmarked the cost of that branch, but I
> suspect it's a fairly large fraction of the total cost of
> vclock_gettime.  IIUC, the point is that there might be a few cycles
> worth of clock skew even on systems with otherwise usable TSCs, and we
> don't want a different CPU to return complete garbage if the cycle
> count is just below cycle_last.
> 
> A different formulation would avoid the problem: set cycle_last to,
> say, 100ms *before* the time of the last update_vsyscall, and adjust
> the wall_time, etc variables accordingly.  That way a few cycles (or
> anything up to 100ms) or skew won't cause an overflow.  Then you could
> kill that branch.
> 

I'm curious... If the task gets preempted after reading ret, and doesn't
get to run again for another 200ms, would that break it?

-- Steve



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core
  2012-09-20 14:31   ` Steven Rostedt
@ 2012-09-20 17:32     ` Andy Lutomirski
  0 siblings, 0 replies; 27+ messages in thread
From: Andy Lutomirski @ 2012-09-20 17:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: John Stultz, linux-kernel, Tony Luck, Paul Mackerras,
	Benjamin Herrenschmidt, Martin Schwidefsky, Paul Turner,
	Richard Cochran, Prarit Bhargava, Thomas Gleixner

On Thu, Sep 20, 2012 at 7:31 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2012-09-17 at 16:49 -0700, Andy Lutomirski wrote:
>
>> I haven't looked in any great detail, but the approach looks sensible
>> and should slow down the vsyscall code.
>>
>> That being said, as long as you're playing with this, here are a
>> couple thoughts:
>>
>> 1. The TSC-reading code does this:
>>
>>       ret = (cycle_t)vget_cycles();
>>
>>       last = VVAR(vsyscall_gtod_data).clock.cycle_last;
>>
>>       if (likely(ret >= last))
>>               return ret;
>>
>> I haven't specifically benchmarked the cost of that branch, but I
>> suspect it's a fairly large fraction of the total cost of
>> vclock_gettime.  IIUC, the point is that there might be a few cycles
>> worth of clock skew even on systems with otherwise usable TSCs, and we
>> don't want a different CPU to return complete garbage if the cycle
>> count is just below cycle_last.
>>
>> A different formulation would avoid the problem: set cycle_last to,
>> say, 100ms *before* the time of the last update_vsyscall, and adjust
>> the wall_time, etc variables accordingly.  That way a few cycles (or
>> anything up to 100ms) or skew won't cause an overflow.  Then you could
>> kill that branch.
>>
>
> I'm curious... If the task gets preempted after reading ret, and doesn't
> get to run again for another 200ms, would that break it?

Only if cycle_last changes while preempted (or from a different CPU).
That case is covered by the seqlock in do_realtime and do_monotonic.

--Andy

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/6][RFC] time: Move update_vsyscall definitions to timekeeper_internal.h
  2012-09-17 22:04 ` [PATCH 2/6][RFC] time: Move update_vsyscall definitions to timekeeper_internal.h John Stultz
@ 2012-09-27  3:14   ` Paul Mackerras
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2012-09-27  3:14 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, Tony Luck, Benjamin Herrenschmidt, Andy Lutomirski,
	Martin Schwidefsky, Paul Turner, Steven Rostedt, Richard Cochran,
	Prarit Bhargava, Thomas Gleixner

On Mon, Sep 17, 2012 at 06:04:57PM -0400, John Stultz wrote:
> Since users will need to include timekeeper_internal.h, move
> update_vsyscall definitions to timekeeper_internal.h.
> 
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Richard Cochran <richardcochran@gmail.com>
> Cc: Prarit Bhargava <prarit@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: John Stultz <john.stultz@linaro.org>

Acked-by: Paul Mackerras <paulus@samba.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6][RFC] time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD
  2012-09-17 22:04 ` [PATCH 3/6][RFC] time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD John Stultz
@ 2012-09-27  3:14   ` Paul Mackerras
  0 siblings, 0 replies; 27+ messages in thread
From: Paul Mackerras @ 2012-09-27  3:14 UTC (permalink / raw)
  To: John Stultz
  Cc: linux-kernel, Tony Luck, Benjamin Herrenschmidt, Andy Lutomirski,
	Martin Schwidefsky, Paul Turner, Steven Rostedt, Richard Cochran,
	Prarit Bhargava, Thomas Gleixner

On Mon, Sep 17, 2012 at 06:04:58PM -0400, John Stultz wrote:
> To help migrate archtectures over to the new update_vsyscall method,
> redfine CONFIG_GENERIC_TIME_VSYSCALL as CONFIG_GENERIC_TIME_VSYSCALL_OLD
> 
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Paul Turner <pjt@google.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Richard Cochran <richardcochran@gmail.com>
> Cc: Prarit Bhargava <prarit@redhat.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: John Stultz <john.stultz@linaro.org>

Acked-by: Paul Mackerras <paulus@samba.org>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2012-09-27  3:15 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-17 22:04 [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core John Stultz
2012-09-17 22:04 ` [PATCH 1/6][RFC] time: Move timekeeper structure to timekeeper_internal.h for vsyscall changes John Stultz
2012-09-17 22:04 ` [PATCH 2/6][RFC] time: Move update_vsyscall definitions to timekeeper_internal.h John Stultz
2012-09-27  3:14   ` Paul Mackerras
2012-09-17 22:04 ` [PATCH 3/6][RFC] time: Convert CONFIG_GENERIC_TIME_VSYSCALL to CONFIG_GENERIC_TIME_VSYSCALL_OLD John Stultz
2012-09-27  3:14   ` Paul Mackerras
2012-09-17 22:04 ` [PATCH 4/6][RFC] time: Introduce new GENERIC_TIME_VSYSCALL John Stultz
2012-09-17 22:05 ` [PATCH 5/6][RFC] time: Only do nanosecond rounding on GENERIC_TIME_VSYSCALL_OLD systems John Stultz
2012-09-17 22:05 ` [PATCH 6/6][RFC] time: Convert x86_64 to using new update_vsyscall John Stultz
2012-09-17 23:49 ` [PATCH 0/6][RFC] Rework vsyscall to avoid truncation/rounding issue in timekeeping core Andy Lutomirski
2012-09-18  0:20   ` John Stultz
2012-09-18  0:43     ` Andy Lutomirski
2012-09-18 18:02     ` Richard Cochran
2012-09-18 18:17       ` Andy Lutomirski
2012-09-18 18:29       ` John Stultz
2012-09-19  4:50         ` Richard Cochran
2012-09-19  5:30           ` Andy Lutomirski
2012-09-19 16:31           ` John Stultz
2012-09-19 17:03             ` Richard Cochran
2012-09-19 17:54               ` John Stultz
2012-09-19 18:26                 ` Andy Lutomirski
2012-09-19 20:50                   ` Luck, Tony
2012-09-19 21:11                     ` John Stultz
2012-09-20  7:36                       ` Richard Cochran
2012-09-19 21:15                     ` Andy Lutomirski
2012-09-20 14:31   ` Steven Rostedt
2012-09-20 17:32     ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).