* [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO @ 2018-09-01 1:59 Matt Rickard 2018-09-01 3:39 ` Andy Lutomirski 0 siblings, 1 reply; 16+ messages in thread From: Matt Rickard @ 2018-09-01 1:59 UTC (permalink / raw) To: Thomas Gleixner; +Cc: Andy Lutomirski, Stephen Boyd, John Stultz, X86 ML, LKML Process clock_gettime(CLOCK_TAI) in vDSO. This makes the call about as fast as CLOCK_REALTIME and CLOCK_MONOTONIC: nanoseconds before after clockname ---- ----- --------- 233 87 CLOCK_TAI 96 93 CLOCK_REALTIME 88 87 CLOCK_MONOTONIC Signed-off-by: Matt Rickard <matt@softrans.com.au> --- arch/x86/entry/vdso/vclock_gettime.c | 25 +++++++++++++++++++++++++ arch/x86/entry/vsyscall/vsyscall_gtod.c | 2 ++ arch/x86/include/asm/vgtod.h | 1 + 3 files changed, 28 insertions(+) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index f19856d95c60..91ed1bb2a3bb 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -246,6 +246,27 @@ notrace static int __always_inline do_monotonic(struct timespec *ts) return mode; } +notrace static int __always_inline do_tai(struct timespec *ts) +{ + unsigned long seq; + u64 ns; + int mode; + + do { + seq = gtod_read_begin(gtod); + mode = gtod->vclock_mode; + ts->tv_sec = gtod->tai_time_sec; + ns = gtod->wall_time_snsec; + ns += vgetsns(&mode); + ns >>= gtod->shift; + } while (unlikely(gtod_read_retry(gtod, seq))); + + ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); + ts->tv_nsec = ns; + + return mode; +} + notrace static void do_realtime_coarse(struct timespec *ts) { unsigned long seq; @@ -277,6 +298,10 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) if (do_monotonic(ts) == VCLOCK_NONE) goto fallback; break; + case CLOCK_TAI: + if (do_tai(ts) == VCLOCK_NONE) + goto fallback; + break; case CLOCK_REALTIME_COARSE: do_realtime_coarse(ts); break; diff --git a/arch/x86/entry/vsyscall/vsyscall_gtod.c b/arch/x86/entry/vsyscall/vsyscall_gtod.c index e1216dd95c04..d61392fe17f6 100644 --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -53,6 +53,8 @@ void update_vsyscall(struct timekeeper *tk) vdata->monotonic_time_snsec = tk->tkr_mono.xtime_nsec + ((u64)tk->wall_to_monotonic.tv_nsec << tk->tkr_mono.shift); + vdata->tai_time_sec = tk->xtime_sec + + tk->tai_offset; while (vdata->monotonic_time_snsec >= (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) { vdata->monotonic_time_snsec -= diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h index fb856c9f0449..adc9f7b20b9c 100644 --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -32,6 +32,7 @@ struct vsyscall_gtod_data { gtod_long_t wall_time_coarse_nsec; gtod_long_t monotonic_time_coarse_sec; gtod_long_t monotonic_time_coarse_nsec; + gtod_long_t tai_time_sec; int tz_minuteswest; int tz_dsttime; ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-01 1:59 [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO Matt Rickard @ 2018-09-01 3:39 ` Andy Lutomirski 2018-09-01 9:33 ` Florian Weimer 2018-09-09 20:05 ` Thomas Gleixner 0 siblings, 2 replies; 16+ messages in thread From: Andy Lutomirski @ 2018-09-01 3:39 UTC (permalink / raw) To: Matt Rickard, Florian Weimer Cc: Thomas Gleixner, Andy Lutomirski, Stephen Boyd, John Stultz, X86 ML, LKML [-- Attachment #1: Type: text/plain, Size: 1580 bytes --] (Hi, Florian!) On Fri, Aug 31, 2018 at 6:59 PM, Matt Rickard <matt@softrans.com.au> wrote: > Process clock_gettime(CLOCK_TAI) in vDSO. > This makes the call about as fast as CLOCK_REALTIME and CLOCK_MONOTONIC: > > nanoseconds > before after clockname > ---- ----- --------- > 233 87 CLOCK_TAI > 96 93 CLOCK_REALTIME > 88 87 CLOCK_MONOTONIC Are you sure you did this right? With the clocksource set to TSC (which is the only reasonable choice unless KVM has seriously cleaned up its act), with retpolines enabled, I get 24ns for CLOCK_MONOTONIC without your patch and 32ns with your patch. And there is indeed a retpoline in the disassembled output: e5: e8 07 00 00 00 callq f1 <__vdso_clock_gettime+0x31> ea: f3 90 pause ec: 0f ae e8 lfence ef: eb f9 jmp ea <__vdso_clock_gettime+0x2a> f1: 48 89 04 24 mov %rax,(%rsp) f5: c3 retq You're probably going to have to set -fno-jump-tables or do something clever like adding a whole array of (seconds, nsec) in gtod and indexing that array by the clock id. Meanwhile, I wrote the following trivial patch to add a __vdso_clock_gettime_monotonic export. It runs in 21ns, and I suspect that the speedup is even a bit bigger when cache-cold because it avoids some branches. What do you all think? Florian, do you think glibc would be willing to add some magic to turn clock_gettime(CLOCK_MONOTONIC, t) into __vdso_clock_gettime_monotonic(t) when CLOCK_MONOTONIC is a constant? [-- Attachment #2: vclock_gettime_monotonic.patch --] [-- Type: text/x-patch, Size: 1107 bytes --] diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 91ed1bb2a3bb..4f22e9cb97a5 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -319,6 +319,14 @@ notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) int clock_gettime(clockid_t, struct timespec *) __attribute__((weak, alias("__vdso_clock_gettime"))); +notrace int __vdso_clock_gettime_monotonic(struct timespec *ts) +{ + if (likely(do_monotonic(ts) != VCLOCK_NONE)) + return 0; + + return vdso_fallback_gettime(CLOCK_MONOTONIC, ts); +} + notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz) { if (likely(tv != NULL)) { diff --git a/arch/x86/entry/vdso/vdso.lds.S b/arch/x86/entry/vdso/vdso.lds.S index d3a2dce4cfa9..28e23cbc02c9 100644 --- a/arch/x86/entry/vdso/vdso.lds.S +++ b/arch/x86/entry/vdso/vdso.lds.S @@ -15,6 +15,11 @@ * This controls what userland symbols we export from the vDSO. */ VERSION { + LINUX_4.19 { + global: + __vdso_clock_gettime_monotonic; + }; + LINUX_2.6 { global: clock_gettime; ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-01 3:39 ` Andy Lutomirski @ 2018-09-01 9:33 ` Florian Weimer 2018-09-01 17:37 ` Andy Lutomirski 2018-09-09 20:05 ` Thomas Gleixner 1 sibling, 1 reply; 16+ messages in thread From: Florian Weimer @ 2018-09-01 9:33 UTC (permalink / raw) To: Andy Lutomirski, Matt Rickard Cc: Thomas Gleixner, Stephen Boyd, John Stultz, X86 ML, LKML On 09/01/2018 05:39 AM, Andy Lutomirski wrote: > Florian, do you think > glibc would be willing to add some magic to turn > clock_gettime(CLOCK_MONOTONIC, t) into > __vdso_clock_gettime_monotonic(t) when CLOCK_MONOTONIC is a constant? What's the goal here? Turn the indirect call/conditional jump/indirect call sequence into a single indirect call, purely for performance reasons? Thanks, Florian ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-01 9:33 ` Florian Weimer @ 2018-09-01 17:37 ` Andy Lutomirski 0 siblings, 0 replies; 16+ messages in thread From: Andy Lutomirski @ 2018-09-01 17:37 UTC (permalink / raw) To: Florian Weimer Cc: Andy Lutomirski, Matt Rickard, Thomas Gleixner, Stephen Boyd, John Stultz, X86 ML, LKML On Sat, Sep 1, 2018 at 2:33 AM, Florian Weimer <fweimer@redhat.com> wrote: > On 09/01/2018 05:39 AM, Andy Lutomirski wrote: >> >> Florian, do you think >> glibc would be willing to add some magic to turn >> clock_gettime(CLOCK_MONOTONIC, t) into >> __vdso_clock_gettime_monotonic(t) when CLOCK_MONOTONIC is a constant? > > > What's the goal here? Turn the indirect call/conditional jump/indirect call > sequence into a single indirect call, purely for performance reasons? Almost. It's to bypass some of the branches in __vdso_clock_gettime(), which is supposed to be very fast. AFAIK most user code that uses clock_gettime() passes a constant for the first argument, and we can squeeze out some performance by optimizing that case. The indirect branches internal to the vDSO are a separate issue and should be solved separately. (It's really too bad that x86 doesn't have a 64-bit call instruction. If it did, then the PLT could get rewritten at dynamic link time to avoid indirect calls entirely, and presumably glibc could use the same technique to call into the vDSO without indirect calls.) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-01 3:39 ` Andy Lutomirski 2018-09-01 9:33 ` Florian Weimer @ 2018-09-09 20:05 ` Thomas Gleixner 2018-09-10 10:38 ` Thomas Gleixner 2018-09-12 13:01 ` Florian Weimer 1 sibling, 2 replies; 16+ messages in thread From: Thomas Gleixner @ 2018-09-09 20:05 UTC (permalink / raw) To: Andy Lutomirski Cc: Matt Rickard, Florian Weimer, Stephen Boyd, John Stultz, X86 ML, LKML On Fri, 31 Aug 2018, Andy Lutomirski wrote: > (Hi, Florian!) > > On Fri, Aug 31, 2018 at 6:59 PM, Matt Rickard <matt@softrans.com.au> wrote: > > Process clock_gettime(CLOCK_TAI) in vDSO. > > This makes the call about as fast as CLOCK_REALTIME and CLOCK_MONOTONIC: > > > > nanoseconds > > before after clockname > > ---- ----- --------- > > 233 87 CLOCK_TAI > > 96 93 CLOCK_REALTIME > > 88 87 CLOCK_MONOTONIC > > Are you sure you did this right? With the clocksource set to TSC > (which is the only reasonable choice unless KVM has seriously cleaned > up its act), with retpolines enabled, I get 24ns for CLOCK_MONOTONIC > without your patch and 32ns with your patch. And there is indeed a > retpoline in the disassembled output: > > e5: e8 07 00 00 00 callq f1 <__vdso_clock_gettime+0x31> > ea: f3 90 pause > ec: 0f ae e8 lfence > ef: eb f9 jmp ea <__vdso_clock_gettime+0x2a> > f1: 48 89 04 24 mov %rax,(%rsp) > f5: c3 retq > > You're probably going to have to set -fno-jump-tables or do something > clever like adding a whole array of (seconds, nsec) in gtod and > indexing that array by the clock id. See the patch below. It's integrating TAI without slowing down everything and it definitely does not result in indirect calls. On a HSW it slows down clock_gettime() by ~0.5ns. On a SKL I get a speedup by ~0.5ns. On a AMD Epyc server it's 1.2ns speedup. So it somehow depends on the uarch and I also observed compiler version dependend variations. Thanks, tglx ---- --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -203,39 +203,18 @@ notrace static inline u64 vgetsns(int *m return v * gtod->mult; } -/* Code size doesn't matter (vdso is 4k anyway) and this is faster. */ -notrace static int __always_inline do_realtime(struct timespec *ts) +notrace static int __always_inline do_hres(struct timespec *ts, clockid_t clk) { - unsigned long seq; - u64 ns; + struct vgtod_ts *base = >od->basetime[clk & VGTOD_HRES_MASK]; + unsigned int seq; int mode; - - do { - seq = gtod_read_begin(gtod); - mode = gtod->vclock_mode; - ts->tv_sec = gtod->wall_time_sec; - ns = gtod->wall_time_snsec; - ns += vgetsns(&mode); - ns >>= gtod->shift; - } while (unlikely(gtod_read_retry(gtod, seq))); - - ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns); - ts->tv_nsec = ns; - - return mode; -} - -notrace static int __always_inline do_monotonic(struct timespec *ts) -{ - unsigned long seq; u64 ns; - int mode; do { seq = gtod_read_begin(gtod); mode = gtod->vclock_mode; - ts->tv_sec = gtod->monotonic_time_sec; - ns = gtod->monotonic_time_snsec; + ts->tv_sec = base->sec; + ns = base->nsec; ns += vgetsns(&mode); ns >>= gtod->shift; } while (unlikely(gtod_read_retry(gtod, seq))); @@ -246,58 +225,50 @@ notrace static int __always_inline do_mo return mode; } -notrace static void do_realtime_coarse(struct timespec *ts) +notrace static void do_coarse(struct timespec *ts, clockid_t clk) { + struct vgtod_ts *base = >od->basetime[clk]; unsigned long seq; - do { - seq = gtod_read_begin(gtod); - ts->tv_sec = gtod->wall_time_coarse_sec; - ts->tv_nsec = gtod->wall_time_coarse_nsec; - } while (unlikely(gtod_read_retry(gtod, seq))); -} -notrace static void do_monotonic_coarse(struct timespec *ts) -{ - unsigned long seq; do { seq = gtod_read_begin(gtod); - ts->tv_sec = gtod->monotonic_time_coarse_sec; - ts->tv_nsec = gtod->monotonic_time_coarse_nsec; + ts->tv_sec = base->sec; + ts->tv_nsec = base->nsec; } while (unlikely(gtod_read_retry(gtod, seq))); } notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts) { - switch (clock) { - case CLOCK_REALTIME: - if (do_realtime(ts) == VCLOCK_NONE) - goto fallback; - break; - case CLOCK_MONOTONIC: - if (do_monotonic(ts) == VCLOCK_NONE) - goto fallback; - break; - case CLOCK_REALTIME_COARSE: - do_realtime_coarse(ts); - break; - case CLOCK_MONOTONIC_COARSE: - do_monotonic_coarse(ts); - break; - default: - goto fallback; - } + unsigned int msk; - return 0; -fallback: + /* Sort out negative (CPU/FD) and invalid clocks */ + if (unlikely((unsigned int) clock >= MAX_CLOCKS)) + return vdso_fallback_gettime(clock, ts); + + /* + * Convert the clockid to a bitmask and use it to check which + * clocks are handled in the VDSO directly. + */ + msk = 1U << clock; + if (likely(msk & VGTOD_HRES)) { + if (do_hres(ts, clock) != VCLOCK_NONE) + return 0; + } else if (msk & VGTOD_COARSE) { + do_coarse(ts, clock); + return 0; + } return vdso_fallback_gettime(clock, ts); } + int clock_gettime(clockid_t, struct timespec *) __attribute__((weak, alias("__vdso_clock_gettime"))); notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz) { if (likely(tv != NULL)) { - if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE)) + struct timespec *ts = (struct timespec *) tv; + + if (unlikely(do_hres(ts, CLOCK_REALTIME) == VCLOCK_NONE)) return vdso_fallback_gtod(tv, tz); tv->tv_usec /= 1000; } @@ -318,7 +289,7 @@ int gettimeofday(struct timeval *, struc notrace time_t __vdso_time(time_t *t) { /* This is atomic on x86 so we don't need any locks. */ - time_t result = READ_ONCE(gtod->wall_time_sec); + time_t result = READ_ONCE(gtod->basetime[CLOCK_REALTIME].sec); if (t) *t = result; --- a/arch/x86/entry/vsyscall/vsyscall_gtod.c +++ b/arch/x86/entry/vsyscall/vsyscall_gtod.c @@ -31,6 +31,8 @@ void update_vsyscall(struct timekeeper * { int vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode; struct vsyscall_gtod_data *vdata = &vsyscall_gtod_data; + struct vgtod_ts *base; + u64 nsec; /* Mark the new vclock used. */ BUILD_BUG_ON(VCLOCK_MAX >= 32); @@ -45,34 +47,37 @@ void update_vsyscall(struct timekeeper * vdata->mult = tk->tkr_mono.mult; vdata->shift = tk->tkr_mono.shift; - vdata->wall_time_sec = tk->xtime_sec; - vdata->wall_time_snsec = tk->tkr_mono.xtime_nsec; - - vdata->monotonic_time_sec = tk->xtime_sec - + tk->wall_to_monotonic.tv_sec; - vdata->monotonic_time_snsec = tk->tkr_mono.xtime_nsec - + ((u64)tk->wall_to_monotonic.tv_nsec - << tk->tkr_mono.shift); - while (vdata->monotonic_time_snsec >= - (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) { - vdata->monotonic_time_snsec -= - ((u64)NSEC_PER_SEC) << tk->tkr_mono.shift; - vdata->monotonic_time_sec++; + base = &vdata->basetime[CLOCK_REALTIME]; + base->sec = (gtod_long_t)tk->xtime_sec; + base->nsec = tk->tkr_mono.xtime_nsec; + + base = &vdata->basetime[VGTOD_TAI]; + base->sec = (gtod_long_t)(tk->xtime_sec + (s64)tk->tai_offset); + base->nsec = tk->tkr_mono.xtime_nsec; + + base = &vdata->basetime[CLOCK_MONOTONIC]; + base->sec = (gtod_long_t)(tk->xtime_sec + tk->wall_to_monotonic.tv_sec); + nsec = tk->tkr_mono.xtime_nsec; + nsec += ((u64)tk->wall_to_monotonic.tv_nsec << tk->tkr_mono.shift); + while (nsec >= (((u64)NSEC_PER_SEC) << tk->tkr_mono.shift)) { + nsec -= ((u64)NSEC_PER_SEC) << tk->tkr_mono.shift; + base->sec++; } + base->nsec = nsec; - vdata->wall_time_coarse_sec = tk->xtime_sec; - vdata->wall_time_coarse_nsec = (long)(tk->tkr_mono.xtime_nsec >> - tk->tkr_mono.shift); - - vdata->monotonic_time_coarse_sec = - vdata->wall_time_coarse_sec + tk->wall_to_monotonic.tv_sec; - vdata->monotonic_time_coarse_nsec = - vdata->wall_time_coarse_nsec + tk->wall_to_monotonic.tv_nsec; - - while (vdata->monotonic_time_coarse_nsec >= NSEC_PER_SEC) { - vdata->monotonic_time_coarse_nsec -= NSEC_PER_SEC; - vdata->monotonic_time_coarse_sec++; + base = &vdata->basetime[CLOCK_REALTIME_COARSE]; + base->sec = (gtod_long_t)tk->xtime_sec; + base->nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift; + + base = &vdata->basetime[CLOCK_MONOTONIC_COARSE]; + base->sec = (gtod_long_t)(tk->xtime_sec + tk->wall_to_monotonic.tv_sec); + nsec = tk->tkr_mono.xtime_nsec >> tk->tkr_mono.shift; + nsec += tk->wall_to_monotonic.tv_nsec; + while (nsec >= NSEC_PER_SEC) { + nsec -= NSEC_PER_SEC; + base->sec++; } + base->nsec = nsec; gtod_write_end(vdata); } --- a/arch/x86/include/asm/vgtod.h +++ b/arch/x86/include/asm/vgtod.h @@ -5,33 +5,41 @@ #include <linux/compiler.h> #include <linux/clocksource.h> +#include <uapi/linux/time.h> + #ifdef BUILD_VDSO32_64 typedef u64 gtod_long_t; #else typedef unsigned long gtod_long_t; #endif + +struct vgtod_ts { + gtod_long_t sec; + u64 nsec; +}; + +#define VGTOD_BASES (CLOCK_MONOTONIC_COARSE + 1) +#define VGTOD_HRES (BIT(CLOCK_REALTIME) | BIT(CLOCK_MONOTONIC) | BIT(CLOCK_TAI)) +#define VGTOD_COARSE (BIT(CLOCK_REALTIME_COARSE) | BIT(CLOCK_MONOTONIC_COARSE)) + +/* Abuse CLOCK_THREAD_CPUTIME_ID for VGTOD CLOCK TAI */ +#define VGTOD_HRES_MASK 0x3 +#define VGTOD_TAI (CLOCK_TAI & VGTOD_HRES_MASK) + /* * vsyscall_gtod_data will be accessed by 32 and 64 bit code at the same time * so be carefull by modifying this structure. */ struct vsyscall_gtod_data { - unsigned seq; + unsigned int seq; + + int vclock_mode; + u64 cycle_last; + u64 mask; + u32 mult; + u32 shift; - int vclock_mode; - u64 cycle_last; - u64 mask; - u32 mult; - u32 shift; - - /* open coded 'struct timespec' */ - u64 wall_time_snsec; - gtod_long_t wall_time_sec; - gtod_long_t monotonic_time_sec; - u64 monotonic_time_snsec; - gtod_long_t wall_time_coarse_sec; - gtod_long_t wall_time_coarse_nsec; - gtod_long_t monotonic_time_coarse_sec; - gtod_long_t monotonic_time_coarse_nsec; + struct vgtod_ts basetime[VGTOD_BASES]; int tz_minuteswest; int tz_dsttime; @@ -44,9 +52,9 @@ static inline bool vclock_was_used(int v return READ_ONCE(vclocks_used) & (1 << vclock); } -static inline unsigned gtod_read_begin(const struct vsyscall_gtod_data *s) +static inline unsigned int gtod_read_begin(const struct vsyscall_gtod_data *s) { - unsigned ret; + unsigned int ret; repeat: ret = READ_ONCE(s->seq); ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-09 20:05 ` Thomas Gleixner @ 2018-09-10 10:38 ` Thomas Gleixner 2018-09-12 13:01 ` Florian Weimer 1 sibling, 0 replies; 16+ messages in thread From: Thomas Gleixner @ 2018-09-10 10:38 UTC (permalink / raw) To: Andy Lutomirski Cc: Matt Rickard, Florian Weimer, Stephen Boyd, John Stultz, X86 ML, LKML On Sun, 9 Sep 2018, Thomas Gleixner wrote: > #ifdef BUILD_VDSO32_64 > typedef u64 gtod_long_t; > #else > typedef unsigned long gtod_long_t; > #endif > + > +struct vgtod_ts { > + gtod_long_t sec; and actually this wants to become u64 unconditionally as we need to provide the full seconds even on 32bit for the upcoming y2038 support. We still have to truncate it for the current 32bit interface, but the core code can be made ready now. Thanks, tglx ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-09 20:05 ` Thomas Gleixner 2018-09-10 10:38 ` Thomas Gleixner @ 2018-09-12 13:01 ` Florian Weimer 2018-09-12 14:17 ` Thomas Gleixner 1 sibling, 1 reply; 16+ messages in thread From: Florian Weimer @ 2018-09-12 13:01 UTC (permalink / raw) To: Thomas Gleixner, Andy Lutomirski Cc: Matt Rickard, Stephen Boyd, John Stultz, X86 ML, LKML On 09/09/2018 10:05 PM, Thomas Gleixner wrote: > See the patch below. It's integrating TAI without slowing down everything > and it definitely does not result in indirect calls. > > On a HSW it slows down clock_gettime() by ~0.5ns. On a SKL I get a speedup > by ~0.5ns. On a AMD Epyc server it's 1.2ns speedup. So it somehow depends > on the uarch and I also observed compiler version dependend variations. Does this mean glibc can keep using a single vDSO entrypoint, the one we have today? Thanks, Florian ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-12 13:01 ` Florian Weimer @ 2018-09-12 14:17 ` Thomas Gleixner 2018-09-12 14:20 ` Florian Weimer 0 siblings, 1 reply; 16+ messages in thread From: Thomas Gleixner @ 2018-09-12 14:17 UTC (permalink / raw) To: Florian Weimer Cc: Andy Lutomirski, Matt Rickard, Stephen Boyd, John Stultz, X86 ML, LKML On Wed, 12 Sep 2018, Florian Weimer wrote: > On 09/09/2018 10:05 PM, Thomas Gleixner wrote: > > See the patch below. It's integrating TAI without slowing down everything > > and it definitely does not result in indirect calls. > > > > On a HSW it slows down clock_gettime() by ~0.5ns. On a SKL I get a speedup > > by ~0.5ns. On a AMD Epyc server it's 1.2ns speedup. So it somehow depends > > on the uarch and I also observed compiler version dependend variations. > > Does this mean glibc can keep using a single vDSO entrypoint, the one we > have today? We have no intention to change that. But we surely could provide separate entry points as an extra to avoid a bunch of conditionals. Thanks, tglx ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-12 14:17 ` Thomas Gleixner @ 2018-09-12 14:20 ` Florian Weimer 2018-09-12 14:29 ` Thomas Gleixner 0 siblings, 1 reply; 16+ messages in thread From: Florian Weimer @ 2018-09-12 14:20 UTC (permalink / raw) To: Thomas Gleixner Cc: Andy Lutomirski, Matt Rickard, Stephen Boyd, John Stultz, X86 ML, LKML On 09/12/2018 04:17 PM, Thomas Gleixner wrote: > On Wed, 12 Sep 2018, Florian Weimer wrote: >> On 09/09/2018 10:05 PM, Thomas Gleixner wrote: >>> See the patch below. It's integrating TAI without slowing down everything >>> and it definitely does not result in indirect calls. >>> >>> On a HSW it slows down clock_gettime() by ~0.5ns. On a SKL I get a speedup >>> by ~0.5ns. On a AMD Epyc server it's 1.2ns speedup. So it somehow depends >>> on the uarch and I also observed compiler version dependend variations. >> >> Does this mean glibc can keep using a single vDSO entrypoint, the one we >> have today? > > We have no intention to change that. Okay, I was wondering because Andy seemed to have proposed just that. > But we surely could provide separate entry points as an extra to avoid a > bunch of conditionals. We could adjust to that, but the benefit would be long-term because it's an ABI change for glibc, and they tend to take a long time to propagate. But I must say that clock_gettime is an odd place to start. I would have expected any of the type-polymorphic multiplexer interfaces (fcntl, ioctl, ptrace, futex) to be a more natural starting point. 8-) Thanks, Florian ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-12 14:20 ` Florian Weimer @ 2018-09-12 14:29 ` Thomas Gleixner 2018-09-12 17:11 ` Andy Lutomirski 0 siblings, 1 reply; 16+ messages in thread From: Thomas Gleixner @ 2018-09-12 14:29 UTC (permalink / raw) To: Florian Weimer Cc: Andy Lutomirski, Matt Rickard, Stephen Boyd, John Stultz, X86 ML, LKML On Wed, 12 Sep 2018, Florian Weimer wrote: > On 09/12/2018 04:17 PM, Thomas Gleixner wrote: > > On Wed, 12 Sep 2018, Florian Weimer wrote: > > > Does this mean glibc can keep using a single vDSO entrypoint, the one we > > > have today? > > > > We have no intention to change that. > > Okay, I was wondering because Andy seemed to have proposed just that. > > > But we surely could provide separate entry points as an extra to avoid a > > bunch of conditionals. > > We could adjust to that, but the benefit would be long-term because it's an > ABI change for glibc, and they tend to take a long time to propagate. > > But I must say that clock_gettime is an odd place to start. I would have > expected any of the type-polymorphic multiplexer interfaces (fcntl, ioctl, > ptrace, futex) to be a more natural starting point. 8-) Well, the starting point of this was to provide clock_tai support in the vdso. clock_gettime() in the vdso vs. the real syscall is a factor of 10 in speed. clock_gettime() is a pretty hot function in some workloads. Andy then noticed that some conditionals could be avoided entirely by using a different entry point and offered one along with a 10% speedup. We don't have to go there, we can. The multiplexer interfaces need much more surgery and talking about futex, we'd need to sit down with quite some people and identify the things they actually care about before just splitting it up and keeping the existing overloaded trainwreck the same. Thanks, tglx ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-12 14:29 ` Thomas Gleixner @ 2018-09-12 17:11 ` Andy Lutomirski 2018-09-13 8:07 ` Florian Weimer 0 siblings, 1 reply; 16+ messages in thread From: Andy Lutomirski @ 2018-09-12 17:11 UTC (permalink / raw) To: Thomas Gleixner Cc: Florian Weimer, Andy Lutomirski, Matt Rickard, Stephen Boyd, John Stultz, X86 ML, LKML > On Sep 12, 2018, at 7:29 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > >> On Wed, 12 Sep 2018, Florian Weimer wrote: >>> On 09/12/2018 04:17 PM, Thomas Gleixner wrote: >>>> On Wed, 12 Sep 2018, Florian Weimer wrote: >>>> Does this mean glibc can keep using a single vDSO entrypoint, the one we >>>> have today? >>> >>> We have no intention to change that. >> >> Okay, I was wondering because Andy seemed to have proposed just that. >> >>> But we surely could provide separate entry points as an extra to avoid a >>> bunch of conditionals. >> >> We could adjust to that, but the benefit would be long-term because it's an >> ABI change for glibc, and they tend to take a long time to propagate. >> >> But I must say that clock_gettime is an odd place to start. I would have >> expected any of the type-polymorphic multiplexer interfaces (fcntl, ioctl, >> ptrace, futex) to be a more natural starting point. 8-) > > Well, the starting point of this was to provide clock_tai support in the > vdso. clock_gettime() in the vdso vs. the real syscall is a factor of 10 in > speed. clock_gettime() is a pretty hot function in some workloads. > > Andy then noticed that some conditionals could be avoided entirely by using > a different entry point and offered one along with a 10% speedup. We don't > have to go there, we can. > > The multiplexer interfaces need much more surgery and talking about futex, > we'd need to sit down with quite some people and identify the things they > actually care about before just splitting it up and keeping the existing > overloaded trainwreck the same. > There’s also the issue of how much the speedup matters. For futex, maybe a better interface saves 3ns, but a futex syscall is hundreds of ns. clock_gettime() is called at high frequency and can be ~25ns. Saving a few ns is a bigger deal. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-12 17:11 ` Andy Lutomirski @ 2018-09-13 8:07 ` Florian Weimer 2018-09-13 15:22 ` Andy Lutomirski 0 siblings, 1 reply; 16+ messages in thread From: Florian Weimer @ 2018-09-13 8:07 UTC (permalink / raw) To: Andy Lutomirski, Thomas Gleixner Cc: Andy Lutomirski, Matt Rickard, Stephen Boyd, John Stultz, X86 ML, LKML On 09/12/2018 07:11 PM, Andy Lutomirski wrote: >> The multiplexer interfaces need much more surgery and talking about futex, >> we'd need to sit down with quite some people and identify the things they >> actually care about before just splitting it up and keeping the existing >> overloaded trainwreck the same. >> > There’s also the issue of how much the speedup matters. For futex, maybe a better interface saves 3ns, but a futex syscall is hundreds of ns. clock_gettime() is called at high frequency and can be ~25ns. Saving a few ns is a bigger deal. My concern is that the userspace system call wrappers currently do not know how many arguments the individual operations take and what types the arguments have (hence the “type-polymorphic” nature I mentioned). This could be a problem for on-stack argument passing (where you might read values beyond the end of the stack, and glibc avoids that most of the time by having enough cruft on the stack), and for architectures which pass pointers and integers in different registers (like some m68k ABIs do for the return value). Thanks, Florian ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-13 8:07 ` Florian Weimer @ 2018-09-13 15:22 ` Andy Lutomirski 2018-09-13 19:07 ` Florian Weimer 0 siblings, 1 reply; 16+ messages in thread From: Andy Lutomirski @ 2018-09-13 15:22 UTC (permalink / raw) To: Florian Weimer Cc: Thomas Gleixner, Andy Lutomirski, Matt Rickard, Stephen Boyd, John Stultz, X86 ML, LKML > On Sep 13, 2018, at 1:07 AM, Florian Weimer <fweimer@redhat.com> wrote: > > On 09/12/2018 07:11 PM, Andy Lutomirski wrote: >>> The multiplexer interfaces need much more surgery and talking about futex, >>> we'd need to sit down with quite some people and identify the things they >>> actually care about before just splitting it up and keeping the existing >>> overloaded trainwreck the same. >>> >> There’s also the issue of how much the speedup matters. For futex, maybe a better interface saves 3ns, but a futex syscall is hundreds of ns. clock_gettime() is called at high frequency and can be ~25ns. Saving a few ns is a bigger deal. > > My concern is that the userspace system call wrappers currently do not know how many arguments the individual operations take and what types the arguments have (hence the “type-polymorphic” nature I mentioned). This could be a problem for on-stack argument passing (where you might read values beyond the end of the stack, and glibc avoids that most of the time by having enough cruft on the stack), and for architectures which pass pointers and integers in different registers (like some m68k ABIs do for the return value). > > Isn’t clock_gettime already special because of the vDSO entry point, though? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-13 15:22 ` Andy Lutomirski @ 2018-09-13 19:07 ` Florian Weimer 2018-09-13 19:35 ` Andy Lutomirski 0 siblings, 1 reply; 16+ messages in thread From: Florian Weimer @ 2018-09-13 19:07 UTC (permalink / raw) To: Andy Lutomirski Cc: Thomas Gleixner, Andy Lutomirski, Matt Rickard, Stephen Boyd, John Stultz, X86 ML, LKML On 09/13/2018 05:22 PM, Andy Lutomirski wrote: > > >> On Sep 13, 2018, at 1:07 AM, Florian Weimer <fweimer@redhat.com> wrote: >> >> On 09/12/2018 07:11 PM, Andy Lutomirski wrote: >>>> The multiplexer interfaces need much more surgery and talking about futex, >>>> we'd need to sit down with quite some people and identify the things they >>>> actually care about before just splitting it up and keeping the existing >>>> overloaded trainwreck the same. >>>> >>> There’s also the issue of how much the speedup matters. For futex, maybe a better interface saves 3ns, but a futex syscall is hundreds of ns. clock_gettime() is called at high frequency and can be ~25ns. Saving a few ns is a bigger deal. >> >> My concern is that the userspace system call wrappers currently do not know how many arguments the individual operations take and what types the arguments have (hence the “type-polymorphic” nature I mentioned). This could be a problem for on-stack argument passing (where you might read values beyond the end of the stack, and glibc avoids that most of the time by having enough cruft on the stack), and for architectures which pass pointers and integers in different registers (like some m68k ABIs do for the return value). > Isn’t clock_gettime already special because of the vDSO entry point, though? Somewhat special, yes, but not overly so, and not in the type-polymorphic sense. We can't give direct access of the vDSO implementation to applications because the kernel does not know about the userspace errno variable. We do that for time on x86_64, where applications call into the vDSO directly, bypassing glibc completely after binding. I suspect most Linux libcs that know about the vDSO at all have generic vsyscall support, just like they have generic support for plain system calls. Thanks, Florian ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-13 19:07 ` Florian Weimer @ 2018-09-13 19:35 ` Andy Lutomirski 2018-09-13 19:53 ` Florian Weimer 0 siblings, 1 reply; 16+ messages in thread From: Andy Lutomirski @ 2018-09-13 19:35 UTC (permalink / raw) To: Florian Weimer Cc: Thomas Gleixner, Andy Lutomirski, Matt Rickard, Stephen Boyd, John Stultz, X86 ML, LKML > On Sep 13, 2018, at 12:07 PM, Florian Weimer <fweimer@redhat.com> wrote: > > On 09/13/2018 05:22 PM, Andy Lutomirski wrote: >>> On Sep 13, 2018, at 1:07 AM, Florian Weimer <fweimer@redhat.com> wrote: >>> >>> On 09/12/2018 07:11 PM, Andy Lutomirski wrote: >>>>> The multiplexer interfaces need much more surgery and talking about futex, >>>>> we'd need to sit down with quite some people and identify the things they >>>>> actually care about before just splitting it up and keeping the existing >>>>> overloaded trainwreck the same. >>>>> >>>> There’s also the issue of how much the speedup matters. For futex, maybe a better interface saves 3ns, but a futex syscall is hundreds of ns. clock_gettime() is called at high frequency and can be ~25ns. Saving a few ns is a bigger deal. >>> >>> My concern is that the userspace system call wrappers currently do not know how many arguments the individual operations take and what types the arguments have (hence the “type-polymorphic” nature I mentioned). This could be a problem for on-stack argument passing (where you might read values beyond the end of the stack, and glibc avoids that most of the time by having enough cruft on the stack), and for architectures which pass pointers and integers in different registers (like some m68k ABIs do for the return value). > >> Isn’t clock_gettime already special because of the vDSO entry point, though? > > Somewhat special, yes, but not overly so, and not in the type-polymorphic sense. We can't give direct access of the vDSO implementation to applications because the kernel does not know about the userspace errno variable. We do that for time on x86_64, where applications call into the vDSO directly, bypassing glibc completely after binding. If the vDSO adds special helpers for CLOCK_MONOTONIC and CLOCK_REALTIME, I think we can reasonably safely promise that they never fail. (seccomp can obviously break that promise if there’s no TSC, but I think that seccomp users who do that get to keep both pieces.) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO 2018-09-13 19:35 ` Andy Lutomirski @ 2018-09-13 19:53 ` Florian Weimer 0 siblings, 0 replies; 16+ messages in thread From: Florian Weimer @ 2018-09-13 19:53 UTC (permalink / raw) To: Andy Lutomirski Cc: Thomas Gleixner, Andy Lutomirski, Matt Rickard, Stephen Boyd, John Stultz, X86 ML, LKML On 09/13/2018 09:35 PM, Andy Lutomirski wrote: >> Somewhat special, yes, but not overly so, and not in the type-polymorphic sense. We can't give direct access of the vDSO implementation to applications because the kernel does not know about the userspace errno variable. We do that for time on x86_64, where applications call into the vDSO directly, bypassing glibc completely after binding. > > If the vDSO adds special helpers for CLOCK_MONOTONIC and CLOCK_REALTIME, I think we can reasonably safely promise that they never fail. (seccomp can obviously break that promise if there’s no TSC, but I think that seccomp users who do that get to keep both pieces.) I agree, I thought about the same thing. We already do not return EFAULT for invalid pointers, for obvious reasons. And if the clock ID is fixed, the EINVAL error is impossible. That would shave off a few nanoseconds more if the calling convention is identical to what glibc exposes to applications. If the vDSO is not available or the symbol is missing, we can provide an implementation based on the current clock_gettime in glibc. Thanks, Florian ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2018-09-13 19:53 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-09-01 1:59 [PATCH v3] x86/vdso: Handle clock_gettime(CLOCK_TAI) in vDSO Matt Rickard 2018-09-01 3:39 ` Andy Lutomirski 2018-09-01 9:33 ` Florian Weimer 2018-09-01 17:37 ` Andy Lutomirski 2018-09-09 20:05 ` Thomas Gleixner 2018-09-10 10:38 ` Thomas Gleixner 2018-09-12 13:01 ` Florian Weimer 2018-09-12 14:17 ` Thomas Gleixner 2018-09-12 14:20 ` Florian Weimer 2018-09-12 14:29 ` Thomas Gleixner 2018-09-12 17:11 ` Andy Lutomirski 2018-09-13 8:07 ` Florian Weimer 2018-09-13 15:22 ` Andy Lutomirski 2018-09-13 19:07 ` Florian Weimer 2018-09-13 19:35 ` Andy Lutomirski 2018-09-13 19:53 ` Florian Weimer
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.