All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] arm64: Add support for CLOCK_MONOTONIC_RAW in clock_gettime() vDSO
@ 2016-05-09 12:36 Kevin Brodsky
  2016-05-09 12:37 ` [PATCH 1/2] arm64: Refactor vDSO time functions Kevin Brodsky
  2016-05-09 12:37 ` [PATCH 2/2] arm64: Add support for CLOCK_MONOTONIC_RAW in clock_gettime() vDSO Kevin Brodsky
  0 siblings, 2 replies; 11+ messages in thread
From: Kevin Brodsky @ 2016-05-09 12:36 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

This small series adds support for CLOCK_MONOTONIC_RAW to
clock_gettime() (and clock_getres()) in the arm64 vDSO. This should make
calls to clock_gettime() with CLOCK_MONOTONIC_RAW just as fast as
CLOCK_MONOTONIC and such, instead of falling back to the significantly
slower syscall.

Given that gettimeofday.S is already quite difficult to understand, and
directly adding MONOTONIC_RAW support makes it even more so, the first
patch refactors it to make it simpler, and save a few instructions along
the way. The second patch actually adds support for MONOTONIC_RAW, which
is made quite easy by the refactoring.

Thanks,
Kevin

Kevin Brodsky (2):
  arm64: Refactor vDSO time functions
  arm64: Add support for CLOCK_MONOTONIC_RAW in clock_gettime() vDSO

 arch/arm64/include/asm/vdso_datapage.h |   8 +-
 arch/arm64/kernel/asm-offsets.c        |   6 +-
 arch/arm64/kernel/vdso.c               |   8 +-
 arch/arm64/kernel/vdso/gettimeofday.S  | 317 ++++++++++++++++++++-------------
 4 files changed, 215 insertions(+), 124 deletions(-)

-- 
2.8.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] arm64: Refactor vDSO time functions
  2016-05-09 12:36 [PATCH 0/2] arm64: Add support for CLOCK_MONOTONIC_RAW in clock_gettime() vDSO Kevin Brodsky
@ 2016-05-09 12:37 ` Kevin Brodsky
  2016-06-22 13:24   ` Christopher Covington
                     ` (2 more replies)
  2016-05-09 12:37 ` [PATCH 2/2] arm64: Add support for CLOCK_MONOTONIC_RAW in clock_gettime() vDSO Kevin Brodsky
  1 sibling, 3 replies; 11+ messages in thread
From: Kevin Brodsky @ 2016-05-09 12:37 UTC (permalink / raw)
  To: linux-arm-kernel

Time functions are directly implemented in assembly in arm64, and it
is desirable to keep it this way for performance reasons (everything
fits in registers, so that the stack is not used at all). However, the
current implementation is quite difficult to read and understand (even
considering it's assembly).  Additionally, due to the structure of
__kernel_clock_gettime, which heavily uses conditional branches to
share code between the different clocks, it is difficult to support a
new clock without making the branches even harder to follow.

This commit completely refactors the structure of clock_gettime (and
gettimeofday along the way) while keeping exactly the same algorithms.
We no longer try to share code; instead, macros provide common
operations. This new approach comes with a number of advantages:
- In clock_gettime, clock implementations are no longer interspersed,
  making them much more readable. Additionally, macros only use
  registers passed as arguments or reserved with .req, this way it is
  easy to make sure that registers are properly allocated. To avoid a
  large number of branches in a given execution path, a jump table is
  used; a normal execution uses 3 unconditional branches.
- __do_get_tspec has been replaced with 2 macros (get_ts_clock_mono,
  get_clock_shifted_nsec) and explicit loading of data from the vDSO
  page. Consequently, clock_gettime and gettimeofday are now leaf
  functions, and saving x30 (lr) is no longer necessary.
- Variables protected by tb_seq_count are now loaded all at once,
  allowing to merge the seqcnt_read macro into seqcnt_check.
- For CLOCK_REALTIME_COARSE, removed an unused load of the wall to
  monotonic timespec.
- For CLOCK_MONOTONIC_COARSE, removed a few shift instructions.

Obviously, the downside of sharing less code is an increase in code
size. However since the vDSO has its own code page, this does not
really matter, as long as the size of the DSO remains below 4 kB. For
now this should be all right:
                    Before  After
  vdso.so size (B)  2776    2936

Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Martin <dave.martin@arm.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/kernel/vdso/gettimeofday.S | 282 +++++++++++++++++++---------------
 1 file changed, 162 insertions(+), 120 deletions(-)

diff --git a/arch/arm64/kernel/vdso/gettimeofday.S b/arch/arm64/kernel/vdso/gettimeofday.S
index efa79e8d4196..caff9dd6ba78 100644
--- a/arch/arm64/kernel/vdso/gettimeofday.S
+++ b/arch/arm64/kernel/vdso/gettimeofday.S
@@ -26,24 +26,91 @@
 #define NSEC_PER_SEC_HI16	0x3b9a
 
 vdso_data	.req	x6
-use_syscall	.req	w7
-seqcnt		.req	w8
+seqcnt		.req	w7
+w_tmp		.req	w8
+x_tmp		.req	x8
+
+/*
+ * Conventions for macro arguments:
+ * - An argument is write-only if its name starts with "res".
+ * - All other arguments are read-only, unless otherwise specified.
+ */
 
 	.macro	seqcnt_acquire
 9999:	ldr	seqcnt, [vdso_data, #VDSO_TB_SEQ_COUNT]
 	tbnz	seqcnt, #0, 9999b
 	dmb	ishld
-	ldr	use_syscall, [vdso_data, #VDSO_USE_SYSCALL]
 	.endm
 
-	.macro	seqcnt_read, cnt
+	.macro	seqcnt_check fail
 	dmb	ishld
-	ldr	\cnt, [vdso_data, #VDSO_TB_SEQ_COUNT]
+	ldr	w_tmp, [vdso_data, #VDSO_TB_SEQ_COUNT]
+	cmp	w_tmp, seqcnt
+	b.ne	\fail
 	.endm
 
-	.macro	seqcnt_check, cnt, fail
-	cmp	\cnt, seqcnt
-	b.ne	\fail
+	.macro	syscall_check fail
+	ldr	w_tmp, [vdso_data, #VDSO_USE_SYSCALL]
+	cbnz	w_tmp, \fail
+	.endm
+
+	.macro get_nsec_per_sec res
+	mov	\res, #NSEC_PER_SEC_LO16
+	movk	\res, #NSEC_PER_SEC_HI16, lsl #16
+	.endm
+
+	/*
+	 * Returns the clock delta, in nanoseconds left-shifted by the clock
+	 * shift.
+	 */
+	.macro	get_clock_shifted_nsec res, cycle_last, mult
+	/* Read the virtual counter. */
+	isb
+	mrs	x_tmp, cntvct_el0
+	/* Calculate cycle delta and convert to ns. */
+	sub	\res, x_tmp, \cycle_last
+	/* We can only guarantee 56 bits of precision. */
+	movn	x_tmp, #0xff00, lsl #48
+	and	\res, x_tmp, \res
+	mul	\res, \res, \mult
+	.endm
+
+	/*
+	 * Returns in res_{sec,nsec} the REALTIME timespec, based on the
+	 * "wall time" (xtime) and the clock_mono delta.
+	 */
+	.macro	get_ts_realtime res_sec, res_nsec, \
+			clock_nsec, xtime_sec, xtime_nsec, nsec_to_sec
+	add	\res_nsec, \clock_nsec, \xtime_nsec
+	udiv	x_tmp, \res_nsec, \nsec_to_sec
+	add	\res_sec, \xtime_sec, x_tmp
+	msub	\res_nsec, x_tmp, \nsec_to_sec, \res_nsec
+	.endm
+
+	/* sec and nsec are modified in place. */
+	.macro add_ts sec, nsec, ts_sec, ts_nsec, nsec_to_sec
+	/* Add timespec. */
+	add	\sec, \sec, \ts_sec
+	add	\nsec, \nsec, \ts_nsec
+
+	/* Normalise the new timespec. */
+	cmp	\nsec, \nsec_to_sec
+	b.lt	9999f
+	sub	\nsec, \nsec, \nsec_to_sec
+	add	\sec, \sec, #1
+9999:
+	cmp	\nsec, #0
+	b.ge	9998f
+	add	\nsec, \nsec, \nsec_to_sec
+	sub	\sec, \sec, #1
+9998:
+	.endm
+
+	.macro jump_slot jumptable, index, label
+	.if (. - \jumptable) != 4 * (\index)
+		.error "Jump slot index mismatch"
+	.endif
+	b	\label
 	.endm
 
 	.text
@@ -51,18 +118,24 @@ seqcnt		.req	w8
 /* int __kernel_gettimeofday(struct timeval *tv, struct timezone *tz); */
 ENTRY(__kernel_gettimeofday)
 	.cfi_startproc
-	mov	x2, x30
-	.cfi_register x30, x2
-
-	/* Acquire the sequence counter and get the timespec. */
 	adr	vdso_data, _vdso_data
-1:	seqcnt_acquire
-	cbnz	use_syscall, 4f
-
 	/* If tv is NULL, skip to the timezone code. */
 	cbz	x0, 2f
-	bl	__do_get_tspec
-	seqcnt_check w9, 1b
+
+	/* Compute the time of day. */
+1:	seqcnt_acquire
+	syscall_check fail=4f
+	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
+	ldp	w11, w12, [vdso_data, #VDSO_CS_MULT]
+	ldp	x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
+	seqcnt_check fail=1b
+
+	get_nsec_per_sec res=x9
+	lsl	x9, x9, x12
+
+	get_clock_shifted_nsec res=x15, cycle_last=x10, mult=x11
+	get_ts_realtime res_sec=x10, res_nsec=x11, \
+		clock_nsec=x15, xtime_sec=x13, xtime_nsec=x14, nsec_to_sec=x9
 
 	/* Convert ns to us. */
 	mov	x13, #1000
@@ -76,95 +149,107 @@ ENTRY(__kernel_gettimeofday)
 	stp	w4, w5, [x1, #TZ_MINWEST]
 3:
 	mov	x0, xzr
-	ret	x2
+	ret
 4:
 	/* Syscall fallback. */
 	mov	x8, #__NR_gettimeofday
 	svc	#0
-	ret	x2
+	ret
 	.cfi_endproc
 ENDPROC(__kernel_gettimeofday)
 
+#define JUMPSLOT_MAX CLOCK_MONOTONIC_COARSE
+
 /* int __kernel_clock_gettime(clockid_t clock_id, struct timespec *tp); */
 ENTRY(__kernel_clock_gettime)
 	.cfi_startproc
-	cmp	w0, #CLOCK_REALTIME
-	ccmp	w0, #CLOCK_MONOTONIC, #0x4, ne
-	b.ne	2f
+	cmp 	w0, #JUMPSLOT_MAX
+	b.hi 	syscall
+	adr	vdso_data, _vdso_data
+	adr 	x_tmp, jumptable
+	add 	x_tmp, x_tmp, w0, uxtw #2
+	br 	x_tmp
+
+jumptable:
+	jump_slot jumptable, CLOCK_REALTIME, realtime
+	jump_slot jumptable, CLOCK_MONOTONIC, monotonic
+	b 	syscall
+	b 	syscall
+	b 	syscall
+	jump_slot jumptable, CLOCK_REALTIME_COARSE, realtime_coarse
+	jump_slot jumptable, CLOCK_MONOTONIC_COARSE, monotonic_coarse
+
+	.if (. - jumptable) != 4 * (JUMPSLOT_MAX + 1)
+	.error	"Wrong jumptable size"
+	.endif
+
+realtime:
+	seqcnt_acquire
+	syscall_check fail=syscall
+	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
+	ldp	w11, w12, [vdso_data, #VDSO_CS_MULT]
+	ldp	x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
+	seqcnt_check fail=realtime
 
-	mov	x2, x30
-	.cfi_register x30, x2
+	/* All computations are done with left-shifted nsecs. */
+	get_nsec_per_sec res=x9
+	lsl	x9, x9, x12
 
-	/* Get kernel timespec. */
-	adr	vdso_data, _vdso_data
-1:	seqcnt_acquire
-	cbnz	use_syscall, 7f
+	get_clock_shifted_nsec res=x15, cycle_last=x10, mult=x11
+	get_ts_realtime res_sec=x10, res_nsec=x11, \
+		clock_nsec=x15, xtime_sec=x13, xtime_nsec=x14, nsec_to_sec=x9
 
-	bl	__do_get_tspec
-	seqcnt_check w9, 1b
+	b shift_store
 
-	mov	x30, x2
+monotonic:
+	seqcnt_acquire
+	syscall_check fail=syscall
+	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
+	ldp	w11, w12, [vdso_data, #VDSO_CS_MULT]
+	ldp	x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
+	ldp	x3, x4, [vdso_data, #VDSO_WTM_CLK_SEC]
+	seqcnt_check fail=monotonic
 
-	cmp	w0, #CLOCK_MONOTONIC
-	b.ne	6f
+	/* All computations are done with left-shifted nsecs. */
+	lsl	x4, x4, x12
+	get_nsec_per_sec res=x9
+	lsl	x9, x9, x12
 
-	/* Get wtm timespec. */
-	ldp	x13, x14, [vdso_data, #VDSO_WTM_CLK_SEC]
+	get_clock_shifted_nsec res=x15, cycle_last=x10, mult=x11
+	get_ts_realtime res_sec=x10, res_nsec=x11, \
+		clock_nsec=x15, xtime_sec=x13, xtime_nsec=x14, nsec_to_sec=x9
 
-	/* Check the sequence counter. */
-	seqcnt_read w9
-	seqcnt_check w9, 1b
-	b	4f
-2:
-	cmp	w0, #CLOCK_REALTIME_COARSE
-	ccmp	w0, #CLOCK_MONOTONIC_COARSE, #0x4, ne
-	b.ne	8f
+	add_ts sec=x10, nsec=x11, ts_sec=x3, ts_nsec=x4, nsec_to_sec=x9
 
-	/* xtime_coarse_nsec is already right-shifted */
-	mov	x12, #0
+	b shift_store
 
-	/* Get coarse timespec. */
-	adr	vdso_data, _vdso_data
-3:	seqcnt_acquire
+realtime_coarse:
+	seqcnt_acquire
 	ldp	x10, x11, [vdso_data, #VDSO_XTIME_CRS_SEC]
+	seqcnt_check fail=realtime_coarse
 
-	/* Get wtm timespec. */
-	ldp	x13, x14, [vdso_data, #VDSO_WTM_CLK_SEC]
+	b store
 
-	/* Check the sequence counter. */
-	seqcnt_read w9
-	seqcnt_check w9, 3b
+monotonic_coarse:
+	seqcnt_acquire
+	ldp	x10, x11, [vdso_data, #VDSO_XTIME_CRS_SEC]
+	ldp	x13, x14, [vdso_data, #VDSO_WTM_CLK_SEC]
+	seqcnt_check fail=monotonic_coarse
 
-	cmp	w0, #CLOCK_MONOTONIC_COARSE
-	b.ne	6f
-4:
-	/* Add on wtm timespec. */
-	add	x10, x10, x13
-	lsl	x14, x14, x12
-	add	x11, x11, x14
+	/* Computations are done in (non-shifted) nsecs. */
+	get_nsec_per_sec res=x9
+	add_ts sec=x10, nsec=x11, ts_sec=x13, ts_nsec=x14, nsec_to_sec=x9
 
-	/* Normalise the new timespec. */
-	mov	x15, #NSEC_PER_SEC_LO16
-	movk	x15, #NSEC_PER_SEC_HI16, lsl #16
-	lsl	x15, x15, x12
-	cmp	x11, x15
-	b.lt	5f
-	sub	x11, x11, x15
-	add	x10, x10, #1
-5:
-	cmp	x11, #0
-	b.ge	6f
-	add	x11, x11, x15
-	sub	x10, x10, #1
+	b store
 
-6:	/* Store to the user timespec. */
+shift_store:
 	lsr	x11, x11, x12
+store:
 	stp	x10, x11, [x1, #TSPEC_TV_SEC]
 	mov	x0, xzr
 	ret
-7:
-	mov	x30, x2
-8:	/* Syscall fallback. */
+
+syscall: /* Syscall fallback. */
 	mov	x8, #__NR_clock_gettime
 	svc	#0
 	ret
@@ -203,46 +288,3 @@ ENTRY(__kernel_clock_getres)
 	.quad	CLOCK_COARSE_RES
 	.cfi_endproc
 ENDPROC(__kernel_clock_getres)
-
-/*
- * Read the current time from the architected counter.
- * Expects vdso_data to be initialised.
- * Clobbers the temporary registers (x9 - x15).
- * Returns:
- *  - w9		= vDSO sequence counter
- *  - (x10, x11)	= (ts->tv_sec, shifted ts->tv_nsec)
- *  - w12		= cs_shift
- */
-ENTRY(__do_get_tspec)
-	.cfi_startproc
-
-	/* Read from the vDSO data page. */
-	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
-	ldp	x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
-	ldp	w11, w12, [vdso_data, #VDSO_CS_MULT]
-	seqcnt_read w9
-
-	/* Read the virtual counter. */
-	isb
-	mrs	x15, cntvct_el0
-
-	/* Calculate cycle delta and convert to ns. */
-	sub	x10, x15, x10
-	/* We can only guarantee 56 bits of precision. */
-	movn	x15, #0xff00, lsl #48
-	and	x10, x15, x10
-	mul	x10, x10, x11
-
-	/* Use the kernel time to calculate the new timespec. */
-	mov	x11, #NSEC_PER_SEC_LO16
-	movk	x11, #NSEC_PER_SEC_HI16, lsl #16
-	lsl	x11, x11, x12
-	add	x15, x10, x14
-	udiv	x14, x15, x11
-	add	x10, x13, x14
-	mul	x13, x14, x11
-	sub	x11, x15, x13
-
-	ret
-	.cfi_endproc
-ENDPROC(__do_get_tspec)
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] arm64: Add support for CLOCK_MONOTONIC_RAW in clock_gettime() vDSO
  2016-05-09 12:36 [PATCH 0/2] arm64: Add support for CLOCK_MONOTONIC_RAW in clock_gettime() vDSO Kevin Brodsky
  2016-05-09 12:37 ` [PATCH 1/2] arm64: Refactor vDSO time functions Kevin Brodsky
@ 2016-05-09 12:37 ` Kevin Brodsky
  2016-07-01 13:48   ` Dave Martin
  1 sibling, 1 reply; 11+ messages in thread
From: Kevin Brodsky @ 2016-05-09 12:37 UTC (permalink / raw)
  To: linux-arm-kernel

So far the arm64 clock_gettime() vDSO implementation only supported
the following clocks, falling back to the syscall for the others:
- CLOCK_REALTIME{,_COARSE}
- CLOCK_MONOTONIC{,_COARSE}

This patch adds support for the CLOCK_MONOTONIC_RAW clock, taking
advantage of the recent refactoring of the vDSO time functions. Like
the non-_COARSE clocks, this only works when the "arch_sys_counter"
clocksource is in use (allowing us to read the current time from the
virtual counter register), otherwise we also have to fall back to the
syscall.

Most of the data is shared with CLOCK_MONOTONIC, and the algorithm is
similar. The reference implementation in kernel/time/timekeeping.c
shows that:
- CLOCK_MONOTONIC = tk->wall_to_monotonic + tk->xtime_sec +
  timekeeping_get_ns(&tk->tkr_mono)
- CLOCK_MONOTONIC_RAW = tk->raw_time + timekeeping_get_ns(&tk->tkr_raw)
- tkr_mono and tkr_raw are identical (in particular, same
  clocksource), except these members:
  * mult (only mono's multiplier is NTP-adjusted)
  * xtime_nsec (always 0 for raw)

Therefore, tk->raw_time and tkr_raw->mult are now also stored in the
vDSO data page.

Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Martin <dave.martin@arm.com>
Cc: Ali Saidi <ali.saidi@arm.com>
Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
---
 arch/arm64/include/asm/vdso_datapage.h |  8 +++--
 arch/arm64/kernel/asm-offsets.c        |  6 +++-
 arch/arm64/kernel/vdso.c               |  8 ++++-
 arch/arm64/kernel/vdso/gettimeofday.S  | 57 +++++++++++++++++++++++++++-------
 4 files changed, 64 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/vdso_datapage.h b/arch/arm64/include/asm/vdso_datapage.h
index de66199673d7..2b9a63771eda 100644
--- a/arch/arm64/include/asm/vdso_datapage.h
+++ b/arch/arm64/include/asm/vdso_datapage.h
@@ -22,6 +22,8 @@
 
 struct vdso_data {
 	__u64 cs_cycle_last;	/* Timebase at clocksource init */
+	__u64 raw_time_sec;	/* Raw time */
+	__u64 raw_time_nsec;
 	__u64 xtime_clock_sec;	/* Kernel time */
 	__u64 xtime_clock_nsec;
 	__u64 xtime_coarse_sec;	/* Coarse time */
@@ -29,8 +31,10 @@ struct vdso_data {
 	__u64 wtm_clock_sec;	/* Wall to monotonic time */
 	__u64 wtm_clock_nsec;
 	__u32 tb_seq_count;	/* Timebase sequence counter */
-	__u32 cs_mult;		/* Clocksource multiplier */
-	__u32 cs_shift;		/* Clocksource shift */
+	/* cs_* members must be adjacent and in this order (ldp accesses) */
+	__u32 cs_mono_mult;	/* NTP-adjusted clocksource multiplier */
+	__u32 cs_shift;		/* Clocksource shift (mono = raw) */
+	__u32 cs_raw_mult;	/* Raw clocksource multiplier */
 	__u32 tz_minuteswest;	/* Whacky timezone stuff */
 	__u32 tz_dsttime;
 	__u32 use_syscall;
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 3ae6b310ac9b..5ff88560e5ef 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -76,6 +76,7 @@ int main(void)
   BLANK();
   DEFINE(CLOCK_REALTIME,	CLOCK_REALTIME);
   DEFINE(CLOCK_MONOTONIC,	CLOCK_MONOTONIC);
+  DEFINE(CLOCK_MONOTONIC_RAW,	CLOCK_MONOTONIC_RAW);
   DEFINE(CLOCK_REALTIME_RES,	MONOTONIC_RES_NSEC);
   DEFINE(CLOCK_REALTIME_COARSE,	CLOCK_REALTIME_COARSE);
   DEFINE(CLOCK_MONOTONIC_COARSE,CLOCK_MONOTONIC_COARSE);
@@ -83,6 +84,8 @@ int main(void)
   DEFINE(NSEC_PER_SEC,		NSEC_PER_SEC);
   BLANK();
   DEFINE(VDSO_CS_CYCLE_LAST,	offsetof(struct vdso_data, cs_cycle_last));
+  DEFINE(VDSO_RAW_TIME_SEC,	offsetof(struct vdso_data, raw_time_sec));
+  DEFINE(VDSO_RAW_TIME_NSEC,	offsetof(struct vdso_data, raw_time_nsec));
   DEFINE(VDSO_XTIME_CLK_SEC,	offsetof(struct vdso_data, xtime_clock_sec));
   DEFINE(VDSO_XTIME_CLK_NSEC,	offsetof(struct vdso_data, xtime_clock_nsec));
   DEFINE(VDSO_XTIME_CRS_SEC,	offsetof(struct vdso_data, xtime_coarse_sec));
@@ -90,7 +93,8 @@ int main(void)
   DEFINE(VDSO_WTM_CLK_SEC,	offsetof(struct vdso_data, wtm_clock_sec));
   DEFINE(VDSO_WTM_CLK_NSEC,	offsetof(struct vdso_data, wtm_clock_nsec));
   DEFINE(VDSO_TB_SEQ_COUNT,	offsetof(struct vdso_data, tb_seq_count));
-  DEFINE(VDSO_CS_MULT,		offsetof(struct vdso_data, cs_mult));
+  DEFINE(VDSO_CS_MONO_MULT,	offsetof(struct vdso_data, cs_mono_mult));
+  DEFINE(VDSO_CS_RAW_MULT,	offsetof(struct vdso_data, cs_raw_mult));
   DEFINE(VDSO_CS_SHIFT,		offsetof(struct vdso_data, cs_shift));
   DEFINE(VDSO_TZ_MINWEST,	offsetof(struct vdso_data, tz_minuteswest));
   DEFINE(VDSO_TZ_DSTTIME,	offsetof(struct vdso_data, tz_dsttime));
diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 97bc68f4c689..54f7b327fd18 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -212,10 +212,16 @@ void update_vsyscall(struct timekeeper *tk)
 	vdso_data->wtm_clock_nsec		= tk->wall_to_monotonic.tv_nsec;
 
 	if (!use_syscall) {
+		/* tkr_mono.cycle_last == tkr_raw.cycle_last */
 		vdso_data->cs_cycle_last	= tk->tkr_mono.cycle_last;
+		vdso_data->raw_time_sec		= tk->raw_time.tv_sec;
+		vdso_data->raw_time_nsec	= tk->raw_time.tv_nsec;
 		vdso_data->xtime_clock_sec	= tk->xtime_sec;
 		vdso_data->xtime_clock_nsec	= tk->tkr_mono.xtime_nsec;
-		vdso_data->cs_mult		= tk->tkr_mono.mult;
+		/* tkr_raw.xtime_nsec == 0 */
+		vdso_data->cs_mono_mult		= tk->tkr_mono.mult;
+		vdso_data->cs_raw_mult		= tk->tkr_raw.mult;
+		/* tkr_mono.shift == tkr_raw.shift */
 		vdso_data->cs_shift		= tk->tkr_mono.shift;
 	}
 
diff --git a/arch/arm64/kernel/vdso/gettimeofday.S b/arch/arm64/kernel/vdso/gettimeofday.S
index caff9dd6ba78..f49b6755058a 100644
--- a/arch/arm64/kernel/vdso/gettimeofday.S
+++ b/arch/arm64/kernel/vdso/gettimeofday.S
@@ -87,6 +87,15 @@ x_tmp		.req	x8
 	msub	\res_nsec, x_tmp, \nsec_to_sec, \res_nsec
 	.endm
 
+	/*
+	 * Returns in res_{sec,nsec} the timespec based on the clock_raw delta,
+	 * used for CLOCK_MONOTONIC_RAW.
+	 */
+	.macro	get_ts_clock_raw res_sec, res_nsec, clock_nsec, nsec_to_sec
+	udiv	\res_sec, \clock_nsec, \nsec_to_sec
+	msub	\res_nsec, \res_sec, \nsec_to_sec, \clock_nsec
+	.endm
+
 	/* sec and nsec are modified in place. */
 	.macro add_ts sec, nsec, ts_sec, ts_nsec, nsec_to_sec
 	/* Add timespec. */
@@ -126,7 +135,8 @@ ENTRY(__kernel_gettimeofday)
 1:	seqcnt_acquire
 	syscall_check fail=4f
 	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
-	ldp	w11, w12, [vdso_data, #VDSO_CS_MULT]
+	/* w11 = cs_mono_mult, w12 = cs_shift */
+	ldp	w11, w12, [vdso_data, #VDSO_CS_MONO_MULT]
 	ldp	x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
 	seqcnt_check fail=1b
 
@@ -163,19 +173,19 @@ ENDPROC(__kernel_gettimeofday)
 /* int __kernel_clock_gettime(clockid_t clock_id, struct timespec *tp); */
 ENTRY(__kernel_clock_gettime)
 	.cfi_startproc
-	cmp 	w0, #JUMPSLOT_MAX
-	b.hi 	syscall
+	cmp	w0, #JUMPSLOT_MAX
+	b.hi	syscall
 	adr	vdso_data, _vdso_data
-	adr 	x_tmp, jumptable
-	add 	x_tmp, x_tmp, w0, uxtw #2
-	br 	x_tmp
+	adr	x_tmp, jumptable
+	add	x_tmp, x_tmp, w0, uxtw #2
+	br	x_tmp
 
 jumptable:
 	jump_slot jumptable, CLOCK_REALTIME, realtime
 	jump_slot jumptable, CLOCK_MONOTONIC, monotonic
-	b 	syscall
-	b 	syscall
-	b 	syscall
+	b	syscall
+	b	syscall
+	jump_slot jumptable, CLOCK_MONOTONIC_RAW, monotonic_raw
 	jump_slot jumptable, CLOCK_REALTIME_COARSE, realtime_coarse
 	jump_slot jumptable, CLOCK_MONOTONIC_COARSE, monotonic_coarse
 
@@ -187,7 +197,8 @@ realtime:
 	seqcnt_acquire
 	syscall_check fail=syscall
 	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
-	ldp	w11, w12, [vdso_data, #VDSO_CS_MULT]
+	/* w11 = cs_mono_mult, w12 = cs_shift */
+	ldp	w11, w12, [vdso_data, #VDSO_CS_MONO_MULT]
 	ldp	x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
 	seqcnt_check fail=realtime
 
@@ -205,7 +216,8 @@ monotonic:
 	seqcnt_acquire
 	syscall_check fail=syscall
 	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
-	ldp	w11, w12, [vdso_data, #VDSO_CS_MULT]
+	/* w11 = cs_mono_mult, w12 = cs_shift */
+	ldp	w11, w12, [vdso_data, #VDSO_CS_MONO_MULT]
 	ldp	x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
 	ldp	x3, x4, [vdso_data, #VDSO_WTM_CLK_SEC]
 	seqcnt_check fail=monotonic
@@ -223,6 +235,28 @@ monotonic:
 
 	b shift_store
 
+monotonic_raw:
+	seqcnt_acquire
+	syscall_check fail=syscall
+	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
+	/* w11 = cs_raw_mult, w12 = cs_shift */
+	ldp	w12, w11, [vdso_data, #VDSO_CS_SHIFT]
+	ldp	x13, x14, [vdso_data, #VDSO_RAW_TIME_SEC]
+	seqcnt_check fail=monotonic_raw
+
+	/* All computations are done with left-shifted nsecs. */
+	lsl	x14, x14, x12
+	get_nsec_per_sec res=x9
+	lsl	x9, x9, x12
+
+	get_clock_shifted_nsec res=x15, cycle_last=x10, mult=x11
+	get_ts_clock_raw res_sec=x10, res_nsec=x11, \
+		clock_nsec=x15, nsec_to_sec=x9
+
+	add_ts sec=x10, nsec=x11, ts_sec=x13, ts_nsec=x14, nsec_to_sec=x9
+
+	b shift_store
+
 realtime_coarse:
 	seqcnt_acquire
 	ldp	x10, x11, [vdso_data, #VDSO_XTIME_CRS_SEC]
@@ -261,6 +295,7 @@ ENTRY(__kernel_clock_getres)
 	.cfi_startproc
 	cmp	w0, #CLOCK_REALTIME
 	ccmp	w0, #CLOCK_MONOTONIC, #0x4, ne
+	ccmp	w0, #CLOCK_MONOTONIC_RAW, #0x4, ne
 	b.ne	1f
 
 	ldr	x2, 5f
-- 
2.8.0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 1/2] arm64: Refactor vDSO time functions
  2016-05-09 12:37 ` [PATCH 1/2] arm64: Refactor vDSO time functions Kevin Brodsky
@ 2016-06-22 13:24   ` Christopher Covington
  2016-07-01 13:46   ` Dave Martin
  2016-07-08 14:11   ` Will Deacon
  2 siblings, 0 replies; 11+ messages in thread
From: Christopher Covington @ 2016-06-22 13:24 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Kevin,

On 05/09/2016 08:37 AM, Kevin Brodsky wrote:
> Time functions are directly implemented in assembly in arm64, and it
> is desirable to keep it this way for performance reasons (everything
> fits in registers, so that the stack is not used at all). However, the

Why do you say that? My simple tests indicate passing -O1 to GCC (4.8)
is sufficient to make it only spill to the stack when necessary.

Cheers,
Cov

-- 
Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] arm64: Refactor vDSO time functions
  2016-05-09 12:37 ` [PATCH 1/2] arm64: Refactor vDSO time functions Kevin Brodsky
  2016-06-22 13:24   ` Christopher Covington
@ 2016-07-01 13:46   ` Dave Martin
  2016-07-04 17:12     ` Will Deacon
  2016-07-08 14:11   ` Will Deacon
  2 siblings, 1 reply; 11+ messages in thread
From: Dave Martin @ 2016-07-01 13:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, May 09, 2016 at 01:37:00PM +0100, Kevin Brodsky wrote:
> Time functions are directly implemented in assembly in arm64, and it
> is desirable to keep it this way for performance reasons (everything
> fits in registers, so that the stack is not used at all). However, the
> current implementation is quite difficult to read and understand (even
> considering it's assembly).  Additionally, due to the structure of
> __kernel_clock_gettime, which heavily uses conditional branches to
> share code between the different clocks, it is difficult to support a
> new clock without making the branches even harder to follow.
> 
> This commit completely refactors the structure of clock_gettime (and
> gettimeofday along the way) while keeping exactly the same algorithms.
> We no longer try to share code; instead, macros provide common
> operations. This new approach comes with a number of advantages:
> - In clock_gettime, clock implementations are no longer interspersed,
>   making them much more readable. Additionally, macros only use
>   registers passed as arguments or reserved with .req, this way it is
>   easy to make sure that registers are properly allocated. To avoid a
>   large number of branches in a given execution path, a jump table is
>   used; a normal execution uses 3 unconditional branches.
> - __do_get_tspec has been replaced with 2 macros (get_ts_clock_mono,
>   get_clock_shifted_nsec) and explicit loading of data from the vDSO
>   page. Consequently, clock_gettime and gettimeofday are now leaf
>   functions, and saving x30 (lr) is no longer necessary.
> - Variables protected by tb_seq_count are now loaded all at once,
>   allowing to merge the seqcnt_read macro into seqcnt_check.
> - For CLOCK_REALTIME_COARSE, removed an unused load of the wall to
>   monotonic timespec.
> - For CLOCK_MONOTONIC_COARSE, removed a few shift instructions.
> 
> Obviously, the downside of sharing less code is an increase in code
> size. However since the vDSO has its own code page, this does not
> really matter, as long as the size of the DSO remains below 4 kB. For
> now this should be all right:
>                     Before  After
>   vdso.so size (B)  2776    2936
> 
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Dave Martin <dave.martin@arm.com>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>

Reviewed-by: Dave Martin <Dave.Martin@arm.com>

I agree with Christopher that we shouldn't simply assume that code
should stay in asm just because is was asm to begin with, but the
refactoring seems reasonable here.

There's no hard limit on the size of the vDSO AFAIK, but in any case
the bloatation here is slight and the total number of clocks we'll ever
support in the vDSO should be pretty small...

The code can always be ported to C later on if there's a compelling
reason, and if the compiler is shown to do a good job on it.

Cheers
---Dave

> ---
>  arch/arm64/kernel/vdso/gettimeofday.S | 282 +++++++++++++++++++---------------
>  1 file changed, 162 insertions(+), 120 deletions(-)
> 
> diff --git a/arch/arm64/kernel/vdso/gettimeofday.S b/arch/arm64/kernel/vdso/gettimeofday.S
> index efa79e8d4196..caff9dd6ba78 100644
> --- a/arch/arm64/kernel/vdso/gettimeofday.S
> +++ b/arch/arm64/kernel/vdso/gettimeofday.S
> @@ -26,24 +26,91 @@
>  #define NSEC_PER_SEC_HI16	0x3b9a
>  
>  vdso_data	.req	x6
> -use_syscall	.req	w7
> -seqcnt		.req	w8
> +seqcnt		.req	w7
> +w_tmp		.req	w8
> +x_tmp		.req	x8
> +
> +/*
> + * Conventions for macro arguments:
> + * - An argument is write-only if its name starts with "res".
> + * - All other arguments are read-only, unless otherwise specified.
> + */
>  
>  	.macro	seqcnt_acquire
>  9999:	ldr	seqcnt, [vdso_data, #VDSO_TB_SEQ_COUNT]
>  	tbnz	seqcnt, #0, 9999b
>  	dmb	ishld
> -	ldr	use_syscall, [vdso_data, #VDSO_USE_SYSCALL]
>  	.endm
>  
> -	.macro	seqcnt_read, cnt
> +	.macro	seqcnt_check fail
>  	dmb	ishld
> -	ldr	\cnt, [vdso_data, #VDSO_TB_SEQ_COUNT]
> +	ldr	w_tmp, [vdso_data, #VDSO_TB_SEQ_COUNT]
> +	cmp	w_tmp, seqcnt
> +	b.ne	\fail
>  	.endm
>  
> -	.macro	seqcnt_check, cnt, fail
> -	cmp	\cnt, seqcnt
> -	b.ne	\fail
> +	.macro	syscall_check fail
> +	ldr	w_tmp, [vdso_data, #VDSO_USE_SYSCALL]
> +	cbnz	w_tmp, \fail
> +	.endm
> +
> +	.macro get_nsec_per_sec res
> +	mov	\res, #NSEC_PER_SEC_LO16
> +	movk	\res, #NSEC_PER_SEC_HI16, lsl #16
> +	.endm
> +
> +	/*
> +	 * Returns the clock delta, in nanoseconds left-shifted by the clock
> +	 * shift.
> +	 */
> +	.macro	get_clock_shifted_nsec res, cycle_last, mult
> +	/* Read the virtual counter. */
> +	isb
> +	mrs	x_tmp, cntvct_el0
> +	/* Calculate cycle delta and convert to ns. */
> +	sub	\res, x_tmp, \cycle_last
> +	/* We can only guarantee 56 bits of precision. */
> +	movn	x_tmp, #0xff00, lsl #48
> +	and	\res, x_tmp, \res
> +	mul	\res, \res, \mult
> +	.endm
> +
> +	/*
> +	 * Returns in res_{sec,nsec} the REALTIME timespec, based on the
> +	 * "wall time" (xtime) and the clock_mono delta.
> +	 */
> +	.macro	get_ts_realtime res_sec, res_nsec, \
> +			clock_nsec, xtime_sec, xtime_nsec, nsec_to_sec
> +	add	\res_nsec, \clock_nsec, \xtime_nsec
> +	udiv	x_tmp, \res_nsec, \nsec_to_sec
> +	add	\res_sec, \xtime_sec, x_tmp
> +	msub	\res_nsec, x_tmp, \nsec_to_sec, \res_nsec
> +	.endm
> +
> +	/* sec and nsec are modified in place. */
> +	.macro add_ts sec, nsec, ts_sec, ts_nsec, nsec_to_sec
> +	/* Add timespec. */
> +	add	\sec, \sec, \ts_sec
> +	add	\nsec, \nsec, \ts_nsec
> +
> +	/* Normalise the new timespec. */
> +	cmp	\nsec, \nsec_to_sec
> +	b.lt	9999f
> +	sub	\nsec, \nsec, \nsec_to_sec
> +	add	\sec, \sec, #1
> +9999:
> +	cmp	\nsec, #0
> +	b.ge	9998f
> +	add	\nsec, \nsec, \nsec_to_sec
> +	sub	\sec, \sec, #1
> +9998:
> +	.endm
> +
> +	.macro jump_slot jumptable, index, label
> +	.if (. - \jumptable) != 4 * (\index)
> +		.error "Jump slot index mismatch"
> +	.endif
> +	b	\label
>  	.endm
>  
>  	.text
> @@ -51,18 +118,24 @@ seqcnt		.req	w8
>  /* int __kernel_gettimeofday(struct timeval *tv, struct timezone *tz); */
>  ENTRY(__kernel_gettimeofday)
>  	.cfi_startproc
> -	mov	x2, x30
> -	.cfi_register x30, x2
> -
> -	/* Acquire the sequence counter and get the timespec. */
>  	adr	vdso_data, _vdso_data
> -1:	seqcnt_acquire
> -	cbnz	use_syscall, 4f
> -
>  	/* If tv is NULL, skip to the timezone code. */
>  	cbz	x0, 2f
> -	bl	__do_get_tspec
> -	seqcnt_check w9, 1b
> +
> +	/* Compute the time of day. */
> +1:	seqcnt_acquire
> +	syscall_check fail=4f
> +	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
> +	ldp	w11, w12, [vdso_data, #VDSO_CS_MULT]
> +	ldp	x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
> +	seqcnt_check fail=1b
> +
> +	get_nsec_per_sec res=x9
> +	lsl	x9, x9, x12
> +
> +	get_clock_shifted_nsec res=x15, cycle_last=x10, mult=x11
> +	get_ts_realtime res_sec=x10, res_nsec=x11, \
> +		clock_nsec=x15, xtime_sec=x13, xtime_nsec=x14, nsec_to_sec=x9
>  
>  	/* Convert ns to us. */
>  	mov	x13, #1000
> @@ -76,95 +149,107 @@ ENTRY(__kernel_gettimeofday)
>  	stp	w4, w5, [x1, #TZ_MINWEST]
>  3:
>  	mov	x0, xzr
> -	ret	x2
> +	ret
>  4:
>  	/* Syscall fallback. */
>  	mov	x8, #__NR_gettimeofday
>  	svc	#0
> -	ret	x2
> +	ret
>  	.cfi_endproc
>  ENDPROC(__kernel_gettimeofday)
>  
> +#define JUMPSLOT_MAX CLOCK_MONOTONIC_COARSE
> +
>  /* int __kernel_clock_gettime(clockid_t clock_id, struct timespec *tp); */
>  ENTRY(__kernel_clock_gettime)
>  	.cfi_startproc
> -	cmp	w0, #CLOCK_REALTIME
> -	ccmp	w0, #CLOCK_MONOTONIC, #0x4, ne
> -	b.ne	2f
> +	cmp 	w0, #JUMPSLOT_MAX
> +	b.hi 	syscall
> +	adr	vdso_data, _vdso_data
> +	adr 	x_tmp, jumptable
> +	add 	x_tmp, x_tmp, w0, uxtw #2
> +	br 	x_tmp
> +
> +jumptable:
> +	jump_slot jumptable, CLOCK_REALTIME, realtime
> +	jump_slot jumptable, CLOCK_MONOTONIC, monotonic
> +	b 	syscall
> +	b 	syscall
> +	b 	syscall
> +	jump_slot jumptable, CLOCK_REALTIME_COARSE, realtime_coarse
> +	jump_slot jumptable, CLOCK_MONOTONIC_COARSE, monotonic_coarse
> +
> +	.if (. - jumptable) != 4 * (JUMPSLOT_MAX + 1)
> +	.error	"Wrong jumptable size"
> +	.endif
> +
> +realtime:
> +	seqcnt_acquire
> +	syscall_check fail=syscall
> +	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
> +	ldp	w11, w12, [vdso_data, #VDSO_CS_MULT]
> +	ldp	x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
> +	seqcnt_check fail=realtime
>  
> -	mov	x2, x30
> -	.cfi_register x30, x2
> +	/* All computations are done with left-shifted nsecs. */
> +	get_nsec_per_sec res=x9
> +	lsl	x9, x9, x12
>  
> -	/* Get kernel timespec. */
> -	adr	vdso_data, _vdso_data
> -1:	seqcnt_acquire
> -	cbnz	use_syscall, 7f
> +	get_clock_shifted_nsec res=x15, cycle_last=x10, mult=x11
> +	get_ts_realtime res_sec=x10, res_nsec=x11, \
> +		clock_nsec=x15, xtime_sec=x13, xtime_nsec=x14, nsec_to_sec=x9
>  
> -	bl	__do_get_tspec
> -	seqcnt_check w9, 1b
> +	b shift_store
>  
> -	mov	x30, x2
> +monotonic:
> +	seqcnt_acquire
> +	syscall_check fail=syscall
> +	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
> +	ldp	w11, w12, [vdso_data, #VDSO_CS_MULT]
> +	ldp	x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
> +	ldp	x3, x4, [vdso_data, #VDSO_WTM_CLK_SEC]
> +	seqcnt_check fail=monotonic
>  
> -	cmp	w0, #CLOCK_MONOTONIC
> -	b.ne	6f
> +	/* All computations are done with left-shifted nsecs. */
> +	lsl	x4, x4, x12
> +	get_nsec_per_sec res=x9
> +	lsl	x9, x9, x12
>  
> -	/* Get wtm timespec. */
> -	ldp	x13, x14, [vdso_data, #VDSO_WTM_CLK_SEC]
> +	get_clock_shifted_nsec res=x15, cycle_last=x10, mult=x11
> +	get_ts_realtime res_sec=x10, res_nsec=x11, \
> +		clock_nsec=x15, xtime_sec=x13, xtime_nsec=x14, nsec_to_sec=x9
>  
> -	/* Check the sequence counter. */
> -	seqcnt_read w9
> -	seqcnt_check w9, 1b
> -	b	4f
> -2:
> -	cmp	w0, #CLOCK_REALTIME_COARSE
> -	ccmp	w0, #CLOCK_MONOTONIC_COARSE, #0x4, ne
> -	b.ne	8f
> +	add_ts sec=x10, nsec=x11, ts_sec=x3, ts_nsec=x4, nsec_to_sec=x9
>  
> -	/* xtime_coarse_nsec is already right-shifted */
> -	mov	x12, #0
> +	b shift_store
>  
> -	/* Get coarse timespec. */
> -	adr	vdso_data, _vdso_data
> -3:	seqcnt_acquire
> +realtime_coarse:
> +	seqcnt_acquire
>  	ldp	x10, x11, [vdso_data, #VDSO_XTIME_CRS_SEC]
> +	seqcnt_check fail=realtime_coarse
>  
> -	/* Get wtm timespec. */
> -	ldp	x13, x14, [vdso_data, #VDSO_WTM_CLK_SEC]
> +	b store
>  
> -	/* Check the sequence counter. */
> -	seqcnt_read w9
> -	seqcnt_check w9, 3b
> +monotonic_coarse:
> +	seqcnt_acquire
> +	ldp	x10, x11, [vdso_data, #VDSO_XTIME_CRS_SEC]
> +	ldp	x13, x14, [vdso_data, #VDSO_WTM_CLK_SEC]
> +	seqcnt_check fail=monotonic_coarse
>  
> -	cmp	w0, #CLOCK_MONOTONIC_COARSE
> -	b.ne	6f
> -4:
> -	/* Add on wtm timespec. */
> -	add	x10, x10, x13
> -	lsl	x14, x14, x12
> -	add	x11, x11, x14
> +	/* Computations are done in (non-shifted) nsecs. */
> +	get_nsec_per_sec res=x9
> +	add_ts sec=x10, nsec=x11, ts_sec=x13, ts_nsec=x14, nsec_to_sec=x9
>  
> -	/* Normalise the new timespec. */
> -	mov	x15, #NSEC_PER_SEC_LO16
> -	movk	x15, #NSEC_PER_SEC_HI16, lsl #16
> -	lsl	x15, x15, x12
> -	cmp	x11, x15
> -	b.lt	5f
> -	sub	x11, x11, x15
> -	add	x10, x10, #1
> -5:
> -	cmp	x11, #0
> -	b.ge	6f
> -	add	x11, x11, x15
> -	sub	x10, x10, #1
> +	b store
>  
> -6:	/* Store to the user timespec. */
> +shift_store:
>  	lsr	x11, x11, x12
> +store:
>  	stp	x10, x11, [x1, #TSPEC_TV_SEC]
>  	mov	x0, xzr
>  	ret
> -7:
> -	mov	x30, x2
> -8:	/* Syscall fallback. */
> +
> +syscall: /* Syscall fallback. */
>  	mov	x8, #__NR_clock_gettime
>  	svc	#0
>  	ret
> @@ -203,46 +288,3 @@ ENTRY(__kernel_clock_getres)
>  	.quad	CLOCK_COARSE_RES
>  	.cfi_endproc
>  ENDPROC(__kernel_clock_getres)
> -
> -/*
> - * Read the current time from the architected counter.
> - * Expects vdso_data to be initialised.
> - * Clobbers the temporary registers (x9 - x15).
> - * Returns:
> - *  - w9		= vDSO sequence counter
> - *  - (x10, x11)	= (ts->tv_sec, shifted ts->tv_nsec)
> - *  - w12		= cs_shift
> - */
> -ENTRY(__do_get_tspec)
> -	.cfi_startproc
> -
> -	/* Read from the vDSO data page. */
> -	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
> -	ldp	x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
> -	ldp	w11, w12, [vdso_data, #VDSO_CS_MULT]
> -	seqcnt_read w9
> -
> -	/* Read the virtual counter. */
> -	isb
> -	mrs	x15, cntvct_el0
> -
> -	/* Calculate cycle delta and convert to ns. */
> -	sub	x10, x15, x10
> -	/* We can only guarantee 56 bits of precision. */
> -	movn	x15, #0xff00, lsl #48
> -	and	x10, x15, x10
> -	mul	x10, x10, x11
> -
> -	/* Use the kernel time to calculate the new timespec. */
> -	mov	x11, #NSEC_PER_SEC_LO16
> -	movk	x11, #NSEC_PER_SEC_HI16, lsl #16
> -	lsl	x11, x11, x12
> -	add	x15, x10, x14
> -	udiv	x14, x15, x11
> -	add	x10, x13, x14
> -	mul	x13, x14, x11
> -	sub	x11, x15, x13
> -
> -	ret
> -	.cfi_endproc
> -ENDPROC(__do_get_tspec)
> -- 
> 2.8.0
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/2] arm64: Add support for CLOCK_MONOTONIC_RAW in clock_gettime() vDSO
  2016-05-09 12:37 ` [PATCH 2/2] arm64: Add support for CLOCK_MONOTONIC_RAW in clock_gettime() vDSO Kevin Brodsky
@ 2016-07-01 13:48   ` Dave Martin
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Martin @ 2016-07-01 13:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, May 09, 2016 at 01:37:01PM +0100, Kevin Brodsky wrote:
> So far the arm64 clock_gettime() vDSO implementation only supported
> the following clocks, falling back to the syscall for the others:
> - CLOCK_REALTIME{,_COARSE}
> - CLOCK_MONOTONIC{,_COARSE}
> 
> This patch adds support for the CLOCK_MONOTONIC_RAW clock, taking
> advantage of the recent refactoring of the vDSO time functions. Like
> the non-_COARSE clocks, this only works when the "arch_sys_counter"
> clocksource is in use (allowing us to read the current time from the
> virtual counter register), otherwise we also have to fall back to the
> syscall.
> 
> Most of the data is shared with CLOCK_MONOTONIC, and the algorithm is
> similar. The reference implementation in kernel/time/timekeeping.c
> shows that:
> - CLOCK_MONOTONIC = tk->wall_to_monotonic + tk->xtime_sec +
>   timekeeping_get_ns(&tk->tkr_mono)
> - CLOCK_MONOTONIC_RAW = tk->raw_time + timekeeping_get_ns(&tk->tkr_raw)
> - tkr_mono and tkr_raw are identical (in particular, same
>   clocksource), except these members:
>   * mult (only mono's multiplier is NTP-adjusted)
>   * xtime_nsec (always 0 for raw)
> 
> Therefore, tk->raw_time and tkr_raw->mult are now also stored in the
> vDSO data page.
> 
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Dave Martin <dave.martin@arm.com>
> Cc: Ali Saidi <ali.saidi@arm.com>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>

Reviewed-by: Dave Martin <Dave.Martin@arm.com>

> ---
>  arch/arm64/include/asm/vdso_datapage.h |  8 +++--
>  arch/arm64/kernel/asm-offsets.c        |  6 +++-
>  arch/arm64/kernel/vdso.c               |  8 ++++-
>  arch/arm64/kernel/vdso/gettimeofday.S  | 57 +++++++++++++++++++++++++++-------
>  4 files changed, 64 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/vdso_datapage.h b/arch/arm64/include/asm/vdso_datapage.h
> index de66199673d7..2b9a63771eda 100644
> --- a/arch/arm64/include/asm/vdso_datapage.h
> +++ b/arch/arm64/include/asm/vdso_datapage.h
> @@ -22,6 +22,8 @@
>  
>  struct vdso_data {
>  	__u64 cs_cycle_last;	/* Timebase at clocksource init */
> +	__u64 raw_time_sec;	/* Raw time */
> +	__u64 raw_time_nsec;
>  	__u64 xtime_clock_sec;	/* Kernel time */
>  	__u64 xtime_clock_nsec;
>  	__u64 xtime_coarse_sec;	/* Coarse time */
> @@ -29,8 +31,10 @@ struct vdso_data {
>  	__u64 wtm_clock_sec;	/* Wall to monotonic time */
>  	__u64 wtm_clock_nsec;
>  	__u32 tb_seq_count;	/* Timebase sequence counter */
> -	__u32 cs_mult;		/* Clocksource multiplier */
> -	__u32 cs_shift;		/* Clocksource shift */
> +	/* cs_* members must be adjacent and in this order (ldp accesses) */
> +	__u32 cs_mono_mult;	/* NTP-adjusted clocksource multiplier */
> +	__u32 cs_shift;		/* Clocksource shift (mono = raw) */
> +	__u32 cs_raw_mult;	/* Raw clocksource multiplier */
>  	__u32 tz_minuteswest;	/* Whacky timezone stuff */
>  	__u32 tz_dsttime;
>  	__u32 use_syscall;
> diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
> index 3ae6b310ac9b..5ff88560e5ef 100644
> --- a/arch/arm64/kernel/asm-offsets.c
> +++ b/arch/arm64/kernel/asm-offsets.c
> @@ -76,6 +76,7 @@ int main(void)
>    BLANK();
>    DEFINE(CLOCK_REALTIME,	CLOCK_REALTIME);
>    DEFINE(CLOCK_MONOTONIC,	CLOCK_MONOTONIC);
> +  DEFINE(CLOCK_MONOTONIC_RAW,	CLOCK_MONOTONIC_RAW);
>    DEFINE(CLOCK_REALTIME_RES,	MONOTONIC_RES_NSEC);
>    DEFINE(CLOCK_REALTIME_COARSE,	CLOCK_REALTIME_COARSE);
>    DEFINE(CLOCK_MONOTONIC_COARSE,CLOCK_MONOTONIC_COARSE);
> @@ -83,6 +84,8 @@ int main(void)
>    DEFINE(NSEC_PER_SEC,		NSEC_PER_SEC);
>    BLANK();
>    DEFINE(VDSO_CS_CYCLE_LAST,	offsetof(struct vdso_data, cs_cycle_last));
> +  DEFINE(VDSO_RAW_TIME_SEC,	offsetof(struct vdso_data, raw_time_sec));
> +  DEFINE(VDSO_RAW_TIME_NSEC,	offsetof(struct vdso_data, raw_time_nsec));
>    DEFINE(VDSO_XTIME_CLK_SEC,	offsetof(struct vdso_data, xtime_clock_sec));
>    DEFINE(VDSO_XTIME_CLK_NSEC,	offsetof(struct vdso_data, xtime_clock_nsec));
>    DEFINE(VDSO_XTIME_CRS_SEC,	offsetof(struct vdso_data, xtime_coarse_sec));
> @@ -90,7 +93,8 @@ int main(void)
>    DEFINE(VDSO_WTM_CLK_SEC,	offsetof(struct vdso_data, wtm_clock_sec));
>    DEFINE(VDSO_WTM_CLK_NSEC,	offsetof(struct vdso_data, wtm_clock_nsec));
>    DEFINE(VDSO_TB_SEQ_COUNT,	offsetof(struct vdso_data, tb_seq_count));
> -  DEFINE(VDSO_CS_MULT,		offsetof(struct vdso_data, cs_mult));
> +  DEFINE(VDSO_CS_MONO_MULT,	offsetof(struct vdso_data, cs_mono_mult));
> +  DEFINE(VDSO_CS_RAW_MULT,	offsetof(struct vdso_data, cs_raw_mult));
>    DEFINE(VDSO_CS_SHIFT,		offsetof(struct vdso_data, cs_shift));
>    DEFINE(VDSO_TZ_MINWEST,	offsetof(struct vdso_data, tz_minuteswest));
>    DEFINE(VDSO_TZ_DSTTIME,	offsetof(struct vdso_data, tz_dsttime));
> diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
> index 97bc68f4c689..54f7b327fd18 100644
> --- a/arch/arm64/kernel/vdso.c
> +++ b/arch/arm64/kernel/vdso.c
> @@ -212,10 +212,16 @@ void update_vsyscall(struct timekeeper *tk)
>  	vdso_data->wtm_clock_nsec		= tk->wall_to_monotonic.tv_nsec;
>  
>  	if (!use_syscall) {
> +		/* tkr_mono.cycle_last == tkr_raw.cycle_last */
>  		vdso_data->cs_cycle_last	= tk->tkr_mono.cycle_last;
> +		vdso_data->raw_time_sec		= tk->raw_time.tv_sec;
> +		vdso_data->raw_time_nsec	= tk->raw_time.tv_nsec;
>  		vdso_data->xtime_clock_sec	= tk->xtime_sec;
>  		vdso_data->xtime_clock_nsec	= tk->tkr_mono.xtime_nsec;
> -		vdso_data->cs_mult		= tk->tkr_mono.mult;
> +		/* tkr_raw.xtime_nsec == 0 */
> +		vdso_data->cs_mono_mult		= tk->tkr_mono.mult;
> +		vdso_data->cs_raw_mult		= tk->tkr_raw.mult;
> +		/* tkr_mono.shift == tkr_raw.shift */
>  		vdso_data->cs_shift		= tk->tkr_mono.shift;
>  	}
>  
> diff --git a/arch/arm64/kernel/vdso/gettimeofday.S b/arch/arm64/kernel/vdso/gettimeofday.S
> index caff9dd6ba78..f49b6755058a 100644
> --- a/arch/arm64/kernel/vdso/gettimeofday.S
> +++ b/arch/arm64/kernel/vdso/gettimeofday.S
> @@ -87,6 +87,15 @@ x_tmp		.req	x8
>  	msub	\res_nsec, x_tmp, \nsec_to_sec, \res_nsec
>  	.endm
>  
> +	/*
> +	 * Returns in res_{sec,nsec} the timespec based on the clock_raw delta,
> +	 * used for CLOCK_MONOTONIC_RAW.
> +	 */
> +	.macro	get_ts_clock_raw res_sec, res_nsec, clock_nsec, nsec_to_sec
> +	udiv	\res_sec, \clock_nsec, \nsec_to_sec
> +	msub	\res_nsec, \res_sec, \nsec_to_sec, \clock_nsec
> +	.endm
> +
>  	/* sec and nsec are modified in place. */
>  	.macro add_ts sec, nsec, ts_sec, ts_nsec, nsec_to_sec
>  	/* Add timespec. */
> @@ -126,7 +135,8 @@ ENTRY(__kernel_gettimeofday)
>  1:	seqcnt_acquire
>  	syscall_check fail=4f
>  	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
> -	ldp	w11, w12, [vdso_data, #VDSO_CS_MULT]
> +	/* w11 = cs_mono_mult, w12 = cs_shift */
> +	ldp	w11, w12, [vdso_data, #VDSO_CS_MONO_MULT]
>  	ldp	x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
>  	seqcnt_check fail=1b
>  
> @@ -163,19 +173,19 @@ ENDPROC(__kernel_gettimeofday)
>  /* int __kernel_clock_gettime(clockid_t clock_id, struct timespec *tp); */
>  ENTRY(__kernel_clock_gettime)
>  	.cfi_startproc
> -	cmp 	w0, #JUMPSLOT_MAX
> -	b.hi 	syscall
> +	cmp	w0, #JUMPSLOT_MAX
> +	b.hi	syscall
>  	adr	vdso_data, _vdso_data
> -	adr 	x_tmp, jumptable
> -	add 	x_tmp, x_tmp, w0, uxtw #2
> -	br 	x_tmp
> +	adr	x_tmp, jumptable
> +	add	x_tmp, x_tmp, w0, uxtw #2
> +	br	x_tmp
>  
>  jumptable:
>  	jump_slot jumptable, CLOCK_REALTIME, realtime
>  	jump_slot jumptable, CLOCK_MONOTONIC, monotonic
> -	b 	syscall
> -	b 	syscall
> -	b 	syscall
> +	b	syscall
> +	b	syscall
> +	jump_slot jumptable, CLOCK_MONOTONIC_RAW, monotonic_raw
>  	jump_slot jumptable, CLOCK_REALTIME_COARSE, realtime_coarse
>  	jump_slot jumptable, CLOCK_MONOTONIC_COARSE, monotonic_coarse
>  
> @@ -187,7 +197,8 @@ realtime:
>  	seqcnt_acquire
>  	syscall_check fail=syscall
>  	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
> -	ldp	w11, w12, [vdso_data, #VDSO_CS_MULT]
> +	/* w11 = cs_mono_mult, w12 = cs_shift */
> +	ldp	w11, w12, [vdso_data, #VDSO_CS_MONO_MULT]
>  	ldp	x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
>  	seqcnt_check fail=realtime
>  
> @@ -205,7 +216,8 @@ monotonic:
>  	seqcnt_acquire
>  	syscall_check fail=syscall
>  	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
> -	ldp	w11, w12, [vdso_data, #VDSO_CS_MULT]
> +	/* w11 = cs_mono_mult, w12 = cs_shift */
> +	ldp	w11, w12, [vdso_data, #VDSO_CS_MONO_MULT]
>  	ldp	x13, x14, [vdso_data, #VDSO_XTIME_CLK_SEC]
>  	ldp	x3, x4, [vdso_data, #VDSO_WTM_CLK_SEC]
>  	seqcnt_check fail=monotonic
> @@ -223,6 +235,28 @@ monotonic:
>  
>  	b shift_store
>  
> +monotonic_raw:
> +	seqcnt_acquire
> +	syscall_check fail=syscall
> +	ldr	x10, [vdso_data, #VDSO_CS_CYCLE_LAST]
> +	/* w11 = cs_raw_mult, w12 = cs_shift */
> +	ldp	w12, w11, [vdso_data, #VDSO_CS_SHIFT]
> +	ldp	x13, x14, [vdso_data, #VDSO_RAW_TIME_SEC]
> +	seqcnt_check fail=monotonic_raw
> +
> +	/* All computations are done with left-shifted nsecs. */
> +	lsl	x14, x14, x12
> +	get_nsec_per_sec res=x9
> +	lsl	x9, x9, x12
> +
> +	get_clock_shifted_nsec res=x15, cycle_last=x10, mult=x11
> +	get_ts_clock_raw res_sec=x10, res_nsec=x11, \
> +		clock_nsec=x15, nsec_to_sec=x9
> +
> +	add_ts sec=x10, nsec=x11, ts_sec=x13, ts_nsec=x14, nsec_to_sec=x9
> +
> +	b shift_store
> +
>  realtime_coarse:
>  	seqcnt_acquire
>  	ldp	x10, x11, [vdso_data, #VDSO_XTIME_CRS_SEC]
> @@ -261,6 +295,7 @@ ENTRY(__kernel_clock_getres)
>  	.cfi_startproc
>  	cmp	w0, #CLOCK_REALTIME
>  	ccmp	w0, #CLOCK_MONOTONIC, #0x4, ne
> +	ccmp	w0, #CLOCK_MONOTONIC_RAW, #0x4, ne
>  	b.ne	1f
>  
>  	ldr	x2, 5f
> -- 
> 2.8.0
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] arm64: Refactor vDSO time functions
  2016-07-01 13:46   ` Dave Martin
@ 2016-07-04 17:12     ` Will Deacon
  0 siblings, 0 replies; 11+ messages in thread
From: Will Deacon @ 2016-07-04 17:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jul 01, 2016 at 02:46:54PM +0100, Dave Martin wrote:
> On Mon, May 09, 2016 at 01:37:00PM +0100, Kevin Brodsky wrote:
> > Time functions are directly implemented in assembly in arm64, and it
> > is desirable to keep it this way for performance reasons (everything
> > fits in registers, so that the stack is not used at all). However, the
> > current implementation is quite difficult to read and understand (even
> > considering it's assembly).  Additionally, due to the structure of
> > __kernel_clock_gettime, which heavily uses conditional branches to
> > share code between the different clocks, it is difficult to support a
> > new clock without making the branches even harder to follow.
> > 
> > This commit completely refactors the structure of clock_gettime (and
> > gettimeofday along the way) while keeping exactly the same algorithms.
> > We no longer try to share code; instead, macros provide common
> > operations. This new approach comes with a number of advantages:
> > - In clock_gettime, clock implementations are no longer interspersed,
> >   making them much more readable. Additionally, macros only use
> >   registers passed as arguments or reserved with .req, this way it is
> >   easy to make sure that registers are properly allocated. To avoid a
> >   large number of branches in a given execution path, a jump table is
> >   used; a normal execution uses 3 unconditional branches.
> > - __do_get_tspec has been replaced with 2 macros (get_ts_clock_mono,
> >   get_clock_shifted_nsec) and explicit loading of data from the vDSO
> >   page. Consequently, clock_gettime and gettimeofday are now leaf
> >   functions, and saving x30 (lr) is no longer necessary.
> > - Variables protected by tb_seq_count are now loaded all at once,
> >   allowing to merge the seqcnt_read macro into seqcnt_check.
> > - For CLOCK_REALTIME_COARSE, removed an unused load of the wall to
> >   monotonic timespec.
> > - For CLOCK_MONOTONIC_COARSE, removed a few shift instructions.
> > 
> > Obviously, the downside of sharing less code is an increase in code
> > size. However since the vDSO has its own code page, this does not
> > really matter, as long as the size of the DSO remains below 4 kB. For
> > now this should be all right:
> >                     Before  After
> >   vdso.so size (B)  2776    2936
> > 
> > Cc: Will Deacon <will.deacon@arm.com>
> > Cc: Dave Martin <dave.martin@arm.com>
> > Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> 
> Reviewed-by: Dave Martin <Dave.Martin@arm.com>
> 
> I agree with Christopher that we shouldn't simply assume that code
> should stay in asm just because is was asm to begin with, but the
> refactoring seems reasonable here.

FWIW, we did do some benchmarking on a variety of microarchitectures
comparing the existing asm code with a version written in C. Whilst the
asm code tended to be a small amount faster in most cases, there were
some CPUs which showed a significant benefit from keeping things as they
are.

> There's no hard limit on the size of the vDSO AFAIK, but in any case
> the bloatation here is slight and the total number of clocks we'll ever
> support in the vDSO should be pretty small...
> 
> The code can always be ported to C later on if there's a compelling
> reason, and if the compiler is shown to do a good job on it.

One reason might be if we go down the route of offering a compat vdso,
but we'd also want to get to the bottom of any performance variations
as described above.

Will

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] arm64: Refactor vDSO time functions
  2016-05-09 12:37 ` [PATCH 1/2] arm64: Refactor vDSO time functions Kevin Brodsky
  2016-06-22 13:24   ` Christopher Covington
  2016-07-01 13:46   ` Dave Martin
@ 2016-07-08 14:11   ` Will Deacon
  2016-07-11 17:31     ` Kevin Brodsky
  2 siblings, 1 reply; 11+ messages in thread
From: Will Deacon @ 2016-07-08 14:11 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Kevin,

On Mon, May 09, 2016 at 01:37:00PM +0100, Kevin Brodsky wrote:
> Time functions are directly implemented in assembly in arm64, and it
> is desirable to keep it this way for performance reasons (everything
> fits in registers, so that the stack is not used at all). However, the
> current implementation is quite difficult to read and understand (even
> considering it's assembly).  Additionally, due to the structure of
> __kernel_clock_gettime, which heavily uses conditional branches to
> share code between the different clocks, it is difficult to support a
> new clock without making the branches even harder to follow.
> 
> This commit completely refactors the structure of clock_gettime (and
> gettimeofday along the way) while keeping exactly the same algorithms.
> We no longer try to share code; instead, macros provide common
> operations. This new approach comes with a number of advantages:
> - In clock_gettime, clock implementations are no longer interspersed,
>   making them much more readable. Additionally, macros only use
>   registers passed as arguments or reserved with .req, this way it is
>   easy to make sure that registers are properly allocated. To avoid a
>   large number of branches in a given execution path, a jump table is
>   used; a normal execution uses 3 unconditional branches.
> - __do_get_tspec has been replaced with 2 macros (get_ts_clock_mono,
>   get_clock_shifted_nsec) and explicit loading of data from the vDSO
>   page. Consequently, clock_gettime and gettimeofday are now leaf
>   functions, and saving x30 (lr) is no longer necessary.
> - Variables protected by tb_seq_count are now loaded all at once,
>   allowing to merge the seqcnt_read macro into seqcnt_check.
> - For CLOCK_REALTIME_COARSE, removed an unused load of the wall to
>   monotonic timespec.
> - For CLOCK_MONOTONIC_COARSE, removed a few shift instructions.
> 
> Obviously, the downside of sharing less code is an increase in code
> size. However since the vDSO has its own code page, this does not
> really matter, as long as the size of the DSO remains below 4 kB. For
> now this should be all right:
>                     Before  After
>   vdso.so size (B)  2776    2936
> 
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Dave Martin <dave.martin@arm.com>
> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
> ---
>  arch/arm64/kernel/vdso/gettimeofday.S | 282 +++++++++++++++++++---------------
>  1 file changed, 162 insertions(+), 120 deletions(-)

This patch is really hard to read, but after applying it the resulting
code looks really good. Thanks! Just a couple of minor comments inline.

>  ENTRY(__kernel_clock_gettime)
>  	.cfi_startproc
> -	cmp	w0, #CLOCK_REALTIME
> -	ccmp	w0, #CLOCK_MONOTONIC, #0x4, ne
> -	b.ne	2f
> +	cmp 	w0, #JUMPSLOT_MAX
> +	b.hi 	syscall
> +	adr	vdso_data, _vdso_data
> +	adr 	x_tmp, jumptable
> +	add 	x_tmp, x_tmp, w0, uxtw #2
> +	br 	x_tmp
> +
> +jumptable:
> +	jump_slot jumptable, CLOCK_REALTIME, realtime
> +	jump_slot jumptable, CLOCK_MONOTONIC, monotonic
> +	b 	syscall
> +	b 	syscall
> +	b 	syscall
> +	jump_slot jumptable, CLOCK_REALTIME_COARSE, realtime_coarse
> +	jump_slot jumptable, CLOCK_MONOTONIC_COARSE, monotonic_coarse

The branchiness of this code (into __kernel_clock_gettime, then not taking
the b.hi, the br following by the b in the jumptable) is likely to hurt
most branch predictors. I found that aligning the jumptable and its
subsequent targets helped a bit here.

> +shift_store:
>  	lsr	x11, x11, x12
> +store:
>  	stp	x10, x11, [x1, #TSPEC_TV_SEC]
>  	mov	x0, xzr
>  	ret

I think it's simpler just to macroise this, which also avoids some of
the branches given that it ends in a ret anyway.

Fixup patch below. If you fold it in, then:

Acked-by: Will Deacon <will.deacon@arm.com>

for the series.

Will

--->8

diff --git a/arch/arm64/kernel/vdso/gettimeofday.S b/arch/arm64/kernel/vdso/gettimeofday.S
index f49b6755058a..85969cd2b2f7 100644
--- a/arch/arm64/kernel/vdso/gettimeofday.S
+++ b/arch/arm64/kernel/vdso/gettimeofday.S
@@ -115,6 +115,15 @@ x_tmp		.req	x8
 9998:
 	.endm
 
+	.macro clock_gettime_return, shift=0
+	.if \shift == 1
+	lsr	x11, x11, x12
+	.endif
+	stp	x10, x11, [x1, #TSPEC_TV_SEC]
+	mov	x0, xzr
+	ret
+	.endm
+
 	.macro jump_slot jumptable, index, label
 	.if (. - \jumptable) != 4 * (\index)
 		.error "Jump slot index mismatch"
@@ -180,6 +189,7 @@ ENTRY(__kernel_clock_gettime)
 	add	x_tmp, x_tmp, w0, uxtw #2
 	br	x_tmp
 
+	ALIGN
 jumptable:
 	jump_slot jumptable, CLOCK_REALTIME, realtime
 	jump_slot jumptable, CLOCK_MONOTONIC, monotonic
@@ -193,6 +203,7 @@ jumptable:
 	.error	"Wrong jumptable size"
 	.endif
 
+	ALIGN
 realtime:
 	seqcnt_acquire
 	syscall_check fail=syscall
@@ -209,9 +220,9 @@ realtime:
 	get_clock_shifted_nsec res=x15, cycle_last=x10, mult=x11
 	get_ts_realtime res_sec=x10, res_nsec=x11, \
 		clock_nsec=x15, xtime_sec=x13, xtime_nsec=x14, nsec_to_sec=x9
+	clock_gettime_return, shift=1
 
-	b shift_store
-
+	ALIGN
 monotonic:
 	seqcnt_acquire
 	syscall_check fail=syscall
@@ -232,9 +243,9 @@ monotonic:
 		clock_nsec=x15, xtime_sec=x13, xtime_nsec=x14, nsec_to_sec=x9
 
 	add_ts sec=x10, nsec=x11, ts_sec=x3, ts_nsec=x4, nsec_to_sec=x9
+	clock_gettime_return, shift=1
 
-	b shift_store
-
+	ALIGN
 monotonic_raw:
 	seqcnt_acquire
 	syscall_check fail=syscall
@@ -254,16 +265,16 @@ monotonic_raw:
 		clock_nsec=x15, nsec_to_sec=x9
 
 	add_ts sec=x10, nsec=x11, ts_sec=x13, ts_nsec=x14, nsec_to_sec=x9
+	clock_gettime_return, shift=1
 
-	b shift_store
-
+	ALIGN
 realtime_coarse:
 	seqcnt_acquire
 	ldp	x10, x11, [vdso_data, #VDSO_XTIME_CRS_SEC]
 	seqcnt_check fail=realtime_coarse
+	clock_gettime_return
 
-	b store
-
+	ALIGN
 monotonic_coarse:
 	seqcnt_acquire
 	ldp	x10, x11, [vdso_data, #VDSO_XTIME_CRS_SEC]
@@ -273,16 +284,9 @@ monotonic_coarse:
 	/* Computations are done in (non-shifted) nsecs. */
 	get_nsec_per_sec res=x9
 	add_ts sec=x10, nsec=x11, ts_sec=x13, ts_nsec=x14, nsec_to_sec=x9
+	clock_gettime_return
 
-	b store
-
-shift_store:
-	lsr	x11, x11, x12
-store:
-	stp	x10, x11, [x1, #TSPEC_TV_SEC]
-	mov	x0, xzr
-	ret
-
+	ALIGN
 syscall: /* Syscall fallback. */
 	mov	x8, #__NR_clock_gettime
 	svc	#0

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 1/2] arm64: Refactor vDSO time functions
  2016-07-08 14:11   ` Will Deacon
@ 2016-07-11 17:31     ` Kevin Brodsky
  2016-07-11 17:42       ` Will Deacon
  0 siblings, 1 reply; 11+ messages in thread
From: Kevin Brodsky @ 2016-07-11 17:31 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,


On 08/07/16 15:11, Will Deacon wrote:
> Hi Kevin,
>
> On Mon, May 09, 2016 at 01:37:00PM +0100, Kevin Brodsky wrote:
>> Time functions are directly implemented in assembly in arm64, and it
>> is desirable to keep it this way for performance reasons (everything
>> fits in registers, so that the stack is not used at all). However, the
>> current implementation is quite difficult to read and understand (even
>> considering it's assembly).  Additionally, due to the structure of
>> __kernel_clock_gettime, which heavily uses conditional branches to
>> share code between the different clocks, it is difficult to support a
>> new clock without making the branches even harder to follow.
>>
>> This commit completely refactors the structure of clock_gettime (and
>> gettimeofday along the way) while keeping exactly the same algorithms.
>> We no longer try to share code; instead, macros provide common
>> operations. This new approach comes with a number of advantages:
>> - In clock_gettime, clock implementations are no longer interspersed,
>>    making them much more readable. Additionally, macros only use
>>    registers passed as arguments or reserved with .req, this way it is
>>    easy to make sure that registers are properly allocated. To avoid a
>>    large number of branches in a given execution path, a jump table is
>>    used; a normal execution uses 3 unconditional branches.
>> - __do_get_tspec has been replaced with 2 macros (get_ts_clock_mono,
>>    get_clock_shifted_nsec) and explicit loading of data from the vDSO
>>    page. Consequently, clock_gettime and gettimeofday are now leaf
>>    functions, and saving x30 (lr) is no longer necessary.
>> - Variables protected by tb_seq_count are now loaded all at once,
>>    allowing to merge the seqcnt_read macro into seqcnt_check.
>> - For CLOCK_REALTIME_COARSE, removed an unused load of the wall to
>>    monotonic timespec.
>> - For CLOCK_MONOTONIC_COARSE, removed a few shift instructions.
>>
>> Obviously, the downside of sharing less code is an increase in code
>> size. However since the vDSO has its own code page, this does not
>> really matter, as long as the size of the DSO remains below 4 kB. For
>> now this should be all right:
>>                      Before  After
>>    vdso.so size (B)  2776    2936
>>
>> Cc: Will Deacon <will.deacon@arm.com>
>> Cc: Dave Martin <dave.martin@arm.com>
>> Signed-off-by: Kevin Brodsky <kevin.brodsky@arm.com>
>> ---
>>   arch/arm64/kernel/vdso/gettimeofday.S | 282 +++++++++++++++++++---------------
>>   1 file changed, 162 insertions(+), 120 deletions(-)
> This patch is really hard to read, but after applying it the resulting
> code looks really good. Thanks! Just a couple of minor comments inline.

Thanks for taking a look at this! Yes the diff is almost unreadable, it's bound to
happen when modifying most of the file.

>
>>   ENTRY(__kernel_clock_gettime)
>>      .cfi_startproc
>> -    cmp     w0, #CLOCK_REALTIME
>> -    ccmp    w0, #CLOCK_MONOTONIC, #0x4, ne
>> -    b.ne    2f
>> +    cmp     w0, #JUMPSLOT_MAX
>> +    b.hi    syscall
>> +    adr     vdso_data, _vdso_data
>> +    adr     x_tmp, jumptable
>> +    add     x_tmp, x_tmp, w0, uxtw #2
>> +    br      x_tmp
>> +
>> +jumptable:
>> +    jump_slot jumptable, CLOCK_REALTIME, realtime
>> +    jump_slot jumptable, CLOCK_MONOTONIC, monotonic
>> +    b       syscall
>> +    b       syscall
>> +    b       syscall
>> +    jump_slot jumptable, CLOCK_REALTIME_COARSE, realtime_coarse
>> +    jump_slot jumptable, CLOCK_MONOTONIC_COARSE, monotonic_coarse
> The branchiness of this code (into __kernel_clock_gettime, then not taking
> the b.hi, the br following by the b in the jumptable) is likely to hurt
> most branch predictors. I found that aligning the jumptable and its
> subsequent targets helped a bit here.

That sounds perfectly sensible, there are a lot of branches indeed (fortunately,
mostly unconditional).

>
>> +shift_store:
>>      lsr     x11, x11, x12
>> +store:
>>      stp     x10, x11, [x1, #TSPEC_TV_SEC]
>>      mov     x0, xzr
>>      ret
> I think it's simpler just to macroise this, which also avoids some of
> the branches given that it ends in a ret anyway.

Sounds good too, while we're at macroising things we might as well go all the way ;)

> Fixup patch below. If you fold it in, then:
>
> Acked-by: Will Deacon <will.deacon@arm.com>
>
> for the series.

I just have one small concern with your fixup patch: the ALIGN macro from linkage.h
(not the one from kernel.h, which is for C, not assembly) uses 0x90 as padding, which
apparently is NOP on x86 but does not make much sense on ARM64 (or ARM for that
matter). It is not currently used in arm{,64}. I am not sure if it could decode to a
valid instruction on ARM64, but maybe using simply 0x0 as a padding byte would be
less arbitrary. I also don't like this ALIGN macro too much, because the "4" argument
means something different depending on the architecture (4 bytes on x86, 2^4 on arm*:
https://sourceware.org/binutils/docs/as/Align.html).

Thanks,
Kevin

>
> Will
>
> --->8
>
> diff --git a/arch/arm64/kernel/vdso/gettimeofday.S b/arch/arm64/kernel/vdso/gettimeofday.S
> index f49b6755058a..85969cd2b2f7 100644
> --- a/arch/arm64/kernel/vdso/gettimeofday.S
> +++ b/arch/arm64/kernel/vdso/gettimeofday.S
> @@ -115,6 +115,15 @@ x_tmp            .req    x8
>   9998:
>       .endm
>
> +     .macro clock_gettime_return, shift=0
> +     .if \shift == 1
> +     lsr     x11, x11, x12
> +     .endif
> +     stp     x10, x11, [x1, #TSPEC_TV_SEC]
> +     mov     x0, xzr
> +     ret
> +     .endm
> +
>       .macro jump_slot jumptable, index, label
>       .if (. - \jumptable) != 4 * (\index)
>               .error "Jump slot index mismatch"
> @@ -180,6 +189,7 @@ ENTRY(__kernel_clock_gettime)
>       add     x_tmp, x_tmp, w0, uxtw #2
>       br      x_tmp
>
> +     ALIGN
>   jumptable:
>       jump_slot jumptable, CLOCK_REALTIME, realtime
>       jump_slot jumptable, CLOCK_MONOTONIC, monotonic
> @@ -193,6 +203,7 @@ jumptable:
>       .error  "Wrong jumptable size"
>       .endif
>
> +     ALIGN
>   realtime:
>       seqcnt_acquire
>       syscall_check fail=syscall
> @@ -209,9 +220,9 @@ realtime:
>       get_clock_shifted_nsec res=x15, cycle_last=x10, mult=x11
>       get_ts_realtime res_sec=x10, res_nsec=x11, \
>               clock_nsec=x15, xtime_sec=x13, xtime_nsec=x14, nsec_to_sec=x9
> +     clock_gettime_return, shift=1
>
> -     b shift_store
> -
> +     ALIGN
>   monotonic:
>       seqcnt_acquire
>       syscall_check fail=syscall
> @@ -232,9 +243,9 @@ monotonic:
>               clock_nsec=x15, xtime_sec=x13, xtime_nsec=x14, nsec_to_sec=x9
>
>       add_ts sec=x10, nsec=x11, ts_sec=x3, ts_nsec=x4, nsec_to_sec=x9
> +     clock_gettime_return, shift=1
>
> -     b shift_store
> -
> +     ALIGN
>   monotonic_raw:
>       seqcnt_acquire
>       syscall_check fail=syscall
> @@ -254,16 +265,16 @@ monotonic_raw:
>               clock_nsec=x15, nsec_to_sec=x9
>
>       add_ts sec=x10, nsec=x11, ts_sec=x13, ts_nsec=x14, nsec_to_sec=x9
> +     clock_gettime_return, shift=1
>
> -     b shift_store
> -
> +     ALIGN
>   realtime_coarse:
>       seqcnt_acquire
>       ldp     x10, x11, [vdso_data, #VDSO_XTIME_CRS_SEC]
>       seqcnt_check fail=realtime_coarse
> +     clock_gettime_return
>
> -     b store
> -
> +     ALIGN
>   monotonic_coarse:
>       seqcnt_acquire
>       ldp     x10, x11, [vdso_data, #VDSO_XTIME_CRS_SEC]
> @@ -273,16 +284,9 @@ monotonic_coarse:
>       /* Computations are done in (non-shifted) nsecs. */
>       get_nsec_per_sec res=x9
>       add_ts sec=x10, nsec=x11, ts_sec=x13, ts_nsec=x14, nsec_to_sec=x9
> +     clock_gettime_return
>
> -     b store
> -
> -shift_store:
> -     lsr     x11, x11, x12
> -store:
> -     stp     x10, x11, [x1, #TSPEC_TV_SEC]
> -     mov     x0, xzr
> -     ret
> -
> +     ALIGN
>   syscall: /* Syscall fallback. */
>       mov     x8, #__NR_clock_gettime
>       svc     #0
>

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] arm64: Refactor vDSO time functions
  2016-07-11 17:31     ` Kevin Brodsky
@ 2016-07-11 17:42       ` Will Deacon
  2016-07-12  9:10         ` Kevin Brodsky
  0 siblings, 1 reply; 11+ messages in thread
From: Will Deacon @ 2016-07-11 17:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 11, 2016 at 06:31:16PM +0100, Kevin Brodsky wrote:
> On 08/07/16 15:11, Will Deacon wrote:
> >Fixup patch below. If you fold it in, then:
> >
> >Acked-by: Will Deacon <will.deacon@arm.com>
> >
> >for the series.
> 
> I just have one small concern with your fixup patch: the ALIGN macro from
> linkage.h (not the one from kernel.h, which is for C, not assembly) uses
> 0x90 as padding, which apparently is NOP on x86 but does not make much sense
> on ARM64 (or ARM for that matter). It is not currently used in arm{,64}. I
> am not sure if it could decode to a valid instruction on ARM64, but maybe
> using simply 0x0 as a padding byte would be less arbitrary. I also don't
> like this ALIGN macro too much, because the "4" argument means something
> different depending on the architecture (4 bytes on x86, 2^4 on arm*:
> https://sourceware.org/binutils/docs/as/Align.html).

ALIGN expands to __ALIGN which is defined in
arch/arm64/include/asm/linkage.h as .align 4.

You're 4 years late ;)

http://git.kernel.org/linus/aeed41a9371e

Will

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] arm64: Refactor vDSO time functions
  2016-07-11 17:42       ` Will Deacon
@ 2016-07-12  9:10         ` Kevin Brodsky
  0 siblings, 0 replies; 11+ messages in thread
From: Kevin Brodsky @ 2016-07-12  9:10 UTC (permalink / raw)
  To: linux-arm-kernel

On 11/07/16 18:42, Will Deacon wrote:
> On Mon, Jul 11, 2016 at 06:31:16PM +0100, Kevin Brodsky wrote:
>> On 08/07/16 15:11, Will Deacon wrote:
>>> Fixup patch below. If you fold it in, then:
>>>
>>> Acked-by: Will Deacon <will.deacon@arm.com>
>>>
>>> for the series.
>> I just have one small concern with your fixup patch: the ALIGN macro from
>> linkage.h (not the one from kernel.h, which is for C, not assembly) uses
>> 0x90 as padding, which apparently is NOP on x86 but does not make much sense
>> on ARM64 (or ARM for that matter). It is not currently used in arm{,64}. I
>> am not sure if it could decode to a valid instruction on ARM64, but maybe
>> using simply 0x0 as a padding byte would be less arbitrary. I also don't
>> like this ALIGN macro too much, because the "4" argument means something
>> different depending on the architecture (4 bytes on x86, 2^4 on arm*:
>> https://sourceware.org/binutils/docs/as/Align.html).
> ALIGN expands to __ALIGN which is defined in
> arch/arm64/include/asm/linkage.h as .align 4.
>
> You're 4 years late ;)
>
> http://git.kernel.org/linus/aeed41a9371e
>
> Will
>

Oh, sorry, didn't pay attention to the #ifndef __ALIGN in linux/linkage.h... All good
for me then, I will repost it.

Thanks,
Kevin
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2016-07-12  9:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-09 12:36 [PATCH 0/2] arm64: Add support for CLOCK_MONOTONIC_RAW in clock_gettime() vDSO Kevin Brodsky
2016-05-09 12:37 ` [PATCH 1/2] arm64: Refactor vDSO time functions Kevin Brodsky
2016-06-22 13:24   ` Christopher Covington
2016-07-01 13:46   ` Dave Martin
2016-07-04 17:12     ` Will Deacon
2016-07-08 14:11   ` Will Deacon
2016-07-11 17:31     ` Kevin Brodsky
2016-07-11 17:42       ` Will Deacon
2016-07-12  9:10         ` Kevin Brodsky
2016-05-09 12:37 ` [PATCH 2/2] arm64: Add support for CLOCK_MONOTONIC_RAW in clock_gettime() vDSO Kevin Brodsky
2016-07-01 13:48   ` Dave Martin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.