All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5 (rebase resend)] Arch-generic sched_clock NMI safety and optimizations for -tip
@ 2015-03-26 19:23 John Stultz
  2015-03-26 19:23 ` [PATCH 1/5] sched_clock: Match scope of read and write seqcounts John Stultz
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: John Stultz @ 2015-03-26 19:23 UTC (permalink / raw)
  To: lkml
  Cc: John Stultz, Russell King, Will Deacon, Catalin Marinas,
	Daniel Thompson, Thomas Gleixner, Stephen Boyd, Peter Zijlstra,
	Ingo Molnar

Ingo, Peter, Thomas,

(No real changes since last time, only rebased ontop of tip/timers/core
so as to avoid the collision caused by some of my patches there. Also
resent as patches instead of pull request, since Ingo asked)

The following patches from Daniel extend the arch-generic sched_clock
implementation so that it can be safely called from NMI (or FIQ on ARM)
context. They also optimize the sched_clock logic to improve cache
performance. I wanted to send them along so they could be queued in
-tip for 4.1.

>From Daniel's last post:
"The data cache profile of sched_clock() in both the original code and
my previous patch was somewhere between 2 and 3 (64-byte) cache lines,
depending on alignment of struct clock_data. After patching, the cache
profile for the normal case should be a single cacheline.

NMI safety was tested on i.MX6 with perf drowning the system in FIQs and
using the perf handler to check that sched_clock() returned monotonic
values. At the same time I forcefully reduced kt_wrap so that
update_sched_clock() is being called at >1000Hz.

Without the patches the above system is grossly unstable, surviving
[9K,115K,25K] perf event cycles during three separate runs. With the
patch I ran for over 9M perf event cycles before getting bored."

I'm relaying these along because in the past I've queued and submitted
arch-generic sched_clock.c changes from Stephen via Thomas, but I wanted
to make sure this got a look from Ingo and Peter.

In the future, I think Stephen is probably the right person to act as
patch-catcher for the kernel/time/sched_clock.c file. But I think those
changes should continue to be merged in via the -tip tree w/ other
scheduler code.

Let me know if you have any objections or concerns.

thanks
-john

Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>

Daniel Thompson (5):
  sched_clock: Match scope of read and write seqcounts
  sched_clock: Optimize cache line usage
  sched_clock: Remove suspend from clock_read_data
  sched_clock: Remove redundant notrace from update function
  sched_clock: Avoid deadlock during read from NMI

 kernel/time/sched_clock.c | 195 ++++++++++++++++++++++++++++++++--------------
 1 file changed, 138 insertions(+), 57 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/5] sched_clock: Match scope of read and write seqcounts
  2015-03-26 19:23 [PATCH 0/5 (rebase resend)] Arch-generic sched_clock NMI safety and optimizations for -tip John Stultz
@ 2015-03-26 19:23 ` John Stultz
  2015-03-27  7:41   ` [tip:timers/core] timers, sched/clock: " tip-bot for Daniel Thompson
  2015-03-26 19:23 ` [PATCH 2/5] sched_clock: Optimize cache line usage John Stultz
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: John Stultz @ 2015-03-26 19:23 UTC (permalink / raw)
  To: lkml
  Cc: Daniel Thompson, Russell King, Will Deacon, Catalin Marinas,
	Thomas Gleixner, Stephen Boyd, Peter Zijlstra, Ingo Molnar,
	John Stultz

From: Daniel Thompson <daniel.thompson@linaro.org>

Currently the scope of the raw_write_seqcount_begin/end in
sched_clock_register far exceeds the scope of the read section in
sched_clock. This gives the impression of safety during cursory review
but achieves little.

Note that this is likely to be a latent issue at present because
sched_clock_register() is typically called before we enable interrupts,
however the issue does risk bugs being needlessly introduced as the code
evolves.

This patch fixes the problem by increasing the scope of the read locking
performed by sched_clock() to cover all data modified by
sched_clock_register.

We also improve clarity by moving writes to struct clock_data that do
not impact sched_clock() outside of the critical section.

Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
[jstultz: Slight rework to apply to tip/timers/core]
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/sched_clock.c | 26 +++++++++++---------------
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index ca3bc5c..1751e95 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -58,23 +58,21 @@ static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
 
 unsigned long long notrace sched_clock(void)
 {
-	u64 epoch_ns;
-	u64 epoch_cyc;
-	u64 cyc;
+	u64 cyc, res;
 	unsigned long seq;
 
-	if (cd.suspended)
-		return cd.epoch_ns;
-
 	do {
 		seq = raw_read_seqcount_begin(&cd.seq);
-		epoch_cyc = cd.epoch_cyc;
-		epoch_ns = cd.epoch_ns;
+
+		res = cd.epoch_ns;
+		if (!cd.suspended) {
+			cyc = read_sched_clock();
+			cyc = (cyc - cd.epoch_cyc) & sched_clock_mask;
+			res += cyc_to_ns(cyc, cd.mult, cd.shift);
+		}
 	} while (read_seqcount_retry(&cd.seq, seq));
 
-	cyc = read_sched_clock();
-	cyc = (cyc - epoch_cyc) & sched_clock_mask;
-	return epoch_ns + cyc_to_ns(cyc, cd.mult, cd.shift);
+	return res;
 }
 
 /*
@@ -111,7 +109,6 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 {
 	u64 res, wrap, new_mask, new_epoch, cyc, ns;
 	u32 new_mult, new_shift;
-	ktime_t new_wrap_kt;
 	unsigned long r;
 	char r_unit;
 
@@ -124,10 +121,11 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 	clocks_calc_mult_shift(&new_mult, &new_shift, rate, NSEC_PER_SEC, 3600);
 
 	new_mask = CLOCKSOURCE_MASK(bits);
+	cd.rate = rate;
 
 	/* calculate how many nanosecs until we risk wrapping */
 	wrap = clocks_calc_max_nsecs(new_mult, new_shift, 0, new_mask, NULL);
-	new_wrap_kt = ns_to_ktime(wrap);
+	cd.wrap_kt = ns_to_ktime(wrap);
 
 	/* update epoch for new counter and update epoch_ns from old counter*/
 	new_epoch = read();
@@ -138,8 +136,6 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 	raw_write_seqcount_begin(&cd.seq);
 	read_sched_clock = read;
 	sched_clock_mask = new_mask;
-	cd.rate = rate;
-	cd.wrap_kt = new_wrap_kt;
 	cd.mult = new_mult;
 	cd.shift = new_shift;
 	cd.epoch_cyc = new_epoch;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/5] sched_clock: Optimize cache line usage
  2015-03-26 19:23 [PATCH 0/5 (rebase resend)] Arch-generic sched_clock NMI safety and optimizations for -tip John Stultz
  2015-03-26 19:23 ` [PATCH 1/5] sched_clock: Match scope of read and write seqcounts John Stultz
@ 2015-03-26 19:23 ` John Stultz
  2015-03-27  7:41   ` [tip:timers/core] timers, sched/clock: " tip-bot for Daniel Thompson
  2015-03-26 19:23 ` [PATCH 3/5] sched_clock: Remove suspend from clock_read_data John Stultz
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: John Stultz @ 2015-03-26 19:23 UTC (permalink / raw)
  To: lkml
  Cc: Daniel Thompson, Russell King, Will Deacon, Catalin Marinas,
	Thomas Gleixner, Stephen Boyd, Peter Zijlstra, Ingo Molnar,
	John Stultz

From: Daniel Thompson <daniel.thompson@linaro.org>

Currently sched_clock(), a very hot code path, is not optimized to
minimise its cache profile. In particular:

  1. cd is not ____cacheline_aligned,

  2. struct clock_data does not distinguish between hotpath and
     coldpath data, reducing locality of reference in the hotpath,

  3. Some hotpath data is missing from struct clock_data and is marked
     __read_mostly (which more or less guarantees it will not share a
     cache line with cd).

This patch corrects these problems by extracting all hotpath data
into a separate structure and using ____cacheline_aligned to ensure
the hotpath uses a single (64 byte) cache line.

Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/sched_clock.c | 112 +++++++++++++++++++++++++++++++---------------
 1 file changed, 77 insertions(+), 35 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 1751e95..872e068 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -18,28 +18,59 @@
 #include <linux/seqlock.h>
 #include <linux/bitops.h>
 
-struct clock_data {
-	ktime_t wrap_kt;
+/**
+ * struct clock_read_data - data required to read from sched_clock
+ *
+ * @epoch_ns:		sched_clock value at last update
+ * @epoch_cyc:		Clock cycle value at last update
+ * @sched_clock_mask:   Bitmask for two's complement subtraction of non 64bit
+ *			clocks
+ * @read_sched_clock:	Current clock source (or dummy source when suspended)
+ * @mult:		Multipler for scaled math conversion
+ * @shift:		Shift value for scaled math conversion
+ * @suspended:		Flag to indicate if the clock is suspended (stopped)
+ *
+ * Care must be taken when updating this structure; it is read by
+ * some very hot code paths. It occupies <=48 bytes and, when combined
+ * with the seqcount used to synchronize access, comfortably fits into
+ * a 64 byte cache line.
+ */
+struct clock_read_data {
 	u64 epoch_ns;
 	u64 epoch_cyc;
-	seqcount_t seq;
-	unsigned long rate;
+	u64 sched_clock_mask;
+	u64 (*read_sched_clock)(void);
 	u32 mult;
 	u32 shift;
 	bool suspended;
 };
 
+/**
+ * struct clock_data - all data needed for sched_clock (including
+ *                     registration of a new clock source)
+ *
+ * @seq:		Sequence counter for protecting updates.
+ * @read_data:		Data required to read from sched_clock.
+ * @wrap_kt:		Duration for which clock can run before wrapping
+ * @rate:		Tick rate of the registered clock
+ * @actual_read_sched_clock: Registered clock read function
+ *
+ * The ordering of this structure has been chosen to optimize cache
+ * performance. In particular seq and read_data (combined) should fit
+ * into a single 64 byte cache line.
+ */
+struct clock_data {
+	seqcount_t seq;
+	struct clock_read_data read_data;
+	ktime_t wrap_kt;
+	unsigned long rate;
+};
+
 static struct hrtimer sched_clock_timer;
 static int irqtime = -1;
 
 core_param(irqtime, irqtime, int, 0400);
 
-static struct clock_data cd = {
-	.mult	= NSEC_PER_SEC / HZ,
-};
-
-static u64 __read_mostly sched_clock_mask;
-
 static u64 notrace jiffy_sched_clock_read(void)
 {
 	/*
@@ -49,7 +80,10 @@ static u64 notrace jiffy_sched_clock_read(void)
 	return (u64)(jiffies - INITIAL_JIFFIES);
 }
 
-static u64 __read_mostly (*read_sched_clock)(void) = jiffy_sched_clock_read;
+static struct clock_data cd ____cacheline_aligned = {
+	.read_data = { .mult = NSEC_PER_SEC / HZ,
+		       .read_sched_clock = jiffy_sched_clock_read, },
+};
 
 static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
 {
@@ -60,15 +94,16 @@ unsigned long long notrace sched_clock(void)
 {
 	u64 cyc, res;
 	unsigned long seq;
+	struct clock_read_data *rd = &cd.read_data;
 
 	do {
 		seq = raw_read_seqcount_begin(&cd.seq);
 
-		res = cd.epoch_ns;
-		if (!cd.suspended) {
-			cyc = read_sched_clock();
-			cyc = (cyc - cd.epoch_cyc) & sched_clock_mask;
-			res += cyc_to_ns(cyc, cd.mult, cd.shift);
+		res = rd->epoch_ns;
+		if (!rd->suspended) {
+			cyc = rd->read_sched_clock();
+			cyc = (cyc - rd->epoch_cyc) & rd->sched_clock_mask;
+			res += cyc_to_ns(cyc, rd->mult, rd->shift);
 		}
 	} while (read_seqcount_retry(&cd.seq, seq));
 
@@ -83,16 +118,17 @@ static void notrace update_sched_clock(void)
 	unsigned long flags;
 	u64 cyc;
 	u64 ns;
+	struct clock_read_data *rd = &cd.read_data;
 
-	cyc = read_sched_clock();
-	ns = cd.epoch_ns +
-		cyc_to_ns((cyc - cd.epoch_cyc) & sched_clock_mask,
-			  cd.mult, cd.shift);
+	cyc = rd->read_sched_clock();
+	ns = rd->epoch_ns +
+	     cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
+		       rd->mult, rd->shift);
 
 	raw_local_irq_save(flags);
 	raw_write_seqcount_begin(&cd.seq);
-	cd.epoch_ns = ns;
-	cd.epoch_cyc = cyc;
+	rd->epoch_ns = ns;
+	rd->epoch_cyc = cyc;
 	raw_write_seqcount_end(&cd.seq);
 	raw_local_irq_restore(flags);
 }
@@ -111,6 +147,7 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 	u32 new_mult, new_shift;
 	unsigned long r;
 	char r_unit;
+	struct clock_read_data *rd = &cd.read_data;
 
 	if (cd.rate > rate)
 		return;
@@ -129,17 +166,18 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 
 	/* update epoch for new counter and update epoch_ns from old counter*/
 	new_epoch = read();
-	cyc = read_sched_clock();
-	ns = cd.epoch_ns + cyc_to_ns((cyc - cd.epoch_cyc) & sched_clock_mask,
-			  cd.mult, cd.shift);
+	cyc = rd->read_sched_clock();
+	ns = rd->epoch_ns +
+	     cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
+		       rd->mult, rd->shift);
 
 	raw_write_seqcount_begin(&cd.seq);
-	read_sched_clock = read;
-	sched_clock_mask = new_mask;
-	cd.mult = new_mult;
-	cd.shift = new_shift;
-	cd.epoch_cyc = new_epoch;
-	cd.epoch_ns = ns;
+	rd->read_sched_clock = read;
+	rd->sched_clock_mask = new_mask;
+	rd->mult = new_mult;
+	rd->shift = new_shift;
+	rd->epoch_cyc = new_epoch;
+	rd->epoch_ns = ns;
 	raw_write_seqcount_end(&cd.seq);
 
 	r = rate;
@@ -171,7 +209,7 @@ void __init sched_clock_postinit(void)
 	 * If no sched_clock function has been provided at that point,
 	 * make it the final one one.
 	 */
-	if (read_sched_clock == jiffy_sched_clock_read)
+	if (cd.read_data.read_sched_clock == jiffy_sched_clock_read)
 		sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);
 
 	update_sched_clock();
@@ -187,17 +225,21 @@ void __init sched_clock_postinit(void)
 
 static int sched_clock_suspend(void)
 {
+	struct clock_read_data *rd = &cd.read_data;
+
 	update_sched_clock();
 	hrtimer_cancel(&sched_clock_timer);
-	cd.suspended = true;
+	rd->suspended = true;
 	return 0;
 }
 
 static void sched_clock_resume(void)
 {
-	cd.epoch_cyc = read_sched_clock();
+	struct clock_read_data *rd = &cd.read_data;
+
+	rd->epoch_cyc = rd->read_sched_clock();
 	hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
-	cd.suspended = false;
+	rd->suspended = false;
 }
 
 static struct syscore_ops sched_clock_ops = {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/5] sched_clock: Remove suspend from clock_read_data
  2015-03-26 19:23 [PATCH 0/5 (rebase resend)] Arch-generic sched_clock NMI safety and optimizations for -tip John Stultz
  2015-03-26 19:23 ` [PATCH 1/5] sched_clock: Match scope of read and write seqcounts John Stultz
  2015-03-26 19:23 ` [PATCH 2/5] sched_clock: Optimize cache line usage John Stultz
@ 2015-03-26 19:23 ` John Stultz
  2015-03-27  7:42   ` [tip:timers/core] timers, sched/clock: Remove suspend from clock_read_data() tip-bot for Daniel Thompson
  2015-03-26 19:23 ` [PATCH 4/5] sched_clock: Remove redundant notrace from update function John Stultz
  2015-03-26 19:23 ` [PATCH 5/5] sched_clock: Avoid deadlock during read from NMI John Stultz
  4 siblings, 1 reply; 12+ messages in thread
From: John Stultz @ 2015-03-26 19:23 UTC (permalink / raw)
  To: lkml
  Cc: Daniel Thompson, Russell King, Will Deacon, Catalin Marinas,
	Thomas Gleixner, Stephen Boyd, Peter Zijlstra, Ingo Molnar,
	John Stultz

From: Daniel Thompson <daniel.thompson@linaro.org>

Currently cd.read_data.suspended is read by the hotpath function
sched_clock(). This variable need not be accessed on the hotpath. In
fact, once it is removed, we can remove the conditional branches from
sched_clock() and install a dummy read_sched_clock function to suspend
the clock.

The new master copy of the function pointer (actual_read_sched_clock) is
introduced and is used for all reads of the clock hardware except those
within sched_clock itself.

Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/sched_clock.c | 40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 872e068..52ea5d9 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -28,10 +28,9 @@
  * @read_sched_clock:	Current clock source (or dummy source when suspended)
  * @mult:		Multipler for scaled math conversion
  * @shift:		Shift value for scaled math conversion
- * @suspended:		Flag to indicate if the clock is suspended (stopped)
  *
  * Care must be taken when updating this structure; it is read by
- * some very hot code paths. It occupies <=48 bytes and, when combined
+ * some very hot code paths. It occupies <=40 bytes and, when combined
  * with the seqcount used to synchronize access, comfortably fits into
  * a 64 byte cache line.
  */
@@ -42,7 +41,6 @@ struct clock_read_data {
 	u64 (*read_sched_clock)(void);
 	u32 mult;
 	u32 shift;
-	bool suspended;
 };
 
 /**
@@ -64,6 +62,7 @@ struct clock_data {
 	struct clock_read_data read_data;
 	ktime_t wrap_kt;
 	unsigned long rate;
+	u64 (*actual_read_sched_clock)(void);
 };
 
 static struct hrtimer sched_clock_timer;
@@ -83,6 +82,8 @@ static u64 notrace jiffy_sched_clock_read(void)
 static struct clock_data cd ____cacheline_aligned = {
 	.read_data = { .mult = NSEC_PER_SEC / HZ,
 		       .read_sched_clock = jiffy_sched_clock_read, },
+	.actual_read_sched_clock = jiffy_sched_clock_read,
+
 };
 
 static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
@@ -99,12 +100,9 @@ unsigned long long notrace sched_clock(void)
 	do {
 		seq = raw_read_seqcount_begin(&cd.seq);
 
-		res = rd->epoch_ns;
-		if (!rd->suspended) {
-			cyc = rd->read_sched_clock();
-			cyc = (cyc - rd->epoch_cyc) & rd->sched_clock_mask;
-			res += cyc_to_ns(cyc, rd->mult, rd->shift);
-		}
+		cyc = (rd->read_sched_clock() - rd->epoch_cyc) &
+		      rd->sched_clock_mask;
+		res = rd->epoch_ns + cyc_to_ns(cyc, rd->mult, rd->shift);
 	} while (read_seqcount_retry(&cd.seq, seq));
 
 	return res;
@@ -120,7 +118,7 @@ static void notrace update_sched_clock(void)
 	u64 ns;
 	struct clock_read_data *rd = &cd.read_data;
 
-	cyc = rd->read_sched_clock();
+	cyc = cd.actual_read_sched_clock();
 	ns = rd->epoch_ns +
 	     cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
 		       rd->mult, rd->shift);
@@ -166,10 +164,11 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 
 	/* update epoch for new counter and update epoch_ns from old counter*/
 	new_epoch = read();
-	cyc = rd->read_sched_clock();
+	cyc = cd.actual_read_sched_clock();
 	ns = rd->epoch_ns +
 	     cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
 		       rd->mult, rd->shift);
+	cd.actual_read_sched_clock = read;
 
 	raw_write_seqcount_begin(&cd.seq);
 	rd->read_sched_clock = read;
@@ -209,7 +208,7 @@ void __init sched_clock_postinit(void)
 	 * If no sched_clock function has been provided at that point,
 	 * make it the final one one.
 	 */
-	if (cd.read_data.read_sched_clock == jiffy_sched_clock_read)
+	if (cd.actual_read_sched_clock == jiffy_sched_clock_read)
 		sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);
 
 	update_sched_clock();
@@ -223,13 +222,24 @@ void __init sched_clock_postinit(void)
 	hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
 }
 
+/*
+ * Clock read function for use when the clock is suspended.
+ *
+ * This function makes it appear to sched_clock() as if the clock
+ * stopped counting at its last update.
+ */
+static u64 notrace suspended_sched_clock_read(void)
+{
+	return cd.read_data.epoch_cyc;
+}
+
 static int sched_clock_suspend(void)
 {
 	struct clock_read_data *rd = &cd.read_data;
 
 	update_sched_clock();
 	hrtimer_cancel(&sched_clock_timer);
-	rd->suspended = true;
+	rd->read_sched_clock = suspended_sched_clock_read;
 	return 0;
 }
 
@@ -237,9 +247,9 @@ static void sched_clock_resume(void)
 {
 	struct clock_read_data *rd = &cd.read_data;
 
-	rd->epoch_cyc = rd->read_sched_clock();
+	rd->epoch_cyc = cd.actual_read_sched_clock();
 	hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
-	rd->suspended = false;
+	rd->read_sched_clock = cd.actual_read_sched_clock;
 }
 
 static struct syscore_ops sched_clock_ops = {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/5] sched_clock: Remove redundant notrace from update function
  2015-03-26 19:23 [PATCH 0/5 (rebase resend)] Arch-generic sched_clock NMI safety and optimizations for -tip John Stultz
                   ` (2 preceding siblings ...)
  2015-03-26 19:23 ` [PATCH 3/5] sched_clock: Remove suspend from clock_read_data John Stultz
@ 2015-03-26 19:23 ` John Stultz
  2015-03-27  7:42   ` [tip:timers/core] timers, sched/clock: " tip-bot for Daniel Thompson
  2015-03-26 19:23 ` [PATCH 5/5] sched_clock: Avoid deadlock during read from NMI John Stultz
  4 siblings, 1 reply; 12+ messages in thread
From: John Stultz @ 2015-03-26 19:23 UTC (permalink / raw)
  To: lkml
  Cc: Daniel Thompson, Russell King, Will Deacon, Catalin Marinas,
	Thomas Gleixner, Stephen Boyd, Peter Zijlstra, Ingo Molnar,
	John Stultz

From: Daniel Thompson <daniel.thompson@linaro.org>

Currently update_sched_clock() is marked as notrace but this function
is not called by ftrace. This is trivially fixed by removing the mark
up.

Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/sched_clock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 52ea5d9..8adb9d0 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -111,7 +111,7 @@ unsigned long long notrace sched_clock(void)
 /*
  * Atomically update the sched_clock epoch.
  */
-static void notrace update_sched_clock(void)
+static void update_sched_clock(void)
 {
 	unsigned long flags;
 	u64 cyc;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 5/5] sched_clock: Avoid deadlock during read from NMI
  2015-03-26 19:23 [PATCH 0/5 (rebase resend)] Arch-generic sched_clock NMI safety and optimizations for -tip John Stultz
                   ` (3 preceding siblings ...)
  2015-03-26 19:23 ` [PATCH 4/5] sched_clock: Remove redundant notrace from update function John Stultz
@ 2015-03-26 19:23 ` John Stultz
  2015-03-27  7:42   ` [tip:timers/core] timers, sched/clock: " tip-bot for Daniel Thompson
  4 siblings, 1 reply; 12+ messages in thread
From: John Stultz @ 2015-03-26 19:23 UTC (permalink / raw)
  To: lkml
  Cc: Daniel Thompson, Russell King, Will Deacon, Catalin Marinas,
	Thomas Gleixner, Stephen Boyd, Peter Zijlstra, Ingo Molnar,
	John Stultz

From: Daniel Thompson <daniel.thompson@linaro.org>

Currently it is possible for an NMI (or FIQ on ARM) to come in and
read sched_clock() whilst update_sched_clock() has locked the seqcount
for writing. This results in the NMI handler locking up when it calls
raw_read_seqcount_begin().

This patch fixes the NMI safety issues by providing banked clock data.
This is a similar approach to the one used in Thomas Gleixner's
4396e058c52e("timekeeping: Provide fast and NMI safe access to
CLOCK_MONOTONIC").

Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Thompson <daniel.thompson@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Suggested-by: Stephen Boyd <sboyd@codeaurora.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/sched_clock.c | 103 ++++++++++++++++++++++++++++++----------------
 1 file changed, 68 insertions(+), 35 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 8adb9d0..eeea1e9 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -47,19 +47,20 @@ struct clock_read_data {
  * struct clock_data - all data needed for sched_clock (including
  *                     registration of a new clock source)
  *
- * @seq:		Sequence counter for protecting updates.
+ * @seq:		Sequence counter for protecting updates. The lowest
+ *			bit is the index for @read_data.
  * @read_data:		Data required to read from sched_clock.
  * @wrap_kt:		Duration for which clock can run before wrapping
  * @rate:		Tick rate of the registered clock
  * @actual_read_sched_clock: Registered clock read function
  *
  * The ordering of this structure has been chosen to optimize cache
- * performance. In particular seq and read_data (combined) should fit
+ * performance. In particular seq and read_data[0] (combined) should fit
  * into a single 64 byte cache line.
  */
 struct clock_data {
 	seqcount_t seq;
-	struct clock_read_data read_data;
+	struct clock_read_data read_data[2];
 	ktime_t wrap_kt;
 	unsigned long rate;
 	u64 (*actual_read_sched_clock)(void);
@@ -80,10 +81,9 @@ static u64 notrace jiffy_sched_clock_read(void)
 }
 
 static struct clock_data cd ____cacheline_aligned = {
-	.read_data = { .mult = NSEC_PER_SEC / HZ,
-		       .read_sched_clock = jiffy_sched_clock_read, },
+	.read_data[0] = { .mult = NSEC_PER_SEC / HZ,
+			  .read_sched_clock = jiffy_sched_clock_read, },
 	.actual_read_sched_clock = jiffy_sched_clock_read,
-
 };
 
 static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
@@ -95,10 +95,11 @@ unsigned long long notrace sched_clock(void)
 {
 	u64 cyc, res;
 	unsigned long seq;
-	struct clock_read_data *rd = &cd.read_data;
+	struct clock_read_data *rd;
 
 	do {
-		seq = raw_read_seqcount_begin(&cd.seq);
+		seq = raw_read_seqcount(&cd.seq);
+		rd = cd.read_data + (seq & 1);
 
 		cyc = (rd->read_sched_clock() - rd->epoch_cyc) &
 		      rd->sched_clock_mask;
@@ -109,26 +110,50 @@ unsigned long long notrace sched_clock(void)
 }
 
 /*
+ * Updating the data required to read the clock.
+ *
+ * sched_clock will never observe mis-matched data even if called from
+ * an NMI. We do this by maintaining an odd/even copy of the data and
+ * steering sched_clock to one or the other using a sequence counter.
+ * In order to preserve the data cache profile of sched_clock as much
+ * as possible the system reverts back to the even copy when the update
+ * completes; the odd copy is used *only* during an update.
+ */
+static void update_clock_read_data(struct clock_read_data *rd)
+{
+	/* update the backup (odd) copy with the new data */
+	cd.read_data[1] = *rd;
+
+	/* steer readers towards the odd copy */
+	raw_write_seqcount_latch(&cd.seq);
+
+	/* now its safe for us to update the normal (even) copy */
+	cd.read_data[0] = *rd;
+
+	/* switch readers back to the even copy */
+	raw_write_seqcount_latch(&cd.seq);
+}
+
+/*
  * Atomically update the sched_clock epoch.
  */
 static void update_sched_clock(void)
 {
-	unsigned long flags;
 	u64 cyc;
 	u64 ns;
-	struct clock_read_data *rd = &cd.read_data;
+	struct clock_read_data rd;
+
+	rd = cd.read_data[0];
 
 	cyc = cd.actual_read_sched_clock();
-	ns = rd->epoch_ns +
-	     cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
-		       rd->mult, rd->shift);
-
-	raw_local_irq_save(flags);
-	raw_write_seqcount_begin(&cd.seq);
-	rd->epoch_ns = ns;
-	rd->epoch_cyc = cyc;
-	raw_write_seqcount_end(&cd.seq);
-	raw_local_irq_restore(flags);
+	ns = rd.epoch_ns +
+	     cyc_to_ns((cyc - rd.epoch_cyc) & rd.sched_clock_mask,
+		       rd.mult, rd.shift);
+
+	rd.epoch_ns = ns;
+	rd.epoch_cyc = cyc;
+
+	update_clock_read_data(&rd);
 }
 
 static enum hrtimer_restart sched_clock_poll(struct hrtimer *hrt)
@@ -145,7 +170,7 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 	u32 new_mult, new_shift;
 	unsigned long r;
 	char r_unit;
-	struct clock_read_data *rd = &cd.read_data;
+	struct clock_read_data rd;
 
 	if (cd.rate > rate)
 		return;
@@ -162,22 +187,23 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 	wrap = clocks_calc_max_nsecs(new_mult, new_shift, 0, new_mask, NULL);
 	cd.wrap_kt = ns_to_ktime(wrap);
 
+	rd = cd.read_data[0];
+
 	/* update epoch for new counter and update epoch_ns from old counter*/
 	new_epoch = read();
 	cyc = cd.actual_read_sched_clock();
-	ns = rd->epoch_ns +
-	     cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
-		       rd->mult, rd->shift);
+	ns = rd.epoch_ns +
+	     cyc_to_ns((cyc - rd.epoch_cyc) & rd.sched_clock_mask,
+		       rd.mult, rd.shift);
 	cd.actual_read_sched_clock = read;
 
-	raw_write_seqcount_begin(&cd.seq);
-	rd->read_sched_clock = read;
-	rd->sched_clock_mask = new_mask;
-	rd->mult = new_mult;
-	rd->shift = new_shift;
-	rd->epoch_cyc = new_epoch;
-	rd->epoch_ns = ns;
-	raw_write_seqcount_end(&cd.seq);
+	rd.read_sched_clock = read;
+	rd.sched_clock_mask = new_mask;
+	rd.mult = new_mult;
+	rd.shift = new_shift;
+	rd.epoch_cyc = new_epoch;
+	rd.epoch_ns = ns;
+	update_clock_read_data(&rd);
 
 	r = rate;
 	if (r >= 4000000) {
@@ -227,15 +253,22 @@ void __init sched_clock_postinit(void)
  *
  * This function makes it appear to sched_clock() as if the clock
  * stopped counting at its last update.
+ *
+ * This function must only be called from the critical
+ * section in sched_clock(). It relies on the read_seqcount_retry()
+ * at the end of the critical section to be sure we observe the
+ * correct copy of epoch_cyc.
  */
 static u64 notrace suspended_sched_clock_read(void)
 {
-	return cd.read_data.epoch_cyc;
+	unsigned long seq = raw_read_seqcount(&cd.seq);
+
+	return cd.read_data[seq & 1].epoch_cyc;
 }
 
 static int sched_clock_suspend(void)
 {
-	struct clock_read_data *rd = &cd.read_data;
+	struct clock_read_data *rd = &cd.read_data[0];
 
 	update_sched_clock();
 	hrtimer_cancel(&sched_clock_timer);
@@ -245,7 +278,7 @@ static int sched_clock_suspend(void)
 
 static void sched_clock_resume(void)
 {
-	struct clock_read_data *rd = &cd.read_data;
+	struct clock_read_data *rd = &cd.read_data[0];
 
 	rd->epoch_cyc = cd.actual_read_sched_clock();
 	hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [tip:timers/core] timers, sched/clock: Match scope of read and write seqcounts
  2015-03-26 19:23 ` [PATCH 1/5] sched_clock: Match scope of read and write seqcounts John Stultz
@ 2015-03-27  7:41   ` tip-bot for Daniel Thompson
  0 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Daniel Thompson @ 2015-03-27  7:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, will.deacon, sboyd, mingo, john.stultz, linux,
	daniel.thompson, hpa, catalin.marinas, linux-kernel, tglx

Commit-ID:  8710e914027e4f64058ebbf0501cc6db3cc8454f
Gitweb:     http://git.kernel.org/tip/8710e914027e4f64058ebbf0501cc6db3cc8454f
Author:     Daniel Thompson <daniel.thompson@linaro.org>
AuthorDate: Thu, 26 Mar 2015 12:23:22 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 08:33:56 +0100

timers, sched/clock: Match scope of read and write seqcounts

Currently the scope of the raw_write_seqcount_begin/end() in
sched_clock_register() far exceeds the scope of the read section
in sched_clock(). This gives the impression of safety during
cursory review but achieves little.

Note that this is likely to be a latent issue at present because
sched_clock_register() is typically called before we enable
interrupts, however the issue does risk bugs being needlessly
introduced as the code evolves.

This patch fixes the problem by increasing the scope of the read
locking performed by sched_clock() to cover all data modified by
sched_clock_register.

We also improve clarity by moving writes to struct clock_data
that do not impact sched_clock() outside of the critical
section.

Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
[ Reworked it slightly to apply to tip/timers/core]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1427397806-20889-2-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/time/sched_clock.c | 26 +++++++++++---------------
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index ca3bc5c..1751e95 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -58,23 +58,21 @@ static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
 
 unsigned long long notrace sched_clock(void)
 {
-	u64 epoch_ns;
-	u64 epoch_cyc;
-	u64 cyc;
+	u64 cyc, res;
 	unsigned long seq;
 
-	if (cd.suspended)
-		return cd.epoch_ns;
-
 	do {
 		seq = raw_read_seqcount_begin(&cd.seq);
-		epoch_cyc = cd.epoch_cyc;
-		epoch_ns = cd.epoch_ns;
+
+		res = cd.epoch_ns;
+		if (!cd.suspended) {
+			cyc = read_sched_clock();
+			cyc = (cyc - cd.epoch_cyc) & sched_clock_mask;
+			res += cyc_to_ns(cyc, cd.mult, cd.shift);
+		}
 	} while (read_seqcount_retry(&cd.seq, seq));
 
-	cyc = read_sched_clock();
-	cyc = (cyc - epoch_cyc) & sched_clock_mask;
-	return epoch_ns + cyc_to_ns(cyc, cd.mult, cd.shift);
+	return res;
 }
 
 /*
@@ -111,7 +109,6 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 {
 	u64 res, wrap, new_mask, new_epoch, cyc, ns;
 	u32 new_mult, new_shift;
-	ktime_t new_wrap_kt;
 	unsigned long r;
 	char r_unit;
 
@@ -124,10 +121,11 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 	clocks_calc_mult_shift(&new_mult, &new_shift, rate, NSEC_PER_SEC, 3600);
 
 	new_mask = CLOCKSOURCE_MASK(bits);
+	cd.rate = rate;
 
 	/* calculate how many nanosecs until we risk wrapping */
 	wrap = clocks_calc_max_nsecs(new_mult, new_shift, 0, new_mask, NULL);
-	new_wrap_kt = ns_to_ktime(wrap);
+	cd.wrap_kt = ns_to_ktime(wrap);
 
 	/* update epoch for new counter and update epoch_ns from old counter*/
 	new_epoch = read();
@@ -138,8 +136,6 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 	raw_write_seqcount_begin(&cd.seq);
 	read_sched_clock = read;
 	sched_clock_mask = new_mask;
-	cd.rate = rate;
-	cd.wrap_kt = new_wrap_kt;
 	cd.mult = new_mult;
 	cd.shift = new_shift;
 	cd.epoch_cyc = new_epoch;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [tip:timers/core] timers, sched/clock: Optimize cache line usage
  2015-03-26 19:23 ` [PATCH 2/5] sched_clock: Optimize cache line usage John Stultz
@ 2015-03-27  7:41   ` tip-bot for Daniel Thompson
  0 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Daniel Thompson @ 2015-03-27  7:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: will.deacon, linux, hpa, linux-kernel, mingo, tglx,
	daniel.thompson, sboyd, catalin.marinas, john.stultz, peterz

Commit-ID:  cf7c9c170787d6870af54684822f58acc00a966c
Gitweb:     http://git.kernel.org/tip/cf7c9c170787d6870af54684822f58acc00a966c
Author:     Daniel Thompson <daniel.thompson@linaro.org>
AuthorDate: Thu, 26 Mar 2015 12:23:23 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 08:33:57 +0100

timers, sched/clock: Optimize cache line usage

Currently sched_clock(), a very hot code path, is not optimized
to minimise its cache profile. In particular:

  1. cd is not ____cacheline_aligned,

  2. struct clock_data does not distinguish between hotpath and
     coldpath data, reducing locality of reference in the hotpath,

  3. Some hotpath data is missing from struct clock_data and is marked
     __read_mostly (which more or less guarantees it will not share a
     cache line with cd).

This patch corrects these problems by extracting all hotpath
data into a separate structure and using ____cacheline_aligned
to ensure the hotpath uses a single (64 byte) cache line.

Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1427397806-20889-3-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/time/sched_clock.c | 112 +++++++++++++++++++++++++++++++---------------
 1 file changed, 77 insertions(+), 35 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 1751e95..872e068 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -18,28 +18,59 @@
 #include <linux/seqlock.h>
 #include <linux/bitops.h>
 
-struct clock_data {
-	ktime_t wrap_kt;
+/**
+ * struct clock_read_data - data required to read from sched_clock
+ *
+ * @epoch_ns:		sched_clock value at last update
+ * @epoch_cyc:		Clock cycle value at last update
+ * @sched_clock_mask:   Bitmask for two's complement subtraction of non 64bit
+ *			clocks
+ * @read_sched_clock:	Current clock source (or dummy source when suspended)
+ * @mult:		Multipler for scaled math conversion
+ * @shift:		Shift value for scaled math conversion
+ * @suspended:		Flag to indicate if the clock is suspended (stopped)
+ *
+ * Care must be taken when updating this structure; it is read by
+ * some very hot code paths. It occupies <=48 bytes and, when combined
+ * with the seqcount used to synchronize access, comfortably fits into
+ * a 64 byte cache line.
+ */
+struct clock_read_data {
 	u64 epoch_ns;
 	u64 epoch_cyc;
-	seqcount_t seq;
-	unsigned long rate;
+	u64 sched_clock_mask;
+	u64 (*read_sched_clock)(void);
 	u32 mult;
 	u32 shift;
 	bool suspended;
 };
 
+/**
+ * struct clock_data - all data needed for sched_clock (including
+ *                     registration of a new clock source)
+ *
+ * @seq:		Sequence counter for protecting updates.
+ * @read_data:		Data required to read from sched_clock.
+ * @wrap_kt:		Duration for which clock can run before wrapping
+ * @rate:		Tick rate of the registered clock
+ * @actual_read_sched_clock: Registered clock read function
+ *
+ * The ordering of this structure has been chosen to optimize cache
+ * performance. In particular seq and read_data (combined) should fit
+ * into a single 64 byte cache line.
+ */
+struct clock_data {
+	seqcount_t seq;
+	struct clock_read_data read_data;
+	ktime_t wrap_kt;
+	unsigned long rate;
+};
+
 static struct hrtimer sched_clock_timer;
 static int irqtime = -1;
 
 core_param(irqtime, irqtime, int, 0400);
 
-static struct clock_data cd = {
-	.mult	= NSEC_PER_SEC / HZ,
-};
-
-static u64 __read_mostly sched_clock_mask;
-
 static u64 notrace jiffy_sched_clock_read(void)
 {
 	/*
@@ -49,7 +80,10 @@ static u64 notrace jiffy_sched_clock_read(void)
 	return (u64)(jiffies - INITIAL_JIFFIES);
 }
 
-static u64 __read_mostly (*read_sched_clock)(void) = jiffy_sched_clock_read;
+static struct clock_data cd ____cacheline_aligned = {
+	.read_data = { .mult = NSEC_PER_SEC / HZ,
+		       .read_sched_clock = jiffy_sched_clock_read, },
+};
 
 static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
 {
@@ -60,15 +94,16 @@ unsigned long long notrace sched_clock(void)
 {
 	u64 cyc, res;
 	unsigned long seq;
+	struct clock_read_data *rd = &cd.read_data;
 
 	do {
 		seq = raw_read_seqcount_begin(&cd.seq);
 
-		res = cd.epoch_ns;
-		if (!cd.suspended) {
-			cyc = read_sched_clock();
-			cyc = (cyc - cd.epoch_cyc) & sched_clock_mask;
-			res += cyc_to_ns(cyc, cd.mult, cd.shift);
+		res = rd->epoch_ns;
+		if (!rd->suspended) {
+			cyc = rd->read_sched_clock();
+			cyc = (cyc - rd->epoch_cyc) & rd->sched_clock_mask;
+			res += cyc_to_ns(cyc, rd->mult, rd->shift);
 		}
 	} while (read_seqcount_retry(&cd.seq, seq));
 
@@ -83,16 +118,17 @@ static void notrace update_sched_clock(void)
 	unsigned long flags;
 	u64 cyc;
 	u64 ns;
+	struct clock_read_data *rd = &cd.read_data;
 
-	cyc = read_sched_clock();
-	ns = cd.epoch_ns +
-		cyc_to_ns((cyc - cd.epoch_cyc) & sched_clock_mask,
-			  cd.mult, cd.shift);
+	cyc = rd->read_sched_clock();
+	ns = rd->epoch_ns +
+	     cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
+		       rd->mult, rd->shift);
 
 	raw_local_irq_save(flags);
 	raw_write_seqcount_begin(&cd.seq);
-	cd.epoch_ns = ns;
-	cd.epoch_cyc = cyc;
+	rd->epoch_ns = ns;
+	rd->epoch_cyc = cyc;
 	raw_write_seqcount_end(&cd.seq);
 	raw_local_irq_restore(flags);
 }
@@ -111,6 +147,7 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 	u32 new_mult, new_shift;
 	unsigned long r;
 	char r_unit;
+	struct clock_read_data *rd = &cd.read_data;
 
 	if (cd.rate > rate)
 		return;
@@ -129,17 +166,18 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 
 	/* update epoch for new counter and update epoch_ns from old counter*/
 	new_epoch = read();
-	cyc = read_sched_clock();
-	ns = cd.epoch_ns + cyc_to_ns((cyc - cd.epoch_cyc) & sched_clock_mask,
-			  cd.mult, cd.shift);
+	cyc = rd->read_sched_clock();
+	ns = rd->epoch_ns +
+	     cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
+		       rd->mult, rd->shift);
 
 	raw_write_seqcount_begin(&cd.seq);
-	read_sched_clock = read;
-	sched_clock_mask = new_mask;
-	cd.mult = new_mult;
-	cd.shift = new_shift;
-	cd.epoch_cyc = new_epoch;
-	cd.epoch_ns = ns;
+	rd->read_sched_clock = read;
+	rd->sched_clock_mask = new_mask;
+	rd->mult = new_mult;
+	rd->shift = new_shift;
+	rd->epoch_cyc = new_epoch;
+	rd->epoch_ns = ns;
 	raw_write_seqcount_end(&cd.seq);
 
 	r = rate;
@@ -171,7 +209,7 @@ void __init sched_clock_postinit(void)
 	 * If no sched_clock function has been provided at that point,
 	 * make it the final one one.
 	 */
-	if (read_sched_clock == jiffy_sched_clock_read)
+	if (cd.read_data.read_sched_clock == jiffy_sched_clock_read)
 		sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);
 
 	update_sched_clock();
@@ -187,17 +225,21 @@ void __init sched_clock_postinit(void)
 
 static int sched_clock_suspend(void)
 {
+	struct clock_read_data *rd = &cd.read_data;
+
 	update_sched_clock();
 	hrtimer_cancel(&sched_clock_timer);
-	cd.suspended = true;
+	rd->suspended = true;
 	return 0;
 }
 
 static void sched_clock_resume(void)
 {
-	cd.epoch_cyc = read_sched_clock();
+	struct clock_read_data *rd = &cd.read_data;
+
+	rd->epoch_cyc = rd->read_sched_clock();
 	hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
-	cd.suspended = false;
+	rd->suspended = false;
 }
 
 static struct syscore_ops sched_clock_ops = {

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [tip:timers/core] timers, sched/clock: Remove suspend from clock_read_data()
  2015-03-26 19:23 ` [PATCH 3/5] sched_clock: Remove suspend from clock_read_data John Stultz
@ 2015-03-27  7:42   ` tip-bot for Daniel Thompson
  0 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Daniel Thompson @ 2015-03-27  7:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: catalin.marinas, tglx, will.deacon, daniel.thompson, mingo,
	peterz, linux, linux-kernel, hpa, john.stultz, sboyd

Commit-ID:  13dbeb384d2d3aa555ea48d511e8cb110bd172e0
Gitweb:     http://git.kernel.org/tip/13dbeb384d2d3aa555ea48d511e8cb110bd172e0
Author:     Daniel Thompson <daniel.thompson@linaro.org>
AuthorDate: Thu, 26 Mar 2015 12:23:24 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 08:33:58 +0100

timers, sched/clock: Remove suspend from clock_read_data()

Currently cd.read_data.suspended is read by the hotpath function
sched_clock(). This variable need not be accessed on the
hotpath. In fact, once it is removed, we can remove the
conditional branches from sched_clock() and install a dummy
read_sched_clock function to suspend the clock.

The new master copy of the function pointer
(actual_read_sched_clock) is introduced and is used for all
reads of the clock hardware except those within sched_clock
itself.

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1427397806-20889-4-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/time/sched_clock.c | 40 +++++++++++++++++++++++++---------------
 1 file changed, 25 insertions(+), 15 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 872e068..52ea5d9 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -28,10 +28,9 @@
  * @read_sched_clock:	Current clock source (or dummy source when suspended)
  * @mult:		Multipler for scaled math conversion
  * @shift:		Shift value for scaled math conversion
- * @suspended:		Flag to indicate if the clock is suspended (stopped)
  *
  * Care must be taken when updating this structure; it is read by
- * some very hot code paths. It occupies <=48 bytes and, when combined
+ * some very hot code paths. It occupies <=40 bytes and, when combined
  * with the seqcount used to synchronize access, comfortably fits into
  * a 64 byte cache line.
  */
@@ -42,7 +41,6 @@ struct clock_read_data {
 	u64 (*read_sched_clock)(void);
 	u32 mult;
 	u32 shift;
-	bool suspended;
 };
 
 /**
@@ -64,6 +62,7 @@ struct clock_data {
 	struct clock_read_data read_data;
 	ktime_t wrap_kt;
 	unsigned long rate;
+	u64 (*actual_read_sched_clock)(void);
 };
 
 static struct hrtimer sched_clock_timer;
@@ -83,6 +82,8 @@ static u64 notrace jiffy_sched_clock_read(void)
 static struct clock_data cd ____cacheline_aligned = {
 	.read_data = { .mult = NSEC_PER_SEC / HZ,
 		       .read_sched_clock = jiffy_sched_clock_read, },
+	.actual_read_sched_clock = jiffy_sched_clock_read,
+
 };
 
 static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
@@ -99,12 +100,9 @@ unsigned long long notrace sched_clock(void)
 	do {
 		seq = raw_read_seqcount_begin(&cd.seq);
 
-		res = rd->epoch_ns;
-		if (!rd->suspended) {
-			cyc = rd->read_sched_clock();
-			cyc = (cyc - rd->epoch_cyc) & rd->sched_clock_mask;
-			res += cyc_to_ns(cyc, rd->mult, rd->shift);
-		}
+		cyc = (rd->read_sched_clock() - rd->epoch_cyc) &
+		      rd->sched_clock_mask;
+		res = rd->epoch_ns + cyc_to_ns(cyc, rd->mult, rd->shift);
 	} while (read_seqcount_retry(&cd.seq, seq));
 
 	return res;
@@ -120,7 +118,7 @@ static void notrace update_sched_clock(void)
 	u64 ns;
 	struct clock_read_data *rd = &cd.read_data;
 
-	cyc = rd->read_sched_clock();
+	cyc = cd.actual_read_sched_clock();
 	ns = rd->epoch_ns +
 	     cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
 		       rd->mult, rd->shift);
@@ -166,10 +164,11 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 
 	/* update epoch for new counter and update epoch_ns from old counter*/
 	new_epoch = read();
-	cyc = rd->read_sched_clock();
+	cyc = cd.actual_read_sched_clock();
 	ns = rd->epoch_ns +
 	     cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
 		       rd->mult, rd->shift);
+	cd.actual_read_sched_clock = read;
 
 	raw_write_seqcount_begin(&cd.seq);
 	rd->read_sched_clock = read;
@@ -209,7 +208,7 @@ void __init sched_clock_postinit(void)
 	 * If no sched_clock function has been provided at that point,
 	 * make it the final one one.
 	 */
-	if (cd.read_data.read_sched_clock == jiffy_sched_clock_read)
+	if (cd.actual_read_sched_clock == jiffy_sched_clock_read)
 		sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);
 
 	update_sched_clock();
@@ -223,13 +222,24 @@ void __init sched_clock_postinit(void)
 	hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
 }
 
+/*
+ * Clock read function for use when the clock is suspended.
+ *
+ * This function makes it appear to sched_clock() as if the clock
+ * stopped counting at its last update.
+ */
+static u64 notrace suspended_sched_clock_read(void)
+{
+	return cd.read_data.epoch_cyc;
+}
+
 static int sched_clock_suspend(void)
 {
 	struct clock_read_data *rd = &cd.read_data;
 
 	update_sched_clock();
 	hrtimer_cancel(&sched_clock_timer);
-	rd->suspended = true;
+	rd->read_sched_clock = suspended_sched_clock_read;
 	return 0;
 }
 
@@ -237,9 +247,9 @@ static void sched_clock_resume(void)
 {
 	struct clock_read_data *rd = &cd.read_data;
 
-	rd->epoch_cyc = rd->read_sched_clock();
+	rd->epoch_cyc = cd.actual_read_sched_clock();
 	hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
-	rd->suspended = false;
+	rd->read_sched_clock = cd.actual_read_sched_clock;
 }
 
 static struct syscore_ops sched_clock_ops = {

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [tip:timers/core] timers, sched/clock: Remove redundant notrace from update function
  2015-03-26 19:23 ` [PATCH 4/5] sched_clock: Remove redundant notrace from update function John Stultz
@ 2015-03-27  7:42   ` tip-bot for Daniel Thompson
  0 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Daniel Thompson @ 2015-03-27  7:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, catalin.marinas, linux, mingo, peterz, john.stultz, sboyd,
	will.deacon, daniel.thompson, hpa, linux-kernel

Commit-ID:  9fee69a8c8070b38b558161a3f18bd5e2b664682
Gitweb:     http://git.kernel.org/tip/9fee69a8c8070b38b558161a3f18bd5e2b664682
Author:     Daniel Thompson <daniel.thompson@linaro.org>
AuthorDate: Thu, 26 Mar 2015 12:23:25 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 08:33:59 +0100

timers, sched/clock: Remove redundant notrace from update function

Currently update_sched_clock() is marked as notrace but this
function is not called by ftrace. This is trivially fixed by
removing the mark up.

Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1427397806-20889-5-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/time/sched_clock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 52ea5d9..8adb9d0 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -111,7 +111,7 @@ unsigned long long notrace sched_clock(void)
 /*
  * Atomically update the sched_clock epoch.
  */
-static void notrace update_sched_clock(void)
+static void update_sched_clock(void)
 {
 	unsigned long flags;
 	u64 cyc;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [tip:timers/core] timers, sched/clock: Avoid deadlock during read from NMI
  2015-03-26 19:23 ` [PATCH 5/5] sched_clock: Avoid deadlock during read from NMI John Stultz
@ 2015-03-27  7:42   ` tip-bot for Daniel Thompson
  0 siblings, 0 replies; 12+ messages in thread
From: tip-bot for Daniel Thompson @ 2015-03-27  7:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, daniel.thompson, peterz, linux, sboyd, will.deacon,
	catalin.marinas, tglx, hpa, mingo, john.stultz

Commit-ID:  1809bfa44e1019e397fabaa6f2349bb7237e57a4
Gitweb:     http://git.kernel.org/tip/1809bfa44e1019e397fabaa6f2349bb7237e57a4
Author:     Daniel Thompson <daniel.thompson@linaro.org>
AuthorDate: Thu, 26 Mar 2015 12:23:26 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 27 Mar 2015 08:34:00 +0100

timers, sched/clock: Avoid deadlock during read from NMI

Currently it is possible for an NMI (or FIQ on ARM) to come in
and read sched_clock() whilst update_sched_clock() has locked
the seqcount for writing. This results in the NMI handler
locking up when it calls raw_read_seqcount_begin().

This patch fixes the NMI safety issues by providing banked clock
data. This is a similar approach to the one used in Thomas
Gleixner's 4396e058c52e("timekeeping: Provide fast and NMI safe
access to CLOCK_MONOTONIC").

Suggested-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1427397806-20889-6-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/time/sched_clock.c | 103 ++++++++++++++++++++++++++++++----------------
 1 file changed, 68 insertions(+), 35 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 8adb9d0..eeea1e9 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -47,19 +47,20 @@ struct clock_read_data {
  * struct clock_data - all data needed for sched_clock (including
  *                     registration of a new clock source)
  *
- * @seq:		Sequence counter for protecting updates.
+ * @seq:		Sequence counter for protecting updates. The lowest
+ *			bit is the index for @read_data.
  * @read_data:		Data required to read from sched_clock.
  * @wrap_kt:		Duration for which clock can run before wrapping
  * @rate:		Tick rate of the registered clock
  * @actual_read_sched_clock: Registered clock read function
  *
  * The ordering of this structure has been chosen to optimize cache
- * performance. In particular seq and read_data (combined) should fit
+ * performance. In particular seq and read_data[0] (combined) should fit
  * into a single 64 byte cache line.
  */
 struct clock_data {
 	seqcount_t seq;
-	struct clock_read_data read_data;
+	struct clock_read_data read_data[2];
 	ktime_t wrap_kt;
 	unsigned long rate;
 	u64 (*actual_read_sched_clock)(void);
@@ -80,10 +81,9 @@ static u64 notrace jiffy_sched_clock_read(void)
 }
 
 static struct clock_data cd ____cacheline_aligned = {
-	.read_data = { .mult = NSEC_PER_SEC / HZ,
-		       .read_sched_clock = jiffy_sched_clock_read, },
+	.read_data[0] = { .mult = NSEC_PER_SEC / HZ,
+			  .read_sched_clock = jiffy_sched_clock_read, },
 	.actual_read_sched_clock = jiffy_sched_clock_read,
-
 };
 
 static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
@@ -95,10 +95,11 @@ unsigned long long notrace sched_clock(void)
 {
 	u64 cyc, res;
 	unsigned long seq;
-	struct clock_read_data *rd = &cd.read_data;
+	struct clock_read_data *rd;
 
 	do {
-		seq = raw_read_seqcount_begin(&cd.seq);
+		seq = raw_read_seqcount(&cd.seq);
+		rd = cd.read_data + (seq & 1);
 
 		cyc = (rd->read_sched_clock() - rd->epoch_cyc) &
 		      rd->sched_clock_mask;
@@ -109,26 +110,50 @@ unsigned long long notrace sched_clock(void)
 }
 
 /*
+ * Updating the data required to read the clock.
+ *
+ * sched_clock will never observe mis-matched data even if called from
+ * an NMI. We do this by maintaining an odd/even copy of the data and
+ * steering sched_clock to one or the other using a sequence counter.
+ * In order to preserve the data cache profile of sched_clock as much
+ * as possible the system reverts back to the even copy when the update
+ * completes; the odd copy is used *only* during an update.
+ */
+static void update_clock_read_data(struct clock_read_data *rd)
+{
+	/* update the backup (odd) copy with the new data */
+	cd.read_data[1] = *rd;
+
+	/* steer readers towards the odd copy */
+	raw_write_seqcount_latch(&cd.seq);
+
+	/* now its safe for us to update the normal (even) copy */
+	cd.read_data[0] = *rd;
+
+	/* switch readers back to the even copy */
+	raw_write_seqcount_latch(&cd.seq);
+}
+
+/*
  * Atomically update the sched_clock epoch.
  */
 static void update_sched_clock(void)
 {
-	unsigned long flags;
 	u64 cyc;
 	u64 ns;
-	struct clock_read_data *rd = &cd.read_data;
+	struct clock_read_data rd;
+
+	rd = cd.read_data[0];
 
 	cyc = cd.actual_read_sched_clock();
-	ns = rd->epoch_ns +
-	     cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
-		       rd->mult, rd->shift);
-
-	raw_local_irq_save(flags);
-	raw_write_seqcount_begin(&cd.seq);
-	rd->epoch_ns = ns;
-	rd->epoch_cyc = cyc;
-	raw_write_seqcount_end(&cd.seq);
-	raw_local_irq_restore(flags);
+	ns = rd.epoch_ns +
+	     cyc_to_ns((cyc - rd.epoch_cyc) & rd.sched_clock_mask,
+		       rd.mult, rd.shift);
+
+	rd.epoch_ns = ns;
+	rd.epoch_cyc = cyc;
+
+	update_clock_read_data(&rd);
 }
 
 static enum hrtimer_restart sched_clock_poll(struct hrtimer *hrt)
@@ -145,7 +170,7 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 	u32 new_mult, new_shift;
 	unsigned long r;
 	char r_unit;
-	struct clock_read_data *rd = &cd.read_data;
+	struct clock_read_data rd;
 
 	if (cd.rate > rate)
 		return;
@@ -162,22 +187,23 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 	wrap = clocks_calc_max_nsecs(new_mult, new_shift, 0, new_mask, NULL);
 	cd.wrap_kt = ns_to_ktime(wrap);
 
+	rd = cd.read_data[0];
+
 	/* update epoch for new counter and update epoch_ns from old counter*/
 	new_epoch = read();
 	cyc = cd.actual_read_sched_clock();
-	ns = rd->epoch_ns +
-	     cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
-		       rd->mult, rd->shift);
+	ns = rd.epoch_ns +
+	     cyc_to_ns((cyc - rd.epoch_cyc) & rd.sched_clock_mask,
+		       rd.mult, rd.shift);
 	cd.actual_read_sched_clock = read;
 
-	raw_write_seqcount_begin(&cd.seq);
-	rd->read_sched_clock = read;
-	rd->sched_clock_mask = new_mask;
-	rd->mult = new_mult;
-	rd->shift = new_shift;
-	rd->epoch_cyc = new_epoch;
-	rd->epoch_ns = ns;
-	raw_write_seqcount_end(&cd.seq);
+	rd.read_sched_clock = read;
+	rd.sched_clock_mask = new_mask;
+	rd.mult = new_mult;
+	rd.shift = new_shift;
+	rd.epoch_cyc = new_epoch;
+	rd.epoch_ns = ns;
+	update_clock_read_data(&rd);
 
 	r = rate;
 	if (r >= 4000000) {
@@ -227,15 +253,22 @@ void __init sched_clock_postinit(void)
  *
  * This function makes it appear to sched_clock() as if the clock
  * stopped counting at its last update.
+ *
+ * This function must only be called from the critical
+ * section in sched_clock(). It relies on the read_seqcount_retry()
+ * at the end of the critical section to be sure we observe the
+ * correct copy of epoch_cyc.
  */
 static u64 notrace suspended_sched_clock_read(void)
 {
-	return cd.read_data.epoch_cyc;
+	unsigned long seq = raw_read_seqcount(&cd.seq);
+
+	return cd.read_data[seq & 1].epoch_cyc;
 }
 
 static int sched_clock_suspend(void)
 {
-	struct clock_read_data *rd = &cd.read_data;
+	struct clock_read_data *rd = &cd.read_data[0];
 
 	update_sched_clock();
 	hrtimer_cancel(&sched_clock_timer);
@@ -245,7 +278,7 @@ static int sched_clock_suspend(void)
 
 static void sched_clock_resume(void)
 {
-	struct clock_read_data *rd = &cd.read_data;
+	struct clock_read_data *rd = &cd.read_data[0];
 
 	rd->epoch_cyc = cd.actual_read_sched_clock();
 	hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/5] sched_clock: Optimize cache line usage
  2015-03-17 17:38 [PATCH 0/5] Arch-generic sched_clock NMI safety and optimizations for -tip John Stultz
@ 2015-03-17 17:38 ` John Stultz
  0 siblings, 0 replies; 12+ messages in thread
From: John Stultz @ 2015-03-17 17:38 UTC (permalink / raw)
  To: lkml
  Cc: Daniel Thompson, Russell King, Will Deacon, Catalin Marinas,
	Thomas Gleixner, Stephen Boyd, Ingo Molnar, Peter Zijlstra,
	John Stultz

From: Daniel Thompson <daniel.thompson@linaro.org>

Currently sched_clock(), a very hot code path, is not optimized to
minimise its cache profile. In particular:

  1. cd is not ____cacheline_aligned,

  2. struct clock_data does not distinguish between hotpath and
     coldpath data, reducing locality of reference in the hotpath,

  3. Some hotpath data is missing from struct clock_data and is marked
     __read_mostly (which more or less guarantees it will not share a
     cache line with cd).

This patch corrects these problems by extracting all hotpath data
into a separate structure and using ____cacheline_aligned to ensure
the hotpath uses a single (64 byte) cache line.

Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 kernel/time/sched_clock.c | 113 +++++++++++++++++++++++++++++++---------------
 1 file changed, 77 insertions(+), 36 deletions(-)

diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index 3d21a87..695b2ac 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -18,28 +18,59 @@
 #include <linux/seqlock.h>
 #include <linux/bitops.h>
 
-struct clock_data {
-	ktime_t wrap_kt;
+/**
+ * struct clock_read_data - data required to read from sched_clock
+ *
+ * @epoch_ns:		sched_clock value at last update
+ * @epoch_cyc:		Clock cycle value at last update
+ * @sched_clock_mask:   Bitmask for two's complement subtraction of non 64bit
+ *			clocks
+ * @read_sched_clock:	Current clock source (or dummy source when suspended)
+ * @mult:		Multipler for scaled math conversion
+ * @shift:		Shift value for scaled math conversion
+ * @suspended:		Flag to indicate if the clock is suspended (stopped)
+ *
+ * Care must be taken when updating this structure; it is read by
+ * some very hot code paths. It occupies <=48 bytes and, when combined
+ * with the seqcount used to synchronize access, comfortably fits into
+ * a 64 byte cache line.
+ */
+struct clock_read_data {
 	u64 epoch_ns;
 	u64 epoch_cyc;
-	seqcount_t seq;
-	unsigned long rate;
+	u64 sched_clock_mask;
+	u64 (*read_sched_clock)(void);
 	u32 mult;
 	u32 shift;
 	bool suspended;
 };
 
+/**
+ * struct clock_data - all data needed for sched_clock (including
+ *                     registration of a new clock source)
+ *
+ * @seq:		Sequence counter for protecting updates.
+ * @read_data:		Data required to read from sched_clock.
+ * @wrap_kt:		Duration for which clock can run before wrapping
+ * @rate:		Tick rate of the registered clock
+ * @actual_read_sched_clock: Registered clock read function
+ *
+ * The ordering of this structure has been chosen to optimize cache
+ * performance. In particular seq and read_data (combined) should fit
+ * into a single 64 byte cache line.
+ */
+struct clock_data {
+	seqcount_t seq;
+	struct clock_read_data read_data;
+	ktime_t wrap_kt;
+	unsigned long rate;
+};
+
 static struct hrtimer sched_clock_timer;
 static int irqtime = -1;
 
 core_param(irqtime, irqtime, int, 0400);
 
-static struct clock_data cd = {
-	.mult	= NSEC_PER_SEC / HZ,
-};
-
-static u64 __read_mostly sched_clock_mask;
-
 static u64 notrace jiffy_sched_clock_read(void)
 {
 	/*
@@ -49,7 +80,10 @@ static u64 notrace jiffy_sched_clock_read(void)
 	return (u64)(jiffies - INITIAL_JIFFIES);
 }
 
-static u64 __read_mostly (*read_sched_clock)(void) = jiffy_sched_clock_read;
+static struct clock_data cd ____cacheline_aligned = {
+	.read_data = { .mult = NSEC_PER_SEC / HZ,
+		       .read_sched_clock = jiffy_sched_clock_read, },
+};
 
 static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
 {
@@ -60,15 +94,16 @@ unsigned long long notrace sched_clock(void)
 {
 	u64 cyc, res;
 	unsigned long seq;
+	struct clock_read_data *rd = &cd.read_data;
 
 	do {
 		seq = raw_read_seqcount_begin(&cd.seq);
 
-		res = cd.epoch_ns;
-		if (!cd.suspended) {
-			cyc = read_sched_clock();
-			cyc = (cyc - cd.epoch_cyc) & sched_clock_mask;
-			res += cyc_to_ns(cyc, cd.mult, cd.shift);
+		res = rd->epoch_ns;
+		if (!rd->suspended) {
+			cyc = rd->read_sched_clock();
+			cyc = (cyc - rd->epoch_cyc) & rd->sched_clock_mask;
+			res += cyc_to_ns(cyc, rd->mult, rd->shift);
 		}
 	} while (read_seqcount_retry(&cd.seq, seq));
 
@@ -83,16 +118,17 @@ static void notrace update_sched_clock(void)
 	unsigned long flags;
 	u64 cyc;
 	u64 ns;
+	struct clock_read_data *rd = &cd.read_data;
 
-	cyc = read_sched_clock();
-	ns = cd.epoch_ns +
-		cyc_to_ns((cyc - cd.epoch_cyc) & sched_clock_mask,
-			  cd.mult, cd.shift);
+	cyc = rd->read_sched_clock();
+	ns = rd->epoch_ns +
+	     cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
+		       rd->mult, rd->shift);
 
 	raw_local_irq_save(flags);
 	raw_write_seqcount_begin(&cd.seq);
-	cd.epoch_ns = ns;
-	cd.epoch_cyc = cyc;
+	rd->epoch_ns = ns;
+	rd->epoch_cyc = cyc;
 	raw_write_seqcount_end(&cd.seq);
 	raw_local_irq_restore(flags);
 }
@@ -109,9 +145,9 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 {
 	u64 res, wrap, new_mask, new_epoch, cyc, ns;
 	u32 new_mult, new_shift;
-	ktime_t new_wrap_kt;
 	unsigned long r;
 	char r_unit;
+	struct clock_read_data *rd = &cd.read_data;
 
 	if (cd.rate > rate)
 		return;
@@ -130,17 +166,18 @@ void __init sched_clock_register(u64 (*read)(void), int bits,
 
 	/* update epoch for new counter and update epoch_ns from old counter*/
 	new_epoch = read();
-	cyc = read_sched_clock();
-	ns = cd.epoch_ns + cyc_to_ns((cyc - cd.epoch_cyc) & sched_clock_mask,
-			  cd.mult, cd.shift);
+	cyc = rd->read_sched_clock();
+	ns = rd->epoch_ns +
+	     cyc_to_ns((cyc - rd->epoch_cyc) & rd->sched_clock_mask,
+		       rd->mult, rd->shift);
 
 	raw_write_seqcount_begin(&cd.seq);
-	read_sched_clock = read;
-	sched_clock_mask = new_mask;
-	cd.mult = new_mult;
-	cd.shift = new_shift;
-	cd.epoch_cyc = new_epoch;
-	cd.epoch_ns = ns;
+	rd->read_sched_clock = read;
+	rd->sched_clock_mask = new_mask;
+	rd->mult = new_mult;
+	rd->shift = new_shift;
+	rd->epoch_cyc = new_epoch;
+	rd->epoch_ns = ns;
 	raw_write_seqcount_end(&cd.seq);
 
 	r = rate;
@@ -172,7 +209,7 @@ void __init sched_clock_postinit(void)
 	 * If no sched_clock function has been provided at that point,
 	 * make it the final one one.
 	 */
-	if (read_sched_clock == jiffy_sched_clock_read)
+	if (cd.read_data.read_sched_clock == jiffy_sched_clock_read)
 		sched_clock_register(jiffy_sched_clock_read, BITS_PER_LONG, HZ);
 
 	update_sched_clock();
@@ -188,17 +225,21 @@ void __init sched_clock_postinit(void)
 
 static int sched_clock_suspend(void)
 {
+	struct clock_read_data *rd = &cd.read_data;
+
 	update_sched_clock();
 	hrtimer_cancel(&sched_clock_timer);
-	cd.suspended = true;
+	rd->suspended = true;
 	return 0;
 }
 
 static void sched_clock_resume(void)
 {
-	cd.epoch_cyc = read_sched_clock();
+	struct clock_read_data *rd = &cd.read_data;
+
+	rd->epoch_cyc = rd->read_sched_clock();
 	hrtimer_start(&sched_clock_timer, cd.wrap_kt, HRTIMER_MODE_REL);
-	cd.suspended = false;
+	rd->suspended = false;
 }
 
 static struct syscore_ops sched_clock_ops = {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-03-27  7:44 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-26 19:23 [PATCH 0/5 (rebase resend)] Arch-generic sched_clock NMI safety and optimizations for -tip John Stultz
2015-03-26 19:23 ` [PATCH 1/5] sched_clock: Match scope of read and write seqcounts John Stultz
2015-03-27  7:41   ` [tip:timers/core] timers, sched/clock: " tip-bot for Daniel Thompson
2015-03-26 19:23 ` [PATCH 2/5] sched_clock: Optimize cache line usage John Stultz
2015-03-27  7:41   ` [tip:timers/core] timers, sched/clock: " tip-bot for Daniel Thompson
2015-03-26 19:23 ` [PATCH 3/5] sched_clock: Remove suspend from clock_read_data John Stultz
2015-03-27  7:42   ` [tip:timers/core] timers, sched/clock: Remove suspend from clock_read_data() tip-bot for Daniel Thompson
2015-03-26 19:23 ` [PATCH 4/5] sched_clock: Remove redundant notrace from update function John Stultz
2015-03-27  7:42   ` [tip:timers/core] timers, sched/clock: " tip-bot for Daniel Thompson
2015-03-26 19:23 ` [PATCH 5/5] sched_clock: Avoid deadlock during read from NMI John Stultz
2015-03-27  7:42   ` [tip:timers/core] timers, sched/clock: " tip-bot for Daniel Thompson
  -- strict thread matches above, loose matches on Subject: below --
2015-03-17 17:38 [PATCH 0/5] Arch-generic sched_clock NMI safety and optimizations for -tip John Stultz
2015-03-17 17:38 ` [PATCH 2/5] sched_clock: Optimize cache line usage John Stultz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.