linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3 v2] Fixes for two recently found timekeeping bugs
@ 2017-06-01  3:07 John Stultz
  2017-06-01  3:07 ` [PATCH 1/3 v2] time: Fix clock->read(clock) race around clocksource changes John Stultz
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: John Stultz @ 2017-06-01  3:07 UTC (permalink / raw)
  To: lkml
  Cc: John Stultz, Thomas Gleixner, Ingo Molnar, Miroslav Lichvar,
	Richard Cochran, Prarit Bhargava, Stephen Boyd, Daniel Mentz

As part of the Linaro Linux Kernel Functional Test (LKFT)
effort, test failures from kselftest/timer's
inconsistency-check were reported connected to
CLOCK_MONOTONIC_RAW, on the HiKey platform.

Digging in I found that an old issue with how sub-ns accounting
is handled with the RAW time which was fixed long ago with the
CLOCK_MONOTONIC/REALTIME ids, but missed with RAW time, was
present.

Additionally, running further tests, I uncovered an issue with
how the clocksource read function is handled when clocksources
are changed, which can cause crashes.

Both of these issues have not been uncovered in x86 based
testing due to x86 not using vDSO to accelerate
CLOCK_MONOTONIC_RAW, combined with the HiKey's arch_timer
clocksource being fast to access but incrementing slowly enough
to get multiple reads using the same counter value (which helps
uncover time handing issues), along with the fact that none of
the x86 clocksources making use of the clocksource argument
passed to the read function.

This patchset addresses these two issues.

Thanks so much to help from Will Deacon in getting the needed
adjustments to the arm64 vDSO in place. Also thanks to Daniel
Mentz who also properly diagnosed the MONOTONIC_RAW issue in
parallel while I was woking on this patchset.

As always feedback would be appreciated!

thanks
-john

v2:
* Addressed style/phrasing feedback from Ingo
* Dropped the final cleanup patch, since it can wait for 4.13-rc

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Daniel Mentz <danielmentz@google.com>

John Stultz (2):
  time: Fix clock->read(clock) race around clocksource changes
  time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting

Will Deacon (1):
  arm64: vdso: Fix nsec handling for CLOCK_MONOTONIC_RAW

 arch/arm64/kernel/vdso.c              |  5 +--
 arch/arm64/kernel/vdso/gettimeofday.S |  1 -
 include/linux/timekeeper_internal.h   |  4 +--
 kernel/time/timekeeping.c             | 59 +++++++++++++++++++++++------------
 4 files changed, 44 insertions(+), 25 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/3 v2] time: Fix clock->read(clock) race around clocksource changes
  2017-06-01  3:07 [PATCH 0/3 v2] Fixes for two recently found timekeeping bugs John Stultz
@ 2017-06-01  3:07 ` John Stultz
  2017-06-04 18:52   ` Thomas Gleixner
  2017-06-01  3:07 ` [PATCH 2/3 v2] time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting John Stultz
  2017-06-01  3:07 ` [PATCH 3/3 v2] arm64: vdso: Fix nsec handling for CLOCK_MONOTONIC_RAW John Stultz
  2 siblings, 1 reply; 6+ messages in thread
From: John Stultz @ 2017-06-01  3:07 UTC (permalink / raw)
  To: lkml
  Cc: John Stultz, Thomas Gleixner, Ingo Molnar, Miroslav Lichvar,
	Richard Cochran, Prarit Bhargava, Stephen Boyd, Daniel Mentz,
	stable

In some testing on arm64 platforms, I was seeing null ptr
crashes in the kselftest/timers clocksource-switch test.

This was happening in a read function like:
u64 clocksource_mmio_readl_down(struct clocksource *c)
{
    return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask;
}

Where the callers enter the seqlock, and then call something
like:
    cycle_now = tkr->read(tkr->clock);

The problem seeming to be that since the ->read() and ->clock
pointer references are happening separately, its possible the
clocksource change happens in between and we end up calling the
old ->read() function with the new clocksource, (or vice-versa)
which causes the to_mmio_clksrc() in the read function to run
off into space.

This patch tries to address the issue by providing a helper
function that atomically reads the clock value and then calls
the clock->read(clock) function so that we always call the read
funciton with the appropriate clocksource and don't accidentally
mix them.

The one exception where this helper isn't necessary is for the
fast-timekepers which use their own locking and update logic
to the tkr structures.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: stable <stable@vger.kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
v2: Addressed Ingo's feedback on wording
---
 kernel/time/timekeeping.c | 40 +++++++++++++++++++++++++++++-----------
 1 file changed, 29 insertions(+), 11 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 9652bc5..797c73e 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -118,6 +118,26 @@ static inline void tk_update_sleep_time(struct timekeeper *tk, ktime_t delta)
 	tk->offs_boot = ktime_add(tk->offs_boot, delta);
 }
 
+/*
+ * tk_clock_read - atomic clocksource read() helper
+ *
+ * This helper is necessary to use in the read paths because, while the
+ * seqlock ensures we don't return a bad value while structures are updated,
+ * it doesn't protect from potential crashes. There is the possibility that
+ * the tkr's clocksource may change between the read reference, and the
+ * clock reference passed to the read function.  This can cause crashes if
+ * the wrong clocksource is passed to the wrong read function.
+ * This isn't necessary to use when holding the timekeeper_lock or doing
+ * a read of the fast-timekeeper tkrs (which is protected by its own locking
+ * and update logic).
+ */
+static inline u64 tk_clock_read(struct tk_read_base *tkr)
+{
+	struct clocksource *clock = READ_ONCE(tkr->clock);
+
+	return clock->read(clock);
+}
+
 #ifdef CONFIG_DEBUG_TIMEKEEPING
 #define WARNING_FREQ (HZ*300) /* 5 minute rate-limiting */
 
@@ -175,7 +195,7 @@ static inline u64 timekeeping_get_delta(struct tk_read_base *tkr)
 	 */
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
-		now = tkr->read(tkr->clock);
+		now = tk_clock_read(tkr);
 		last = tkr->cycle_last;
 		mask = tkr->mask;
 		max = tkr->clock->max_cycles;
@@ -209,7 +229,7 @@ static inline u64 timekeeping_get_delta(struct tk_read_base *tkr)
 	u64 cycle_now, delta;
 
 	/* read clocksource */
-	cycle_now = tkr->read(tkr->clock);
+	cycle_now = tk_clock_read(tkr);
 
 	/* calculate the delta since the last update_wall_time */
 	delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);
@@ -240,7 +260,7 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
 	tk->tkr_mono.clock = clock;
 	tk->tkr_mono.read = clock->read;
 	tk->tkr_mono.mask = clock->mask;
-	tk->tkr_mono.cycle_last = tk->tkr_mono.read(clock);
+	tk->tkr_mono.cycle_last = tk_clock_read(&tk->tkr_mono);
 
 	tk->tkr_raw.clock = clock;
 	tk->tkr_raw.read = clock->read;
@@ -477,7 +497,7 @@ static void halt_fast_timekeeper(struct timekeeper *tk)
 	struct tk_read_base *tkr = &tk->tkr_mono;
 
 	memcpy(&tkr_dummy, tkr, sizeof(tkr_dummy));
-	cycles_at_suspend = tkr->read(tkr->clock);
+	cycles_at_suspend = tk_clock_read(tkr);
 	tkr_dummy.read = dummy_clock_read;
 	update_fast_timekeeper(&tkr_dummy, &tk_fast_mono);
 
@@ -649,11 +669,10 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action)
  */
 static void timekeeping_forward_now(struct timekeeper *tk)
 {
-	struct clocksource *clock = tk->tkr_mono.clock;
 	u64 cycle_now, delta;
 	u64 nsec;
 
-	cycle_now = tk->tkr_mono.read(clock);
+	cycle_now = tk_clock_read(&tk->tkr_mono);
 	delta = clocksource_delta(cycle_now, tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
 	tk->tkr_mono.cycle_last = cycle_now;
 	tk->tkr_raw.cycle_last  = cycle_now;
@@ -929,8 +948,7 @@ void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
 
 	do {
 		seq = read_seqcount_begin(&tk_core.seq);
-
-		now = tk->tkr_mono.read(tk->tkr_mono.clock);
+		now = tk_clock_read(&tk->tkr_mono);
 		systime_snapshot->cs_was_changed_seq = tk->cs_was_changed_seq;
 		systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq;
 		base_real = ktime_add(tk->tkr_mono.base,
@@ -1108,7 +1126,7 @@ int get_device_system_crosststamp(int (*get_time_fn)
 		 * Check whether the system counter value provided by the
 		 * device driver is on the current timekeeping interval.
 		 */
-		now = tk->tkr_mono.read(tk->tkr_mono.clock);
+		now = tk_clock_read(&tk->tkr_mono);
 		interval_start = tk->tkr_mono.cycle_last;
 		if (!cycle_between(interval_start, cycles, now)) {
 			clock_was_set_seq = tk->clock_was_set_seq;
@@ -1629,7 +1647,7 @@ void timekeeping_resume(void)
 	 * The less preferred source will only be tried if there is no better
 	 * usable source. The rtc part is handled separately in rtc core code.
 	 */
-	cycle_now = tk->tkr_mono.read(clock);
+	cycle_now = tk_clock_read(&tk->tkr_mono);
 	if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
 		cycle_now > tk->tkr_mono.cycle_last) {
 		u64 nsec, cyc_delta;
@@ -2030,7 +2048,7 @@ void update_wall_time(void)
 #ifdef CONFIG_ARCH_USES_GETTIMEOFFSET
 	offset = real_tk->cycle_interval;
 #else
-	offset = clocksource_delta(tk->tkr_mono.read(tk->tkr_mono.clock),
+	offset = clocksource_delta(tk_clock_read(&tk->tkr_mono),
 				   tk->tkr_mono.cycle_last, tk->tkr_mono.mask);
 #endif
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/3 v2] time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting
  2017-06-01  3:07 [PATCH 0/3 v2] Fixes for two recently found timekeeping bugs John Stultz
  2017-06-01  3:07 ` [PATCH 1/3 v2] time: Fix clock->read(clock) race around clocksource changes John Stultz
@ 2017-06-01  3:07 ` John Stultz
  2017-06-01  3:07 ` [PATCH 3/3 v2] arm64: vdso: Fix nsec handling for CLOCK_MONOTONIC_RAW John Stultz
  2 siblings, 0 replies; 6+ messages in thread
From: John Stultz @ 2017-06-01  3:07 UTC (permalink / raw)
  To: lkml
  Cc: John Stultz, Thomas Gleixner, Ingo Molnar, Miroslav Lichvar,
	Richard Cochran, Prarit Bhargava, Stephen Boyd, Kevin Brodsky,
	Will Deacon, Daniel Mentz, stable #4 . 8+

Due to how the MONOTONIC_RAW accumulation logic was handled,
there is the potential for a 1ns discontinuity when we do
accumulations. This small discontinuity has for the most part
gone un-noticed, but since ARM64 enabled CLOCK_MONOTONIC_RAW
in their vDSO clock_gettime implementation, we've seen failures
with the inconsistency-check test in kselftest.

This patch addresses the issue by using the same sub-ns
accumulation handling that CLOCK_MONOTONIC uses, which avoids
the issue for in-kernel users.

Since the ARM64 vDSO implementation has its own clock_gettime
calculation logic, this patch reduces the frequency of errors,
but failures are still seen. The ARM64 vDSO will need to be
updated to include the sub-nanosecond xtime_nsec values in its
calculation for this issue to be completely fixed.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: stable <stable@vger.kernel.org> #4.8+
Tested-by: Daniel Mentz <danielmentz@google.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
v2: Address Ingo's style feedback
---
 include/linux/timekeeper_internal.h |  4 ++--
 kernel/time/timekeeping.c           | 19 ++++++++++---------
 2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h
index 110f453..528cc86 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -58,7 +58,7 @@ struct tk_read_base {
  *			interval.
  * @xtime_remainder:	Shifted nano seconds left over when rounding
  *			@cycle_interval
- * @raw_interval:	Raw nano seconds accumulated per NTP interval.
+ * @raw_interval:	Shifted raw nano seconds accumulated per NTP interval.
  * @ntp_error:		Difference between accumulated time and NTP time in ntp
  *			shifted nano seconds.
  * @ntp_error_shift:	Shift conversion between clock shifted nano seconds and
@@ -100,7 +100,7 @@ struct timekeeper {
 	u64			cycle_interval;
 	u64			xtime_interval;
 	s64			xtime_remainder;
-	u32			raw_interval;
+	u64			raw_interval;
 	/* The ntp_tick_length() value currently being used.
 	 * This cached copy ensures we consistently apply the tick
 	 * length for an entire tick, as ntp_tick_length may change
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 797c73e..8eaa95c 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -282,7 +282,7 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock)
 	/* Go back from cycles -> shifted ns */
 	tk->xtime_interval = interval * clock->mult;
 	tk->xtime_remainder = ntpinterval - tk->xtime_interval;
-	tk->raw_interval = (interval * clock->mult) >> clock->shift;
+	tk->raw_interval = interval * clock->mult;
 
 	 /* if changing clocks, convert xtime_nsec shift units */
 	if (old_clock) {
@@ -1994,7 +1994,7 @@ static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset,
 				    u32 shift, unsigned int *clock_set)
 {
 	u64 interval = tk->cycle_interval << shift;
-	u64 raw_nsecs;
+	u64 snsec_per_sec;
 
 	/* If the offset is smaller than a shifted interval, do nothing */
 	if (offset < interval)
@@ -2009,14 +2009,15 @@ static u64 logarithmic_accumulation(struct timekeeper *tk, u64 offset,
 	*clock_set |= accumulate_nsecs_to_secs(tk);
 
 	/* Accumulate raw time */
-	raw_nsecs = (u64)tk->raw_interval << shift;
-	raw_nsecs += tk->raw_time.tv_nsec;
-	if (raw_nsecs >= NSEC_PER_SEC) {
-		u64 raw_secs = raw_nsecs;
-		raw_nsecs = do_div(raw_secs, NSEC_PER_SEC);
-		tk->raw_time.tv_sec += raw_secs;
+	tk->tkr_raw.xtime_nsec += (u64)tk->raw_time.tv_nsec << tk->tkr_raw.shift;
+	tk->tkr_raw.xtime_nsec += tk->raw_interval << shift;
+	snsec_per_sec = (u64)NSEC_PER_SEC << tk->tkr_raw.shift;
+	while (tk->tkr_raw.xtime_nsec >= snsec_per_sec) {
+		tk->tkr_raw.xtime_nsec -= snsec_per_sec;
+		tk->raw_time.tv_sec++;
 	}
-	tk->raw_time.tv_nsec = raw_nsecs;
+	tk->raw_time.tv_nsec = tk->tkr_raw.xtime_nsec >> tk->tkr_raw.shift;
+	tk->tkr_raw.xtime_nsec -= (u64)tk->raw_time.tv_nsec << tk->tkr_raw.shift;
 
 	/* Accumulate error between NTP and clock interval */
 	tk->ntp_error += tk->ntp_tick << shift;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/3 v2] arm64: vdso: Fix nsec handling for CLOCK_MONOTONIC_RAW
  2017-06-01  3:07 [PATCH 0/3 v2] Fixes for two recently found timekeeping bugs John Stultz
  2017-06-01  3:07 ` [PATCH 1/3 v2] time: Fix clock->read(clock) race around clocksource changes John Stultz
  2017-06-01  3:07 ` [PATCH 2/3 v2] time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting John Stultz
@ 2017-06-01  3:07 ` John Stultz
  2 siblings, 0 replies; 6+ messages in thread
From: John Stultz @ 2017-06-01  3:07 UTC (permalink / raw)
  To: lkml
  Cc: Will Deacon, Thomas Gleixner, Ingo Molnar, Miroslav Lichvar,
	Richard Cochran, Prarit Bhargava, Stephen Boyd, Kevin Brodsky,
	Daniel Mentz, stable #4 . 8+,
	John Stultz

From: Will Deacon <will.deacon@arm.com>

Recently vDSO support for CLOCK_MONOTONIC_RAW was added in
49eea433b326 ("arm64: Add support for CLOCK_MONOTONIC_RAW in
clock_gettime() vDSO"). Noticing that the core timekeeping code
never set tkr_raw.xtime_nsec, the vDSO implementation didn't
bother exposing it via the data page and instead took the
unshifted tk->raw_time.tv_nsec value which was then immediately
shifted left in the vDSO code.

Unfortunately, by accellerating the MONOTONIC_RAW clockid, it
uncovered potential 1ns time inconsistencies caused by the
timekeeping core not handing sub-ns resolution.

Now that the core code has been fixed and is actually setting
tkr_raw.xtime_nsec, we need to take that into account in the
vDSO by adding it to the shifted raw_time value, in order to
fix the user-visible inconsistency. Rather than do that at each
use (and expand the data page in the process), instead perform
the shift/addition operation when populating the data page and
remove the shift from the vDSO code entirely.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Daniel Mentz <danielmentz@google.com>
Cc: stable <stable@vger.kernel.org> #4.8+
Reported-by: John Stultz <john.stultz@linaro.org>
Acked-by: Kevin Brodsky <kevin.brodsky@arm.com>
Tested-by: Daniel Mentz <danielmentz@google.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
[jstultz: minor whitespace tweak, tried to improve commit
 message to make it more clear this fixes a regression]
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
v2: Tweak commit message to address Ingo's feedback
---
 arch/arm64/kernel/vdso.c              | 5 +++--
 arch/arm64/kernel/vdso/gettimeofday.S | 1 -
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index 41b6e31..d0cb007 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -221,10 +221,11 @@ void update_vsyscall(struct timekeeper *tk)
 		/* tkr_mono.cycle_last == tkr_raw.cycle_last */
 		vdso_data->cs_cycle_last	= tk->tkr_mono.cycle_last;
 		vdso_data->raw_time_sec		= tk->raw_time.tv_sec;
-		vdso_data->raw_time_nsec	= tk->raw_time.tv_nsec;
+		vdso_data->raw_time_nsec	= (tk->raw_time.tv_nsec <<
+						   tk->tkr_raw.shift) +
+						  tk->tkr_raw.xtime_nsec;
 		vdso_data->xtime_clock_sec	= tk->xtime_sec;
 		vdso_data->xtime_clock_nsec	= tk->tkr_mono.xtime_nsec;
-		/* tkr_raw.xtime_nsec == 0 */
 		vdso_data->cs_mono_mult		= tk->tkr_mono.mult;
 		vdso_data->cs_raw_mult		= tk->tkr_raw.mult;
 		/* tkr_mono.shift == tkr_raw.shift */
diff --git a/arch/arm64/kernel/vdso/gettimeofday.S b/arch/arm64/kernel/vdso/gettimeofday.S
index e00b467..76320e9 100644
--- a/arch/arm64/kernel/vdso/gettimeofday.S
+++ b/arch/arm64/kernel/vdso/gettimeofday.S
@@ -256,7 +256,6 @@ monotonic_raw:
 	seqcnt_check fail=monotonic_raw
 
 	/* All computations are done with left-shifted nsecs. */
-	lsl	x14, x14, x12
 	get_nsec_per_sec res=x9
 	lsl	x9, x9, x12
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3 v2] time: Fix clock->read(clock) race around clocksource changes
  2017-06-01  3:07 ` [PATCH 1/3 v2] time: Fix clock->read(clock) race around clocksource changes John Stultz
@ 2017-06-04 18:52   ` Thomas Gleixner
  2017-06-08 19:02     ` John Stultz
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2017-06-04 18:52 UTC (permalink / raw)
  To: John Stultz
  Cc: lkml, Ingo Molnar, Miroslav Lichvar, Richard Cochran,
	Prarit Bhargava, Stephen Boyd, Daniel Mentz, stable

On Wed, 31 May 2017, John Stultz wrote:

> In some testing on arm64 platforms, I was seeing null ptr
> crashes in the kselftest/timers clocksource-switch test.
> 
> This was happening in a read function like:
> u64 clocksource_mmio_readl_down(struct clocksource *c)
> {
>     return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask;
> }
> 
> Where the callers enter the seqlock, and then call something
> like:
>     cycle_now = tkr->read(tkr->clock);
> 
> The problem seeming to be that since the ->read() and ->clock
> pointer references are happening separately, its possible the
> clocksource change happens in between and we end up calling the
> old ->read() function with the new clocksource, (or vice-versa)
> which causes the to_mmio_clksrc() in the read function to run
> off into space.
> 
> This patch tries to address the issue by providing a helper
> function that atomically reads the clock value and then calls
> the clock->read(clock) function so that we always call the read
> funciton with the appropriate clocksource and don't accidentally
> mix them.

This changelog is still horrible to read. This really want's proper
explanations and not 'seeming ot be', 'tries to address' ....

Something like this:

  "In tests, which excercise switching of clocksources, a NULL pointer
   dereference can be observed on AMR64 platforms in the clocksource read()
   function:

   u64 clocksource_mmio_readl_down(struct clocksource *c)
   {
	return ~(u64)readl_relaxed(to_mmio_clksrc(c)->reg) & c->mask;
   }

   This is called from the core timekeeping code via:

    	cycle_now = tkr->read(tkr->clock);

   tkr->read is the cached tkr->clock->read() function pointer. When the
   clocksource is changed then tkr->clock and tkr->read are updated
   sequentially. The code above results in a sequential load operation of
   tkr->read and tkr->clock as well.

   If the store to tkr->clock hits between the loads of tkr->read and
   tkr->clock, then the old read() function is called with the new clock
   pointer. As a consequence the read() function dereferences a different data
   structure and the resulting 'reg' pointer can point anywhere including
   NULL.

   This problem was introduced when the timekeeping code was switched over to
   use struct tk_read_base. Before that, it was theoretically possible as well
   when the compiler decided to reload clock in the code sequence:

     now = tk->clock-read(tk->clock);

   Add a helper function which avoids the issue by reading tk_read_base->clock
   once into a local variable clk and then issue the read function via
   clk->read(clk). This guarantees that the read() function always gets the
   proper clocksource pointer handed in."

The whole problem was introduced by me, when I (over)optimized the cache
line footprint of the timekeeping stuff and wanted to avoid touching the
clocksource cache line when the clocksource does not need it, like TSC on
x86. The above race did not come to my mind at all when I wrote that
code. Bummer..

> The one exception where this helper isn't necessary is for the
> fast-timekepers which use their own locking and update logic
> to the tkr structures.

That's simply wrong. The fast time keepers have exactly the same issue.

   seq = tkf->seq;
   tkr = tkr->base + (seq & 0x01)
   now = tkr->read(tkr->clock);

So this is exactly the same because this decomposes to

   rd = tkr->read;
   cl = tkr->clock;
   now = rd(cl);

So if you put the update in context:

CPU0  	      	  	CPU1
   rd = tkr->read;
			update_fast_timekeeper()
			write_seqcount_latch(tkr->seq);
			memcpy(tkr->base[0], newtkr);
			write_seqcount_latch(tkr->seq);
			memcpy(tkr->base[1], newtkr);
   cl = tkr->clock;
   now = rd(cl);

Then you end up with the very same problem as with the general timekeeping
itself.

The two bases and the seqcount_latch() magic are there to allow using the
fast timekeeper in NMI context, which can interrupt the update
sequence. That guarantees that the reader which interrupted the update will
always use a consistent tkr->base. But in no way does it protect against
the read -> clock inconsistency caused by a concurrent or interrupting
update.

> +/*
> + * tk_clock_read - atomic clocksource read() helper
> + *
> + * This helper is necessary to use in the read paths because, while the
> + * seqlock ensures we don't return a bad value while structures are updated,
> + * it doesn't protect from potential crashes. There is the possibility that
> + * the tkr's clocksource may change between the read reference, and the
> + * clock reference passed to the read function.  This can cause crashes if
> + * the wrong clocksource is passed to the wrong read function.

Come on. The problem is not that it can cause crashes.

The problem is that it hands in the wrong pointer. Even if it does not
crash, it still can read from a location which has other way harder to
debug side effects.

Comments and changelogs should be written in a factual manner not like
fairy tales.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/3 v2] time: Fix clock->read(clock) race around clocksource changes
  2017-06-04 18:52   ` Thomas Gleixner
@ 2017-06-08 19:02     ` John Stultz
  0 siblings, 0 replies; 6+ messages in thread
From: John Stultz @ 2017-06-08 19:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: lkml, Ingo Molnar, Miroslav Lichvar, Richard Cochran,
	Prarit Bhargava, Stephen Boyd, Daniel Mentz, stable

On Sun, Jun 4, 2017 at 11:52 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 31 May 2017, John Stultz wrote:
>>
>> The one exception where this helper isn't necessary is for the
>> fast-timekepers which use their own locking and update logic
>> to the tkr structures.
>
> That's simply wrong. The fast time keepers have exactly the same issue.
>
...
>
> So if you put the update in context:
>
> CPU0                    CPU1
>    rd = tkr->read;
>                         update_fast_timekeeper()
>                         write_seqcount_latch(tkr->seq);
>                         memcpy(tkr->base[0], newtkr);
>                         write_seqcount_latch(tkr->seq);
>                         memcpy(tkr->base[1], newtkr);
>    cl = tkr->clock;
>    now = rd(cl);
>
> Then you end up with the very same problem as with the general timekeeping
> itself.
>
> The two bases and the seqcount_latch() magic are there to allow using the
> fast timekeeper in NMI context, which can interrupt the update
> sequence. That guarantees that the reader which interrupted the update will
> always use a consistent tkr->base. But in no way does it protect against
> the read -> clock inconsistency caused by a concurrent or interrupting
> update.

Ah. I mistakenly thought the fast-timekeepers alternated on updates,
rather then both being updated at once.

Thanks for the clarification. I guess we'll need a fix there too.

>> + * tk_clock_read - atomic clocksource read() helper
>> + *
>> + * This helper is necessary to use in the read paths because, while the
>> + * seqlock ensures we don't return a bad value while structures are updated,
>> + * it doesn't protect from potential crashes. There is the possibility that
>> + * the tkr's clocksource may change between the read reference, and the
>> + * clock reference passed to the read function.  This can cause crashes if
>> + * the wrong clocksource is passed to the wrong read function.
>
> Come on. The problem is not that it can cause crashes.
>
> The problem is that it hands in the wrong pointer. Even if it does not
> crash, it still can read from a location which has other way harder to
> debug side effects.
>
> Comments and changelogs should be written in a factual manner not like
> fairy tales.

Apologies, I've long been criticized for using the passive voice, and
I tend to hedge statements when I'm not totally confident I'm correct
(which with my batting avg, is most always). I'll try to improve this.

While I'm reworking this patch, if you have no objections to the other
two, are you open to queuing them up?

thanks
-john

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-06-08 19:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-01  3:07 [PATCH 0/3 v2] Fixes for two recently found timekeeping bugs John Stultz
2017-06-01  3:07 ` [PATCH 1/3 v2] time: Fix clock->read(clock) race around clocksource changes John Stultz
2017-06-04 18:52   ` Thomas Gleixner
2017-06-08 19:02     ` John Stultz
2017-06-01  3:07 ` [PATCH 2/3 v2] time: Fix CLOCK_MONOTONIC_RAW sub-nanosecond accounting John Stultz
2017-06-01  3:07 ` [PATCH 3/3 v2] arm64: vdso: Fix nsec handling for CLOCK_MONOTONIC_RAW John Stultz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).