linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] sched_clock fixes
@ 2017-04-21 14:57 Peter Zijlstra
  2017-04-21 14:57 ` [PATCH 1/9] x86/tsc: Provide tsc=unstable boot parameter Peter Zijlstra
                   ` (8 more replies)
  0 siblings, 9 replies; 13+ messages in thread
From: Peter Zijlstra @ 2017-04-21 14:57 UTC (permalink / raw)
  To: tglx, mingo
  Cc: linux-kernel, ville.syrjala, daniel.lezcano, rafael.j.wysocki,
	marta.lofstedt, martin.peres, pasha.tatashin, peterz,
	daniel.vetter

Hi,

These patches were inspired (and hopefully fix) two independent bug reports on
Core2 machines.

I never could quite reproduce one, but my Core2 machine no longer switches to
stable sched_clock and therefore no longer tickles the problematic stable ->
unstable transition either.

Before I dug up my Core2 machine, I tried emulating TSC wreckage by poking
random values into the TSC MSR from userspace. Behaviour in that case is
improved as well.

People have to realize that if we manage to boot with TSC 'stable' (both
sched_clock and clocksource) and we later find out we were mistaken (we observe
a TSC wobble) the clocks that are derived from it _will_ have had an observable
hickup. This is fundamentally unfixable.

If you own a machine where the BIOS tries to hide SMI latencies by rewinding
TSC (yes, this is a thing), the very best we can do is mark TSC unstable with a
boot parameter.

For example, this is me writing a stupid value into the TSC:

[   46.745082] random: crng init done
[18443029775.010069] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
[18443029775.023141] clocksource:                       'hpet' wd_now: 3ebec538 wd_last: 3e486ec9 mask: ffffffff
[18443029775.034214] clocksource:                       'tsc' cs_now: 5025acce9 cs_last: 24dc3bd21c88ee mask: ffffffffffffffff
[18443029775.046651] tsc: Marking TSC unstable due to clocksource watchdog
[18443029775.054211] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[18443029775.064434] sched_clock: Marking unstable (70569005835, -17833788)<-(-3714295689546517, -2965802361)
[   70.573700] clocksource: Switched to clocksource hpet

With some trace_printk()s (not included) I could tell that the wobble
occured at 69.965474. The clock now resumes where it 'should' have been.

But an unfortunate scheduling event could have resulted in one task
having seen a runtime of ~584 years with 'obvious' effects. Similar
jumps can also be observed from userspace GTOD usage.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/9] x86/tsc: Provide tsc=unstable boot parameter
  2017-04-21 14:57 [PATCH 0/9] sched_clock fixes Peter Zijlstra
@ 2017-04-21 14:57 ` Peter Zijlstra
  2017-04-21 14:57 ` [PATCH 2/9] x86,tsc: Feed refined TSC calibration into sched_clock Peter Zijlstra
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2017-04-21 14:57 UTC (permalink / raw)
  To: tglx, mingo
  Cc: linux-kernel, ville.syrjala, daniel.lezcano, rafael.j.wysocki,
	marta.lofstedt, martin.peres, pasha.tatashin, peterz,
	daniel.vetter

[-- Attachment #1: peterz-sched-clock-rubarb-1.patch --]
[-- Type: text/plain, Size: 779 bytes --]

Since the clocksource watchdog will only detect broken TSC after the
fact, all TSC based clocks will likely have observed non-continuous
values before/when switching away from TSC.

Therefore only thing to fully avoid random clock movement when your
BIOS randomly mucks with TSC values from SMI handlers is reporting the
TSC as unstable at boot.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/tsc.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -374,6 +374,8 @@ static int __init tsc_setup(char *str)
 		tsc_clocksource_reliable = 1;
 	if (!strncmp(str, "noirqtime", 9))
 		no_sched_irq_time = 1;
+	if (!strcmp(str, "unstable"))
+		mark_tsc_unstable("boot parameter");
 	return 1;
 }
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/9] x86,tsc: Feed refined TSC calibration into sched_clock
  2017-04-21 14:57 [PATCH 0/9] sched_clock fixes Peter Zijlstra
  2017-04-21 14:57 ` [PATCH 1/9] x86/tsc: Provide tsc=unstable boot parameter Peter Zijlstra
@ 2017-04-21 14:57 ` Peter Zijlstra
  2017-04-21 14:57 ` [PATCH 3/9] sched/clock: Initialize all per-cpu state before switching (back) to unstable Peter Zijlstra
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2017-04-21 14:57 UTC (permalink / raw)
  To: tglx, mingo
  Cc: linux-kernel, ville.syrjala, daniel.lezcano, rafael.j.wysocki,
	marta.lofstedt, martin.peres, pasha.tatashin, peterz,
	daniel.vetter

[-- Attachment #1: peterz-sched-clock-rubarb-2.patch --]
[-- Type: text/plain, Size: 915 bytes --]

For the (older) CPUs that still need the refined TSC calibration, also
update the sched_clock rate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/tsc.c |    5 +++++
 1 file changed, 5 insertions(+)

--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1267,6 +1267,7 @@ static void tsc_refine_calibration_work(
 	static int hpet;
 	u64 tsc_stop, ref_stop, delta;
 	unsigned long freq;
+	int cpu;
 
 	/* Don't bother refining TSC on unstable systems */
 	if (check_tsc_unstable())
@@ -1317,6 +1318,10 @@ static void tsc_refine_calibration_work(
 	/* Inform the TSC deadline clockevent devices about the recalibration */
 	lapic_update_tsc_freq();
 
+	/* Update the sched_clock() rate to match the clocksource one */
+	for_each_possible_cpu(cpu)
+		set_cyc2ns_scale(tsc_khz, cpu);
+
 out:
 	if (boot_cpu_has(X86_FEATURE_ART))
 		art_related_clocksource = &clocksource_tsc;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/9] sched/clock: Initialize all per-cpu state before switching (back) to unstable
  2017-04-21 14:57 [PATCH 0/9] sched_clock fixes Peter Zijlstra
  2017-04-21 14:57 ` [PATCH 1/9] x86/tsc: Provide tsc=unstable boot parameter Peter Zijlstra
  2017-04-21 14:57 ` [PATCH 2/9] x86,tsc: Feed refined TSC calibration into sched_clock Peter Zijlstra
@ 2017-04-21 14:57 ` Peter Zijlstra
  2017-04-21 14:58 ` [PATCH 4/9] x86/tsc,sched/clock,clocksource: Use clocksource watchdog to provide stable sync points Peter Zijlstra
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2017-04-21 14:57 UTC (permalink / raw)
  To: tglx, mingo
  Cc: linux-kernel, ville.syrjala, daniel.lezcano, rafael.j.wysocki,
	marta.lofstedt, martin.peres, pasha.tatashin, peterz,
	daniel.vetter

[-- Attachment #1: peterz-sched-clock-rubarb-4.patch --]
[-- Type: text/plain, Size: 3201 bytes --]

In preparation for not keeping the sched_clock_tick() active for
stable TSC, we need to explicitly initialize all per-cpu state
before switching back to unstable.

Note: this patch looses the __gtod_offset calculation; it will be
restored in the next one.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/clock.c |   60 +++++++++++++++++++++++++++++++++------------------
 1 file changed, 39 insertions(+), 21 deletions(-)

--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -124,6 +124,12 @@ int sched_clock_stable(void)
 	return static_branch_likely(&__sched_clock_stable);
 }
 
+static void __scd_stamp(struct sched_clock_data *scd)
+{
+	scd->tick_gtod = ktime_get_ns();
+	scd->tick_raw = sched_clock();
+}
+
 static void __set_sched_clock_stable(void)
 {
 	struct sched_clock_data *scd = this_scd();
@@ -141,8 +147,37 @@ static void __set_sched_clock_stable(voi
 	tick_dep_clear(TICK_DEP_BIT_CLOCK_UNSTABLE);
 }
 
+/*
+ * If we ever get here, we're screwed, because we found out -- typically after
+ * the fact -- that TSC wasn't good. This means all our clocksources (including
+ * ktime) could have reported wrong values.
+ *
+ * What we do here is an attempt to fix up and continue sort of where we left
+ * off in a coherent manner.
+ *
+ * The only way to fully avoid random clock jumps is to boot with:
+ * "tsc=unstable".
+ */
 static void __sched_clock_work(struct work_struct *work)
 {
+	struct sched_clock_data *scd;
+	int cpu;
+
+	/* take a current timestamp and set 'now' */
+	preempt_disable();
+	scd = this_scd();
+	__scd_stamp(scd);
+	scd->clock = scd->tick_gtod + __gtod_offset;
+	preempt_enable();
+
+	/* clone to all CPUs */
+	for_each_possible_cpu(cpu)
+		per_cpu(sched_clock_data, cpu) = *scd;
+
+	printk(KERN_INFO "sched_clock: Marking unstable (%lld, %lld)<-(%lld, %lld)\n",
+			scd->tick_gtod, __gtod_offset,
+			scd->tick_raw,  __sched_clock_offset);
+
 	static_branch_disable(&__sched_clock_stable);
 }
 
@@ -150,27 +185,11 @@ static DECLARE_WORK(sched_clock_work, __
 
 static void __clear_sched_clock_stable(void)
 {
-	struct sched_clock_data *scd = this_scd();
-
-	/*
-	 * Attempt to make the stable->unstable transition continuous.
-	 *
-	 * Trouble is, this is typically called from the TSC watchdog
-	 * timer, which is late per definition. This means the tick
-	 * values can already be screwy.
-	 *
-	 * Still do what we can.
-	 */
-	__gtod_offset = (scd->tick_raw + __sched_clock_offset) - (scd->tick_gtod);
-
-	printk(KERN_INFO "sched_clock: Marking unstable (%lld, %lld)<-(%lld, %lld)\n",
-			scd->tick_gtod, __gtod_offset,
-			scd->tick_raw,  __sched_clock_offset);
+	if (!sched_clock_stable())
+		return;
 
 	tick_dep_set(TICK_DEP_BIT_CLOCK_UNSTABLE);
-
-	if (sched_clock_stable())
-		schedule_work(&sched_clock_work);
+	schedule_work(&sched_clock_work);
 }
 
 void clear_sched_clock_stable(void)
@@ -357,8 +376,7 @@ void sched_clock_tick(void)
 	 * XXX arguably we can skip this if we expose tsc_clocksource_reliable
 	 */
 	scd = this_scd();
-	scd->tick_raw  = sched_clock();
-	scd->tick_gtod = ktime_get_ns();
+	__scd_stamp(scd);
 
 	if (!sched_clock_stable() && likely(sched_clock_running))
 		sched_clock_local(scd);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 4/9] x86/tsc,sched/clock,clocksource: Use clocksource watchdog to provide stable sync points
  2017-04-21 14:57 [PATCH 0/9] sched_clock fixes Peter Zijlstra
                   ` (2 preceding siblings ...)
  2017-04-21 14:57 ` [PATCH 3/9] sched/clock: Initialize all per-cpu state before switching (back) to unstable Peter Zijlstra
@ 2017-04-21 14:58 ` Peter Zijlstra
  2017-04-21 14:58 ` [PATCH 5/9] sched/clock: Remove unused argument to sched_clock_idle_wakeup_event() Peter Zijlstra
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2017-04-21 14:58 UTC (permalink / raw)
  To: tglx, mingo
  Cc: linux-kernel, ville.syrjala, daniel.lezcano, rafael.j.wysocki,
	marta.lofstedt, martin.peres, pasha.tatashin, peterz,
	daniel.vetter

[-- Attachment #1: peterz-sched-clock-rubarb-5.patch --]
[-- Type: text/plain, Size: 4028 bytes --]

Currently we keep sched_clock_tick() active for stable TSC in order to
keep the per-cpu state semi up-to-date. The (obvious) problem is that
by the time we detect TSC is borked, our per-cpu state is also borked.

So hook into the clocksource watchdog and call a method after we've
found it to still be stable.

There's the obvious race where the TSC goes wonky between finding it
stable and us running the callback, but closing that is too much work
and not really worth it, since we're already detecting TSC wobbles
after the fact, so we cannot, per definition, fully avoid funny clock
values.

And since the watchdog runs less often than the tick, this is also an
optimization.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/tsc.c       |   10 ++++++++++
 include/linux/clocksource.h |    1 +
 include/linux/sched/clock.h |    2 +-
 kernel/sched/clock.c        |   36 +++++++++++++++++++++++++++---------
 kernel/time/clocksource.c   |    3 +++
 5 files changed, 42 insertions(+), 10 deletions(-)

--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1129,6 +1129,15 @@ static void tsc_cs_mark_unstable(struct
 	pr_info("Marking TSC unstable due to clocksource watchdog\n");
 }
 
+static void tsc_cs_tick_stable(struct clocksource *cs)
+{
+	if (tsc_unstable)
+		return;
+
+	if (using_native_sched_clock())
+		sched_clock_tick_stable();
+}
+
 /*
  * .mask MUST be CLOCKSOURCE_MASK(64). See comment above read_tsc()
  */
@@ -1142,6 +1151,7 @@ static struct clocksource clocksource_ts
 	.archdata               = { .vclock_mode = VCLOCK_TSC },
 	.resume			= tsc_resume,
 	.mark_unstable		= tsc_cs_mark_unstable,
+	.tick_stable		= tsc_cs_tick_stable,
 };
 
 void mark_tsc_unstable(char *reason)
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -96,6 +96,7 @@ struct clocksource {
 	void (*suspend)(struct clocksource *cs);
 	void (*resume)(struct clocksource *cs);
 	void (*mark_unstable)(struct clocksource *cs);
+	void (*tick_stable)(struct clocksource *cs);
 
 	/* private: */
 #ifdef CONFIG_CLOCKSOURCE_WATCHDOG
--- a/include/linux/sched/clock.h
+++ b/include/linux/sched/clock.h
@@ -63,8 +63,8 @@ extern void clear_sched_clock_stable(voi
  */
 extern u64 __sched_clock_offset;
 
-
 extern void sched_clock_tick(void);
+extern void sched_clock_tick_stable(void);
 extern void sched_clock_idle_sleep_event(void);
 extern void sched_clock_idle_wakeup_event(u64 delta_ns);
 
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -367,20 +367,38 @@ void sched_clock_tick(void)
 {
 	struct sched_clock_data *scd;
 
+	if (sched_clock_stable())
+		return;
+
+	if (unlikely(!sched_clock_running))
+		return;
+
 	WARN_ON_ONCE(!irqs_disabled());
 
-	/*
-	 * Update these values even if sched_clock_stable(), because it can
-	 * become unstable at any point in time at which point we need some
-	 * values to fall back on.
-	 *
-	 * XXX arguably we can skip this if we expose tsc_clocksource_reliable
-	 */
 	scd = this_scd();
 	__scd_stamp(scd);
+	sched_clock_local(scd);
+}
 
-	if (!sched_clock_stable() && likely(sched_clock_running))
-		sched_clock_local(scd);
+void sched_clock_tick_stable(void)
+{
+	u64 gtod, clock;
+
+	if (!sched_clock_stable())
+		return;
+
+	/*
+	 * Called under watchdog_lock.
+	 *
+	 * The watchdog just found this TSC to (still) be stable, so now is a
+	 * good moment to update our __gtod_offset. Because once we find the
+	 * TSC to be unstable, any computation will be computing crap.
+	 */
+	local_irq_disable();
+	gtod = ktime_get_ns();
+	clock = sched_clock();
+	__gtod_offset = (clock + __sched_clock_offset) - gtod;
+	local_irq_enable();
 }
 
 /*
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -233,6 +233,9 @@ static void clocksource_watchdog(unsigne
 			continue;
 		}
 
+		if (cs == curr_clocksource && cs->tick_stable)
+			cs->tick_stable(cs);
+
 		if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) &&
 		    (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) &&
 		    (watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) {

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 5/9] sched/clock: Remove unused argument to sched_clock_idle_wakeup_event()
  2017-04-21 14:57 [PATCH 0/9] sched_clock fixes Peter Zijlstra
                   ` (3 preceding siblings ...)
  2017-04-21 14:58 ` [PATCH 4/9] x86/tsc,sched/clock,clocksource: Use clocksource watchdog to provide stable sync points Peter Zijlstra
@ 2017-04-21 14:58 ` Peter Zijlstra
  2017-04-21 14:58 ` [PATCH 6/9] sched/clock: Remove watchdog touching Peter Zijlstra
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2017-04-21 14:58 UTC (permalink / raw)
  To: tglx, mingo
  Cc: linux-kernel, ville.syrjala, daniel.lezcano, rafael.j.wysocki,
	marta.lofstedt, martin.peres, pasha.tatashin, peterz,
	daniel.vetter

[-- Attachment #1: peterz-sched-clock-rubarb-6.patch --]
[-- Type: text/plain, Size: 2049 bytes --]

The argument to sched_clock_idle_wakeup_event() has not been used in a
long time. Remove it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/kernel/tsc.c       |    2 +-
 include/linux/sched/clock.h |    4 ++--
 kernel/sched/clock.c        |    4 ++--
 kernel/time/tick-sched.c    |    2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -284,7 +284,7 @@ static void set_cyc2ns_scale(unsigned lo
 	cyc2ns_write_end(cpu, data);
 
 done:
-	sched_clock_idle_wakeup_event(0);
+	sched_clock_idle_wakeup_event();
 	local_irq_restore(flags);
 }
 /*
--- a/include/linux/sched/clock.h
+++ b/include/linux/sched/clock.h
@@ -39,7 +39,7 @@ static inline void sched_clock_idle_slee
 {
 }
 
-static inline void sched_clock_idle_wakeup_event(u64 delta_ns)
+static inline void sched_clock_idle_wakeup_event(void)
 {
 }
 
@@ -66,7 +66,7 @@ extern u64 __sched_clock_offset;
 extern void sched_clock_tick(void);
 extern void sched_clock_tick_stable(void);
 extern void sched_clock_idle_sleep_event(void);
-extern void sched_clock_idle_wakeup_event(u64 delta_ns);
+extern void sched_clock_idle_wakeup_event(void);
 
 /*
  * As outlined in clock.c, provides a fast, high resolution, nanosecond
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -410,9 +410,9 @@ void sched_clock_idle_sleep_event(void)
 EXPORT_SYMBOL_GPL(sched_clock_idle_sleep_event);
 
 /*
- * We just idled delta nanoseconds (called with irqs disabled):
+ * We just idled; resync with ktime. (called with irqs disabled):
  */
-void sched_clock_idle_wakeup_event(u64 delta_ns)
+void sched_clock_idle_wakeup_event(void)
 {
 	if (timekeeping_suspended)
 		return;
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -573,7 +573,7 @@ static void tick_nohz_stop_idle(struct t
 	update_ts_time_stats(smp_processor_id(), ts, now, NULL);
 	ts->idle_active = 0;
 
-	sched_clock_idle_wakeup_event(0);
+	sched_clock_idle_wakeup_event();
 }
 
 static ktime_t tick_nohz_start_idle(struct tick_sched *ts)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 6/9] sched/clock: Remove watchdog touching
  2017-04-21 14:57 [PATCH 0/9] sched_clock fixes Peter Zijlstra
                   ` (4 preceding siblings ...)
  2017-04-21 14:58 ` [PATCH 5/9] sched/clock: Remove unused argument to sched_clock_idle_wakeup_event() Peter Zijlstra
@ 2017-04-21 14:58 ` Peter Zijlstra
  2017-04-21 15:28   ` Peter Zijlstra
  2017-04-21 14:58 ` [PATCH 8/9] sched/clock: Use late_initcall() instead of sched_init_smp() Peter Zijlstra
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2017-04-21 14:58 UTC (permalink / raw)
  To: tglx, mingo
  Cc: linux-kernel, ville.syrjala, daniel.lezcano, rafael.j.wysocki,
	marta.lofstedt, martin.peres, pasha.tatashin, peterz,
	daniel.vetter

[-- Attachment #1: peterz-sched-clock-rubarb-7.patch --]
[-- Type: text/plain, Size: 1135 bytes --]

Commit:

  2bacec8c318c ("sched: touch softlockup watchdog after idling")

introduced the touch_softlockup_watchdog_sched() call without
justification and I feel sched_clock management is not the right
place, it should only be concerned with producing semi coherent time.

If this causes watchdog thingies, we can find a better place.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/clock.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -410,15 +410,21 @@ void sched_clock_idle_sleep_event(void)
 EXPORT_SYMBOL_GPL(sched_clock_idle_sleep_event);
 
 /*
- * We just idled; resync with ktime. (called with irqs disabled):
+ * We just idled; resync with ktime.
  */
 void sched_clock_idle_wakeup_event(void)
 {
-	if (timekeeping_suspended)
+	unsigned long flags;
+
+	if (sched_clock_stable())
+		return;
+
+	if (unlikely(timekeeping_suspended))
 		return;
 
+	local_irq_save(flags);
 	sched_clock_tick();
-	touch_softlockup_watchdog_sched();
+	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 8/9] sched/clock: Use late_initcall() instead of sched_init_smp()
  2017-04-21 14:57 [PATCH 0/9] sched_clock fixes Peter Zijlstra
                   ` (5 preceding siblings ...)
  2017-04-21 14:58 ` [PATCH 6/9] sched/clock: Remove watchdog touching Peter Zijlstra
@ 2017-04-21 14:58 ` Peter Zijlstra
  2017-04-21 14:58 ` [PATCH 9/9] sched/clock: Print a warning recommending tsc=unstable Peter Zijlstra
  2017-04-25  9:31 ` [PATCH 0/9] sched_clock fixes Lofstedt, Marta
  8 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2017-04-21 14:58 UTC (permalink / raw)
  To: tglx, mingo
  Cc: linux-kernel, ville.syrjala, daniel.lezcano, rafael.j.wysocki,
	marta.lofstedt, martin.peres, pasha.tatashin, peterz,
	daniel.vetter

[-- Attachment #1: peterz-sched-clock-rubarb-9.patch --]
[-- Type: text/plain, Size: 2607 bytes --]

Core2 marks its TSC unstable in ACPI Processor Idle, which is probed
after sched_init_smp(). Luckily it appears both acpi_processor and
intel_idle (which has a similar check) are mandatory built-in.

This means we can delay switching to stable until after these drivers
have ran (if they were modules, this would be impossible).

Delay the stable switch to late_initcall() to allow these drivers to
mark TSC unstable and avoid difficult stable->unstable transitions.

Cc: "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>
Reported-by: "Lofstedt, Marta" <marta.lofstedt@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched/clock.h |    5 -----
 kernel/sched/clock.c        |   10 +++++++++-
 kernel/sched/core.c         |    2 --
 3 files changed, 9 insertions(+), 8 deletions(-)

--- a/include/linux/sched/clock.h
+++ b/include/linux/sched/clock.h
@@ -23,10 +23,6 @@ extern u64 sched_clock_cpu(int cpu);
 extern void sched_clock_init(void);
 
 #ifndef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
-static inline void sched_clock_init_late(void)
-{
-}
-
 static inline void sched_clock_tick(void)
 {
 }
@@ -53,7 +49,6 @@ static inline u64 local_clock(void)
 	return sched_clock();
 }
 #else
-extern void sched_clock_init_late(void);
 extern int sched_clock_stable(void);
 extern void clear_sched_clock_stable(void);
 
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -64,6 +64,7 @@
 #include <linux/workqueue.h>
 #include <linux/compiler.h>
 #include <linux/tick.h>
+#include <linux/init.h>
 
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -203,7 +204,11 @@ void clear_sched_clock_stable(void)
 		__clear_sched_clock_stable();
 }
 
-void sched_clock_init_late(void)
+/*
+ * We run this as late_initcall() such that it runs after all built-in drivers,
+ * notably: acpi_processor and intel_idle, which can mark the TSC as unstable.
+ */
+static int __init sched_clock_init_late(void)
 {
 	sched_clock_running = 2;
 	/*
@@ -217,7 +222,10 @@ void sched_clock_init_late(void)
 
 	if (__sched_clock_stable_early)
 		__set_sched_clock_stable();
+
+	return 0;
 }
+late_initcall(sched_clock_init_late);
 
 /*
  * min, max except they take wrapping into account
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5962,7 +5962,6 @@ void __init sched_init_smp(void)
 	init_sched_dl_class();
 
 	sched_init_smt();
-	sched_clock_init_late();
 
 	sched_smp_initialized = true;
 }
@@ -5978,7 +5977,6 @@ early_initcall(migration_init);
 void __init sched_init_smp(void)
 {
 	sched_init_granularity();
-	sched_clock_init_late();
 }
 #endif /* CONFIG_SMP */
 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 9/9] sched/clock: Print a warning recommending tsc=unstable
  2017-04-21 14:57 [PATCH 0/9] sched_clock fixes Peter Zijlstra
                   ` (6 preceding siblings ...)
  2017-04-21 14:58 ` [PATCH 8/9] sched/clock: Use late_initcall() instead of sched_init_smp() Peter Zijlstra
@ 2017-04-21 14:58 ` Peter Zijlstra
  2017-04-25  9:31 ` [PATCH 0/9] sched_clock fixes Lofstedt, Marta
  8 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2017-04-21 14:58 UTC (permalink / raw)
  To: tglx, mingo
  Cc: linux-kernel, ville.syrjala, daniel.lezcano, rafael.j.wysocki,
	marta.lofstedt, martin.peres, pasha.tatashin, peterz,
	daniel.vetter

[-- Attachment #1: peterz-sched-clock-rubarb-10.patch --]
[-- Type: text/plain, Size: 1090 bytes --]

With our switch to stable delayed until late_initcall(), the most
likely cause of hitting mark_tsc_unstable() is the watchdog. The
watchdog typically only triggers when creative BIOS'es fiddle with the
TSC to hide SMI latency.

Since the watchdog can only detect TSC fiddling after the fact all TSC
clocks (including userspace GTOD) can already have reported funny
values.

The only way to fully avoid this, is manually marking the TSC unstable
at boot. Suggest people do this on their broken systems.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/clock.c |    1 +
 1 file changed, 1 insertion(+)

--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -175,6 +175,7 @@ static void __sched_clock_work(struct wo
 	for_each_possible_cpu(cpu)
 		per_cpu(sched_clock_data, cpu) = *scd;
 
+	printk(KERN_WARNING "TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.\n");
 	printk(KERN_INFO "sched_clock: Marking unstable (%lld, %lld)<-(%lld, %lld)\n",
 			scd->tick_gtod, __gtod_offset,
 			scd->tick_raw,  __sched_clock_offset);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 6/9] sched/clock: Remove watchdog touching
  2017-04-21 14:58 ` [PATCH 6/9] sched/clock: Remove watchdog touching Peter Zijlstra
@ 2017-04-21 15:28   ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2017-04-21 15:28 UTC (permalink / raw)
  To: tglx, mingo
  Cc: linux-kernel, ville.syrjala, daniel.lezcano, rafael.j.wysocki,
	marta.lofstedt, martin.peres, pasha.tatashin, daniel.vetter

On Fri, Apr 21, 2017 at 04:58:02PM +0200, Peter Zijlstra wrote:
> Commit:
> 
>   2bacec8c318c ("sched: touch softlockup watchdog after idling")
> 
> introduced the touch_softlockup_watchdog_sched() call without
> justification and I feel sched_clock management is not the right
> place, it should only be concerned with producing semi coherent time.
> 
> If this causes watchdog thingies, we can find a better place.

Hurmph, the rest should've gone in the next patch, which calls
sched_clock_idle_wakeup_event() from a place with IRQs enabled.

> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  kernel/sched/clock.c |   12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> --- a/kernel/sched/clock.c
> +++ b/kernel/sched/clock.c
> @@ -410,15 +410,21 @@ void sched_clock_idle_sleep_event(void)
>  EXPORT_SYMBOL_GPL(sched_clock_idle_sleep_event);
>  
>  /*
> - * We just idled; resync with ktime. (called with irqs disabled):
> + * We just idled; resync with ktime.
>   */
>  void sched_clock_idle_wakeup_event(void)
>  {
> -	if (timekeeping_suspended)
> +	unsigned long flags;
> +
> +	if (sched_clock_stable())
> +		return;
> +
> +	if (unlikely(timekeeping_suspended))
>  		return;
>  
> +	local_irq_save(flags);
>  	sched_clock_tick();
> -	touch_softlockup_watchdog_sched();
> +	local_irq_restore(flags);
>  }
>  EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
>  
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 0/9] sched_clock fixes
  2017-04-21 14:57 [PATCH 0/9] sched_clock fixes Peter Zijlstra
                   ` (7 preceding siblings ...)
  2017-04-21 14:58 ` [PATCH 9/9] sched/clock: Print a warning recommending tsc=unstable Peter Zijlstra
@ 2017-04-25  9:31 ` Lofstedt, Marta
  2017-04-25 13:44   ` Peter Zijlstra
  8 siblings, 1 reply; 13+ messages in thread
From: Lofstedt, Marta @ 2017-04-25  9:31 UTC (permalink / raw)
  To: Peter Zijlstra, tglx, mingo
  Cc: linux-kernel, ville.syrjala, daniel.lezcano, Wysocki, Rafael J,
	martin.peres, pasha.tatashin, daniel.vetter

Hi Peterz,

I tested your patch-set on the same Core2 machine as where we discovered the regression. 
With the tsc=unstable boot param that passrate has improved significantly; 350 fails -> 15 fails.

BR,
Marta

> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Friday, April 21, 2017 5:58 PM
> To: tglx@linutronix.de; mingo@kernel.org
> Cc: linux-kernel@vger.kernel.org; ville.syrjala@linux.intel.com;
> daniel.lezcano@linaro.org; Wysocki, Rafael J <rafael.j.wysocki@intel.com>;
> Lofstedt, Marta <marta.lofstedt@intel.com>; martin.peres@linux.intel.com;
> pasha.tatashin@oracle.com; peterz@infradead.org; daniel.vetter@ffwll.ch
> Subject: [PATCH 0/9] sched_clock fixes
> 
> Hi,
> 
> These patches were inspired (and hopefully fix) two independent bug
> reports on
> Core2 machines.
> 
> I never could quite reproduce one, but my Core2 machine no longer switches
> to stable sched_clock and therefore no longer tickles the problematic stable -
> > unstable transition either.
> 
> Before I dug up my Core2 machine, I tried emulating TSC wreckage by poking
> random values into the TSC MSR from userspace. Behaviour in that case is
> improved as well.
> 
> People have to realize that if we manage to boot with TSC 'stable' (both
> sched_clock and clocksource) and we later find out we were mistaken (we
> observe a TSC wobble) the clocks that are derived from it _will_ have had an
> observable hickup. This is fundamentally unfixable.
> 
> If you own a machine where the BIOS tries to hide SMI latencies by
> rewinding TSC (yes, this is a thing), the very best we can do is mark TSC
> unstable with a boot parameter.
> 
> For example, this is me writing a stupid value into the TSC:
> 
> [   46.745082] random: crng init done
> [18443029775.010069] clocksource: timekeeping watchdog on CPU0: Marking
> clocksource 'tsc' as unstable because the skew is too large:
> [18443029775.023141] clocksource:                       'hpet' wd_now: 3ebec538
> wd_last: 3e486ec9 mask: ffffffff
> [18443029775.034214] clocksource:                       'tsc' cs_now: 5025acce9 cs_last:
> 24dc3bd21c88ee mask: ffffffffffffffff
> [18443029775.046651] tsc: Marking TSC unstable due to clocksource
> watchdog [18443029775.054211] TSC found unstable after boot, most likely
> due to broken BIOS. Use 'tsc=unstable'.
> [18443029775.064434] sched_clock: Marking unstable (70569005835, -
> 17833788)<-(-3714295689546517, -2965802361)
> [   70.573700] clocksource: Switched to clocksource hpet
> 
> With some trace_printk()s (not included) I could tell that the wobble occured
> at 69.965474. The clock now resumes where it 'should' have been.
> 
> But an unfortunate scheduling event could have resulted in one task having
> seen a runtime of ~584 years with 'obvious' effects. Similar jumps can also be
> observed from userspace GTOD usage.
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/9] sched_clock fixes
  2017-04-25  9:31 ` [PATCH 0/9] sched_clock fixes Lofstedt, Marta
@ 2017-04-25 13:44   ` Peter Zijlstra
  2017-04-26  6:41     ` Lofstedt, Marta
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2017-04-25 13:44 UTC (permalink / raw)
  To: Lofstedt, Marta
  Cc: tglx, mingo, linux-kernel, ville.syrjala, daniel.lezcano,
	Wysocki, Rafael J, martin.peres, pasha.tatashin, daniel.vetter

On Tue, Apr 25, 2017 at 09:31:40AM +0000, Lofstedt, Marta wrote:
> Hi Peterz,
> 
> I tested your patch-set on the same Core2 machine as where we discovered the regression. 
> With the tsc=unstable boot param that passrate has improved significantly; 350 fails -> 15 fails.

So is that the same as before, or still worse? I don't really have a
handle on what your benchmark is here, nor what how 'good' is defined.

If its still worse than before, I'm completely confused. Because with
"tsc=unstable" the patch you fingered is a complete no-op (__gtod_offset
== 0).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH 0/9] sched_clock fixes
  2017-04-25 13:44   ` Peter Zijlstra
@ 2017-04-26  6:41     ` Lofstedt, Marta
  0 siblings, 0 replies; 13+ messages in thread
From: Lofstedt, Marta @ 2017-04-26  6:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, linux-kernel, ville.syrjala, daniel.lezcano,
	Wysocki, Rafael J, martin.peres, pasha.tatashin, daniel.vetter

For bisecting the regression we ran 14 test for 50 repetitions.

Before the bisected regression:

commit 7b09cc5a9debc86c903c2eff8f8a1fdef773c649
Author: Pavel Tatashin <pasha.tatashin@oracle.com>
Date:   Wed Mar 22 16:24:17 2017 -0400

    sched/clock: Fix broken stable to unstable transfer

there was ~0 failing test on the Core2 machine. 
After regression ~350 failing tests.
With your patch-set ~15 failing tests.

To be honest, I must say that these test used to give unstable results on the Core2. But some time ago, the results magically stabilized at ~0 fails, by timing related fixes for other issues. Ville Syrjala now has a patch-set that we believe really solves the graphics parts of the issue. However, I believe that your patch-set still improves the situation related to the tsc instability of the Core2. 

/Marta

> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@infradead.org]
> Sent: Tuesday, April 25, 2017 4:45 PM
> To: Lofstedt, Marta <marta.lofstedt@intel.com>
> Cc: tglx@linutronix.de; mingo@kernel.org; linux-kernel@vger.kernel.org;
> ville.syrjala@linux.intel.com; daniel.lezcano@linaro.org; Wysocki, Rafael J
> <rafael.j.wysocki@intel.com>; martin.peres@linux.intel.com;
> pasha.tatashin@oracle.com; daniel.vetter@ffwll.ch
> Subject: Re: [PATCH 0/9] sched_clock fixes
> 
> On Tue, Apr 25, 2017 at 09:31:40AM +0000, Lofstedt, Marta wrote:
> > Hi Peterz,
> >
> > I tested your patch-set on the same Core2 machine as where we
> discovered the regression.
> > With the tsc=unstable boot param that passrate has improved significantly;
> 350 fails -> 15 fails.
> 
> So is that the same as before, or still worse? I don't really have a handle on
> what your benchmark is here, nor what how 'good' is defined.
> 
> If its still worse than before, I'm completely confused. Because with
> "tsc=unstable" the patch you fingered is a complete no-op (__gtod_offset
> == 0).

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-04-26  6:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-21 14:57 [PATCH 0/9] sched_clock fixes Peter Zijlstra
2017-04-21 14:57 ` [PATCH 1/9] x86/tsc: Provide tsc=unstable boot parameter Peter Zijlstra
2017-04-21 14:57 ` [PATCH 2/9] x86,tsc: Feed refined TSC calibration into sched_clock Peter Zijlstra
2017-04-21 14:57 ` [PATCH 3/9] sched/clock: Initialize all per-cpu state before switching (back) to unstable Peter Zijlstra
2017-04-21 14:58 ` [PATCH 4/9] x86/tsc,sched/clock,clocksource: Use clocksource watchdog to provide stable sync points Peter Zijlstra
2017-04-21 14:58 ` [PATCH 5/9] sched/clock: Remove unused argument to sched_clock_idle_wakeup_event() Peter Zijlstra
2017-04-21 14:58 ` [PATCH 6/9] sched/clock: Remove watchdog touching Peter Zijlstra
2017-04-21 15:28   ` Peter Zijlstra
2017-04-21 14:58 ` [PATCH 8/9] sched/clock: Use late_initcall() instead of sched_init_smp() Peter Zijlstra
2017-04-21 14:58 ` [PATCH 9/9] sched/clock: Print a warning recommending tsc=unstable Peter Zijlstra
2017-04-25  9:31 ` [PATCH 0/9] sched_clock fixes Lofstedt, Marta
2017-04-25 13:44   ` Peter Zijlstra
2017-04-26  6:41     ` Lofstedt, Marta

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).