* [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm @ 2016-12-13 13:14 Thomas Gleixner 2016-12-13 13:14 ` [patch 1/2] x86/tsc: Validate TSC_ADJUST after resume Thomas Gleixner ` (2 more replies) 0 siblings, 3 replies; 22+ messages in thread From: Thomas Gleixner @ 2016-12-13 13:14 UTC (permalink / raw) To: LKML Cc: x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Roland Scheidegger, Kevin Stanton, Allen Hung Roland reported interesting TSC ADJUST register wreckage on his DELL machine, which seems to populate that MSR with a random number generator. Deeper investagation into fixing this wreckage unearthed another special feature which is designed by Intel: Negative TSC adjuste values cause interrupt storms on the TSC deadline timer. Further details in patch 2/2 Thanks tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* [patch 1/2] x86/tsc: Validate TSC_ADJUST after resume 2016-12-13 13:14 [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm Thomas Gleixner @ 2016-12-13 13:14 ` Thomas Gleixner 2016-12-13 13:22 ` Peter Zijlstra 2016-12-15 10:52 ` [tip:x86/timers] " tip-bot for Thomas Gleixner 2016-12-13 13:14 ` [patch 2/2] x86/tsc: Force TSC_ADJUST register to value >= zero Thomas Gleixner 2016-12-13 16:34 ` [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm Roland Scheidegger 2 siblings, 2 replies; 22+ messages in thread From: Thomas Gleixner @ 2016-12-13 13:14 UTC (permalink / raw) To: LKML Cc: x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Roland Scheidegger, Kevin Stanton, Allen Hung [-- Attachment #1: x86-tsc--Validate-TSC_ADJUST-after-resume.patch --] [-- Type: text/plain, Size: 3313 bytes --] Some 'feature' BIOSes fiddle with the TSC_ADJUST register during suspend/resume which renders the TSC unusable. Add sanity checks into the resume path and restore the original value if it was adjusted. Reported-by: Roland Scheidegger <rscheidegger_lists@hispeed.ch> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/include/asm/tsc.h | 4 ++-- arch/x86/kernel/process.c | 2 +- arch/x86/kernel/tsc.c | 6 ++++++ arch/x86/kernel/tsc_sync.c | 6 +++--- arch/x86/power/cpu.c | 1 + 5 files changed, 13 insertions(+), 6 deletions(-) --- a/arch/x86/include/asm/tsc.h +++ b/arch/x86/include/asm/tsc.h @@ -47,12 +47,12 @@ extern int tsc_clocksource_reliable; */ #ifdef CONFIG_X86_TSC extern bool tsc_store_and_check_tsc_adjust(void); -extern void tsc_verify_tsc_adjust(void); +extern void tsc_verify_tsc_adjust(bool resume); extern void check_tsc_sync_source(int cpu); extern void check_tsc_sync_target(void); #else static inline bool tsc_store_and_check_tsc_adjust(void) { return false; } -static inline void tsc_verify_tsc_adjust(void) { } +static inline void tsc_verify_tsc_adjust(bool resume) { } static inline void check_tsc_sync_source(int cpu) { } static inline void check_tsc_sync_target(void) { } #endif --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -277,7 +277,7 @@ void exit_idle(void) void arch_cpu_idle_enter(void) { - tsc_verify_tsc_adjust(); + tsc_verify_tsc_adjust(false); local_touch_nmi(); enter_idle(); } --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1080,6 +1080,11 @@ static void detect_art(void) static struct clocksource clocksource_tsc; +static void tsc_resume(struct clocksource *cs) +{ + tsc_verify_tsc_adjust(true); +} + /* * We used to compare the TSC to the cycle_last value in the clocksource * structure to avoid a nasty time-warp. This can be observed in a @@ -1112,6 +1117,7 @@ static struct clocksource clocksource_ts .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_MUST_VERIFY, .archdata = { .vclock_mode = VCLOCK_TSC }, + .resume = tsc_resume, }; void mark_tsc_unstable(char *reason) --- a/arch/x86/kernel/tsc_sync.c +++ b/arch/x86/kernel/tsc_sync.c @@ -30,7 +30,7 @@ struct tsc_adjust { static DEFINE_PER_CPU(struct tsc_adjust, tsc_adjust); -void tsc_verify_tsc_adjust(void) +void tsc_verify_tsc_adjust(bool resume) { struct tsc_adjust *adj = this_cpu_ptr(&tsc_adjust); s64 curval; @@ -39,7 +39,7 @@ void tsc_verify_tsc_adjust(void) return; /* Rate limit the MSR check */ - if (time_before(jiffies, adj->nextcheck)) + if (!resume && time_before(jiffies, adj->nextcheck)) return; adj->nextcheck = jiffies + HZ; @@ -51,7 +51,7 @@ void tsc_verify_tsc_adjust(void) /* Restore the original value */ wrmsrl(MSR_IA32_TSC_ADJUST, adj->adjusted); - if (!adj->warned) { + if (!adj->warned || resume) { pr_warn(FW_BUG "TSC ADJUST differs: CPU%u %lld --> %lld. Restoring\n", smp_processor_id(), adj->adjusted, curval); adj->warned = true; --- a/arch/x86/power/cpu.c +++ b/arch/x86/power/cpu.c @@ -256,6 +256,7 @@ static void notrace __restore_processor_ mtrr_bp_restore(); perf_restore_debug_store(); msr_restore_context(ctxt); + tsc_verify_tsc_adjust(true); } /* Needed by apm.c */ ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 1/2] x86/tsc: Validate TSC_ADJUST after resume 2016-12-13 13:14 ` [patch 1/2] x86/tsc: Validate TSC_ADJUST after resume Thomas Gleixner @ 2016-12-13 13:22 ` Peter Zijlstra 2016-12-13 13:23 ` Thomas Gleixner 2016-12-15 10:52 ` [tip:x86/timers] " tip-bot for Thomas Gleixner 1 sibling, 1 reply; 22+ messages in thread From: Peter Zijlstra @ 2016-12-13 13:22 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, x86, Borislav Petkov, Bruce Schlobohm, Roland Scheidegger, Kevin Stanton, Allen Hung On Tue, Dec 13, 2016 at 01:14:17PM -0000, Thomas Gleixner wrote: > --- a/arch/x86/power/cpu.c > +++ b/arch/x86/power/cpu.c > @@ -256,6 +256,7 @@ static void notrace __restore_processor_ > mtrr_bp_restore(); > perf_restore_debug_store(); > msr_restore_context(ctxt); > + tsc_verify_tsc_adjust(true); > } Should we do that sooner, as in before calling restore_sched_clock_state() ? Otherwise we recompute the sched_clock deltas vs the wrecked TSC and then fix it up through the ADJUST, wrecking our sched clock again. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 1/2] x86/tsc: Validate TSC_ADJUST after resume 2016-12-13 13:22 ` Peter Zijlstra @ 2016-12-13 13:23 ` Thomas Gleixner 0 siblings, 0 replies; 22+ messages in thread From: Thomas Gleixner @ 2016-12-13 13:23 UTC (permalink / raw) To: Peter Zijlstra Cc: LKML, x86, Borislav Petkov, Bruce Schlobohm, Roland Scheidegger, Kevin Stanton, Allen Hung On Tue, 13 Dec 2016, Peter Zijlstra wrote: > On Tue, Dec 13, 2016 at 01:14:17PM -0000, Thomas Gleixner wrote: > > --- a/arch/x86/power/cpu.c > > +++ b/arch/x86/power/cpu.c > > @@ -256,6 +256,7 @@ static void notrace __restore_processor_ > > mtrr_bp_restore(); > > perf_restore_debug_store(); > > msr_restore_context(ctxt); > > + tsc_verify_tsc_adjust(true); > > } > > Should we do that sooner, as in before calling > restore_sched_clock_state() ? Otherwise we recompute the sched_clock > deltas vs the wrecked TSC and then fix it up through the ADJUST, > wrecking our sched clock again. Ah. Indeed. Will fix. That needs some thought on the secondary CPUs as well... Thanks, tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tip:x86/timers] x86/tsc: Validate TSC_ADJUST after resume 2016-12-13 13:14 ` [patch 1/2] x86/tsc: Validate TSC_ADJUST after resume Thomas Gleixner 2016-12-13 13:22 ` Peter Zijlstra @ 2016-12-15 10:52 ` tip-bot for Thomas Gleixner 1 sibling, 0 replies; 22+ messages in thread From: tip-bot for Thomas Gleixner @ 2016-12-15 10:52 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, bruce.schlobohm, hpa, tglx, bp, rscheidegger_lists, mingo, kevin.b.stanton, allen_hung, peterz Commit-ID: 6a369583178d0b89c2c3919c4456ee22fee0f249 Gitweb: http://git.kernel.org/tip/6a369583178d0b89c2c3919c4456ee22fee0f249 Author: Thomas Gleixner <tglx@linutronix.de> AuthorDate: Tue, 13 Dec 2016 13:14:17 +0000 Committer: Thomas Gleixner <tglx@linutronix.de> CommitDate: Thu, 15 Dec 2016 11:44:29 +0100 x86/tsc: Validate TSC_ADJUST after resume Some 'feature' BIOSes fiddle with the TSC_ADJUST register during suspend/resume which renders the TSC unusable. Add sanity checks into the resume path and restore the original value if it was adjusted. Reported-and-tested-by: Roland Scheidegger <rscheidegger_lists@hispeed.ch> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Bruce Schlobohm <bruce.schlobohm@intel.com> Cc: Kevin Stanton <kevin.b.stanton@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Allen Hung <allen_hung@dell.com> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/20161213131211.317654500@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/include/asm/tsc.h | 4 ++-- arch/x86/kernel/process.c | 2 +- arch/x86/kernel/tsc.c | 6 ++++++ arch/x86/kernel/tsc_sync.c | 6 +++--- arch/x86/power/cpu.c | 1 + 5 files changed, 13 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h index c054eaa..372ad0c 100644 --- a/arch/x86/include/asm/tsc.h +++ b/arch/x86/include/asm/tsc.h @@ -47,12 +47,12 @@ extern int tsc_clocksource_reliable; */ #ifdef CONFIG_X86_TSC extern bool tsc_store_and_check_tsc_adjust(void); -extern void tsc_verify_tsc_adjust(void); +extern void tsc_verify_tsc_adjust(bool resume); extern void check_tsc_sync_source(int cpu); extern void check_tsc_sync_target(void); #else static inline bool tsc_store_and_check_tsc_adjust(void) { return false; } -static inline void tsc_verify_tsc_adjust(void) { } +static inline void tsc_verify_tsc_adjust(bool resume) { } static inline void check_tsc_sync_source(int cpu) { } static inline void check_tsc_sync_target(void) { } #endif diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c index 4fe5dc8..a67e0f0 100644 --- a/arch/x86/kernel/process.c +++ b/arch/x86/kernel/process.c @@ -277,7 +277,7 @@ void exit_idle(void) void arch_cpu_idle_enter(void) { - tsc_verify_tsc_adjust(); + tsc_verify_tsc_adjust(false); local_touch_nmi(); enter_idle(); } diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index 2bb8de4..bfb541a 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1080,6 +1080,11 @@ static void detect_art(void) static struct clocksource clocksource_tsc; +static void tsc_resume(struct clocksource *cs) +{ + tsc_verify_tsc_adjust(true); +} + /* * We used to compare the TSC to the cycle_last value in the clocksource * structure to avoid a nasty time-warp. This can be observed in a @@ -1112,6 +1117,7 @@ static struct clocksource clocksource_tsc = { .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_MUST_VERIFY, .archdata = { .vclock_mode = VCLOCK_TSC }, + .resume = tsc_resume, }; void mark_tsc_unstable(char *reason) diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c index a75f696..94f2ce5 100644 --- a/arch/x86/kernel/tsc_sync.c +++ b/arch/x86/kernel/tsc_sync.c @@ -30,7 +30,7 @@ struct tsc_adjust { static DEFINE_PER_CPU(struct tsc_adjust, tsc_adjust); -void tsc_verify_tsc_adjust(void) +void tsc_verify_tsc_adjust(bool resume) { struct tsc_adjust *adj = this_cpu_ptr(&tsc_adjust); s64 curval; @@ -39,7 +39,7 @@ void tsc_verify_tsc_adjust(void) return; /* Rate limit the MSR check */ - if (time_before(jiffies, adj->nextcheck)) + if (!resume && time_before(jiffies, adj->nextcheck)) return; adj->nextcheck = jiffies + HZ; @@ -51,7 +51,7 @@ void tsc_verify_tsc_adjust(void) /* Restore the original value */ wrmsrl(MSR_IA32_TSC_ADJUST, adj->adjusted); - if (!adj->warned) { + if (!adj->warned || resume) { pr_warn(FW_BUG "TSC ADJUST differs: CPU%u %lld --> %lld. Restoring\n", smp_processor_id(), adj->adjusted, curval); adj->warned = true; diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c index 53cace2..66ade16 100644 --- a/arch/x86/power/cpu.c +++ b/arch/x86/power/cpu.c @@ -252,6 +252,7 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) fix_processor_context(); do_fpu_end(); + tsc_verify_tsc_adjust(true); x86_platform.restore_sched_clock_state(); mtrr_bp_restore(); perf_restore_debug_store(); ^ permalink raw reply related [flat|nested] 22+ messages in thread
* [patch 2/2] x86/tsc: Force TSC_ADJUST register to value >= zero 2016-12-13 13:14 [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm Thomas Gleixner 2016-12-13 13:14 ` [patch 1/2] x86/tsc: Validate TSC_ADJUST after resume Thomas Gleixner @ 2016-12-13 13:14 ` Thomas Gleixner 2016-12-13 13:43 ` Peter Zijlstra ` (2 more replies) 2016-12-13 16:34 ` [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm Roland Scheidegger 2 siblings, 3 replies; 22+ messages in thread From: Thomas Gleixner @ 2016-12-13 13:14 UTC (permalink / raw) To: LKML Cc: x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Roland Scheidegger, Kevin Stanton, Allen Hung [-- Attachment #1: x86-tsc--Force-TSC_ADJUST-register-to-zero.patch --] [-- Type: text/plain, Size: 7452 bytes --] Roland reported that his DELL T5810 sports a value add BIOS which completely wreckages the TSC. The squirmware [(TM) Ingo Molnar] boots with random negative TSC_ADJUST values, different on all CPUs. That renders the TSC useless because the sycnchronization check fails. Roland tested the new TSC_ADJUST mechanism. While it manages to readjust the TSCs he needs to disable the TSC deadline timer, otherwise the machine just stops booting. Deeper investigation unearthed that the TSC deadline timer is sensitive to the TSC_ADJUST value. Writing TSC_ADJUST to a negative value results in an interrupt storm caused by the TSC deadline timer. This does not make any sense and it's hard to imagine what kind of hardware wreckage is behind that misfeature, but it's reliably reproducible on other systems which have TSC_ADJUST and TSC deadline timer. While it would be understandable that a big enough negative value which moves the resulting TSC readout into the negative space could have the described effect, this happens even with a adjust value of -1, which keeps the TSC readout definitely in the positive space. The compare register for the TSC deadline timer is set to a positive value larger than the TSC, but despite not having reached the deadline the interrupt is raised immediately. If this happens on the boot CPU, then the machine dies silently because this setup happens before the NMI watchdog is armed. Further experiments showed that any other adjustment of TSC_ADJUST works as expected as long as it stays in the positive range. The direction of the adjustment has no influence either. Yet another proof for the theory that timers are designed by janitors and the underlying (obviously undocumented) mechanisms which allow BIOSes to wreckage them are considered a feature. Well done Intel - NOT! To address this wreckage add the following sanity measures: - If the TSC_ADJUST value on the boot cpu is not 0, set it to 0 - If the TSC_ADJUST value on any cpu is negative, set it to 0 - Prevent the cross package synchronization mechanism from setting negative TSC_ADJUST values. Reported-by: Roland Scheidegger <rscheidegger_lists@hispeed.ch> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/include/asm/tsc.h | 4 +-- arch/x86/kernel/tsc.c | 2 - arch/x86/kernel/tsc_sync.c | 55 +++++++++++++++++++++++++++++++-------------- 3 files changed, 42 insertions(+), 19 deletions(-) --- a/arch/x86/include/asm/tsc.h +++ b/arch/x86/include/asm/tsc.h @@ -46,12 +46,12 @@ extern int tsc_clocksource_reliable; * all CPUs/cores: */ #ifdef CONFIG_X86_TSC -extern bool tsc_store_and_check_tsc_adjust(void); +extern bool tsc_store_and_check_tsc_adjust(bool bootcpu); extern void tsc_verify_tsc_adjust(bool resume); extern void check_tsc_sync_source(int cpu); extern void check_tsc_sync_target(void); #else -static inline bool tsc_store_and_check_tsc_adjust(void) { return false; } +static inline bool tsc_store_and_check_tsc_adjust(bool bootcpu) { return false; } static inline void tsc_verify_tsc_adjust(bool resume) { } static inline void check_tsc_sync_source(int cpu) { } static inline void check_tsc_sync_target(void) { } --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1386,7 +1386,7 @@ void __init tsc_init(void) if (unsynchronized_tsc()) mark_tsc_unstable("TSCs unsynchronized"); else - tsc_store_and_check_tsc_adjust(); + tsc_store_and_check_tsc_adjust(true); check_system_tsc_reliable(); --- a/arch/x86/kernel/tsc_sync.c +++ b/arch/x86/kernel/tsc_sync.c @@ -58,8 +58,33 @@ void tsc_verify_tsc_adjust(bool resume) } } +static void tsc_sanitize_first_cpu(struct tsc_adjust *cur, s64 bootval, + unsigned int cpu, bool bootcpu) +{ + /* + * First online CPU in a package stores the boot value in the + * adjustment value. This value might change later via the sync + * mechanism. If that fails we still can yell about boot values not + * being consistent. + * + * On the boot cpu we just force set the ADJUST value to 0 if it's + * non zero. We don't do that on non boot cpus because physical + * hotplug should have set the ADJUST register to a value > 0 so + * the TSC is in sync with the already running cpus. + * + * But we always force positive ADJUST values. Otherwise the TSC + * deadline timer creates an interrupt storm. Sigh! + */ + if ((bootcpu && bootval != 0) || (!bootcpu && bootval < 0)) { + pr_warn("TSC ADJUST: CPU%u: %lld force to 0\n", cpu, bootval); + wrmsrl(MSR_IA32_TSC_ADJUST, 0); + bootval = 0; + } + cur->adjusted = bootval; +} + #ifndef CONFIG_SMP -bool __init tsc_store_and_check_tsc_adjust(void) +bool __init tsc_store_and_check_tsc_adjust(bool bootcpu) { struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust); s64 bootval; @@ -69,9 +94,8 @@ bool __init tsc_store_and_check_tsc_adju rdmsrl(MSR_IA32_TSC_ADJUST, bootval); cur->bootval = bootval; - cur->adjusted = bootval; cur->nextcheck = jiffies + HZ; - pr_info("TSC ADJUST: Boot CPU0: %lld\n", bootval); + tsc_sanitize_first_cpu(cur, bootval, smp_processor_id(), bootcpu); return false; } @@ -80,7 +104,7 @@ bool __init tsc_store_and_check_tsc_adju /* * Store and check the TSC ADJUST MSR if available */ -bool tsc_store_and_check_tsc_adjust(void) +bool tsc_store_and_check_tsc_adjust(bool bootcpu) { struct tsc_adjust *ref, *cur = this_cpu_ptr(&tsc_adjust); unsigned int refcpu, cpu = smp_processor_id(); @@ -98,22 +122,16 @@ bool tsc_store_and_check_tsc_adjust(void /* * Check whether this CPU is the first in a package to come up. In * this case do not check the boot value against another package - * because the package might have been physically hotplugged, where - * TSC_ADJUST is expected to be different. When called on the boot - * CPU topology_core_cpumask() might not be available yet. + * because the new package might have been physically hotplugged, + * where TSC_ADJUST is expected to be different. When called on the + * boot CPU topology_core_cpumask() might not be available yet. */ mask = topology_core_cpumask(cpu); refcpu = mask ? cpumask_any_but(mask, cpu) : nr_cpu_ids; if (refcpu >= nr_cpu_ids) { - /* - * First online CPU in a package stores the boot value in - * the adjustment value. This value might change later via - * the sync mechanism. If that fails we still can yell - * about boot values not being consistent. - */ - cur->adjusted = bootval; - pr_info_once("TSC ADJUST: Boot CPU%u: %lld\n", cpu, bootval); + tsc_sanitize_first_cpu(cur, bootval, smp_processor_id(), + bootcpu); return false; } @@ -366,7 +384,7 @@ void check_tsc_sync_target(void) * Store, verify and sanitize the TSC adjust register. If * successful skip the test. */ - if (tsc_store_and_check_tsc_adjust()) { + if (tsc_store_and_check_tsc_adjust(false)) { atomic_inc(&skip_test); return; } @@ -429,8 +447,13 @@ void check_tsc_sync_target(void) * that the warp is not longer detectable when the observed warp * value is used. In the worst case the adjustment needs to go * through a 3rd run for fine tuning. + * + * But we must make sure that the value doesn't become negative + * otherwise TSC deadline timer will create an interrupt storm. */ cur->adjusted += cur_max_warp; + if (cur->adjusted < 0) + cur->adjusted = 0; pr_warn("TSC ADJUST compensate: CPU%u observed %lld warp. Adjust: %lld\n", cpu, cur_max_warp, cur->adjusted); ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 2/2] x86/tsc: Force TSC_ADJUST register to value >= zero 2016-12-13 13:14 ` [patch 2/2] x86/tsc: Force TSC_ADJUST register to value >= zero Thomas Gleixner @ 2016-12-13 13:43 ` Peter Zijlstra 2016-12-13 15:49 ` Thomas Gleixner 2016-12-15 10:53 ` [tip:x86/timers] " tip-bot for Thomas Gleixner 2016-12-16 11:46 ` [patch 2/2] " Thomas Gleixner 2 siblings, 1 reply; 22+ messages in thread From: Peter Zijlstra @ 2016-12-13 13:43 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, x86, Borislav Petkov, Bruce Schlobohm, Roland Scheidegger, Kevin Stanton, Allen Hung On Tue, Dec 13, 2016 at 01:14:17PM -0000, Thomas Gleixner wrote: > @@ -429,8 +447,13 @@ void check_tsc_sync_target(void) > * that the warp is not longer detectable when the observed warp > * value is used. In the worst case the adjustment needs to go > * through a 3rd run for fine tuning. > + * > + * But we must make sure that the value doesn't become negative > + * otherwise TSC deadline timer will create an interrupt storm. > */ > cur->adjusted += cur_max_warp; > + if (cur->adjusted < 0) > + cur->adjusted = 0; > So depending on how horrid we want to make this we could work around this by increasing the ADJUST of the other CPUs by the amount we're negative. But yes, yuck. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 2/2] x86/tsc: Force TSC_ADJUST register to value >= zero 2016-12-13 13:43 ` Peter Zijlstra @ 2016-12-13 15:49 ` Thomas Gleixner 0 siblings, 0 replies; 22+ messages in thread From: Thomas Gleixner @ 2016-12-13 15:49 UTC (permalink / raw) To: Peter Zijlstra Cc: LKML, x86, Borislav Petkov, Bruce Schlobohm, Roland Scheidegger, Kevin Stanton, Allen Hung On Tue, 13 Dec 2016, Peter Zijlstra wrote: > On Tue, Dec 13, 2016 at 01:14:17PM -0000, Thomas Gleixner wrote: > > @@ -429,8 +447,13 @@ void check_tsc_sync_target(void) > > * that the warp is not longer detectable when the observed warp > > * value is used. In the worst case the adjustment needs to go > > * through a 3rd run for fine tuning. > > + * > > + * But we must make sure that the value doesn't become negative > > + * otherwise TSC deadline timer will create an interrupt storm. > > */ > > cur->adjusted += cur_max_warp; > > + if (cur->adjusted < 0) > > + cur->adjusted = 0; > > > > So depending on how horrid we want to make this we could work around > this by increasing the ADJUST of the other CPUs by the amount we're > negative. That means to increase all CPUs in the package of the other CPU by that amount .... > But yes, yuck. Indeed. tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* [tip:x86/timers] x86/tsc: Force TSC_ADJUST register to value >= zero 2016-12-13 13:14 ` [patch 2/2] x86/tsc: Force TSC_ADJUST register to value >= zero Thomas Gleixner 2016-12-13 13:43 ` Peter Zijlstra @ 2016-12-15 10:53 ` tip-bot for Thomas Gleixner 2016-12-16 11:46 ` [patch 2/2] " Thomas Gleixner 2 siblings, 0 replies; 22+ messages in thread From: tip-bot for Thomas Gleixner @ 2016-12-15 10:53 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, bp, bruce.schlobohm, mingo, tglx, allen_hung, rscheidegger_lists, peterz, hpa, kevin.b.stanton Commit-ID: 5bae156241e05d25171b18ee43e49f103c3f8097 Gitweb: http://git.kernel.org/tip/5bae156241e05d25171b18ee43e49f103c3f8097 Author: Thomas Gleixner <tglx@linutronix.de> AuthorDate: Tue, 13 Dec 2016 13:14:17 +0000 Committer: Thomas Gleixner <tglx@linutronix.de> CommitDate: Thu, 15 Dec 2016 11:44:29 +0100 x86/tsc: Force TSC_ADJUST register to value >= zero Roland reported that his DELL T5810 sports a value add BIOS which completely wreckages the TSC. The squirmware [(TM) Ingo Molnar] boots with random negative TSC_ADJUST values, different on all CPUs. That renders the TSC useless because the sycnchronization check fails. Roland tested the new TSC_ADJUST mechanism. While it manages to readjust the TSCs he needs to disable the TSC deadline timer, otherwise the machine just stops booting. Deeper investigation unearthed that the TSC deadline timer is sensitive to the TSC_ADJUST value. Writing TSC_ADJUST to a negative value results in an interrupt storm caused by the TSC deadline timer. This does not make any sense and it's hard to imagine what kind of hardware wreckage is behind that misfeature, but it's reliably reproducible on other systems which have TSC_ADJUST and TSC deadline timer. While it would be understandable that a big enough negative value which moves the resulting TSC readout into the negative space could have the described effect, this happens even with a adjust value of -1, which keeps the TSC readout definitely in the positive space. The compare register for the TSC deadline timer is set to a positive value larger than the TSC, but despite not having reached the deadline the interrupt is raised immediately. If this happens on the boot CPU, then the machine dies silently because this setup happens before the NMI watchdog is armed. Further experiments showed that any other adjustment of TSC_ADJUST works as expected as long as it stays in the positive range. The direction of the adjustment has no influence either. See the lkml link for further analysis. Yet another proof for the theory that timers are designed by janitors and the underlying (obviously undocumented) mechanisms which allow BIOSes to wreckage them are considered a feature. Well done Intel - NOT! To address this wreckage add the following sanity measures: - If the TSC_ADJUST value on the boot cpu is not 0, set it to 0 - If the TSC_ADJUST value on any cpu is negative, set it to 0 - Prevent the cross package synchronization mechanism from setting negative TSC_ADJUST values. Reported-and-tested-by: Roland Scheidegger <rscheidegger_lists@hispeed.ch> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Bruce Schlobohm <bruce.schlobohm@intel.com> Cc: Kevin Stanton <kevin.b.stanton@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Allen Hung <allen_hung@dell.com> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/20161213131211.397588033@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/include/asm/tsc.h | 4 ++-- arch/x86/kernel/tsc.c | 2 +- arch/x86/kernel/tsc_sync.c | 55 ++++++++++++++++++++++++++++++++-------------- 3 files changed, 42 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h index 372ad0c..abb1fdc 100644 --- a/arch/x86/include/asm/tsc.h +++ b/arch/x86/include/asm/tsc.h @@ -46,12 +46,12 @@ extern int tsc_clocksource_reliable; * all CPUs/cores: */ #ifdef CONFIG_X86_TSC -extern bool tsc_store_and_check_tsc_adjust(void); +extern bool tsc_store_and_check_tsc_adjust(bool bootcpu); extern void tsc_verify_tsc_adjust(bool resume); extern void check_tsc_sync_source(int cpu); extern void check_tsc_sync_target(void); #else -static inline bool tsc_store_and_check_tsc_adjust(void) { return false; } +static inline bool tsc_store_and_check_tsc_adjust(bool bootcpu) { return false; } static inline void tsc_verify_tsc_adjust(bool resume) { } static inline void check_tsc_sync_source(int cpu) { } static inline void check_tsc_sync_target(void) { } diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index bfb541a..0aed75a 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1386,7 +1386,7 @@ void __init tsc_init(void) if (unsynchronized_tsc()) mark_tsc_unstable("TSCs unsynchronized"); else - tsc_store_and_check_tsc_adjust(); + tsc_store_and_check_tsc_adjust(true); check_system_tsc_reliable(); diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c index 94f2ce5..9151f0c 100644 --- a/arch/x86/kernel/tsc_sync.c +++ b/arch/x86/kernel/tsc_sync.c @@ -58,8 +58,33 @@ void tsc_verify_tsc_adjust(bool resume) } } +static void tsc_sanitize_first_cpu(struct tsc_adjust *cur, s64 bootval, + unsigned int cpu, bool bootcpu) +{ + /* + * First online CPU in a package stores the boot value in the + * adjustment value. This value might change later via the sync + * mechanism. If that fails we still can yell about boot values not + * being consistent. + * + * On the boot cpu we just force set the ADJUST value to 0 if it's + * non zero. We don't do that on non boot cpus because physical + * hotplug should have set the ADJUST register to a value > 0 so + * the TSC is in sync with the already running cpus. + * + * But we always force positive ADJUST values. Otherwise the TSC + * deadline timer creates an interrupt storm. Sigh! + */ + if ((bootcpu && bootval != 0) || (!bootcpu && bootval < 0)) { + pr_warn("TSC ADJUST: CPU%u: %lld force to 0\n", cpu, bootval); + wrmsrl(MSR_IA32_TSC_ADJUST, 0); + bootval = 0; + } + cur->adjusted = bootval; +} + #ifndef CONFIG_SMP -bool __init tsc_store_and_check_tsc_adjust(void) +bool __init tsc_store_and_check_tsc_adjust(bool bootcpu) { struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust); s64 bootval; @@ -69,9 +94,8 @@ bool __init tsc_store_and_check_tsc_adjust(void) rdmsrl(MSR_IA32_TSC_ADJUST, bootval); cur->bootval = bootval; - cur->adjusted = bootval; cur->nextcheck = jiffies + HZ; - pr_info("TSC ADJUST: Boot CPU0: %lld\n", bootval); + tsc_sanitize_first_cpu(cur, bootval, smp_processor_id(), bootcpu); return false; } @@ -80,7 +104,7 @@ bool __init tsc_store_and_check_tsc_adjust(void) /* * Store and check the TSC ADJUST MSR if available */ -bool tsc_store_and_check_tsc_adjust(void) +bool tsc_store_and_check_tsc_adjust(bool bootcpu) { struct tsc_adjust *ref, *cur = this_cpu_ptr(&tsc_adjust); unsigned int refcpu, cpu = smp_processor_id(); @@ -98,22 +122,16 @@ bool tsc_store_and_check_tsc_adjust(void) /* * Check whether this CPU is the first in a package to come up. In * this case do not check the boot value against another package - * because the package might have been physically hotplugged, where - * TSC_ADJUST is expected to be different. When called on the boot - * CPU topology_core_cpumask() might not be available yet. + * because the new package might have been physically hotplugged, + * where TSC_ADJUST is expected to be different. When called on the + * boot CPU topology_core_cpumask() might not be available yet. */ mask = topology_core_cpumask(cpu); refcpu = mask ? cpumask_any_but(mask, cpu) : nr_cpu_ids; if (refcpu >= nr_cpu_ids) { - /* - * First online CPU in a package stores the boot value in - * the adjustment value. This value might change later via - * the sync mechanism. If that fails we still can yell - * about boot values not being consistent. - */ - cur->adjusted = bootval; - pr_info_once("TSC ADJUST: Boot CPU%u: %lld\n", cpu, bootval); + tsc_sanitize_first_cpu(cur, bootval, smp_processor_id(), + bootcpu); return false; } @@ -366,7 +384,7 @@ void check_tsc_sync_target(void) * Store, verify and sanitize the TSC adjust register. If * successful skip the test. */ - if (tsc_store_and_check_tsc_adjust()) { + if (tsc_store_and_check_tsc_adjust(false)) { atomic_inc(&skip_test); return; } @@ -429,8 +447,13 @@ retry: * that the warp is not longer detectable when the observed warp * value is used. In the worst case the adjustment needs to go * through a 3rd run for fine tuning. + * + * But we must make sure that the value doesn't become negative + * otherwise TSC deadline timer will create an interrupt storm. */ cur->adjusted += cur_max_warp; + if (cur->adjusted < 0) + cur->adjusted = 0; pr_warn("TSC ADJUST compensate: CPU%u observed %lld warp. Adjust: %lld\n", cpu, cur_max_warp, cur->adjusted); ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [patch 2/2] x86/tsc: Force TSC_ADJUST register to value >= zero 2016-12-13 13:14 ` [patch 2/2] x86/tsc: Force TSC_ADJUST register to value >= zero Thomas Gleixner 2016-12-13 13:43 ` Peter Zijlstra 2016-12-15 10:53 ` [tip:x86/timers] " tip-bot for Thomas Gleixner @ 2016-12-16 11:46 ` Thomas Gleixner 2016-12-16 11:52 ` Ingo Molnar 2016-12-16 13:33 ` Thomas Gleixner 2 siblings, 2 replies; 22+ messages in thread From: Thomas Gleixner @ 2016-12-16 11:46 UTC (permalink / raw) To: LKML Cc: x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Roland Scheidegger, Kevin Stanton, Allen Hung, stable On Tue, 13 Dec 2016, Thomas Gleixner wrote: > Roland reported that his DELL T5810 sports a value add BIOS which > completely wreckages the TSC. The squirmware [(TM) Ingo Molnar] boots with > random negative TSC_ADJUST values, different on all CPUs. That renders the > TSC useless because the sycnchronization check fails. While everyone assumed that this is the usual DELL squirmware problem, I have to say it's not. Just got my hands on a Skylake based Lenovo S510 box and it shows the same feature: TSC ADJUST: CPU0: -10123656703215 CPU1: -10123656796701 CPU2: -10123656797460 CPU3: -10123656798366 Which causes the TSC to be out of sync on a stock upstream kernel and the TSC deadline timer wreckage is happening on that machine as well. I'm pretty sure, that this well thought out feature to 'hide power on time' from TSC has not been independently 'invented' by DELL and Lenovo BIOS tinkerers. I rather have the impression that this is an advisory or feature kit from some other entity. Whoever came up with this misfeature at Intel and/or Microsoft (sorry, I could not come up with any other suspects) should be promoted to run the 'Linux on feature-plagued systems' hot line. As this seems to be more wide spread than we thought initially, we have to think about a solution for stable kernels, especially 4.9. And distros will have to think about that as well.... We have two options: 1) Disable TSC deadline timer by default and force users with sane machines to enable it on the kernel command line. Upside: Very small patch Downside: Degrades existing setups on sane machines, keeps TSC unusable on affected machines. We have no idea what other hidden side effects the TSC_ADJUST tinkering has. If there are any, they ain't be nice ones. 2) Push the whole TSC_ADJUST sanitizing machinery into stable Upside: Does not affect sane machines and gives a benefit to users of affected machines Downside: Rather large patch, but not that risky either. Needs a few eyes and good test coverage though Thoughts? tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 2/2] x86/tsc: Force TSC_ADJUST register to value >= zero 2016-12-16 11:46 ` [patch 2/2] " Thomas Gleixner @ 2016-12-16 11:52 ` Ingo Molnar 2016-12-16 11:53 ` Thomas Gleixner 2016-12-16 13:33 ` Thomas Gleixner 1 sibling, 1 reply; 22+ messages in thread From: Ingo Molnar @ 2016-12-16 11:52 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Roland Scheidegger, Kevin Stanton, Allen Hung, stable * Thomas Gleixner <tglx@linutronix.de> wrote: > We have two options: > > 1) Disable TSC deadline timer by default and force users with sane machines > to enable it on the kernel command line. > > Upside: Very small patch > > Downside: Degrades existing setups on sane machines, keeps TSC unusable > on affected machines. We have no idea what other hidden side > effects the TSC_ADJUST tinkering has. If there are any, they > ain't be nice ones. > > 2) Push the whole TSC_ADJUST sanitizing machinery into stable > > Upside: Does not affect sane machines and gives a benefit to users of > affected machines > > Downside: Rather large patch, but not that risky either. Needs a few > eyes and good test coverage though > > Thoughts? I'd go for #2, because #1 is essentially turning it off for almost everyone. We can still do #1 and push it back to -stable as well if #2 fails. But I'd suggest we delay the stable backporting until it's been upstream a bit. Thanks, Ingo ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 2/2] x86/tsc: Force TSC_ADJUST register to value >= zero 2016-12-16 11:52 ` Ingo Molnar @ 2016-12-16 11:53 ` Thomas Gleixner 0 siblings, 0 replies; 22+ messages in thread From: Thomas Gleixner @ 2016-12-16 11:53 UTC (permalink / raw) To: Ingo Molnar Cc: LKML, x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Roland Scheidegger, Kevin Stanton, Allen Hung, stable On Fri, 16 Dec 2016, Ingo Molnar wrote: > * Thomas Gleixner <tglx@linutronix.de> wrote: > > > We have two options: > > > > 1) Disable TSC deadline timer by default and force users with sane machines > > to enable it on the kernel command line. > > > > Upside: Very small patch > > > > Downside: Degrades existing setups on sane machines, keeps TSC unusable > > on affected machines. We have no idea what other hidden side > > effects the TSC_ADJUST tinkering has. If there are any, they > > ain't be nice ones. > > > > 2) Push the whole TSC_ADJUST sanitizing machinery into stable > > > > Upside: Does not affect sane machines and gives a benefit to users of > > affected machines > > > > Downside: Rather large patch, but not that risky either. Needs a few > > eyes and good test coverage though > > > > Thoughts? > > I'd go for #2, because #1 is essentially turning it off for almost everyone. > > We can still do #1 and push it back to -stable as well if #2 fails. > > But I'd suggest we delay the stable backporting until it's been upstream a bit. I agree. None of these patches is tagged stable. I just wanted to mention it so it can be discussed before distros/stable users are swamped with failure reports. Thanks, tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 2/2] x86/tsc: Force TSC_ADJUST register to value >= zero 2016-12-16 11:46 ` [patch 2/2] " Thomas Gleixner 2016-12-16 11:52 ` Ingo Molnar @ 2016-12-16 13:33 ` Thomas Gleixner 1 sibling, 0 replies; 22+ messages in thread From: Thomas Gleixner @ 2016-12-16 13:33 UTC (permalink / raw) To: LKML Cc: x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Roland Scheidegger, Kevin Stanton, Allen Hung, stable On Fri, 16 Dec 2016, Thomas Gleixner wrote: > On Tue, 13 Dec 2016, Thomas Gleixner wrote: > > Roland reported that his DELL T5810 sports a value add BIOS which > > completely wreckages the TSC. The squirmware [(TM) Ingo Molnar] boots with > > random negative TSC_ADJUST values, different on all CPUs. That renders the > > TSC useless because the sycnchronization check fails. > > While everyone assumed that this is the usual DELL squirmware problem, I > have to say it's not. > > Just got my hands on a Skylake based Lenovo S510 box and it shows the same > feature: > > TSC ADJUST: CPU0: -10123656703215 > CPU1: -10123656796701 > CPU2: -10123656797460 > CPU3: -10123656798366 > > Which causes the TSC to be out of sync on a stock upstream kernel and the > TSC deadline timer wreckage is happening on that machine as well. > > I'm pretty sure, that this well thought out feature to 'hide power on time' > from TSC has not been independently 'invented' by DELL and Lenovo BIOS > tinkerers. > > I rather have the impression that this is an advisory or feature kit from > some other entity. Whoever came up with this misfeature at Intel and/or > Microsoft (sorry, I could not come up with any other suspects) should be > promoted to run the 'Linux on feature-plagued systems' hot line. Just to add another data point here. On cold boot the TSC_ADJUST value on that LENOVO machine is: -24534293, which is about 9ms. So assumed that the SDM is correct in this point and the counter starts at 0 after power on, then 9ms later might be right in that magic blob which does the low level bringup of CPUs. That comes from the CPU vendor and runs _BEFORE_ the system vendor BIOS can create havoc. Dealing with timers on x86 feels like a Sisyphean task. Thanks, tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm 2016-12-13 13:14 [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm Thomas Gleixner 2016-12-13 13:14 ` [patch 1/2] x86/tsc: Validate TSC_ADJUST after resume Thomas Gleixner 2016-12-13 13:14 ` [patch 2/2] x86/tsc: Force TSC_ADJUST register to value >= zero Thomas Gleixner @ 2016-12-13 16:34 ` Roland Scheidegger 2016-12-13 16:46 ` Thomas Gleixner 2 siblings, 1 reply; 22+ messages in thread From: Roland Scheidegger @ 2016-12-13 16:34 UTC (permalink / raw) To: Thomas Gleixner, LKML Cc: x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Kevin Stanton, Allen Hung Am 13.12.2016 um 14:14 schrieb Thomas Gleixner: > Roland reported interesting TSC ADJUST register wreckage on his DELL > machine, which seems to populate that MSR with a random number generator. FWIW, I thought about the actual values some more and I don't actually think they are all that random any more: the behavior is consistent with the bios trying to zero the TSC of all cpus. If I understand this right, writing a zero to TSC would cause somewhat small negative values in the TSC_ADJ register at boot time, and larger negative values at suspend time (at least if the TSC just stops when suspended and isn't reset) - exactly what I'm seeing. (And of course the different TSC_ADJ values would be because the bios is writing TSC without any thoughts of synchronization, just one cpu after another). > > Deeper investagation into fixing this wreckage unearthed another special > feature which is designed by Intel: Negative TSC adjuste values cause > interrupt storms on the TSC deadline timer. Further details in patch 2/2 This actually looks like quite a serious hw bug to me, shouldn't there be an errata for such a bug? And I still don't quite understand why the lockup doesn't happen after a warm boot, there must be something different there... (I didn't have the chance to test the patch yet.) Roland ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm 2016-12-13 16:34 ` [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm Roland Scheidegger @ 2016-12-13 16:46 ` Thomas Gleixner 2016-12-14 1:36 ` Roland Scheidegger 0 siblings, 1 reply; 22+ messages in thread From: Thomas Gleixner @ 2016-12-13 16:46 UTC (permalink / raw) To: Roland Scheidegger Cc: LKML, x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Kevin Stanton, Allen Hung On Tue, 13 Dec 2016, Roland Scheidegger wrote: > Am 13.12.2016 um 14:14 schrieb Thomas Gleixner: > > Roland reported interesting TSC ADJUST register wreckage on his DELL > > machine, which seems to populate that MSR with a random number generator. > > FWIW, I thought about the actual values some more and I don't actually > think they are all that random any more: the behavior is consistent with > the bios trying to zero the TSC of all cpus. If I understand this right, > writing a zero to TSC would cause somewhat small negative values in the > TSC_ADJ register at boot time, and larger negative values at suspend > time (at least if the TSC just stops when suspended and isn't reset) - > exactly what I'm seeing. > (And of course the different TSC_ADJ values would be because the bios is > writing TSC without any thoughts of synchronization, just one cpu after > another). Yeah, that might be. Still it looks like random nonsense and definitely the BIOS developers did not follow the secrit boot protocol. > > Deeper investagation into fixing this wreckage unearthed another special > > feature which is designed by Intel: Negative TSC adjuste values cause > > interrupt storms on the TSC deadline timer. Further details in patch 2/2 > > This actually looks like quite a serious hw bug to me, shouldn't there > be an errata for such a bug? > > And I still don't quite understand why the lockup doesn't happen after a > warm boot, there must be something different there... What are the adjust values after a warm boot? Thanks, tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm 2016-12-13 16:46 ` Thomas Gleixner @ 2016-12-14 1:36 ` Roland Scheidegger 2016-12-14 7:31 ` Thomas Gleixner 0 siblings, 1 reply; 22+ messages in thread From: Roland Scheidegger @ 2016-12-14 1:36 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Kevin Stanton, Allen Hung Am 13.12.2016 um 17:46 schrieb Thomas Gleixner: > On Tue, 13 Dec 2016, Roland Scheidegger wrote: > >> Am 13.12.2016 um 14:14 schrieb Thomas Gleixner: >>> Roland reported interesting TSC ADJUST register wreckage on his DELL >>> machine, which seems to populate that MSR with a random number generator. >> >> FWIW, I thought about the actual values some more and I don't actually >> think they are all that random any more: the behavior is consistent with >> the bios trying to zero the TSC of all cpus. If I understand this right, >> writing a zero to TSC would cause somewhat small negative values in the >> TSC_ADJ register at boot time, and larger negative values at suspend >> time (at least if the TSC just stops when suspended and isn't reset) - >> exactly what I'm seeing. >> (And of course the different TSC_ADJ values would be because the bios is >> writing TSC without any thoughts of synchronization, just one cpu after >> another). > > Yeah, that might be. Still it looks like random nonsense and definitely the > BIOS developers did not follow the secrit boot protocol. > >>> Deeper investagation into fixing this wreckage unearthed another special >>> feature which is designed by Intel: Negative TSC adjuste values cause >>> interrupt storms on the TSC deadline timer. Further details in patch 2/2 >> >> This actually looks like quite a serious hw bug to me, shouldn't there >> be an errata for such a bug? >> >> And I still don't quite understand why the lockup doesn't happen after a >> warm boot, there must be something different there... > > What are the adjust values after a warm boot? > So, after cold boot with a kernel which doesn't adjust TSCs, then warm boot I got: [ 0.000000] TSC ADJUST: CPU0: -602358264300 176072418728 [ 0.000000] TSC ADJUST: Boot CPU0: -602358264300 [ 0.172245] TSC ADJUST: CPU1: -602360207584 176587932558 [ 0.172245] TSC ADJUST differs: Reference CPU0: -602358264300 CPU1: -602360207584 [ 0.172246] TSC ADJUST synchronize: Reference CPU0: -602358264300 CPU1: -602360207584 [ 0.252663] TSC ADJUST: CPU2: -602359000822 176828627154 [ 0.252663] TSC ADJUST differs: Reference CPU0: -602358264300 CPU2: -602359000822 [ 0.252664] TSC ADJUST synchronize: Reference CPU0: -602358264300 CPU2: -602359000822 [ 0.337014] TSC ADJUST: CPU3: -602360177680 177081093132 [ 0.337014] TSC ADJUST differs: Reference CPU0: -602358264300 CPU3: -602360177680 [ 0.337015] TSC ADJUST synchronize: Reference CPU0: -602358264300 CPU3: -602360177680 and so on. Albeit after another reboot (some minutes later), it actually straight locked up again: TSC ADJUST: CPU1: -8257481427958 165112676430 TSC ADJUST differs: Reference CPU0: -8257479484330 CPU1: -8257481427958 TSC ADJUST synchronize: Reference CPU0: -8257479484330 CPU1: -8254781427958 TSC target sync skip ... smpboot: Target CPU is online So, actually I thought the TSC would get reset too on warm boot, but clearly looks like that isn't the case... But I don't know what's the difference between first and second reboot - the adjust values have just more magnitude, but otherwise even the direction of the adjustments and everything looks all the same (just like cold boot, which also looks all the same to me). Roland ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm 2016-12-14 1:36 ` Roland Scheidegger @ 2016-12-14 7:31 ` Thomas Gleixner 2016-12-14 20:59 ` Thomas Gleixner 0 siblings, 1 reply; 22+ messages in thread From: Thomas Gleixner @ 2016-12-14 7:31 UTC (permalink / raw) To: Roland Scheidegger Cc: LKML, x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Kevin Stanton, Allen Hung On Wed, 14 Dec 2016, Roland Scheidegger wrote: > Am 13.12.2016 um 17:46 schrieb Thomas Gleixner: > > What are the adjust values after a warm boot? > > So, after cold boot with a kernel which doesn't adjust TSCs, then warm > boot I got: > [ 0.000000] TSC ADJUST: CPU0: -602358264300 176072418728 > [ 0.000000] TSC ADJUST: Boot CPU0: -602358264300 > [ 0.172245] TSC ADJUST: CPU1: -602360207584 176587932558 > [ 0.172245] TSC ADJUST differs: Reference CPU0: -602358264300 CPU1: > -602360207584 > [ 0.172246] TSC ADJUST synchronize: Reference CPU0: -602358264300 > CPU1: -602360207584 > [ 0.252663] TSC ADJUST: CPU2: -602359000822 176828627154 > [ 0.252663] TSC ADJUST differs: Reference CPU0: -602358264300 CPU2: > -602359000822 > [ 0.252664] TSC ADJUST synchronize: Reference CPU0: -602358264300 > CPU2: -602359000822 > [ 0.337014] TSC ADJUST: CPU3: -602360177680 177081093132 > [ 0.337014] TSC ADJUST differs: Reference CPU0: -602358264300 CPU3: > -602360177680 > [ 0.337015] TSC ADJUST synchronize: Reference CPU0: -602358264300 > CPU3: -602360177680 > > and so on. > > Albeit after another reboot (some minutes later), it actually straight > locked up again: > > TSC ADJUST: CPU1: -8257481427958 165112676430 > TSC ADJUST differs: Reference CPU0: -8257479484330 CPU1: -8257481427958 > TSC ADJUST synchronize: Reference CPU0: -8257479484330 CPU1: -8254781427958 > TSC target sync skip > ... > smpboot: Target CPU is online > > So, actually I thought the TSC would get reset too on warm boot, but > clearly looks like that isn't the case... > But I don't know what's the difference between first and second reboot - > the adjust values have just more magnitude, but otherwise even the > direction of the adjustments and everything looks all the same (just > like cold boot, which also looks all the same to me). I haven't found a pattern for the lockups yet and we have to wait for Intel to provide useful information about that issue. All we know so far is that negative adjust values are dangerous. Could you test the two patches on top of tip x86/timers branch so we can make progress with that whole disaster while waiting for Intel to come forth with a proper explanation? Thanks, tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm 2016-12-14 7:31 ` Thomas Gleixner @ 2016-12-14 20:59 ` Thomas Gleixner 2016-12-14 21:40 ` Thomas Gleixner 0 siblings, 1 reply; 22+ messages in thread From: Thomas Gleixner @ 2016-12-14 20:59 UTC (permalink / raw) To: Roland Scheidegger Cc: LKML, x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Kevin Stanton, Allen Hung On Wed, 14 Dec 2016, Thomas Gleixner wrote: > On Wed, 14 Dec 2016, Roland Scheidegger wrote: > > Am 13.12.2016 um 17:46 schrieb Thomas Gleixner: > > > What are the adjust values after a warm boot? > > > > So, after cold boot with a kernel which doesn't adjust TSCs, then warm > > boot I got: > > [ 0.000000] TSC ADJUST: CPU0: -602358264300 176072418728 > > [ 0.000000] TSC ADJUST: Boot CPU0: -602358264300 > > [ 0.172245] TSC ADJUST: CPU1: -602360207584 176587932558 > > [ 0.172245] TSC ADJUST differs: Reference CPU0: -602358264300 CPU1: > > -602360207584 > > [ 0.172246] TSC ADJUST synchronize: Reference CPU0: -602358264300 > > CPU1: -602360207584 > > [ 0.252663] TSC ADJUST: CPU2: -602359000822 176828627154 > > [ 0.252663] TSC ADJUST differs: Reference CPU0: -602358264300 CPU2: > > -602359000822 > > [ 0.252664] TSC ADJUST synchronize: Reference CPU0: -602358264300 > > CPU2: -602359000822 > > [ 0.337014] TSC ADJUST: CPU3: -602360177680 177081093132 > > [ 0.337014] TSC ADJUST differs: Reference CPU0: -602358264300 CPU3: > > -602360177680 > > [ 0.337015] TSC ADJUST synchronize: Reference CPU0: -602358264300 > > CPU3: -602360177680 > > > > and so on. > > > > Albeit after another reboot (some minutes later), it actually straight > > locked up again: > > > > TSC ADJUST: CPU1: -8257481427958 165112676430 > > TSC ADJUST differs: Reference CPU0: -8257479484330 CPU1: -8257481427958 > > TSC ADJUST synchronize: Reference CPU0: -8257479484330 CPU1: -8254781427958 > > TSC target sync skip > > ... > > smpboot: Target CPU is online > > > > So, actually I thought the TSC would get reset too on warm boot, but > > clearly looks like that isn't the case... > > But I don't know what's the difference between first and second reboot - > > the adjust values have just more magnitude, but otherwise even the > > direction of the adjustments and everything looks all the same (just > > like cold boot, which also looks all the same to me). > > I haven't found a pattern for the lockups yet and we have to wait for Intel > to provide useful information about that issue. All we know so far is that > negative adjust values are dangerous. Did some futher investigation. The values which cause the interrupt storms have very clear identifiable points which reliably reproduce: Positive space, results in timer not firing anymore - at least not in a time frame you are willing to wait for. 0x0000 0000 8000 0000 Negative space, results in an interrupt storm. 0xffff ffff 0000 0000 0xffff fffe 0000 0000 0xffff fffd 0000 0000 0xffff fffc 0000 0000 0xffff fffb 0000 0000 .... These points are independent of the underlying counter value (cold boot, warm boot) and even reproduce after hours of power on reliably. And looking at the values makes me wonder about 32bit vs. 64bit wreckage combined with sign expansion done wrong. Im really impressed! In the negative space there is something else going on which is dependent on the counter value. Right after cold boot the space is closer to zero than after hours of power on. So the approach of forbidding negative values is definitely not wrong. Thanks, tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm 2016-12-14 20:59 ` Thomas Gleixner @ 2016-12-14 21:40 ` Thomas Gleixner 2016-12-14 22:54 ` Roland Scheidegger 0 siblings, 1 reply; 22+ messages in thread From: Thomas Gleixner @ 2016-12-14 21:40 UTC (permalink / raw) To: Roland Scheidegger Cc: LKML, x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Kevin Stanton, Allen Hung On Wed, 14 Dec 2016, Thomas Gleixner wrote: > Positive space, results in timer not firing anymore - at least not in a > time frame you are willing to wait for. > > 0x0000 0000 8000 0000 > > Negative space, results in an interrupt storm. > > 0xffff ffff 0000 0000 > 0xffff fffe 0000 0000 > 0xffff fffd 0000 0000 > 0xffff fffc 0000 0000 > 0xffff fffb 0000 0000 > .... > > These points are independent of the underlying counter value (cold boot, > warm boot) and even reproduce after hours of power on reliably. > > And looking at the values makes me wonder about 32bit vs. 64bit wreckage > combined with sign expansion done wrong. Im really impressed! And the whole mess stems from the fact that the deadline is not as one would expect simply compared against the sum of the counter and the adjust MSR. No, they subtract the adjust value from the MSR when you write the deadline and latch the result to compare it against the counter. So when the following happens: ADJUST = 0 RDTSC = 10000000 DEADLINE = 11000000 ADJUST = 1000000 INTERRUPT RDTSC = 12000000 DEADLINE = 13000000 ADJUST = 0 INTERRUPT RDTSC = 12000000 So depending on the direction of the adjustment the timer fires late or early. Combined with that math wreckage this is a complete disaster. And of course nothing is documented anywhere and the SDM is outright wrong: 10.5.4.1 TSC-Deadline Mode The processor generates a timer interrupt when the value of time-stamp counter is greater than or equal to that of IA32_TSC_DEADLINE. It then disarms the timer and clear the IA32_TSC_DEADLINE MSR. (Both the time-stamp counter and the IA32_TSC_DEADLINE MSR are 64-bit unsigned integers.) See the example above. 1200000 is neither equal nor greater than 1300000, at least not in my universe. I serioulsy doubt that Intel manages it to design at least ONE functional non broken timer before I retire. Thanks, tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm 2016-12-14 21:40 ` Thomas Gleixner @ 2016-12-14 22:54 ` Roland Scheidegger 2016-12-15 9:31 ` Thomas Gleixner 0 siblings, 1 reply; 22+ messages in thread From: Roland Scheidegger @ 2016-12-14 22:54 UTC (permalink / raw) To: Thomas Gleixner Cc: LKML, x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Kevin Stanton, Allen Hung Am 14.12.2016 um 22:40 schrieb Thomas Gleixner: > On Wed, 14 Dec 2016, Thomas Gleixner wrote: >> Positive space, results in timer not firing anymore - at least not in a >> time frame you are willing to wait for. >> >> 0x0000 0000 8000 0000 >> >> Negative space, results in an interrupt storm. >> >> 0xffff ffff 0000 0000 >> 0xffff fffe 0000 0000 >> 0xffff fffd 0000 0000 >> 0xffff fffc 0000 0000 >> 0xffff fffb 0000 0000 >> .... >> >> These points are independent of the underlying counter value (cold boot, >> warm boot) and even reproduce after hours of power on reliably. >> >> And looking at the values makes me wonder about 32bit vs. 64bit wreckage >> combined with sign expansion done wrong. Im really impressed! > > And the whole mess stems from the fact that the deadline is not as one > would expect simply compared against the sum of the counter and the adjust > MSR. Why would it be compared against the sum? As far as I can tell the adjust value should never come into play when using deadline timer (other than indirectly because the TSC would change). (And I'd guess better avoid an armed deadline timer while changing TSC_ADJ...) In any case, I've tested the two patches on top of x86-timers and they work just fine - all TSC_ADJ values get set to zero both on boot and resume, no lockups, and tsc clocksource active, with some whining in the log of course :-). So, Tested-by: Roland Scheidegger <rscheidegger_lists@hispeed.ch> > No, they subtract the adjust value from the MSR when you write the deadline > and latch the result to compare it against the counter. > > So when the following happens: > > ADJUST = 0 > RDTSC = 10000000 > DEADLINE = 11000000 > > ADJUST = 1000000 > > INTERRUPT > RDTSC = 12000000 > > DEADLINE = 13000000 > > ADJUST = 0 > > INTERRUPT > RDTSC = 12000000 > > So depending on the direction of the adjustment the timer fires late or > early. > > Combined with that math wreckage this is a complete disaster. And of course > nothing is documented anywhere and the SDM is outright wrong: > > 10.5.4.1 TSC-Deadline Mode > > The processor generates a timer interrupt when the value of time-stamp > counter is greater than or equal to that of IA32_TSC_DEADLINE. It then > disarms the timer and clear the IA32_TSC_DEADLINE MSR. (Both the time-stamp > counter and the IA32_TSC_DEADLINE MSR are 64-bit unsigned integers.) > > See the example above. 1200000 is neither equal nor greater than 1300000, at > least not in my universe. > > I serioulsy doubt that Intel manages it to design at least ONE functional > non broken timer before I retire. > > Thanks, > > tglx > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm 2016-12-14 22:54 ` Roland Scheidegger @ 2016-12-15 9:31 ` Thomas Gleixner 2017-01-26 23:40 ` Stanton, Kevin B 0 siblings, 1 reply; 22+ messages in thread From: Thomas Gleixner @ 2016-12-15 9:31 UTC (permalink / raw) To: Roland Scheidegger Cc: LKML, x86, Peter Zijlstra, Borislav Petkov, Bruce Schlobohm, Kevin Stanton, Allen Hung On Wed, 14 Dec 2016, Roland Scheidegger wrote: > Am 14.12.2016 um 22:40 schrieb Thomas Gleixner: > > And the whole mess stems from the fact that the deadline is not as one > > would expect simply compared against the sum of the counter and the adjust > > MSR. > Why would it be compared against the sum? As far as I can tell the adjust > value should never come into play when using deadline timer (other than > indirectly because the TSC would change). See the SDM. It suggests that the deadline is compared to the TSC value. I don't care how it is implemented, but I very much care about it being documented in the way it is implemented, which is obviously not the case. And even if the adjust value is subtracted once when the timer is armed the whole thing should not blow up due to 32bit/sign extension bugs or whatever they decided to cobble together. Thanks, tglx ^ permalink raw reply [flat|nested] 22+ messages in thread
* RE: [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm 2016-12-15 9:31 ` Thomas Gleixner @ 2017-01-26 23:40 ` Stanton, Kevin B 0 siblings, 0 replies; 22+ messages in thread From: Stanton, Kevin B @ 2017-01-26 23:40 UTC (permalink / raw) To: Thomas Gleixner, Roland Scheidegger Cc: LKML, x86, Peter Zijlstra, Borislav Petkov, Schlobohm, Bruce, Allen Hung, Sharon, Barak, Segev, Gil On Thurs, 15 Dec 2016, Thomas Gleixner wrote: >See the SDM. It suggests that the deadline is compared to the TSC value. >I don't care how it is implemented, but I very much care about it being >documented in the way it is implemented, which is obviously not the case. The TSC_DEADLINE behavior definitely diverges from the SDM and we've root caused the issue and are finalizing a fix that addresses the problem. Glad that a software workaround was also possible. We also plan to expand our documentation on what good behavior of a BIOS is as it relates to adjusting TSC at boot (in almost every circumstance you shouldn't do it). The official errata will be forthcoming. Thank you for helping to identify this. Kevin ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2017-01-26 23:40 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2016-12-13 13:14 [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm Thomas Gleixner 2016-12-13 13:14 ` [patch 1/2] x86/tsc: Validate TSC_ADJUST after resume Thomas Gleixner 2016-12-13 13:22 ` Peter Zijlstra 2016-12-13 13:23 ` Thomas Gleixner 2016-12-15 10:52 ` [tip:x86/timers] " tip-bot for Thomas Gleixner 2016-12-13 13:14 ` [patch 2/2] x86/tsc: Force TSC_ADJUST register to value >= zero Thomas Gleixner 2016-12-13 13:43 ` Peter Zijlstra 2016-12-13 15:49 ` Thomas Gleixner 2016-12-15 10:53 ` [tip:x86/timers] " tip-bot for Thomas Gleixner 2016-12-16 11:46 ` [patch 2/2] " Thomas Gleixner 2016-12-16 11:52 ` Ingo Molnar 2016-12-16 11:53 ` Thomas Gleixner 2016-12-16 13:33 ` Thomas Gleixner 2016-12-13 16:34 ` [patch 0/2] tsc/adjust: Cure suspend/resume issues and prevent TSC deadline timer irq storm Roland Scheidegger 2016-12-13 16:46 ` Thomas Gleixner 2016-12-14 1:36 ` Roland Scheidegger 2016-12-14 7:31 ` Thomas Gleixner 2016-12-14 20:59 ` Thomas Gleixner 2016-12-14 21:40 ` Thomas Gleixner 2016-12-14 22:54 ` Roland Scheidegger 2016-12-15 9:31 ` Thomas Gleixner 2017-01-26 23:40 ` Stanton, Kevin B
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).