Re: [PATCH] x86/tsc: RFC: re-synchronize TSCs to boot cpu TSC

From: Roland Scheidegger <rscheidegger_lists@hispeed.ch>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, LKML <linux-kernel@vger.kernel.org>,
	Allen Hung <allen_hung@dell.com>
Subject: Re: [PATCH] x86/tsc: RFC: re-synchronize TSCs to boot cpu TSC
Date: Sat, 10 Dec 2016 02:55:16 +0100	[thread overview]
Message-ID: <b6134d9e-276f-032c-dfef-9c3173632ea8@hispeed.ch> (raw)
In-Reply-To: <alpine.DEB.2.20.1612092203430.3612@nanos>

Am 09.12.2016 um 23:59 schrieb Thomas Gleixner:
> On Fri, 9 Dec 2016, Roland Scheidegger wrote:
> 
> Cc'ed someone from Dell. 
> 
>> Am 09.12.2016 um 18:33 schrieb Thomas Gleixner:
>>> Can you add the patch below to gather more information? There is a hunk in
>>> there with an '#if 0' which sets the TSC ADJUST to 0 on boot, which you can
>>> turn on as second step.
>>
>> Ok, here's the results:
>> ...
>> TSC ADJUST synchronize: Reference CPU0: -2820267100 CPU1: -2822498296
>> TSC target sync skipped
>> smpboot: Vector locked
>> smpboot: Vector setup done
>> smpboot: Clock setup
>> TSC source sync skipped
>> smpboot: Target CPU is online
> 
> I did not expect that to happen. Now I'm puzzled and curious where the
> machine gets lost after that. See below.
> 
>> With the #if 0 block activated, it boots up fine, the output was:
> 
> That does not make any sense at all, but yes, nothing in this context makes
> sense.
> 
>> [    1.038892] x86: Booting SMP configuration:
>> [    1.038930] .... node  #0, CPUs:        #1
>> [    0.171851] TSC ADJUST: CPU1: -2830353064 218577682002
>> [    1.117495] TSC source sync 0 -> 1 runs 3
>> [    0.171852] TSC ADJUST differs: Reference CPU0: -2828600940 CPU1:
>> -2830353064
>> [    0.171853] TSC ADJUST synchronize: Reference CPU0: 0 CPU1: -2830353064
>> [    1.117497] TSC target sync skip
> 
>> (And fwiw with my quick hack the lockups disappear to when I change that
>> back to blast a zero into TSC_ADJ for all cpus.)
> 
> Right, That's what that hunk does as well.
> 
> Now what's interesting is that the adjustement of CPU1 in the non write to
> zero case results in the following:
> 
> TSC ADJUST: CPU1: -2830353064 218577682002 <-- TSC value
> TSC ADJUST differs: Reference CPU0: -2828600940 CPU1: -2830353064
> 
> We write CPU1 adjust register to -2828600940 which makes the TSC on CPU1
> jump forwards by -2828600940 - -2830353064 = 1752124 cycles.
> 
> In the write to zero case the jump is forward as well, but this time it's
> huge, i.e. 2830353064 cycles.
> 
> I tried to wreckage the TSC by writing similar values to the adjust MSR on
> early boot, but independent of the values and independent of the write to
> zero part the machine comes up happily all the time.
> 
> The only difference is that my machine has a somewhat saner BIOS. So the
> thing might just die in the value add SMM crap, but who knows.
> 
> In the patch below is another bunch of debug prints which emit the state
> information of CPU1 during bringup. Maybe that gives a hint where the
> system gets stuck when you disable the 'write to zero' magic again.
> 
> The NMI watchdog does not catch anything, right?
Nope. (Though as mentioned earlier, with my hack when not writing zero
it did - but the lockup there was later after all 16 cpus were online,
and I only really tried that with the ubuntu 4.4 kernel. I never got to
see the full output from that NMI though due to limited screen space, my
attempts to try anything different than text mode were met with a blank
screen, and from the parts I did see I didn't really see anything
interesting albeit that's not saying much as I really have no idea about
that code...)

With the new patch here's the output (albeit the typing gets a bit
annoying...)
...
Invoking state 32 CB replay_prepare_cpu+0x0/0xe0
CB returned 0
Invoking state 35 CB rcutree_prepare_cpu+0x0/0x50
CB returned 0
Invoking state 41 CB notify_prepare+0x0/0xa0
CBreturned 0
Invoking state 48 CB bringup_cpu+0x0/0x90
x86: Booting SMP configuration:
.... node  #0, CPUs:       #1
Invoking state 51 CB sched_cpu_starting+0x0/0x60
CB returned 0
Invoking state 62 CB x86_pmu_starting_cpu+0x0/0x20
CB returned 0
TSC ADJUST: CPU1: -2846131604 175264328618
TSC ADJUST differs: Reference CPU0: -2843967660 CPU1: -2846131604
TSC ADJUST synchronize: Reference CPU0: -2843967660 CPU1: -2846131604
TSC target sync skip
smpboot: Vector locked
TSC source sync 0 -> 1 runs 3
smpboot: Vector setup done
smpboot: Clock setup
TSC source sync skipped
smpboot: Target CPU is online

> 
>> The system also came back up fine from suspend with this (well - still
>> minus graphics...), however disabled tsc clocksource:
>>
>> [  579.931739] Enabling non-boot CPUs ...
>> [  579.943107] smpboot: Booting Node 0 Processor 1 APIC 0x2
>> [  579.943189] TSC ADJUST: CPU1: -1504429974 21601834126
> 
> Fun, yet another adjust value. Are they set by a random number generator?
> 
>> [  579.944093] CPU1 is up
> 
>> [  580.458983] clocksource: timekeeping watchdog on CPU1: Marking
>> clocksource 'tsc' as unstable because the skew is too large:
>> [  580.458985] clocksource:                       'hpet' wd_now: 587c1
>> wd_last: 437c7 mask: ffffffff
>> [  580.458986] clocksource:                       'tsc' cs_now:
>> 563963cd8 cs_last: 508f5a02a mask: ffffffffffffffff
> 
> Ok, that's caused by the fact that we do not sanitize the TSC adjust
> register on the boot CPU in the resume path.
Yep. (Which is why my hack hacked the restore path as well - I am still
really confused about boot/resume boot cpu initialization...)

> 
>> [  581.006760] [Firmware Bug]: TSC ADJUST differs: CPU0 0 -->
>> -1502494750. Restoring
> 
> We only detect it later and correct it, but that's too late. The untested
> patch below should cure that.
> 
>>> This BIOS seems to be rather strange ...
>> Don't tell me...
>>
>> No idea what it's doing, but I think it's safe to say whatever it's
>> trying to do, it's doing it wrong...
> 
> Amen to that. I've seen a lot of Value Add BIOSes which broke things left
> and right, but this one takes it to a new level of insanity.
> 
> Did you report that back to DELL already?
No, I never did (even though the box actually shipped with ubuntu
(albeit a really old version)). Since it was basically working - just
without TSCs...
I was quite happy with using my hack for quite some time, it just got
annoying I'll either have to stick to one kernel or reapply it all the
time :-).

Roland

> 
> Thanks,
> 
> 	tglx
> 
> 8<--------------------
> --- a/arch/x86/include/asm/tsc.h
> +++ b/arch/x86/include/asm/tsc.h
> @@ -47,7 +47,7 @@ extern int tsc_clocksource_reliable;
>   */
>  #ifdef CONFIG_X86_TSC
>  extern bool tsc_store_and_check_tsc_adjust(void);
> -extern void tsc_verify_tsc_adjust(void);
> +extern void tsc_verify_tsc_adjust(bool resume);
>  extern void check_tsc_sync_source(int cpu);
>  extern void check_tsc_sync_target(void);
>  #else
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -277,7 +277,7 @@ void exit_idle(void)
>  
>  void arch_cpu_idle_enter(void)
>  {
> -	tsc_verify_tsc_adjust();
> +	tsc_verify_tsc_adjust(false);
>  	local_touch_nmi();
>  	enter_idle();
>  }
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1080,6 +1080,11 @@ static void detect_art(void)
>  
>  static struct clocksource clocksource_tsc;
>  
> +static void tsc_resume(struct clocksource *cs)
> +{
> +	tsc_verify_tsc_adjust(true);
> +}
> +
>  /*
>   * We used to compare the TSC to the cycle_last value in the clocksource
>   * structure to avoid a nasty time-warp. This can be observed in a
> @@ -1112,6 +1117,7 @@ static struct clocksource clocksource_ts
>  	.flags                  = CLOCK_SOURCE_IS_CONTINUOUS |
>  				  CLOCK_SOURCE_MUST_VERIFY,
>  	.archdata               = { .vclock_mode = VCLOCK_TSC },
> +	.resume			= tsc_resume,
>  };
>  
>  void mark_tsc_unstable(char *reason)
> --- a/arch/x86/kernel/tsc_sync.c
> +++ b/arch/x86/kernel/tsc_sync.c
> @@ -30,7 +30,7 @@ struct tsc_adjust {
>  
>  static DEFINE_PER_CPU(struct tsc_adjust, tsc_adjust);
>  
> -void tsc_verify_tsc_adjust(void)
> +void tsc_verify_tsc_adjust(bool resume)
>  {
>  	struct tsc_adjust *adj = this_cpu_ptr(&tsc_adjust);
>  	s64 curval;
> @@ -39,7 +39,7 @@ void tsc_verify_tsc_adjust(void)
>  		return;
>  
>  	/* Rate limit the MSR check */
> -	if (time_before(jiffies, adj->nextcheck))
> +	if (!resume && time_before(jiffies, adj->nextcheck))
>  		return;
>  
>  	adj->nextcheck = jiffies + HZ;
> @@ -51,7 +51,7 @@ void tsc_verify_tsc_adjust(void)
>  	/* Restore the original value */
>  	wrmsrl(MSR_IA32_TSC_ADJUST, adj->adjusted);
>  
> -	if (!adj->warned) {
> +	if (!adj->warned || resume) {
>  		pr_warn(FW_BUG "TSC ADJUST differs: CPU%u %lld --> %lld. Restoring\n",
>  			smp_processor_id(), adj->adjusted, curval);
>  		adj->warned = true;
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -256,6 +256,7 @@ static void notrace __restore_processor_
>  	mtrr_bp_restore();
>  	perf_restore_debug_store();
>  	msr_restore_context(ctxt);
> +	tsc_verify_tsc_adjust(true);
>  }
>  
>  /* Needed by apm.c */
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -133,7 +133,9 @@ static int cpuhp_invoke_callback(unsigne
>  		if (!cb)
>  			return 0;
>  		trace_cpuhp_enter(cpu, st->target, state, cb);
> +		pr_info("Invoking state %d CB %pF\n", state, cb);
>  		ret = cb(cpu);
> +		pr_info("CB returned %d\n", ret);
>  		trace_cpuhp_exit(cpu, st->state, state, ret);
>  		return ret;
>  	}
> @@ -153,7 +155,10 @@ static int cpuhp_invoke_callback(unsigne
>  	cnt = 0;
>  	hlist_for_each(node, &step->list) {
>  		trace_cpuhp_multi_enter(cpu, st->target, state, cbm, node);
> +		pr_info("Invoking state %d CB %pF instance %p\n", state, cbm,
> +			node);
>  		ret = cbm(cpu, node);
> +		pr_info("CB returned %d\n", ret);
>  		trace_cpuhp_exit(cpu, st->state, state, ret);
>  		if (ret)
>  			goto err;
> @@ -550,6 +555,7 @@ static void cpuhp_thread_fun(unsigned in
>  
>  	st->should_run = false;
>  
> +	pr_info("CPUHP thread running\n");
>  	/* Single callback invocation for [un]install ? */
>  	if (st->single) {
>  		if (st->cb_state < CPUHP_AP_ONLINE) {
>