From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752035AbdBGBbl (ORCPT <rfc822;w@1wt.eu>);
        Mon, 6 Feb 2017 20:31:41 -0500
Received: from mail-qk0-f193.google.com ([209.85.220.193]:36090 "EHLO
        mail-qk0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751164AbdBGBbi (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 6 Feb 2017 20:31:38 -0500
MIME-Version: 1.0
X-Originating-IP: [209.133.79.6]
In-Reply-To: <alpine.DEB.2.20.1612182042570.3628@nanos>
References: <alpine.DEB.2.20.1612182042570.3628@nanos>
From: Olof Johansson <olof@lixom.net>
Date: Mon, 6 Feb 2017 17:31:35 -0800
Message-ID: <CAOesGMgzR+9Hj_twU1N35_uguQc9-FaxZiHi6KWpVGZRc2paSA@mail.gmail.com>
Subject: Re: [GIT pull] x86/timers for 4.10
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Ingo Molnar <mingo@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hi Thomas,

I just now updated my build box from 4.9-rc to 4.10-rc, and picked up
these changes. My machine went from doing:

[    0.000000] tsc: Fast TSC calibration using PIT
[    0.060669] TSC deadline timer enabled
[    0.142701] TSC synchronization [CPU#0 -> CPU#1]:
[    0.142704] Measured 3127756 cycles TSC warp between CPUs, turning
off TSC clock.
[    0.142708] tsc: Marking TSC unstable due to check_tsc_sync_source failed

To:

[    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles:
0xffffffff, max_idle_ns: 133484882848 ns
[    0.000000] hpet clockevent registered
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] tsc: Detected 2793.624 MHz processor
[    0.000000] [Firmware Bug]: TSC ADJUST: CPU0: -6495898515190607 force to 0
[2325258.699535] Calibrating delay loop (skipped), value calculated
using timer frequency.. 5587.24 BogoMIPS (lpj=11174496)
[2325258.699537] pid_max: default: 32768 minimum: 301
[... SMP bringup and for each CPU:]
[    0.177102] [Firmware Bug]: TSC ADJUST differs: Reference CPU0:
-6495898515190607 CPU1: -6495898517158354
[    0.177104] TSC ADJUST synchronize: Reference CPU0: 0 CPU1: -6495898517158354
[2325258.877496]   #2
[    0.257232] [Firmware Bug]: TSC ADJUST differs: Reference CPU0:
-6495898515190607 CPU2: -6495898516849701
[    0.257234] TSC ADJUST synchronize: Reference CPU0: 0 CPU2: -6495898516849701
[2325258.957514]   #3

(Once SMP bringup is done, system settles down at the 232525x printk timestamps)

...

So, a couple of obvious notes:

1) Timestamp jumps around during SMP bringup.

2) Timestamp jumps forward a lot. That timestamp is ~26 days, which is
likely the last cold boot of the system, similar to the original
reports.

I do find it somewhat annoying when printk timestamps aren't 0-based
at boot, but can cope with it. Not sure if it's intended behavior
though? And the jumping around definitely seems to not be.


If someone cares, hardware is a Dell T7810 with 2x E5-2663 v3, BIOS
date 03/09/2016.


-Olof

On Sun, Dec 18, 2016 at 12:06 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> Linus,
>
> please pull the latest x86-timers-for-linus git tree from:
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-timers-for-linus
>
> This is the last functional update from the tip tree for 4.10. It got
> delayed due to a newly reported and anlyzed variant of BIOS bug and the
> resulting wreckage:
>
>  - Seperation of TSC being marked realiable and the fact that the platform
>    provides the TSC frequency via CPUID/MSRs and making use for it for
>    GOLDMONT.
>
>  - TSC adjust MSR validation and sanitizing:
>
>    The TSC adjust MSR contains the offset to the hardware counter. The sum
>    of the adjust MSR and the counter is the TSC value which is read via
>    RDTSC.
>
>    On at least two machines from different vendors the BIOS sets the TSC
>    adjust MSR to negative values. This happens on cold and warm boot. While
>    on cold boot the offset is a few milliseconds, on warm boot it basically
>    compensates the power on time of the system. The BIOSes are not even
>    using the adjust MSR to set all CPUs in the package to the same
>    offset. The offsets are different which renders the TSC unusable,
>
>    What's worse is that the TSC deadline timer has a HW feature^Wbug. It
>    malfunctions when the TSC adjust value is negative or greater equal
>    0x80000000 resulting in silent boot failures, hard lockups or non firing
>    timers. This looks like some hardware internal 32/64bit issue with a
>    sign extension problem. Intel has been silent so far on the issue.
>
>    The update contains sanity checks and keeps the adjust register within
>    working limits and in sync on the package.
>
>    As it looks like this disease is spreading via BIOS crapware, we need to
>    address this urgently as the boot failures are hard to debug for users.
>
>
> Thanks,
>
>         tglx
>
> ------------------>
> Bin Gao (4):
>       x86/tsc: Add X86_FEATURE_TSC_KNOWN_FREQ flag
>       x86/tsc: Mark TSC frequency determined by CPUID as known
>       x86/tsc: Mark Intel ATOM_GOLDMONT TSC reliable
>       x86/tsc: Set TSC_KNOWN_FREQ and TSC_RELIABLE flags on Intel Atom SoCs
>
> Thomas Gleixner (15):
>       x86/tsc: Finalize the split of the TSC_RELIABLE flag
>       x86/tsc: Use X86_FEATURE_TSC_ADJUST in detect_art()
>       x86/tsc: Detect random warps
>       x86/tsc: Store and check TSC ADJUST MSR
>       x86/tsc: Verify TSC_ADJUST from idle
>       x86/tsc: Sync test only for the first cpu in a package
>       x86/tsc: Move sync cleanup to a safe place
>       x86/tsc: Prepare warp test for TSC adjustment
>       x86/tsc: Try to adjust TSC if sync test fails
>       x86/tsc: Fix broken CONFIG_X86_TSC=n build
>       x86/tsc: Validate cpumask pointer before accessing it
>       x86/tsc: Validate TSC_ADJUST after resume
>       x86/tsc: Force TSC_ADJUST register to value >= zero
>       x86/tsc: Annotate printouts as firmware bug
>       x86/tsc: Limit the adjust value further
>
>
>  arch/x86/include/asm/cpufeatures.h  |   1 +
>  arch/x86/include/asm/tsc.h          |   9 ++
>  arch/x86/kernel/Makefile            |   2 +-
>  arch/x86/kernel/process.c           |   1 +
>  arch/x86/kernel/tsc.c               |  42 ++++--
>  arch/x86/kernel/tsc_msr.c           |  19 +++
>  arch/x86/kernel/tsc_sync.c          | 290 ++++++++++++++++++++++++++++++++++--
>  arch/x86/platform/intel-mid/mfld.c  |   9 +-
>  arch/x86/platform/intel-mid/mrfld.c |   8 +-
>  arch/x86/power/cpu.c                |   1 +
>  10 files changed, 355 insertions(+), 27 deletions(-)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index a39629206864..7f6a5f88d5ae 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -106,6 +106,7 @@
>  #define X86_FEATURE_APERFMPERF ( 3*32+28) /* APERFMPERF */
>  #define X86_FEATURE_EAGER_FPU  ( 3*32+29) /* "eagerfpu" Non lazy FPU restore */
>  #define X86_FEATURE_NONSTOP_TSC_S3 ( 3*32+30) /* TSC doesn't stop in S3 state */
> +#define X86_FEATURE_TSC_KNOWN_FREQ ( 3*32+31) /* TSC has known frequency */
>
>  /* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
>  #define X86_FEATURE_XMM3       ( 4*32+ 0) /* "pni" SSE-3 */
> diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
> index 33b6365c22fe..abb1fdcc545a 100644
> --- a/arch/x86/include/asm/tsc.h
> +++ b/arch/x86/include/asm/tsc.h
> @@ -45,8 +45,17 @@ extern int tsc_clocksource_reliable;
>   * Boot-time check whether the TSCs are synchronized across
>   * all CPUs/cores:
>   */
> +#ifdef CONFIG_X86_TSC
> +extern bool tsc_store_and_check_tsc_adjust(bool bootcpu);
> +extern void tsc_verify_tsc_adjust(bool resume);
>  extern void check_tsc_sync_source(int cpu);
>  extern void check_tsc_sync_target(void);
> +#else
> +static inline bool tsc_store_and_check_tsc_adjust(bool bootcpu) { return false; }
> +static inline void tsc_verify_tsc_adjust(bool resume) { }
> +static inline void check_tsc_sync_source(int cpu) { }
> +static inline void check_tsc_sync_target(void) { }
> +#endif
>
>  extern int notsc_setup(char *);
>  extern void tsc_save_sched_clock_state(void);
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index 79076d75bdbf..c0ac317dd372 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -75,7 +75,7 @@ apm-y                         := apm_32.o
>  obj-$(CONFIG_APM)              += apm.o
>  obj-$(CONFIG_SMP)              += smp.o
>  obj-$(CONFIG_SMP)              += smpboot.o
> -obj-$(CONFIG_SMP)              += tsc_sync.o
> +obj-$(CONFIG_X86_TSC)          += tsc_sync.o
>  obj-$(CONFIG_SMP)              += setup_percpu.o
>  obj-$(CONFIG_X86_MPPARSE)      += mpparse.o
>  obj-y                          += apic/
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 0888a879120f..a67e0f0cdaab 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -277,6 +277,7 @@ void exit_idle(void)
>
>  void arch_cpu_idle_enter(void)
>  {
> +       tsc_verify_tsc_adjust(false);
>         local_touch_nmi();
>         enter_idle();
>  }
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 46b2f41f8b05..0aed75a1e31b 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -702,6 +702,20 @@ unsigned long native_calibrate_tsc(void)
>                 }
>         }
>
> +       /*
> +        * TSC frequency determined by CPUID is a "hardware reported"
> +        * frequency and is the most accurate one so far we have. This
> +        * is considered a known frequency.
> +        */
> +       setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> +
> +       /*
> +        * For Atom SoCs TSC is the only reliable clocksource.
> +        * Mark TSC reliable so no watchdog on it.
> +        */
> +       if (boot_cpu_data.x86_model == INTEL_FAM6_ATOM_GOLDMONT)
> +               setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> +
>         return crystal_khz * ebx_numerator / eax_denominator;
>  }
>
> @@ -1043,18 +1057,20 @@ static void detect_art(void)
>         if (boot_cpu_data.cpuid_level < ART_CPUID_LEAF)
>                 return;
>
> -       cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator,
> -             &art_to_tsc_numerator, unused, unused+1);
> -
> -       /* Don't enable ART in a VM, non-stop TSC required */
> +       /* Don't enable ART in a VM, non-stop TSC and TSC_ADJUST required */
>         if (boot_cpu_has(X86_FEATURE_HYPERVISOR) ||
>             !boot_cpu_has(X86_FEATURE_NONSTOP_TSC) ||
> -           art_to_tsc_denominator < ART_MIN_DENOMINATOR)
> +           !boot_cpu_has(X86_FEATURE_TSC_ADJUST))
>                 return;
>
> -       if (rdmsrl_safe(MSR_IA32_TSC_ADJUST, &art_to_tsc_offset))
> +       cpuid(ART_CPUID_LEAF, &art_to_tsc_denominator,
> +             &art_to_tsc_numerator, unused, unused+1);
> +
> +       if (art_to_tsc_denominator < ART_MIN_DENOMINATOR)
>                 return;
>
> +       rdmsrl(MSR_IA32_TSC_ADJUST, art_to_tsc_offset);
> +
>         /* Make this sticky over multiple CPU init calls */
>         setup_force_cpu_cap(X86_FEATURE_ART);
>  }
> @@ -1064,6 +1080,11 @@ static void detect_art(void)
>
>  static struct clocksource clocksource_tsc;
>
> +static void tsc_resume(struct clocksource *cs)
> +{
> +       tsc_verify_tsc_adjust(true);
> +}
> +
>  /*
>   * We used to compare the TSC to the cycle_last value in the clocksource
>   * structure to avoid a nasty time-warp. This can be observed in a
> @@ -1096,6 +1117,7 @@ static struct clocksource clocksource_tsc = {
>         .flags                  = CLOCK_SOURCE_IS_CONTINUOUS |
>                                   CLOCK_SOURCE_MUST_VERIFY,
>         .archdata               = { .vclock_mode = VCLOCK_TSC },
> +       .resume                 = tsc_resume,
>  };
>
>  void mark_tsc_unstable(char *reason)
> @@ -1283,10 +1305,10 @@ static int __init init_tsc_clocksource(void)
>                 clocksource_tsc.flags |= CLOCK_SOURCE_SUSPEND_NONSTOP;
>
>         /*
> -        * Trust the results of the earlier calibration on systems
> -        * exporting a reliable TSC.
> +        * When TSC frequency is known (retrieved via MSR or CPUID), we skip
> +        * the refined calibration and directly register it as a clocksource.
>          */
> -       if (boot_cpu_has(X86_FEATURE_TSC_RELIABLE)) {
> +       if (boot_cpu_has(X86_FEATURE_TSC_KNOWN_FREQ)) {
>                 clocksource_register_khz(&clocksource_tsc, tsc_khz);
>                 return 0;
>         }
> @@ -1363,6 +1385,8 @@ void __init tsc_init(void)
>
>         if (unsynchronized_tsc())
>                 mark_tsc_unstable("TSCs unsynchronized");
> +       else
> +               tsc_store_and_check_tsc_adjust(true);
>
>         check_system_tsc_reliable();
>
> diff --git a/arch/x86/kernel/tsc_msr.c b/arch/x86/kernel/tsc_msr.c
> index 0fe720d64fef..19afdbd7d0a7 100644
> --- a/arch/x86/kernel/tsc_msr.c
> +++ b/arch/x86/kernel/tsc_msr.c
> @@ -100,5 +100,24 @@ unsigned long cpu_khz_from_msr(void)
>  #ifdef CONFIG_X86_LOCAL_APIC
>         lapic_timer_frequency = (freq * 1000) / HZ;
>  #endif
> +
> +       /*
> +        * TSC frequency determined by MSR is always considered "known"
> +        * because it is reported by HW.
> +        * Another fact is that on MSR capable platforms, PIT/HPET is
> +        * generally not available so calibration won't work at all.
> +        */
> +       setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> +
> +       /*
> +        * Unfortunately there is no way for hardware to tell whether the
> +        * TSC is reliable.  We were told by silicon design team that TSC
> +        * on Atom SoCs are always "reliable". TSC is also the only
> +        * reliable clocksource on these SoCs (HPET is either not present
> +        * or not functional) so mark TSC reliable which removes the
> +        * requirement for a watchdog clocksource.
> +        */
> +       setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
> +
>         return res;
>  }
> diff --git a/arch/x86/kernel/tsc_sync.c b/arch/x86/kernel/tsc_sync.c
> index 78083bf23ed1..d0db011051a5 100644
> --- a/arch/x86/kernel/tsc_sync.c
> +++ b/arch/x86/kernel/tsc_sync.c
> @@ -14,18 +14,166 @@
>   * ( The serial nature of the boot logic and the CPU hotplug lock
>   *   protects against more than 2 CPUs entering this code. )
>   */
> +#include <linux/topology.h>
>  #include <linux/spinlock.h>
>  #include <linux/kernel.h>
>  #include <linux/smp.h>
>  #include <linux/nmi.h>
>  #include <asm/tsc.h>
>
> +struct tsc_adjust {
> +       s64             bootval;
> +       s64             adjusted;
> +       unsigned long   nextcheck;
> +       bool            warned;
> +};
> +
> +static DEFINE_PER_CPU(struct tsc_adjust, tsc_adjust);
> +
> +void tsc_verify_tsc_adjust(bool resume)
> +{
> +       struct tsc_adjust *adj = this_cpu_ptr(&tsc_adjust);
> +       s64 curval;
> +
> +       if (!boot_cpu_has(X86_FEATURE_TSC_ADJUST))
> +               return;
> +
> +       /* Rate limit the MSR check */
> +       if (!resume && time_before(jiffies, adj->nextcheck))
> +               return;
> +
> +       adj->nextcheck = jiffies + HZ;
> +
> +       rdmsrl(MSR_IA32_TSC_ADJUST, curval);
> +       if (adj->adjusted == curval)
> +               return;
> +
> +       /* Restore the original value */
> +       wrmsrl(MSR_IA32_TSC_ADJUST, adj->adjusted);
> +
> +       if (!adj->warned || resume) {
> +               pr_warn(FW_BUG "TSC ADJUST differs: CPU%u %lld --> %lld. Restoring\n",
> +                       smp_processor_id(), adj->adjusted, curval);
> +               adj->warned = true;
> +       }
> +}
> +
> +static void tsc_sanitize_first_cpu(struct tsc_adjust *cur, s64 bootval,
> +                                  unsigned int cpu, bool bootcpu)
> +{
> +       /*
> +        * First online CPU in a package stores the boot value in the
> +        * adjustment value. This value might change later via the sync
> +        * mechanism. If that fails we still can yell about boot values not
> +        * being consistent.
> +        *
> +        * On the boot cpu we just force set the ADJUST value to 0 if it's
> +        * non zero. We don't do that on non boot cpus because physical
> +        * hotplug should have set the ADJUST register to a value > 0 so
> +        * the TSC is in sync with the already running cpus.
> +        *
> +        * But we always force positive ADJUST values. Otherwise the TSC
> +        * deadline timer creates an interrupt storm. We also have to
> +        * prevent values > 0x7FFFFFFF as those wreckage the timer as well.
> +        */
> +       if ((bootcpu && bootval != 0) || (!bootcpu && bootval < 0) ||
> +           (bootval > 0x7FFFFFFF)) {
> +               pr_warn(FW_BUG "TSC ADJUST: CPU%u: %lld force to 0\n", cpu,
> +                       bootval);
> +               wrmsrl(MSR_IA32_TSC_ADJUST, 0);
> +               bootval = 0;
> +       }
> +       cur->adjusted = bootval;
> +}
> +
> +#ifndef CONFIG_SMP
> +bool __init tsc_store_and_check_tsc_adjust(bool bootcpu)
> +{
> +       struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
> +       s64 bootval;
> +
> +       if (!boot_cpu_has(X86_FEATURE_TSC_ADJUST))
> +               return false;
> +
> +       rdmsrl(MSR_IA32_TSC_ADJUST, bootval);
> +       cur->bootval = bootval;
> +       cur->nextcheck = jiffies + HZ;
> +       tsc_sanitize_first_cpu(cur, bootval, smp_processor_id(), bootcpu);
> +       return false;
> +}
> +
> +#else /* !CONFIG_SMP */
> +
> +/*
> + * Store and check the TSC ADJUST MSR if available
> + */
> +bool tsc_store_and_check_tsc_adjust(bool bootcpu)
> +{
> +       struct tsc_adjust *ref, *cur = this_cpu_ptr(&tsc_adjust);
> +       unsigned int refcpu, cpu = smp_processor_id();
> +       struct cpumask *mask;
> +       s64 bootval;
> +
> +       if (!boot_cpu_has(X86_FEATURE_TSC_ADJUST))
> +               return false;
> +
> +       rdmsrl(MSR_IA32_TSC_ADJUST, bootval);
> +       cur->bootval = bootval;
> +       cur->nextcheck = jiffies + HZ;
> +       cur->warned = false;
> +
> +       /*
> +        * Check whether this CPU is the first in a package to come up. In
> +        * this case do not check the boot value against another package
> +        * because the new package might have been physically hotplugged,
> +        * where TSC_ADJUST is expected to be different. When called on the
> +        * boot CPU topology_core_cpumask() might not be available yet.
> +        */
> +       mask = topology_core_cpumask(cpu);
> +       refcpu = mask ? cpumask_any_but(mask, cpu) : nr_cpu_ids;
> +
> +       if (refcpu >= nr_cpu_ids) {
> +               tsc_sanitize_first_cpu(cur, bootval, smp_processor_id(),
> +                                      bootcpu);
> +               return false;
> +       }
> +
> +       ref = per_cpu_ptr(&tsc_adjust, refcpu);
> +       /*
> +        * Compare the boot value and complain if it differs in the
> +        * package.
> +        */
> +       if (bootval != ref->bootval) {
> +               pr_warn(FW_BUG "TSC ADJUST differs: Reference CPU%u: %lld CPU%u: %lld\n",
> +                       refcpu, ref->bootval, cpu, bootval);
> +       }
> +       /*
> +        * The TSC_ADJUST values in a package must be the same. If the boot
> +        * value on this newly upcoming CPU differs from the adjustment
> +        * value of the already online CPU in this package, set it to that
> +        * adjusted value.
> +        */
> +       if (bootval != ref->adjusted) {
> +               pr_warn("TSC ADJUST synchronize: Reference CPU%u: %lld CPU%u: %lld\n",
> +                       refcpu, ref->adjusted, cpu, bootval);
> +               cur->adjusted = ref->adjusted;
> +               wrmsrl(MSR_IA32_TSC_ADJUST, ref->adjusted);
> +       }
> +       /*
> +        * We have the TSCs forced to be in sync on this package. Skip sync
> +        * test:
> +        */
> +       return true;
> +}
> +
>  /*
>   * Entry/exit counters that make sure that both CPUs
>   * run the measurement code at once:
>   */
>  static atomic_t start_count;
>  static atomic_t stop_count;
> +static atomic_t skip_test;
> +static atomic_t test_runs;
>
>  /*
>   * We use a raw spinlock in this exceptional case, because
> @@ -37,15 +185,16 @@ static arch_spinlock_t sync_lock = __ARCH_SPIN_LOCK_UNLOCKED;
>  static cycles_t last_tsc;
>  static cycles_t max_warp;
>  static int nr_warps;
> +static int random_warps;
>
>  /*
>   * TSC-warp measurement loop running on both CPUs.  This is not called
>   * if there is no TSC.
>   */
> -static void check_tsc_warp(unsigned int timeout)
> +static cycles_t check_tsc_warp(unsigned int timeout)
>  {
> -       cycles_t start, now, prev, end;
> -       int i;
> +       cycles_t start, now, prev, end, cur_max_warp = 0;
> +       int i, cur_warps = 0;
>
>         start = rdtsc_ordered();
>         /*
> @@ -85,13 +234,22 @@ static void check_tsc_warp(unsigned int timeout)
>                 if (unlikely(prev > now)) {
>                         arch_spin_lock(&sync_lock);
>                         max_warp = max(max_warp, prev - now);
> +                       cur_max_warp = max_warp;
> +                       /*
> +                        * Check whether this bounces back and forth. Only
> +                        * one CPU should observe time going backwards.
> +                        */
> +                       if (cur_warps != nr_warps)
> +                               random_warps++;
>                         nr_warps++;
> +                       cur_warps = nr_warps;
>                         arch_spin_unlock(&sync_lock);
>                 }
>         }
>         WARN(!(now-start),
>                 "Warning: zero tsc calibration delta: %Ld [max: %Ld]\n",
>                         now-start, end-start);
> +       return cur_max_warp;
>  }
>
>  /*
> @@ -136,15 +294,26 @@ void check_tsc_sync_source(int cpu)
>         }
>
>         /*
> -        * Reset it - in case this is a second bootup:
> +        * Set the maximum number of test runs to
> +        *  1 if the CPU does not provide the TSC_ADJUST MSR
> +        *  3 if the MSR is available, so the target can try to adjust
>          */
> -       atomic_set(&stop_count, 0);
> -
> +       if (!boot_cpu_has(X86_FEATURE_TSC_ADJUST))
> +               atomic_set(&test_runs, 1);
> +       else
> +               atomic_set(&test_runs, 3);
> +retry:
>         /*
> -        * Wait for the target to arrive:
> +        * Wait for the target to start or to skip the test:
>          */
> -       while (atomic_read(&start_count) != cpus-1)
> +       while (atomic_read(&start_count) != cpus - 1) {
> +               if (atomic_read(&skip_test) > 0) {
> +                       atomic_set(&skip_test, 0);
> +                       return;
> +               }
>                 cpu_relax();
> +       }
> +
>         /*
>          * Trigger the target to continue into the measurement too:
>          */
> @@ -155,21 +324,35 @@ void check_tsc_sync_source(int cpu)
>         while (atomic_read(&stop_count) != cpus-1)
>                 cpu_relax();
>
> -       if (nr_warps) {
> +       /*
> +        * If the test was successful set the number of runs to zero and
> +        * stop. If not, decrement the number of runs an check if we can
> +        * retry. In case of random warps no retry is attempted.
> +        */
> +       if (!nr_warps) {
> +               atomic_set(&test_runs, 0);
> +
> +               pr_debug("TSC synchronization [CPU#%d -> CPU#%d]: passed\n",
> +                       smp_processor_id(), cpu);
> +
> +       } else if (atomic_dec_and_test(&test_runs) || random_warps) {
> +               /* Force it to 0 if random warps brought us here */
> +               atomic_set(&test_runs, 0);
> +
>                 pr_warning("TSC synchronization [CPU#%d -> CPU#%d]:\n",
>                         smp_processor_id(), cpu);
>                 pr_warning("Measured %Ld cycles TSC warp between CPUs, "
>                            "turning off TSC clock.\n", max_warp);
> +               if (random_warps)
> +                       pr_warning("TSC warped randomly between CPUs\n");
>                 mark_tsc_unstable("check_tsc_sync_source failed");
> -       } else {
> -               pr_debug("TSC synchronization [CPU#%d -> CPU#%d]: passed\n",
> -                       smp_processor_id(), cpu);
>         }
>
>         /*
>          * Reset it - just in case we boot another CPU later:
>          */
>         atomic_set(&start_count, 0);
> +       random_warps = 0;
>         nr_warps = 0;
>         max_warp = 0;
>         last_tsc = 0;
> @@ -178,6 +361,12 @@ void check_tsc_sync_source(int cpu)
>          * Let the target continue with the bootup:
>          */
>         atomic_inc(&stop_count);
> +
> +       /*
> +        * Retry, if there is a chance to do so.
> +        */
> +       if (atomic_read(&test_runs) > 0)
> +               goto retry;
>  }
>
>  /*
> @@ -185,6 +374,9 @@ void check_tsc_sync_source(int cpu)
>   */
>  void check_tsc_sync_target(void)
>  {
> +       struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust);
> +       unsigned int cpu = smp_processor_id();
> +       cycles_t cur_max_warp, gbl_max_warp;
>         int cpus = 2;
>
>         /* Also aborts if there is no TSC. */
> @@ -192,6 +384,16 @@ void check_tsc_sync_target(void)
>                 return;
>
>         /*
> +        * Store, verify and sanitize the TSC adjust register. If
> +        * successful skip the test.
> +        */
> +       if (tsc_store_and_check_tsc_adjust(false)) {
> +               atomic_inc(&skip_test);
> +               return;
> +       }
> +
> +retry:
> +       /*
>          * Register this CPU's participation and wait for the
>          * source CPU to start the measurement:
>          */
> @@ -199,7 +401,12 @@ void check_tsc_sync_target(void)
>         while (atomic_read(&start_count) != cpus)
>                 cpu_relax();
>
> -       check_tsc_warp(loop_timeout(smp_processor_id()));
> +       cur_max_warp = check_tsc_warp(loop_timeout(cpu));
> +
> +       /*
> +        * Store the maximum observed warp value for a potential retry:
> +        */
> +       gbl_max_warp = max_warp;
>
>         /*
>          * Ok, we are done:
> @@ -211,4 +418,61 @@ void check_tsc_sync_target(void)
>          */
>         while (atomic_read(&stop_count) != cpus)
>                 cpu_relax();
> +
> +       /*
> +        * Reset it for the next sync test:
> +        */
> +       atomic_set(&stop_count, 0);
> +
> +       /*
> +        * Check the number of remaining test runs. If not zero, the test
> +        * failed and a retry with adjusted TSC is possible. If zero the
> +        * test was either successful or failed terminally.
> +        */
> +       if (!atomic_read(&test_runs))
> +               return;
> +
> +       /*
> +        * If the warp value of this CPU is 0, then the other CPU
> +        * observed time going backwards so this TSC was ahead and
> +        * needs to move backwards.
> +        */
> +       if (!cur_max_warp)
> +               cur_max_warp = -gbl_max_warp;
> +
> +       /*
> +        * Add the result to the previous adjustment value.
> +        *
> +        * The adjustement value is slightly off by the overhead of the
> +        * sync mechanism (observed values are ~200 TSC cycles), but this
> +        * really depends on CPU, node distance and frequency. So
> +        * compensating for this is hard to get right. Experiments show
> +        * that the warp is not longer detectable when the observed warp
> +        * value is used. In the worst case the adjustment needs to go
> +        * through a 3rd run for fine tuning.
> +        */
> +       cur->adjusted += cur_max_warp;
> +
> +       /*
> +        * TSC deadline timer stops working or creates an interrupt storm
> +        * with adjust values < 0 and > x07ffffff.
> +        *
> +        * To allow adjust values > 0x7FFFFFFF we need to disable the
> +        * deadline timer and use the local APIC timer, but that requires
> +        * more intrusive changes and we do not have any useful information
> +        * from Intel about the underlying HW wreckage yet.
> +        */
> +       if (cur->adjusted < 0)
> +               cur->adjusted = 0;
> +       if (cur->adjusted > 0x7FFFFFFF)
> +               cur->adjusted = 0x7FFFFFFF;
> +
> +       pr_warn("TSC ADJUST compensate: CPU%u observed %lld warp. Adjust: %lld\n",
> +               cpu, cur_max_warp, cur->adjusted);
> +
> +       wrmsrl(MSR_IA32_TSC_ADJUST, cur->adjusted);
> +       goto retry;
> +
>  }
> +
> +#endif /* CONFIG_SMP */
> diff --git a/arch/x86/platform/intel-mid/mfld.c b/arch/x86/platform/intel-mid/mfld.c
> index 1eb47b6298c2..e793fe509971 100644
> --- a/arch/x86/platform/intel-mid/mfld.c
> +++ b/arch/x86/platform/intel-mid/mfld.c
> @@ -49,8 +49,13 @@ static unsigned long __init mfld_calibrate_tsc(void)
>         fast_calibrate = ratio * fsb;
>         pr_debug("read penwell tsc %lu khz\n", fast_calibrate);
>         lapic_timer_frequency = fsb * 1000 / HZ;
> -       /* mark tsc clocksource as reliable */
> -       set_cpu_cap(&boot_cpu_data, X86_FEATURE_TSC_RELIABLE);
> +
> +       /*
> +        * TSC on Intel Atom SoCs is reliable and of known frequency.
> +        * See tsc_msr.c for details.
> +        */
> +       setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> +       setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
>
>         return fast_calibrate;
>  }
> diff --git a/arch/x86/platform/intel-mid/mrfld.c b/arch/x86/platform/intel-mid/mrfld.c
> index 59253db41bbc..e0607c77a1bd 100644
> --- a/arch/x86/platform/intel-mid/mrfld.c
> +++ b/arch/x86/platform/intel-mid/mrfld.c
> @@ -78,8 +78,12 @@ static unsigned long __init tangier_calibrate_tsc(void)
>         pr_debug("Setting lapic_timer_frequency = %d\n",
>                         lapic_timer_frequency);
>
> -       /* mark tsc clocksource as reliable */
> -       set_cpu_cap(&boot_cpu_data, X86_FEATURE_TSC_RELIABLE);
> +       /*
> +        * TSC on Intel Atom SoCs is reliable and of known frequency.
> +        * See tsc_msr.c for details.
> +        */
> +       setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> +       setup_force_cpu_cap(X86_FEATURE_TSC_RELIABLE);
>
>         return fast_calibrate;
>  }
> diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
> index 53cace2ec0e2..66ade16c7693 100644
> --- a/arch/x86/power/cpu.c
> +++ b/arch/x86/power/cpu.c
> @@ -252,6 +252,7 @@ static void notrace __restore_processor_state(struct saved_context *ctxt)
>         fix_processor_context();
>
>         do_fpu_end();
> +       tsc_verify_tsc_adjust(true);
>         x86_platform.restore_sched_clock_state();
>         mtrr_bp_restore();
>         perf_restore_debug_store();