linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Speed MTRR programming up when we can
@ 2019-06-27 18:43 Ricardo Neri
  2019-06-27 18:43 ` [PATCH 1/2] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata Ricardo Neri
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Ricardo Neri @ 2019-06-27 18:43 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Alan Cox, Tony Luck, H. Peter Anvin, Andy Shevchenko, Andi Kleen,
	Hans de Goede, Greg Kroah-Hartman, Jordan Borgner,
	Ravi V. Shankar, Mohammad Etemadi, Ricardo Neri, linux-kernel,
	x86, Ricardo Neri

Programming MTRR registers in multi-processor systems is a rather lengthy
process. Furthermore, all processors must program these registers in lock
step and with interrupts disabled; the process also involves flushing
caches and TLBs twice. As a result, the process may take a considerable
amount of time.

In some platforms, this can lead to a large skew of the refined-jiffies
clock source. Early when booting, if no other clock is available (e.g.,
booting with hpet=disabled), the refined-jiffies clock source is used to
monitor the TSC clock source. If the skew of refined-jiffies is too large,
Linux wrongly assumes that the TSC is unstable:

     clocksource: timekeeping watchdog on CPU1: Marking clocksource
          'tsc-early' as unstable because the skew is too large:
     clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last:
          fffedb90 mask: ffffffff
     clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4
          mask: ffffffffffffffff
     tsc: Marking TSC unstable due to clocksource watchdog

As per my measurements, around 98% of the time needed by the procedure to
program MTRRs in multi-processor systems is spent flushing caches with
wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32
Architectures Software Developer's Manual, it is not necessary to flush
caches if the CPU supports cache self-snooping. Thus, skipping the cache
flushes can reduce by several tens of milliseconds the time needed to
complete the programming of the MTRR registers.

By measuring the execution time of mtrr_aps_init() (from which MTRRs
in all CPUs are programmed in lock-step at boot), I find savings in the
time required to program MTRRs as follows:

Platform                      time-with-wbinvd(ms) time-no-wbinvd(ms)
104-core (208 LP) Skylake            1437                 28
2-core (4 LP) Haswell                 114                  2

However, there exist CPU models with errata that affect their self-
snooping capability. Such errata may cause unpredictable behavior, machine
check errors, or hangs. For instance:

     "Where two different logical processors have the same code page
      mapped with two different memory types Specifically if one code
      page is mapped by one logical processor as write back and by
      another as uncacheable and certain instruction timing conditions
      occur the system may experience unpredictable behaviour." [1].

Similar errata are reported in other processors as well [2], [3], [4],
[5], and [6].

Thus, in order to confidently leverage self-snooping for the MTRR
programming algorithm, we must first clear such feature in models with
known errata.


Thanks and BR,
Ricardo

LP = Logical Processor

[1]. Erratum BF52, 
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-3600-specification-update.pdf
[2]. Erratum BK47, 
https://www.mouser.com/pdfdocs/2ndgencorefamilymobilespecificationupdate.pdf
[3]. Erratum AAO54, 
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-c5500-c3500-spec-update.pdf
[4]. Errata AZ39, AZ42, 
https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf
[5]. Errata AQ51, AQ102, AQ104, 
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-desktop-e2000-specification-update.pdf
[6]. Errata AN107, AN109, 
https://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/pentium-dual-core-specification-update.pdf

Ricardo Neri (2):
  x86/cpu/intel: Clear cache self-snoop capability in CPUs with known
    errata
  x86, mtrr: generic: Skip cache flushes on CPUs with cache
    self-snooping

 arch/x86/kernel/cpu/intel.c        | 30 ++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/mtrr/generic.c |  8 ++++++--
 2 files changed, 36 insertions(+), 2 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata
  2019-06-27 18:43 [PATCH 0/2] Speed MTRR programming up when we can Ricardo Neri
@ 2019-06-27 18:43 ` Ricardo Neri
  2019-06-27 20:38   ` Thomas Gleixner
  2019-06-27 18:43 ` [PATCH 2/2] x86, mtrr: generic: Skip cache flushes on CPUs with cache self-snooping Ricardo Neri
  2019-06-27 20:39 ` [PATCH 0/2] Speed MTRR programming up when we can Thomas Gleixner
  2 siblings, 1 reply; 6+ messages in thread
From: Ricardo Neri @ 2019-06-27 18:43 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Alan Cox, Tony Luck, H. Peter Anvin, Andy Shevchenko, Andi Kleen,
	Hans de Goede, Greg Kroah-Hartman, Jordan Borgner,
	Ravi V. Shankar, Mohammad Etemadi, Ricardo Neri, linux-kernel,
	x86, Ricardo Neri, Andy Shevchenko, Andi Kleen, Peter Feiner,
	Rafael J. Wysocki

Processors which have self-snooping capability can handle conflicting
memory type across CPUs by snooping its own cache. However, there exists
CPU models in which having conflicting memory types still leads to
unpredictable behavior, machine check errors, or hangs. Clear this feature
to prevent its use. For instance, the algorithm to program the Memory Type
Region Registers and the Page Attribute Table MSR can skip expensive cache
flushes if self-snooping is supported.

Cc: Tony Luck <tony.luck@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jordan Borgner <mail@jordan-borgner.de>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Suggested-by: Alan Cox <alan.cox@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/kernel/cpu/intel.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index f17c1a714779..8684c66e71e0 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -66,6 +66,34 @@ void check_mpx_erratum(struct cpuinfo_x86 *c)
 	}
 }
 
+/*
+ * Processors which have self-snooping capability can handle conflicting
+ * memory type across CPUs by snooping its own cache. However, there exists
+ * CPU models in which having conflicting memory types still leads to
+ * unpredictable behavior, machine check errors, or hangs. Clear this feature
+ * to prevent its use. For instance, the algorithm to program the Memory Type
+ * Region Registers and the Page Attribute Table MSR can skip expensive cache
+ * flushes if self-snooping is supported.
+ */
+static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c)
+{
+	switch (c->x86_model) {
+	case INTEL_FAM6_CORE_YONAH:
+	case INTEL_FAM6_CORE2_MEROM:
+	case INTEL_FAM6_CORE2_MEROM_L:
+	case INTEL_FAM6_CORE2_PENRYN:
+	case INTEL_FAM6_CORE2_DUNNINGTON:
+	case INTEL_FAM6_NEHALEM:
+	case INTEL_FAM6_NEHALEM_G:
+	case INTEL_FAM6_NEHALEM_EP:
+	case INTEL_FAM6_NEHALEM_EX:
+	case INTEL_FAM6_WESTMERE:
+	case INTEL_FAM6_WESTMERE_EP:
+	case INTEL_FAM6_SANDYBRIDGE:
+		setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP);
+	}
+}
+
 static bool ring3mwait_disabled __read_mostly;
 
 static int __init ring3mwait_disable(char *__unused)
@@ -311,6 +339,8 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 	 */
 	if (detect_extended_topology_early(c) < 0)
 		detect_ht_early(c);
+
+	check_memory_type_self_snoop_errata(c);
 }
 
 #ifdef CONFIG_X86_32
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] x86, mtrr: generic: Skip cache flushes on CPUs with cache self-snooping
  2019-06-27 18:43 [PATCH 0/2] Speed MTRR programming up when we can Ricardo Neri
  2019-06-27 18:43 ` [PATCH 1/2] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata Ricardo Neri
@ 2019-06-27 18:43 ` Ricardo Neri
  2019-06-27 20:39 ` [PATCH 0/2] Speed MTRR programming up when we can Thomas Gleixner
  2 siblings, 0 replies; 6+ messages in thread
From: Ricardo Neri @ 2019-06-27 18:43 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Alan Cox, Tony Luck, H. Peter Anvin, Andy Shevchenko, Andi Kleen,
	Hans de Goede, Greg Kroah-Hartman, Jordan Borgner,
	Ravi V. Shankar, Mohammad Etemadi, Ricardo Neri, linux-kernel,
	x86, Ricardo Neri, Andy Shevchenko, Andi Kleen, Peter Feiner,
	Rafael J. Wysocki

Programming MTRR registers in multi-processor systems is a rather lengthy
process. Furthermore, all processors must program these registers in lock
step and with interrupts disabled; the process also involves flushing
caches and TLBs twice. As a result, the process may take a considerable
amount of time.

In some platforms, this can lead to a large skew of the refined-jiffies
clock source. Early when booting, if no other clock is available (e.g.,
booting with hpet=disabled), the refined-jiffies clock source is used to
monitor the TSC clock source. If the skew of refined-jiffies is too large,
Linux wrongly assumes that the TSC is unstable:

  clocksource: timekeeping watchdog on CPU1: Marking clocksource
               'tsc-early' as unstable because the skew is too large:
  clocksource: 'refined-jiffies' wd_now: fffedc10 wd_last:
               fffedb90 mask: ffffffff
  clocksource: 'tsc-early' cs_now: 5eccfddebc cs_last: 5e7e3303d4
               mask: ffffffffffffffff
  tsc: Marking TSC unstable due to clocksource watchdog

As per my measurements, around 98% of the time needed by the procedure to
program MTRRs in multi-processor systems is spent flushing caches with
wbinvd(). As per the Section 11.11.8 of the Intel 64 and IA 32
Architectures Software Developer's Manual, it is not necessary to flush
caches if the CPU supports cache self-snooping. Thus, skipping the cache
flushes can reduce by several tens of milliseconds the time needed to
complete the programming of the MTRR registers.

Cc: Alan Cox <alan.cox@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jordan Borgner <mail@jordan-borgner.de>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Reported-by: Mohammad Etemadi <mohammad.etemadi@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/kernel/cpu/mtrr/generic.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index 9356c1c9024d..169672a6935c 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -743,7 +743,9 @@ static void prepare_set(void) __acquires(set_atomicity_lock)
 	/* Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */
 	cr0 = read_cr0() | X86_CR0_CD;
 	write_cr0(cr0);
-	wbinvd();
+
+	if (!static_cpu_has(X86_FEATURE_SELFSNOOP))
+		wbinvd();
 
 	/* Save value of CR4 and clear Page Global Enable (bit 7) */
 	if (boot_cpu_has(X86_FEATURE_PGE)) {
@@ -760,7 +762,9 @@ static void prepare_set(void) __acquires(set_atomicity_lock)
 
 	/* Disable MTRRs, and set the default type to uncached */
 	mtrr_wrmsr(MSR_MTRRdefType, deftype_lo & ~0xcff, deftype_hi);
-	wbinvd();
+
+	if (!static_cpu_has(X86_FEATURE_SELFSNOOP))
+		wbinvd();
 }
 
 static void post_set(void) __releases(set_atomicity_lock)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata
  2019-06-27 18:43 ` [PATCH 1/2] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata Ricardo Neri
@ 2019-06-27 20:38   ` Thomas Gleixner
  2019-06-28  1:43     ` Ricardo Neri
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2019-06-27 20:38 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Ingo Molnar, Borislav Petkov, Alan Cox, Tony Luck,
	H. Peter Anvin, Andy Shevchenko, Andi Kleen, Hans de Goede,
	Greg Kroah-Hartman, Jordan Borgner, Ravi V. Shankar,
	Mohammad Etemadi, Ricardo Neri, LKML, x86, Andy Shevchenko,
	Andi Kleen, Peter Feiner, Rafael J. Wysocki

Ricardo,

On Thu, 27 Jun 2019, Ricardo Neri wrote:
>  
> +/*
> + * Processors which have self-snooping capability can handle conflicting
> + * memory type across CPUs by snooping its own cache. However, there exists
> + * CPU models in which having conflicting memory types still leads to
> + * unpredictable behavior, machine check errors, or hangs. Clear this feature
> + * to prevent its use. For instance, the algorithm to program the Memory Type
> + * Region Registers and the Page Attribute Table MSR can skip expensive cache
> + * flushes if self-snooping is supported.

I appreciate informative comments, but this is the part which disables a
feature on errata inflicted CPUs. So the whole information about what
self-snooping helps with is not that interesting here. It's broken, we
disable it and be done with it.

> + */
> +static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c)
> +{
> +	switch (c->x86_model) {
> +	case INTEL_FAM6_CORE_YONAH:
> +	case INTEL_FAM6_CORE2_MEROM:
> +	case INTEL_FAM6_CORE2_MEROM_L:
> +	case INTEL_FAM6_CORE2_PENRYN:
> +	case INTEL_FAM6_CORE2_DUNNINGTON:
> +	case INTEL_FAM6_NEHALEM:
> +	case INTEL_FAM6_NEHALEM_G:
> +	case INTEL_FAM6_NEHALEM_EP:
> +	case INTEL_FAM6_NEHALEM_EX:
> +	case INTEL_FAM6_WESTMERE:
> +	case INTEL_FAM6_WESTMERE_EP:
> +	case INTEL_FAM6_SANDYBRIDGE:
> +		setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP);
> +	}
> +}
> +

But looking at the actual interesting part of the 2nd patch:

> @@ -743,7 +743,9 @@ static void prepare_set(void) __acquires(set_atomicity_lock)
>        /* Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */
>        cr0 = read_cr0() | X86_CR0_CD;
>        write_cr0(cr0);
> -       wbinvd();
> +
> +       if (!static_cpu_has(X86_FEATURE_SELFSNOOP))
> +               wbinvd();

This part lacks any form of explanation. So I'd rather have the comment
about why we can avoid the wbindv() here. I''d surely never would look at
that errata handling function to get that information.

Other than that detail, the patches are well done!

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/2] Speed MTRR programming up when we can
  2019-06-27 18:43 [PATCH 0/2] Speed MTRR programming up when we can Ricardo Neri
  2019-06-27 18:43 ` [PATCH 1/2] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata Ricardo Neri
  2019-06-27 18:43 ` [PATCH 2/2] x86, mtrr: generic: Skip cache flushes on CPUs with cache self-snooping Ricardo Neri
@ 2019-06-27 20:39 ` Thomas Gleixner
  2 siblings, 0 replies; 6+ messages in thread
From: Thomas Gleixner @ 2019-06-27 20:39 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Ingo Molnar, Borislav Petkov, Alan Cox, Tony Luck,
	H. Peter Anvin, Andy Shevchenko, Andi Kleen, Hans de Goede,
	Greg Kroah-Hartman, Jordan Borgner, Ravi V. Shankar,
	Mohammad Etemadi, Ricardo Neri, linux-kernel, x86

On Thu, 27 Jun 2019, Ricardo Neri wrote:
> By measuring the execution time of mtrr_aps_init() (from which MTRRs
> in all CPUs are programmed in lock-step at boot), I find savings in the
> time required to program MTRRs as follows:
> 
> Platform                      time-with-wbinvd(ms) time-no-wbinvd(ms)
> 104-core (208 LP) Skylake            1437                 28
> 2-core (4 LP) Haswell                 114                  2

Impressive!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/2] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata
  2019-06-27 20:38   ` Thomas Gleixner
@ 2019-06-28  1:43     ` Ricardo Neri
  0 siblings, 0 replies; 6+ messages in thread
From: Ricardo Neri @ 2019-06-28  1:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Borislav Petkov, Alan Cox, Tony Luck,
	H. Peter Anvin, Andy Shevchenko, Andi Kleen, Hans de Goede,
	Greg Kroah-Hartman, Jordan Borgner, Ravi V. Shankar,
	Mohammad Etemadi, Ricardo Neri, LKML, x86, Andy Shevchenko,
	Andi Kleen, Peter Feiner, Rafael J. Wysocki

On Thu, Jun 27, 2019 at 10:38:13PM +0200, Thomas Gleixner wrote:
> Ricardo,
> 
> On Thu, 27 Jun 2019, Ricardo Neri wrote:
> >  
> > +/*
> > + * Processors which have self-snooping capability can handle conflicting
> > + * memory type across CPUs by snooping its own cache. However, there exists
> > + * CPU models in which having conflicting memory types still leads to
> > + * unpredictable behavior, machine check errors, or hangs. Clear this feature
> > + * to prevent its use. For instance, the algorithm to program the Memory Type
> > + * Region Registers and the Page Attribute Table MSR can skip expensive cache
> > + * flushes if self-snooping is supported.
> 
> I appreciate informative comments, but this is the part which disables a
> feature on errata inflicted CPUs. So the whole information about what
> self-snooping helps with is not that interesting here. It's broken, we
> disable it and be done with it.

Sure, Thomas. I will move the the usefulness of self-snooping to the
MTRR programming function as you mention below.
> 
> > + */
> > +static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c)
> > +{
> > +	switch (c->x86_model) {
> > +	case INTEL_FAM6_CORE_YONAH:
> > +	case INTEL_FAM6_CORE2_MEROM:
> > +	case INTEL_FAM6_CORE2_MEROM_L:
> > +	case INTEL_FAM6_CORE2_PENRYN:
> > +	case INTEL_FAM6_CORE2_DUNNINGTON:
> > +	case INTEL_FAM6_NEHALEM:
> > +	case INTEL_FAM6_NEHALEM_G:
> > +	case INTEL_FAM6_NEHALEM_EP:
> > +	case INTEL_FAM6_NEHALEM_EX:
> > +	case INTEL_FAM6_WESTMERE:
> > +	case INTEL_FAM6_WESTMERE_EP:
> > +	case INTEL_FAM6_SANDYBRIDGE:
> > +		setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP);
> > +	}
> > +}
> > +
> 
> But looking at the actual interesting part of the 2nd patch:
> 
> > @@ -743,7 +743,9 @@ static void prepare_set(void) __acquires(set_atomicity_lock)
> >        /* Enter the no-fill (CD=1, NW=0) cache mode and flush caches. */
> >        cr0 = read_cr0() | X86_CR0_CD;
> >        write_cr0(cr0);
> > -       wbinvd();
> > +
> > +       if (!static_cpu_has(X86_FEATURE_SELFSNOOP))
> > +               wbinvd();
> 
> This part lacks any form of explanation. So I'd rather have the comment
> about why we can avoid the wbindv() here. I''d surely never would look at
> that errata handling function to get that information.
> 
> Other than that detail, the patches are well done!

Thank you, Thomas!

BR,
Ricardo

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-06-28  1:43 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-27 18:43 [PATCH 0/2] Speed MTRR programming up when we can Ricardo Neri
2019-06-27 18:43 ` [PATCH 1/2] x86/cpu/intel: Clear cache self-snoop capability in CPUs with known errata Ricardo Neri
2019-06-27 20:38   ` Thomas Gleixner
2019-06-28  1:43     ` Ricardo Neri
2019-06-27 18:43 ` [PATCH 2/2] x86, mtrr: generic: Skip cache flushes on CPUs with cache self-snooping Ricardo Neri
2019-06-27 20:39 ` [PATCH 0/2] Speed MTRR programming up when we can Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).