linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform
@ 2022-10-13 13:12 Feng Tang
  2022-10-13 16:02 ` Dave Hansen
  2022-10-19  9:18 ` Thomas Gleixner
  0 siblings, 2 replies; 6+ messages in thread
From: Feng Tang @ 2022-10-13 13:12 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H . Peter Anvin, Peter Zijlstra, x86, linux-kernel
  Cc: rui.zhang, tim.c.chen, Xiongfeng Wang, Feng Tang, Yu Liao

There is report again that the tsc clocksource on a 4 sockets x86
Skylake server was wrongly judged as 'unstable' by 'jiffies' watchdog,
and disabled [1].

Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC
on qualified platorms") was introduce to deal with these false
alarms of tsc unstable issues, covering qualified platforms for 2
sockets or smaller ones.

Extend the exemption to 4 sockets to fix the issue.

We also got similar reports on 8 sockets platform from internal test,
but as Peter pointed out, there was tsc sync issues for 8-sockets
platform, and it'd better be handled architecture by architecture,
instead of directly changing the threshold to 8 here.

Rui also proposed another way to disable 'jiffies' as clocksource
watchdog [2], which can also solve this specific problem in an
architecture independent way, with one limitation that some tsc false
alarms are reported by other watchdogs like HPET in post-boot time,
while 'jiffies' is mostly used in boot phase before hardware
clocksources are initialized.

[1]. https://lore.kernel.org/all/9d3bf570-3108-0336-9c52-9bee15767d29@huawei.com/
[2]. https://lore.kernel.org/all/bd5b97f89ab2887543fc262348d1c7cafcaae536.camel@intel.com/

Reported-by: Yu Liao <liaoyu15@huawei.com>
Tested-by: Yu Liao <liaoyu15@huawei.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---

Hi reviewers:

In the v1 review cycle, Dave raised the issue that 'nr_online_nodes'
is not accurate, and could have problem with fakenuma (numa=fake=4 in
cmdline) case and system with CPU-less HBM/PMEM nodes, which we have
discussed in https://lore.kernel.org/lkml/Y0UgeUIJSFNR4mQB@feng-clx/
and will post the solution as a separate RFC patch.

Thanks,
Feng

Changelog:
  
  Since v1:
  * Change the max socket number from 8 to 4, as Peter Zijlstra
    pointed the 8S machine could have tsc sync issue, and should
    not be exempted generally
  
 arch/x86/kernel/tsc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index cafacb2e58cc..1fa3fdf43159 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1209,7 +1209,7 @@ static void __init check_system_tsc_reliable(void)
 	 *  - TSC which does not stop in C-States
 	 *  - the TSC_ADJUST register which allows to detect even minimal
 	 *    modifications
-	 *  - not more than two sockets. As the number of sockets cannot be
+	 *  - not more than four sockets. As the number of sockets cannot be
 	 *    evaluated at the early boot stage where this has to be
 	 *    invoked, check the number of online memory nodes as a
 	 *    fallback solution which is an reasonable estimate.
@@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
 	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
 	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
 	    boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
-	    nr_online_nodes <= 2)
+	    nr_online_nodes <= 4)
 		tsc_disable_clocksource_watchdog();
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform
  2022-10-13 13:12 [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform Feng Tang
@ 2022-10-13 16:02 ` Dave Hansen
  2022-10-14  0:37   ` Feng Tang
  2022-10-19  9:18 ` Thomas Gleixner
  1 sibling, 1 reply; 6+ messages in thread
From: Dave Hansen @ 2022-10-13 16:02 UTC (permalink / raw)
  To: Feng Tang, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, Peter Zijlstra, x86, linux-kernel
  Cc: rui.zhang, tim.c.chen, Xiongfeng Wang, Yu Liao

On 10/13/22 06:12, Feng Tang wrote:
> @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
>  	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
>  	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
>  	    boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> -	    nr_online_nodes <= 2)
> +	    nr_online_nodes <= 4)
>  		tsc_disable_clocksource_watchdog();

I still don't think we should perpetuate this hack.

This just plain doesn't work in numa=off numa=fake=... or presumably in
cases where NUMA is disabled in the firmware and memory is interleaved
across all sockets.

It also presumably doesn't work on two-socket systems that have
Cluster-on-Die or Sub-NUMA-Clustering where a single socket is chopped
up into multiple nodes.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform
  2022-10-13 16:02 ` Dave Hansen
@ 2022-10-14  0:37   ` Feng Tang
  2022-10-14  1:14     ` Feng Tang
  0 siblings, 1 reply; 6+ messages in thread
From: Feng Tang @ 2022-10-14  0:37 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Peter Zijlstra, x86, linux-kernel, rui.zhang, tim.c.chen,
	Xiongfeng Wang, Yu Liao

On Thu, Oct 13, 2022 at 09:02:43AM -0700, Dave Hansen wrote:
> On 10/13/22 06:12, Feng Tang wrote:
> > @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
> >  	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> >  	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> >  	    boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> > -	    nr_online_nodes <= 2)
> > +	    nr_online_nodes <= 4)
> >  		tsc_disable_clocksource_watchdog();
> 
> I still don't think we should perpetuate this hack.
> 
> This just plain doesn't work in numa=off numa=fake=... or presumably in
> cases where NUMA is disabled in the firmware and memory is interleaved
> across all sockets.
> 
> It also presumably doesn't work on two-socket systems that have
> Cluster-on-Die or Sub-NUMA-Clustering where a single socket is chopped
> up into multiple nodes.

Yes, after you raised the 'nr_online_nodes' issue, Peter, Rui and I
have discussed the problem, and plan to post a RFC patch as in 
 https://lore.kernel.org/lkml/Y0UgeUIJSFNR4mQB@feng-clx/

Which can cover:
 - numa=fake=... case
 - platform has DRAM nodes and cpu-less HBM/PMEM nodes

and 'sub-numa-clustering' can't be covered, and the tsc will be
watchdoged as before.

For numa=off case, there is only one CPU up, and I think lifting this
watchdog for tsc is fine.

Thanks,
Feng





^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform
  2022-10-14  0:37   ` Feng Tang
@ 2022-10-14  1:14     ` Feng Tang
  0 siblings, 0 replies; 6+ messages in thread
From: Feng Tang @ 2022-10-14  1:14 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, H . Peter Anvin,
	Peter Zijlstra, x86, linux-kernel, rui.zhang, tim.c.chen,
	Xiongfeng Wang, Yu Liao

On Fri, Oct 14, 2022 at 08:37:18AM +0800, Feng Tang wrote:
> On Thu, Oct 13, 2022 at 09:02:43AM -0700, Dave Hansen wrote:
> > On 10/13/22 06:12, Feng Tang wrote:
> > > @@ -1217,7 +1217,7 @@ static void __init check_system_tsc_reliable(void)
> > >  	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > >  	    boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > >  	    boot_cpu_has(X86_FEATURE_TSC_ADJUST) &&
> > > -	    nr_online_nodes <= 2)
> > > +	    nr_online_nodes <= 4)
> > >  		tsc_disable_clocksource_watchdog();
> > 
> > I still don't think we should perpetuate this hack.
> > 
> > This just plain doesn't work in numa=off numa=fake=... or presumably in
> > cases where NUMA is disabled in the firmware and memory is interleaved
> > across all sockets.
> > 
> > It also presumably doesn't work on two-socket systems that have
> > Cluster-on-Die or Sub-NUMA-Clustering where a single socket is chopped
> > up into multiple nodes.
> 
> Yes, after you raised the 'nr_online_nodes' issue, Peter, Rui and I
> have discussed the problem, and plan to post a RFC patch as in 
>  https://lore.kernel.org/lkml/Y0UgeUIJSFNR4mQB@feng-clx/
> 
> Which can cover:
>  - numa=fake=... case
>  - platform has DRAM nodes and cpu-less HBM/PMEM nodes
> 
> and 'sub-numa-clustering' can't be covered, and the tsc will be
> watchdoged as before.
[...] 


> For numa=off case, there is only one CPU up, and I think lifting this
> watchdog for tsc is fine.

Sorry, I was wrong about this. 'numa=off' will still boot all CPUs up,
but skip SRAT table init and only show one node. so this is another
case that the fix patch can't cover.

Thanks,
Feng



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform
  2022-10-13 13:12 [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform Feng Tang
  2022-10-13 16:02 ` Dave Hansen
@ 2022-10-19  9:18 ` Thomas Gleixner
  2022-10-20  4:44   ` Feng Tang
  1 sibling, 1 reply; 6+ messages in thread
From: Thomas Gleixner @ 2022-10-19  9:18 UTC (permalink / raw)
  To: Feng Tang, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H . Peter Anvin, Peter Zijlstra, x86, linux-kernel
  Cc: rui.zhang, tim.c.chen, Xiongfeng Wang, Feng Tang, Yu Liao

On Thu, Oct 13 2022 at 21:12, Feng Tang wrote:
> There is report again that the tsc clocksource on a 4 sockets x86
> Skylake server was wrongly judged as 'unstable' by 'jiffies' watchdog,
> and disabled [1].
>
> Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC
> on qualified platorms") was introduce to deal with these false
> alarms of tsc unstable issues, covering qualified platforms for 2
> sockets or smaller ones.
>
> Extend the exemption to 4 sockets to fix the issue.
>
> We also got similar reports on 8 sockets platform from internal test,
> but as Peter pointed out, there was tsc sync issues for 8-sockets
> platform, and it'd better be handled architecture by architecture,
> instead of directly changing the threshold to 8 here.
>
> Rui also proposed another way to disable 'jiffies' as clocksource
> watchdog [2], which can also solve this specific problem in an
> architecture independent way, with one limitation that some tsc false
> alarms are reported by other watchdogs like HPET in post-boot time,
> while 'jiffies' is mostly used in boot phase before hardware
> clocksources are initialized.

HPET is initialized early, but if HPET is disabled or not advertised
then the only other hardware clocksource is PMTIMER which is initialized
late via fs_initcall. PMTIMER is initialized late due to broken Pentium
era chipsets which are sorted with PCI quirks. For anything else we can
initialize it early. Something like the below.

I'm sure I said this more than once, but I'm happy to repeat myself
forever:

  Instead of proliferating lousy hacks, can the X86 vendors finaly get
  their act together and provide some architected information whether
  the TSC is trustworthy or not?

Thanks,

        tglx
---

--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -10,6 +10,7 @@
  *
  */
 
+#include <linux/acpi_pmtmr.h>
 #include <linux/clocksource.h>
 #include <linux/clockchips.h>
 #include <linux/interrupt.h>
@@ -75,6 +76,14 @@ static void __init setup_default_timer_i
 void __init hpet_time_init(void)
 {
 	if (!hpet_enable()) {
+		/*
+		 * Some Pentium chipsets have broken HPETs and need
+		 * PCI quirks to run before init.
+		 */
+		if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+		    boot_cpu_data.family != 5)
+			init_acpi_pm_clocksource();
+
 		if (!pit_timer_init())
 			return;
 	}
--- a/drivers/clocksource/acpi_pm.c
+++ b/drivers/clocksource/acpi_pm.c
@@ -30,6 +30,7 @@
  * in arch/i386/kernel/acpi/boot.c
  */
 u32 pmtmr_ioport __read_mostly;
+static bool pmtmr_initialized __init_data;
 
 static inline u32 read_pmtmr(void)
 {
@@ -142,7 +143,7 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_SE
  * Some boards have the PMTMR running way too fast. We check
  * the PMTMR rate against PIT channel 2 to catch these cases.
  */
-static int verify_pmtmr_rate(void)
+static int __init verify_pmtmr_rate(void)
 {
 	u64 value1, value2;
 	unsigned long count, delta;
@@ -172,14 +173,18 @@ static int verify_pmtmr_rate(void)
 /* Number of reads we try to get two different values */
 #define ACPI_PM_READ_CHECKS 10000
 
-static int __init init_acpi_pm_clocksource(void)
+int __init init_acpi_pm_clocksource(void)
 {
 	u64 value1, value2;
 	unsigned int i, j = 0;
+	int ret;
 
 	if (!pmtmr_ioport)
 		return -ENODEV;
 
+	if (pmtmr_initialized)
+		return 0;
+
 	/* "verify" this timing source: */
 	for (j = 0; j < ACPI_PM_MONOTONICITY_CHECKS; j++) {
 		udelay(100 * j);
@@ -210,10 +215,11 @@ static int __init init_acpi_pm_clocksour
 		return -ENODEV;
 	}
 
-	return clocksource_register_hz(&clocksource_acpi_pm,
-						PMTMR_TICKS_PER_SEC);
+	ret = clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
+	if (!ret)
+		pmtimer_initialized = true;
+	return ret;
 }
-
 /* We use fs_initcall because we want the PCI fixups to have run
  * but we still need to load before device_initcall
  */
--- a/include/linux/acpi_pmtmr.h
+++ b/include/linux/acpi_pmtmr.h
@@ -13,6 +13,8 @@
 /* Overrun value */
 #define ACPI_PM_OVRRUN	(1<<24)
 
+extern int __init init_acpi_pm_clocksource(void);
+
 #ifdef CONFIG_X86_PM_TIMER
 
 extern u32 acpi_pm_read_verified(void);



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform
  2022-10-19  9:18 ` Thomas Gleixner
@ 2022-10-20  4:44   ` Feng Tang
  0 siblings, 0 replies; 6+ messages in thread
From: Feng Tang @ 2022-10-20  4:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Borislav Petkov, Dave Hansen, H . Peter Anvin,
	Peter Zijlstra, x86, linux-kernel, rui.zhang, tim.c.chen,
	Xiongfeng Wang, Yu Liao

On Wed, Oct 19, 2022 at 11:18:43AM +0200, Thomas Gleixner wrote:
> On Thu, Oct 13 2022 at 21:12, Feng Tang wrote:
> > There is report again that the tsc clocksource on a 4 sockets x86
> > Skylake server was wrongly judged as 'unstable' by 'jiffies' watchdog,
> > and disabled [1].
> >
> > Commit b50db7095fe0 ("x86/tsc: Disable clocksource watchdog for TSC
> > on qualified platorms") was introduce to deal with these false
> > alarms of tsc unstable issues, covering qualified platforms for 2
> > sockets or smaller ones.
> >
> > Extend the exemption to 4 sockets to fix the issue.
> >
> > We also got similar reports on 8 sockets platform from internal test,
> > but as Peter pointed out, there was tsc sync issues for 8-sockets
> > platform, and it'd better be handled architecture by architecture,
> > instead of directly changing the threshold to 8 here.
> >
> > Rui also proposed another way to disable 'jiffies' as clocksource
> > watchdog [2], which can also solve this specific problem in an
> > architecture independent way, with one limitation that some tsc false
> > alarms are reported by other watchdogs like HPET in post-boot time,
> > while 'jiffies' is mostly used in boot phase before hardware
> > clocksources are initialized.
> 
> HPET is initialized early, but if HPET is disabled or not advertised
> then the only other hardware clocksource is PMTIMER which is initialized
> late via fs_initcall. PMTIMER is initialized late due to broken Pentium
> era chipsets which are sorted with PCI quirks. For anything else we can
> initialize it early. Something like the below.

Thanks for sharing the background and the code! It can reduce the
time of 'jiffies' being a watchdog on client platforms whose HPET
are disabled. And there were still false positive reports for
HPET/PMTIMER as watchdogs, so I still vote to your suggestion of
lifting the check for qualified platforms.

For that, Dave raised the accuracy issue of 'nr_online_nodes' and
we proposed new patch in https://lore.kernel.org/lkml/20221017132942.1646934-1-feng.tang@intel.com/
while the topology_max_packages() still has issue as providing socket
number, and I plan to use 'logical_packages' instead. Do you think
it's in the right direction?

> I'm sure I said this more than once, but I'm happy to repeat myself
> forever:
> 
>   Instead of proliferating lousy hacks, can the X86 vendors finaly get
>   their act together and provide some architected information whether
>   the TSC is trustworthy or not?
 
Yes it will save us a lot of trouble. Maybe better in CPUID info, as
if there is some bug in HW/BIOS, it may get fixed with microcode update.

Thanks,
Feng

> Thanks,
> 
>         tglx
> ---
> 
> --- a/arch/x86/kernel/time.c
> +++ b/arch/x86/kernel/time.c
> @@ -10,6 +10,7 @@
>   *
>   */
>  
> +#include <linux/acpi_pmtmr.h>
>  #include <linux/clocksource.h>
>  #include <linux/clockchips.h>
>  #include <linux/interrupt.h>
> @@ -75,6 +76,14 @@ static void __init setup_default_timer_i
>  void __init hpet_time_init(void)
>  {
>  	if (!hpet_enable()) {
> +		/*
> +		 * Some Pentium chipsets have broken HPETs and need
> +		 * PCI quirks to run before init.
> +		 */
> +		if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
> +		    boot_cpu_data.family != 5)
> +			init_acpi_pm_clocksource();
> +
>  		if (!pit_timer_init())
>  			return;
>  	}
> --- a/drivers/clocksource/acpi_pm.c
> +++ b/drivers/clocksource/acpi_pm.c
> @@ -30,6 +30,7 @@
>   * in arch/i386/kernel/acpi/boot.c
>   */
>  u32 pmtmr_ioport __read_mostly;
> +static bool pmtmr_initialized __init_data;
>  
>  static inline u32 read_pmtmr(void)
>  {
> @@ -142,7 +143,7 @@ DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_SE
>   * Some boards have the PMTMR running way too fast. We check
>   * the PMTMR rate against PIT channel 2 to catch these cases.
>   */
> -static int verify_pmtmr_rate(void)
> +static int __init verify_pmtmr_rate(void)
>  {
>  	u64 value1, value2;
>  	unsigned long count, delta;
> @@ -172,14 +173,18 @@ static int verify_pmtmr_rate(void)
>  /* Number of reads we try to get two different values */
>  #define ACPI_PM_READ_CHECKS 10000
>  
> -static int __init init_acpi_pm_clocksource(void)
> +int __init init_acpi_pm_clocksource(void)
>  {
>  	u64 value1, value2;
>  	unsigned int i, j = 0;
> +	int ret;
>  
>  	if (!pmtmr_ioport)
>  		return -ENODEV;
>  
> +	if (pmtmr_initialized)
> +		return 0;
> +
>  	/* "verify" this timing source: */
>  	for (j = 0; j < ACPI_PM_MONOTONICITY_CHECKS; j++) {
>  		udelay(100 * j);
> @@ -210,10 +215,11 @@ static int __init init_acpi_pm_clocksour
>  		return -ENODEV;
>  	}
>  
> -	return clocksource_register_hz(&clocksource_acpi_pm,
> -						PMTMR_TICKS_PER_SEC);
> +	ret = clocksource_register_hz(&clocksource_acpi_pm, PMTMR_TICKS_PER_SEC);
> +	if (!ret)
> +		pmtimer_initialized = true;
> +	return ret;
>  }
> -
>  /* We use fs_initcall because we want the PCI fixups to have run
>   * but we still need to load before device_initcall
>   */
> --- a/include/linux/acpi_pmtmr.h
> +++ b/include/linux/acpi_pmtmr.h
> @@ -13,6 +13,8 @@
>  /* Overrun value */
>  #define ACPI_PM_OVRRUN	(1<<24)
>  
> +extern int __init init_acpi_pm_clocksource(void);
> +
>  #ifdef CONFIG_X86_PM_TIMER
>  
>  extern u32 acpi_pm_read_verified(void);
> 
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-10-20  4:45 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-13 13:12 [PATCH v2] x86/tsc: Extend watchdog check exemption to 4-Sockets platform Feng Tang
2022-10-13 16:02 ` Dave Hansen
2022-10-14  0:37   ` Feng Tang
2022-10-14  1:14     ` Feng Tang
2022-10-19  9:18 ` Thomas Gleixner
2022-10-20  4:44   ` Feng Tang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).