linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector
@ 2021-05-04 19:05 Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 01/16] x86/hpet: Expose hpet_writel() in header Ricardo Neri
                   ` (15 more replies)
  0 siblings, 16 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri

Hi,

This is the long overdue fifth attempt to implement a hardlockup detector
that uses an HPET channel. Previous versions can be found in [1], [2],
[3], and [4]. I will soon post in the IOMMU mailing list a separate series
to deal with interrupt remapping. Thus, x86 reviewers may hold off on their
review momentarily while the IOMMU experts decide on whether what I propose
makes sense.

This RFC series has been reviewed by Tony Luck <tony.luck@intel.com>

== Introduction ==

In CPU architectures that do not have an NMI watchdog, one can be
constructed using a counter of the Performance Monitoring Unit (PMU).
Counters in the PMU have high granularity and high visibility of the CPU.
These capabilities and their limited number make these counters precious
resources. Unfortunately, the perf-based hardlockup detector permanently
consumes one of these counters per CPU. These counters could be freed
if the hardlockup detector were driven by another timer.

The hardlockup detector runs relatively infrequently and does not require
visibility of the CPU activity (in addition to detect locked-up CPUs). A
timer that is external to the CPU (e.g., in the chipset) can be used to
drive the detector.

A key requirement is that the timer needs to be capable of issuing a
non-maskable interrupt to the CPU. In most cases, this can be achieved
by tweaking the delivery mode of the interrupt. It is especially
straightforward for MSI interrupts.

This implementation uses an HPET timer to deliver an NMI interrupt via
an MSI message.

== Details of this implementation

Unlike the perf-based hardlockup detector, this implementation is
driven by a single timer. Driving the detector with a single timer brings
certain complexities to the implementation: accessing the HPET timer is
slow, and the frequency and the affinity of the timer interrupt needs to be
adjusted periodically. These operations need to happen in addition to
servicing the NMI. In order to address these complexities, this design
meets the following goals:

  * Minimize updates to the affinity of the HPET timer interrupt and do
    it outside of NMI context.

  * Avoid races with System Management Mode that may lead to see the HPET
    count stall or run backwards.

  * Minimize the number of reads and writes to the HPET registers.

Also, as per feedback from Thomas Gleixner,

  * Do not implement an IRQF_NMI to request an NMI interrupt for x86 due
    to the difficulty of identifying its source.

In order to meet the goals above, I implemented what Thomas suggested [5]:
a detector that mixes inter-processor interrupts (IPIs) and updating the
affinity of the HPET interrupt in a round-robin fashion.

The CPUs that the hardlockup detector monitors are partitioned into groups.
A CPU from such group, the handling CPU, handles the NMI interrupt from the
HPET timer and then issues an NMI IPI to the rest of the CPUs in the group.
Each of the CPUs looks for hardlockups upon reception of the NMI. To
minimize IPIs among packages, a group constitutes all the CPUs in a
package.

Since the monitored CPUs now have been partitioned in groups, the HPET
timer needs to target the handling CPU of each group. Hence, the affinity
of the HPET timer is updated in round-robin manner to sequentially target
each group of CPUs. Each group of CPUs will be monitored every 1 second or
less frequently. This is the frequency of the HPET timer interrupt. In the
unlikely case of having more packages than the value watch_thresh, several
packages will be grouped together to keep the HPET timer interrupt at 1
second.

In order to avoid reading HPET registers in every NMI, the time-stamp
counter is used to determine whether the HPET caused the interrupt. At
every timer expiration, we compute the value the time-stamp counter is
expected to have the next time the timer expires. I have found
experimentally that expected TSC value consistently has an error of less
than 1%.

== Parts of this series ==

For clarity, patches are grouped as follows:

 1) HPET updates. Patches 1-3 prepare the HPET code to accommodate the
    new detector: rework periodic programming, reserve and configure a
    timer for the detector and expose a few existing functions.

 3) NMI watchdog. Patches 4-6 updates the existing hardlockup detector
    to uncouple it from perf, and introduces a new NMI handler category
    intended to run after the NMI_LOCAL handlers.

 4) New HPET-based hardlockup detector. Patches 7-11 includes changes to
    probe the hardware resources, configure the interrupt, and rotate the
    destination of the interrupts among all monitored CPUs.

 5) Hardlockup detector management. Patches 12-16 is a collection of
    miscellaneous patches to determine when to use the HPET hardlockup
    detector and stop it if necessary. It also includes an x86-specific
    shim hardlockup detector that selects between the perf- and hpet-based
    implementations. It also switches back to the perf implementation if
    the TSC becomes unstable.

== Testing ==

I tested this series in both client and server systems, single-package and
dual-package systems. I let the system run normally and observed the NMI
interrupts keep coming by inspecting the file /proc/interrupts.

I also implemented a test module to hog a CPU and observed the detector
complain about soft and hardlockups.

Lastly, I wrote a script to keep only one CPU online at a time in a round-
robin manner by doing 500 back-to-back CPU hotplug online/offline 500
times. During the test, I set /proc/sys/kernel/watchdog_thresh to 1 second.
The HPET-based hardlockup detector is still functional after the test. I
used the latest master branch from the tip tree.

The tests above were performed with interrupt remapping both enabled and
disabled.

Thanks and BR,
Ricardo

Changes since v4:
 * Added commentary on the tests performed on this feature. (Andi)
 * Added a stub version of hardlockup_detector_switch_to_perf() for
   !CONFIG_HPET_TIMER. (lkp)
 * Use switch to select the type of x86 hardlockup detector. (Andi)
 * Renamed a local variable in update_ticks_per_group(). (Andi)
 * Made this hardlockup detector available to X86_32.
 * Reworked logic to kick the HPET timer to remove a local variable.
   (Andi)
 * Added a comment on what type of timer channel will be assigned to the
   detector. (Andi)
 * Reworded help comment in the X86_HARDLOCKUP_DETECTOR_HPET Kconfig
   option. (Andi)
 * Removed unnecessary switch to level interrupt mode when disabling the
   timer. (Andi)
 * Disabled the HPET timer to avoid a race between an incoming interrupt
   and an update of the MSI destination ID. (Ashok)
 * Renamed hpet_hardlockup_detector_get_timer() as hpet_hld_get_timer()
 * Added commentary on an undocumented udelay() when programming an
   HPET channel in periodic mode. (Ashok)
 * Reworked code to use new enumeration apic_delivery_modes and reworked
   MSI message composition fields [6].
 * Partitioned monitored CPUs into groups. Each CPU in the group is
   inspected for hardlockups using an IPI.
 * Use a round-robin mechanism to update the affinity of the HPET timer.
   Affinity is updated every watchdog_thresh seconds to target the
   handling CPU of the group.
 * Moved update of the HPET interrupt affinity to an irq_work. (Thomas
   Gleixner).
 * Updated expiration of the HPET timer and the expected value of the
   TSC based on the number of groups of monitored CPUs.
 * Renamed hpet_set_comparator() to hpet_set_comparator_periodic() to
   remove decision logic for periodic case. (Thomas Gleixner)
 * Reworked timer reservation to use Thomas' rework on HPET channel
   management[7].
 * Removed hard-coded channel number for the hardlockup detector.
 * Provided more details on the sequence of HPET channel reservation.
   (Thomas Gleixner)
 * Only reserve a channel for the hardlockup detector if enabled via
   kernel command line. The function reserving the channel is called from
   hardlockup detector. (Thomas Gleixner)
 * Dropped hpet_hld_data::enabled_cpus and instead use cpumask_weight().
 * Renamed hpet_hld_data::cpu_monitored_mask to
   hld_data_data.cpu_monitored_mask and converted it to cpumask_var_t.
 * Flushed out any outstanding interrupt before enabling the HPET channel.
 * Removed unnecessary MSI_DATA_LEVEL_ASSERT from the MSI message.
 * Added comments in hardlockup_detector_nmi_handler() to explain how
   CPUs are targeted for an IPI.
 * Updated code to only issue an IPI when needed (i.e., there are CPUs in
   the group other than the handling CPU).
 * Reworked hardlockup_detector_hpet_init() for readability.
 * Now reserve the cpumasks in the hardlockup detector code and not in the
   generic HPET code.
 * Handle the case of watchdog_thresh = 0 when disabling the detector.

Change since v3:
 * Fixed yet another bug in periodic programming of the HPET timer that
   prevented the system from booting.
 * Fixed computation of HPET frequency to use hpet_readl() only.
 * Added a missing #include in the watchdog_hld_hpet.c
 * Fixed various typos and grammar errors (Randy Dunlap)

Changes since v2:
 * Added functionality to switch to the perf-based hardlockup
   detector if the TSC becomes unstable (Thomas Gleixner).
 * Brought back the round-robin mechanism proposed in v1 (this time not
   using the interrupt subsystem). This also requires computing
   expiration times as in v1 (Andi Kleen, Stephane Eranian).
 * Fixed a bug in which using a periodic timer was not working(thanks
   to Suravee Suthikulpanit!).
 * In this version, I incorporate support for interrupt remapping in the
   last 4 patches so that they can be reviewed separately if needed.
 * Removed redundant documentation of functions (Thomas Gleixner).
 * Added a new category of NMI handler, NMI_WATCHDOG, which executes after
   NMI_LOCAL handlers (Andi Kleen).
 * Updated handling of "nmi_watchdog" to support comma-separated
   arguments.
 * Undid split of the generic hardlockup detector into a separate file
   (Thomas Gleixner).
 * Added a new intermediate symbol CONFIG_HARDLOCKUP_DETECTOR_CORE to
   select generic parts of the detector (Paul E. McKenney,
   Thomas Gleixner).
 * Removed use of struct cpumask in favor of a variable length array in
   conjunction with kzalloc (Peter Zijlstra).
 * Added CPU as argument hardlockup_detector_hpet_enable()/disable()
   (Thomas Gleixner).
 * Remove unnecessary export of function declarations, flags, and bit
   fields (Thomas Gleixner).
 * Removed  unnecessary check for FSB support when reserving timer for the
   detector (Thomas Gleixner).
 * Separated TSC code from HPET code in kick_timer() (Thomas Gleixner).
 * Reworked condition to check if the expected TSC value is within the
   error margin to avoid conditional (Peter Zijlstra).
 * Removed TSC error margin from struct hld_data; use global variable
   instead (Peter Zijlstra).
 * Removed previously introduced watchdog_get_allowed_cpumask*() and
   reworked hardlockup_detector_hpet_enable()/disable() to not need
   access to watchdog_allowed_mask (Thomas Gleixner).

Changes since v1:

 * Removed reads to HPET registers at every NMI. Instead use the time-stamp
   counter to infer the interrupt source (Thomas Gleixner, Andi Kleen).
 * Do not target CPUs in a round-robin manner. Instead, the HPET timer
   always targets the same CPU; other CPUs are monitored via an
   interprocessor interrupt.
 * Removed use of generic irq code to set interrupt affinity and NMI
   delivery. Instead, configure the interrupt directly in HPET registers
   (Thomas Gleixner).
 * Removed the proposed ops structure for NMI watchdogs. Instead, split
   the existing implementation into a generic library and perf-specific
   infrastructure (Thomas Gleixner, Nicholas Piggin).
 * Added an x86-specific shim hardlockup detector that selects between
   HPET and perf infrastructures as needed (Nicholas Piggin).
 * Removed locks taken in NMI and !NMI context. This was wrong and is no
   longer needed (Thomas Gleixner).
 * Fixed unconditional return NMI_HANDLED when the HPET timer is programmed
   for FSB/MSI delivery (Peter Zijlstra).

[1]. https://lore.kernel.org/lkml/1528851463-21140-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[2]. https://lore.kernel.org/lkml/1551283518-18922-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[3]. https://lore.kernel.org/lkml/1557842534-4266-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[4]. https://lore.kernel.org/lkml/1558660583-28561-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[5]. https://lore.kernel.org/lkml/alpine.DEB.2.21.1906172343120.1963@nanos.tec.linutronix.de/
[6]. https://lore.kernel.org/r/20201024213535.443185-6-dwmw2@infradead.org
[7]. https://lore.kernel.org/lkml/20190623132340.463097504@linutronix.de/

Ricardo Neri (16):
  x86/hpet: Expose hpet_writel() in header
  x86/hpet: Add helper function hpet_set_comparator_periodic()
  x86/hpet: Reserve an HPET channel for the hardlockup detector
  watchdog/hardlockup: Define a generic function to detect hardlockups
  watchdog/hardlockup: Decouple the hardlockup detector from perf
  x86/nmi: Add an NMI_WATCHDOG NMI handler category
  x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
  x86/watchdog/hardlockup/hpet: Introduce a target_cpumask
  watchdog/hardlockup/hpet: Group packages receiving IPIs when needed
  watchdog/hardlockup/hpet: Adjust timer expiration on the number of
    monitored groups
  x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI
  watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog"
  watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot
    parameter
  x86/watchdog: Add a shim hardlockup detector
  watchdog: Expose lockup_detector_reconfigure()
  x86/tsc: Switch to perf-based hardlockup detector if TSC become
    unstable

 .../admin-guide/kernel-parameters.txt         |   8 +-
 arch/x86/Kconfig.debug                        |  14 +
 arch/x86/include/asm/hpet.h                   |  67 ++
 arch/x86/include/asm/nmi.h                    |   1 +
 arch/x86/kernel/Makefile                      |   2 +
 arch/x86/kernel/hpet.c                        | 124 +++-
 arch/x86/kernel/nmi.c                         |  10 +
 arch/x86/kernel/tsc.c                         |   2 +
 arch/x86/kernel/watchdog_hld.c                |  86 +++
 arch/x86/kernel/watchdog_hld_hpet.c           | 633 ++++++++++++++++++
 include/linux/nmi.h                           |   8 +-
 kernel/Makefile                               |   2 +-
 kernel/watchdog.c                             |  12 +-
 kernel/watchdog_hld.c                         |  50 +-
 lib/Kconfig.debug                             |   4 +
 15 files changed, 982 insertions(+), 41 deletions(-)
 create mode 100644 arch/x86/kernel/watchdog_hld.c
 create mode 100644 arch/x86/kernel/watchdog_hld_hpet.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 01/16] x86/hpet: Expose hpet_writel() in header
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 02/16] x86/hpet: Add helper function hpet_set_comparator_periodic() Ricardo Neri
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen

In order to allow hpet_writel() to be used by other components (e.g.,
the HPET-based hardlockup detector), expose it in the HPET header file.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
 * Dropped exposing hpet_readq() as it is not needed.

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 arch/x86/include/asm/hpet.h | 1 +
 arch/x86/kernel/hpet.c      | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index ab9f3dd87c80..be9848f0883f 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -72,6 +72,7 @@ extern int is_hpet_enabled(void);
 extern int hpet_enable(void);
 extern void hpet_disable(void);
 extern unsigned int hpet_readl(unsigned int a);
+extern void hpet_writel(unsigned int d, unsigned int a);
 extern void force_hpet_resume(void);
 
 #ifdef CONFIG_HPET_EMULATE_RTC
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 08651a4e6aa0..326af9a55129 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -78,7 +78,7 @@ inline unsigned int hpet_readl(unsigned int a)
 	return readl(hpet_virt_address + a);
 }
 
-static inline void hpet_writel(unsigned int d, unsigned int a)
+inline void hpet_writel(unsigned int d, unsigned int a)
 {
 	writel(d, hpet_virt_address + a);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 02/16] x86/hpet: Add helper function hpet_set_comparator_periodic()
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 01/16] x86/hpet: Expose hpet_writel() in header Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 03/16] x86/hpet: Reserve an HPET channel for the hardlockup detector Ricardo Neri
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen

Programming an HPET channel as periodic requires setting the
HPET_TN_SETVAL bit in the channel configuration. Plus, the comparator
register must be written twice (once for the comparator value and once for
the periodic value). Since this programming might be needed in several
places (e.g., the HPET clocksource and the HPET-based hardlockup detector),
add a helper function for this purpose.

A helper function hpet_set_comparator_oneshot() could also be implemented.
However, such function would only program the comparator register and the
function would be quite small. Hence, it is better to not bloat the code
with such an obvious function.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Originally-by: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
When programming the HPET channel in periodic mode, a udelay(1) between
the two successive writes to HPET_Tn_CMP was introduced in commit
e9e2cdb41241 ("[PATCH] clockevents: i386 drivers"). The commit message
does not give any reason for such delay. The hardware specification does
not seem to require it. The refactoring in this patch simply carries such
delay.
---
Changes since v4:
 * Implement function only for periodic mode. This removed extra logic to
   to use a non-zero period value as a proxy for periodic mode
   programming. (Thomas)
 * Added a comment on the history of the udelay() when programming the
   channel in periodic mode. (Ashok)

Changes since v3:
 * Added back a missing hpet_writel() for time configuration.

Changes since v2:
 *  Introduced this patch.

Changes since v1:
 * N/A
---
 arch/x86/include/asm/hpet.h |  2 ++
 arch/x86/kernel/hpet.c      | 49 ++++++++++++++++++++++++++++---------
 2 files changed, 39 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index be9848f0883f..486e001413c7 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -74,6 +74,8 @@ extern void hpet_disable(void);
 extern unsigned int hpet_readl(unsigned int a);
 extern void hpet_writel(unsigned int d, unsigned int a);
 extern void force_hpet_resume(void);
+extern void hpet_set_comparator_periodic(int channel, unsigned int cmp,
+					 unsigned int period);
 
 #ifdef CONFIG_HPET_EMULATE_RTC
 
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 326af9a55129..8be1d3d9162e 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -293,6 +293,39 @@ static void hpet_enable_legacy_int(void)
 	hpet_legacy_int_enabled = true;
 }
 
+/**
+ * hpet_set_comparator_periodic() - Helper function to set periodic channel
+ * @channel:	The HPET channel
+ * @cmp:	The value to be written to the comparator/accumulator
+ * @period:	Number of ticks per period
+ *
+ * Helper function for updating comparator, accumulator and period values.
+ *
+ * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing
+ * to the Tn_CMP to update the accumulator. Then, HPET needs a second
+ * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period.
+ * The HPET_TN_SETVAL bit is automatically cleared after the first write.
+ *
+ * This function takes a 1 microsecond delay. However, this function is supposed
+ * to be called only once (or when reprogramming the timer) as it deals with a
+ * periodic timer channel.
+ *
+ * See the following documents:
+ *   - Intel IA-PC HPET (High Precision Event Timers) Specification
+ *   - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674
+ */
+void hpet_set_comparator_periodic(int channel, unsigned int cmp, unsigned int period)
+{
+	unsigned int v = hpet_readl(HPET_Tn_CFG(channel));
+
+	hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(channel));
+
+	hpet_writel(cmp, HPET_Tn_CMP(channel));
+
+	udelay(1);
+	hpet_writel(period, HPET_Tn_CMP(channel));
+}
+
 static int hpet_clkevt_set_state_periodic(struct clock_event_device *evt)
 {
 	unsigned int channel = clockevent_to_channel(evt)->num;
@@ -305,19 +338,11 @@ static int hpet_clkevt_set_state_periodic(struct clock_event_device *evt)
 	now = hpet_readl(HPET_COUNTER);
 	cmp = now + (unsigned int)delta;
 	cfg = hpet_readl(HPET_Tn_CFG(channel));
-	cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL |
-	       HPET_TN_32BIT;
+	cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_32BIT;
 	hpet_writel(cfg, HPET_Tn_CFG(channel));
-	hpet_writel(cmp, HPET_Tn_CMP(channel));
-	udelay(1);
-	/*
-	 * HPET on AMD 81xx needs a second write (with HPET_TN_SETVAL
-	 * cleared) to T0_CMP to set the period. The HPET_TN_SETVAL
-	 * bit is automatically cleared after the first write.
-	 * (See AMD-8111 HyperTransport I/O Hub Data Sheet,
-	 * Publication # 24674)
-	 */
-	hpet_writel((unsigned int)delta, HPET_Tn_CMP(channel));
+
+	hpet_set_comparator_periodic(channel, cmp, (unsigned int)delta);
+
 	hpet_start_counter();
 	hpet_print_config();
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 03/16] x86/hpet: Reserve an HPET channel for the hardlockup detector
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 01/16] x86/hpet: Expose hpet_writel() in header Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 02/16] x86/hpet: Add helper function hpet_set_comparator_periodic() Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 04/16] watchdog/hardlockup: Define a generic function to detect hardlockups Ricardo Neri
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen

The HPET hardlockup detector needs a dedicated HPET channel. Hence, create
a new HPET_MODE_NMI_WATCHDOG mode category to indicate that it cannot be
used for other purposes. Using MSI interrupts greatly simplifies the
implementation of the detector. Specifically, it helps to avoid the
complexities of routing the interrupt via the IO-APIC (e.g., potential
race conditions that arise from re-programming the IO-APIC while also
servicing an NMI). Therefore, only reserve the timer if it supports Front
Side Bus interrupt delivery.

HPET channels are reserved at various stages. First, from
x86_late_time_init(), hpet_time_init() checks if the HPET timer supports
Legacy Replacement Routing. If this is the case, channels 0 and 1 are
reserved as HPET_MODE_LEGACY.

At a later stage, from lockup_detector_init(), reserve the HPET channel
for the hardlockup detector. Then, the HPET clocksource reserves the
channels it needs and then the remaining channels are given to the HPET
char driver via hpet_alloc().

Hence, the channel assigned to the HPET hardlockup detector depends on
whether the first two channels are reserved for legacy mode.

Lastly, only reserve the channel for the hardlockup detector if enabled
in the kernel command line.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
 * Reworked timer reservation to use Thomas' rework on HPET channel
   management.
 * Removed hard-coded channel number for the hardlockup detector.
 * Provided more details on the sequence of HPET channel reservations.
   (Thomas Gleixner)
 * Only reserve a channel for the hardlockup detector if enabled via
   kernel command line. The function reserving the channel is called from
   hardlockup detector. (Thomas Gleixner)
 * Shorten the name of hpet_hardlockup_detector_get_timer() to
   hpet_hld_get_timer(). (Andi)
 * Simplify error handling when a channel is not found. (Tony)

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 arch/x86/include/asm/hpet.h | 18 +++++++++
 arch/x86/kernel/hpet.c      | 73 +++++++++++++++++++++++++++++++++++++
 2 files changed, 91 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 486e001413c7..f1e41c11c29f 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -95,6 +95,24 @@ extern void hpet_unregister_irq_handler(rtc_irq_handler handler);
 
 #endif /* CONFIG_HPET_EMULATE_RTC */
 
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+/**
+ * struct hpet_hld_data - Data needed to operate the detector
+ * @has_periodic:		The HPET channel supports periodic mode
+ * @channel:			HPET channel assigned to the detector
+ * @ticks_per_second:		Frequency of the HPET timer
+ * @irq:			IRQ number assigned to the HPET channel
+ */
+struct hpet_hld_data {
+	bool		has_periodic;
+	u32		channel;
+	u64		ticks_per_second;
+	int		irq;
+};
+
+extern struct hpet_hld_data *hpet_hld_get_timer(void);
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
+
 #else /* CONFIG_HPET_TIMER */
 
 static inline int hpet_enable(void) { return 0; }
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 8be1d3d9162e..5012590dc1b8 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -19,6 +19,7 @@ enum hpet_mode {
 	HPET_MODE_LEGACY,
 	HPET_MODE_CLOCKEVT,
 	HPET_MODE_DEVICE,
+	HPET_MODE_NMI_WATCHDOG,
 };
 
 struct hpet_channel {
@@ -215,6 +216,7 @@ static void __init hpet_reserve_platform_timers(void)
 			break;
 		case HPET_MODE_CLOCKEVT:
 		case HPET_MODE_LEGACY:
+		case HPET_MODE_NMI_WATCHDOG:
 			hpet_reserve_timer(&hd, hc->num);
 			break;
 		}
@@ -1408,4 +1410,75 @@ irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id)
 	return IRQ_HANDLED;
 }
 EXPORT_SYMBOL_GPL(hpet_rtc_interrupt);
+
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+static struct hpet_hld_data *hld_data;
+
+/**
+ * hpet_hld_get_timer - Get an HPET channel for the hardlockup detector
+ *
+ * Reseve an HPET channel and return the timer information to caller only if a
+ * channel is available and supports FSB mode. This function is called by the
+ * hardlockup detector only if enabled in the kernel command line.
+ *
+ * Returns: none
+ */
+struct hpet_hld_data *hpet_hld_get_timer(void)
+{
+	struct hpet_channel *hc = hpet_base.channels;
+	int i, irq;
+
+	for (i = 0; i < hpet_base.nr_channels; i++) {
+		hc = hpet_base.channels + i;
+
+		/*
+		 * Associate the first unused channel to the hardlockup
+		 * detector. Bailout if we cannot find one. This may happen if
+		 * the HPET clocksource has taken all the timers. The HPET driver
+		 * (/dev/hpet) should not take timers at this point as channels
+		 * for such driver can only be reserved from user space.
+		 */
+		if (hc->mode == HPET_MODE_UNUSED)
+			break;
+	}
+
+	if (i == hpet_base.nr_channels)
+		return NULL;
+
+	if (!(hc->boot_cfg & HPET_TN_FSB_CAP))
+		return NULL;
+
+	hld_data = kzalloc(sizeof(*hld_data), GFP_KERNEL);
+	if (!hld_data)
+		return NULL;
+
+	if (hc->boot_cfg & HPET_TN_PERIODIC_CAP)
+		hld_data->has_periodic = true;
+
+	hld_data->channel = i;
+	hld_data->ticks_per_second = hpet_freq;
+
+	if (!hpet_domain)
+		hpet_domain = hpet_create_irq_domain(hpet_blockid);
+
+	if (!hpet_domain)
+		goto err;
+
+	hc->mode = HPET_MODE_NMI_WATCHDOG;
+	irq = hpet_assign_irq(hpet_domain, hc, hc->num);
+	if (irq <= 0)
+		goto err;
+
+	hc->irq = irq;
+	hld_data->irq = irq;
+	return hld_data;
+
+err:
+	hc->mode = HPET_MODE_UNUSED;
+	kfree(hld_data);
+	hld_data = NULL;
+	return NULL;
+}
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
+
 #endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 04/16] watchdog/hardlockup: Define a generic function to detect hardlockups
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (2 preceding siblings ...)
  2021-05-04 19:05 ` [RFC PATCH v5 03/16] x86/hpet: Reserve an HPET channel for the hardlockup detector Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 05/16] watchdog/hardlockup: Decouple the hardlockup detector from perf Ricardo Neri
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen, linuxppc-dev

The procedure to detect hardlockups is independent of the underlying
mechanism that generates the non-maskable interrupt used to drive the
detector. Thus, it can be put in a separate, generic function. In this
manner, it can be invoked by various implementations of the NMI watchdog.

For this purpose, move the bulk of watchdog_overflow_callback() to the
new function inspect_for_hardlockups(). This function can then be called
from the applicable NMI handlers.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 include/linux/nmi.h   |  1 +
 kernel/watchdog_hld.c | 18 +++++++++++-------
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 750c7f395ca9..1b68f48ad440 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -207,6 +207,7 @@ int proc_nmi_watchdog(struct ctl_table *, int , void *, size_t *, loff_t *);
 int proc_soft_watchdog(struct ctl_table *, int , void *, size_t *, loff_t *);
 int proc_watchdog_thresh(struct ctl_table *, int , void *, size_t *, loff_t *);
 int proc_watchdog_cpumask(struct ctl_table *, int, void *, size_t *, loff_t *);
+void inspect_for_hardlockups(struct pt_regs *regs);
 
 #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 #include <asm/nmi.h>
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 247bf0b1582c..b352e507b17f 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -106,14 +106,8 @@ static struct perf_event_attr wd_hw_attr = {
 	.disabled	= 1,
 };
 
-/* Callback function for perf event subsystem */
-static void watchdog_overflow_callback(struct perf_event *event,
-				       struct perf_sample_data *data,
-				       struct pt_regs *regs)
+void inspect_for_hardlockups(struct pt_regs *regs)
 {
-	/* Ensure the watchdog never gets throttled */
-	event->hw.interrupts = 0;
-
 	if (__this_cpu_read(watchdog_nmi_touch) == true) {
 		__this_cpu_write(watchdog_nmi_touch, false);
 		return;
@@ -163,6 +157,16 @@ static void watchdog_overflow_callback(struct perf_event *event,
 	return;
 }
 
+/* Callback function for perf event subsystem */
+static void watchdog_overflow_callback(struct perf_event *event,
+				       struct perf_sample_data *data,
+				       struct pt_regs *regs)
+{
+	/* Ensure the watchdog never gets throttled */
+	event->hw.interrupts = 0;
+	inspect_for_hardlockups(regs);
+}
+
 static int hardlockup_detector_event_create(void)
 {
 	unsigned int cpu = smp_processor_id();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 05/16] watchdog/hardlockup: Decouple the hardlockup detector from perf
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (3 preceding siblings ...)
  2021-05-04 19:05 ` [RFC PATCH v5 04/16] watchdog/hardlockup: Define a generic function to detect hardlockups Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 06/16] x86/nmi: Add an NMI_WATCHDOG NMI handler category Ricardo Neri
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen, linuxppc-dev

The current default implementation of the hardlockup detector assumes that
it is implemented using perf events. However, the hardlockup detector can
be driven by other sources of non-maskable interrupts (e.g., a properly
configured timer).

Group and wrap in #ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF all the code
specific to perf: create and manage perf events, stop and start the perf-
based detector.

The generic portion of the detector (monitor the timers' thresholds, check
timestamps and detect hardlockups as well as the implementation of
arch_touch_nmi_watchdog()) is now selected with the new intermediate config
symbol CONFIG_HARDLOCKUP_DETECTOR_CORE.

The perf-based implementation of the detector selects the new intermediate
symbol. Other implementations should do the same.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
 * None

Changes since v3:
 * Squashed into this patch a previous patch to make
   arch_touch_nmi_watchdog() part of the core detector code.

Changes since v2:
 * Undid split of the generic hardlockup detector into a separate file.
   (Thomas Gleixner)
 * Added a new intermediate symbol CONFIG_HARDLOCKUP_DETECTOR_CORE to
   select generic parts of the detector (Paul E. McKenney,
   Thomas Gleixner).

Changes since v1:
 * Make the generic detector code with CONFIG_HARDLOCKUP_DETECTOR.
---
 include/linux/nmi.h   |  5 ++++-
 kernel/Makefile       |  2 +-
 kernel/watchdog_hld.c | 32 ++++++++++++++++++++------------
 lib/Kconfig.debug     |  4 ++++
 4 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 1b68f48ad440..cf12380e51b3 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -94,8 +94,11 @@ static inline void hardlockup_detector_disable(void) {}
 # define NMI_WATCHDOG_SYSCTL_PERM	0444
 #endif
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_CORE)
 extern void arch_touch_nmi_watchdog(void);
+#endif
+
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
 extern void hardlockup_detector_perf_stop(void);
 extern void hardlockup_detector_perf_restart(void);
 extern void hardlockup_detector_perf_disable(void);
diff --git a/kernel/Makefile b/kernel/Makefile
index e8a6715f38dc..03ac041abfff 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -95,7 +95,7 @@ obj-$(CONFIG_FAIL_FUNCTION) += fail_function.o
 obj-$(CONFIG_KGDB) += debug/
 obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
-obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
+obj-$(CONFIG_HARDLOCKUP_DETECTOR_CORE) += watchdog_hld.o
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index b352e507b17f..bb6435978c46 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -22,12 +22,8 @@
 
 static DEFINE_PER_CPU(bool, hard_watchdog_warn);
 static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
-static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
-static DEFINE_PER_CPU(struct perf_event *, dead_event);
-static struct cpumask dead_events_mask;
 
 static unsigned long hardlockup_allcpu_dumped;
-static atomic_t watchdog_cpus = ATOMIC_INIT(0);
 
 notrace void arch_touch_nmi_watchdog(void)
 {
@@ -98,14 +94,6 @@ static inline bool watchdog_check_timestamp(void)
 }
 #endif
 
-static struct perf_event_attr wd_hw_attr = {
-	.type		= PERF_TYPE_HARDWARE,
-	.config		= PERF_COUNT_HW_CPU_CYCLES,
-	.size		= sizeof(struct perf_event_attr),
-	.pinned		= 1,
-	.disabled	= 1,
-};
-
 void inspect_for_hardlockups(struct pt_regs *regs)
 {
 	if (__this_cpu_read(watchdog_nmi_touch) == true) {
@@ -157,6 +145,24 @@ void inspect_for_hardlockups(struct pt_regs *regs)
 	return;
 }
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
+#undef pr_fmt
+#define pr_fmt(fmt) "NMI perf watchdog: " fmt
+
+static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
+static DEFINE_PER_CPU(struct perf_event *, dead_event);
+static struct cpumask dead_events_mask;
+
+static atomic_t watchdog_cpus = ATOMIC_INIT(0);
+
+static struct perf_event_attr wd_hw_attr = {
+	.type		= PERF_TYPE_HARDWARE,
+	.config		= PERF_COUNT_HW_CPU_CYCLES,
+	.size		= sizeof(struct perf_event_attr),
+	.pinned		= 1,
+	.disabled	= 1,
+};
+
 /* Callback function for perf event subsystem */
 static void watchdog_overflow_callback(struct perf_event *event,
 				       struct perf_sample_data *data,
@@ -298,3 +304,5 @@ int __init hardlockup_detector_perf_init(void)
 	}
 	return ret;
 }
+
+#endif /* CONFIG_HARDLOCKUP_DETECTOR_PERF */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 5918e1686736..08332b5ad4fe 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1021,9 +1021,13 @@ config BOOTPARAM_SOFTLOCKUP_PANIC_VALUE
 	default 0 if !BOOTPARAM_SOFTLOCKUP_PANIC
 	default 1 if BOOTPARAM_SOFTLOCKUP_PANIC
 
+config HARDLOCKUP_DETECTOR_CORE
+	bool
+
 config HARDLOCKUP_DETECTOR_PERF
 	bool
 	select SOFTLOCKUP_DETECTOR
+	select HARDLOCKUP_DETECTOR_CORE
 
 #
 # Enables a timestamp based low pass filter to compensate for perf based
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 06/16] x86/nmi: Add an NMI_WATCHDOG NMI handler category
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (4 preceding siblings ...)
  2021-05-04 19:05 ` [RFC PATCH v5 05/16] watchdog/hardlockup: Decouple the hardlockup detector from perf Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 07/16] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector Ricardo Neri
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen

Add a NMI_WATCHDOG as a new category of NMI handler. This new category
is to be used with the HPET-based hardlockup detector. This detector
does not have a direct way of checking if the HPET timer is the source of
the NMI. Instead, it indirectly estimates it using the time-stamp counter.

Therefore, we may have false-positives in case another NMI occurs within
the estimated time window. For this reason, we want the handler of the
detector to be called after all the NMI_LOCAL handlers. A simple way
of achieving this with a new NMI handler category.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 arch/x86/include/asm/nmi.h |  1 +
 arch/x86/kernel/nmi.c      | 10 ++++++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 1cb9c17a4cb4..4a0d5b562c91 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -28,6 +28,7 @@ enum {
 	NMI_UNKNOWN,
 	NMI_SERR,
 	NMI_IO_CHECK,
+	NMI_WATCHDOG,
 	NMI_MAX
 };
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index bf250a339655..5016bc45e16c 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -61,6 +61,10 @@ static struct nmi_desc nmi_desc[NMI_MAX] =
 		.lock = __RAW_SPIN_LOCK_UNLOCKED(&nmi_desc[3].lock),
 		.head = LIST_HEAD_INIT(nmi_desc[3].head),
 	},
+	{
+		.lock = __RAW_SPIN_LOCK_UNLOCKED(&nmi_desc[4].lock),
+		.head = LIST_HEAD_INIT(nmi_desc[4].head),
+	},
 
 };
 
@@ -168,6 +172,8 @@ int __register_nmi_handler(unsigned int type, struct nmiaction *action)
 	 */
 	WARN_ON_ONCE(type == NMI_SERR && !list_empty(&desc->head));
 	WARN_ON_ONCE(type == NMI_IO_CHECK && !list_empty(&desc->head));
+	WARN_ON_ONCE(type == NMI_WATCHDOG && !list_empty(&desc->head));
+
 
 	/*
 	 * some handlers need to be executed first otherwise a fake
@@ -380,6 +386,10 @@ static noinstr void default_do_nmi(struct pt_regs *regs)
 	}
 	raw_spin_unlock(&nmi_reason_lock);
 
+	handled = nmi_handle(NMI_WATCHDOG, regs);
+	if (handled == NMI_HANDLED)
+		return;
+
 	/*
 	 * Only one NMI can be latched at a time.  To handle
 	 * this we may process multiple nmi handlers at once to
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 07/16] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (5 preceding siblings ...)
  2021-05-04 19:05 ` [RFC PATCH v5 06/16] x86/nmi: Add an NMI_WATCHDOG NMI handler category Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 20:53   ` Thomas Gleixner
  2021-05-04 19:05 ` [RFC PATCH v5 08/16] x86/watchdog/hardlockup/hpet: Introduce a target_cpumask Ricardo Neri
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen

Implement a hardlockup detector that uses the HPET timer as the source of
the non-maskable interrupt. Implement the basic functionality, to start
and stop the timer and configure it issue an MSI interrupt (write directly
to the HPET MSI registers instead the interrupt subsystem). Hence, the
detector relies on an HPET channel capable of Front Side Bus interrupts.

A single CPU, the handling CPU, will service the non-maskable interrupt
from the HPET timer. Then, it will issue an inter-processor interrupt
to the rest of the CPUs monitored by the detector.

The detector is not functional at this stage. A subsequent changeset will
invoke the interfaces implemented in this changeset as well as
functionality to determine if the HPET timer caused the NMI. For now,
implement a stub function.

HPET registers are only accessed to kick the timer after looking for
hardlockups. If the HPET channel is periodic, there is no need to write
to the HPET registers at all in interrupt context. On all the systems I
inspected, only the first channel of the timer is periodic and it will
usually be reserved for the HPET legacy clockevent. Thus, in most of the
cases the detector will be assigned a non-periodic channel.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
 * Dropped hpet_hld_data.enabled_cpus and instead use cpumask_weight().
 * Renamed hpet_hld_data.cpu_monitored_mask to
   hld_data_data.cpu_monitored_mask and converted it to cpumask_var_t.
 * Flushed out any outstanding interrupt before enabling the HPET channel.
 * Removed unnecessary MSI_DATA_LEVEL_ASSERT from the MSI message.
 * Added comments in hardlockup_detector_nmi_handler() to explain how
   CPUs are targeted for an IPI.
 * Updated code to only issue an IPI when needed (i.e., there are monitored
   CPUs to be inspected via an IPI).
 * Reworked hardlockup_detector_hpet_init() for readability.
 * Now reserve the cpumasks in the hardlockup detector code and not in the
   generic HPET code.
 * Handled the case of watchdog_thresh = 0 when disabling the detector.
 * Made this detector available to i386.
 * Reworked logic to kick the timer to remove a local variable. (Andi)
 * Added a comment on what type of timer channel will be assigned to the
   detector. (Andi)
 * Reworded prompt comment in Kconfig. (Andi)
 * Removed unneeded switch to level interrupt mode when disabling the
   timer. (Andi)
 * Disabled the HPET timer to avoid a race between an incoming interrupt
   and an update of the MSI destination ID. (Ashok)
 * Corrected a typo in an inline comment. (Tony)
 * Made the HPET hardlockup detector depend on HARDLOCKUP_DETECTOR instead
   of selecting it.

Changes since v3:
 * Fixed typo in Kconfig.debug. (Randy Dunlap)
 * Added missing slab.h to include the definition of kfree to fix a build
   break.

Changes since v2:
 * Removed use of struct cpumask in favor of a variable length array in
   conjunction with kzalloc. (Peter Zijlstra)
 * Removed redundant documentation of functions. (Thomas Gleixner)
 * Added CPU as argument hardlockup_detector_hpet_enable()/disable().
   (Thomas Gleixner).

Changes since v1:
 * Do not target CPUs in a round-robin manner. Instead, the HPET timer
   always targets the same CPU; other CPUs are monitored via an
   interprocessor interrupt.
 * Dropped support for IO APIC interrupts and instead use only MSI
   interrupts.
 * Removed use of generic irq code to set interrupt affinity and NMI
   delivery. Instead, configure the interrupt directly in HPET registers.
   (Thomas Gleixner)
 * Fixed unconditional return NMI_HANDLED when the HPET timer is
   programmed for FSB/MSI delivery. (Peter Zijlstra)
---
 arch/x86/Kconfig.debug              |  10 +
 arch/x86/include/asm/hpet.h         |  21 ++
 arch/x86/kernel/Makefile            |   1 +
 arch/x86/kernel/watchdog_hld_hpet.c | 384 ++++++++++++++++++++++++++++
 4 files changed, 416 insertions(+)
 create mode 100644 arch/x86/kernel/watchdog_hld_hpet.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 80b57e7f4947..0731da557a6d 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -117,6 +117,16 @@ config IOMMU_LEAK
 config HAVE_MMIOTRACE_SUPPORT
 	def_bool y
 
+config X86_HARDLOCKUP_DETECTOR_HPET
+	bool "HPET Timer for Hard Lockup Detection"
+	select HARDLOCKUP_DETECTOR_CORE
+	depends on HARDLOCKUP_DETECTOR && HPET_TIMER && HPET && (X86_64 || X86_32)
+	help
+	  The hardlockup detector is driven by one counter of the Performance
+	  Monitoring Unit (PMU) per CPU. Say y to instead drive the
+	  hardlockup detector using a High-Precision Event Timer and make the
+	  PMU counters available for other purposes.
+
 config X86_DECODER_SELFTEST
 	bool "x86 instruction decoder selftest"
 	depends on DEBUG_KERNEL && INSTRUCTION_DECODER
diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index f1e41c11c29f..b1c8d5ce7e13 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_HPET_H
 
 #include <linux/msi.h>
+#include <linux/irq_work.h>
 
 #ifdef CONFIG_HPET_TIMER
 
@@ -102,15 +103,35 @@ extern void hpet_unregister_irq_handler(rtc_irq_handler handler);
  * @channel:			HPET channel assigned to the detector
  * @ticks_per_second:		Frequency of the HPET timer
  * @irq:			IRQ number assigned to the HPET channel
+ * @handling_cpu:		CPU handling the HPET interrupt
+ * @msi_msg:			MSI message to be written it the HPET registers
+ * @monitored_cpumask:		CPUs monitored by the hardlockup detector
+ * @ipi_cpumask:		Auxiliary mask to handle IPIs. Both sending and
+ *				receiving CPUs write to it. Hence, we cannot
+ *				reuse @monitored_cpumask.
  */
 struct hpet_hld_data {
 	bool		has_periodic;
 	u32		channel;
 	u64		ticks_per_second;
 	int		irq;
+	u32		handling_cpu;
+	struct msi_msg	msi_msg;
+	cpumask_var_t	monitored_cpumask;
+	cpumask_var_t	ipi_cpumask;
 };
 
 extern struct hpet_hld_data *hpet_hld_get_timer(void);
+extern int hardlockup_detector_hpet_init(void);
+extern void hardlockup_detector_hpet_stop(void);
+extern void hardlockup_detector_hpet_enable(unsigned int cpu);
+extern void hardlockup_detector_hpet_disable(unsigned int cpu);
+#else
+static inline int hardlockup_detector_hpet_init(void)
+{ return -ENODEV; }
+static inline void hardlockup_detector_hpet_stop(void) {}
+static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
+static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
 #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
 
 #else /* CONFIG_HPET_TIMER */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 0704c2a94272..5d1a90b23577 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -115,6 +115,7 @@ obj-$(CONFIG_VM86)		+= vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK)	+= early_printk.o
 
 obj-$(CONFIG_HPET_TIMER) 	+= hpet.o
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 
 obj-$(CONFIG_AMD_NB)		+= amd_nb.o
 obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
new file mode 100644
index 000000000000..47fc2cb540de
--- /dev/null
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -0,0 +1,384 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A hardlockup detector driven by an HPET timer.
+ *
+ * Copyright (C) Intel Corporation 2021
+ *
+ * A hardlockup detector driven by an HPET timer. It implements the same
+ * interfaces as the PMU-based hardlockup detector.
+ *
+ * A single HPET timer is used to monitor all the CPUs from the allowed_mask
+ * from kernel/watchdog.c. Thus, the timer is programmed to expire every
+ * watchdog_thresh/cpumask_weight(watchdog_allowed_cpumask). The timer targets
+ * CPUs in round robin manner. Thus, every cpu in watchdog_allowed_mask is
+ * monitored every watchdog_thresh seconds.
+ */
+
+#define pr_fmt(fmt) "NMI hpet watchdog: " fmt
+
+#include <linux/nmi.h>
+#include <linux/slab.h>
+
+#include <asm/apic.h>
+#include <asm/hpet.h>
+
+static struct hpet_hld_data *hld_data;
+static bool hardlockup_use_hpet;
+
+/**
+ * kick_timer() - Reprogram timer to expire in the future
+ * @hdata:	A data structure with the timer instance to update
+ * @force:	Force reprogramming
+ *
+ * Reprogram the timer to expire within watchdog_thresh seconds in the future.
+ * If the timer supports periodic mode, it is not kicked unless @force is
+ * true.
+ */
+static void kick_timer(struct hpet_hld_data *hdata, bool force)
+{
+	u64 new_compare, count, period = 0;
+
+	/* kick the timer only when needed */
+	if (!force && hdata->has_periodic)
+		return;
+
+	/*
+	 * Update the comparator in increments of watch_thresh seconds relative
+	 * to the current count. Since watch_thresh is given in seconds, we
+	 * are able to update the comparator before the counter reaches such new
+	 * value.
+	 *
+	 * Let it wrap around if needed.
+	 */
+
+	count = hpet_readl(HPET_COUNTER);
+	new_compare = count + watchdog_thresh * hdata->ticks_per_second;
+
+	if (!hdata->has_periodic) {
+		hpet_writel(new_compare, HPET_Tn_CMP(hdata->channel));
+		return;
+	}
+
+	period = watchdog_thresh * hdata->ticks_per_second;
+	hpet_set_comparator_periodic(hdata->channel, (u32)new_compare,
+				     (u32)period);
+}
+
+static void disable_timer(struct hpet_hld_data *hdata)
+{
+	u32 v;
+
+	v = hpet_readl(HPET_Tn_CFG(hdata->channel));
+	v &= ~HPET_TN_ENABLE;
+	if (hld_data->has_periodic)
+		v &= ~HPET_TN_PERIODIC;
+
+	hpet_writel(v, HPET_Tn_CFG(hdata->channel));
+}
+
+static void enable_timer(struct hpet_hld_data *hdata)
+{
+	u32 v;
+
+	v = hpet_readl(HPET_Tn_CFG(hdata->channel));
+	/* Make sure we flush any outstanding interrupt. */
+	v |= HPET_TN_LEVEL;
+	hpet_writel(v, HPET_Tn_CFG(hdata->channel));
+	hpet_writel(1 << hdata->channel, HPET_STATUS);
+
+	v &= ~HPET_TN_LEVEL;
+	if (hld_data->has_periodic)
+		v |= HPET_TN_PERIODIC;
+	else
+		v &= ~HPET_TN_PERIODIC;
+
+	v |= HPET_TN_ENABLE;
+	hpet_writel(v, HPET_Tn_CFG(hdata->channel));
+}
+
+/**
+ * is_hpet_wdt_interrupt() - Check if an HPET timer caused the interrupt
+ * @hdata:	A data structure with the timer instance to enable
+ *
+ * Returns:
+ * True if the HPET watchdog timer caused the interrupt. False otherwise.
+ */
+static bool is_hpet_wdt_interrupt(struct hpet_hld_data *hdata)
+{
+	return false;
+}
+
+/**
+ * compose_msi_msg() - Populate address and data fields of an MSI message
+ * @hdata:	A data strucure with the message to populate
+ *
+ * Initialize the fields of the MSI message to deliver an NMI interrupt. This
+ * function only initialize the files that don't change during the operation of
+ * the detector. This function does not populate the Destination ID; which
+ * should be populated using update_msi_destid().
+ */
+static void compose_msi_msg(struct hpet_hld_data *hdata)
+{
+	struct msi_msg *msg = &hdata->msi_msg;
+
+	memset(msg, 0, sizeof(*msg));
+	/*
+	 * The HPET FSB Interrupt Route register does not have an
+	 * address_hi part.
+	 */
+	msg->address_hi = X86_MSI_BASE_ADDRESS_HIGH;
+	msg->arch_addr_lo.base_address = X86_MSI_BASE_ADDRESS_LOW;
+	msg->arch_addr_lo.dest_mode_logical = apic->dest_mode_logical;
+
+	/*
+	 * Since delivery mode is NMI, no irq vector is needed.
+	 */
+	msg->arch_data.delivery_mode = APIC_DELIVERY_MODE_NMI;
+}
+
+/**
+ * update_msi_destid() - Update APIC destid of handling CPU
+ * @hdata:	A data strucure with the MSI message to update
+ *
+ * Update the APIC destid of the MSI message generated by the HPET timer
+ * on expiration.
+ */
+static int update_msi_destid(struct hpet_hld_data *hdata)
+{
+	u32 destid;
+
+	destid = apic->calc_dest_apicid(hdata->handling_cpu);
+	/*
+	 * HPET only supports a 32-bit MSI address register. Thus, only
+	 * 8-bit APIC IDs are supported. Systems with more than 256 CPUs
+	 * should use interrupt remapping.
+	 */
+	WARN_ON_ONCE(destid > 0xff);
+
+	hdata->msi_msg.arch_addr_lo.destid_0_7 = destid & 0xff;
+
+	/*
+	 * Disable the timer to avoid getting an interrupt while updating
+	 * the handlingt CPU.
+	 */
+	disable_timer(hdata);
+	hpet_writel(hdata->msi_msg.address_lo,
+		    HPET_Tn_ROUTE(hdata->channel) + 4);
+	enable_timer(hdata);
+
+	return 0;
+}
+
+/**
+ * hardlockup_detector_nmi_handler() - NMI Interrupt handler
+ * @type:	Type of NMI handler; not used.
+ * @regs:	Register values as seen when the NMI was asserted
+ *
+ * Check if it was caused by the expiration of the HPET timer. If yes, inspect
+ * for lockups by issuing an IPI to all the monitored CPUs. Also, kick the
+ * timer if it is non-periodic.
+ *
+ * Returns:
+ * NMI_DONE if the HPET timer did not cause the interrupt. NMI_HANDLED
+ * otherwise.
+ */
+static int hardlockup_detector_nmi_handler(unsigned int type,
+					   struct pt_regs *regs)
+{
+	struct hpet_hld_data *hdata = hld_data;
+	int cpu = smp_processor_id();
+
+	if (is_hpet_wdt_interrupt(hdata)) {
+		/*
+		 * Make a copy of the target mask. We need this as once a CPU
+		 * gets the watchdog NMI it will clear itself from ipi_cpumask.
+		 * Also, target_cpumask will be updated in a workqueue for the
+		 * next NMI IPI.
+		 */
+		cpumask_copy(hld_data->ipi_cpumask, hld_data->monitored_cpumask);
+		/*
+		 * Even though the NMI IPI will be sent to all CPUs but self,
+		 * clear the CPU to identify a potential unrelated NMI.
+		 */
+		cpumask_clear_cpu(cpu, hld_data->ipi_cpumask);
+		if (cpumask_weight(hld_data->ipi_cpumask))
+			apic->send_IPI_mask_allbutself(hld_data->ipi_cpumask,
+						       NMI_VECTOR);
+
+		kick_timer(hdata, !(hdata->has_periodic));
+
+		inspect_for_hardlockups(regs);
+
+		return NMI_HANDLED;
+	}
+
+	if (cpumask_test_and_clear_cpu(cpu, hld_data->ipi_cpumask)) {
+		inspect_for_hardlockups(regs);
+		return NMI_HANDLED;
+	}
+
+	return NMI_DONE;
+}
+
+/**
+ * setup_irq_msi_mode() - Configure the timer to deliver an MSI interrupt
+ * @data:	Data associated with the instance of the HPET timer to configure
+ *
+ * Configure the HPET timer to deliver interrupts via the Front-
+ * Side Bus.
+ */
+static void setup_irq_msi_mode(struct hpet_hld_data *hdata)
+{
+	u32 v;
+
+	compose_msi_msg(hdata);
+	hpet_writel(hdata->msi_msg.data, HPET_Tn_ROUTE(hdata->channel));
+	hpet_writel(hdata->msi_msg.address_lo,
+		    HPET_Tn_ROUTE(hdata->channel) + 4);
+
+	v = hpet_readl(HPET_Tn_CFG(hdata->channel));
+	v |= HPET_TN_FSB;
+
+	hpet_writel(v, HPET_Tn_CFG(hdata->channel));
+}
+
+/**
+ * setup_hpet_irq() - Configure the interrupt delivery of an HPET timer
+ * @data:	Data associated with the instance of the HPET timer to configure
+ *
+ * Configure the interrupt parameters of an HPET timer. If supported, configure
+ * interrupts to be delivered via the Front-Side Bus. Also, install an interrupt
+ * handler.
+ *
+ * Returns:
+ * 0 success. An error code if setup was unsuccessful.
+ */
+static int setup_hpet_irq(struct hpet_hld_data *hdata)
+{
+	int ret;
+
+	setup_irq_msi_mode(hdata);
+
+	ret = register_nmi_handler(NMI_WATCHDOG,
+				   hardlockup_detector_nmi_handler, 0,
+				   "hpet_hld");
+
+	return ret;
+}
+
+/**
+ * hardlockup_detector_hpet_enable() - Enable the hardlockup detector
+ * @cpu:	CPU Index in which the watchdog will be enabled.
+ *
+ * Enable the hardlockup detector in @cpu. This means adding it to the
+ * cpumask of monitored CPUs and starting the detectot if not done
+ * before.
+ */
+void hardlockup_detector_hpet_enable(unsigned int cpu)
+{
+	cpumask_set_cpu(cpu, hld_data->monitored_cpumask);
+
+	/*
+	 * If this is the first CPU on which the detector is enabled,
+	 * start everything.
+	 */
+	if (cpumask_weight(hld_data->monitored_cpumask) == 1) {
+		hld_data->handling_cpu = cpu;
+		update_msi_destid(hld_data);
+		kick_timer(hld_data, true);
+		enable_timer(hld_data);
+	}
+}
+
+/**
+ * hardlockup_detector_hpet_disable() - Disable the hardlockup detector
+ * @cpu:	CPU index in which the watchdog will be disabled
+ *
+ * @cpu is removed from the cpumask of monitored CPUs. If @cpu is also the CPU
+ * handling the timer interrupt, update it to be the next available, monitored,
+ * CPU.
+ */
+void hardlockup_detector_hpet_disable(unsigned int cpu)
+{
+	cpumask_clear_cpu(cpu, hld_data->monitored_cpumask);
+
+	if (hld_data->handling_cpu != cpu)
+		return;
+
+	disable_timer(hld_data);
+	if (!cpumask_weight(hld_data->monitored_cpumask))
+		return;
+
+	/*
+	 * If watchdog_thresh is zero, then the hardlockup detector is being
+	 * disabled.
+	 */
+	if (!watchdog_thresh)
+		return;
+
+	hld_data->handling_cpu = cpumask_first(hld_data->monitored_cpumask);
+	update_msi_destid(hld_data);
+	enable_timer(hld_data);
+}
+
+void hardlockup_detector_hpet_stop(void)
+{
+	disable_timer(hld_data);
+}
+
+/**
+ * hardlockup_detector_hpet_init() - Initialize the hardlockup detector
+ *
+ * Only initialize and configure the detector if an HPET is available on the
+ * system.
+ *
+ * Returns:
+ * 0 success. An error code if initialization was unsuccessful.
+ */
+int __init hardlockup_detector_hpet_init(void)
+{
+	int ret;
+	u32 v;
+
+	if (!hardlockup_use_hpet)
+		return -ENODEV;
+
+	if (!is_hpet_enabled())
+		return -ENODEV;
+
+	if (check_tsc_unstable())
+		return -ENODEV;
+
+	hld_data = hpet_hld_get_timer();
+	if (!hld_data)
+		return -ENODEV;
+
+	disable_timer(hld_data);
+
+	ret = setup_hpet_irq(hld_data);
+	if (ret)
+		goto err_no_irq;
+
+	if (!zalloc_cpumask_var(&hld_data->monitored_cpumask, GFP_KERNEL))
+		goto err_no_monitored_cpumask;
+
+	if (!zalloc_cpumask_var(&hld_data->ipi_cpumask, GFP_KERNEL))
+		goto err_no_ipi_cpumask;
+
+	v = hpet_readl(HPET_Tn_CFG(hld_data->channel));
+	v |= HPET_TN_32BIT;
+
+	hpet_writel(v, HPET_Tn_CFG(hld_data->channel));
+
+	return ret;
+
+err_no_ipi_cpumask:
+	free_cpumask_var(hld_data->monitored_cpumask);
+err_no_monitored_cpumask:
+	ret = -ENOMEM;
+err_no_irq:
+	kfree(hld_data);
+	hld_data = NULL;
+
+	return ret;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 08/16] x86/watchdog/hardlockup/hpet: Introduce a target_cpumask
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (6 preceding siblings ...)
  2021-05-04 19:05 ` [RFC PATCH v5 07/16] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 09/16] watchdog/hardlockup/hpet: Group packages receiving IPIs when needed Ricardo Neri
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen

The HPET hardlockup detector uses struct hpet_hld_data::monitored_cpumask
to keep track of the CPUs that must be monitored for hard lockups. When
the HPET timer expires, the CPU handling the interrupt issues an IPI to
a subset of those monitored CPUs. The IPI will be sent only to the
monitored CPUs that reside in the same package as the CPU handling the
HPET interrupt. Designate this subset of monitored CPUs as target CPUs and
track them using hpet_hld_data::target_cpumask.

In order to check all the monitored CPUs at the HPET timer expiration,
rotate target_cpumask to now target the CPUs in the next package. Also
the CPU handling the next HPET interrupt accordingly. Do all this from
irq_work started from the NMI handler.

Changes in the monitored CPUs mask may change the handling CPU or the set
of CPUs targeted for the next HPET interrupt. Thus, update both when the
hardlockup detector is enabled or disabled on any CPU.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
 * Introduced this patch.
 * Moved configuration of IPI of target CPUs and affinity of the HPET
   interrupt to an irq_work. (Thomas Gleixner)

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/include/asm/hpet.h         |   7 ++
 arch/x86/kernel/watchdog_hld_hpet.c | 114 +++++++++++++++++++++++++++-
 2 files changed, 119 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index b1c8d5ce7e13..8aea54f412e0 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -105,10 +105,15 @@ extern void hpet_unregister_irq_handler(rtc_irq_handler handler);
  * @irq:			IRQ number assigned to the HPET channel
  * @handling_cpu:		CPU handling the HPET interrupt
  * @msi_msg:			MSI message to be written it the HPET registers
+ * @affinity_work:		Used to update the affinity of the detector
+ *				interrupts, both IPI and NMI.
  * @monitored_cpumask:		CPUs monitored by the hardlockup detector
  * @ipi_cpumask:		Auxiliary mask to handle IPIs. Both sending and
  *				receiving CPUs write to it. Hence, we cannot
  *				reuse @monitored_cpumask.
+ * @target_cpumask:		Subset of @monitored_cpumask receiving a
+ *				particular IPI upon HPET interrupt. It changes
+ *				based on which CPU handles such interrupt.
  */
 struct hpet_hld_data {
 	bool		has_periodic;
@@ -117,8 +122,10 @@ struct hpet_hld_data {
 	int		irq;
 	u32		handling_cpu;
 	struct msi_msg	msi_msg;
+	struct irq_work	affinity_work;
 	cpumask_var_t	monitored_cpumask;
 	cpumask_var_t	ipi_cpumask;
+	cpumask_var_t	target_cpumask;
 };
 
 extern struct hpet_hld_data *hpet_hld_get_timer(void);
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index 47fc2cb540de..a363f3cd45dd 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -169,6 +169,103 @@ static int update_msi_destid(struct hpet_hld_data *hdata)
 	return 0;
 }
 
+/**
+ * get_first_cpu_in_next_pkg() - Find the first CPU in the next package
+ * @start_cpu:	CPU from which we start looking
+ * @hdata:	A data structure with the monitored CPUs mask
+ *
+ * Find the first CPU in the package next to the package of @start_cpu.
+ * If there is only one package in the system, return @start_cpu.
+ */
+static unsigned int get_first_cpu_in_next_pkg(int start_cpu,
+					      struct hpet_hld_data *hdata)
+{
+	u16 this_cpu_pkg_id, next_cpu_pkg_id;
+	int next_cpu = start_cpu;
+
+	if (start_cpu < 0 || start_cpu >= nr_cpu_ids)
+		return -EINVAL;
+
+	if (!cpumask_test_cpu(start_cpu, hdata->monitored_cpumask))
+		return -ENODEV;
+
+	this_cpu_pkg_id = topology_physical_package_id(start_cpu);
+	next_cpu_pkg_id = this_cpu_pkg_id;
+
+	/* If there is only one online package, return @start_cpu */
+	while (this_cpu_pkg_id == next_cpu_pkg_id) {
+		next_cpu = cpumask_next_wrap(next_cpu,
+					     hdata->monitored_cpumask,
+					     nr_cpu_ids,
+					     true);
+		/* Wrapped-around */
+		if (next_cpu >= nr_cpu_ids)
+			continue;
+
+		/* Returned to starting point */
+		next_cpu_pkg_id = topology_physical_package_id(next_cpu);
+		if (next_cpu == start_cpu)
+			break;
+	}
+
+	return next_cpu;
+}
+
+/**
+ * update_ipi_target_cpumask() - Update IPI mask for the next HPET interrupt
+ * @hdata:	 Data strucure with the monitored cpumask and handling CPU info
+ *
+ * Update the target_cpumask of @hdata with the set of CPUs to which the
+ * handling_cpu of @hdata will issue an IPI. Normally, the handling_cpu and the
+ * CPUs in the updated target_cpumask are in the same package.
+ */
+static void update_ipi_target_cpumask(struct hpet_hld_data *hdata)
+{
+	int next_cpu, i;
+
+	next_cpu = hld_data->handling_cpu;
+
+	/*
+	 * If we start from an invalid CPU, instead of failing, just use the
+	 * first monitored CPU.
+	 */
+	if (next_cpu < 0 || next_cpu >= nr_cpu_ids)
+		next_cpu = cpumask_first(hdata->monitored_cpumask);
+
+retry:
+	cpumask_clear(hdata->target_cpumask);
+
+	next_cpu = get_first_cpu_in_next_pkg(next_cpu, hdata);
+	if (next_cpu < 0 || next_cpu >= nr_cpu_ids) {
+		/*
+		 * If a CPU in a next package was not identified,
+		 * fallback to the first monitored CPU instead of
+		 * bailing out.
+		 */
+		next_cpu = cpumask_first(hdata->monitored_cpumask);
+		goto retry;
+	}
+
+	/* Select all the CPUs in the same package as @next_cpu */
+	cpumask_or(hdata->target_cpumask, hdata->target_cpumask,
+		   topology_core_cpumask(next_cpu));
+
+	/* Only select the CPUs that need to be monitored */
+	cpumask_and(hdata->target_cpumask, hdata->target_cpumask,
+		    hdata->monitored_cpumask);
+}
+
+static void update_timer_irq_affinity(struct irq_work *work)
+{
+	struct hpet_hld_data *hdata = container_of(work, struct hpet_hld_data,
+						   affinity_work);
+
+	update_ipi_target_cpumask(hdata);
+
+	hdata->handling_cpu = cpumask_first(hdata->target_cpumask);
+	update_msi_destid(hdata);
+}
+
 /**
  * hardlockup_detector_nmi_handler() - NMI Interrupt handler
  * @type:	Type of NMI handler; not used.
@@ -176,7 +273,8 @@ static int update_msi_destid(struct hpet_hld_data *hdata)
  *
  * Check if it was caused by the expiration of the HPET timer. If yes, inspect
  * for lockups by issuing an IPI to all the monitored CPUs. Also, kick the
- * timer if it is non-periodic.
+ * timer if it is non-periodic. Lastly, start IRQ work to update the
+ * target_cpumask
  *
  * Returns:
  * NMI_DONE if the HPET timer did not cause the interrupt. NMI_HANDLED
@@ -195,7 +293,7 @@ static int hardlockup_detector_nmi_handler(unsigned int type,
 		 * Also, target_cpumask will be updated in a workqueue for the
 		 * next NMI IPI.
 		 */
-		cpumask_copy(hld_data->ipi_cpumask, hld_data->monitored_cpumask);
+		cpumask_copy(hld_data->ipi_cpumask, hld_data->target_cpumask);
 		/*
 		 * Even though the NMI IPI will be sent to all CPUs but self,
 		 * clear the CPU to identify a potential unrelated NMI.
@@ -205,6 +303,7 @@ static int hardlockup_detector_nmi_handler(unsigned int type,
 			apic->send_IPI_mask_allbutself(hld_data->ipi_cpumask,
 						       NMI_VECTOR);
 
+		irq_work_queue(&hdata->affinity_work);
 		kick_timer(hdata, !(hdata->has_periodic));
 
 		inspect_for_hardlockups(regs);
@@ -263,6 +362,7 @@ static int setup_hpet_irq(struct hpet_hld_data *hdata)
 				   hardlockup_detector_nmi_handler, 0,
 				   "hpet_hld");
 
+	init_irq_work(&hdata->affinity_work, update_timer_irq_affinity);
 	return ret;
 }
 
@@ -278,6 +378,8 @@ void hardlockup_detector_hpet_enable(unsigned int cpu)
 {
 	cpumask_set_cpu(cpu, hld_data->monitored_cpumask);
 
+	update_ipi_target_cpumask(hld_data);
+
 	/*
 	 * If this is the first CPU on which the detector is enabled,
 	 * start everything.
@@ -318,6 +420,9 @@ void hardlockup_detector_hpet_disable(unsigned int cpu)
 
 	hld_data->handling_cpu = cpumask_first(hld_data->monitored_cpumask);
 	update_msi_destid(hld_data);
+
+	update_ipi_target_cpumask(hld_data);
+
 	enable_timer(hld_data);
 }
 
@@ -365,6 +470,9 @@ int __init hardlockup_detector_hpet_init(void)
 	if (!zalloc_cpumask_var(&hld_data->ipi_cpumask, GFP_KERNEL))
 		goto err_no_ipi_cpumask;
 
+	if (!zalloc_cpumask_var(&hld_data->target_cpumask, GFP_KERNEL))
+		goto err_no_target_cpumask;
+
 	v = hpet_readl(HPET_Tn_CFG(hld_data->channel));
 	v |= HPET_TN_32BIT;
 
@@ -372,6 +480,8 @@ int __init hardlockup_detector_hpet_init(void)
 
 	return ret;
 
+err_no_target_cpumask:
+	free_cpumask_var(hld_data->ipi_cpumask);
 err_no_ipi_cpumask:
 	free_cpumask_var(hld_data->monitored_cpumask);
 err_no_monitored_cpumask:
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 09/16] watchdog/hardlockup/hpet: Group packages receiving IPIs when needed
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (7 preceding siblings ...)
  2021-05-04 19:05 ` [RFC PATCH v5 08/16] x86/watchdog/hardlockup/hpet: Introduce a target_cpumask Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 10/16] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored groups Ricardo Neri
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen

In order to keep the HPET interrupts of the hardlockup detector at a rate
of one per second or less frequent, the HPET timer only targets one of
the CPUs monitored by the detector. This is the handling CPU. The rest of
the CPUs are monitored via an IPI issued by the handling CPUs.
Furthermore, the monitored CPUs are partitioned into groups. Groups are
targeted by the HPET timer in a round-robin manner. A group is composed of
of all the CPUs in a physical package.

There may be situations in which it is not possible to keep the
aforementioned HPET interrupt rate. This may happen if, for instance,
watchdog_thresh is set to 1 second and there are more than one package in
the system. In such case, the HPET timer should expire 1/nr_packages
seconds.

It is possible to keep the HPET timer expiration at one second or less
frequent if the packages receiving the IPI are grouped together. Hence,
in the example above, all packages would be grouped together.

This approach has the drawback of having to issue IPIs across packages
However, these cases should be rare: only when there are more packages
than the value of watchdog_thresh in seconds.

Implement functionality to use the logic above: when the hardlockup
detector is enabled in a CPU, check if grouping is necessary based in the
value of watchdog_thresh. When updating target_cpumask, do it as many
times as packages in the group.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
 * Introduced this patch.

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 *N/A
---
 arch/x86/include/asm/hpet.h         |  6 +++
 arch/x86/kernel/watchdog_hld_hpet.c | 75 ++++++++++++++++++++++++-----
 2 files changed, 68 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 8aea54f412e0..bb76f54effe4 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -104,6 +104,10 @@ extern void hpet_unregister_irq_handler(rtc_irq_handler handler);
  * @ticks_per_second:		Frequency of the HPET timer
  * @irq:			IRQ number assigned to the HPET channel
  * @handling_cpu:		CPU handling the HPET interrupt
+ * @pkgs_per_group:		Number of physical packages in a group of CPUs
+ *				receiving an IPI
+ * @nr_groups:			Number of groups into which @monitored_cpumask
+ *				is partitioned
  * @msi_msg:			MSI message to be written it the HPET registers
  * @affinity_work:		Used to update the affinity of the detector
  *				interrupts, both IPI and NMI.
@@ -121,6 +125,8 @@ struct hpet_hld_data {
 	u64		ticks_per_second;
 	int		irq;
 	u32		handling_cpu;
+	u32		pkgs_per_group;
+	u32		nr_groups;
 	struct msi_msg	msi_msg;
 	struct irq_work	affinity_work;
 	cpumask_var_t	monitored_cpumask;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index a363f3cd45dd..04b354a35e68 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -235,26 +235,71 @@ static void update_ipi_target_cpumask(struct hpet_hld_data *hdata)
 retry:
 	cpumask_clear(hdata->target_cpumask);
 
-	next_cpu = get_first_cpu_in_next_pkg(next_cpu, hdata);
-	if (next_cpu < 0 || next_cpu >= nr_cpu_ids) {
-		/*
-		 * If a CPU in a next package was not identified,
-		 * fallback to the first monitored CPU instead of
-		 * bailing out.
-		 */
-		next_cpu = cpumask_first(hdata->monitored_cpumask);
-		goto retry;
+	for (i = 0 ; i < hdata->pkgs_per_group; i++) {
+		next_cpu = get_first_cpu_in_next_pkg(next_cpu, hdata);
+		if (next_cpu < 0 || next_cpu >= nr_cpu_ids) {
+			/*
+			 * If a CPU in a next package was not identified,
+			 * fallback to the first monitored CPU instead of
+			 * bailing out.
+			 */
+			next_cpu = cpumask_first(hdata->monitored_cpumask);
+			goto retry;
+		}
+
+		/* Select all the CPUs in the same package as @next_cpu */
+		cpumask_or(hdata->target_cpumask, hdata->target_cpumask,
+			   topology_core_cpumask(next_cpu));
 	}
 
-	/* Select all the CPUs in the same package as @next_cpu */
-	cpumask_or(hdata->target_cpumask, hdata->target_cpumask,
-		   topology_core_cpumask(next_cpu));
-
 	/* Only select the CPUs that need to be monitored */
 	cpumask_and(hdata->target_cpumask, hdata->target_cpumask,
 		    hdata->monitored_cpumask);
 }
 
+/**
+ * count_monitored_packages() - Count the packages with monitored CPUs
+ * @hdata:	A data structure with the monitored cpumask
+ *
+ * Return the number of packages with at least one CPU in the monitored_cpumask
+ * of @hdata
+ */
+static u32 count_monitored_packages(struct hpet_hld_data *hdata)
+{
+	int c = cpumask_first(hdata->monitored_cpumask);
+	u16 start_id, id;
+	u32 nr_pkgs = 0;
+
+	start_id = topology_physical_package_id(c);
+
+	do {
+		nr_pkgs++;
+		c = get_first_cpu_in_next_pkg(c, hdata);
+		id = topology_physical_package_id(c);
+	} while (start_id != id);
+
+	return nr_pkgs;
+}
+
+static void setup_cpu_groups(struct hpet_hld_data *hdata)
+{
+	u32 monitored_pkgs = count_monitored_packages(hdata);
+
+	hdata->pkgs_per_group = 0;
+	hdata->nr_groups = U32_MAX;
+
+	/*
+	 * To keep the HPET timer to fire each 1 second or less frequently,
+	 * the condition watchdog_thresh >= nr_groups nust be met. Thus,
+	 * group together one or more packages until such condition is reached.
+	 */
+	while (watchdog_thresh < hdata->nr_groups) {
+		hdata->pkgs_per_group++;
+		hdata->nr_groups = DIV_ROUND_UP(monitored_pkgs,
+						hdata->pkgs_per_group);
+	}
+}
+
 static void update_timer_irq_affinity(struct irq_work *work)
 {
 	struct hpet_hld_data *hdata = container_of(work, struct hpet_hld_data,
@@ -378,6 +423,8 @@ void hardlockup_detector_hpet_enable(unsigned int cpu)
 {
 	cpumask_set_cpu(cpu, hld_data->monitored_cpumask);
 
+	setup_cpu_groups(hld_data);
+
 	update_ipi_target_cpumask(hld_data);
 
 	/*
@@ -421,6 +468,8 @@ void hardlockup_detector_hpet_disable(unsigned int cpu)
 	hld_data->handling_cpu = cpumask_first(hld_data->monitored_cpumask);
 	update_msi_destid(hld_data);
 
+	setup_cpu_groups(hld_data);
+
 	update_ipi_target_cpumask(hld_data);
 
 	enable_timer(hld_data);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 10/16] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored groups
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (8 preceding siblings ...)
  2021-05-04 19:05 ` [RFC PATCH v5 09/16] watchdog/hardlockup/hpet: Group packages receiving IPIs when needed Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 11/16] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI Ricardo Neri
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen

The CPUs monitored by the HPET hardlockup detector are partitioned into
groups based on their location in a given package. The HPET timer only
targets a single CPU per package (or group of packages) at expiration. The
rest of the CPUs in the group are monitored via an IPI issued by the
handling CPU.

Each monitored CPU must be checked for hardlockups every watch_thresh
seconds. This also means that each group of CPUs must be monitored in the
same interval. Therefore, the HPET timer expiration is determined by the
watch_thresh divided by the number of groups to monitor.

Add a new member, hpet_hld_data::ticks_per_group represents the number of
times the HPET timer must tick before interrupting the handling CPU. Derive
this value from the frequency of the HPET timer and the number of groups of
CPUs.

Furthermore, update the timer expiration whenever there is a change
in the number of monitored CPUs. Namely, when enabling or disabling the
hardlockup detector in a given CPU.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
 * Reworked computation for the number of package groups instead of the
   number of CPUs.
 * Renamed local variable temp as ticks in update_ticks_per_group().
   (Andi)

Changes since v3:
 * None

Changes since v2:
 * Since the round-robin mechanism set the affinity of the HPET timer
   interrupt is in use again, it also becomes necessary to adjust
   the timer expiration.

Changes since v1:
 * Dropped this patch as there was no need to readjust the timer
   expiration when the HPET timer only targets a single CPU.
---
 arch/x86/include/asm/hpet.h         |  3 ++
 arch/x86/kernel/watchdog_hld_hpet.c | 47 +++++++++++++++++++++++++++--
 2 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index bb76f54effe4..738fcf256b14 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -102,6 +102,8 @@ extern void hpet_unregister_irq_handler(rtc_irq_handler handler);
  * @has_periodic:		The HPET channel supports periodic mode
  * @channel:			HPET channel assigned to the detector
  * @ticks_per_second:		Frequency of the HPET timer
+ * @ticks_per_group:		HPET ticks per group that must elapse before
+ *				the timer expires
  * @irq:			IRQ number assigned to the HPET channel
  * @handling_cpu:		CPU handling the HPET interrupt
  * @pkgs_per_group:		Number of physical packages in a group of CPUs
@@ -123,6 +125,7 @@ struct hpet_hld_data {
 	bool		has_periodic;
 	u32		channel;
 	u64		ticks_per_second;
+	u64		ticks_per_group;
 	int		irq;
 	u32		handling_cpu;
 	u32		pkgs_per_group;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index 04b354a35e68..bf3ee354907f 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -48,18 +48,26 @@ static void kick_timer(struct hpet_hld_data *hdata, bool force)
 	 * are able to update the comparator before the counter reaches such new
 	 * value.
 	 *
+	 * Each CPU must be monitored every watch_thresh seconds. In order to
+	 * keep the HPET channel interrupt under 1 per second, CPUs are targeted
+	 * by groups. Each group is target separately.
+	 *
+	 *   ticks_per_group = watch_thresh * ticks_per_second / nr_groups
+	 *
+	 * as computed in update_ticks_per_group().
+	 *
 	 * Let it wrap around if needed.
 	 */
 
 	count = hpet_readl(HPET_COUNTER);
-	new_compare = count + watchdog_thresh * hdata->ticks_per_second;
+	new_compare = count + watchdog_thresh * hdata->ticks_per_group;
 
 	if (!hdata->has_periodic) {
 		hpet_writel(new_compare, HPET_Tn_CMP(hdata->channel));
 		return;
 	}
 
-	period = watchdog_thresh * hdata->ticks_per_second;
+	period = watchdog_thresh * hdata->ticks_per_group;
 	hpet_set_comparator_periodic(hdata->channel, (u32)new_compare,
 				     (u32)period);
 }
@@ -411,6 +419,27 @@ static int setup_hpet_irq(struct hpet_hld_data *hdata)
 	return ret;
 }
 
+/**
+ * update_ticks_per_group() - Update the number of HPET ticks CPU group
+ * @hdata:     struct with the timer's the ticks-per-second and CPU mask
+ *
+ * From the overall ticks-per-second of the timer, compute the number of ticks
+ * after which the timer should expire to monitor each CPU every watch_thresh
+ * seconds. The monitored CPUs have been partitioned into groups, and the HPET
+ * channel targets one group at a time.
+ */
+static void update_ticks_per_group(struct hpet_hld_data *hdata)
+{
+	u64 ticks = hdata->ticks_per_second;
+
+	/* Only update if there are CPUs to monitor. */
+	if (!hdata->nr_groups)
+		return;
+
+	do_div(ticks, hdata->nr_groups);
+	hdata->ticks_per_group = ticks;
+}
+
 /**
  * hardlockup_detector_hpet_enable() - Enable the hardlockup detector
  * @cpu:	CPU Index in which the watchdog will be enabled.
@@ -424,6 +453,7 @@ void hardlockup_detector_hpet_enable(unsigned int cpu)
 	cpumask_set_cpu(cpu, hld_data->monitored_cpumask);
 
 	setup_cpu_groups(hld_data);
+	update_ticks_per_group(hld_data);
 
 	update_ipi_target_cpumask(hld_data);
 
@@ -436,7 +466,14 @@ void hardlockup_detector_hpet_enable(unsigned int cpu)
 		update_msi_destid(hld_data);
 		kick_timer(hld_data, true);
 		enable_timer(hld_data);
+		return;
 	}
+
+	/*
+	 * Kick timer in case the number of monitored CPUs requires a change in
+	 * the timer period.
+	 */
+	kick_timer(hld_data, hld_data->has_periodic);
 }
 
 /**
@@ -469,9 +506,15 @@ void hardlockup_detector_hpet_disable(unsigned int cpu)
 	update_msi_destid(hld_data);
 
 	setup_cpu_groups(hld_data);
+	update_ticks_per_group(hld_data);
 
 	update_ipi_target_cpumask(hld_data);
 
+	/*
+	 * Kick timer in case the number of monitored CPUs requires a change in
+	 * the timer period.
+	 */
+	kick_timer(hld_data, hld_data->has_periodic);
 	enable_timer(hld_data);
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 11/16] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (9 preceding siblings ...)
  2021-05-04 19:05 ` [RFC PATCH v5 10/16] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored groups Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 12/16] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog" Ricardo Neri
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen

The only direct method to determine whether an HPET timer caused an
interrupt is to read the Interrupt Status register. Unfortunately,
reading HPET registers is slow and, therefore, it is not recommended to
read them while in NMI context. Furthermore, status is not available if
the interrupt is generated via the Front Side Bus.

An indirect manner to infer if a given non-maskable interrupt was caused
by the HPET timer is to use the time-stamp counter. Compute the value that
the time-stamp counter should have at the next interrupt of the HPET timer.
Since the hardlockup detector operates in seconds, high precision is not
needed. This implementation considers that the HPET caused the NMI if the
time-stamp counter reads the expected value -/+ 1.5%. This value is
selected as it is equivalent to 1/64 and the division can be performed
using a bit shift operation. Experimentally, the error in the estimation
is consistently less than 1%.

The computation of the expected value of the time-stamp counter must be
performed in relation to watchdog_thresh divided by the number of groups
of packages with monitored CPUs. This quantity is stored in
tsc_ticks_per_group and must be updated whenever the number of monitored
CPUs changes. Namely, when enabling or disabling the hardlockup detector
on a given CPU.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
 * Compute the TSC expected value at the next HPET interrupt based on the
   number of monitored packages and not the number of monitored CPUs.

Changes since v3:
 * None

Changes since v2:
 * Reworked condition to check if the expected TSC value is within the
   error margin to avoid an unnecessary conditional. (Peter Zijlstra)
 * Removed TSC error margin from struct hld_data; use a global variable
   instead. (Peter Zijlstra)

Changes since v1:
 * Introduced this patch.
---
 arch/x86/include/asm/hpet.h         |  6 ++++++
 arch/x86/kernel/watchdog_hld_hpet.c | 27 ++++++++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 738fcf256b14..1ff7436c1ce6 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -104,6 +104,10 @@ extern void hpet_unregister_irq_handler(rtc_irq_handler handler);
  * @ticks_per_second:		Frequency of the HPET timer
  * @ticks_per_group:		HPET ticks per group that must elapse before
  *				the timer expires
+ * @tsc_next:			Estimated value of the TSC at the next
+ *				HPET timer interrupt
+ * @tsc_ticks_per_group:	TSC ticks that must elapse for each group of
+ *				monitored CPUs.
  * @irq:			IRQ number assigned to the HPET channel
  * @handling_cpu:		CPU handling the HPET interrupt
  * @pkgs_per_group:		Number of physical packages in a group of CPUs
@@ -126,6 +130,8 @@ struct hpet_hld_data {
 	u32		channel;
 	u64		ticks_per_second;
 	u64		ticks_per_group;
+	u64		tsc_next;
+	u64		tsc_ticks_per_group;
 	int		irq;
 	u32		handling_cpu;
 	u32		pkgs_per_group;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index bf3ee354907f..cd5f59b7c01b 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -24,6 +24,7 @@
 
 static struct hpet_hld_data *hld_data;
 static bool hardlockup_use_hpet;
+static u64 tsc_next_error;
 
 /**
  * kick_timer() - Reprogram timer to expire in the future
@@ -33,10 +34,21 @@ static bool hardlockup_use_hpet;
  * Reprogram the timer to expire within watchdog_thresh seconds in the future.
  * If the timer supports periodic mode, it is not kicked unless @force is
  * true.
+ *
+ * Also, compute the expected value of the time-stamp counter at the time of
+ * expiration as well as a deviation from the expected value. The maximum
+ * deviation is of ~1.5%. This deviation can be easily computed by shifting
+ * by 6 positions the delta between the current and expected time-stamp values.
  */
 static void kick_timer(struct hpet_hld_data *hdata, bool force)
 {
-	u64 new_compare, count, period = 0;
+	u64 tsc_curr, tsc_delta, new_compare, count, period = 0;
+
+	tsc_curr = rdtsc();
+
+	tsc_delta = (unsigned long)watchdog_thresh * hdata->tsc_ticks_per_group;
+	hdata->tsc_next = tsc_curr + tsc_delta;
+	tsc_next_error = tsc_delta >> 6;
 
 	/* kick the timer only when needed */
 	if (!force && hdata->has_periodic)
@@ -113,6 +125,15 @@ static void enable_timer(struct hpet_hld_data *hdata)
  */
 static bool is_hpet_wdt_interrupt(struct hpet_hld_data *hdata)
 {
+	if (smp_processor_id() == hdata->handling_cpu) {
+		u64 tsc_curr;
+
+		tsc_curr = rdtsc();
+
+		return (tsc_curr - hdata->tsc_next) + tsc_next_error <
+		       2 * tsc_next_error;
+	}
+
 	return false;
 }
 
@@ -438,6 +459,10 @@ static void update_ticks_per_group(struct hpet_hld_data *hdata)
 
 	do_div(ticks, hdata->nr_groups);
 	hdata->ticks_per_group = ticks;
+
+	ticks = (unsigned long)tsc_khz * 1000L;
+	do_div(ticks, hdata->nr_groups);
+	hdata->tsc_ticks_per_group = ticks;
 }
 
 /**
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 12/16] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog"
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (10 preceding siblings ...)
  2021-05-04 19:05 ` [RFC PATCH v5 11/16] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 13/16] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter Ricardo Neri
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen, linuxppc-dev

Prepare hardlockup_panic_setup() to handle a comma-separated list of
options. Thus, it can continue parsing its own command-line options while
ignoring paremeters that are relevant only to specific implementations of
the hardlockup detector. Such implementations may use an early_param to
parse their own options.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * None
---
 kernel/watchdog.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 107bc38b1945..4615064ee282 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -73,13 +73,13 @@ void __init hardlockup_detector_disable(void)
 
 static int __init hardlockup_panic_setup(char *str)
 {
-	if (!strncmp(str, "panic", 5))
+	if (parse_option_str(str, "panic"))
 		hardlockup_panic = 1;
-	else if (!strncmp(str, "nopanic", 7))
+	else if (parse_option_str(str, "nopanic"))
 		hardlockup_panic = 0;
-	else if (!strncmp(str, "0", 1))
+	else if (parse_option_str(str, "0"))
 		nmi_watchdog_user_enabled = 0;
-	else if (!strncmp(str, "1", 1))
+	else if (parse_option_str(str, "1"))
 		nmi_watchdog_user_enabled = 1;
 	return 1;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 13/16] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (11 preceding siblings ...)
  2021-05-04 19:05 ` [RFC PATCH v5 12/16] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog" Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 14/16] x86/watchdog: Add a shim hardlockup detector Ricardo Neri
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen

Keep the HPET-based hardlockup detector disabled unless explicitly enabled
via a command-line argument. If such parameter is not given, the
initialization of the hpet-based hardlockup detector fails and the NMI
watchdog will fall back to use the perf-based implementation.

Implement the command-line parsing using an early_param, as
__setup("nmi_watchdog=") only parses generic options.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
--
Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * Do not imply that using nmi_watchdog=hpet means the detector is
   enabled. Instead, print a warning in such case.

Changes since v1:
 * Added documentation to the function handing the nmi_watchdog
   kernel command-line argument.
---
 .../admin-guide/kernel-parameters.txt         |  8 ++++++-
 arch/x86/kernel/watchdog_hld_hpet.c           | 22 +++++++++++++++++++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 74db44ce4d9a..eafa38867270 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3136,7 +3136,7 @@
 			Format: [state][,regs][,debounce][,die]
 
 	nmi_watchdog=	[KNL,BUGS=X86] Debugging features for SMP kernels
-			Format: [panic,][nopanic,][num]
+			Format: [panic,][nopanic,][num,][hpet]
 			Valid num: 0 or 1
 			0 - turn hardlockup detector in nmi_watchdog off
 			1 - turn hardlockup detector in nmi_watchdog on
@@ -3147,6 +3147,12 @@
 			please see 'nowatchdog'.
 			This is useful when you use a panic=... timeout and
 			need the box quickly up again.
+			When hpet is specified, the NMI watchdog will be driven
+			by an HPET timer, if available in the system. Otherwise,
+			it falls back to the default implementation (perf or
+			architecture-specific). Specifying hpet has no effect
+			if the NMI watchdog is not enabled (either at build time
+			or via the command line).
 
 			These settings can be accessed at runtime via
 			the nmi_watchdog and hardlockup_panic sysctls.
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index cd5f59b7c01b..3fd2405b31fa 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -548,6 +548,28 @@ void hardlockup_detector_hpet_stop(void)
 	disable_timer(hld_data);
 }
 
+/**
+ * hardlockup_detector_hpet_setup() - Parse command-line parameters
+ * @str:	A string containing the kernel command line
+ *
+ * Parse the nmi_watchdog parameter from the kernel command line. If
+ * selected by the user, use this implementation to detect hardlockups.
+ */
+static int __init hardlockup_detector_hpet_setup(char *str)
+{
+	if (!str)
+		return -EINVAL;
+
+	if (parse_option_str(str, "hpet"))
+		hardlockup_use_hpet = true;
+
+	if (!nmi_watchdog_user_enabled && hardlockup_use_hpet)
+		pr_err("Selecting HPET NMI watchdog has no effect with NMI watchdog disabled\n");
+
+	return 0;
+}
+early_param("nmi_watchdog", hardlockup_detector_hpet_setup);
+
 /**
  * hardlockup_detector_hpet_init() - Initialize the hardlockup detector
  *
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 14/16] x86/watchdog: Add a shim hardlockup detector
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (12 preceding siblings ...)
  2021-05-04 19:05 ` [RFC PATCH v5 13/16] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 15/16] watchdog: Expose lockup_detector_reconfigure() Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 16/16] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable Ricardo Neri
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen

The generic hardlockup detector is based on perf. It also provides a set
of weak stubs that CPU architectures can override. Add a shim hardlockup
detector for x86 that selects between perf and hpet implementations.

Specifically, this shim implementation is needed for the HPET-based
hardlockup detector; it can also be used for future implementations.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
 * Use a switch to enable and disable the various available detectors.
   (Andi)

Changes since v3:
 * Fixed style in multi-line comment. (Randy Dunlap)

Changes since v2:
 * Pass cpu number as argument to hardlockup_detector_[enable|disable].
   (Thomas Gleixner)

Changes since v1:
 * Introduced this patch: Added an x86-specific shim hardlockup
   detector. (Nicholas Piggin)
---
 arch/x86/Kconfig.debug         |  4 ++
 arch/x86/kernel/Makefile       |  1 +
 arch/x86/kernel/watchdog_hld.c | 80 ++++++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+)
 create mode 100644 arch/x86/kernel/watchdog_hld.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 0731da557a6d..4cdac3f80e80 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -117,9 +117,13 @@ config IOMMU_LEAK
 config HAVE_MMIOTRACE_SUPPORT
 	def_bool y
 
+config X86_HARDLOCKUP_DETECTOR
+	bool
+
 config X86_HARDLOCKUP_DETECTOR_HPET
 	bool "HPET Timer for Hard Lockup Detection"
 	select HARDLOCKUP_DETECTOR_CORE
+	select X86_HARDLOCKUP_DETECTOR
 	depends on HARDLOCKUP_DETECTOR && HPET_TIMER && HPET && (X86_64 || X86_32)
 	help
 	  The hardlockup detector is driven by one counter of the Performance
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 5d1a90b23577..3303f10b3d0a 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -115,6 +115,7 @@ obj-$(CONFIG_VM86)		+= vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK)	+= early_printk.o
 
 obj-$(CONFIG_HPET_TIMER) 	+= hpet.o
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR) += watchdog_hld.o
 obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 
 obj-$(CONFIG_AMD_NB)		+= amd_nb.o
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
new file mode 100644
index 000000000000..8947a7644421
--- /dev/null
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A shim hardlockup detector. It overrides the weak stubs of the generic
+ * implementation to select between the perf- or the hpet-based implementation.
+ *
+ * Copyright (C) Intel Corporation 2021
+ */
+
+#include <linux/nmi.h>
+#include <asm/hpet.h>
+
+enum x86_hardlockup_detector {
+	X86_HARDLOCKUP_DETECTOR_PERF,
+	X86_HARDLOCKUP_DETECTOR_HPET,
+};
+
+static enum __read_mostly x86_hardlockup_detector detector_type;
+
+int watchdog_nmi_enable(unsigned int cpu)
+{
+	int ret = 0;
+
+	switch (detector_type) {
+	case X86_HARDLOCKUP_DETECTOR_PERF:
+		hardlockup_detector_perf_enable();
+		break;
+	case X86_HARDLOCKUP_DETECTOR_HPET:
+		hardlockup_detector_hpet_enable(cpu);
+		break;
+	default:
+		ret = -ENODEV;
+	}
+
+	return ret;
+}
+
+void watchdog_nmi_disable(unsigned int cpu)
+{
+	switch (detector_type) {
+	case X86_HARDLOCKUP_DETECTOR_PERF:
+		hardlockup_detector_perf_disable();
+		break;
+	case X86_HARDLOCKUP_DETECTOR_HPET:
+		hardlockup_detector_hpet_disable(cpu);
+		break;
+	}
+}
+
+int __init watchdog_nmi_probe(void)
+{
+	int ret;
+
+	/*
+	 * Try first with the HPET hardlockup detector. It will only
+	 * succeed if selected at build time and the nmi_watchdog
+	 * command-line parameter is configured. This ensure that the
+	 * perf-based detector is used by default, if selected at
+	 * build time.
+	 */
+	ret = hardlockup_detector_hpet_init();
+	if (!ret) {
+		detector_type = X86_HARDLOCKUP_DETECTOR_HPET;
+		return ret;
+	}
+
+	ret = hardlockup_detector_perf_init();
+	if (!ret) {
+		detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+		return ret;
+	}
+
+	return ret;
+}
+
+void watchdog_nmi_stop(void)
+{
+	/* Only the HPET lockup detector defines a stop function. */
+	if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
+		hardlockup_detector_hpet_stop();
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 15/16] watchdog: Expose lockup_detector_reconfigure()
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (13 preceding siblings ...)
  2021-05-04 19:05 ` [RFC PATCH v5 14/16] x86/watchdog: Add a shim hardlockup detector Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  2021-05-04 19:05 ` [RFC PATCH v5 16/16] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable Ricardo Neri
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen, linuxppc-dev

When there are more than one implementation of the NMI watchdog, there may
be situations in which switching from one to another is needed. For
if the time-stamp counter becomes unstable, the HPET-based NMI watchdog
an no longer be used.

Switching to another hardlockup detector can be done cleanly by updating
the arch-specific stub and then reconfiguring the whole lockup detector.
Expose lockup_detector_reconfigure() to achieve this goal.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
 * Switching to the perf-based lockup detector under the hood is hacky.
   Instead, reconfigure the whole lockup detector.

Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 include/linux/nmi.h | 2 ++
 kernel/watchdog.c   | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index cf12380e51b3..73827a477288 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -16,6 +16,7 @@ void lockup_detector_init(void);
 void lockup_detector_soft_poweroff(void);
 void lockup_detector_cleanup(void);
 bool is_hardlockup(void);
+void lockup_detector_reconfigure(void);
 
 extern int watchdog_user_enabled;
 extern int nmi_watchdog_user_enabled;
@@ -37,6 +38,7 @@ extern int sysctl_hardlockup_all_cpu_backtrace;
 static inline void lockup_detector_init(void) { }
 static inline void lockup_detector_soft_poweroff(void) { }
 static inline void lockup_detector_cleanup(void) { }
+static inline void lockup_detector_reconfigure(void) { }
 #endif /* !CONFIG_LOCKUP_DETECTOR */
 
 #ifdef CONFIG_SOFTLOCKUP_DETECTOR
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 4615064ee282..96f06938dc83 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -531,7 +531,7 @@ int lockup_detector_offline_cpu(unsigned int cpu)
 	return 0;
 }
 
-static void lockup_detector_reconfigure(void)
+void lockup_detector_reconfigure(void)
 {
 	cpus_read_lock();
 	watchdog_nmi_stop();
@@ -577,7 +577,7 @@ static __init void lockup_detector_setup(void)
 }
 
 #else /* CONFIG_SOFTLOCKUP_DETECTOR */
-static void lockup_detector_reconfigure(void)
+void lockup_detector_reconfigure(void)
 {
 	cpus_read_lock();
 	watchdog_nmi_stop();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH v5 16/16] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable
  2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (14 preceding siblings ...)
  2021-05-04 19:05 ` [RFC PATCH v5 15/16] watchdog: Expose lockup_detector_reconfigure() Ricardo Neri
@ 2021-05-04 19:05 ` Ricardo Neri
  15 siblings, 0 replies; 18+ messages in thread
From: Ricardo Neri @ 2021-05-04 19:05 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen

The HPET-based hardlockup detector relies on the TSC to determine if an
observed NMI interrupt was originated by HPET timer. Hence, this detector
can no longer be used with an unstable TSC.

In such case, permanently stop the HPET-based hardlockup detector and
start the perf-based detector.

Add stub versions of hardlockup_detector_switch_to_perf() to be used when
the HPET hardlockup detector and/or when CONFIG_HPET_TIMER are not
selected.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v4:
 * Added a stub version of hardlockup_detector_switch_to_perf() for
   !CONFIG_HPET_TIMER. (lkp)
 * Reconfigure the whole lockup detector instead of unconditionally
   starting the perf-based hardlockup detector.

Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 arch/x86/include/asm/hpet.h    | 3 +++
 arch/x86/kernel/tsc.c          | 2 ++
 arch/x86/kernel/watchdog_hld.c | 6 ++++++
 3 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 1ff7436c1ce6..df11c7d4af44 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -148,18 +148,21 @@ extern int hardlockup_detector_hpet_init(void);
 extern void hardlockup_detector_hpet_stop(void);
 extern void hardlockup_detector_hpet_enable(unsigned int cpu);
 extern void hardlockup_detector_hpet_disable(unsigned int cpu);
+extern void hardlockup_detector_switch_to_perf(void);
 #else
 static inline int hardlockup_detector_hpet_init(void)
 { return -ENODEV; }
 static inline void hardlockup_detector_hpet_stop(void) {}
 static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
 static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
+static inline void hardlockup_detector_switch_to_perf(void) {}
 #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
 
 #else /* CONFIG_HPET_TIMER */
 
 static inline int hpet_enable(void) { return 0; }
 static inline int is_hpet_enabled(void) { return 0; }
+static inline void hardlockup_detector_switch_to_perf(void) {}
 #define hpet_readl(a) 0
 #define default_setup_hpet_msi	NULL
 
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 57ec01192180..86ac13a83884 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1174,6 +1174,8 @@ void mark_tsc_unstable(char *reason)
 
 	clocksource_mark_unstable(&clocksource_tsc_early);
 	clocksource_mark_unstable(&clocksource_tsc);
+
+	hardlockup_detector_switch_to_perf();
 }
 
 EXPORT_SYMBOL_GPL(mark_tsc_unstable);
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
index 8947a7644421..a4415ad4aa85 100644
--- a/arch/x86/kernel/watchdog_hld.c
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -78,3 +78,9 @@ void watchdog_nmi_stop(void)
 	if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
 		hardlockup_detector_hpet_stop();
 }
+
+void hardlockup_detector_switch_to_perf(void)
+{
+	detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+	lockup_detector_reconfigure();
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH v5 07/16] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
  2021-05-04 19:05 ` [RFC PATCH v5 07/16] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector Ricardo Neri
@ 2021-05-04 20:53   ` Thomas Gleixner
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Gleixner @ 2021-05-04 20:53 UTC (permalink / raw)
  To: Ricardo Neri, Ingo Molnar, Borislav Petkov
  Cc: H. Peter Anvin, Ashok Raj, Andi Kleen, Tony Luck,
	Nicholas Piggin, Peter Zijlstra (Intel),
	Andrew Morton, Stephane Eranian, Suravee Suthikulpanit,
	Ravi V. Shankar, Ricardo Neri, x86, linux-kernel, Ricardo Neri,
	Andi Kleen

Ricardo,

On Tue, May 04 2021 at 12:05, Ricardo Neri wrote:
> +static int hardlockup_detector_nmi_handler(unsigned int type,
> +					   struct pt_regs *regs)
> +{
> +	struct hpet_hld_data *hdata = hld_data;
> +	int cpu = smp_processor_id();
> +
> +	if (is_hpet_wdt_interrupt(hdata)) {
> +		/*
> +		 * Make a copy of the target mask. We need this as once a CPU
> +		 * gets the watchdog NMI it will clear itself from ipi_cpumask.
> +		 * Also, target_cpumask will be updated in a workqueue for the
> +		 * next NMI IPI.
> +		 */
> +		cpumask_copy(hld_data->ipi_cpumask, hld_data->monitored_cpumask);
> +		/*
> +		 * Even though the NMI IPI will be sent to all CPUs but self,
> +		 * clear the CPU to identify a potential unrelated NMI.
> +		 */
> +		cpumask_clear_cpu(cpu, hld_data->ipi_cpumask);
> +		if (cpumask_weight(hld_data->ipi_cpumask))
> +			apic->send_IPI_mask_allbutself(hld_data->ipi_cpumask,
> +						       NMI_VECTOR);

How is this supposed to work correctly?

x2apic_cluster:
 x2apic_send_IPI_mask_allbutself()
  __x2apic_send_IPI_mask()
    	tmpmsk = this_cpu_cpumask_var_ptr(ipi_mask);
	cpumask_copy(tmpmsk, mask);

So if an NMI hits right after or in the middle of the cpumask_copy()
then the IPI sent from that NMI overwrites tmpmask and when its done
then tmpmask is empty. Similar to when it hits in the middle of
processing, just with the difference that maybe a few IPIs have been
sent already. But the not yet sent ones are lost...

Also anything which ends up in __default_send_IPI_dest_field() is
borked:

__default_send_IPI_dest_field()
	cfg = __prepare_ICR2(mask);
	native_apic_mem_write(APIC_ICR2, cfg);

          <- NMI hits and invokes IPI which invokes __default_send_IPI_dest_field()...

	cfg = __prepare_ICR(0, vector, dest);
	native_apic_mem_write(APIC_ICR, cfg);
        
IOW, when the NMI returns ICR2 has been overwritten and the interrupted
IPI goes into lala land.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-05-04 20:53 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-04 19:05 [RFC PATCH v5 00/16] x86: Implement an HPET-based hardlockup detector Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 01/16] x86/hpet: Expose hpet_writel() in header Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 02/16] x86/hpet: Add helper function hpet_set_comparator_periodic() Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 03/16] x86/hpet: Reserve an HPET channel for the hardlockup detector Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 04/16] watchdog/hardlockup: Define a generic function to detect hardlockups Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 05/16] watchdog/hardlockup: Decouple the hardlockup detector from perf Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 06/16] x86/nmi: Add an NMI_WATCHDOG NMI handler category Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 07/16] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector Ricardo Neri
2021-05-04 20:53   ` Thomas Gleixner
2021-05-04 19:05 ` [RFC PATCH v5 08/16] x86/watchdog/hardlockup/hpet: Introduce a target_cpumask Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 09/16] watchdog/hardlockup/hpet: Group packages receiving IPIs when needed Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 10/16] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored groups Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 11/16] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 12/16] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog" Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 13/16] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 14/16] x86/watchdog: Add a shim hardlockup detector Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 15/16] watchdog: Expose lockup_detector_reconfigure() Ricardo Neri
2021-05-04 19:05 ` [RFC PATCH v5 16/16] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable Ricardo Neri

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).