IOMMU Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector
@ 2019-05-24  1:16 Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 01/21] x86/msi: Add definition for NMI delivery mode Ricardo Neri
                   ` (20 more replies)
  0 siblings, 21 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Ravi V. Shankar, x86, Ashok Raj, Peter Zijlstra, Ricardo Neri,
	Randy Dunlap, linux-kernel, Stephane Eranian, Ricardo Neri,
	iommu, Andi Kleen

Hi,

This is the third attempt to demonstrate the implementation of a
hardlockup detector driven by the High-Precision Event Timer. This
version provides a few but important updates with respect the previous
version (please refer to the Changes since v3 section). The three initial
implementations can be found here [1], [2], and [3].

== Introduction ==

In CPU architectures that do not have an NMI watchdog, one can be
constructed using a counter of the Performance Monitoring Unit (PMU).
Counters in the PMU have high granularity and high visibility of the CPU.
These capabilities and their limited number make these counters precious
resources. Unfortunately, the perf-based hardlockup detector permanently
consumes one of these counters per CPU.

These counters could be freed for profiling purposes if the hardlockup
detector were driven by another timer.

The hardlockup detector runs relatively infrequently and does not require
visibility of the CPU activity (in addition to detect locked-up CPUs). A
timer that is external to the CPU (e.g., in the chipset) can be used to
drive the detector.

A key requirement is that the timer needs to be capable of issuing a
non-maskable interrupt to the CPU. In most cases, this can be achieved
by tweaking the delivery mode of the interrupt. It is especially
straightforward for MSI interrupts.

== Details of this implementation

This implementation uses an HPET timer to deliver an NMI interrupt via
an MSI message.

Unlike the the perf-based hardlockup detector, this implementation is
driven by a single timer. The timer targets one CPU at a time in a round-
robin manner. This means that if a CPU must be monitored every watch_thresh
seconds, in a system with N monitored CPUs the timer must expire every
watch_thresh/N. A timer expiration per CPU attribute is maintained.

The timer expiration time per CPU is updated every time CPUs are put
online or offline (a CPU hotplug thread enables and disables the watchdog
in these events) or the user changes the file /proc/sys/kernel/
watchdog_cpumask.

Also, given that a single timer drives the detector, a cpumask is needed
to keep track of which online CPUs are allowed to be monitored. This mask
is also updated every time a CPU is put online or offline or when the user
modifies the mask in /proc/sys/kernel/watchdog_cpumask. This mask
is needed to keep the current behavior of the lockup detector.

In order to avoid reading HPET registers in every NMI, the time-stamp
counter is used to determine whether the HPET caused the interrupt. At
every timer expiration, we compute the value the time-stamp counter is
expected to have the next time the timer expires. I have found
experimentally that expected TSC value consistently has an error of less
than 1.5%

Furthermore, only one write to HPET registers is done every
watchdog_thresh seconds. This write can be eliminated if the HPET timer
is periodic.

== Parts of this series ==

For clarity, patches are grouped as follows:

 1) New irq definition. Patch 1 adds a definition for NMI delivery mode
    in MSI interrupts. No other changes are done to generic irq code.

 2) HPET updates. Patches 2-7 prepare the HPET code to accommodate the
    new detector: rework periodic programming, reserve and configure a
    timer for the detector and expose a few existing functions.

 3) NMI watchdog. Patches 8-11 updates the existing hardlockup detector
    to uncouple it from perf, switch back to the perf implementation if
    TSC becomes unstable, and introduce a new NMI handler category
    intended to run after the NMI_LOCAL handlers.

 4) New HPET-based hardlockup detector. Patches 12-17 includes changes to
    probe the hardware resources, configure the interrupt and rotate the
    destination of the interrupts among all monitored CPUs. Also, it
    includes an x86-specific shim hardlockup detector that selects
    between HPET and perf implementations.

 5) Interrupt remapping. Patches 18-22 add support to operate this new
    detector with interrupt remapping enabled.

Thanks and BR,
Ricardo

Change since v3:
 * Fixed yet another bug in periodic programming of the HPET timer that
   prevented the system from booting.
 * Fixed computation of HPET frequency to use hpet_readl() only.
 * Added a missing #include in the watchdog_hld_hpet.c
 * Fixed various typos and grammar errors (Randy Dunlap)

Changes since v2:
 * Added functionality to switch to the perf-based hardlockup
   detector if the TSC becomes unstable (Thomas Gleixner).
 * Brought back the round-robin mechanism proposed in v1 (this time not
   using the interrupt subsystem). This also requires to compute
   expiration times as in v1 (Andi Kleen, Stephane Eranian).
 * Fixed a bug in which using a periodic timer was not working(thanks
   to Suravee Suthikulpanit!).
 * In this version, I incorporate support for interrupt remapping in the
   last 4 patches so that they can be reviewed separately if needed.
 * Removed redundant documentation of functions (Thomas Gleixner).
 * Added a new category of NMI handler, NMI_WATCHDOG, which executes after
   NMI_LOCAL handlers (Andi Kleen).
 * Updated handling of "nmi_watchdog" to support comma-separated
   arguments.
 * Undid split of the generic hardlockup detector into a separate file
   (Thomas Gleixner).
 * Added a new intermediate symbol CONFIG_HARDLOCKUP_DETECTOR_CORE to
   select generic parts of the detector (Paul E. McKenney,
   Thomas Gleixner).
 * Removed use of struct cpumask in favor of a variable length array in
   conjunction with kzalloc (Peter Zijlstra).
 * Added CPU as argument hardlockup_detector_hpet_enable()/disable()
   (Thomas Gleixner).
 * Remove unnecessary export of function declarations, flags and bit
   fields (Thomas Gleixner).
 * Removed  unnecessary check for FSB support when reserving timer for the
   detector (Thomas Gleixner).
 * Separated TSC code from HPET code in kick_timer() (Thomas Gleixner).
 * Reworked condition to check if the expected TSC value is within the
   error margin to avoid conditional (Peter Zijlstra).
 * Removed TSC error margin from struct hld_data; use global variable
   instead (Peter Zijlstra).
 * Removed previously introduced watchdog_get_allowed_cpumask*() and
   reworked hardlockup_detector_hpet_enable()/disable() to not need
   access to watchdog_allowed_mask (Thomas Gleixner).

Changes since v1:

 * Removed reads to HPET registers at every NMI. Instead use the time-stamp
   counter to infer the interrupt source (Thomas Gleixner, Andi Kleen).
 * Do not target CPUs in a round-robin manner. Instead, the HPET timer
   always targets the same CPU; other CPUs are monitored via an
   interprocessor interrupt.
 * Removed use of generic irq code to set interrupt affinity and NMI
   delivery. Instead, configure the interrupt directly in HPET registers
   (Thomas Gleixner).
 * Removed the proposed ops structure for NMI watchdogs. Instead, split
   the existing implementation into a generic library and perf-specific
   infrastructure (Thomas Gleixner, Nicholas Piggin).
 * Added an x86-specific shim hardlockup detector that selects between
   HPET and perf infrastructures as needed (Nicholas Piggin).
 * Removed locks taken in NMI and !NMI context. This was wrong and is no
   longer needed (Thomas Gleixner).
 * Fixed unconditonal return NMI_HANDLED when the HPET timer is programmed
   for FSB/MSI delivery (Peter Zijlstra).

References:

[1]. https://lkml.org/lkml/2018/6/12/1027
[2]. https://lkml.org/lkml/2019/2/27/402
[3]. https://lkml.org/lkml/2019/5/14/386

Ricardo Neri (21):
  x86/msi: Add definition for NMI delivery mode
  x86/hpet: Expose hpet_writel() in header
  x86/hpet: Calculate ticks-per-second in a separate function
  x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes
  x86/hpet: Reserve timer for the HPET hardlockup detector
  x86/hpet: Configure the timer used by the hardlockup detector
  watchdog/hardlockup: Define a generic function to detect hardlockups
  watchdog/hardlockup: Decouple the hardlockup detector from perf
  x86/nmi: Add a NMI_WATCHDOG NMI handler category
  watchdog/hardlockup: Add function to enable NMI watchdog on all
    allowed CPUs at once
  x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
  watchdog/hardlockup/hpet: Adjust timer expiration on the number of
    monitored CPUs
  x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI
  watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog"
  watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot
    parameter
  x86/watchdog: Add a shim hardlockup detector
  x86/tsc: Switch to perf-based hardlockup detector if TSC become
    unstable
  x86/apic: Add a parameter for the APIC delivery mode
  iommu/vt-d: Rework prepare_irte() to support per-irq delivery mode
  iommu/vt-d: hpet: Reserve an interrupt remampping table entry for
    watchdog
  x86/watchdog/hardlockup/hpet: Support interrupt remapping

 .../admin-guide/kernel-parameters.txt         |   8 +-
 arch/x86/Kconfig.debug                        |  15 +
 arch/x86/include/asm/hpet.h                   |  47 ++
 arch/x86/include/asm/hw_irq.h                 |   5 +-
 arch/x86/include/asm/msidef.h                 |   4 +
 arch/x86/include/asm/nmi.h                    |   1 +
 arch/x86/kernel/Makefile                      |   2 +
 arch/x86/kernel/apic/vector.c                 |  10 +
 arch/x86/kernel/hpet.c                        | 115 ++++-
 arch/x86/kernel/nmi.c                         |  10 +
 arch/x86/kernel/tsc.c                         |   2 +
 arch/x86/kernel/watchdog_hld.c                |  85 ++++
 arch/x86/kernel/watchdog_hld_hpet.c           | 453 ++++++++++++++++++
 drivers/char/hpet.c                           |  31 +-
 drivers/iommu/intel_irq_remapping.c           |  59 ++-
 include/linux/hpet.h                          |   1 +
 include/linux/nmi.h                           |   8 +-
 kernel/Makefile                               |   2 +-
 kernel/watchdog.c                             |  23 +-
 kernel/watchdog_hld.c                         |  50 +-
 lib/Kconfig.debug                             |   4 +
 21 files changed, 879 insertions(+), 56 deletions(-)
 create mode 100644 arch/x86/kernel/watchdog_hld.c
 create mode 100644 arch/x86/kernel/watchdog_hld_hpet.c

-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 01/21] x86/msi: Add definition for NMI delivery mode
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 02/21] x86/hpet: Expose hpet_writel() in header Ricardo Neri
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, H. Peter Anvin, Wincy Van, Ashok Raj, x86,
	Andi Kleen, Ravi V. Shankar, Ricardo Neri, Bjorn Helgaas,
	Juergen Gross, Tony Luck, Randy Dunlap, linux-kernel, iommu,
	Eric W. Biederman, Philippe Ombredanne

Until now, the delivery mode of MSI interrupts is set to the default
mode set in the APIC driver. However, there are no restrictions in hardware
to configure each interrupt with a different delivery mode. Specifying the
delivery mode per interrupt is useful when one is interested in changing
the delivery mode of a particular interrupt. For instance, this can be used
to deliver an interrupt as non-maskable.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Wincy Van <fanwenyi0529@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/include/asm/msidef.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/msidef.h b/arch/x86/include/asm/msidef.h
index ee2f8ccc32d0..38ccfdc2d96e 100644
--- a/arch/x86/include/asm/msidef.h
+++ b/arch/x86/include/asm/msidef.h
@@ -18,6 +18,7 @@
 #define MSI_DATA_DELIVERY_MODE_SHIFT	8
 #define  MSI_DATA_DELIVERY_FIXED	(0 << MSI_DATA_DELIVERY_MODE_SHIFT)
 #define  MSI_DATA_DELIVERY_LOWPRI	(1 << MSI_DATA_DELIVERY_MODE_SHIFT)
+#define  MSI_DATA_DELIVERY_NMI		(4 << MSI_DATA_DELIVERY_MODE_SHIFT)
 
 #define MSI_DATA_LEVEL_SHIFT		14
 #define	 MSI_DATA_LEVEL_DEASSERT	(0 << MSI_DATA_LEVEL_SHIFT)
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 02/21] x86/hpet: Expose hpet_writel() in header
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 01/21] x86/msi: Add definition for NMI delivery mode Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 03/21] x86/hpet: Calculate ticks-per-second in a separate function Ricardo Neri
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Peter Zijlstra,
	Ricardo Neri, Randy Dunlap, linux-kernel, Stephane Eranian,
	Ricardo Neri, Rafael J. Wysocki, iommu, Tony Luck,
	H. Peter Anvin, Andi Kleen, Philippe Ombredanne

In order to allow hpet_writel() to be used by other components (e.g.,
the HPET-based hardlockup detector) expose it in the HPET header file.

No empty definition is needed if CONFIG_HPET is not selected as all
existing callers select such config symbol.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/include/asm/hpet.h | 1 +
 arch/x86/kernel/hpet.c      | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 67385d56d4f4..f132fbf984d4 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -72,6 +72,7 @@ extern int is_hpet_enabled(void);
 extern int hpet_enable(void);
 extern void hpet_disable(void);
 extern unsigned int hpet_readl(unsigned int a);
+extern void hpet_writel(unsigned int d, unsigned int a);
 extern void force_hpet_resume(void);
 
 struct irq_data;
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index a0573f2e7763..5e86e024c489 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -62,7 +62,7 @@ inline unsigned int hpet_readl(unsigned int a)
 	return readl(hpet_virt_address + a);
 }
 
-static inline void hpet_writel(unsigned int d, unsigned int a)
+inline void hpet_writel(unsigned int d, unsigned int a)
 {
 	writel(d, hpet_virt_address + a);
 }
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 03/21] x86/hpet: Calculate ticks-per-second in a separate function
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 01/21] x86/msi: Add definition for NMI delivery mode Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 02/21] x86/hpet: Expose hpet_writel() in header Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-06-14 15:54   ` Thomas Gleixner
  2019-05-24  1:16 ` [RFC PATCH v4 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes Ricardo Neri
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Arnd Bergmann,
	Peter Zijlstra, Ricardo Neri, Randy Dunlap, Clemens Ladisch,
	linux-kernel, Stephane Eranian, Ricardo Neri, Rafael J. Wysocki,
	iommu, Tony Luck, H. Peter Anvin, Andi Kleen,
	Philippe Ombredanne

It is easier to compute the expiration times of an HPET timer by using
its frequency (i.e., the number of times it ticks in a second) than its
period, as given in the capabilities register.

In addition to the HPET char driver, the HPET-based hardlockup detector
will also need to know the timer's frequency. Thus, create a common
function that both can use.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 drivers/char/hpet.c  | 31 ++++++++++++++++++++++++-------
 include/linux/hpet.h |  1 +
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/char/hpet.c b/drivers/char/hpet.c
index 3a1e6b3ccd10..747255f552a9 100644
--- a/drivers/char/hpet.c
+++ b/drivers/char/hpet.c
@@ -836,6 +836,29 @@ static unsigned long hpet_calibrate(struct hpets *hpetp)
 	return ret;
 }
 
+u64 hpet_get_ticks_per_sec(u64 hpet_caps)
+{
+	u64 ticks_per_sec, period;
+
+	period = (hpet_caps & HPET_COUNTER_CLK_PERIOD_MASK) >>
+		 HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
+
+	/*
+	 * The frequency is the reciprocal of the period. The period is given
+	 * in femtoseconds per second. Thus, prepare a dividend to obtain the
+	 * frequency in ticks per second.
+	 */
+
+	/* 10^15 femtoseconds per second */
+	ticks_per_sec = 1000000000000000ULL;
+	ticks_per_sec += period >> 1; /* round */
+
+	/* The quotient is put in the dividend. We drop the remainder. */
+	do_div(ticks_per_sec, period);
+
+	return ticks_per_sec;
+}
+
 int hpet_alloc(struct hpet_data *hdp)
 {
 	u64 cap, mcfg;
@@ -844,7 +867,6 @@ int hpet_alloc(struct hpet_data *hdp)
 	struct hpets *hpetp;
 	struct hpet __iomem *hpet;
 	static struct hpets *last;
-	unsigned long period;
 	unsigned long long temp;
 	u32 remainder;
 
@@ -894,12 +916,7 @@ int hpet_alloc(struct hpet_data *hdp)
 
 	last = hpetp;
 
-	period = (cap & HPET_COUNTER_CLK_PERIOD_MASK) >>
-		HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
-	temp = 1000000000000000uLL; /* 10^15 femtoseconds per second */
-	temp += period >> 1; /* round */
-	do_div(temp, period);
-	hpetp->hp_tick_freq = temp; /* ticks per second */
+	hpetp->hp_tick_freq = hpet_get_ticks_per_sec(cap);
 
 	printk(KERN_INFO "hpet%d: at MMIO 0x%lx, IRQ%s",
 		hpetp->hp_which, hdp->hd_phys_address,
diff --git a/include/linux/hpet.h b/include/linux/hpet.h
index 8604564b985d..e7b36bcf4699 100644
--- a/include/linux/hpet.h
+++ b/include/linux/hpet.h
@@ -107,5 +107,6 @@ static inline void hpet_reserve_timer(struct hpet_data *hd, int timer)
 }
 
 int hpet_alloc(struct hpet_data *);
+u64 hpet_get_ticks_per_sec(u64 hpet_caps);
 
 #endif				/* !__HPET__ */
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (2 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 03/21] x86/hpet: Calculate ticks-per-second in a separate function Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-06-14 18:17   ` Thomas Gleixner
  2019-05-24  1:16 ` [RFC PATCH v4 05/21] x86/hpet: Reserve timer for the HPET hardlockup detector Ricardo Neri
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Peter Zijlstra,
	Ricardo Neri, Randy Dunlap, linux-kernel, Stephane Eranian,
	Ricardo Neri, Rafael J. Wysocki, iommu, Tony Luck,
	H. Peter Anvin, Andi Kleen, Philippe Ombredanne

Instead of setting the timer period directly in hpet_set_periodic(), add a
new helper function hpet_set_comparator() that only sets the accumulator
and comparator. hpet_set_periodic() will only prepare the timer for
periodic mode and leave the expiration programming to
hpet_set_comparator().

This new function can also be used by other components (e.g., the HPET-
based hardlockup detector) which also need to configure HPET timers. Thus,
add its declaration into the hpet header file.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Originally-by: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/include/asm/hpet.h |  1 +
 arch/x86/kernel/hpet.c      | 57 +++++++++++++++++++++++++++++--------
 2 files changed, 46 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index f132fbf984d4..e7098740f5ee 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -102,6 +102,7 @@ extern int hpet_rtc_timer_init(void);
 extern irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id);
 extern int hpet_register_irq_handler(rtc_irq_handler handler);
 extern void hpet_unregister_irq_handler(rtc_irq_handler handler);
+extern void hpet_set_comparator(int num, unsigned int cmp, unsigned int period);
 
 #endif /* CONFIG_HPET_EMULATE_RTC */
 
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 5e86e024c489..1723d55219e8 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -290,6 +290,47 @@ static void hpet_legacy_clockevent_register(void)
 	printk(KERN_DEBUG "hpet clockevent registered\n");
 }
 
+/**
+ * hpet_set_comparator() - Helper function for setting comparator register
+ * @num:	The timer ID
+ * @cmp:	The value to be written to the comparator/accumulator
+ * @period:	The value to be written to the period (0 = oneshot mode)
+ *
+ * Helper function for updating comparator, accumulator and period values.
+ *
+ * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing
+ * to the Tn_CMP to update the accumulator. Then, HPET needs a second
+ * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period.
+ * The HPET_TN_SETVAL bit is automatically cleared after the first write.
+ *
+ * For one-shot mode, HPET_TN_SETVAL does not need to be set.
+ *
+ * See the following documents:
+ *   - Intel IA-PC HPET (High Precision Event Timers) Specification
+ *   - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674
+ */
+void hpet_set_comparator(int num, unsigned int cmp, unsigned int period)
+{
+	if (period) {
+		unsigned int v = hpet_readl(HPET_Tn_CFG(num));
+
+		hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(num));
+	}
+
+	hpet_writel(cmp, HPET_Tn_CMP(num));
+
+	if (!period)
+		return;
+
+	/*
+	 * This delay is seldom used: never in one-shot mode and in periodic
+	 * only when reprogramming the timer.
+	 */
+	udelay(1);
+	hpet_writel(period, HPET_Tn_CMP(num));
+}
+EXPORT_SYMBOL_GPL(hpet_set_comparator);
+
 static int hpet_set_periodic(struct clock_event_device *evt, int timer)
 {
 	unsigned int cfg, cmp, now;
@@ -301,19 +342,11 @@ static int hpet_set_periodic(struct clock_event_device *evt, int timer)
 	now = hpet_readl(HPET_COUNTER);
 	cmp = now + (unsigned int)delta;
 	cfg = hpet_readl(HPET_Tn_CFG(timer));
-	cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL |
-	       HPET_TN_32BIT;
+	cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_32BIT;
 	hpet_writel(cfg, HPET_Tn_CFG(timer));
-	hpet_writel(cmp, HPET_Tn_CMP(timer));
-	udelay(1);
-	/*
-	 * HPET on AMD 81xx needs a second write (with HPET_TN_SETVAL
-	 * cleared) to T0_CMP to set the period. The HPET_TN_SETVAL
-	 * bit is automatically cleared after the first write.
-	 * (See AMD-8111 HyperTransport I/O Hub Data Sheet,
-	 * Publication # 24674)
-	 */
-	hpet_writel((unsigned int)delta, HPET_Tn_CMP(timer));
+
+	hpet_set_comparator(timer, cmp, (unsigned int)delta);
+
 	hpet_start_counter();
 	hpet_print_config();
 
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 05/21] x86/hpet: Reserve timer for the HPET hardlockup detector
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (3 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-06-11 19:54   ` Thomas Gleixner
  2019-05-24  1:16 ` [RFC PATCH v4 06/21] x86/hpet: Configure the timer used by the " Ricardo Neri
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Arnd Bergmann,
	Peter Zijlstra, Ricardo Neri, Randy Dunlap, Clemens Ladisch,
	linux-kernel, Stephane Eranian, Ricardo Neri, Rafael J. Wysocki,
	iommu, Tony Luck, H. Peter Anvin, Andi Kleen,
	Philippe Ombredanne

HPET timer 2 will be used to drive the HPET-based hardlockup detector.
Reserve such timer to ensure it cannot be used by user space programs or
for clock events.

When looking for MSI-capable timers for clock events, skip timer 2 if
the HPET hardlockup detector is selected.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/include/asm/hpet.h |  3 +++
 arch/x86/kernel/hpet.c      | 19 ++++++++++++++++---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index e7098740f5ee..6f099e2781ce 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -61,6 +61,9 @@
  */
 #define HPET_MIN_PERIOD		100000UL
 
+/* Timer used for the hardlockup detector */
+#define HPET_WD_TIMER_NR 2
+
 /* hpet memory map physical address */
 extern unsigned long hpet_address;
 extern unsigned long force_hpet_address;
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 1723d55219e8..ff0250831786 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -173,7 +173,8 @@ do {								\
 
 /*
  * When the hpet driver (/dev/hpet) is enabled, we need to reserve
- * timer 0 and timer 1 in case of RTC emulation.
+ * timer 0 and timer 1 in case of RTC emulation. Timer 2 is reserved in case
+ * the HPET-based hardlockup detector is used.
  */
 #ifdef CONFIG_HPET
 
@@ -183,7 +184,7 @@ static void hpet_reserve_platform_timers(unsigned int id)
 {
 	struct hpet __iomem *hpet = hpet_virt_address;
 	struct hpet_timer __iomem *timer = &hpet->hpet_timers[2];
-	unsigned int nrtimers, i;
+	unsigned int nrtimers, i, start_timer;
 	struct hpet_data hd;
 
 	nrtimers = ((id & HPET_ID_NUMBER) >> HPET_ID_NUMBER_SHIFT) + 1;
@@ -198,6 +199,13 @@ static void hpet_reserve_platform_timers(unsigned int id)
 	hpet_reserve_timer(&hd, 1);
 #endif
 
+	if (IS_ENABLED(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET)) {
+		hpet_reserve_timer(&hd, HPET_WD_TIMER_NR);
+		start_timer = HPET_WD_TIMER_NR + 1;
+	} else {
+		start_timer = HPET_WD_TIMER_NR;
+	}
+
 	/*
 	 * NOTE that hd_irq[] reflects IOAPIC input pins (LEGACY_8254
 	 * is wrong for i8259!) not the output IRQ.  Many BIOS writers
@@ -206,7 +214,7 @@ static void hpet_reserve_platform_timers(unsigned int id)
 	hd.hd_irq[0] = HPET_LEGACY_8254;
 	hd.hd_irq[1] = HPET_LEGACY_RTC;
 
-	for (i = 2; i < nrtimers; timer++, i++) {
+	for (i = start_timer; i < nrtimers; timer++, i++) {
 		hd.hd_irq[i] = (readl(&timer->hpet_config) &
 			Tn_INT_ROUTE_CNF_MASK) >> Tn_INT_ROUTE_CNF_SHIFT;
 	}
@@ -651,6 +659,11 @@ static void hpet_msi_capability_lookup(unsigned int start_timer)
 		struct hpet_dev *hdev = &hpet_devs[num_timers_used];
 		unsigned int cfg = hpet_readl(HPET_Tn_CFG(i));
 
+		/* Do not use timer reserved for the HPET watchdog. */
+		if (IS_ENABLED(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) &&
+		    i == HPET_WD_TIMER_NR)
+			continue;
+
 		/* Only consider HPET timer with MSI support */
 		if (!(cfg & HPET_TN_FSB_CAP))
 			continue;
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 06/21] x86/hpet: Configure the timer used by the hardlockup detector
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (4 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 05/21] x86/hpet: Reserve timer for the HPET hardlockup detector Ricardo Neri
@ 2019-05-24  1:16 ` " Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 07/21] watchdog/hardlockup: Define a generic function to detect hardlockups Ricardo Neri
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Arnd Bergmann,
	Peter Zijlstra, Ricardo Neri, Randy Dunlap, Clemens Ladisch,
	linux-kernel, Stephane Eranian, Ricardo Neri, Rafael J. Wysocki,
	iommu, Tony Luck, H. Peter Anvin, Andi Kleen,
	Philippe Ombredanne

Implement the initial configuration of the timer to be used by the
hardlockup detector. Return a data structure with a description of the
timer; this information is subsequently used by the hardlockup detector.

Only provide the timer if it supports Front Side Bus interrupt delivery.
This condition greatly simplifies the implementation of the detector.
Specifically, it helps to avoid the complexities of routing the interrupt
via the IO-APIC (e.g., potential race conditions that arise from re-
programming the IO-APIC in NMI context).

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/include/asm/hpet.h | 13 +++++++++++++
 arch/x86/kernel/hpet.c      | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 6f099e2781ce..20abdaa5372d 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -109,6 +109,19 @@ extern void hpet_set_comparator(int num, unsigned int cmp, unsigned int period);
 
 #endif /* CONFIG_HPET_EMULATE_RTC */
 
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+struct hpet_hld_data {
+	bool		has_periodic;
+	u32		num;
+	u64		ticks_per_second;
+};
+
+extern struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void);
+#else
+static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
+{ return NULL; }
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
+
 #else /* CONFIG_HPET_TIMER */
 
 static inline int hpet_enable(void) { return 0; }
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index ff0250831786..5f9209949fc7 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -171,6 +171,41 @@ do {								\
 		_hpet_print_config(__func__, __LINE__);	\
 } while (0)
 
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
+{
+	struct hpet_hld_data *hdata;
+	u64 temp;
+	u32 cfg;
+
+	cfg = hpet_readl(HPET_Tn_CFG(HPET_WD_TIMER_NR));
+
+	if (!(cfg & HPET_TN_FSB_CAP))
+		return NULL;
+
+	hdata = kzalloc(sizeof(*hdata), GFP_KERNEL);
+	if (!hdata)
+		return NULL;
+
+	if (cfg & HPET_TN_PERIODIC_CAP)
+		hdata->has_periodic = true;
+
+	hdata->num = HPET_WD_TIMER_NR;
+
+	cfg = hpet_readl(HPET_PERIOD);
+
+	/*
+	 * hpet_get_ticks_per_sec() expects the contents of the general
+	 * capabilities register. The period is in the 32 most significant
+	 * bits.
+	 */
+	temp = (u64)cfg << HPET_COUNTER_CLK_PERIOD_SHIFT;
+	hdata->ticks_per_second = hpet_get_ticks_per_sec(temp);
+
+	return hdata;
+}
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
+
 /*
  * When the hpet driver (/dev/hpet) is enabled, we need to reserve
  * timer 0 and timer 1 in case of RTC emulation. Timer 2 is reserved in case
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 07/21] watchdog/hardlockup: Define a generic function to detect hardlockups
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (5 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 06/21] x86/hpet: Configure the timer used by the " Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 08/21] watchdog/hardlockup: Decouple the hardlockup detector from perf Ricardo Neri
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Peter Zijlstra, Benjamin Herrenschmidt, Ricardo Neri,
	Stephane Eranian, Paul Mackerras, H. Peter Anvin, sparclinux,
	Ashok Raj, Michael Ellerman, x86, Luis R. Rodriguez, Andi Kleen,
	Don Zickus, Ravi V. Shankar, Ricardo Neri, Frederic Weisbecker,
	Nicholas Piggin, Babu Moger, Mathieu Desnoyers, Tony Luck,
	Randy Dunlap, linux-kernel, iommu, Masami Hiramatsu,
	Philippe Ombredanne, Colin Ian King, Andrew Morton, linuxppc-dev,
	David S. Miller

The procedure to detect hardlockups is independent of the underlying
mechanism that generates the non-maskable interrupt used to drive the
detector. Thus, it can be put in a separate, generic function. In this
manner, it can be invoked by various implementations of the NMI watchdog.

For this purpose, move the bulk of watchdog_overflow_callback() to the
new function inspect_for_hardlockups(). This function can then be called
from the applicable NMI handlers.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Babu Moger <Babu.Moger@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 include/linux/nmi.h   |  1 +
 kernel/watchdog_hld.c | 18 +++++++++++-------
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 9003e29cde46..5a8b19749769 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -212,6 +212,7 @@ extern int proc_watchdog_thresh(struct ctl_table *, int ,
 				void __user *, size_t *, loff_t *);
 extern int proc_watchdog_cpumask(struct ctl_table *, int,
 				 void __user *, size_t *, loff_t *);
+void inspect_for_hardlockups(struct pt_regs *regs);
 
 #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 #include <asm/nmi.h>
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 247bf0b1582c..b352e507b17f 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -106,14 +106,8 @@ static struct perf_event_attr wd_hw_attr = {
 	.disabled	= 1,
 };
 
-/* Callback function for perf event subsystem */
-static void watchdog_overflow_callback(struct perf_event *event,
-				       struct perf_sample_data *data,
-				       struct pt_regs *regs)
+void inspect_for_hardlockups(struct pt_regs *regs)
 {
-	/* Ensure the watchdog never gets throttled */
-	event->hw.interrupts = 0;
-
 	if (__this_cpu_read(watchdog_nmi_touch) == true) {
 		__this_cpu_write(watchdog_nmi_touch, false);
 		return;
@@ -163,6 +157,16 @@ static void watchdog_overflow_callback(struct perf_event *event,
 	return;
 }
 
+/* Callback function for perf event subsystem */
+static void watchdog_overflow_callback(struct perf_event *event,
+				       struct perf_sample_data *data,
+				       struct pt_regs *regs)
+{
+	/* Ensure the watchdog never gets throttled */
+	event->hw.interrupts = 0;
+	inspect_for_hardlockups(regs);
+}
+
 static int hardlockup_detector_event_create(void)
 {
 	unsigned int cpu = smp_processor_id();
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 08/21] watchdog/hardlockup: Decouple the hardlockup detector from perf
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (6 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 07/21] watchdog/hardlockup: Define a generic function to detect hardlockups Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 09/21] x86/nmi: Add a NMI_WATCHDOG NMI handler category Ricardo Neri
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Rafael J. Wysocki, Peter Zijlstra, Benjamin Herrenschmidt,
	Alexei Starovoitov, Stephane Eranian, Kai-Heng Feng,
	Paul Mackerras, H. Peter Anvin, sparclinux, Davidlohr Bueso,
	Ashok Raj, Michael Ellerman, x86, Luis R. Rodriguez,
	David Rientjes, Andi Kleen, Waiman Long, Paul E. McKenney,
	Don Zickus, Ravi V. Shankar, Konrad Rzeszutek Wilk, Marc Zyngier,
	Ricardo Neri, Frederic Weisbecker, Nicholas Piggin, Ricardo Neri,
	Byungchul Park, Babu Moger, Mathieu Desnoyers, Josh Poimboeuf,
	Tony Luck, Randy Dunlap, linux-kernel, iommu, Masami Hiramatsu,
	Philippe Ombredanne, Colin Ian King, Andrew Morton, linuxppc-dev,
	David S. Miller

The current default implementation of the hardlockup detector assumes that
it is implemented using perf events. However, the hardlockup detector can
be driven by other sources of non-maskable interrupts (e.g., a properly
configured timer).

Group and wrap in #ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF all the code
specific to perf: create and manage perf events, stop and start the perf-
based detector.

The generic portion of the detector (monitor the timers' thresholds, check
timestamps and detect hardlockups as well as the implementation of
arch_touch_nmi_watchdog()) is now selected with the new intermediate config
symbol CONFIG_HARDLOCKUP_DETECTOR_CORE.

The perf-based implementation of the detector selects the new intermediate
symbol. Other implementations should do the same.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Babu Moger <Babu.Moger@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 include/linux/nmi.h   |  5 ++++-
 kernel/Makefile       |  2 +-
 kernel/watchdog_hld.c | 32 ++++++++++++++++++++------------
 lib/Kconfig.debug     |  4 ++++
 4 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 5a8b19749769..e5f1a86e20b7 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -94,8 +94,11 @@ static inline void hardlockup_detector_disable(void) {}
 # define NMI_WATCHDOG_SYSCTL_PERM	0444
 #endif
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_CORE)
 extern void arch_touch_nmi_watchdog(void);
+#endif
+
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
 extern void hardlockup_detector_perf_stop(void);
 extern void hardlockup_detector_perf_restart(void);
 extern void hardlockup_detector_perf_disable(void);
diff --git a/kernel/Makefile b/kernel/Makefile
index 33824f0385b3..d07d52a03cc9 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -83,7 +83,7 @@ obj-$(CONFIG_FAIL_FUNCTION) += fail_function.o
 obj-$(CONFIG_KGDB) += debug/
 obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
-obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
+obj-$(CONFIG_HARDLOCKUP_DETECTOR_CORE) += watchdog_hld.o
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index b352e507b17f..bb6435978c46 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -22,12 +22,8 @@
 
 static DEFINE_PER_CPU(bool, hard_watchdog_warn);
 static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
-static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
-static DEFINE_PER_CPU(struct perf_event *, dead_event);
-static struct cpumask dead_events_mask;
 
 static unsigned long hardlockup_allcpu_dumped;
-static atomic_t watchdog_cpus = ATOMIC_INIT(0);
 
 notrace void arch_touch_nmi_watchdog(void)
 {
@@ -98,14 +94,6 @@ static inline bool watchdog_check_timestamp(void)
 }
 #endif
 
-static struct perf_event_attr wd_hw_attr = {
-	.type		= PERF_TYPE_HARDWARE,
-	.config		= PERF_COUNT_HW_CPU_CYCLES,
-	.size		= sizeof(struct perf_event_attr),
-	.pinned		= 1,
-	.disabled	= 1,
-};
-
 void inspect_for_hardlockups(struct pt_regs *regs)
 {
 	if (__this_cpu_read(watchdog_nmi_touch) == true) {
@@ -157,6 +145,24 @@ void inspect_for_hardlockups(struct pt_regs *regs)
 	return;
 }
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
+#undef pr_fmt
+#define pr_fmt(fmt) "NMI perf watchdog: " fmt
+
+static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
+static DEFINE_PER_CPU(struct perf_event *, dead_event);
+static struct cpumask dead_events_mask;
+
+static atomic_t watchdog_cpus = ATOMIC_INIT(0);
+
+static struct perf_event_attr wd_hw_attr = {
+	.type		= PERF_TYPE_HARDWARE,
+	.config		= PERF_COUNT_HW_CPU_CYCLES,
+	.size		= sizeof(struct perf_event_attr),
+	.pinned		= 1,
+	.disabled	= 1,
+};
+
 /* Callback function for perf event subsystem */
 static void watchdog_overflow_callback(struct perf_event *event,
 				       struct perf_sample_data *data,
@@ -298,3 +304,5 @@ int __init hardlockup_detector_perf_init(void)
 	}
 	return ret;
 }
+
+#endif /* CONFIG_HARDLOCKUP_DETECTOR_PERF */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index cbdfae379896..c31d7a6e284d 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -876,9 +876,13 @@ config BOOTPARAM_SOFTLOCKUP_PANIC_VALUE
 	default 0 if !BOOTPARAM_SOFTLOCKUP_PANIC
 	default 1 if BOOTPARAM_SOFTLOCKUP_PANIC
 
+config HARDLOCKUP_DETECTOR_CORE
+	bool
+
 config HARDLOCKUP_DETECTOR_PERF
 	bool
 	select SOFTLOCKUP_DETECTOR
+	select HARDLOCKUP_DETECTOR_CORE
 
 #
 # Enables a timestamp based low pass filter to compensate for perf based
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 09/21] x86/nmi: Add a NMI_WATCHDOG NMI handler category
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (7 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 08/21] watchdog/hardlockup: Decouple the hardlockup detector from perf Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 10/21] watchdog/hardlockup: Add function to enable NMI watchdog on all allowed CPUs at once Ricardo Neri
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Ravi V. Shankar, x86, Ashok Raj, Peter Zijlstra, Ricardo Neri,
	Randy Dunlap, linux-kernel, Stephane Eranian, Ricardo Neri,
	iommu, Andi Kleen

Add a NMI_WATCHDOG as a new category of NMI handler. This new category
is to be used with the HPET-based hardlockup detector. This detector
does not have a direct way of checking if the HPET timer is the source of
the NMI. Instead it indirectly estimate it using the time-stamp counter.

Therefore, we may have false-positives in case another NMI occurs within
the estimated time window. For this reason, we want the handler of the
detector to be called after all the NMI_LOCAL handlers. A simple way
of achieving this with a new NMI handler category.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/include/asm/nmi.h |  1 +
 arch/x86/kernel/nmi.c      | 10 ++++++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 75ded1d13d98..75aa98313cde 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -29,6 +29,7 @@ enum {
 	NMI_UNKNOWN,
 	NMI_SERR,
 	NMI_IO_CHECK,
+	NMI_WATCHDOG,
 	NMI_MAX
 };
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 4df7705022b9..43e96aedc6fe 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -64,6 +64,10 @@ static struct nmi_desc nmi_desc[NMI_MAX] =
 		.lock = __RAW_SPIN_LOCK_UNLOCKED(&nmi_desc[3].lock),
 		.head = LIST_HEAD_INIT(nmi_desc[3].head),
 	},
+	{
+		.lock = __RAW_SPIN_LOCK_UNLOCKED(&nmi_desc[4].lock),
+		.head = LIST_HEAD_INIT(nmi_desc[4].head),
+	},
 
 };
 
@@ -174,6 +178,8 @@ int __register_nmi_handler(unsigned int type, struct nmiaction *action)
 	 */
 	WARN_ON_ONCE(type == NMI_SERR && !list_empty(&desc->head));
 	WARN_ON_ONCE(type == NMI_IO_CHECK && !list_empty(&desc->head));
+	WARN_ON_ONCE(type == NMI_WATCHDOG && !list_empty(&desc->head));
+
 
 	/*
 	 * some handlers need to be executed first otherwise a fake
@@ -384,6 +390,10 @@ static void default_do_nmi(struct pt_regs *regs)
 	}
 	raw_spin_unlock(&nmi_reason_lock);
 
+	handled = nmi_handle(NMI_WATCHDOG, regs);
+	if (handled == NMI_HANDLED)
+		return;
+
 	/*
 	 * Only one NMI can be latched at a time.  To handle
 	 * this we may process multiple nmi handlers at once to
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 10/21] watchdog/hardlockup: Add function to enable NMI watchdog on all allowed CPUs at once
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (8 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 09/21] x86/nmi: Add a NMI_WATCHDOG NMI handler category Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 11/21] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector Ricardo Neri
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Rafael J. Wysocki, Peter Zijlstra, Benjamin Herrenschmidt,
	Alexei Starovoitov, Stephane Eranian, Kai-Heng Feng,
	Paul Mackerras, H. Peter Anvin, sparclinux, Davidlohr Bueso,
	Ashok Raj, Michael Ellerman, x86, Luis R. Rodriguez,
	David Rientjes, Andi Kleen, Waiman Long, Paul E. McKenney,
	Don Zickus, Ravi V. Shankar, Konrad Rzeszutek Wilk, Marc Zyngier,
	Ricardo Neri, Frederic Weisbecker, Nicholas Piggin, Ricardo Neri,
	Byungchul Park, Babu Moger, Mathieu Desnoyers, Josh Poimboeuf,
	Tony Luck, Randy Dunlap, linux-kernel, iommu, Masami Hiramatsu,
	Philippe Ombredanne, Colin Ian King, Andrew Morton, linuxppc-dev,
	David S. Miller

When there are more than one implementation of the NMI watchdog, there may
be situations in which switching from one to another is needed (e.g., if
the time-stamp counter becomes unstable, the HPET-based NMI watchdog can
no longer be used.

The perf-based implementation of the hardlockup detector makes use of
various per-CPU variables which are accessed via this_cpu operations.
Hence, each CPU needs to enable its own NMI watchdog if using the perf
implementation.

Add functionality to switch from one NMI watchdog to another and do it
from each allowed CPU.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Babu Moger <Babu.Moger@amd.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 include/linux/nmi.h |  2 ++
 kernel/watchdog.c   | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index e5f1a86e20b7..6d828334348b 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -83,9 +83,11 @@ static inline void reset_hung_task_detector(void) { }
 
 #if defined(CONFIG_HARDLOCKUP_DETECTOR)
 extern void hardlockup_detector_disable(void);
+extern void hardlockup_start_all(void);
 extern unsigned int hardlockup_panic;
 #else
 static inline void hardlockup_detector_disable(void) {}
+static inline void hardlockup_start_all(void) {}
 #endif
 
 #if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 7f9e7b9306fe..be589001200a 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -566,6 +566,21 @@ int lockup_detector_offline_cpu(unsigned int cpu)
 	return 0;
 }
 
+static int hardlockup_start_fn(void *data)
+{
+	watchdog_nmi_enable(smp_processor_id());
+	return 0;
+}
+
+void hardlockup_start_all(void)
+{
+	int cpu;
+
+	cpumask_copy(&watchdog_allowed_mask, &watchdog_cpumask);
+	for_each_cpu(cpu, &watchdog_allowed_mask)
+		smp_call_on_cpu(cpu, hardlockup_start_fn, NULL, false);
+}
+
 static void lockup_detector_reconfigure(void)
 {
 	cpus_read_lock();
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 11/21] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (9 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 10/21] watchdog/hardlockup: Add function to enable NMI watchdog on all allowed CPUs at once Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 12/21] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored CPUs Ricardo Neri
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Rafael J. Wysocki, Peter Zijlstra, Jan Kiszka,
	Clemens Ladisch, Ricardo Neri, Stephane Eranian, Masahiro Yamada,
	H. Peter Anvin, Ashok Raj, x86, Andi Kleen, Ravi V. Shankar,
	Arnd Bergmann, Ricardo Neri, Mimi Zohar, Tony Luck, Randy Dunlap,
	Nick Desaulniers, linux-kernel, iommu, Philippe Ombredanne,
	Nayna Jain

This is the initial implementation of a hardlockup detector driven by an
HPET timer. This initial implementation includes functions to control the
timer via its registers. It also requests such timer, installs an NMI
interrupt handler and performs the initial configuration of the timer.

The detector is not functional at this stage. A subsequent changeset will
invoke the interfaces provides by this detector as well as functionality
to determine if the HPET timer caused the NMI.

In order to detect hardlockups in all the monitored CPUs, move the
interrupt to the next monitored CPU while handling the NMI interrupt; wrap
around when reaching the highest CPU in the mask. This rotation is
achieved by setting the affinity mask to only contain the next CPU to
monitor. A cpumask keeps track of all the CPUs that need to be monitored.
Such cpumask is updated when the watchdog is enabled or disabled in a
particular CPU.

This detector relies on an HPET timer that is capable of using Front Side
Bus interrupts. In order to avoid using the generic interrupt code,
program directly the MSI message register of the HPET timer.

HPET registers are only accessed to kick the timer after looking for
hardlockups. This happens every watchdog_thresh seconds. A subsequent
changeset will determine whether the HPET timer caused the interrupt based
on the value of the time-stamp counter. For now, just add a stub function.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Nayna Jain <nayna@linux.ibm.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/Kconfig.debug              |  11 +
 arch/x86/include/asm/hpet.h         |  13 ++
 arch/x86/kernel/Makefile            |   1 +
 arch/x86/kernel/hpet.c              |   3 +-
 arch/x86/kernel/watchdog_hld_hpet.c | 335 ++++++++++++++++++++++++++++
 5 files changed, 362 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kernel/watchdog_hld_hpet.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index f730680dc818..445bbb188f10 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -169,6 +169,17 @@ config IOMMU_LEAK
 config HAVE_MMIOTRACE_SUPPORT
 	def_bool y
 
+config X86_HARDLOCKUP_DETECTOR_HPET
+	bool "Use HPET Timer for Hard Lockup Detection"
+	select SOFTLOCKUP_DETECTOR
+	select HARDLOCKUP_DETECTOR
+	select HARDLOCKUP_DETECTOR_CORE
+	depends on HPET_TIMER && HPET && X86_64
+	help
+	  Say y to enable a hardlockup detector that is driven by a High-
+	  Precision Event Timer. This option is helpful to not use counters
+	  from the Performance Monitoring Unit to drive the detector.
+
 config X86_DECODER_SELFTEST
 	bool "x86 instruction decoder selftest"
 	depends on DEBUG_KERNEL && KPROBES
diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 20abdaa5372d..31fc27508cf3 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -114,12 +114,25 @@ struct hpet_hld_data {
 	bool		has_periodic;
 	u32		num;
 	u64		ticks_per_second;
+	u32		handling_cpu;
+	u32		enabled_cpus;
+	struct msi_msg	msi_msg;
+	unsigned long	cpu_monitored_mask[0];
 };
 
 extern struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void);
+extern int hardlockup_detector_hpet_init(void);
+extern void hardlockup_detector_hpet_stop(void);
+extern void hardlockup_detector_hpet_enable(unsigned int cpu);
+extern void hardlockup_detector_hpet_disable(unsigned int cpu);
 #else
 static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
 { return NULL; }
+static inline int hardlockup_detector_hpet_init(void)
+{ return -ENODEV; }
+static inline void hardlockup_detector_hpet_stop(void) {}
+static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
+static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
 #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
 
 #else /* CONFIG_HPET_TIMER */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 3578ad248bc9..3ad55de67e8b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_VM86)		+= vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK)	+= early_printk.o
 
 obj-$(CONFIG_HPET_TIMER) 	+= hpet.o
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 obj-$(CONFIG_APB_TIMER)		+= apb_timer.o
 
 obj-$(CONFIG_AMD_NB)		+= amd_nb.o
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 5f9209949fc7..dd3bb664a188 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -183,7 +183,8 @@ struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
 	if (!(cfg & HPET_TN_FSB_CAP))
 		return NULL;
 
-	hdata = kzalloc(sizeof(*hdata), GFP_KERNEL);
+	hdata = kzalloc(sizeof(struct hpet_hld_data) + cpumask_size(),
+			GFP_KERNEL);
 	if (!hdata)
 		return NULL;
 
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
new file mode 100644
index 000000000000..dff4dadabd4c
--- /dev/null
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -0,0 +1,335 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A hardlockup detector driven by an HPET timer.
+ *
+ * Copyright (C) Intel Corporation 2019
+ *
+ * A hardlockup detector driven by an HPET timer. It implements the same
+ * interfaces as the PERF-based hardlockup detector.
+ *
+ * A single HPET timer is used to monitor all the CPUs from the allowed_mask
+ * from kernel/watchdog.c. Thus, the timer is programmed to expire every
+ * watchdog_thresh/cpumask_weight(watchdog_allowed_cpumask). The timer targets
+ * CPUs in round robin manner. Thus, every cpu in watchdog_allowed_mask is
+ * monitored every watchdog_thresh seconds.
+ */
+
+#define pr_fmt(fmt) "NMI hpet watchdog: " fmt
+
+#include <linux/nmi.h>
+#include <linux/hpet.h>
+#include <linux/slab.h>
+#include <asm/msidef.h>
+#include <asm/hpet.h>
+
+static struct hpet_hld_data *hld_data;
+static bool hardlockup_use_hpet;
+
+/**
+ * kick_timer() - Reprogram timer to expire in the future
+ * @hdata:	A data structure with the timer instance to update
+ * @force:	Force reprogramming
+ *
+ * Reprogram the timer to expire within watchdog_thresh seconds in the future.
+ * If the timer supports periodic mode, it is not kicked unless @force is
+ * true.
+ */
+static void kick_timer(struct hpet_hld_data *hdata, bool force)
+{
+	bool kick_needed = force || !(hdata->has_periodic);
+	u64 new_compare, count, period = 0;
+
+	/*
+	 * Update the comparator in increments of watch_thresh seconds relative
+	 * to the current count. Since watch_thresh is given in seconds, we
+	 * are able to update the comparator before the counter reaches such new
+	 * value.
+	 *
+	 * Let it wrap around if needed.
+	 */
+
+	if (!kick_needed)
+		return;
+
+	if (hdata->has_periodic)
+		period = watchdog_thresh * hdata->ticks_per_second;
+
+	count = hpet_readl(HPET_COUNTER);
+	new_compare = count + watchdog_thresh * hdata->ticks_per_second;
+	hpet_set_comparator(hdata->num, (u32)new_compare, (u32)period);
+}
+
+static void disable_timer(struct hpet_hld_data *hdata)
+{
+	u32 v;
+
+	v = hpet_readl(HPET_Tn_CFG(hdata->num));
+	v &= ~HPET_TN_ENABLE;
+	hpet_writel(v, HPET_Tn_CFG(hdata->num));
+}
+
+static void enable_timer(struct hpet_hld_data *hdata)
+{
+	u32 v;
+
+	v = hpet_readl(HPET_Tn_CFG(hdata->num));
+	v |= HPET_TN_ENABLE;
+	hpet_writel(v, HPET_Tn_CFG(hdata->num));
+}
+
+/**
+ * is_hpet_wdt_interrupt() - Check if an HPET timer caused the interrupt
+ * @hdata:	A data structure with the timer instance to enable
+ *
+ * Returns:
+ * True if the HPET watchdog timer caused the interrupt. False otherwise.
+ */
+static bool is_hpet_wdt_interrupt(struct hpet_hld_data *hdata)
+{
+	return false;
+}
+
+/**
+ * compose_msi_msg() - Populate address and data fields of an MSI message
+ * @hdata:	A data strucure with the message to populate
+ *
+ * Initialize the fields of the MSI message to deliver an NMI interrupt. This
+ * function only initialize the files that don't change during the operation of
+ * of the detector. This function does not populate the Destination ID; which
+ * should be populated using update_msi_destid().
+ */
+static void compose_msi_msg(struct hpet_hld_data *hdata)
+{
+	struct msi_msg *msg = &hdata->msi_msg;
+
+	/*
+	 * The HPET FSB Interrupt Route register does not have an
+	 * address_hi part.
+	 */
+	msg->address_lo = MSI_ADDR_BASE_LO;
+
+	if (apic->irq_dest_mode == 0)
+		msg->address_lo |= MSI_ADDR_DEST_MODE_PHYSICAL;
+	else
+		msg->address_lo |= MSI_ADDR_DEST_MODE_LOGICAL;
+
+	msg->address_lo |= MSI_ADDR_REDIRECTION_CPU;
+
+	/*
+	 * On edge trigger, we don't care about assert level. Also,
+	 * since delivery mode is NMI, no irq vector is needed.
+	 */
+	msg->data = MSI_DATA_TRIGGER_EDGE | MSI_DATA_LEVEL_ASSERT |
+		    MSI_DATA_DELIVERY_NMI;
+}
+
+/** update_msi_destid() - Update APIC destid of handling CPU
+ * @hdata:	A data strucure with the MSI message to update
+ *
+ * Update the APIC destid of the MSI message generated by the HPET timer
+ * on expiration.
+ */
+static int update_msi_destid(struct hpet_hld_data *hdata)
+{
+	u32 destid;
+
+	hdata->msi_msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
+	destid = apic->calc_dest_apicid(hdata->handling_cpu);
+	hdata->msi_msg.address_lo |= MSI_ADDR_DEST_ID(destid);
+
+	hpet_writel(hdata->msi_msg.address_lo, HPET_Tn_ROUTE(hdata->num) + 4);
+
+	return 0;
+}
+
+/**
+ * hardlockup_detector_nmi_handler() - NMI Interrupt handler
+ * @type:	Type of NMI handler; not used.
+ * @regs:	Register values as seen when the NMI was asserted
+ *
+ * Check if it was caused by the expiration of the HPET timer. If yes, inspect
+ * for lockups. Also, prepare the HPET timer to target the next monitored CPU
+ * and kick it.
+ *
+ * Returns:
+ * NMI_DONE if the HPET timer did not cause the interrupt. NMI_HANDLED
+ * otherwise.
+ */
+static int hardlockup_detector_nmi_handler(unsigned int type,
+					   struct pt_regs *regs)
+{
+	struct hpet_hld_data *hdata = hld_data;
+	u32 cpu = smp_processor_id();
+
+	if (!is_hpet_wdt_interrupt(hdata))
+		return NMI_DONE;
+
+	inspect_for_hardlockups(regs);
+
+	cpu = cpumask_next(cpu, to_cpumask(hdata->cpu_monitored_mask));
+	if (cpu >= nr_cpu_ids)
+		cpu = cpumask_first(to_cpumask(hdata->cpu_monitored_mask));
+
+	hdata->handling_cpu = cpu;
+	update_msi_destid(hdata);
+	kick_timer(hdata, !(hdata->has_periodic));
+
+	return NMI_HANDLED;
+}
+
+/**
+ * setup_irq_msi_mode() - Configure the timer to deliver an MSI interrupt
+ * @data:	Data associated with the instance of the HPET timer to configure
+ *
+ * Configure the HPET timer to deliver interrupts via the Front-
+ * Side Bus.
+ *
+ * Returns:
+ * 0 success. An error code in configuration was unsuccessful.
+ */
+static int setup_irq_msi_mode(struct hpet_hld_data *hdata)
+{
+	u32 v;
+
+	compose_msi_msg(hdata);
+	hpet_writel(hdata->msi_msg.data, HPET_Tn_ROUTE(hdata->num));
+	hpet_writel(hdata->msi_msg.address_lo, HPET_Tn_ROUTE(hdata->num) + 4);
+
+	/*
+	 * Since FSB interrupt delivery is used, configure as edge-triggered
+	 * interrupt.
+	 */
+	v = hpet_readl(HPET_Tn_CFG(hdata->num));
+	v &= ~HPET_TN_LEVEL;
+	v |= HPET_TN_FSB;
+
+	hpet_writel(v, HPET_Tn_CFG(hdata->num));
+
+	return 0;
+}
+
+/**
+ * setup_hpet_irq() - Configure the interrupt delivery of an HPET timer
+ * @data:	Data associated with the instance of the HPET timer to configure
+ *
+ * Configure the interrupt parameters of an HPET timer. If supported, configure
+ * interrupts to be delivered via the Front-Side Bus. Also, install an interrupt
+ * handler.
+ *
+ * Returns:
+ * 0 success. An error code in configuration was unsuccessful.
+ */
+static int setup_hpet_irq(struct hpet_hld_data *hdata)
+{
+	int ret;
+
+	ret = setup_irq_msi_mode(hdata);
+	if (ret)
+		return ret;
+
+	ret = register_nmi_handler(NMI_WATCHDOG,
+				   hardlockup_detector_nmi_handler, 0,
+				   "hpet_hld");
+
+	return ret;
+}
+
+/**
+ * hardlockup_detector_hpet_enable() - Enable the hardlockup detector
+ * @cpu:	CPU Index in which the watchdog will be enabled.
+ *
+ * Enable the hardlockup detector in @cpu. This means adding it to the
+ * cpumask of monitored CPUs. If @cpu is the first one for which the
+ * hardlockup detector is enabled, enable and kick the timer.
+ */
+void hardlockup_detector_hpet_enable(unsigned int cpu)
+{
+	cpumask_set_cpu(cpu, to_cpumask(hld_data->cpu_monitored_mask));
+
+	if (!hld_data->enabled_cpus++) {
+		hld_data->handling_cpu = cpu;
+		update_msi_destid(hld_data);
+		/* Force timer kick when detector is just enabled */
+		kick_timer(hld_data, true);
+		enable_timer(hld_data);
+	}
+}
+
+/**
+ * hardlockup_detector_hpet_disable() - Disable the hardlockup detector
+ * @cpu:	CPU index in which the watchdog will be disabled
+ *
+ * @cpu is removed from the cpumask of monitored CPUs. If @cpu is also the CPU
+ * handling the timer interrupt, update it to be the next available, monitored,
+ * CPU.
+ */
+void hardlockup_detector_hpet_disable(unsigned int cpu)
+{
+	cpumask_clear_cpu(cpu, to_cpumask(hld_data->cpu_monitored_mask));
+	hld_data->enabled_cpus--;
+
+	if (hld_data->handling_cpu != cpu)
+		return;
+
+	disable_timer(hld_data);
+	if (!hld_data->enabled_cpus)
+		return;
+
+	cpu = cpumask_first(to_cpumask(hld_data->cpu_monitored_mask));
+	hld_data->handling_cpu = cpu;
+	update_msi_destid(hld_data);
+	enable_timer(hld_data);
+}
+
+void hardlockup_detector_hpet_stop(void)
+{
+	disable_timer(hld_data);
+}
+
+/**
+ * hardlockup_detector_hpet_init() - Initialize the hardlockup detector
+ *
+ * Only initialize and configure the detector if an HPET is available on the
+ * system.
+ *
+ * Returns:
+ * 0 success. An error code if initialization was unsuccessful.
+ */
+int __init hardlockup_detector_hpet_init(void)
+{
+	int ret;
+	u32 v;
+
+	if (!hardlockup_use_hpet)
+		return -ENODEV;
+
+	if (!is_hpet_enabled())
+		return -ENODEV;
+
+	if (check_tsc_unstable())
+		return -ENODEV;
+
+	hld_data = hpet_hardlockup_detector_assign_timer();
+	if (!hld_data)
+		return -ENODEV;
+
+	disable_timer(hld_data);
+
+	ret = setup_hpet_irq(hld_data);
+	if (ret) {
+		kfree(hld_data);
+		hld_data = NULL;
+	}
+
+	v = hpet_readl(HPET_Tn_CFG(hld_data->num));
+	v |= HPET_TN_32BIT;
+
+	if (hld_data->has_periodic)
+		v |= HPET_TN_PERIODIC;
+	else
+		v &= ~HPET_TN_PERIODIC;
+
+	hpet_writel(v, HPET_Tn_CFG(hld_data->num));
+
+	return ret;
+}
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 12/21] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored CPUs
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (10 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 11/21] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-06-11 20:11   ` Thomas Gleixner
  2019-05-24  1:16 ` [RFC PATCH v4 13/21] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI Ricardo Neri
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Rafael J. Wysocki, Peter Zijlstra, Alexei Starovoitov,
	Stephane Eranian, Kai-Heng Feng, Davidlohr Bueso, Ashok Raj,
	Michael Ellerman, x86, Luis R. Rodriguez, David Rientjes,
	Andi Kleen, Waiman Long, Paul E. McKenney, Masami Hiramatsu,
	Don Zickus, Ravi V. Shankar, Konrad Rzeszutek Wilk, Marc Zyngier,
	Ricardo Neri, Frederic Weisbecker, Nicholas Piggin, Ricardo Neri,
	Byungchul Park, Babu Moger, Mathieu Desnoyers, Josh Poimboeuf,
	Tony Luck, Randy Dunlap, linux-kernel, iommu, Jacob Pan,
	Philippe Ombredanne, Colin Ian King, Andrew Morton

Each CPU should be monitored for hardlockups every watchdog_thresh seconds.
Since all the CPUs in the system are monitored by the same timer and the
timer interrupt is rotated among the monitored CPUs, the timer must expire
every watchdog_thresh/N seconds; where N is the number of monitored CPUs.
Use the new member of struct hld_data, ticks_per_cpu, to store the
aforementioned quantity.

The ticks-per-CPU quantity is updated every time the number of monitored
CPUs changes: when the watchdog is enabled or disabled for a specific CPU.
If the timer is used in periodic mode, it needs to be adjusted to reflect
the new expected expiration.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jacob Pan <jacob.jun.pan@intel.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Babu Moger <Babu.Moger@amd.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: "Luis R. Rodriguez" <mcgrof@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/include/asm/hpet.h         |  1 +
 arch/x86/kernel/watchdog_hld_hpet.c | 46 +++++++++++++++++++++++++++--
 2 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 31fc27508cf3..64acacce095d 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -114,6 +114,7 @@ struct hpet_hld_data {
 	bool		has_periodic;
 	u32		num;
 	u64		ticks_per_second;
+	u64		ticks_per_cpu;
 	u32		handling_cpu;
 	u32		enabled_cpus;
 	struct msi_msg	msi_msg;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index dff4dadabd4c..74aeb0535d08 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -45,6 +45,13 @@ static void kick_timer(struct hpet_hld_data *hdata, bool force)
 	 * are able to update the comparator before the counter reaches such new
 	 * value.
 	 *
+	 * Each CPU must be monitored every watch_thresh seconds. Since the
+	 * timer targets one CPU at a time, it must expire every
+	 *
+	 *        ticks_per_cpu = watch_thresh * ticks_per_second /enabled_cpus
+	 *
+	 * as computed in update_ticks_per_cpu().
+	 *
 	 * Let it wrap around if needed.
 	 */
 
@@ -52,10 +59,10 @@ static void kick_timer(struct hpet_hld_data *hdata, bool force)
 		return;
 
 	if (hdata->has_periodic)
-		period = watchdog_thresh * hdata->ticks_per_second;
+		period = watchdog_thresh * hdata->ticks_per_cpu;
 
 	count = hpet_readl(HPET_COUNTER);
-	new_compare = count + watchdog_thresh * hdata->ticks_per_second;
+	new_compare = count + watchdog_thresh * hdata->ticks_per_cpu;
 	hpet_set_comparator(hdata->num, (u32)new_compare, (u32)period);
 }
 
@@ -234,6 +241,27 @@ static int setup_hpet_irq(struct hpet_hld_data *hdata)
 	return ret;
 }
 
+/**
+ * update_ticks_per_cpu() - Update the number of HPET ticks per CPU
+ * @hdata:     struct with the timer's the ticks-per-second and CPU mask
+ *
+ * From the overall ticks-per-second of the timer, compute the number of ticks
+ * after which the timer should expire to monitor each CPU every watch_thresh
+ * seconds. The ticks-per-cpu quantity is computed using the number of CPUs that
+ * the watchdog currently monitors.
+ */
+static void update_ticks_per_cpu(struct hpet_hld_data *hdata)
+{
+	u64 temp = hdata->ticks_per_second;
+
+	/* Only update if there are monitored CPUs. */
+	if (!hdata->enabled_cpus)
+		return;
+
+	do_div(temp, hdata->enabled_cpus);
+	hdata->ticks_per_cpu = temp;
+}
+
 /**
  * hardlockup_detector_hpet_enable() - Enable the hardlockup detector
  * @cpu:	CPU Index in which the watchdog will be enabled.
@@ -246,13 +274,23 @@ void hardlockup_detector_hpet_enable(unsigned int cpu)
 {
 	cpumask_set_cpu(cpu, to_cpumask(hld_data->cpu_monitored_mask));
 
-	if (!hld_data->enabled_cpus++) {
+	hld_data->enabled_cpus++;
+	update_ticks_per_cpu(hld_data);
+
+	if (hld_data->enabled_cpus == 1) {
 		hld_data->handling_cpu = cpu;
 		update_msi_destid(hld_data);
 		/* Force timer kick when detector is just enabled */
 		kick_timer(hld_data, true);
 		enable_timer(hld_data);
 	}
+
+	/*
+	 * When in periodic mode, we only kick the timer here. Hence,
+	 * as there are now more CPUs to monitor, we need to adjust the
+	 * periodic expiration.
+	 */
+	kick_timer(hld_data, hld_data->has_periodic);
 }
 
 /**
@@ -268,6 +306,8 @@ void hardlockup_detector_hpet_disable(unsigned int cpu)
 	cpumask_clear_cpu(cpu, to_cpumask(hld_data->cpu_monitored_mask));
 	hld_data->enabled_cpus--;
 
+	update_ticks_per_cpu(hld_data);
+
 	if (hld_data->handling_cpu != cpu)
 		return;
 
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 13/21] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (11 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 12/21] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored CPUs Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 14/21] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog" Ricardo Neri
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Rafael J. Wysocki, Peter Zijlstra, Jan Kiszka,
	Clemens Ladisch, Ricardo Neri, Stephane Eranian, Masahiro Yamada,
	H. Peter Anvin, Ashok Raj, x86, Andi Kleen, Ravi V. Shankar,
	Arnd Bergmann, Ricardo Neri, Mimi Zohar, Tony Luck, Randy Dunlap,
	Nick Desaulniers, linux-kernel, iommu, Philippe Ombredanne,
	Nayna Jain

The only direct method to determine whether an HPET timer caused an
interrupt is to read the Interrupt Status register. Unfortunately,
reading HPET registers is slow and, therefore, it is not recommended to
read them while in NMI context. Furthermore, status is not available if
the interrupt is generated vi the Front Side Bus.

An indirect manner to infer if the non-maskable interrupt we see was
caused by the HPET timer is to use the time-stamp counter. Compute the
value that the time-stamp counter should have at the next interrupt of the
HPET timer. Since the hardlockup detector operates in seconds, high
precision is not needed. This implementation considers that the HPET
caused the HMI if the time-stamp counter reads the expected value -/+ 1.5%.
This value is selected as it is equivalent to 1/64 and the division can be
performed using a bit shift operation. Experimentally, the error in the
estimation is consistently less than 1%.

The computation of the expected value of the time-stamp counter must be
performed in relation to watchdog_thresh divided by the number of
monitored CPUs. This quantity is stored in tsc_ticks_per_cpu and must be
updated whenever the number of monitored CPUs changes.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Nayna Jain <nayna@linux.ibm.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Suggested-by: Andi Kleen <andi.kleen@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/include/asm/hpet.h         |  2 ++
 arch/x86/kernel/watchdog_hld_hpet.c | 27 ++++++++++++++++++++++++++-
 2 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 64acacce095d..fd99f2390714 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -115,6 +115,8 @@ struct hpet_hld_data {
 	u32		num;
 	u64		ticks_per_second;
 	u64		ticks_per_cpu;
+	u64		tsc_next;
+	u64		tsc_ticks_per_cpu;
 	u32		handling_cpu;
 	u32		enabled_cpus;
 	struct msi_msg	msi_msg;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index 74aeb0535d08..dcc50cd29374 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -24,6 +24,7 @@
 
 static struct hpet_hld_data *hld_data;
 static bool hardlockup_use_hpet;
+static u64 tsc_next_error;
 
 /**
  * kick_timer() - Reprogram timer to expire in the future
@@ -33,11 +34,22 @@ static bool hardlockup_use_hpet;
  * Reprogram the timer to expire within watchdog_thresh seconds in the future.
  * If the timer supports periodic mode, it is not kicked unless @force is
  * true.
+ *
+ * Also, compute the expected value of the time-stamp counter at the time of
+ * expiration as well as a deviation from the expected value. The maximum
+ * deviation is of ~1.5%. This deviation can be easily computed by shifting
+ * by 6 positions the delta between the current and expected time-stamp values.
  */
 static void kick_timer(struct hpet_hld_data *hdata, bool force)
 {
+	u64 tsc_curr, tsc_delta, new_compare, count, period = 0;
 	bool kick_needed = force || !(hdata->has_periodic);
-	u64 new_compare, count, period = 0;
+
+	tsc_curr = rdtsc();
+
+	tsc_delta = (unsigned long)watchdog_thresh * hdata->tsc_ticks_per_cpu;
+	hdata->tsc_next = tsc_curr + tsc_delta;
+	tsc_next_error = tsc_delta >> 6;
 
 	/*
 	 * Update the comparator in increments of watch_thresh seconds relative
@@ -93,6 +105,15 @@ static void enable_timer(struct hpet_hld_data *hdata)
  */
 static bool is_hpet_wdt_interrupt(struct hpet_hld_data *hdata)
 {
+	if (smp_processor_id() == hdata->handling_cpu) {
+		u64 tsc_curr;
+
+		tsc_curr = rdtsc();
+
+		return (tsc_curr - hdata->tsc_next) + tsc_next_error <
+		       2 * tsc_next_error;
+	}
+
 	return false;
 }
 
@@ -260,6 +281,10 @@ static void update_ticks_per_cpu(struct hpet_hld_data *hdata)
 
 	do_div(temp, hdata->enabled_cpus);
 	hdata->ticks_per_cpu = temp;
+
+	temp = (unsigned long)tsc_khz * 1000L;
+	do_div(temp, hdata->enabled_cpus);
+	hdata->tsc_ticks_per_cpu = temp;
 }
 
 /**
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 14/21] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog"
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (12 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 13/21] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 15/21] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter Ricardo Neri
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Rafael J. Wysocki, Peter Zijlstra, Jan Kiszka,
	Clemens Ladisch, Ricardo Neri, Stephane Eranian, Masahiro Yamada,
	H. Peter Anvin, Ashok Raj, x86, Andi Kleen, Ravi V. Shankar,
	Arnd Bergmann, Ricardo Neri, Mimi Zohar, Tony Luck, Randy Dunlap,
	Nick Desaulniers, linux-kernel, iommu, Philippe Ombredanne,
	Nayna Jain

Prepare hardlockup_panic_setup() to handle a comma-separated list of
options. This is needed to pass options to specific implementations of the
hardlockup detector.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Nayna Jain <nayna@linux.ibm.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 kernel/watchdog.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index be589001200a..fd50049449ec 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -70,13 +70,13 @@ void __init hardlockup_detector_disable(void)
 
 static int __init hardlockup_panic_setup(char *str)
 {
-	if (!strncmp(str, "panic", 5))
+	if (parse_option_str(str, "panic"))
 		hardlockup_panic = 1;
-	else if (!strncmp(str, "nopanic", 7))
+	else if (parse_option_str(str, "nopanic"))
 		hardlockup_panic = 0;
-	else if (!strncmp(str, "0", 1))
+	else if (parse_option_str(str, "0"))
 		nmi_watchdog_user_enabled = 0;
-	else if (!strncmp(str, "1", 1))
+	else if (parse_option_str(str, "1"))
 		nmi_watchdog_user_enabled = 1;
 	return 1;
 }
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 15/21] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (13 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 14/21] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog" Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 16/21] x86/watchdog: Add a shim hardlockup detector Ricardo Neri
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Rafael J. Wysocki, Peter Zijlstra, Jan Kiszka,
	Clemens Ladisch, Ricardo Neri, Stephane Eranian, Masahiro Yamada,
	H. Peter Anvin, Ashok Raj, x86, Andi Kleen, Ravi V. Shankar,
	Arnd Bergmann, Ricardo Neri, Mimi Zohar, Tony Luck, Randy Dunlap,
	Nick Desaulniers, linux-kernel, iommu, Philippe Ombredanne,
	Nayna Jain

Keep the HPET-based hardlockup detector disabled unless explicitly enabled
via a command-line argument. If such parameter is not given, the
initialization of the hpet-based hardlockup detector fails and the NMI
watchdog will fallback to use the perf-based implementation.

Given that __setup("nmi_watchdog=") is already used to control the behavior
of the NMI watchdog (via hardlockup_panic_setup()), it cannot be used to
control of the hpet-based implementation. Instead, use a new
early_param("nmi_watchdog").

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Nayna Jain <nayna@linux.ibm.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>

--
checkpatch gives the following warning:

CHECK: __setup appears un-documented -- check Documentation/admin-guide/kernel-parameters.rst
+__setup("nmi_watchdog=", hardlockup_detector_hpet_setup);

This is a false-positive as the option nmi_watchdog is already
documented. The option is re-evaluated in this file as well.
---
 .../admin-guide/kernel-parameters.txt         |  8 ++++++-
 arch/x86/kernel/watchdog_hld_hpet.c           | 22 +++++++++++++++++++
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 138f6664b2e2..17ed3dcda13e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2831,7 +2831,7 @@
 			Format: [state][,regs][,debounce][,die]
 
 	nmi_watchdog=	[KNL,BUGS=X86] Debugging features for SMP kernels
-			Format: [panic,][nopanic,][num]
+			Format: [panic,][nopanic,][num,][hpet]
 			Valid num: 0 or 1
 			0 - turn hardlockup detector in nmi_watchdog off
 			1 - turn hardlockup detector in nmi_watchdog on
@@ -2841,6 +2841,12 @@
 			please see 'nowatchdog'.
 			This is useful when you use a panic=... timeout and
 			need the box quickly up again.
+			When hpet is specified, the NMI watchdog will be driven
+			by an HPET timer, if available in the system. Otherwise,
+			it falls back to the default implementation (perf or
+			architecture-specific). Specifying hpet has no effect
+			if the NMI watchdog is not enabled (either at build time
+			or via the command line).
 
 			These settings can be accessed at runtime via
 			the nmi_watchdog and hardlockup_panic sysctls.
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index dcc50cd29374..76eed714a1cb 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -351,6 +351,28 @@ void hardlockup_detector_hpet_stop(void)
 	disable_timer(hld_data);
 }
 
+/**
+ * hardlockup_detector_hpet_setup() - Parse command-line parameters
+ * @str:	A string containing the kernel command line
+ *
+ * Parse the nmi_watchdog parameter from the kernel command line. If
+ * selected by the user, use this implementation to detect hardlockups.
+ */
+static int __init hardlockup_detector_hpet_setup(char *str)
+{
+	if (!str)
+		return -EINVAL;
+
+	if (parse_option_str(str, "hpet"))
+		hardlockup_use_hpet = true;
+
+	if (!nmi_watchdog_user_enabled && hardlockup_use_hpet)
+		pr_warn("Selecting HPET NMI watchdog has no effect with NMI watchdog disabled\n");
+
+	return 0;
+}
+early_param("nmi_watchdog", hardlockup_detector_hpet_setup);
+
 /**
  * hardlockup_detector_hpet_init() - Initialize the hardlockup detector
  *
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 16/21] x86/watchdog: Add a shim hardlockup detector
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (14 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 15/21] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 17/21] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable Ricardo Neri
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Rafael J. Wysocki, Peter Zijlstra, Jan Kiszka,
	Clemens Ladisch, Ricardo Neri, Stephane Eranian, Masahiro Yamada,
	H. Peter Anvin, Ashok Raj, x86, Andi Kleen, Ravi V. Shankar,
	Arnd Bergmann, Ricardo Neri, Mimi Zohar, Tony Luck, Randy Dunlap,
	Nick Desaulniers, linux-kernel, iommu, Philippe Ombredanne,
	Nayna Jain

The generic hardlockup detector is based on perf. It also provides a set
of weak stubs that CPU architectures can override. Add a shim hardlockup
detector for x86 that selects between perf and hpet implementations.

Specifically, this shim implementation is needed for the HPET-based
hardlockup detector; it can also be used for future implementations.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Clemens Ladisch <clemens@ladisch.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Nayna Jain <nayna@linux.ibm.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/Kconfig.debug         |  4 ++
 arch/x86/kernel/Makefile       |  1 +
 arch/x86/kernel/watchdog_hld.c | 78 ++++++++++++++++++++++++++++++++++
 3 files changed, 83 insertions(+)
 create mode 100644 arch/x86/kernel/watchdog_hld.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index 445bbb188f10..52c77e2145c9 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -169,11 +169,15 @@ config IOMMU_LEAK
 config HAVE_MMIOTRACE_SUPPORT
 	def_bool y
 
+config X86_HARDLOCKUP_DETECTOR
+	bool
+
 config X86_HARDLOCKUP_DETECTOR_HPET
 	bool "Use HPET Timer for Hard Lockup Detection"
 	select SOFTLOCKUP_DETECTOR
 	select HARDLOCKUP_DETECTOR
 	select HARDLOCKUP_DETECTOR_CORE
+	select X86_HARDLOCKUP_DETECTOR
 	depends on HPET_TIMER && HPET && X86_64
 	help
 	  Say y to enable a hardlockup detector that is driven by a High-
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 3ad55de67e8b..e60244b8a8ec 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -106,6 +106,7 @@ obj-$(CONFIG_VM86)		+= vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK)	+= early_printk.o
 
 obj-$(CONFIG_HPET_TIMER) 	+= hpet.o
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR) += watchdog_hld.o
 obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 obj-$(CONFIG_APB_TIMER)		+= apb_timer.o
 
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
new file mode 100644
index 000000000000..c2512d4c79c5
--- /dev/null
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A shim hardlockup detector. It overrides the weak stubs of the generic
+ * implementation to select between the perf- or the hpet-based implementation.
+ *
+ * Copyright (C) Intel Corporation 2019
+ */
+
+#include <linux/nmi.h>
+#include <asm/hpet.h>
+
+enum x86_hardlockup_detector {
+	X86_HARDLOCKUP_DETECTOR_PERF,
+	X86_HARDLOCKUP_DETECTOR_HPET,
+};
+
+static enum __read_mostly x86_hardlockup_detector detector_type;
+
+int watchdog_nmi_enable(unsigned int cpu)
+{
+	if (detector_type == X86_HARDLOCKUP_DETECTOR_PERF) {
+		hardlockup_detector_perf_enable();
+		return 0;
+	}
+
+	if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET) {
+		hardlockup_detector_hpet_enable(cpu);
+		return 0;
+	}
+
+	return -ENODEV;
+}
+
+void watchdog_nmi_disable(unsigned int cpu)
+{
+	if (detector_type == X86_HARDLOCKUP_DETECTOR_PERF) {
+		hardlockup_detector_perf_disable();
+		return;
+	}
+
+	if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET) {
+		hardlockup_detector_hpet_disable(cpu);
+		return;
+	}
+}
+
+int __init watchdog_nmi_probe(void)
+{
+	int ret;
+
+	/*
+	 * Try first with the HPET hardlockup detector. It will only
+	 * succeed if selected at build time and the nmi_watchdog
+	 * command-line parameter is configured. This ensure that the
+	 * perf-based detector is used by default, if selected at
+	 * build time.
+	 */
+	ret = hardlockup_detector_hpet_init();
+	if (!ret) {
+		detector_type = X86_HARDLOCKUP_DETECTOR_HPET;
+		return ret;
+	}
+
+	ret = hardlockup_detector_perf_init();
+	if (!ret) {
+		detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+		return ret;
+	}
+
+	return ret;
+}
+
+void watchdog_nmi_stop(void)
+{
+	/* Only the HPET lockup detector defines a stop function. */
+	if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
+		hardlockup_detector_hpet_stop();
+}
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 17/21] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (15 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 16/21] x86/watchdog: Add a shim hardlockup detector Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-06-07  0:35   ` Stephane Eranian via iommu
  2019-05-24  1:16 ` [RFC PATCH v4 18/21] x86/apic: Add a parameter for the APIC delivery mode Ricardo Neri
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Ravi V. Shankar, x86, Ashok Raj, Peter Zijlstra, Ricardo Neri,
	Randy Dunlap, linux-kernel, Stephane Eranian, Ricardo Neri,
	iommu, Andi Kleen

The HPET-based hardlockup detector relies on the TSC to determine if an
observed NMI interrupt was originated by HPET timer. Hence, this detector
can no longer be used with an unstable TSC.

In such case, permanently stop the HPET-based hardlockup detector and
start the perf-based detector.

Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/include/asm/hpet.h    | 2 ++
 arch/x86/kernel/tsc.c          | 2 ++
 arch/x86/kernel/watchdog_hld.c | 7 +++++++
 3 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index fd99f2390714..a82cbe17479d 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -128,6 +128,7 @@ extern int hardlockup_detector_hpet_init(void);
 extern void hardlockup_detector_hpet_stop(void);
 extern void hardlockup_detector_hpet_enable(unsigned int cpu);
 extern void hardlockup_detector_hpet_disable(unsigned int cpu);
+extern void hardlockup_detector_switch_to_perf(void);
 #else
 static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
 { return NULL; }
@@ -136,6 +137,7 @@ static inline int hardlockup_detector_hpet_init(void)
 static inline void hardlockup_detector_hpet_stop(void) {}
 static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
 static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
+static void harrdlockup_detector_switch_to_perf(void) {}
 #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
 
 #else /* CONFIG_HPET_TIMER */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 59b57605e66c..b2210728ce3d 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1158,6 +1158,8 @@ void mark_tsc_unstable(char *reason)
 
 	clocksource_mark_unstable(&clocksource_tsc_early);
 	clocksource_mark_unstable(&clocksource_tsc);
+
+	hardlockup_detector_switch_to_perf();
 }
 
 EXPORT_SYMBOL_GPL(mark_tsc_unstable);
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
index c2512d4c79c5..c8547c227a41 100644
--- a/arch/x86/kernel/watchdog_hld.c
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -76,3 +76,10 @@ void watchdog_nmi_stop(void)
 	if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
 		hardlockup_detector_hpet_stop();
 }
+
+void hardlockup_detector_switch_to_perf(void)
+{
+	detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+	hardlockup_detector_hpet_stop();
+	hardlockup_start_all();
+}
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 18/21] x86/apic: Add a parameter for the APIC delivery mode
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (16 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 17/21] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-06-16  9:55   ` Thomas Gleixner
  2019-05-24  1:16 ` [RFC PATCH v4 19/21] iommu/vt-d: Rework prepare_irte() to support per-irq " Ricardo Neri
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Wincy Van, Ashok Raj, x86, Andi Kleen,
	Eric W. Biederman, Ravi V. Shankar, Ricardo Neri, Bjorn Helgaas,
	Juergen Gross, Tony Luck, Randy Dunlap, linux-kernel, iommu,
	Jacob Pan, Philippe Ombredanne

Until now, the delivery mode of APIC interrupts is set to the default
mode set in the APIC driver. However, there are no restrictions in hardware
to configure each interrupt with a different delivery mode. Specifying the
delivery mode per interrupt is useful when one is interested in changing
the delivery mode of a particular interrupt. For instance, this can be used
to deliver an interrupt as non-maskable.

Add a new member, delivery_mode, to struct irq_cfg. This new member, can
be used to update the configuration of the delivery mode in each interrupt
domain. Likewise, add equivalent macros to populate MSI messages.

Currently, all interrupt domains set the delivery mode of interrupts using
the APIC setting. Interrupt domains use an irq_cfg data structure to
configure their own data structures and hardware resources. Thus, in order
to keep the current behavior, set the delivery mode of the irq
configuration that as the APIC setting. In this manner, irq domains can
obtain the delivery mode from the irq configuration data instead of the
APIC setting, if needed.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jacob Pan <jacob.jun.pan@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Wincy Van <fanwenyi0529@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/include/asm/hw_irq.h |  5 +++--
 arch/x86/include/asm/msidef.h |  3 +++
 arch/x86/kernel/apic/vector.c | 10 ++++++++++
 3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index 32e666e1231e..c024e5976b78 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -117,8 +117,9 @@ struct irq_alloc_info {
 };
 
 struct irq_cfg {
-	unsigned int		dest_apicid;
-	unsigned int		vector;
+	unsigned int				dest_apicid;
+	unsigned int				vector;
+	enum ioapic_irq_destination_types	delivery_mode;
 };
 
 extern struct irq_cfg *irq_cfg(unsigned int irq);
diff --git a/arch/x86/include/asm/msidef.h b/arch/x86/include/asm/msidef.h
index 38ccfdc2d96e..6d666c90f057 100644
--- a/arch/x86/include/asm/msidef.h
+++ b/arch/x86/include/asm/msidef.h
@@ -16,6 +16,9 @@
 					 MSI_DATA_VECTOR_MASK)
 
 #define MSI_DATA_DELIVERY_MODE_SHIFT	8
+#define MSI_DATA_DELIVERY_MODE_MASK	0x00000700
+#define MSI_DATA_DELIVERY_MODE(dm)	(((dm) << MSI_DATA_DELIVERY_MODE_SHIFT) & \
+					 MSI_DATA_DELIVERY_MODE_MASK)
 #define  MSI_DATA_DELIVERY_FIXED	(0 << MSI_DATA_DELIVERY_MODE_SHIFT)
 #define  MSI_DATA_DELIVERY_LOWPRI	(1 << MSI_DATA_DELIVERY_MODE_SHIFT)
 #define  MSI_DATA_DELIVERY_NMI		(4 << MSI_DATA_DELIVERY_MODE_SHIFT)
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 3173e07d3791..99436fe7e932 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -548,6 +548,16 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
 		irqd->chip_data = apicd;
 		irqd->hwirq = virq + i;
 		irqd_set_single_target(irqd);
+
+		/*
+		 * Initialize the delivery mode of this irq to match the
+		 * default delivery mode of the APIC. This is useful for
+		 * children irq domains which want to take the delivery
+		 * mode from the individual irq configuration rather
+		 * than from the APIC.
+		 */
+		 apicd->hw_irq_cfg.delivery_mode = apic->irq_delivery_mode;
+
 		/*
 		 * Legacy vectors are already assigned when the IOAPIC
 		 * takes them over. They stay on the same vector. This is
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 19/21] iommu/vt-d: Rework prepare_irte() to support per-irq delivery mode
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (17 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 18/21] x86/apic: Add a parameter for the APIC delivery mode Ricardo Neri
@ 2019-05-24  1:16 ` " Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog Ricardo Neri
  2019-05-24  1:16 ` [RFC PATCH v4 21/21] x86/watchdog/hardlockup/hpet: Support interrupt remapping Ricardo Neri
  20 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Wincy Van, Ashok Raj, x86, Andi Kleen,
	Eric W. Biederman, Ravi V. Shankar, Ricardo Neri, Bjorn Helgaas,
	Juergen Gross, Tony Luck, Randy Dunlap, linux-kernel, iommu,
	Jacob Pan, Philippe Ombredanne

A recent change introduced a new member to struct irq_cfg to specify the
delivery mode of an interrupt. Supporting the configuration of the
delivery mode would require adding a third argument to prepare_irte().
Instead, simply take a pointer to a irq_cfg data structure as a the only
argument.

Internally, configure the delivery mode of the Interrupt Remapping Table
Entry as specified in the irq_cfg data structure and not as the APIC
setting.

This change does not change the existing behavior, as the delivery mode
of the APIC is used to configure irq_cfg data structure.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jacob Pan <jacob.jun.pan@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Wincy Van <fanwenyi0529@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 drivers/iommu/intel_irq_remapping.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 4160aa9f3f80..2e61eaca7d7e 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -1072,7 +1072,7 @@ static int reenable_irq_remapping(int eim)
 	return -1;
 }
 
-static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
+static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg)
 {
 	memset(irte, 0, sizeof(*irte));
 
@@ -1086,9 +1086,9 @@ static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
 	 * irq migration in the presence of interrupt-remapping.
 	*/
 	irte->trigger_mode = 0;
-	irte->dlvry_mode = apic->irq_delivery_mode;
-	irte->vector = vector;
-	irte->dest_id = IRTE_DEST(dest);
+	irte->dlvry_mode = irq_cfg->delivery_mode;
+	irte->vector = irq_cfg->vector;
+	irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid);
 	irte->redir_hint = 1;
 }
 
@@ -1265,7 +1265,7 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
 	struct irte *irte = &data->irte_entry;
 	struct msi_msg *msg = &data->msi_entry;
 
-	prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
+	prepare_irte(irte, irq_cfg);
 	switch (info->type) {
 	case X86_IRQ_ALLOC_TYPE_IOAPIC:
 		/* Set source-id of interrupt request */
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (18 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 19/21] iommu/vt-d: Rework prepare_irte() to support per-irq " Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-06-16 18:42   ` Thomas Gleixner
  2019-06-16 19:21   ` Thomas Gleixner
  2019-05-24  1:16 ` [RFC PATCH v4 21/21] x86/watchdog/hardlockup/hpet: Support interrupt remapping Ricardo Neri
  20 siblings, 2 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Wincy Van, Ashok Raj, x86, Andi Kleen,
	Eric W. Biederman, Ravi V. Shankar, Ricardo Neri, Bjorn Helgaas,
	Juergen Gross, Tony Luck, Randy Dunlap, linux-kernel, iommu,
	Jacob Pan, Philippe Ombredanne

When interrupt remapping is enabled, MSI interrupt messages must follow a
special format that the IOMMU can understand. Hence, when the HPET hard
lockup detector is used with interrupt remapping, it must also follow this
special format.

The IOMMU, given the information about a particular interrupt, already
knows how to populate the MSI message with this special format and the
corresponding entry in the interrupt remapping table. Given that this is a
special interrupt case, we want to avoid the interrupt subsystem. Add two
functions to create an entry for the HPET hard lockup detector. Perform
this process in two steps as described below.

When initializing the lockup detector, the function
hld_hpet_intremap_alloc_irq() permanently allocates a new entry in the
interrupt remapping table and populates it with the information the
IOMMU driver needs. In order to populate the table, the IOMMU needs to
know the HPET block ID as described in the ACPI table. Hence, add such
ID to the data of the hardlockup detector.

When the hardlockup detector is enabled, the function
hld_hpet_intremapactivate_irq() activates the recently created entry
in the interrupt remapping table via the modify_irte() functions. While
doing this, it specifies which CPU the interrupt must target via its APIC
ID. This function can be called every time the destination iD of the
interrupt needs to be updated; there is no need to allocate or remove
entries in the interrupt remapping table.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jacob Pan <jacob.jun.pan@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Wincy Van <fanwenyi0529@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/include/asm/hpet.h         | 11 +++++++
 arch/x86/kernel/hpet.c              |  1 +
 drivers/iommu/intel_irq_remapping.c | 49 +++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index a82cbe17479d..811051fa7ade 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -119,6 +119,8 @@ struct hpet_hld_data {
 	u64		tsc_ticks_per_cpu;
 	u32		handling_cpu;
 	u32		enabled_cpus;
+	u8		blockid;
+	void		*intremap_data;
 	struct msi_msg	msi_msg;
 	unsigned long	cpu_monitored_mask[0];
 };
@@ -129,6 +131,15 @@ extern void hardlockup_detector_hpet_stop(void);
 extern void hardlockup_detector_hpet_enable(unsigned int cpu);
 extern void hardlockup_detector_hpet_disable(unsigned int cpu);
 extern void hardlockup_detector_switch_to_perf(void);
+#ifdef CONFIG_IRQ_REMAP
+extern int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata);
+extern int hld_hpet_intremap_alloc_irq(struct hpet_hld_data *hdata);
+#else
+static inline int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
+{ return -ENODEV; }
+static inline int hld_hpet_intremap_alloc_irq(struct hpet_hld_data *hdata)
+{ return -ENODEV; }
+#endif /* CONFIG_IRQ_REMAP */
 #else
 static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
 { return NULL; }
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index dd3bb664a188..ddc9be81a075 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -202,6 +202,7 @@ struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
 	 */
 	temp = (u64)cfg << HPET_COUNTER_CLK_PERIOD_SHIFT;
 	hdata->ticks_per_second = hpet_get_ticks_per_sec(temp);
+	hdata->blockid = hpet_blockid;
 
 	return hdata;
 }
diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 2e61eaca7d7e..256466dd30cb 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -20,6 +20,7 @@
 #include <asm/irq_remapping.h>
 #include <asm/pci-direct.h>
 #include <asm/msidef.h>
+#include <asm/hpet.h>
 
 #include "irq_remapping.h"
 
@@ -1516,3 +1517,51 @@ int dmar_ir_hotplug(struct dmar_drhd_unit *dmaru, bool insert)
 
 	return ret;
 }
+
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
+{
+	u32 destid = apic->calc_dest_apicid(hdata->handling_cpu);
+	struct intel_ir_data *data;
+
+	data = (struct intel_ir_data *)hdata->intremap_data;
+	data->irte_entry.dest_id = IRTE_DEST(destid);
+	return modify_irte(&data->irq_2_iommu, &data->irte_entry);
+}
+
+int hld_hpet_intremap_alloc_irq(struct hpet_hld_data *hdata)
+{
+	struct intel_ir_data *data;
+	struct irq_alloc_info info;
+	struct intel_iommu *iommu;
+	struct irq_cfg irq_cfg;
+	int index;
+
+	iommu = map_hpet_to_ir(hdata->blockid);
+	if (!iommu)
+		return -ENODEV;
+
+	data = kzalloc(sizeof(*data), GFP_KERNEL);
+	if (!data)
+		return -ENOMEM;
+
+	down_read(&dmar_global_lock);
+	index =  alloc_irte(iommu, 0, &data->irq_2_iommu, 1);
+	up_read(&dmar_global_lock);
+	if (index < 0)
+		return index;
+
+	info.type = X86_IRQ_ALLOC_TYPE_HPET;
+	info.hpet_id = hdata->blockid;
+
+	/* Vector is not relevant if NMI is the delivery mode */
+	irq_cfg.vector = 0;
+	irq_cfg.delivery_mode = dest_NMI;
+	intel_irq_remapping_prepare_irte(data, &irq_cfg, &info, index, 0);
+
+	hdata->intremap_data = data;
+	memcpy(&hdata->msi_msg, &data->msi_entry, sizeof(hdata->msi_msg));
+
+	return 0;
+}
+#endif /* CONFIG_HARDLOCKUP_DETECTOR_HPET */
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC PATCH v4 21/21] x86/watchdog/hardlockup/hpet: Support interrupt remapping
  2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
                   ` (19 preceding siblings ...)
  2019-05-24  1:16 ` [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog Ricardo Neri
@ 2019-05-24  1:16 ` Ricardo Neri
  2019-06-16  8:44   ` Thomas Gleixner
  20 siblings, 1 reply; 54+ messages in thread
From: Ricardo Neri @ 2019-05-24  1:16 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Borislav Petkov
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Wincy Van, Ashok Raj, x86, Andi Kleen,
	Eric W. Biederman, Ravi V. Shankar, Ricardo Neri, Bjorn Helgaas,
	Juergen Gross, Tony Luck, Randy Dunlap, linux-kernel, iommu,
	Jacob Pan, Philippe Ombredanne

When interrupt remapping is enabled in the system, the MSI interrupt
message must follow a special format the IOMMU can understand. Hence,
utilize the functionality provided by the IOMMU driver for such purpose.

The first step is to determine whether interrupt remapping is enabled
by looking for the existence of an interrupt remapping domain. If it
exists, let the IOMMU driver compose the MSI message for us. The hard-
lockup detector is still responsible of writing the message in the
HPET FSB route register.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jacob Pan <jacob.jun.pan@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Wincy Van <fanwenyi0529@gmail.com>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: x86@kernel.org
Cc: iommu@lists.linux-foundation.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
 arch/x86/kernel/watchdog_hld_hpet.c | 33 ++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index 76eed714a1cb..a266439fdb9e 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -20,6 +20,7 @@
 #include <linux/hpet.h>
 #include <linux/slab.h>
 #include <asm/msidef.h>
+#include <asm/irq_remapping.h>
 #include <asm/hpet.h>
 
 static struct hpet_hld_data *hld_data;
@@ -117,6 +118,25 @@ static bool is_hpet_wdt_interrupt(struct hpet_hld_data *hdata)
 	return false;
 }
 
+/** irq_remapping_enabled() - Detect if interrupt remapping is enabled
+ * @hdata:	A data structure with the HPET block id
+ *
+ * Determine if the HPET block that the hardlockup detector is under
+ * the remapped interrupt domain.
+ *
+ * Returns: True interrupt remapping is enabled. False otherwise.
+ */
+static bool irq_remapping_enabled(struct hpet_hld_data *hdata)
+{
+	struct irq_alloc_info info;
+
+	init_irq_alloc_info(&info, NULL);
+	info.type = X86_IRQ_ALLOC_TYPE_HPET;
+	info.hpet_id = hdata->blockid;
+
+	return !!irq_remapping_get_ir_irq_domain(&info);
+}
+
 /**
  * compose_msi_msg() - Populate address and data fields of an MSI message
  * @hdata:	A data strucure with the message to populate
@@ -161,6 +181,9 @@ static int update_msi_destid(struct hpet_hld_data *hdata)
 {
 	u32 destid;
 
+	if (irq_remapping_enabled(hdata))
+		return hld_hpet_intremap_activate_irq(hdata);
+
 	hdata->msi_msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
 	destid = apic->calc_dest_apicid(hdata->handling_cpu);
 	hdata->msi_msg.address_lo |= MSI_ADDR_DEST_ID(destid);
@@ -217,9 +240,17 @@ static int hardlockup_detector_nmi_handler(unsigned int type,
  */
 static int setup_irq_msi_mode(struct hpet_hld_data *hdata)
 {
+	s32 ret;
 	u32 v;
 
-	compose_msi_msg(hdata);
+	if (irq_remapping_enabled(hdata)) {
+		ret = hld_hpet_intremap_alloc_irq(hdata);
+		if (ret)
+			return ret;
+	} else {
+		compose_msi_msg(hdata);
+	}
+
 	hpet_writel(hdata->msi_msg.data, HPET_Tn_ROUTE(hdata->num));
 	hpet_writel(hdata->msi_msg.address_lo, HPET_Tn_ROUTE(hdata->num) + 4);
 
-- 
2.17.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 17/21] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable
  2019-05-24  1:16 ` [RFC PATCH v4 17/21] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable Ricardo Neri
@ 2019-06-07  0:35   ` Stephane Eranian via iommu
  2019-06-07 14:14     ` Ricardo Neri
  0 siblings, 1 reply; 54+ messages in thread
From: Stephane Eranian via iommu @ 2019-06-07  0:35 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Ravi V. Shankar, x86, Ashok Raj, Peter Zijlstra, Randy Dunlap,
	LKML, Ricardo Neri, iommu, Andi Kleen, Thomas Gleixner,
	Borislav Petkov, Ingo Molnar

Hi Ricardo,
Thanks for your contribution here. It is very important to move the
watchdog out of the PMU wherever possible.

On Thu, May 23, 2019 at 6:17 PM Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> The HPET-based hardlockup detector relies on the TSC to determine if an
> observed NMI interrupt was originated by HPET timer. Hence, this detector
> can no longer be used with an unstable TSC.
>
> In such case, permanently stop the HPET-based hardlockup detector and
> start the perf-based detector.
>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
>  arch/x86/include/asm/hpet.h    | 2 ++
>  arch/x86/kernel/tsc.c          | 2 ++
>  arch/x86/kernel/watchdog_hld.c | 7 +++++++
>  3 files changed, 11 insertions(+)
>
> diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
> index fd99f2390714..a82cbe17479d 100644
> --- a/arch/x86/include/asm/hpet.h
> +++ b/arch/x86/include/asm/hpet.h
> @@ -128,6 +128,7 @@ extern int hardlockup_detector_hpet_init(void);
>  extern void hardlockup_detector_hpet_stop(void);
>  extern void hardlockup_detector_hpet_enable(unsigned int cpu);
>  extern void hardlockup_detector_hpet_disable(unsigned int cpu);
> +extern void hardlockup_detector_switch_to_perf(void);
>  #else
>  static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
>  { return NULL; }
> @@ -136,6 +137,7 @@ static inline int hardlockup_detector_hpet_init(void)
>  static inline void hardlockup_detector_hpet_stop(void) {}
>  static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
>  static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
> +static void harrdlockup_detector_switch_to_perf(void) {}
>  #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
>
This does not compile for me when CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
is not enabled.
because:
   1- you have a typo on the function name
    2- you are missing the inline keyword


>  #else /* CONFIG_HPET_TIMER */
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index 59b57605e66c..b2210728ce3d 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -1158,6 +1158,8 @@ void mark_tsc_unstable(char *reason)
>
>         clocksource_mark_unstable(&clocksource_tsc_early);
>         clocksource_mark_unstable(&clocksource_tsc);
> +
> +       hardlockup_detector_switch_to_perf();
>  }
>
>  EXPORT_SYMBOL_GPL(mark_tsc_unstable);
> diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
> index c2512d4c79c5..c8547c227a41 100644
> --- a/arch/x86/kernel/watchdog_hld.c
> +++ b/arch/x86/kernel/watchdog_hld.c
> @@ -76,3 +76,10 @@ void watchdog_nmi_stop(void)
>         if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
>                 hardlockup_detector_hpet_stop();
>  }
> +
> +void hardlockup_detector_switch_to_perf(void)
> +{
> +       detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
> +       hardlockup_detector_hpet_stop();
> +       hardlockup_start_all();
> +}
> --
> 2.17.1
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 17/21] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable
  2019-06-07  0:35   ` Stephane Eranian via iommu
@ 2019-06-07 14:14     ` Ricardo Neri
  0 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-06-07 14:14 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Ravi V. Shankar, x86, Ashok Raj, Peter Zijlstra, Randy Dunlap,
	LKML, Ricardo Neri, iommu, Andi Kleen, Thomas Gleixner,
	Borislav Petkov, Ingo Molnar

On Thu, Jun 06, 2019 at 05:35:51PM -0700, Stephane Eranian wrote:
> Hi Ricardo,

Hi Stephane,

> Thanks for your contribution here. It is very important to move the
> watchdog out of the PMU wherever possible.

Indeed, using the PMU for the hardlockup detector is still the default
option. This patch series proposes a new kernel command line to switch
to use the HPET.

> 
> On Thu, May 23, 2019 at 6:17 PM Ricardo Neri
> <ricardo.neri-calderon@linux.intel.com> wrote:
> >
> > The HPET-based hardlockup detector relies on the TSC to determine if an
> > observed NMI interrupt was originated by HPET timer. Hence, this detector
> > can no longer be used with an unstable TSC.
> >
> > In such case, permanently stop the HPET-based hardlockup detector and
> > start the perf-based detector.
> >
> > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> > ---
> >  arch/x86/include/asm/hpet.h    | 2 ++
> >  arch/x86/kernel/tsc.c          | 2 ++
> >  arch/x86/kernel/watchdog_hld.c | 7 +++++++
> >  3 files changed, 11 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
> > index fd99f2390714..a82cbe17479d 100644
> > --- a/arch/x86/include/asm/hpet.h
> > +++ b/arch/x86/include/asm/hpet.h
> > @@ -128,6 +128,7 @@ extern int hardlockup_detector_hpet_init(void);
> >  extern void hardlockup_detector_hpet_stop(void);
> >  extern void hardlockup_detector_hpet_enable(unsigned int cpu);
> >  extern void hardlockup_detector_hpet_disable(unsigned int cpu);
> > +extern void hardlockup_detector_switch_to_perf(void);
> >  #else
> >  static inline struct hpet_hld_data *hpet_hardlockup_detector_assign_timer(void)
> >  { return NULL; }
> > @@ -136,6 +137,7 @@ static inline int hardlockup_detector_hpet_init(void)
> >  static inline void hardlockup_detector_hpet_stop(void) {}
> >  static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
> >  static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
> > +static void harrdlockup_detector_switch_to_perf(void) {}
> >  #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
> >
> This does not compile for me when CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
> is not enabled.
> because:
>    1- you have a typo on the function name
>     2- you are missing the inline keyword

I am sorry. This was an oversight on my side. I have corrected this in
preparation for a v5.

Thanks and BR,
Ricardo
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 05/21] x86/hpet: Reserve timer for the HPET hardlockup detector
  2019-05-24  1:16 ` [RFC PATCH v4 05/21] x86/hpet: Reserve timer for the HPET hardlockup detector Ricardo Neri
@ 2019-06-11 19:54   ` Thomas Gleixner
  2019-06-14  1:14     ` Ricardo Neri
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-11 19:54 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Arnd Bergmann,
	Peter Zijlstra, Philippe Ombredanne, Randy Dunlap,
	Clemens Ladisch, linux-kernel, Stephane Eranian, Ricardo Neri,
	Rafael J. Wysocki, iommu, Tony Luck, H. Peter Anvin, Andi Kleen,
	Borislav Petkov, Ingo Molnar

On Thu, 23 May 2019, Ricardo Neri wrote:

> HPET timer 2 will be used to drive the HPET-based hardlockup detector.
> Reserve such timer to ensure it cannot be used by user space programs or
> for clock events.
> 
> When looking for MSI-capable timers for clock events, skip timer 2 if
> the HPET hardlockup detector is selected.

Why? Both the changelog and the code change lack an explanation why this
timer is actually touched after it got reserved for the platform. The
reservation should make it inaccessible for other things.

Thanks,

	tglx
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 12/21] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored CPUs
  2019-05-24  1:16 ` [RFC PATCH v4 12/21] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored CPUs Ricardo Neri
@ 2019-06-11 20:11   ` Thomas Gleixner
  2019-06-18 22:46     ` Ricardo Neri
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-11 20:11 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Rafael J. Wysocki, Peter Zijlstra, Ricardo Neri,
	Stephane Eranian, Kai-Heng Feng, Ingo Molnar, Davidlohr Bueso,
	Ashok Raj, Michael Ellerman, x86, Luis R. Rodriguez,
	David Rientjes, Andi Kleen, Waiman Long, Borislav Petkov,
	Masami Hiramatsu, Don Zickus, Ravi V. Shankar,
	Konrad Rzeszutek Wilk, Marc Zyngier, Frederic Weisbecker,
	Nicholas Piggin, Alexei Starovoitov, Byungchul Park, Babu Moger,
	Mathieu Desnoyers, Josh Poimboeuf, Paul E. McKenney, Tony Luck,
	Randy Dunlap, linux-kernel, iommu, Jacob Pan,
	Philippe Ombredanne, Colin Ian King, Andrew Morton

On Thu, 23 May 2019, Ricardo Neri wrote:
> @@ -52,10 +59,10 @@ static void kick_timer(struct hpet_hld_data *hdata, bool force)
>  		return;
>  
>  	if (hdata->has_periodic)
> -		period = watchdog_thresh * hdata->ticks_per_second;
> +		period = watchdog_thresh * hdata->ticks_per_cpu;
>  
>  	count = hpet_readl(HPET_COUNTER);
> -	new_compare = count + watchdog_thresh * hdata->ticks_per_second;
> +	new_compare = count + watchdog_thresh * hdata->ticks_per_cpu;
>  	hpet_set_comparator(hdata->num, (u32)new_compare, (u32)period);

So with this you might get close to the point where you trip over the SMI
induced madness where CPUs vanish for several milliseconds in some value
add code. You really want to do a read back of the hpet to detect that. See
the comment in the hpet code. RHEL 7/8 allow up to 768 logical CPUs....

Thanks,

	tglx
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 05/21] x86/hpet: Reserve timer for the HPET hardlockup detector
  2019-06-11 19:54   ` Thomas Gleixner
@ 2019-06-14  1:14     ` Ricardo Neri
  2019-06-14 16:10       ` Thomas Gleixner
  0 siblings, 1 reply; 54+ messages in thread
From: Ricardo Neri @ 2019-06-14  1:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Arnd Bergmann,
	Peter Zijlstra, Philippe Ombredanne, Randy Dunlap,
	Clemens Ladisch, linux-kernel, Stephane Eranian, Ricardo Neri,
	Rafael J. Wysocki, iommu, Tony Luck, H. Peter Anvin, Andi Kleen,
	Borislav Petkov, Ingo Molnar

On Tue, Jun 11, 2019 at 09:54:25PM +0200, Thomas Gleixner wrote:
> On Thu, 23 May 2019, Ricardo Neri wrote:
> 
> > HPET timer 2 will be used to drive the HPET-based hardlockup detector.
> > Reserve such timer to ensure it cannot be used by user space programs or
> > for clock events.
> > 
> > When looking for MSI-capable timers for clock events, skip timer 2 if
> > the HPET hardlockup detector is selected.
> 
> Why? Both the changelog and the code change lack an explanation why this
> timer is actually touched after it got reserved for the platform. The
> reservation should make it inaccessible for other things.

hpet_reserve_platform_timers() will give the HPET char driver a data
structure which specifies which drivers are reserved. In this manner,
they cannot be used by applications via file opens. The timer used by
the hardlockup detector should be marked as reserved.

Also, hpet_msi_capability_lookup() populates another data structure
which is used when obtaining an unused timer for a HPET clock event.
The timer used by the hardlockup detector should not be included in such
data structure.

Is this the explanation you would like to see? If yes, I will include it
in the changelog.

Thanks and BR,
Ricardo

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/21] x86/hpet: Calculate ticks-per-second in a separate function
  2019-05-24  1:16 ` [RFC PATCH v4 03/21] x86/hpet: Calculate ticks-per-second in a separate function Ricardo Neri
@ 2019-06-14 15:54   ` Thomas Gleixner
  2019-06-14 15:59     ` Thomas Gleixner
  2019-06-18 22:48     ` Ricardo Neri
  0 siblings, 2 replies; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-14 15:54 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Arnd Bergmann,
	Peter Zijlstra, Philippe Ombredanne, Randy Dunlap,
	Clemens Ladisch, linux-kernel, Stephane Eranian, Ricardo Neri,
	Rafael J. Wysocki, iommu, Tony Luck, H. Peter Anvin, Andi Kleen,
	Borislav Petkov, Ingo Molnar

On Thu, 23 May 2019, Ricardo Neri wrote:
>  
> +u64 hpet_get_ticks_per_sec(u64 hpet_caps)
> +{
> +	u64 ticks_per_sec, period;
> +
> +	period = (hpet_caps & HPET_COUNTER_CLK_PERIOD_MASK) >>
> +		 HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
> +
> +	/*
> +	 * The frequency is the reciprocal of the period. The period is given
> +	 * in femtoseconds per second. Thus, prepare a dividend to obtain the
> +	 * frequency in ticks per second.
> +	 */
> +
> +	/* 10^15 femtoseconds per second */
> +	ticks_per_sec = 1000000000000000ULL;
> +	ticks_per_sec += period >> 1; /* round */
> +
> +	/* The quotient is put in the dividend. We drop the remainder. */
> +	do_div(ticks_per_sec, period);
> +
> +	return ticks_per_sec;
> +}
> +
>  int hpet_alloc(struct hpet_data *hdp)
>  {
>  	u64 cap, mcfg;
> @@ -844,7 +867,6 @@ int hpet_alloc(struct hpet_data *hdp)
>  	struct hpets *hpetp;
>  	struct hpet __iomem *hpet;
>  	static struct hpets *last;
> -	unsigned long period;
>  	unsigned long long temp;
>  	u32 remainder;
>  
> @@ -894,12 +916,7 @@ int hpet_alloc(struct hpet_data *hdp)
>  
>  	last = hpetp;
>  
> -	period = (cap & HPET_COUNTER_CLK_PERIOD_MASK) >>
> -		HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
> -	temp = 1000000000000000uLL; /* 10^15 femtoseconds per second */
> -	temp += period >> 1; /* round */
> -	do_div(temp, period);
> -	hpetp->hp_tick_freq = temp; /* ticks per second */
> +	hpetp->hp_tick_freq = hpet_get_ticks_per_sec(cap);

Why are we actually computing this over and over?

In hpet_enable() which is the first function invoked we have:

        /*
         * The period is a femto seconds value. Convert it to a
         * frequency.
         */
        freq = FSEC_PER_SEC;
        do_div(freq, hpet_period);
        hpet_freq = freq;

So we already have ticks per second, aka frequency, right? So why do we
need yet another function instead of using the value which is computed
once? The frequency of the HPET channels has to be identical no matter
what. If it's not HPET is broken beyond repair.

Thanks,

	tglx


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/21] x86/hpet: Calculate ticks-per-second in a separate function
  2019-06-14 15:54   ` Thomas Gleixner
@ 2019-06-14 15:59     ` Thomas Gleixner
  2019-06-18 22:48     ` Ricardo Neri
  1 sibling, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-14 15:59 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Arnd Bergmann,
	Peter Zijlstra, Philippe Ombredanne, Randy Dunlap,
	Clemens Ladisch, linux-kernel, Stephane Eranian, Ricardo Neri,
	Rafael J. Wysocki, iommu, Tony Luck, H. Peter Anvin, Andi Kleen,
	Borislav Petkov, Ingo Molnar

On Fri, 14 Jun 2019, Thomas Gleixner wrote:
> On Thu, 23 May 2019, Ricardo Neri wrote:
> >  int hpet_alloc(struct hpet_data *hdp)
> >  {
> >  	u64 cap, mcfg;
> > @@ -844,7 +867,6 @@ int hpet_alloc(struct hpet_data *hdp)
> >  	struct hpets *hpetp;
> >  	struct hpet __iomem *hpet;
> >  	static struct hpets *last;
> > -	unsigned long period;
> >  	unsigned long long temp;
> >  	u32 remainder;
> >  
> > @@ -894,12 +916,7 @@ int hpet_alloc(struct hpet_data *hdp)
> >  
> >  	last = hpetp;
> >  
> > -	period = (cap & HPET_COUNTER_CLK_PERIOD_MASK) >>
> > -		HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
> > -	temp = 1000000000000000uLL; /* 10^15 femtoseconds per second */
> > -	temp += period >> 1; /* round */
> > -	do_div(temp, period);
> > -	hpetp->hp_tick_freq = temp; /* ticks per second */
> > +	hpetp->hp_tick_freq = hpet_get_ticks_per_sec(cap);
> 
> Why are we actually computing this over and over?
> 
> In hpet_enable() which is the first function invoked we have:
> 
>         /*
>          * The period is a femto seconds value. Convert it to a
>          * frequency.
>          */
>         freq = FSEC_PER_SEC;
>         do_div(freq, hpet_period);
>         hpet_freq = freq;
> 
> So we already have ticks per second, aka frequency, right? So why do we
> need yet another function instead of using the value which is computed
> once? The frequency of the HPET channels has to be identical no matter
> what. If it's not HPET is broken beyond repair.

Aside of that this change breaks the IA64 support for /dev/hpet.

Thanks,

	tglx
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 05/21] x86/hpet: Reserve timer for the HPET hardlockup detector
  2019-06-14  1:14     ` Ricardo Neri
@ 2019-06-14 16:10       ` Thomas Gleixner
  2019-06-18 22:48         ` Ricardo Neri
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-14 16:10 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Arnd Bergmann,
	Peter Zijlstra, Philippe Ombredanne, Randy Dunlap,
	Clemens Ladisch, linux-kernel, Stephane Eranian, Ricardo Neri,
	Rafael J. Wysocki, iommu, Tony Luck, H. Peter Anvin, Andi Kleen,
	Borislav Petkov, Ingo Molnar

On Thu, 13 Jun 2019, Ricardo Neri wrote:

> On Tue, Jun 11, 2019 at 09:54:25PM +0200, Thomas Gleixner wrote:
> > On Thu, 23 May 2019, Ricardo Neri wrote:
> > 
> > > HPET timer 2 will be used to drive the HPET-based hardlockup detector.
> > > Reserve such timer to ensure it cannot be used by user space programs or
> > > for clock events.
> > > 
> > > When looking for MSI-capable timers for clock events, skip timer 2 if
> > > the HPET hardlockup detector is selected.
> > 
> > Why? Both the changelog and the code change lack an explanation why this
> > timer is actually touched after it got reserved for the platform. The
> > reservation should make it inaccessible for other things.
> 
> hpet_reserve_platform_timers() will give the HPET char driver a data
> structure which specifies which drivers are reserved. In this manner,
> they cannot be used by applications via file opens. The timer used by
> the hardlockup detector should be marked as reserved.
> 
> Also, hpet_msi_capability_lookup() populates another data structure
> which is used when obtaining an unused timer for a HPET clock event.
> The timer used by the hardlockup detector should not be included in such
> data structure.
> 
> Is this the explanation you would like to see? If yes, I will include it
> in the changelog.

Yes, the explanation makes sense. The code still sucks. Not really your
fault, but this is not making it any better.

What bothers me most is the fact that CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
removes one HPET timer unconditionally. It neither checks whether the hpet
watchdog is actually enabled on the command line, nor does it validate
upfront whether the HPET supports FSB delivery.

That wastes an HPET timer unconditionally for no value. Not that I
personally care much about /dev/hpet, but some older laptops depend on HPET
per cpu timers as the local APIC timer stops in C2/3. So this unconditional
reservation will cause regressions for no reason.

The proper approach here is to:

 1) Evaluate the command line _before_ hpet_enable() is invoked

 2) Check the availability of FSB delivery in hpet_enable()

Reserve an HPET channel for the watchdog only when #1 and #2 are true.

Thanks,

	tglx


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes
  2019-05-24  1:16 ` [RFC PATCH v4 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes Ricardo Neri
@ 2019-06-14 18:17   ` Thomas Gleixner
  2019-06-18 22:48     ` Ricardo Neri
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-14 18:17 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Peter Zijlstra,
	Philippe Ombredanne, Randy Dunlap, linux-kernel,
	Stephane Eranian, Ricardo Neri, Rafael J. Wysocki, iommu,
	Tony Luck, H. Peter Anvin, Andi Kleen, Borislav Petkov,
	Ingo Molnar

On Thu, 23 May 2019, Ricardo Neri wrote:
> +/**
> + * hpet_set_comparator() - Helper function for setting comparator register
> + * @num:	The timer ID
> + * @cmp:	The value to be written to the comparator/accumulator
> + * @period:	The value to be written to the period (0 = oneshot mode)
> + *
> + * Helper function for updating comparator, accumulator and period values.
> + *
> + * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing
> + * to the Tn_CMP to update the accumulator. Then, HPET needs a second
> + * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period.
> + * The HPET_TN_SETVAL bit is automatically cleared after the first write.
> + *
> + * For one-shot mode, HPET_TN_SETVAL does not need to be set.
> + *
> + * See the following documents:
> + *   - Intel IA-PC HPET (High Precision Event Timers) Specification
> + *   - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674
> + */
> +void hpet_set_comparator(int num, unsigned int cmp, unsigned int period)
> +{
> +	if (period) {
> +		unsigned int v = hpet_readl(HPET_Tn_CFG(num));
> +
> +		hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(num));
> +	}
> +
> +	hpet_writel(cmp, HPET_Tn_CMP(num));
> +
> +	if (!period)
> +		return;

TBH, I hate this conditional handling. What's wrong with two functions?

> +
> +	/*
> +	 * This delay is seldom used: never in one-shot mode and in periodic
> +	 * only when reprogramming the timer.
> +	 */
> +	udelay(1);
> +	hpet_writel(period, HPET_Tn_CMP(num));
> +}
> +EXPORT_SYMBOL_GPL(hpet_set_comparator);

Why is this exported? Which module user needs this?

Thanks,

	tglx
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 21/21] x86/watchdog/hardlockup/hpet: Support interrupt remapping
  2019-05-24  1:16 ` [RFC PATCH v4 21/21] x86/watchdog/hardlockup/hpet: Support interrupt remapping Ricardo Neri
@ 2019-06-16  8:44   ` Thomas Gleixner
  2019-06-16  8:53     ` Thomas Gleixner
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-16  8:44 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Ingo Molnar, Wincy Van, Ashok Raj, x86,
	Andi Kleen, Borislav Petkov, Eric W. Biederman, Ravi V. Shankar,
	Bjorn Helgaas, Juergen Gross, Tony Luck, Randy Dunlap,
	linux-kernel, iommu, Jacob Pan, Philippe Ombredanne

On Thu, 23 May 2019, Ricardo Neri wrote:
> +/** irq_remapping_enabled() - Detect if interrupt remapping is enabled
> + * @hdata:	A data structure with the HPET block id
> + *
> + * Determine if the HPET block that the hardlockup detector is under
> + * the remapped interrupt domain.
> + *
> + * Returns: True interrupt remapping is enabled. False otherwise.
> + */
> +static bool irq_remapping_enabled(struct hpet_hld_data *hdata)
> +{
> +	struct irq_alloc_info info;
> +
> +	init_irq_alloc_info(&info, NULL);
> +	info.type = X86_IRQ_ALLOC_TYPE_HPET;
> +	info.hpet_id = hdata->blockid;
> +
> +	return !!irq_remapping_get_ir_irq_domain(&info);
> +}
> +
>  /**
>   * compose_msi_msg() - Populate address and data fields of an MSI message
>   * @hdata:	A data strucure with the message to populate
> @@ -161,6 +181,9 @@ static int update_msi_destid(struct hpet_hld_data *hdata)
>  {
>  	u32 destid;
>  
> +	if (irq_remapping_enabled(hdata))
> +		return hld_hpet_intremap_activate_irq(hdata);

No. This is horrible hackery violating all the layering which we carefully
put into place to avoid exactly this kind of sprinkling conditionals into
all code pathes.

With some thought the existing irqdomain hierarchy can be used to achieve
the same thing without tons of extra functions and conditionals.

Thanks,

	tglx
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 21/21] x86/watchdog/hardlockup/hpet: Support interrupt remapping
  2019-06-16  8:44   ` Thomas Gleixner
@ 2019-06-16  8:53     ` Thomas Gleixner
  0 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-16  8:53 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Ingo Molnar, Wincy Van, Ashok Raj, x86,
	Andi Kleen, Borislav Petkov, Eric W. Biederman, Ravi V. Shankar,
	Bjorn Helgaas, Juergen Gross, Tony Luck, Randy Dunlap,
	linux-kernel, iommu, Jacob Pan, Philippe Ombredanne

On Sun, 16 Jun 2019, Thomas Gleixner wrote:

> On Thu, 23 May 2019, Ricardo Neri wrote:
> > +/** irq_remapping_enabled() - Detect if interrupt remapping is enabled
> > + * @hdata:	A data structure with the HPET block id
> > + *
> > + * Determine if the HPET block that the hardlockup detector is under
> > + * the remapped interrupt domain.
> > + *
> > + * Returns: True interrupt remapping is enabled. False otherwise.
> > + */
> > +static bool irq_remapping_enabled(struct hpet_hld_data *hdata)
> > +{
> > +	struct irq_alloc_info info;
> > +
> > +	init_irq_alloc_info(&info, NULL);
> > +	info.type = X86_IRQ_ALLOC_TYPE_HPET;
> > +	info.hpet_id = hdata->blockid;
> > +
> > +	return !!irq_remapping_get_ir_irq_domain(&info);
> > +}
> > +
> >  /**
> >   * compose_msi_msg() - Populate address and data fields of an MSI message
> >   * @hdata:	A data strucure with the message to populate
> > @@ -161,6 +181,9 @@ static int update_msi_destid(struct hpet_hld_data *hdata)
> >  {
> >  	u32 destid;
> >  
> > +	if (irq_remapping_enabled(hdata))
> > +		return hld_hpet_intremap_activate_irq(hdata);
> 
> No. This is horrible hackery violating all the layering which we carefully
> put into place to avoid exactly this kind of sprinkling conditionals into
> all code pathes.
> 
> With some thought the existing irqdomain hierarchy can be used to achieve
> the same thing without tons of extra functions and conditionals.

And of course this whole thing falls completely apart when someone enables
the hpet watchdog on AMD.

Can you folks please stop this works for me tinkering and finally grasp
that there is a world outside of Intel and outside of big enterprise boxes?

Thanks,

	tglx
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 18/21] x86/apic: Add a parameter for the APIC delivery mode
  2019-05-24  1:16 ` [RFC PATCH v4 18/21] x86/apic: Add a parameter for the APIC delivery mode Ricardo Neri
@ 2019-06-16  9:55   ` Thomas Gleixner
  2019-06-18 22:47     ` Ricardo Neri
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-16  9:55 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Ingo Molnar, Wincy Van, Ashok Raj, x86,
	Andi Kleen, Borislav Petkov, Eric W. Biederman, Ravi V. Shankar,
	Bjorn Helgaas, Juergen Gross, Tony Luck, Randy Dunlap,
	linux-kernel, iommu, Jacob Pan, Philippe Ombredanne

On Thu, 23 May 2019, Ricardo Neri wrote:
>  
>  struct irq_cfg {
> -	unsigned int		dest_apicid;
> -	unsigned int		vector;
> +	unsigned int				dest_apicid;
> +	unsigned int				vector;
> +	enum ioapic_irq_destination_types	delivery_mode;

And how is this related to IOAPIC? I know this enum exists already, but in
connection with MSI this does not make any sense at all.

> +
> +		/*
> +		 * Initialize the delivery mode of this irq to match the
> +		 * default delivery mode of the APIC. This is useful for
> +		 * children irq domains which want to take the delivery
> +		 * mode from the individual irq configuration rather
> +		 * than from the APIC.
> +		 */
> +		 apicd->hw_irq_cfg.delivery_mode = apic->irq_delivery_mode;

And here it's initialized from apic->irq_delivery_mode, which is an
u32. Intuitive and consistent - NOT!

Thanks,

	tglx
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
  2019-05-24  1:16 ` [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog Ricardo Neri
@ 2019-06-16 18:42   ` Thomas Gleixner
  2019-06-16 19:21   ` Thomas Gleixner
  1 sibling, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-16 18:42 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Ingo Molnar, Wincy Van, Ashok Raj, x86,
	Andi Kleen, Borislav Petkov, Eric W. Biederman, Ravi V. Shankar,
	Bjorn Helgaas, Juergen Gross, Tony Luck, Randy Dunlap,
	linux-kernel, iommu, Jacob Pan, Philippe Ombredanne

On Thu, 23 May 2019, Ricardo Neri wrote:

> When interrupt remapping is enabled, MSI interrupt messages must follow a
> special format that the IOMMU can understand. Hence, when the HPET hard
> lockup detector is used with interrupt remapping, it must also follow this
> special format.
> 
> The IOMMU, given the information about a particular interrupt, already
> knows how to populate the MSI message with this special format and the
> corresponding entry in the interrupt remapping table. Given that this is a
> special interrupt case, we want to avoid the interrupt subsystem. Add two
> functions to create an entry for the HPET hard lockup detector. Perform
> this process in two steps as described below.
> 
> When initializing the lockup detector, the function
> hld_hpet_intremap_alloc_irq() permanently allocates a new entry in the
> interrupt remapping table and populates it with the information the
> IOMMU driver needs. In order to populate the table, the IOMMU needs to
> know the HPET block ID as described in the ACPI table. Hence, add such
> ID to the data of the hardlockup detector.
> 
> When the hardlockup detector is enabled, the function
> hld_hpet_intremapactivate_irq() activates the recently created entry
> in the interrupt remapping table via the modify_irte() functions. While
> doing this, it specifies which CPU the interrupt must target via its APIC
> ID. This function can be called every time the destination iD of the
> interrupt needs to be updated; there is no need to allocate or remove
> entries in the interrupt remapping table.

And except for a few details all of this functionality exists today. There
is no need to hack HPET NMI specific knowledge into the irq remapping
driver. And of course to unbreak AMD you'd have to copy the same hackery
into the AMD interrupt remapping code.

More lines of duplicated duct tape code are better to maintain, right?

Sigh.

	tglx
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
  2019-05-24  1:16 ` [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog Ricardo Neri
  2019-06-16 18:42   ` Thomas Gleixner
@ 2019-06-16 19:21   ` Thomas Gleixner
  2019-06-17  8:25     ` Thomas Gleixner
  1 sibling, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-16 19:21 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Ingo Molnar, Wincy Van, Ashok Raj, x86,
	Andi Kleen, Borislav Petkov, Eric W. Biederman, Ravi V. Shankar,
	Bjorn Helgaas, Juergen Gross, Tony Luck, Randy Dunlap,
	linux-kernel, iommu, Jacob Pan, Philippe Ombredanne

On Thu, 23 May 2019, Ricardo Neri wrote:
> When the hardlockup detector is enabled, the function
> hld_hpet_intremapactivate_irq() activates the recently created entry
> in the interrupt remapping table via the modify_irte() functions. While
> doing this, it specifies which CPU the interrupt must target via its APIC
> ID. This function can be called every time the destination iD of the
> interrupt needs to be updated; there is no need to allocate or remove
> entries in the interrupt remapping table.

Brilliant.

> +int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
> +{
> +	u32 destid = apic->calc_dest_apicid(hdata->handling_cpu);
> +	struct intel_ir_data *data;
> +
> +	data = (struct intel_ir_data *)hdata->intremap_data;
> +	data->irte_entry.dest_id = IRTE_DEST(destid);
> +	return modify_irte(&data->irq_2_iommu, &data->irte_entry);

This calls modify_irte() which does at the very beginning:

   raw_spin_lock_irqsave(&irq_2_ir_lock, flags);

How is that supposed to work from NMI context? Not to talk about the
other spinlocks which are taken in the subsequent call chain.

You cannot call in any of that code from NMI context.

The only reason why this never deadlocked in your testing is that nothing
else touched that particular iommu where the HPET hangs off concurrently.

But that's just pure luck and not design. 

Thanks,

	tglx
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
  2019-06-16 19:21   ` Thomas Gleixner
@ 2019-06-17  8:25     ` Thomas Gleixner
  2019-06-17 21:38       ` Stephane Eranian via iommu
  2019-06-18 22:45       ` Ricardo Neri
  0 siblings, 2 replies; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-17  8:25 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Ingo Molnar, Wincy Van, Ashok Raj, x86,
	Andi Kleen, Borislav Petkov, Eric W. Biederman, Ravi V. Shankar,
	Bjorn Helgaas, Juergen Gross, Tony Luck, Randy Dunlap,
	linux-kernel, iommu, Jacob Pan, Philippe Ombredanne

On Sun, 16 Jun 2019, Thomas Gleixner wrote:
> On Thu, 23 May 2019, Ricardo Neri wrote:
> > When the hardlockup detector is enabled, the function
> > hld_hpet_intremapactivate_irq() activates the recently created entry
> > in the interrupt remapping table via the modify_irte() functions. While
> > doing this, it specifies which CPU the interrupt must target via its APIC
> > ID. This function can be called every time the destination iD of the
> > interrupt needs to be updated; there is no need to allocate or remove
> > entries in the interrupt remapping table.
> 
> Brilliant.
> 
> > +int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
> > +{
> > +	u32 destid = apic->calc_dest_apicid(hdata->handling_cpu);
> > +	struct intel_ir_data *data;
> > +
> > +	data = (struct intel_ir_data *)hdata->intremap_data;
> > +	data->irte_entry.dest_id = IRTE_DEST(destid);
> > +	return modify_irte(&data->irq_2_iommu, &data->irte_entry);
> 
> This calls modify_irte() which does at the very beginning:
> 
>    raw_spin_lock_irqsave(&irq_2_ir_lock, flags);
> 
> How is that supposed to work from NMI context? Not to talk about the
> other spinlocks which are taken in the subsequent call chain.
> 
> You cannot call in any of that code from NMI context.
> 
> The only reason why this never deadlocked in your testing is that nothing
> else touched that particular iommu where the HPET hangs off concurrently.
> 
> But that's just pure luck and not design. 

And just for the record. I warned you about that problem during the review
of an earlier version and told you to talk to IOMMU folks whether there is
a way to update the entry w/o running into that lock problem.

Can you tell my why am I actually reviewing patches and spending time on
this when the result is ignored anyway?

I also tried to figure out why you went away from the IPI broadcast
design. The only information I found is:

Changes vs. v1:

 * Brought back the round-robin mechanism proposed in v1 (this time not
   using the interrupt subsystem). This also requires to compute
   expiration times as in v1 (Andi Kleen, Stephane Eranian).

Great that there is no trace of any mail from Andi or Stephane about this
on LKML. There is no problem with talking offlist about this stuff, but
then you should at least provide a rationale for those who were not part of
the private conversation.

Thanks,

	tglcx

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
  2019-06-17  8:25     ` Thomas Gleixner
@ 2019-06-17 21:38       ` Stephane Eranian via iommu
  2019-06-17 23:08         ` Thomas Gleixner
  2019-06-18 22:45       ` Ricardo Neri
  1 sibling, 1 reply; 54+ messages in thread
From: Stephane Eranian via iommu @ 2019-06-17 21:38 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Ingo Molnar, Wincy Van, Ashok Raj, x86, Andi Kleen,
	Borislav Petkov, Eric W. Biederman, Ravi V. Shankar,
	Ricardo Neri, Bjorn Helgaas, Juergen Gross, Tony Luck,
	Randy Dunlap, LKML, iommu, Jacob Pan, Philippe Ombredanne

Hi,

On Mon, Jun 17, 2019 at 1:25 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Sun, 16 Jun 2019, Thomas Gleixner wrote:
> > On Thu, 23 May 2019, Ricardo Neri wrote:
> > > When the hardlockup detector is enabled, the function
> > > hld_hpet_intremapactivate_irq() activates the recently created entry
> > > in the interrupt remapping table via the modify_irte() functions. While
> > > doing this, it specifies which CPU the interrupt must target via its APIC
> > > ID. This function can be called every time the destination iD of the
> > > interrupt needs to be updated; there is no need to allocate or remove
> > > entries in the interrupt remapping table.
> >
> > Brilliant.
> >
> > > +int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
> > > +{
> > > +   u32 destid = apic->calc_dest_apicid(hdata->handling_cpu);
> > > +   struct intel_ir_data *data;
> > > +
> > > +   data = (struct intel_ir_data *)hdata->intremap_data;
> > > +   data->irte_entry.dest_id = IRTE_DEST(destid);
> > > +   return modify_irte(&data->irq_2_iommu, &data->irte_entry);
> >
> > This calls modify_irte() which does at the very beginning:
> >
> >    raw_spin_lock_irqsave(&irq_2_ir_lock, flags);
> >
> > How is that supposed to work from NMI context? Not to talk about the
> > other spinlocks which are taken in the subsequent call chain.
> >
> > You cannot call in any of that code from NMI context.
> >
> > The only reason why this never deadlocked in your testing is that nothing
> > else touched that particular iommu where the HPET hangs off concurrently.
> >
> > But that's just pure luck and not design.
>
> And just for the record. I warned you about that problem during the review
> of an earlier version and told you to talk to IOMMU folks whether there is
> a way to update the entry w/o running into that lock problem.
>
> Can you tell my why am I actually reviewing patches and spending time on
> this when the result is ignored anyway?
>
> I also tried to figure out why you went away from the IPI broadcast
> design. The only information I found is:
>
> Changes vs. v1:
>
>  * Brought back the round-robin mechanism proposed in v1 (this time not
>    using the interrupt subsystem). This also requires to compute
>    expiration times as in v1 (Andi Kleen, Stephane Eranian).
>
> Great that there is no trace of any mail from Andi or Stephane about this
> on LKML. There is no problem with talking offlist about this stuff, but
> then you should at least provide a rationale for those who were not part of
> the private conversation.
>
Let me add some context to this whole patch series. The pressure on
the core PMU counters
is increasing as more people want to use them to measure always more
events. When the PMU
is overcommitted, i.e., more events than counters for them, there is
multiplexing. It comes
with an overhead that is too high for certain applications. One way to
avoid this is to lower the
multiplexing frequency, which is by default 1ms, but that comes with
loss of accuracy. Another approach is
to measure only a small number of events at a time and use multiple
runs, but then you lose consistent event
view. Another approach is to push for increasing the number of
counters. But getting new hardware
counters takes time. Short term, we can investigate what it would take
to free one cycle-capable
counter which is commandeered by the hard lockup detector on all X86
processors today. The functionality
of the watchdog, being able to get a crash dump on kernel deadlocks,
is important and we cannot simply
disable it. At scale, many bugs are exposed and thus machines
deadlock. Therefore, we want to investigate
what it would take to move the detector to another NMI-capable source,
such as the HPET because the
detector does not need high low granularity timer and interrupts only every 2s.

Furthermore, recent Intel erratum, e.g., the TSX issue forcing the TFA
code in perf_events, have increased the pressure
even more with only 3 generic counters left. Thus, it is time to look
at alternative ways of  getting a hard lockup detector
(NMI watchdog) from another NMI source than the PMU. To that extent, I
have been discussing about alternatives.
Intel suggested using the HPET and Ricardo has been working on
producing this patch series. It is clear from your review
that the patches have issues, but I am hoping that they can be
resolved with constructive feedback knowing what the end goal is.

As for the round-robin changes, yes, we discussed this as an
alternative to avoid overloading CPU0 with handling
all of the work to broadcasting IPI to 100+ other CPUs.

Thanks.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
  2019-06-17 21:38       ` Stephane Eranian via iommu
@ 2019-06-17 23:08         ` Thomas Gleixner
  2019-06-19 15:43           ` Jacob Pan
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-17 23:08 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Ingo Molnar, Wincy Van, Ashok Raj, x86, Andi Kleen,
	Borislav Petkov, Eric W. Biederman, Ravi V. Shankar,
	Ricardo Neri, Bjorn Helgaas, Juergen Gross, Tony Luck,
	Randy Dunlap, LKML, iommu, Jacob Pan, Philippe Ombredanne

Stephane,

On Mon, 17 Jun 2019, Stephane Eranian wrote:
> On Mon, Jun 17, 2019 at 1:25 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > Great that there is no trace of any mail from Andi or Stephane about this
> > on LKML. There is no problem with talking offlist about this stuff, but
> > then you should at least provide a rationale for those who were not part of
> > the private conversation.
> >
> Let me add some context to this whole patch series. The pressure on the
> core PMU counters is increasing as more people want to use them to
> measure always more events. When the PMU is overcommitted, i.e., more
> events than counters for them, there is multiplexing. It comes with an
> overhead that is too high for certain applications. One way to avoid this
> is to lower the multiplexing frequency, which is by default 1ms, but that
> comes with loss of accuracy. Another approach is to measure only a small
> number of events at a time and use multiple runs, but then you lose
> consistent event view. Another approach is to push for increasing the
> number of counters. But getting new hardware counters takes time. Short
> term, we can investigate what it would take to free one cycle-capable
> counter which is commandeered by the hard lockup detector on all X86
> processors today. The functionality of the watchdog, being able to get a
> crash dump on kernel deadlocks, is important and we cannot simply disable
> it. At scale, many bugs are exposed and thus machines
> deadlock. Therefore, we want to investigate what it would take to move
> the detector to another NMI-capable source, such as the HPET because the
> detector does not need high low granularity timer and interrupts only
> every 2s.

I'm well aware about the reasons for this.

> Furthermore, recent Intel erratum, e.g., the TSX issue forcing the TFA
> code in perf_events, have increased the pressure even more with only 3
> generic counters left. Thus, it is time to look at alternative ways of
> getting a hard lockup detector (NMI watchdog) from another NMI source
> than the PMU. To that extent, I have been discussing about alternatives.
>
> Intel suggested using the HPET and Ricardo has been working on
> producing this patch series. It is clear from your review
> that the patches have issues, but I am hoping that they can be
> resolved with constructive feedback knowing what the end goal is.

Well, I gave constructive feedback from the very first version on. But
essential parts of that feedback have been ignored for whatever reasons.

> As for the round-robin changes, yes, we discussed this as an alternative
> to avoid overloading CPU0 with handling all of the work to broadcasting
> IPI to 100+ other CPUs.

I can understand the reason why you don't want to do that, but again, I
said way before this was tried that changing affinity from NMI context with
the IOMMU cannot work by just calling into the iommu code and it needs some
deep investigation with the IOMMU wizards whether a preallocated entry can
be used lockless (including the subsequently required flush).

The outcome is that the change was implemented by simply calling into
functions which I told that they cannot be called from NMI context.

Unless this problem is not solved and I doubt it can be solved after
talking to IOMMU people and studying manuals, the round robin mechanics in
the current form are not going to happen. We'd need a SMI based lockup
detector to debug the resulting livelock wreckage.

There are two possible options:

  1) Back to the IPI approach

     The probem with broadcast is that it sends IPIs one by one to each
     online CPU, which sums up with a large number of CPUs.

     The interesting question is why the kernel does not utilize the all
     excluding self destination shorthand for this. The SDM is not giving
     any information.

     But there is a historic commit which is related and gives a hint:

        commit e77deacb7b078156fcadf27b838a4ce1a65eda04
        Author: Keith Owens <kaos@sgi.com>
        Date:   Mon Jun 26 13:59:56 2006 +0200

        [PATCH] x86_64: Avoid broadcasting NMI IPIs
    
        On some i386/x86_64 systems, sending an NMI IPI as a broadcast will
    	reset the system.  This seems to be a BIOS bug which affects
    	machines where one or more cpus are not under OS control.  It
    	occurs on HT systems with a version of the OS that is not compiled
    	without HT support.  It also occurs when a system is booted with
    	max_cpus=n where 2 <= n < cpus known to the BIOS.  The fix is to
    	always send NMI IPI as a mask instead of as a broadcast.

    I can see the issue with max_cpus and that'd be trivial to solve by
    disabling the HPET watchdog when maxcpus < num_present_cpus is on the
    command line (That's broken anyway with respect to MCEs. See the stupid
    dance we need to do for 'nosmt').

    Though the HT part of the changelog is unparseable garbage but might be
    a cryptic hint to the 'nosmt' mess. Though back then we did not have
    a way to disable the siblings (or did we?). Weird...

    It definitely would be worthwhile to experiment with that. if we could
    use shorthands (also for regular IPIs) that would be a great
    improvement in general and would nicely solve that NMI issue. Beware of
    the dragons though.

  2) Delegate round robin to irq_work

    Use the same mechanism as perf for stuff which needs to be done outside
    of NMI context.

    That can solve the issue, but the drawback is that in case the NMI hits
    a locked up interrupt disabled region the affinity stays on the same
    CPU as the regular IPI which kicks the irq work is not going to be
    handled.  Might not be a big issue as we could detect the situation
    that the IPI comes back to the same CPU. Not pretty and lots of nasty
    corner case and race condition handling.

    There is another issue with that round robin scheme as I pointed out to
    Ricardo:

      With a small watchdog threshold and tons of CPUs the time to switch
      the affinity becomes short. That brings the HPET reprogramming (in
      case of oneshot) into the SMI endagered zone and aside of that it
      will eat performance as well because with lets say 1 second threshold
      and 1000 CPUs we are going to flush the interrupt remapping
      table/entry once per millisecond. No idea how big the penalty is, but
      it's certainly not free.

    One possible way out would be to use a combined approach of building
    CPU groups (lets say 8) where one of the CPUs gets the NMI and IPIs the
    other 7 and then round robins to the next group. Whether that's any
    better, I can't tell.

Sorry that I can't come up with the magic cure and just can provide more
questions than answers.

Thanks,

	tglx



_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
  2019-06-17  8:25     ` Thomas Gleixner
  2019-06-17 21:38       ` Stephane Eranian via iommu
@ 2019-06-18 22:45       ` Ricardo Neri
  1 sibling, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-06-18 22:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Ingo Molnar, Wincy Van, Ashok Raj, x86,
	Andi Kleen, Borislav Petkov, Eric W. Biederman, Ravi V. Shankar,
	Bjorn Helgaas, Juergen Gross, Tony Luck, Randy Dunlap,
	linux-kernel, iommu, Jacob Pan, Philippe Ombredanne

On Mon, Jun 17, 2019 at 10:25:35AM +0200, Thomas Gleixner wrote:
> On Sun, 16 Jun 2019, Thomas Gleixner wrote:
> > On Thu, 23 May 2019, Ricardo Neri wrote:
> > > When the hardlockup detector is enabled, the function
> > > hld_hpet_intremapactivate_irq() activates the recently created entry
> > > in the interrupt remapping table via the modify_irte() functions. While
> > > doing this, it specifies which CPU the interrupt must target via its APIC
> > > ID. This function can be called every time the destination iD of the
> > > interrupt needs to be updated; there is no need to allocate or remove
> > > entries in the interrupt remapping table.
> > 
> > Brilliant.
> > 
> > > +int hld_hpet_intremap_activate_irq(struct hpet_hld_data *hdata)
> > > +{
> > > +	u32 destid = apic->calc_dest_apicid(hdata->handling_cpu);
> > > +	struct intel_ir_data *data;
> > > +
> > > +	data = (struct intel_ir_data *)hdata->intremap_data;
> > > +	data->irte_entry.dest_id = IRTE_DEST(destid);
> > > +	return modify_irte(&data->irq_2_iommu, &data->irte_entry);
> > 
> > This calls modify_irte() which does at the very beginning:
> > 
> >    raw_spin_lock_irqsave(&irq_2_ir_lock, flags);
> > 
> > How is that supposed to work from NMI context? Not to talk about the
> > other spinlocks which are taken in the subsequent call chain.
> > 
> > You cannot call in any of that code from NMI context.
> > 
> > The only reason why this never deadlocked in your testing is that nothing
> > else touched that particular iommu where the HPET hangs off concurrently.
> > 
> > But that's just pure luck and not design. 
> 
> And just for the record. I warned you about that problem during the review
> of an earlier version and told you to talk to IOMMU folks whether there is
> a way to update the entry w/o running into that lock problem.

I think I misunderstood your feedback. You did mention issues on locking
between NMI and !NMI contexts. However, that was in the context of using the
generic irq code to do things such as set the affinity of the interrupt and
requesting an irq. I understood that I should instead program things directly.
I extrapolated this to the IOMMU driver in which I also added code directly
instead of using the existing layering.

Also, at the time, the question regarding the IOMMU, as I understood, was
whether it was posible to reserve a IOMMU remapping entry upfront. I believe
my patches achieve that, even if they are hacky and ugly, and have locking
issues. I see now that the locking issues are also part of the IOMMU
discussion. Perhaps that was also implicit.
> 
> Can you tell my why am I actually reviewing patches and spending time on
> this when the result is ignored anyway?

Yes, Thomas, I should have checked first with the IOMMU maintainers
first on the issues in the paragraph above. It is not my intention to
waste your time; your feedback has been valuable and has contributed to
improve the code.

> 
> I also tried to figure out why you went away from the IPI broadcast
> design. The only information I found is:
> 
> Changes vs. v1:
> 
>  * Brought back the round-robin mechanism proposed in v1 (this time not
>    using the interrupt subsystem). This also requires to compute
>    expiration times as in v1 (Andi Kleen, Stephane Eranian).
> 
> Great that there is no trace of any mail from Andi or Stephane about this
> on LKML. There is no problem with talking offlist about this stuff, but
> then you should at least provide a rationale for those who were not part of
> the private conversation.

Stephane has already commented the rationale.

Thanks and BR,

Ricardo
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 12/21] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored CPUs
  2019-06-11 20:11   ` Thomas Gleixner
@ 2019-06-18 22:46     ` Ricardo Neri
  0 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-06-18 22:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Rafael J. Wysocki, Peter Zijlstra, Ricardo Neri,
	Stephane Eranian, Kai-Heng Feng, Ingo Molnar, Davidlohr Bueso,
	Ashok Raj, Michael Ellerman, x86, Luis R. Rodriguez,
	David Rientjes, Andi Kleen, Waiman Long, Borislav Petkov,
	Masami Hiramatsu, Don Zickus, Ravi V. Shankar,
	Konrad Rzeszutek Wilk, Marc Zyngier, Frederic Weisbecker,
	Nicholas Piggin, Alexei Starovoitov, Byungchul Park, Babu Moger,
	Mathieu Desnoyers, Josh Poimboeuf, Paul E. McKenney, Tony Luck,
	Randy Dunlap, linux-kernel, iommu, Jacob Pan,
	Philippe Ombredanne, Colin Ian King, Andrew Morton

On Tue, Jun 11, 2019 at 10:11:04PM +0200, Thomas Gleixner wrote:
> On Thu, 23 May 2019, Ricardo Neri wrote:
> > @@ -52,10 +59,10 @@ static void kick_timer(struct hpet_hld_data *hdata, bool force)
> >  		return;
> >  
> >  	if (hdata->has_periodic)
> > -		period = watchdog_thresh * hdata->ticks_per_second;
> > +		period = watchdog_thresh * hdata->ticks_per_cpu;
> >  
> >  	count = hpet_readl(HPET_COUNTER);
> > -	new_compare = count + watchdog_thresh * hdata->ticks_per_second;
> > +	new_compare = count + watchdog_thresh * hdata->ticks_per_cpu;
> >  	hpet_set_comparator(hdata->num, (u32)new_compare, (u32)period);
> 
> So with this you might get close to the point where you trip over the SMI
> induced madness where CPUs vanish for several milliseconds in some value
> add code. You really want to do a read back of the hpet to detect that. See
> the comment in the hpet code. RHEL 7/8 allow up to 768 logical CPUs....

Do you mean adding a readback to check if the new compare value is
greater than the current count? Similar to the check at the end of
hpet_next_event():

	return res < HPET_MIN_CYCLES ? -ETIME : 0;

In such a case, should it try to set the comparator again? I think it
should, as otherwise the hardlockup detector would stop working.

Thanks and BR,
Ricardo
> 
> Thanks,
> 
> 	tglx
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 18/21] x86/apic: Add a parameter for the APIC delivery mode
  2019-06-16  9:55   ` Thomas Gleixner
@ 2019-06-18 22:47     ` Ricardo Neri
  2019-06-18 23:15       ` Thomas Gleixner
  0 siblings, 1 reply; 54+ messages in thread
From: Ricardo Neri @ 2019-06-18 22:47 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Ingo Molnar, Wincy Van, Ashok Raj, x86,
	Andi Kleen, Borislav Petkov, Eric W. Biederman, Ravi V. Shankar,
	Bjorn Helgaas, Juergen Gross, Tony Luck, Randy Dunlap,
	linux-kernel, iommu, Jacob Pan, Philippe Ombredanne

On Sun, Jun 16, 2019 at 11:55:03AM +0200, Thomas Gleixner wrote:
> On Thu, 23 May 2019, Ricardo Neri wrote:
> >  
> >  struct irq_cfg {
> > -	unsigned int		dest_apicid;
> > -	unsigned int		vector;
> > +	unsigned int				dest_apicid;
> > +	unsigned int				vector;
> > +	enum ioapic_irq_destination_types	delivery_mode;
> 
> And how is this related to IOAPIC?

In my view, IOAPICs can also be programmed with a delivery mode. Mode
values are the same for MSI interrupts.

> I know this enum exists already, but in
> connection with MSI this does not make any sense at all.

Is the issue here the name of the enumeration?

> 
> > +
> > +		/*
> > +		 * Initialize the delivery mode of this irq to match the
> > +		 * default delivery mode of the APIC. This is useful for
> > +		 * children irq domains which want to take the delivery
> > +		 * mode from the individual irq configuration rather
> > +		 * than from the APIC.
> > +		 */
> > +		 apicd->hw_irq_cfg.delivery_mode = apic->irq_delivery_mode;
> 
> And here it's initialized from apic->irq_delivery_mode, which is an
> u32. Intuitive and consistent - NOT!

Yes, this is wrong. Then should the member in the structure above be an
u32 instead of enum ioapic_irq_destination_types?

Thanks and BR,
Ricardo
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/21] x86/hpet: Calculate ticks-per-second in a separate function
  2019-06-14 15:54   ` Thomas Gleixner
  2019-06-14 15:59     ` Thomas Gleixner
@ 2019-06-18 22:48     ` Ricardo Neri
  2019-06-18 23:13       ` Thomas Gleixner
  1 sibling, 1 reply; 54+ messages in thread
From: Ricardo Neri @ 2019-06-18 22:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Arnd Bergmann,
	Peter Zijlstra, Philippe Ombredanne, Randy Dunlap,
	Clemens Ladisch, linux-kernel, Stephane Eranian, Ricardo Neri,
	Rafael J. Wysocki, iommu, Tony Luck, H. Peter Anvin, Andi Kleen,
	Borislav Petkov, Ingo Molnar

On Fri, Jun 14, 2019 at 05:54:05PM +0200, Thomas Gleixner wrote:
> On Thu, 23 May 2019, Ricardo Neri wrote:
> >  
> > +u64 hpet_get_ticks_per_sec(u64 hpet_caps)
> > +{
> > +	u64 ticks_per_sec, period;
> > +
> > +	period = (hpet_caps & HPET_COUNTER_CLK_PERIOD_MASK) >>
> > +		 HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
> > +
> > +	/*
> > +	 * The frequency is the reciprocal of the period. The period is given
> > +	 * in femtoseconds per second. Thus, prepare a dividend to obtain the
> > +	 * frequency in ticks per second.
> > +	 */
> > +
> > +	/* 10^15 femtoseconds per second */
> > +	ticks_per_sec = 1000000000000000ULL;
> > +	ticks_per_sec += period >> 1; /* round */
> > +
> > +	/* The quotient is put in the dividend. We drop the remainder. */
> > +	do_div(ticks_per_sec, period);
> > +
> > +	return ticks_per_sec;
> > +}
> > +
> >  int hpet_alloc(struct hpet_data *hdp)
> >  {
> >  	u64 cap, mcfg;
> > @@ -844,7 +867,6 @@ int hpet_alloc(struct hpet_data *hdp)
> >  	struct hpets *hpetp;
> >  	struct hpet __iomem *hpet;
> >  	static struct hpets *last;
> > -	unsigned long period;
> >  	unsigned long long temp;
> >  	u32 remainder;
> >  
> > @@ -894,12 +916,7 @@ int hpet_alloc(struct hpet_data *hdp)
> >  
> >  	last = hpetp;
> >  
> > -	period = (cap & HPET_COUNTER_CLK_PERIOD_MASK) >>
> > -		HPET_COUNTER_CLK_PERIOD_SHIFT; /* fs, 10^-15 */
> > -	temp = 1000000000000000uLL; /* 10^15 femtoseconds per second */
> > -	temp += period >> 1; /* round */
> > -	do_div(temp, period);
> > -	hpetp->hp_tick_freq = temp; /* ticks per second */
> > +	hpetp->hp_tick_freq = hpet_get_ticks_per_sec(cap);
> 
> Why are we actually computing this over and over?
> 
> In hpet_enable() which is the first function invoked we have:
> 
>         /*
>          * The period is a femto seconds value. Convert it to a
>          * frequency.
>          */
>         freq = FSEC_PER_SEC;
>         do_div(freq, hpet_period);
>         hpet_freq = freq;
> 
> So we already have ticks per second, aka frequency, right? So why do we
> need yet another function instead of using the value which is computed
> once? The frequency of the HPET channels has to be identical no matter
> what. If it's not HPET is broken beyond repair.

I don't think it needs to be recomputed again. I missed the fact that
the frequency was already computed here.

Also, the hpet char driver has its own frequency computation. Perhaps it
could also obtain it from here, right?

Thanks and BR,
Ricardo
> 
> Thanks,
> 
> 	tglx
> 
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 05/21] x86/hpet: Reserve timer for the HPET hardlockup detector
  2019-06-14 16:10       ` Thomas Gleixner
@ 2019-06-18 22:48         ` Ricardo Neri
  0 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-06-18 22:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Arnd Bergmann,
	Peter Zijlstra, Philippe Ombredanne, Randy Dunlap,
	Clemens Ladisch, linux-kernel, Stephane Eranian, Ricardo Neri,
	Rafael J. Wysocki, iommu, Tony Luck, H. Peter Anvin, Andi Kleen,
	Borislav Petkov, Ingo Molnar

On Fri, Jun 14, 2019 at 06:10:18PM +0200, Thomas Gleixner wrote:
> On Thu, 13 Jun 2019, Ricardo Neri wrote:
> 
> > On Tue, Jun 11, 2019 at 09:54:25PM +0200, Thomas Gleixner wrote:
> > > On Thu, 23 May 2019, Ricardo Neri wrote:
> > > 
> > > > HPET timer 2 will be used to drive the HPET-based hardlockup detector.
> > > > Reserve such timer to ensure it cannot be used by user space programs or
> > > > for clock events.
> > > > 
> > > > When looking for MSI-capable timers for clock events, skip timer 2 if
> > > > the HPET hardlockup detector is selected.
> > > 
> > > Why? Both the changelog and the code change lack an explanation why this
> > > timer is actually touched after it got reserved for the platform. The
> > > reservation should make it inaccessible for other things.
> > 
> > hpet_reserve_platform_timers() will give the HPET char driver a data
> > structure which specifies which drivers are reserved. In this manner,
> > they cannot be used by applications via file opens. The timer used by
> > the hardlockup detector should be marked as reserved.
> > 
> > Also, hpet_msi_capability_lookup() populates another data structure
> > which is used when obtaining an unused timer for a HPET clock event.
> > The timer used by the hardlockup detector should not be included in such
> > data structure.
> > 
> > Is this the explanation you would like to see? If yes, I will include it
> > in the changelog.
> 
> Yes, the explanation makes sense. The code still sucks. Not really your
> fault, but this is not making it any better.
> 
> What bothers me most is the fact that CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
> removes one HPET timer unconditionally. It neither checks whether the hpet
> watchdog is actually enabled on the command line, nor does it validate
> upfront whether the HPET supports FSB delivery.
> 
> That wastes an HPET timer unconditionally for no value. Not that I
> personally care much about /dev/hpet, but some older laptops depend on HPET
> per cpu timers as the local APIC timer stops in C2/3. So this unconditional
> reservation will cause regressions for no reason.
> 
> The proper approach here is to:
> 
>  1) Evaluate the command line _before_ hpet_enable() is invoked
> 
>  2) Check the availability of FSB delivery in hpet_enable()
> 
> Reserve an HPET channel for the watchdog only when #1 and #2 are true.

Sure. I will add the explanation in the message commit and only reserve
the timer if both of the conditions above are met.

Thanks and BR,
Ricardo
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes
  2019-06-14 18:17   ` Thomas Gleixner
@ 2019-06-18 22:48     ` Ricardo Neri
  0 siblings, 0 replies; 54+ messages in thread
From: Ricardo Neri @ 2019-06-18 22:48 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Peter Zijlstra,
	Philippe Ombredanne, Randy Dunlap, linux-kernel,
	Stephane Eranian, Ricardo Neri, Rafael J. Wysocki, iommu,
	Tony Luck, H. Peter Anvin, Andi Kleen, Borislav Petkov,
	Ingo Molnar

On Fri, Jun 14, 2019 at 08:17:14PM +0200, Thomas Gleixner wrote:
> On Thu, 23 May 2019, Ricardo Neri wrote:
> > +/**
> > + * hpet_set_comparator() - Helper function for setting comparator register
> > + * @num:	The timer ID
> > + * @cmp:	The value to be written to the comparator/accumulator
> > + * @period:	The value to be written to the period (0 = oneshot mode)
> > + *
> > + * Helper function for updating comparator, accumulator and period values.
> > + *
> > + * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing
> > + * to the Tn_CMP to update the accumulator. Then, HPET needs a second
> > + * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period.
> > + * The HPET_TN_SETVAL bit is automatically cleared after the first write.
> > + *
> > + * For one-shot mode, HPET_TN_SETVAL does not need to be set.
> > + *
> > + * See the following documents:
> > + *   - Intel IA-PC HPET (High Precision Event Timers) Specification
> > + *   - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674
> > + */
> > +void hpet_set_comparator(int num, unsigned int cmp, unsigned int period)
> > +{
> > +	if (period) {
> > +		unsigned int v = hpet_readl(HPET_Tn_CFG(num));
> > +
> > +		hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(num));
> > +	}
> > +
> > +	hpet_writel(cmp, HPET_Tn_CMP(num));
> > +
> > +	if (!period)
> > +		return;
> 
> TBH, I hate this conditional handling. What's wrong with two functions?

There is probably nothing wrong with two functions. I can split it into
hpet_set_comparator_periodic() and hpet_set_comparator(). Perhaps the
latter is not needed as it would be a one-line function; you have
suggested earlier to avoid such small functions.
> 
> > +
> > +	/*
> > +	 * This delay is seldom used: never in one-shot mode and in periodic
> > +	 * only when reprogramming the timer.
> > +	 */
> > +	udelay(1);
> > +	hpet_writel(period, HPET_Tn_CMP(num));
> > +}
> > +EXPORT_SYMBOL_GPL(hpet_set_comparator);
> 
> Why is this exported? Which module user needs this?

It is not used anywhere else. I will remove this export.

Thanks and BR,

Ricardo
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 03/21] x86/hpet: Calculate ticks-per-second in a separate function
  2019-06-18 22:48     ` Ricardo Neri
@ 2019-06-18 23:13       ` Thomas Gleixner
  0 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-18 23:13 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Kate Stewart, Ravi V. Shankar, x86, Ashok Raj, Arnd Bergmann,
	Peter Zijlstra, Philippe Ombredanne, Randy Dunlap,
	Clemens Ladisch, linux-kernel, Stephane Eranian, Ricardo Neri,
	Rafael J. Wysocki, iommu, Tony Luck, H. Peter Anvin, Andi Kleen,
	Borislav Petkov, Ingo Molnar

On Tue, 18 Jun 2019, Ricardo Neri wrote:
> On Fri, Jun 14, 2019 at 05:54:05PM +0200, Thomas Gleixner wrote:
> > 
> > So we already have ticks per second, aka frequency, right? So why do we
> > need yet another function instead of using the value which is computed
> > once? The frequency of the HPET channels has to be identical no matter
> > what. If it's not HPET is broken beyond repair.
> 
> I don't think it needs to be recomputed again. I missed the fact that
> the frequency was already computed here.
> 
> Also, the hpet char driver has its own frequency computation. Perhaps it
> could also obtain it from here, right?

No. It's separate on purpose. Hint: IA64, aka Itanic

Thanks,

	tglx
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 18/21] x86/apic: Add a parameter for the APIC delivery mode
  2019-06-18 22:47     ` Ricardo Neri
@ 2019-06-18 23:15       ` Thomas Gleixner
  0 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-18 23:15 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Ingo Molnar, Wincy Van, Ashok Raj, x86,
	Andi Kleen, Borislav Petkov, Eric W. Biederman, Ravi V. Shankar,
	Bjorn Helgaas, Juergen Gross, Tony Luck, Randy Dunlap,
	linux-kernel, iommu, Jacob Pan, Philippe Ombredanne

On Tue, 18 Jun 2019, Ricardo Neri wrote:

> On Sun, Jun 16, 2019 at 11:55:03AM +0200, Thomas Gleixner wrote:
> > On Thu, 23 May 2019, Ricardo Neri wrote:
> > >  
> > >  struct irq_cfg {
> > > -	unsigned int		dest_apicid;
> > > -	unsigned int		vector;
> > > +	unsigned int				dest_apicid;
> > > +	unsigned int				vector;
> > > +	enum ioapic_irq_destination_types	delivery_mode;
> > 
> > And how is this related to IOAPIC?
> 
> In my view, IOAPICs can also be programmed with a delivery mode. Mode
> values are the same for MSI interrupts.

> > I know this enum exists already, but in
> > connection with MSI this does not make any sense at all.
> 
> Is the issue here the name of the enumeration?
> 
> > 
> > > +
> > > +		/*
> > > +		 * Initialize the delivery mode of this irq to match the
> > > +		 * default delivery mode of the APIC. This is useful for
> > > +		 * children irq domains which want to take the delivery
> > > +		 * mode from the individual irq configuration rather
> > > +		 * than from the APIC.
> > > +		 */
> > > +		 apicd->hw_irq_cfg.delivery_mode = apic->irq_delivery_mode;
> > 
> > And here it's initialized from apic->irq_delivery_mode, which is an
> > u32. Intuitive and consistent - NOT!
> 
> Yes, this is wrong. Then should the member in the structure above be an
> u32 instead of enum ioapic_irq_destination_types?
> 
> Thanks and BR,
> Ricardo
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
  2019-06-17 23:08         ` Thomas Gleixner
@ 2019-06-19 15:43           ` Jacob Pan
  2019-06-21 15:33             ` Thomas Gleixner
  0 siblings, 1 reply; 54+ messages in thread
From: Jacob Pan @ 2019-06-19 15:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Ingo Molnar, Wincy Van, Ashok Raj, x86,
	Andi Kleen, Borislav Petkov, jacob.jun.pan, Ravi V. Shankar,
	Ricardo Neri, Bjorn Helgaas, Juergen Gross, Tony Luck,
	Randy Dunlap, LKML, iommu, Eric W. Biederman,
	Philippe Ombredanne

On Tue, 18 Jun 2019 01:08:06 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> Stephane,
> 
> On Mon, 17 Jun 2019, Stephane Eranian wrote:
> > On Mon, Jun 17, 2019 at 1:25 AM Thomas Gleixner
> > <tglx@linutronix.de> wrote:  
> > > Great that there is no trace of any mail from Andi or Stephane
> > > about this on LKML. There is no problem with talking offlist
> > > about this stuff, but then you should at least provide a
> > > rationale for those who were not part of the private conversation.
> > >  
> > Let me add some context to this whole patch series. The pressure on
> > the core PMU counters is increasing as more people want to use them
> > to measure always more events. When the PMU is overcommitted, i.e.,
> > more events than counters for them, there is multiplexing. It comes
> > with an overhead that is too high for certain applications. One way
> > to avoid this is to lower the multiplexing frequency, which is by
> > default 1ms, but that comes with loss of accuracy. Another approach
> > is to measure only a small number of events at a time and use
> > multiple runs, but then you lose consistent event view. Another
> > approach is to push for increasing the number of counters. But
> > getting new hardware counters takes time. Short term, we can
> > investigate what it would take to free one cycle-capable counter
> > which is commandeered by the hard lockup detector on all X86
> > processors today. The functionality of the watchdog, being able to
> > get a crash dump on kernel deadlocks, is important and we cannot
> > simply disable it. At scale, many bugs are exposed and thus
> > machines deadlock. Therefore, we want to investigate what it would
> > take to move the detector to another NMI-capable source, such as
> > the HPET because the detector does not need high low granularity
> > timer and interrupts only every 2s.  
> 
> I'm well aware about the reasons for this.
> 
> > Furthermore, recent Intel erratum, e.g., the TSX issue forcing the
> > TFA code in perf_events, have increased the pressure even more with
> > only 3 generic counters left. Thus, it is time to look at
> > alternative ways of getting a hard lockup detector (NMI watchdog)
> > from another NMI source than the PMU. To that extent, I have been
> > discussing about alternatives.
> >
> > Intel suggested using the HPET and Ricardo has been working on
> > producing this patch series. It is clear from your review
> > that the patches have issues, but I am hoping that they can be
> > resolved with constructive feedback knowing what the end goal is.  
> 
> Well, I gave constructive feedback from the very first version on. But
> essential parts of that feedback have been ignored for whatever
> reasons.
> 
> > As for the round-robin changes, yes, we discussed this as an
> > alternative to avoid overloading CPU0 with handling all of the work
> > to broadcasting IPI to 100+ other CPUs.  
> 
> I can understand the reason why you don't want to do that, but again,
> I said way before this was tried that changing affinity from NMI
> context with the IOMMU cannot work by just calling into the iommu
> code and it needs some deep investigation with the IOMMU wizards
> whether a preallocated entry can be used lockless (including the
> subsequently required flush).
> 
> The outcome is that the change was implemented by simply calling into
> functions which I told that they cannot be called from NMI context.
> 
> Unless this problem is not solved and I doubt it can be solved after
> talking to IOMMU people and studying manuals, 
I agree. modify irte might be done with cmpxchg_double() but the queued
invalidation interface for IRTE cache flush is shared with DMA and
requires holding a spinlock for enque descriptors, QI tail update etc.

Also, reserving & manipulating IRTE slot for hpet via backdoor might not
be needed if the HPET PCI BDF (found in ACPI) can be utilized. But it
might need more work to add a fake PCI device for HPET.

> the round robin
> mechanics in the current form are not going to happen. We'd need a
> SMI based lockup detector to debug the resulting livelock wreckage.
> 
> There are two possible options:
> 
>   1) Back to the IPI approach
> 
>      The probem with broadcast is that it sends IPIs one by one to
> each online CPU, which sums up with a large number of CPUs.
> 
>      The interesting question is why the kernel does not utilize the
> all excluding self destination shorthand for this. The SDM is not
> giving any information.
> 
>      But there is a historic commit which is related and gives a hint:
> 
>         commit e77deacb7b078156fcadf27b838a4ce1a65eda04
>         Author: Keith Owens <kaos@sgi.com>
>         Date:   Mon Jun 26 13:59:56 2006 +0200
> 
>         [PATCH] x86_64: Avoid broadcasting NMI IPIs
>     
>         On some i386/x86_64 systems, sending an NMI IPI as a
> broadcast will reset the system.  This seems to be a BIOS bug which
> affects machines where one or more cpus are not under OS control.  It
>     	occurs on HT systems with a version of the OS that is not
> compiled without HT support.  It also occurs when a system is booted
> with max_cpus=n where 2 <= n < cpus known to the BIOS.  The fix is to
>     	always send NMI IPI as a mask instead of as a broadcast.
> 
>     I can see the issue with max_cpus and that'd be trivial to solve
> by disabling the HPET watchdog when maxcpus < num_present_cpus is on
> the command line (That's broken anyway with respect to MCEs. See the
> stupid dance we need to do for 'nosmt').
> 
>     Though the HT part of the changelog is unparseable garbage but
> might be a cryptic hint to the 'nosmt' mess. Though back then we did
> not have a way to disable the siblings (or did we?). Weird...
> 
>     It definitely would be worthwhile to experiment with that. if we
> could use shorthands (also for regular IPIs) that would be a great
>     improvement in general and would nicely solve that NMI issue.
> Beware of the dragons though.
> 
>   2) Delegate round robin to irq_work
> 
>     Use the same mechanism as perf for stuff which needs to be done
> outside of NMI context.
> 
>     That can solve the issue, but the drawback is that in case the
> NMI hits a locked up interrupt disabled region the affinity stays on
> the same CPU as the regular IPI which kicks the irq work is not going
> to be handled.  Might not be a big issue as we could detect the
> situation that the IPI comes back to the same CPU. Not pretty and
> lots of nasty corner case and race condition handling.
> 
>     There is another issue with that round robin scheme as I pointed
> out to Ricardo:
> 
>       With a small watchdog threshold and tons of CPUs the time to
> switch the affinity becomes short. That brings the HPET reprogramming
> (in case of oneshot) into the SMI endagered zone and aside of that it
>       will eat performance as well because with lets say 1 second
> threshold and 1000 CPUs we are going to flush the interrupt remapping
>       table/entry once per millisecond. No idea how big the penalty
> is, but it's certainly not free.
> 
>     One possible way out would be to use a combined approach of
> building CPU groups (lets say 8) where one of the CPUs gets the NMI
> and IPIs the other 7 and then round robins to the next group. Whether
> that's any better, I can't tell.
> 
> Sorry that I can't come up with the magic cure and just can provide
> more questions than answers.
> 
> Thanks,
> 
> 	tglx
> 
> 
> 
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

[Jacob Pan]
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
  2019-06-19 15:43           ` Jacob Pan
@ 2019-06-21 15:33             ` Thomas Gleixner
  2019-06-21 17:31               ` Jacob Pan
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-21 15:33 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Kate Stewart, Ravi V. Shankar, Tony Luck, Ashok Raj,
	Peter Zijlstra, Jan Kiszka, x86, Ricardo Neri, Stephane Eranian,
	LKML, Juergen Gross, Bjorn Helgaas, iommu, Philippe Ombredanne,
	Randy Dunlap, Eric W. Biederman, Ricardo Neri, Andi Kleen,
	Borislav Petkov, Ingo Molnar, Wincy Van

On Wed, 19 Jun 2019, Jacob Pan wrote:
> On Tue, 18 Jun 2019 01:08:06 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > Unless this problem is not solved and I doubt it can be solved after
> > talking to IOMMU people and studying manuals,
>
> I agree. modify irte might be done with cmpxchg_double() but the queued
> invalidation interface for IRTE cache flush is shared with DMA and
> requires holding a spinlock for enque descriptors, QI tail update etc.
> 
> Also, reserving & manipulating IRTE slot for hpet via backdoor might not
> be needed if the HPET PCI BDF (found in ACPI) can be utilized. But it
> might need more work to add a fake PCI device for HPET.

What would PCI/BDF solve?

Thanks,

	tglx
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
  2019-06-21 15:33             ` Thomas Gleixner
@ 2019-06-21 17:31               ` Jacob Pan
  2019-06-21 18:39                 ` Jacob Pan
  0 siblings, 1 reply; 54+ messages in thread
From: Jacob Pan @ 2019-06-21 17:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Ingo Molnar, Wincy Van, Ashok Raj, x86,
	Andi Kleen, Borislav Petkov, Eric W. Biederman, Ravi V. Shankar,
	Ricardo Neri, Bjorn Helgaas, Juergen Gross, Tony Luck,
	Randy Dunlap, LKML, iommu, jacob.jun.pan, Philippe Ombredanne

On Fri, 21 Jun 2019 17:33:28 +0200 (CEST)
Thomas Gleixner <tglx@linutronix.de> wrote:

> On Wed, 19 Jun 2019, Jacob Pan wrote:
> > On Tue, 18 Jun 2019 01:08:06 +0200 (CEST)
> > Thomas Gleixner <tglx@linutronix.de> wrote:  
> > > 
> > > Unless this problem is not solved and I doubt it can be solved
> > > after talking to IOMMU people and studying manuals,  
> >
> > I agree. modify irte might be done with cmpxchg_double() but the
> > queued invalidation interface for IRTE cache flush is shared with
> > DMA and requires holding a spinlock for enque descriptors, QI tail
> > update etc.
> > 
> > Also, reserving & manipulating IRTE slot for hpet via backdoor
> > might not be needed if the HPET PCI BDF (found in ACPI) can be
> > utilized. But it might need more work to add a fake PCI device for
> > HPET.  
> 
> What would PCI/BDF solve?
I was thinking if HPET is a PCI device then it can naturally
gain slots in IOMMU remapping table IRTEs via PCI MSI code. Then perhaps
it can use the IRQ subsystem to set affinity etc. w/o directly adding
additional helper functions in IRQ remapping code. I have not followed
all the discussions, just a thought.

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
  2019-06-21 17:31               ` Jacob Pan
@ 2019-06-21 18:39                 ` Jacob Pan
  2019-06-21 20:05                   ` Thomas Gleixner
  0 siblings, 1 reply; 54+ messages in thread
From: Jacob Pan @ 2019-06-21 18:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kate Stewart, Peter Zijlstra, Jan Kiszka, Ricardo Neri,
	Stephane Eranian, Ingo Molnar, Wincy Van, Ashok Raj, x86,
	Andi Kleen, Borislav Petkov, jacob.jun.pan, Ravi V. Shankar,
	Ricardo Neri, Bjorn Helgaas, Juergen Gross, Tony Luck,
	Randy Dunlap, LKML, iommu, Eric W. Biederman,
	Philippe Ombredanne

On Fri, 21 Jun 2019 10:31:26 -0700
Jacob Pan <jacob.jun.pan@intel.com> wrote:

> On Fri, 21 Jun 2019 17:33:28 +0200 (CEST)
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> > On Wed, 19 Jun 2019, Jacob Pan wrote:  
> > > On Tue, 18 Jun 2019 01:08:06 +0200 (CEST)
> > > Thomas Gleixner <tglx@linutronix.de> wrote:    
> > > > 
> > > > Unless this problem is not solved and I doubt it can be solved
> > > > after talking to IOMMU people and studying manuals,    
> > >
> > > I agree. modify irte might be done with cmpxchg_double() but the
> > > queued invalidation interface for IRTE cache flush is shared with
> > > DMA and requires holding a spinlock for enque descriptors, QI tail
> > > update etc.
> > > 
> > > Also, reserving & manipulating IRTE slot for hpet via backdoor
> > > might not be needed if the HPET PCI BDF (found in ACPI) can be
> > > utilized. But it might need more work to add a fake PCI device for
> > > HPET.    
> > 
> > What would PCI/BDF solve?  
> I was thinking if HPET is a PCI device then it can naturally
> gain slots in IOMMU remapping table IRTEs via PCI MSI code. Then
> perhaps it can use the IRQ subsystem to set affinity etc. w/o
> directly adding additional helper functions in IRQ remapping code. I
> have not followed all the discussions, just a thought.
> 
I looked at the code again, seems the per cpu HPET code already taken
care of HPET MSI management. Why can't we use IR-HPET-MSI chip and
domain to allocate and set affinity etc.?
Most APIC timer has ARAT not enough per cpu HPET, so per cpu HPET is
not used mostly.


Jacob
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
  2019-06-21 18:39                 ` Jacob Pan
@ 2019-06-21 20:05                   ` Thomas Gleixner
  2019-06-21 23:55                     ` Ricardo Neri
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-21 20:05 UTC (permalink / raw)
  To: Jacob Pan
  Cc: Kate Stewart, Ravi V. Shankar, Tony Luck, Ashok Raj,
	Peter Zijlstra, Jan Kiszka, x86, Ricardo Neri, Stephane Eranian,
	LKML, Juergen Gross, Bjorn Helgaas, iommu, Philippe Ombredanne,
	Randy Dunlap, Eric W. Biederman, Ricardo Neri, Andi Kleen,
	Borislav Petkov, Ingo Molnar, Wincy Van

On Fri, 21 Jun 2019, Jacob Pan wrote:
> On Fri, 21 Jun 2019 10:31:26 -0700
> Jacob Pan <jacob.jun.pan@intel.com> wrote:
> 
> > On Fri, 21 Jun 2019 17:33:28 +0200 (CEST)
> > Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > > On Wed, 19 Jun 2019, Jacob Pan wrote:  
> > > > On Tue, 18 Jun 2019 01:08:06 +0200 (CEST)
> > > > Thomas Gleixner <tglx@linutronix.de> wrote:    
> > > > > 
> > > > > Unless this problem is not solved and I doubt it can be solved
> > > > > after talking to IOMMU people and studying manuals,    
> > > >
> > > > I agree. modify irte might be done with cmpxchg_double() but the
> > > > queued invalidation interface for IRTE cache flush is shared with
> > > > DMA and requires holding a spinlock for enque descriptors, QI tail
> > > > update etc.
> > > > 
> > > > Also, reserving & manipulating IRTE slot for hpet via backdoor
> > > > might not be needed if the HPET PCI BDF (found in ACPI) can be
> > > > utilized. But it might need more work to add a fake PCI device for
> > > > HPET.    
> > > 
> > > What would PCI/BDF solve?  
> > I was thinking if HPET is a PCI device then it can naturally
> > gain slots in IOMMU remapping table IRTEs via PCI MSI code. Then
> > perhaps it can use the IRQ subsystem to set affinity etc. w/o
> > directly adding additional helper functions in IRQ remapping code. I
> > have not followed all the discussions, just a thought.
> > 
> I looked at the code again, seems the per cpu HPET code already taken
> care of HPET MSI management. Why can't we use IR-HPET-MSI chip and
> domain to allocate and set affinity etc.?
> Most APIC timer has ARAT not enough per cpu HPET, so per cpu HPET is
> not used mostly.

Sure, we can use that, but that does not allow to move the affinity from
NMI context either. Same issue with the IOMMU as with the other hack.

Thanks,

	tglx
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
  2019-06-21 20:05                   ` Thomas Gleixner
@ 2019-06-21 23:55                     ` Ricardo Neri
  2019-06-22  7:21                       ` Thomas Gleixner
  0 siblings, 1 reply; 54+ messages in thread
From: Ricardo Neri @ 2019-06-21 23:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Kate Stewart, Ravi V. Shankar, Tony Luck, Ashok Raj,
	Peter Zijlstra, Jan Kiszka, x86, Ricardo Neri, Stephane Eranian,
	LKML, Juergen Gross, Bjorn Helgaas, iommu, Randy Dunlap,
	Jacob Pan, Philippe Ombredanne, Andi Kleen, Borislav Petkov,
	Ingo Molnar, Wincy Van, Eric W. Biederman

On Fri, Jun 21, 2019 at 10:05:01PM +0200, Thomas Gleixner wrote:
> On Fri, 21 Jun 2019, Jacob Pan wrote:
> > On Fri, 21 Jun 2019 10:31:26 -0700
> > Jacob Pan <jacob.jun.pan@intel.com> wrote:
> > 
> > > On Fri, 21 Jun 2019 17:33:28 +0200 (CEST)
> > > Thomas Gleixner <tglx@linutronix.de> wrote:
> > > 
> > > > On Wed, 19 Jun 2019, Jacob Pan wrote:  
> > > > > On Tue, 18 Jun 2019 01:08:06 +0200 (CEST)
> > > > > Thomas Gleixner <tglx@linutronix.de> wrote:    
> > > > > > 
> > > > > > Unless this problem is not solved and I doubt it can be solved
> > > > > > after talking to IOMMU people and studying manuals,    
> > > > >
> > > > > I agree. modify irte might be done with cmpxchg_double() but the
> > > > > queued invalidation interface for IRTE cache flush is shared with
> > > > > DMA and requires holding a spinlock for enque descriptors, QI tail
> > > > > update etc.
> > > > > 
> > > > > Also, reserving & manipulating IRTE slot for hpet via backdoor
> > > > > might not be needed if the HPET PCI BDF (found in ACPI) can be
> > > > > utilized. But it might need more work to add a fake PCI device for
> > > > > HPET.    
> > > > 
> > > > What would PCI/BDF solve?  
> > > I was thinking if HPET is a PCI device then it can naturally
> > > gain slots in IOMMU remapping table IRTEs via PCI MSI code. Then
> > > perhaps it can use the IRQ subsystem to set affinity etc. w/o
> > > directly adding additional helper functions in IRQ remapping code. I
> > > have not followed all the discussions, just a thought.
> > > 
> > I looked at the code again, seems the per cpu HPET code already taken
> > care of HPET MSI management. Why can't we use IR-HPET-MSI chip and
> > domain to allocate and set affinity etc.?
> > Most APIC timer has ARAT not enough per cpu HPET, so per cpu HPET is
> > not used mostly.
> 
> Sure, we can use that, but that does not allow to move the affinity from
> NMI context either. Same issue with the IOMMU as with the other hack.

If I understand Thomas' point correctly, the problem is having to take
lock in NMI context to update the IRTE for the HPET; both as in my hack
and in the generic irq code. The problem is worse when using the generic
irq code as there are several layers and several locks that need to be
handled.

Thanks and BR,
Ricardo
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
  2019-06-21 23:55                     ` Ricardo Neri
@ 2019-06-22  7:21                       ` Thomas Gleixner
  0 siblings, 0 replies; 54+ messages in thread
From: Thomas Gleixner @ 2019-06-22  7:21 UTC (permalink / raw)
  To: Ricardo Neri
  Cc: Kate Stewart, Ravi V. Shankar, Tony Luck, Ashok Raj,
	Peter Zijlstra, Jan Kiszka, x86, Ricardo Neri, Stephane Eranian,
	LKML, Juergen Gross, Bjorn Helgaas, iommu, Randy Dunlap,
	Jacob Pan, Philippe Ombredanne, Andi Kleen, Borislav Petkov,
	Ingo Molnar, Wincy Van, Eric W. Biederman

On Fri, 21 Jun 2019, Ricardo Neri wrote:
> On Fri, Jun 21, 2019 at 10:05:01PM +0200, Thomas Gleixner wrote:
> > On Fri, 21 Jun 2019, Jacob Pan wrote:
> > > > 
> > > I looked at the code again, seems the per cpu HPET code already taken
> > > care of HPET MSI management. Why can't we use IR-HPET-MSI chip and
> > > domain to allocate and set affinity etc.?
> > > Most APIC timer has ARAT not enough per cpu HPET, so per cpu HPET is
> > > not used mostly.
> > 
> > Sure, we can use that, but that does not allow to move the affinity from
> > NMI context either. Same issue with the IOMMU as with the other hack.
> 
> If I understand Thomas' point correctly, the problem is having to take
> lock in NMI context to update the IRTE for the HPET; both as in my hack
> and in the generic irq code. The problem is worse when using the generic
> irq code as there are several layers and several locks that need to be
> handled.

It does not matter how many locks are involved. One is enough to wedge the
machine.

Thanks,

	tglx


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, back to index

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 01/21] x86/msi: Add definition for NMI delivery mode Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 02/21] x86/hpet: Expose hpet_writel() in header Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 03/21] x86/hpet: Calculate ticks-per-second in a separate function Ricardo Neri
2019-06-14 15:54   ` Thomas Gleixner
2019-06-14 15:59     ` Thomas Gleixner
2019-06-18 22:48     ` Ricardo Neri
2019-06-18 23:13       ` Thomas Gleixner
2019-05-24  1:16 ` [RFC PATCH v4 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes Ricardo Neri
2019-06-14 18:17   ` Thomas Gleixner
2019-06-18 22:48     ` Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 05/21] x86/hpet: Reserve timer for the HPET hardlockup detector Ricardo Neri
2019-06-11 19:54   ` Thomas Gleixner
2019-06-14  1:14     ` Ricardo Neri
2019-06-14 16:10       ` Thomas Gleixner
2019-06-18 22:48         ` Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 06/21] x86/hpet: Configure the timer used by the " Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 07/21] watchdog/hardlockup: Define a generic function to detect hardlockups Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 08/21] watchdog/hardlockup: Decouple the hardlockup detector from perf Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 09/21] x86/nmi: Add a NMI_WATCHDOG NMI handler category Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 10/21] watchdog/hardlockup: Add function to enable NMI watchdog on all allowed CPUs at once Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 11/21] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 12/21] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored CPUs Ricardo Neri
2019-06-11 20:11   ` Thomas Gleixner
2019-06-18 22:46     ` Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 13/21] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 14/21] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog" Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 15/21] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 16/21] x86/watchdog: Add a shim hardlockup detector Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 17/21] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable Ricardo Neri
2019-06-07  0:35   ` Stephane Eranian via iommu
2019-06-07 14:14     ` Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 18/21] x86/apic: Add a parameter for the APIC delivery mode Ricardo Neri
2019-06-16  9:55   ` Thomas Gleixner
2019-06-18 22:47     ` Ricardo Neri
2019-06-18 23:15       ` Thomas Gleixner
2019-05-24  1:16 ` [RFC PATCH v4 19/21] iommu/vt-d: Rework prepare_irte() to support per-irq " Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog Ricardo Neri
2019-06-16 18:42   ` Thomas Gleixner
2019-06-16 19:21   ` Thomas Gleixner
2019-06-17  8:25     ` Thomas Gleixner
2019-06-17 21:38       ` Stephane Eranian via iommu
2019-06-17 23:08         ` Thomas Gleixner
2019-06-19 15:43           ` Jacob Pan
2019-06-21 15:33             ` Thomas Gleixner
2019-06-21 17:31               ` Jacob Pan
2019-06-21 18:39                 ` Jacob Pan
2019-06-21 20:05                   ` Thomas Gleixner
2019-06-21 23:55                     ` Ricardo Neri
2019-06-22  7:21                       ` Thomas Gleixner
2019-06-18 22:45       ` Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 21/21] x86/watchdog/hardlockup/hpet: Support interrupt remapping Ricardo Neri
2019-06-16  8:44   ` Thomas Gleixner
2019-06-16  8:53     ` Thomas Gleixner

IOMMU Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-iommu/0 linux-iommu/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-iommu linux-iommu/ https://lore.kernel.org/linux-iommu \
		iommu@lists.linux-foundation.org iommu@archiver.kernel.org
	public-inbox-index linux-iommu


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.linux-foundation.lists.iommu


AGPL code for this site: git clone https://public-inbox.org/ public-inbox