All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 00/24] x86: Implement an HPET-based hardlockup detector
@ 2023-03-01 23:47 ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

Hi x86 trusted reviewers,

This is the seventh version of this patchset. I acknowledge that it took me
a long time to post a new version. Sorry! I will commit time to continue
working on this series with high priority and I will post a new series soon
after receiving your new feedback.

Although this series touches several subsystems, I plan to send it to the
x86 maintainers because a) the series does not make much sense if split
into subsystems, b) Thomas Gleixner has reviewed previous versions, and c)
he has contributed to all the subsystems I modify.

Tony Luck has kindly reviewed previous versions of the series and I carried
his Reviewed-by tags. This version, however, has new patches that also need
review.

I seek to collect the Reviewed-by tags from the x86 trusted reviewers for
the following patches:
   + arch/x86: 4, 5
   + Intel IOMMU: 6,
   + AMD IOMMU: 9, 10, 11,
   + NMI watchdog: 23 and 24.

Previous version of the patches can be found in [1], [2], [3], [4], [5],
and [6].

Thanks and BR,
Ricardo

== Problem statement

In x86, the NMI watchdog is driven using a counter (one per CPU) of the
Performance Monitoring Unit (PMU). PMU counters, however, are scarce and
it would be better to keep them available to profile performance.

Moreover, since the counter that the NMI watchdog uses cannot be shared
with any other perf users, the current perf-event subsystem may return a
false positive when validating certain metric groups. Certain metric groups
may never get a chance to be scheduled [7], [8].

== Solution

Find a different source of NMIs to drive the watchdog. The HPET timer can
also generate NMIs at regular intervals.

== Operation of the watchdog

One HPET channel is reserved for the NMI watchdog. One of the CPUs that the
watchdog monitors is designated to handle the NMI from the HPET channel.
In addition to checking itself for hardlockups, it sends an inter-processor
NMI to the rest of the CPUs in the system. They, in turn, check for
hardlockups if required: all CPUs will get the NMI (including offline
CPUs), but only those being monitored will look for hardlockups.

The time-stamp counter (TSC) is used to determine whether the HPET caused
the NMI. The detector computes the value the TSC is expected to have the
next time the HPET channel generates an NMI. Once it does, we read the
actual TSC value. If the difference between the expected and the actual
values is less than a certain value, assume that the source of the NMI is
the HPET channel. I have found experimentally that difference is of less
than 0.2% of the watchdog period (expressed in TSC counts, see patch 20 for
details).

== Limitations

Since all CPUs get the HPET NMI, any isolated CPUs would be disrupted. If
isolated CPUs are specified in the kernel command line, this detector is
not enabled.

The detector may not be available because the HPET is not enabled in all
systems (see commit 6e3cd95234dc ("x86/hpet: Use another crystalballto
evaluate HPET usability")).

The detector depends on having IPI shorthands enabled.

On AMD systems with interrupt remapping enabled, the detector can only be
used in APIC physical mode (see patch 10).

Unlike the perf-based implementation, all monitored CPUs will look for
hardlockups every watchdog_thresh seconds, regardless of their idle
state.

Thus, I envision this detector an opt-in feature for those interested in
claiming the PMU counters.

== Parts of this series ==

For clarity, patches are grouped as follows:

 1) IRQ updates: patches 1-11 refactor parts of the interrupt subsystem
    to support per-interrupt delivery mode and configure the delivery mode
    as NMI.

 2) HPET updates: patches 12-15 are prework in the HPET code to reserve a
    timer that can deliver NMIs.

 3) NMI watchdog: patches 16-18 update the existing hardlockup detector
    to uncouple it from perf.

 4) New HPET-based hardlockup detector: patches 19-21

 5) Hardlockup detector management: patches 22-24 are a collection of
    miscellaneous patches to determine when to use the HPET hardlockup
    detector and stop it if necessary. It also includes an x86-specific
    shim hardlockup detector that selects between the perf- and HPET-based
    implementations.

== Testing ==

Tests were conducted on the master branch of the tip tree. I have put my
patches here here:

https://github.com/ricardon/tip.git rneri/hpet-wdt-v7

+++ Tests for functional correctness

I tested this series in a variety of server parts: Intel's Sapphire Rapids,
Cooperlake, Icelake, Cascadelake, Snowridge, Skylake, Haswell, Broadwell,
and IvyTown as well as AMD's Rome.

I also tested the series in desktop parts such as Alderlake and Haswell,
but had to use hpet=force in the command line.

On these systems, the detector works with and without interrupt remapping,
in both APIC physical and logical destination modes.

I used the test_lockup kernel module to ensure that the hardlockups were
detected:

$ echo 10 > /proc/sys/kernel/watchdog_thresh
$ modprobe test_lockup time_secs=20 iterations=1 state=R disable_irq

Warnings in dmesg indicate that the hardlockup was detected.

I verified that the detector can be stopped, started, and reconfigured.
CPUs can also be added and removed from monitoring. All these using the
interfaces from /proc/sys/kernel.

Also, please see patch 20 for further details on the difference between
the expected and actual TSC values.

== Changelog ==

Changes since v6:
 + Unchanged patches: 3, 6, 7-9, 12, 14, 16-18, 22, 
 + Updated patches: 1, 2, 10, 11, 13, 15, 19,-21, 24
 + New patches: 4, 5, 23
 
 * Dropped patch to expose irq_matrix_find_best_cpu(). (Thomas)
 * Implemented a separate local APIC NMI controller. (Thomas)
 * Removed superfluous checks for X86_IRQ_ALLOC_AS_NMI && nr_irqs in the
   interrupt remapping drivers. (Thomas)
 * Skip vector cleanup for NMI.
 * Dropped patch that created a new NMI_WATCHDOG NMI category.
 * Do not use the HPET NMI watchdog with isolated CPUs.
 * Added a new `hpet_nmi_watchdog` kernel parameter. Do not reuse
   the existing nmi_watchdog. (Nicholas)
 * Fixed a bug in which I incorrectly used an error window of 0.2% of
   tsc_delta.
 * If the TSC becomes unstable, simply stop the HPET NMI watchdog, do not
   start the perf NMI watchdog. (Thomas, Nicholas)

Changes since v5:
 + Unchanged patches: 14, 15, 18, 19, 24, 25, 28
 + Updated patches: 2, 8, 17, 21-23, 26, 29
 + New patches: 1, 3-7, 9-16, 20, 27

 * Added support in the interrupt subsystem to allocate IRQs with NMI mode.
  (patches 1-13)
 * Only enable the detector when IPI shorthands are enabled in the system.
  (patch 22)
 * Delayed the initialization of the hardlockup detector until after
   calling smp_init(). Only then, we know whether IPI shorthands have been
   enabled. (patch 20)
 * Removed code to periodically update the affinity of the HPET NMI to
   rotate among packages or groups of packages.
 * Removed logic to group the monitored CPUs by groups. All CPUs in all
   packages receive IPIs.
 * Removed irq_work to change the affinity of the HPET channel interrupt.
 * Updated the redirection hint in the Intel IOMMU IRTE to match the
   destination mode. (patch 7)
 * Correctly added support for NMI delivery mode in the AMD IOMMU.
   (patches 11-13)
 * Restart the NMI watchdog after refining tsc_khz. (patch 27)
 * Added a check for the allowed maximum frequency of the HPET. (patch 17)
 * Added hpet_hld_free_timer() to properly free the reserved HPET channel
   if the initialization is not completed. (patch 17)
 * Reworked the periodic setting the HPET channel. Rather than changing it
   every time the channel is disabled or enabled, do it only once. While
   at here, wrap the code in an initial setup function. (patch 22)
 * Implemented hardlockup_detector_hpet_start() to be called when tsc_khz is
   is refined. (patch 22)
 * Reduced the error window of the expected TSC value at the time of the
   HPET channel expiration. (patch 23)
 * Described better the heuristics used to determine if the HPET channel
   caused the NMI. (patch 23)
 * Added a table to characterize the error in the expected TSC value when
   the HPET channel fires. (patch 23)
 * Added watchdog_nmi_start() to be used when tsc_khz is recalibrated.
   (patch 26)
 * Always build the x86-specific hardlockup detector shim; not only
   when the HPET-based detector is selected. (patch 26)
 * Relocated the declaration of hardlockup_detector_switch_to_perf() to
   x86/nmi.h It does not depend on HPET.
 * Use time_in_range64() to compare the actual TSC value vs the expected
   value. This makes it more readable. (patch 22)
 * Dropped pointless X86_64 || X86_32 check in Kconfig. (patch 26)

Changes since v4:
 * Added commentary on the tests performed on this feature. (Andi)
 * Added a stub version of hardlockup_detector_switch_to_perf() for
   !CONFIG_HPET_TIMER. (lkp)
 * Use switch to select the type of x86 hardlockup detector. (Andi)
 * Renamed a local variable in update_ticks_per_group(). (Andi)
 * Made this hardlockup detector available to X86_32.
 * Reworked logic to kick the HPET timer to remove a local variable.
   (Andi)
 * Added a comment on what type of timer channel will be assigned to the
   detector. (Andi)
 * Reworded help comment in the X86_HARDLOCKUP_DETECTOR_HPET Kconfig
   option. (Andi)
 * Removed unnecessary switch to level interrupt mode when disabling the
   timer. (Andi)
 * Disabled the HPET timer to avoid a race between an incoming interrupt
   and an update of the MSI destination ID. (Ashok)
 * Renamed hpet_hardlockup_detector_get_timer() as hpet_hld_get_timer()
 * Added commentary on an undocumented udelay() when programming an
   HPET channel in periodic mode. (Ashok)
 * Reworked code to use new enumeration apic_delivery_modes and reworked
   MSI message composition fields [9].
 * Partitioned monitored CPUs into groups. Each CPU in the group is
   inspected for hardlockups using an IPI.
 * Use a round-robin mechanism to update the affinity of the HPET timer.
   Affinity is updated every watchdog_thresh seconds to target the
   handling CPU of the group.
 * Moved update of the HPET interrupt affinity to an irq_work. (Thomas
   Gleixner).
 * Updated expiration of the HPET timer and the expected value of the
   TSC based on the number of groups of monitored CPUs.
 * Renamed hpet_set_comparator() to hpet_set_comparator_periodic() to
   remove decision logic for periodic case. (Thomas Gleixner)
 * Reworked timer reservation to use Thomas' rework on HPET channel
   management [10].
 * Removed hard-coded channel number for the hardlockup detector.
 * Provided more details on the sequence of HPET channel reservation.
   (Thomas Gleixner)
 * Only reserve a channel for the hardlockup detector if enabled via
   kernel command line. The function reserving the channel is called from
   hardlockup detector. (Thomas Gleixner)
 * Dropped hpet_hld_data::enabled_cpus and instead use cpumask_weight().
 * Renamed hpet_hld_data::cpu_monitored_mask to
   hld_data_data.cpu_monitored_mask and converted it to cpumask_var_t.
 * Flushed out any outstanding interrupt before enabling the HPET channel.
 * Removed unnecessary MSI_DATA_LEVEL_ASSERT from the MSI message.
 * Added comments in hardlockup_detector_nmi_handler() to explain how
   CPUs are targeted for an IPI.
 * Updated code to only issue an IPI when needed (i.e., there are CPUs in
   the group other than the handling CPU).
 * Reworked hardlockup_detector_hpet_init() for readability.
 * Now reserve the cpumasks in the hardlockup detector code and not in the
   generic HPET code.
 * Handle the case of watchdog_thresh = 0 when disabling the detector.

Change since v3:
 * Fixed yet another bug in periodic programming of the HPET timer that
   prevented the system from booting.
 * Fixed computation of HPET frequency to use hpet_readl() only.
 * Added a missing #include in the watchdog_hld_hpet.c
 * Fixed various typos and grammar errors (Randy Dunlap)

Changes since v2:
 * Added functionality to switch to the perf-based hardlockup
   detector if the TSC becomes unstable (Thomas Gleixner).
 * Brought back the round-robin mechanism proposed in v1 (this time not
   using the interrupt subsystem). This also requires computing
   expiration times as in v1 (Andi Kleen, Stephane Eranian).
 * Fixed a bug in which using a periodic timer was not working(thanks
   to Suravee Suthikulpanit!).
 * In this version, I incorporate support for interrupt remapping in the
   last 4 patches so that they can be reviewed separately if needed.
 * Removed redundant documentation of functions (Thomas Gleixner).
 * Added a new category of NMI handler, NMI_WATCHDOG, which executes after
   NMI_LOCAL handlers (Andi Kleen).
 * Updated handling of "nmi_watchdog" to support comma-separated
   arguments.
 * Undid split of the generic hardlockup detector into a separate file
   (Thomas Gleixner).
 * Added a new intermediate symbol CONFIG_HARDLOCKUP_DETECTOR_CORE to
   select generic parts of the detector (Paul E. McKenney,
   Thomas Gleixner).
 * Removed use of struct cpumask in favor of a variable length array in
   conjunction with kzalloc (Peter Zijlstra).
 * Added CPU as argument hardlockup_detector_hpet_enable()/disable()
   (Thomas Gleixner).
 * Remove unnecessary export of function declarations, flags, and bit
   fields (Thomas Gleixner).
 * Removed  unnecessary check for FSB support when reserving timer for the
   detector (Thomas Gleixner).
 * Separated TSC code from HPET code in kick_timer() (Thomas Gleixner).
 * Reworked condition to check if the expected TSC value is within the
   error margin to avoid conditional (Peter Zijlstra).
 * Removed TSC error margin from struct hld_data; use global variable
   instead (Peter Zijlstra).
 * Removed previously introduced watchdog_get_allowed_cpumask*() and
   reworked hardlockup_detector_hpet_enable()/disable() to not need
   access to watchdog_allowed_mask (Thomas Gleixner).

Changes since v1:
 * Removed reads to HPET registers at every NMI. Instead use the time-stamp
   counter to infer the interrupt source (Thomas Gleixner, Andi Kleen).
 * Do not target CPUs in a round-robin manner. Instead, the HPET timer
   always targets the same CPU; other CPUs are monitored via an
   interprocessor interrupt.
 * Removed use of generic irq code to set interrupt affinity and NMI
   delivery. Instead, configure the interrupt directly in HPET registers
   (Thomas Gleixner).
 * Removed the proposed ops structure for NMI watchdogs. Instead, split
   the existing implementation into a generic library and perf-specific
   infrastructure (Thomas Gleixner, Nicholas Piggin).
 * Added an x86-specific shim hardlockup detector that selects between
   HPET and perf infrastructures as needed (Nicholas Piggin).
 * Removed locks taken in NMI and !NMI context. This was wrong and is no
   longer needed (Thomas Gleixner).
 * Fixed unconditional return NMI_HANDLED when the HPET timer is programmed
   for FSB/MSI delivery (Peter Zijlstra).

[1]. https://lore.kernel.org/lkml/1528851463-21140-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[2]. https://lore.kernel.org/lkml/1551283518-18922-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[3]. https://lore.kernel.org/lkml/1557842534-4266-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[4]. https://lore.kernel.org/lkml/1558660583-28561-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[5]. https://lore.kernel.org/lkml/20210504190526.22347-1-ricardo.neri-calderon@linux.intel.com/T/
[6]. https://lore.kernel.org/all/20220506000008.30892-1-ricardo.neri-calderon@linux.intel.com/
[7]. https://lore.kernel.org/lkml/20200117091341.GX2827@hirez.programming.kicks-ass.net/
[8]. https://lore.kernel.org/lkml/1582581564-184429-1-git-send-email-kan.liang@linux.intel.com/
[9]. https://lore.kernel.org/all/20201024213535.443185-6-dwmw2@infradead.org/
[10]. https://lore.kernel.org/lkml/20190623132340.463097504@linutronix.de/

Ricardo Neri (24):
  x86/apic: Add irq_cfg::delivery_mode
  x86/apic/msi: Use the delivery mode from irq_cfg for message
    composition
  x86/apic: Add the X86_IRQ_ALLOC_AS_NMI interrupt allocation flag
  x86/apic/vector: Implement a local APIC NMI controller
  x86/apic/vector: Skip cleanup for the NMI vector
  iommu/vt-d: Clear the redirection hint when the destination mode is
    physical
  iommu/vt-d: Rework prepare_irte() to support per-interrupt delivery
    mode
  iommu/vt-d: Set the IRTE delivery mode individually for each interrupt
  iommu/amd: Expose [set|get]_dev_entry_bit()
  iommu/amd: Enable NMIPass when allocating an NMI
  iommu/amd: Compose MSI messages for NMIs in non-IR format
  x86/hpet: Expose hpet_writel() in header
  x86/hpet: Add helper function hpet_set_comparator_periodic()
  x86/hpet: Prepare IRQ assignments to use the X86_ALLOC_AS_NMI flag
  x86/hpet: Reserve an HPET channel for the hardlockup detector
  watchdog/hardlockup: Define a generic function to detect hardlockups
  watchdog/hardlockup: Decouple the hardlockup detector from perf
  init/main: Delay initialization of the lockup detector after
    smp_init()
  x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
  x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI
  watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot
    parameter
  x86/watchdog: Add a shim hardlockup detector
  watchdog: Introduce hardlockup_detector_mark_unavailable()
  x86/tsc: Stop the HPET hardlockup detector if TSC become unstable

 .../admin-guide/kernel-parameters.txt         |   8 +
 arch/x86/Kconfig.debug                        |  11 +
 arch/x86/include/asm/hpet.h                   |  49 ++
 arch/x86/include/asm/hw_irq.h                 |   5 +-
 arch/x86/include/asm/irqdomain.h              |   1 +
 arch/x86/include/asm/nmi.h                    |   6 +
 arch/x86/kernel/Makefile                      |   3 +
 arch/x86/kernel/apic/apic.c                   |   2 +-
 arch/x86/kernel/apic/vector.c                 |  67 +++
 arch/x86/kernel/hpet.c                        | 157 +++++-
 arch/x86/kernel/tsc.c                         |   3 +
 arch/x86/kernel/watchdog_hld.c                |  97 ++++
 arch/x86/kernel/watchdog_hld_hpet.c           | 449 ++++++++++++++++++
 drivers/iommu/amd/amd_iommu.h                 |   3 +
 drivers/iommu/amd/init.c                      |   4 +-
 drivers/iommu/amd/iommu.c                     |  31 +-
 drivers/iommu/intel/irq_remapping.c           |  23 +-
 include/linux/irq.h                           |   5 +
 include/linux/nmi.h                           |   8 +-
 init/main.c                                   |   4 +-
 kernel/Makefile                               |   2 +-
 kernel/watchdog.c                             |  20 +
 kernel/watchdog_hld.c                         |  50 +-
 lib/Kconfig.debug                             |   4 +
 24 files changed, 962 insertions(+), 50 deletions(-)
 create mode 100644 arch/x86/kernel/watchdog_hld.c
 create mode 100644 arch/x86/kernel/watchdog_hld_hpet.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v7 00/24] x86: Implement an HPET-based hardlockup detector
@ 2023-03-01 23:47 ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

Hi x86 trusted reviewers,

This is the seventh version of this patchset. I acknowledge that it took me
a long time to post a new version. Sorry! I will commit time to continue
working on this series with high priority and I will post a new series soon
after receiving your new feedback.

Although this series touches several subsystems, I plan to send it to the
x86 maintainers because a) the series does not make much sense if split
into subsystems, b) Thomas Gleixner has reviewed previous versions, and c)
he has contributed to all the subsystems I modify.

Tony Luck has kindly reviewed previous versions of the series and I carried
his Reviewed-by tags. This version, however, has new patches that also need
review.

I seek to collect the Reviewed-by tags from the x86 trusted reviewers for
the following patches:
   + arch/x86: 4, 5
   + Intel IOMMU: 6,
   + AMD IOMMU: 9, 10, 11,
   + NMI watchdog: 23 and 24.

Previous version of the patches can be found in [1], [2], [3], [4], [5],
and [6].

Thanks and BR,
Ricardo

== Problem statement

In x86, the NMI watchdog is driven using a counter (one per CPU) of the
Performance Monitoring Unit (PMU). PMU counters, however, are scarce and
it would be better to keep them available to profile performance.

Moreover, since the counter that the NMI watchdog uses cannot be shared
with any other perf users, the current perf-event subsystem may return a
false positive when validating certain metric groups. Certain metric groups
may never get a chance to be scheduled [7], [8].

== Solution

Find a different source of NMIs to drive the watchdog. The HPET timer can
also generate NMIs at regular intervals.

== Operation of the watchdog

One HPET channel is reserved for the NMI watchdog. One of the CPUs that the
watchdog monitors is designated to handle the NMI from the HPET channel.
In addition to checking itself for hardlockups, it sends an inter-processor
NMI to the rest of the CPUs in the system. They, in turn, check for
hardlockups if required: all CPUs will get the NMI (including offline
CPUs), but only those being monitored will look for hardlockups.

The time-stamp counter (TSC) is used to determine whether the HPET caused
the NMI. The detector computes the value the TSC is expected to have the
next time the HPET channel generates an NMI. Once it does, we read the
actual TSC value. If the difference between the expected and the actual
values is less than a certain value, assume that the source of the NMI is
the HPET channel. I have found experimentally that difference is of less
than 0.2% of the watchdog period (expressed in TSC counts, see patch 20 for
details).

== Limitations

Since all CPUs get the HPET NMI, any isolated CPUs would be disrupted. If
isolated CPUs are specified in the kernel command line, this detector is
not enabled.

The detector may not be available because the HPET is not enabled in all
systems (see commit 6e3cd95234dc ("x86/hpet: Use another crystalballto
evaluate HPET usability")).

The detector depends on having IPI shorthands enabled.

On AMD systems with interrupt remapping enabled, the detector can only be
used in APIC physical mode (see patch 10).

Unlike the perf-based implementation, all monitored CPUs will look for
hardlockups every watchdog_thresh seconds, regardless of their idle
state.

Thus, I envision this detector an opt-in feature for those interested in
claiming the PMU counters.

== Parts of this series ==

For clarity, patches are grouped as follows:

 1) IRQ updates: patches 1-11 refactor parts of the interrupt subsystem
    to support per-interrupt delivery mode and configure the delivery mode
    as NMI.

 2) HPET updates: patches 12-15 are prework in the HPET code to reserve a
    timer that can deliver NMIs.

 3) NMI watchdog: patches 16-18 update the existing hardlockup detector
    to uncouple it from perf.

 4) New HPET-based hardlockup detector: patches 19-21

 5) Hardlockup detector management: patches 22-24 are a collection of
    miscellaneous patches to determine when to use the HPET hardlockup
    detector and stop it if necessary. It also includes an x86-specific
    shim hardlockup detector that selects between the perf- and HPET-based
    implementations.

== Testing ==

Tests were conducted on the master branch of the tip tree. I have put my
patches here here:

https://github.com/ricardon/tip.git rneri/hpet-wdt-v7

+++ Tests for functional correctness

I tested this series in a variety of server parts: Intel's Sapphire Rapids,
Cooperlake, Icelake, Cascadelake, Snowridge, Skylake, Haswell, Broadwell,
and IvyTown as well as AMD's Rome.

I also tested the series in desktop parts such as Alderlake and Haswell,
but had to use hpet=force in the command line.

On these systems, the detector works with and without interrupt remapping,
in both APIC physical and logical destination modes.

I used the test_lockup kernel module to ensure that the hardlockups were
detected:

$ echo 10 > /proc/sys/kernel/watchdog_thresh
$ modprobe test_lockup time_secs=20 iterations=1 state=R disable_irq

Warnings in dmesg indicate that the hardlockup was detected.

I verified that the detector can be stopped, started, and reconfigured.
CPUs can also be added and removed from monitoring. All these using the
interfaces from /proc/sys/kernel.

Also, please see patch 20 for further details on the difference between
the expected and actual TSC values.

== Changelog ==

Changes since v6:
 + Unchanged patches: 3, 6, 7-9, 12, 14, 16-18, 22, 
 + Updated patches: 1, 2, 10, 11, 13, 15, 19,-21, 24
 + New patches: 4, 5, 23
 
 * Dropped patch to expose irq_matrix_find_best_cpu(). (Thomas)
 * Implemented a separate local APIC NMI controller. (Thomas)
 * Removed superfluous checks for X86_IRQ_ALLOC_AS_NMI && nr_irqs in the
   interrupt remapping drivers. (Thomas)
 * Skip vector cleanup for NMI.
 * Dropped patch that created a new NMI_WATCHDOG NMI category.
 * Do not use the HPET NMI watchdog with isolated CPUs.
 * Added a new `hpet_nmi_watchdog` kernel parameter. Do not reuse
   the existing nmi_watchdog. (Nicholas)
 * Fixed a bug in which I incorrectly used an error window of 0.2% of
   tsc_delta.
 * If the TSC becomes unstable, simply stop the HPET NMI watchdog, do not
   start the perf NMI watchdog. (Thomas, Nicholas)

Changes since v5:
 + Unchanged patches: 14, 15, 18, 19, 24, 25, 28
 + Updated patches: 2, 8, 17, 21-23, 26, 29
 + New patches: 1, 3-7, 9-16, 20, 27

 * Added support in the interrupt subsystem to allocate IRQs with NMI mode.
  (patches 1-13)
 * Only enable the detector when IPI shorthands are enabled in the system.
  (patch 22)
 * Delayed the initialization of the hardlockup detector until after
   calling smp_init(). Only then, we know whether IPI shorthands have been
   enabled. (patch 20)
 * Removed code to periodically update the affinity of the HPET NMI to
   rotate among packages or groups of packages.
 * Removed logic to group the monitored CPUs by groups. All CPUs in all
   packages receive IPIs.
 * Removed irq_work to change the affinity of the HPET channel interrupt.
 * Updated the redirection hint in the Intel IOMMU IRTE to match the
   destination mode. (patch 7)
 * Correctly added support for NMI delivery mode in the AMD IOMMU.
   (patches 11-13)
 * Restart the NMI watchdog after refining tsc_khz. (patch 27)
 * Added a check for the allowed maximum frequency of the HPET. (patch 17)
 * Added hpet_hld_free_timer() to properly free the reserved HPET channel
   if the initialization is not completed. (patch 17)
 * Reworked the periodic setting the HPET channel. Rather than changing it
   every time the channel is disabled or enabled, do it only once. While
   at here, wrap the code in an initial setup function. (patch 22)
 * Implemented hardlockup_detector_hpet_start() to be called when tsc_khz is
   is refined. (patch 22)
 * Reduced the error window of the expected TSC value at the time of the
   HPET channel expiration. (patch 23)
 * Described better the heuristics used to determine if the HPET channel
   caused the NMI. (patch 23)
 * Added a table to characterize the error in the expected TSC value when
   the HPET channel fires. (patch 23)
 * Added watchdog_nmi_start() to be used when tsc_khz is recalibrated.
   (patch 26)
 * Always build the x86-specific hardlockup detector shim; not only
   when the HPET-based detector is selected. (patch 26)
 * Relocated the declaration of hardlockup_detector_switch_to_perf() to
   x86/nmi.h It does not depend on HPET.
 * Use time_in_range64() to compare the actual TSC value vs the expected
   value. This makes it more readable. (patch 22)
 * Dropped pointless X86_64 || X86_32 check in Kconfig. (patch 26)

Changes since v4:
 * Added commentary on the tests performed on this feature. (Andi)
 * Added a stub version of hardlockup_detector_switch_to_perf() for
   !CONFIG_HPET_TIMER. (lkp)
 * Use switch to select the type of x86 hardlockup detector. (Andi)
 * Renamed a local variable in update_ticks_per_group(). (Andi)
 * Made this hardlockup detector available to X86_32.
 * Reworked logic to kick the HPET timer to remove a local variable.
   (Andi)
 * Added a comment on what type of timer channel will be assigned to the
   detector. (Andi)
 * Reworded help comment in the X86_HARDLOCKUP_DETECTOR_HPET Kconfig
   option. (Andi)
 * Removed unnecessary switch to level interrupt mode when disabling the
   timer. (Andi)
 * Disabled the HPET timer to avoid a race between an incoming interrupt
   and an update of the MSI destination ID. (Ashok)
 * Renamed hpet_hardlockup_detector_get_timer() as hpet_hld_get_timer()
 * Added commentary on an undocumented udelay() when programming an
   HPET channel in periodic mode. (Ashok)
 * Reworked code to use new enumeration apic_delivery_modes and reworked
   MSI message composition fields [9].
 * Partitioned monitored CPUs into groups. Each CPU in the group is
   inspected for hardlockups using an IPI.
 * Use a round-robin mechanism to update the affinity of the HPET timer.
   Affinity is updated every watchdog_thresh seconds to target the
   handling CPU of the group.
 * Moved update of the HPET interrupt affinity to an irq_work. (Thomas
   Gleixner).
 * Updated expiration of the HPET timer and the expected value of the
   TSC based on the number of groups of monitored CPUs.
 * Renamed hpet_set_comparator() to hpet_set_comparator_periodic() to
   remove decision logic for periodic case. (Thomas Gleixner)
 * Reworked timer reservation to use Thomas' rework on HPET channel
   management [10].
 * Removed hard-coded channel number for the hardlockup detector.
 * Provided more details on the sequence of HPET channel reservation.
   (Thomas Gleixner)
 * Only reserve a channel for the hardlockup detector if enabled via
   kernel command line. The function reserving the channel is called from
   hardlockup detector. (Thomas Gleixner)
 * Dropped hpet_hld_data::enabled_cpus and instead use cpumask_weight().
 * Renamed hpet_hld_data::cpu_monitored_mask to
   hld_data_data.cpu_monitored_mask and converted it to cpumask_var_t.
 * Flushed out any outstanding interrupt before enabling the HPET channel.
 * Removed unnecessary MSI_DATA_LEVEL_ASSERT from the MSI message.
 * Added comments in hardlockup_detector_nmi_handler() to explain how
   CPUs are targeted for an IPI.
 * Updated code to only issue an IPI when needed (i.e., there are CPUs in
   the group other than the handling CPU).
 * Reworked hardlockup_detector_hpet_init() for readability.
 * Now reserve the cpumasks in the hardlockup detector code and not in the
   generic HPET code.
 * Handle the case of watchdog_thresh = 0 when disabling the detector.

Change since v3:
 * Fixed yet another bug in periodic programming of the HPET timer that
   prevented the system from booting.
 * Fixed computation of HPET frequency to use hpet_readl() only.
 * Added a missing #include in the watchdog_hld_hpet.c
 * Fixed various typos and grammar errors (Randy Dunlap)

Changes since v2:
 * Added functionality to switch to the perf-based hardlockup
   detector if the TSC becomes unstable (Thomas Gleixner).
 * Brought back the round-robin mechanism proposed in v1 (this time not
   using the interrupt subsystem). This also requires computing
   expiration times as in v1 (Andi Kleen, Stephane Eranian).
 * Fixed a bug in which using a periodic timer was not working(thanks
   to Suravee Suthikulpanit!).
 * In this version, I incorporate support for interrupt remapping in the
   last 4 patches so that they can be reviewed separately if needed.
 * Removed redundant documentation of functions (Thomas Gleixner).
 * Added a new category of NMI handler, NMI_WATCHDOG, which executes after
   NMI_LOCAL handlers (Andi Kleen).
 * Updated handling of "nmi_watchdog" to support comma-separated
   arguments.
 * Undid split of the generic hardlockup detector into a separate file
   (Thomas Gleixner).
 * Added a new intermediate symbol CONFIG_HARDLOCKUP_DETECTOR_CORE to
   select generic parts of the detector (Paul E. McKenney,
   Thomas Gleixner).
 * Removed use of struct cpumask in favor of a variable length array in
   conjunction with kzalloc (Peter Zijlstra).
 * Added CPU as argument hardlockup_detector_hpet_enable()/disable()
   (Thomas Gleixner).
 * Remove unnecessary export of function declarations, flags, and bit
   fields (Thomas Gleixner).
 * Removed  unnecessary check for FSB support when reserving timer for the
   detector (Thomas Gleixner).
 * Separated TSC code from HPET code in kick_timer() (Thomas Gleixner).
 * Reworked condition to check if the expected TSC value is within the
   error margin to avoid conditional (Peter Zijlstra).
 * Removed TSC error margin from struct hld_data; use global variable
   instead (Peter Zijlstra).
 * Removed previously introduced watchdog_get_allowed_cpumask*() and
   reworked hardlockup_detector_hpet_enable()/disable() to not need
   access to watchdog_allowed_mask (Thomas Gleixner).

Changes since v1:
 * Removed reads to HPET registers at every NMI. Instead use the time-stamp
   counter to infer the interrupt source (Thomas Gleixner, Andi Kleen).
 * Do not target CPUs in a round-robin manner. Instead, the HPET timer
   always targets the same CPU; other CPUs are monitored via an
   interprocessor interrupt.
 * Removed use of generic irq code to set interrupt affinity and NMI
   delivery. Instead, configure the interrupt directly in HPET registers
   (Thomas Gleixner).
 * Removed the proposed ops structure for NMI watchdogs. Instead, split
   the existing implementation into a generic library and perf-specific
   infrastructure (Thomas Gleixner, Nicholas Piggin).
 * Added an x86-specific shim hardlockup detector that selects between
   HPET and perf infrastructures as needed (Nicholas Piggin).
 * Removed locks taken in NMI and !NMI context. This was wrong and is no
   longer needed (Thomas Gleixner).
 * Fixed unconditional return NMI_HANDLED when the HPET timer is programmed
   for FSB/MSI delivery (Peter Zijlstra).

[1]. https://lore.kernel.org/lkml/1528851463-21140-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[2]. https://lore.kernel.org/lkml/1551283518-18922-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[3]. https://lore.kernel.org/lkml/1557842534-4266-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[4]. https://lore.kernel.org/lkml/1558660583-28561-1-git-send-email-ricardo.neri-calderon@linux.intel.com/
[5]. https://lore.kernel.org/lkml/20210504190526.22347-1-ricardo.neri-calderon@linux.intel.com/T/
[6]. https://lore.kernel.org/all/20220506000008.30892-1-ricardo.neri-calderon@linux.intel.com/
[7]. https://lore.kernel.org/lkml/20200117091341.GX2827@hirez.programming.kicks-ass.net/
[8]. https://lore.kernel.org/lkml/1582581564-184429-1-git-send-email-kan.liang@linux.intel.com/
[9]. https://lore.kernel.org/all/20201024213535.443185-6-dwmw2@infradead.org/
[10]. https://lore.kernel.org/lkml/20190623132340.463097504@linutronix.de/

Ricardo Neri (24):
  x86/apic: Add irq_cfg::delivery_mode
  x86/apic/msi: Use the delivery mode from irq_cfg for message
    composition
  x86/apic: Add the X86_IRQ_ALLOC_AS_NMI interrupt allocation flag
  x86/apic/vector: Implement a local APIC NMI controller
  x86/apic/vector: Skip cleanup for the NMI vector
  iommu/vt-d: Clear the redirection hint when the destination mode is
    physical
  iommu/vt-d: Rework prepare_irte() to support per-interrupt delivery
    mode
  iommu/vt-d: Set the IRTE delivery mode individually for each interrupt
  iommu/amd: Expose [set|get]_dev_entry_bit()
  iommu/amd: Enable NMIPass when allocating an NMI
  iommu/amd: Compose MSI messages for NMIs in non-IR format
  x86/hpet: Expose hpet_writel() in header
  x86/hpet: Add helper function hpet_set_comparator_periodic()
  x86/hpet: Prepare IRQ assignments to use the X86_ALLOC_AS_NMI flag
  x86/hpet: Reserve an HPET channel for the hardlockup detector
  watchdog/hardlockup: Define a generic function to detect hardlockups
  watchdog/hardlockup: Decouple the hardlockup detector from perf
  init/main: Delay initialization of the lockup detector after
    smp_init()
  x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
  x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI
  watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot
    parameter
  x86/watchdog: Add a shim hardlockup detector
  watchdog: Introduce hardlockup_detector_mark_unavailable()
  x86/tsc: Stop the HPET hardlockup detector if TSC become unstable

 .../admin-guide/kernel-parameters.txt         |   8 +
 arch/x86/Kconfig.debug                        |  11 +
 arch/x86/include/asm/hpet.h                   |  49 ++
 arch/x86/include/asm/hw_irq.h                 |   5 +-
 arch/x86/include/asm/irqdomain.h              |   1 +
 arch/x86/include/asm/nmi.h                    |   6 +
 arch/x86/kernel/Makefile                      |   3 +
 arch/x86/kernel/apic/apic.c                   |   2 +-
 arch/x86/kernel/apic/vector.c                 |  67 +++
 arch/x86/kernel/hpet.c                        | 157 +++++-
 arch/x86/kernel/tsc.c                         |   3 +
 arch/x86/kernel/watchdog_hld.c                |  97 ++++
 arch/x86/kernel/watchdog_hld_hpet.c           | 449 ++++++++++++++++++
 drivers/iommu/amd/amd_iommu.h                 |   3 +
 drivers/iommu/amd/init.c                      |   4 +-
 drivers/iommu/amd/iommu.c                     |  31 +-
 drivers/iommu/intel/irq_remapping.c           |  23 +-
 include/linux/irq.h                           |   5 +
 include/linux/nmi.h                           |   8 +-
 init/main.c                                   |   4 +-
 kernel/Makefile                               |   2 +-
 kernel/watchdog.c                             |  20 +
 kernel/watchdog_hld.c                         |  50 +-
 lib/Kconfig.debug                             |   4 +
 24 files changed, 962 insertions(+), 50 deletions(-)
 create mode 100644 arch/x86/kernel/watchdog_hld.c
 create mode 100644 arch/x86/kernel/watchdog_hld_hpet.c

-- 
2.25.1


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v7 01/24] x86/apic: Add irq_cfg::delivery_mode
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

There are no restrictions in hardware to configure the delivery mode of
each interrupt individually. Also, certain interrupts need to be
configured with a specific delivery mode (e.g., non-maskable interrupts).
Add a new member, delivery_mode, to struct irq_cfg to this effect.

To keep the current behavior, use the delivery mode of the APIC driver when
allocating a vector for an interrupt in the root domain (i.e.,
x86_vector_domain).

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Reworded the commit message to accurately state that the root domain
   allocates a vector for an interrupt, not an interrupt. (Thomas)
 * Removed stray newline. (Thomas)
 * Replaced 'irq' with 'interrupt' in the changelog and in the code.
  (Thomas).

Changes since v5:
 * Updated indentation of the existing members of struct irq_cfg.
 * Reworded the commit message.

Changes since v4:
 * Rebased to use new enumeration apic_delivery_modes.

Changes since v3:
 * None

Changes since v2:
 * Reduced scope to only add the interrupt delivery mode in
   struct irq_alloc_info.

Changes since v1:
 * Introduced this patch.
---
 arch/x86/include/asm/hw_irq.h | 5 +++--
 arch/x86/kernel/apic/vector.c | 6 ++++++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index d465ece58151..5ac5e6c603ee 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -88,8 +88,9 @@ struct irq_alloc_info {
 };
 
 struct irq_cfg {
-	unsigned int		dest_apicid;
-	unsigned int		vector;
+	unsigned int			dest_apicid;
+	unsigned int			vector;
+	enum apic_delivery_modes	delivery_mode;
 };
 
 extern struct irq_cfg *irq_cfg(unsigned int irq);
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index c1efebd27e6c..633b442c8f84 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -573,6 +573,12 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
 		/* Don't invoke affinity setter on deactivated interrupts */
 		irqd_set_affinity_on_activate(irqd);
 
+		/*
+		 * A delivery mode may be specified in the interrupt allocation
+		 * info. If not, use the delivery mode of the APIC.
+		 */
+		apicd->hw_irq_cfg.delivery_mode = apic->delivery_mode;
+
 		/*
 		 * Legacy vectors are already assigned when the IOAPIC
 		 * takes them over. They stay on the same vector. This is
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 01/24] x86/apic: Add irq_cfg::delivery_mode
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

There are no restrictions in hardware to configure the delivery mode of
each interrupt individually. Also, certain interrupts need to be
configured with a specific delivery mode (e.g., non-maskable interrupts).
Add a new member, delivery_mode, to struct irq_cfg to this effect.

To keep the current behavior, use the delivery mode of the APIC driver when
allocating a vector for an interrupt in the root domain (i.e.,
x86_vector_domain).

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Reworded the commit message to accurately state that the root domain
   allocates a vector for an interrupt, not an interrupt. (Thomas)
 * Removed stray newline. (Thomas)
 * Replaced 'irq' with 'interrupt' in the changelog and in the code.
  (Thomas).

Changes since v5:
 * Updated indentation of the existing members of struct irq_cfg.
 * Reworded the commit message.

Changes since v4:
 * Rebased to use new enumeration apic_delivery_modes.

Changes since v3:
 * None

Changes since v2:
 * Reduced scope to only add the interrupt delivery mode in
   struct irq_alloc_info.

Changes since v1:
 * Introduced this patch.
---
 arch/x86/include/asm/hw_irq.h | 5 +++--
 arch/x86/kernel/apic/vector.c | 6 ++++++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h
index d465ece58151..5ac5e6c603ee 100644
--- a/arch/x86/include/asm/hw_irq.h
+++ b/arch/x86/include/asm/hw_irq.h
@@ -88,8 +88,9 @@ struct irq_alloc_info {
 };
 
 struct irq_cfg {
-	unsigned int		dest_apicid;
-	unsigned int		vector;
+	unsigned int			dest_apicid;
+	unsigned int			vector;
+	enum apic_delivery_modes	delivery_mode;
 };
 
 extern struct irq_cfg *irq_cfg(unsigned int irq);
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index c1efebd27e6c..633b442c8f84 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -573,6 +573,12 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
 		/* Don't invoke affinity setter on deactivated interrupts */
 		irqd_set_affinity_on_activate(irqd);
 
+		/*
+		 * A delivery mode may be specified in the interrupt allocation
+		 * info. If not, use the delivery mode of the APIC.
+		 */
+		apicd->hw_irq_cfg.delivery_mode = apic->delivery_mode;
+
 		/*
 		 * Legacy vectors are already assigned when the IOAPIC
 		 * takes them over. They stay on the same vector. This is
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 02/24] x86/apic/msi: Use the delivery mode from irq_cfg for message composition
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

irq_cfg provides a delivery mode for each interrupt. Use it instead
of the hardcoded APIC_DELIVERY_MODE_FIXED. This allows to compose
messages for NMI delivery mode which is required to implement a HPET-
based NMI watchdog.

No functional change as the default delivery mode is set to
APIC_DELIVERY_MODE_FIXED.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Reworded changelog as per suggestion from Thomas.

Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/apic/apic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 20d9a604da7c..352738238e52 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2562,7 +2562,7 @@ void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg,
 	msg->arch_addr_lo.dest_mode_logical = apic->dest_mode_logical;
 	msg->arch_addr_lo.destid_0_7 = cfg->dest_apicid & 0xFF;
 
-	msg->arch_data.delivery_mode = APIC_DELIVERY_MODE_FIXED;
+	msg->arch_data.delivery_mode = cfg->delivery_mode;
 	msg->arch_data.vector = cfg->vector;
 
 	msg->address_hi = X86_MSI_BASE_ADDRESS_HIGH;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 02/24] x86/apic/msi: Use the delivery mode from irq_cfg for message composition
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

irq_cfg provides a delivery mode for each interrupt. Use it instead
of the hardcoded APIC_DELIVERY_MODE_FIXED. This allows to compose
messages for NMI delivery mode which is required to implement a HPET-
based NMI watchdog.

No functional change as the default delivery mode is set to
APIC_DELIVERY_MODE_FIXED.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Reworded changelog as per suggestion from Thomas.

Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/apic/apic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 20d9a604da7c..352738238e52 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -2562,7 +2562,7 @@ void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg,
 	msg->arch_addr_lo.dest_mode_logical = apic->dest_mode_logical;
 	msg->arch_addr_lo.destid_0_7 = cfg->dest_apicid & 0xFF;
 
-	msg->arch_data.delivery_mode = APIC_DELIVERY_MODE_FIXED;
+	msg->arch_data.delivery_mode = cfg->delivery_mode;
 	msg->arch_data.vector = cfg->vector;
 
 	msg->address_hi = X86_MSI_BASE_ADDRESS_HIGH;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 03/24] x86/apic: Add the X86_IRQ_ALLOC_AS_NMI interrupt allocation flag
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

There are cases in which it is necessary to set the delivery mode of an
interrupt as NMI. Add a new flag that callers can specify when allocating
an interrupt.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/include/asm/irqdomain.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/irqdomain.h b/arch/x86/include/asm/irqdomain.h
index 30c325c235c0..e13f02c6fe95 100644
--- a/arch/x86/include/asm/irqdomain.h
+++ b/arch/x86/include/asm/irqdomain.h
@@ -8,6 +8,7 @@
 #ifdef CONFIG_X86_LOCAL_APIC
 enum {
 	X86_IRQ_ALLOC_LEGACY				= 0x1,
+	X86_IRQ_ALLOC_AS_NMI				= 0x2,
 };
 
 extern int x86_fwspec_is_ioapic(struct irq_fwspec *fwspec);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 03/24] x86/apic: Add the X86_IRQ_ALLOC_AS_NMI interrupt allocation flag
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

There are cases in which it is necessary to set the delivery mode of an
interrupt as NMI. Add a new flag that callers can specify when allocating
an interrupt.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/include/asm/irqdomain.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/irqdomain.h b/arch/x86/include/asm/irqdomain.h
index 30c325c235c0..e13f02c6fe95 100644
--- a/arch/x86/include/asm/irqdomain.h
+++ b/arch/x86/include/asm/irqdomain.h
@@ -8,6 +8,7 @@
 #ifdef CONFIG_X86_LOCAL_APIC
 enum {
 	X86_IRQ_ALLOC_LEGACY				= 0x1,
+	X86_IRQ_ALLOC_AS_NMI				= 0x2,
 };
 
 extern int x86_fwspec_is_ioapic(struct irq_fwspec *fwspec);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 04/24] x86/apic/vector: Implement a local APIC NMI controller
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

Add a separate local APIC NMI controller to handle NMIs apart from the
regular APIC management.

This controller will be used to handle the NMI vector of the HPET NMI
watchdog.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Reworked patch "x86/apic/vector: Implement support for NMI delivery
   mode" into a separate local APIC NMI controller. (Thomas)

Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/apic/vector.c | 57 +++++++++++++++++++++++++++++++++++
 include/linux/irq.h           |  5 +++
 2 files changed, 62 insertions(+)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 633b442c8f84..a4cf041427cb 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -42,6 +42,7 @@ EXPORT_SYMBOL_GPL(x86_vector_domain);
 static DEFINE_RAW_SPINLOCK(vector_lock);
 static cpumask_var_t vector_searchmask;
 static struct irq_chip lapic_controller;
+static struct irq_chip lapic_nmi_controller;
 static struct irq_matrix *vector_matrix;
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct hlist_head, cleanup_list);
@@ -451,6 +452,10 @@ static int x86_vector_activate(struct irq_domain *dom, struct irq_data *irqd,
 	trace_vector_activate(irqd->irq, apicd->is_managed,
 			      apicd->can_reserve, reserve);
 
+	/* NMI has a fixed vector. No vector management required */
+	if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI)
+		return 0;
+
 	raw_spin_lock_irqsave(&vector_lock, flags);
 	if (!apicd->can_reserve && !apicd->is_managed)
 		assign_irq_vector_any_locked(irqd);
@@ -472,6 +477,10 @@ static void vector_free_reserved_and_managed(struct irq_data *irqd)
 	trace_vector_teardown(irqd->irq, apicd->is_managed,
 			      apicd->has_reserved);
 
+	/* NMI has a fixed vector. No vector management required */
+	if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI)
+		return;
+
 	if (apicd->has_reserved)
 		irq_matrix_remove_reserved(vector_matrix);
 	if (apicd->is_managed)
@@ -539,6 +548,10 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
 	if (disable_apic)
 		return -ENXIO;
 
+	/* Only one IRQ per NMI */
+	if ((info->flags & X86_IRQ_ALLOC_AS_NMI) && nr_irqs != 1)
+		return -EINVAL;
+
 	/*
 	 * Catch any attempt to touch the cascade interrupt on a PIC
 	 * equipped system.
@@ -573,6 +586,25 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
 		/* Don't invoke affinity setter on deactivated interrupts */
 		irqd_set_affinity_on_activate(irqd);
 
+		if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
+			/*
+			 * NMIs have a fixed vector and need their own
+			 * interrupt chip so nothing can end up in the
+			 * regular local APIC management code except the
+			 * MSI message composing callback.
+			 */
+			apicd->hw_irq_cfg.delivery_mode = APIC_DELIVERY_MODE_NMI;
+			irqd->chip = &lapic_nmi_controller;
+			/*
+			 * Exclude NMIs from balancing. This cannot work with
+			 * the regular affinity mechanisms. The local APIC NMI
+			 * controller provides a set_affinity() callback for the
+			 * intended HPET NMI watchdog use case.
+			 */
+			irqd_set_no_balance(irqd);
+			return 0;
+		}
+
 		/*
 		 * A delivery mode may be specified in the interrupt allocation
 		 * info. If not, use the delivery mode of the APIC.
@@ -872,8 +904,27 @@ static int apic_set_affinity(struct irq_data *irqd,
 	return err ? err : IRQ_SET_MASK_OK;
 }
 
+static int apic_nmi_set_affinity(struct irq_data *irqd,
+				 const struct cpumask *dest, bool force)
+{
+	struct apic_chip_data *apicd = apic_chip_data(irqd);
+	static struct cpumask tmp_mask;
+	int cpu;
+
+	cpumask_and(&tmp_mask, dest, cpu_online_mask);
+	if (cpumask_empty(&tmp_mask))
+		return -ENODEV;
+
+	cpu = cpumask_first(&tmp_mask);
+	apicd->hw_irq_cfg.dest_apicid = apic->calc_dest_apicid(cpu);
+	irq_data_update_effective_affinity(irqd, cpumask_of(cpu));
+
+	return IRQ_SET_MASK_OK;
+}
+
 #else
 # define apic_set_affinity	NULL
+# define apic_nmi_set_affinity	NULL
 #endif
 
 static int apic_retrigger_irq(struct irq_data *irqd)
@@ -914,6 +965,12 @@ static struct irq_chip lapic_controller = {
 	.irq_retrigger		= apic_retrigger_irq,
 };
 
+static struct irq_chip lapic_nmi_controller = {
+	.name			= "APIC-NMI",
+	.irq_set_affinity	= apic_nmi_set_affinity,
+	.irq_compose_msi_msg	= x86_vector_msi_compose_msg,
+};
+
 #ifdef CONFIG_SMP
 
 static void free_moved_vector(struct apic_chip_data *apicd)
diff --git a/include/linux/irq.h b/include/linux/irq.h
index b1b28affb32a..c8738b36e316 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -263,6 +263,11 @@ static inline bool irqd_is_per_cpu(struct irq_data *d)
 	return __irqd_to_state(d) & IRQD_PER_CPU;
 }
 
+static inline void irqd_set_no_balance(struct irq_data *d)
+{
+	__irqd_to_state(d) |= IRQD_NO_BALANCING;
+}
+
 static inline bool irqd_can_balance(struct irq_data *d)
 {
 	return !(__irqd_to_state(d) & (IRQD_PER_CPU | IRQD_NO_BALANCING));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 04/24] x86/apic/vector: Implement a local APIC NMI controller
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

Add a separate local APIC NMI controller to handle NMIs apart from the
regular APIC management.

This controller will be used to handle the NMI vector of the HPET NMI
watchdog.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Reworked patch "x86/apic/vector: Implement support for NMI delivery
   mode" into a separate local APIC NMI controller. (Thomas)

Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/apic/vector.c | 57 +++++++++++++++++++++++++++++++++++
 include/linux/irq.h           |  5 +++
 2 files changed, 62 insertions(+)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 633b442c8f84..a4cf041427cb 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -42,6 +42,7 @@ EXPORT_SYMBOL_GPL(x86_vector_domain);
 static DEFINE_RAW_SPINLOCK(vector_lock);
 static cpumask_var_t vector_searchmask;
 static struct irq_chip lapic_controller;
+static struct irq_chip lapic_nmi_controller;
 static struct irq_matrix *vector_matrix;
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct hlist_head, cleanup_list);
@@ -451,6 +452,10 @@ static int x86_vector_activate(struct irq_domain *dom, struct irq_data *irqd,
 	trace_vector_activate(irqd->irq, apicd->is_managed,
 			      apicd->can_reserve, reserve);
 
+	/* NMI has a fixed vector. No vector management required */
+	if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI)
+		return 0;
+
 	raw_spin_lock_irqsave(&vector_lock, flags);
 	if (!apicd->can_reserve && !apicd->is_managed)
 		assign_irq_vector_any_locked(irqd);
@@ -472,6 +477,10 @@ static void vector_free_reserved_and_managed(struct irq_data *irqd)
 	trace_vector_teardown(irqd->irq, apicd->is_managed,
 			      apicd->has_reserved);
 
+	/* NMI has a fixed vector. No vector management required */
+	if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI)
+		return;
+
 	if (apicd->has_reserved)
 		irq_matrix_remove_reserved(vector_matrix);
 	if (apicd->is_managed)
@@ -539,6 +548,10 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
 	if (disable_apic)
 		return -ENXIO;
 
+	/* Only one IRQ per NMI */
+	if ((info->flags & X86_IRQ_ALLOC_AS_NMI) && nr_irqs != 1)
+		return -EINVAL;
+
 	/*
 	 * Catch any attempt to touch the cascade interrupt on a PIC
 	 * equipped system.
@@ -573,6 +586,25 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq,
 		/* Don't invoke affinity setter on deactivated interrupts */
 		irqd_set_affinity_on_activate(irqd);
 
+		if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
+			/*
+			 * NMIs have a fixed vector and need their own
+			 * interrupt chip so nothing can end up in the
+			 * regular local APIC management code except the
+			 * MSI message composing callback.
+			 */
+			apicd->hw_irq_cfg.delivery_mode = APIC_DELIVERY_MODE_NMI;
+			irqd->chip = &lapic_nmi_controller;
+			/*
+			 * Exclude NMIs from balancing. This cannot work with
+			 * the regular affinity mechanisms. The local APIC NMI
+			 * controller provides a set_affinity() callback for the
+			 * intended HPET NMI watchdog use case.
+			 */
+			irqd_set_no_balance(irqd);
+			return 0;
+		}
+
 		/*
 		 * A delivery mode may be specified in the interrupt allocation
 		 * info. If not, use the delivery mode of the APIC.
@@ -872,8 +904,27 @@ static int apic_set_affinity(struct irq_data *irqd,
 	return err ? err : IRQ_SET_MASK_OK;
 }
 
+static int apic_nmi_set_affinity(struct irq_data *irqd,
+				 const struct cpumask *dest, bool force)
+{
+	struct apic_chip_data *apicd = apic_chip_data(irqd);
+	static struct cpumask tmp_mask;
+	int cpu;
+
+	cpumask_and(&tmp_mask, dest, cpu_online_mask);
+	if (cpumask_empty(&tmp_mask))
+		return -ENODEV;
+
+	cpu = cpumask_first(&tmp_mask);
+	apicd->hw_irq_cfg.dest_apicid = apic->calc_dest_apicid(cpu);
+	irq_data_update_effective_affinity(irqd, cpumask_of(cpu));
+
+	return IRQ_SET_MASK_OK;
+}
+
 #else
 # define apic_set_affinity	NULL
+# define apic_nmi_set_affinity	NULL
 #endif
 
 static int apic_retrigger_irq(struct irq_data *irqd)
@@ -914,6 +965,12 @@ static struct irq_chip lapic_controller = {
 	.irq_retrigger		= apic_retrigger_irq,
 };
 
+static struct irq_chip lapic_nmi_controller = {
+	.name			= "APIC-NMI",
+	.irq_set_affinity	= apic_nmi_set_affinity,
+	.irq_compose_msi_msg	= x86_vector_msi_compose_msg,
+};
+
 #ifdef CONFIG_SMP
 
 static void free_moved_vector(struct apic_chip_data *apicd)
diff --git a/include/linux/irq.h b/include/linux/irq.h
index b1b28affb32a..c8738b36e316 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -263,6 +263,11 @@ static inline bool irqd_is_per_cpu(struct irq_data *d)
 	return __irqd_to_state(d) & IRQD_PER_CPU;
 }
 
+static inline void irqd_set_no_balance(struct irq_data *d)
+{
+	__irqd_to_state(d) |= IRQD_NO_BALANCING;
+}
+
 static inline bool irqd_can_balance(struct irq_data *d)
 {
 	return !(__irqd_to_state(d) & (IRQD_PER_CPU | IRQD_NO_BALANCING));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 05/24] x86/apic/vector: Skip cleanup for the NMI vector
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

The NMI vector is fixed. No cleanup is needed after updating affinity.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>

---
Changes since v6:
 * Introduced this patch.

Changes since v5:
 * N/A

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/apic/vector.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index a4cf041427cb..3045823ecc1b 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -1050,6 +1050,10 @@ void send_cleanup_vector(struct irq_cfg *cfg)
 {
 	struct apic_chip_data *apicd;
 
+	/* NMI has a fixed vector. No vector management required. */
+	if (cfg->delivery_mode == APIC_DELIVERY_MODE_NMI)
+		return;
+
 	apicd = container_of(cfg, struct apic_chip_data, hw_irq_cfg);
 	if (apicd->move_in_progress)
 		__send_cleanup_vector(apicd);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 05/24] x86/apic/vector: Skip cleanup for the NMI vector
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

The NMI vector is fixed. No cleanup is needed after updating affinity.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>

---
Changes since v6:
 * Introduced this patch.

Changes since v5:
 * N/A

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/apic/vector.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index a4cf041427cb..3045823ecc1b 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -1050,6 +1050,10 @@ void send_cleanup_vector(struct irq_cfg *cfg)
 {
 	struct apic_chip_data *apicd;
 
+	/* NMI has a fixed vector. No vector management required. */
+	if (cfg->delivery_mode == APIC_DELIVERY_MODE_NMI)
+		return;
+
 	apicd = container_of(cfg, struct apic_chip_data, hw_irq_cfg);
 	if (apicd->move_in_progress)
 		__send_cleanup_vector(apicd);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 06/24] iommu/vt-d: Clear the redirection hint when the destination mode is physical
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri, David Woodhouse,
	Lu Baolu

When the destination mode of an interrupt is physical APICID, the interrupt
is delivered only to the single CPU of which the physical APICID is
specified in the destination ID field. The redirection hint is meaningless.

Furthermore, on certain processors, the IOMMU does not deliver the
interrupt when the delivery mode is NMI, the redirection hint is set, and
the destination mode is physical. Clearing the redirection hint ensures
that the NMI is delivered.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/intel/irq_remapping.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 6d01fa078c36..2d68f94ae0ee 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1123,7 +1123,17 @@ static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
 	irte->dlvry_mode = apic->delivery_mode;
 	irte->vector = vector;
 	irte->dest_id = IRTE_DEST(dest);
-	irte->redir_hint = 1;
+
+	/*
+	 * When using the destination mode of physical APICID, only the
+	 * processor specified in @dest receives the interrupt. The redirection
+	 * hint is meaningless.
+	 *
+	 * Furthermore, on some processors, NMIs with physical delivery mode
+	 * and the redirection hint set are delivered as regular interrupts
+	 * or not delivered at all.
+	 */
+	irte->redir_hint = apic->dest_mode_logical;
 }
 
 struct irq_remap_ops intel_irq_remap_ops = {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 06/24] iommu/vt-d: Clear the redirection hint when the destination mode is physical
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev,
	David Woodhouse, Lu Baolu

When the destination mode of an interrupt is physical APICID, the interrupt
is delivered only to the single CPU of which the physical APICID is
specified in the destination ID field. The redirection hint is meaningless.

Furthermore, on certain processors, the IOMMU does not deliver the
interrupt when the delivery mode is NMI, the redirection hint is set, and
the destination mode is physical. Clearing the redirection hint ensures
that the NMI is delivered.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/intel/irq_remapping.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 6d01fa078c36..2d68f94ae0ee 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1123,7 +1123,17 @@ static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
 	irte->dlvry_mode = apic->delivery_mode;
 	irte->vector = vector;
 	irte->dest_id = IRTE_DEST(dest);
-	irte->redir_hint = 1;
+
+	/*
+	 * When using the destination mode of physical APICID, only the
+	 * processor specified in @dest receives the interrupt. The redirection
+	 * hint is meaningless.
+	 *
+	 * Furthermore, on some processors, NMIs with physical delivery mode
+	 * and the redirection hint set are delivered as regular interrupts
+	 * or not delivered at all.
+	 */
+	irte->redir_hint = apic->dest_mode_logical;
 }
 
 struct irq_remap_ops intel_irq_remap_ops = {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 07/24] iommu/vt-d: Rework prepare_irte() to support per-interrupt delivery mode
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri, David Woodhouse,
	Lu Baolu, x86

struct irq_cfg::delivery_mode specifies the delivery mode of each interrupt
separately. Configuring the delivery mode of an IRTE would require adding
a third argument to prepare_irte(). Instead, take a pointer to the irq_cfg
for which an IRTE is being configured. No functional changes.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: x86@kernel.org
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Only change the signature of prepare_irte(). A separate patch changes
   the setting of the delivery_mode.

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Introduced this patch.
---
 drivers/iommu/intel/irq_remapping.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 2d68f94ae0ee..1fe30c31fcbe 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1106,7 +1106,7 @@ void intel_irq_remap_add_device(struct dmar_pci_notify_info *info)
 	dev_set_msi_domain(&info->dev->dev, map_dev_to_ir(info->dev));
 }
 
-static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
+static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg)
 {
 	memset(irte, 0, sizeof(*irte));
 
@@ -1121,8 +1121,8 @@ static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
 	*/
 	irte->trigger_mode = 0;
 	irte->dlvry_mode = apic->delivery_mode;
-	irte->vector = vector;
-	irte->dest_id = IRTE_DEST(dest);
+	irte->vector = irq_cfg->vector;
+	irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid);
 
 	/*
 	 * When using the destination mode of physical APICID, only the
@@ -1273,8 +1273,7 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
 {
 	struct irte *irte = &data->irte_entry;
 
-	prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
-
+	prepare_irte(irte, irq_cfg);
 	switch (info->type) {
 	case X86_IRQ_ALLOC_TYPE_IOAPIC:
 		/* Set source-id of interrupt request */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 07/24] iommu/vt-d: Rework prepare_irte() to support per-interrupt delivery mode
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, x86, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev,
	David Woodhouse, Lu Baolu

struct irq_cfg::delivery_mode specifies the delivery mode of each interrupt
separately. Configuring the delivery mode of an IRTE would require adding
a third argument to prepare_irte(). Instead, take a pointer to the irq_cfg
for which an IRTE is being configured. No functional changes.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: x86@kernel.org
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Only change the signature of prepare_irte(). A separate patch changes
   the setting of the delivery_mode.

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * Introduced this patch.
---
 drivers/iommu/intel/irq_remapping.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 2d68f94ae0ee..1fe30c31fcbe 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1106,7 +1106,7 @@ void intel_irq_remap_add_device(struct dmar_pci_notify_info *info)
 	dev_set_msi_domain(&info->dev->dev, map_dev_to_ir(info->dev));
 }
 
-static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
+static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg)
 {
 	memset(irte, 0, sizeof(*irte));
 
@@ -1121,8 +1121,8 @@ static void prepare_irte(struct irte *irte, int vector, unsigned int dest)
 	*/
 	irte->trigger_mode = 0;
 	irte->dlvry_mode = apic->delivery_mode;
-	irte->vector = vector;
-	irte->dest_id = IRTE_DEST(dest);
+	irte->vector = irq_cfg->vector;
+	irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid);
 
 	/*
 	 * When using the destination mode of physical APICID, only the
@@ -1273,8 +1273,7 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
 {
 	struct irte *irte = &data->irte_entry;
 
-	prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid);
-
+	prepare_irte(irte, irq_cfg);
 	switch (info->type) {
 	case X86_IRQ_ALLOC_TYPE_IOAPIC:
 		/* Set source-id of interrupt request */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 08/24] iommu/vt-d: Set the IRTE delivery mode individually for each interrupt
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri, David Woodhouse,
	Lu Baolu

Use the mode specified in the provided interrupt hardware configuration
data to set the delivery mode.

Since most interrupts are configured to use the delivery mode of the APIC
driver, there are no functional changes. The only exception are interrupts
that do specify a different delivery mode.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/intel/irq_remapping.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 1fe30c31fcbe..7b58406ea8d2 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1120,7 +1120,7 @@ static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg)
 	 * irq migration in the presence of interrupt-remapping.
 	*/
 	irte->trigger_mode = 0;
-	irte->dlvry_mode = apic->delivery_mode;
+	irte->dlvry_mode = irq_cfg->delivery_mode;
 	irte->vector = irq_cfg->vector;
 	irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 08/24] iommu/vt-d: Set the IRTE delivery mode individually for each interrupt
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev,
	David Woodhouse, Lu Baolu

Use the mode specified in the provided interrupt hardware configuration
data to set the delivery mode.

Since most interrupts are configured to use the delivery mode of the APIC
driver, there are no functional changes. The only exception are interrupts
that do specify a different delivery mode.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/intel/irq_remapping.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 1fe30c31fcbe..7b58406ea8d2 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1120,7 +1120,7 @@ static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg)
 	 * irq migration in the presence of interrupt-remapping.
 	*/
 	irte->trigger_mode = 0;
-	irte->dlvry_mode = apic->delivery_mode;
+	irte->dlvry_mode = irq_cfg->delivery_mode;
 	irte->vector = irq_cfg->vector;
 	irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 09/24] iommu/amd: Expose [set|get]_dev_entry_bit()
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri, Joerg Roedel,
	Suravee Suthikulpanit

If an interrupt is allocated with NMI as delivery mode, the Device Table
Entry needs to be modified accordingly in irq_remapping_alloc().

No functional changes.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/amd/amd_iommu.h | 3 +++
 drivers/iommu/amd/init.c      | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index c160a332ce33..b9b87a8cd48e 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -141,4 +141,7 @@ extern u64 amd_iommu_efr;
 extern u64 amd_iommu_efr2;
 
 extern bool amd_iommu_snp_en;
+
+extern void set_dev_entry_bit(struct amd_iommu *iommu, u16 devid, u8 bit);
+extern int get_dev_entry_bit(struct amd_iommu *iommu, u16 devid, u8 bit);
 #endif
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 19a46b9f7357..559a9ecb785f 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -999,7 +999,7 @@ static void __set_dev_entry_bit(struct dev_table_entry *dev_table,
 	dev_table[devid].data[i] |= (1UL << _bit);
 }
 
-static void set_dev_entry_bit(struct amd_iommu *iommu, u16 devid, u8 bit)
+void set_dev_entry_bit(struct amd_iommu *iommu, u16 devid, u8 bit)
 {
 	struct dev_table_entry *dev_table = get_dev_table(iommu);
 
@@ -1015,7 +1015,7 @@ static int __get_dev_entry_bit(struct dev_table_entry *dev_table,
 	return (dev_table[devid].data[i] & (1UL << _bit)) >> _bit;
 }
 
-static int get_dev_entry_bit(struct amd_iommu *iommu, u16 devid, u8 bit)
+int get_dev_entry_bit(struct amd_iommu *iommu, u16 devid, u8 bit)
 {
 	struct dev_table_entry *dev_table = get_dev_table(iommu);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 09/24] iommu/amd: Expose [set|get]_dev_entry_bit()
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Joerg Roedel,
	Ricardo Neri, Stephane Eranian, linux-kernel, iommu,
	Suravee Suthikulpanit, linuxppc-dev

If an interrupt is allocated with NMI as delivery mode, the Device Table
Entry needs to be modified accordingly in irq_remapping_alloc().

No functional changes.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/amd/amd_iommu.h | 3 +++
 drivers/iommu/amd/init.c      | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index c160a332ce33..b9b87a8cd48e 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -141,4 +141,7 @@ extern u64 amd_iommu_efr;
 extern u64 amd_iommu_efr2;
 
 extern bool amd_iommu_snp_en;
+
+extern void set_dev_entry_bit(struct amd_iommu *iommu, u16 devid, u8 bit);
+extern int get_dev_entry_bit(struct amd_iommu *iommu, u16 devid, u8 bit);
 #endif
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 19a46b9f7357..559a9ecb785f 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -999,7 +999,7 @@ static void __set_dev_entry_bit(struct dev_table_entry *dev_table,
 	dev_table[devid].data[i] |= (1UL << _bit);
 }
 
-static void set_dev_entry_bit(struct amd_iommu *iommu, u16 devid, u8 bit)
+void set_dev_entry_bit(struct amd_iommu *iommu, u16 devid, u8 bit)
 {
 	struct dev_table_entry *dev_table = get_dev_table(iommu);
 
@@ -1015,7 +1015,7 @@ static int __get_dev_entry_bit(struct dev_table_entry *dev_table,
 	return (dev_table[devid].data[i] & (1UL << _bit)) >> _bit;
 }
 
-static int get_dev_entry_bit(struct amd_iommu *iommu, u16 devid, u8 bit)
+int get_dev_entry_bit(struct amd_iommu *iommu, u16 devid, u8 bit)
 {
 	struct dev_table_entry *dev_table = get_dev_table(iommu);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 10/24] iommu/amd: Enable NMIPass when allocating an NMI
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri, Joerg Roedel,
	Suravee Suthikulpanit

As per the AMD I/O Virtualization Technology (IOMMU) Specification, the
AMD IOMMU only remaps fixed and arbitrated MSIs. NMIs are controlled
by the NMIPass bit of a Device Table Entry. When set, the IOMMU passes
through NMI interrupt messages unmapped. Otherwise, they are aborted.

Also, Section 2.2.5 Table 19 states that the IOMMU will abort NMIs when the
destination mode is logical.

Update the NMIPass setting of a device's DTE when an NMI is being
allocated. Only do so when the destination mode of the APIC is not logical.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Removed check for nr_irqs in irq_remapping_alloc(). Allocation had
   been rejected already in the root domain. (Thomas)

Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/amd/iommu.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 5a505ba5467e..9bf71e7335f5 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3299,6 +3299,10 @@ static int irq_remapping_alloc(struct irq_domain *domain, unsigned int virq,
 	if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI)
 		return -EINVAL;
 
+	/* NMIs are aborted when the destination mode is logical. */
+	if (info->flags & X86_IRQ_ALLOC_AS_NMI && apic->dest_mode_logical)
+		return -EPERM;
+
 	sbdf = get_devid(info);
 	if (sbdf < 0)
 		return -EINVAL;
@@ -3348,6 +3352,13 @@ static int irq_remapping_alloc(struct irq_domain *domain, unsigned int virq,
 		goto out_free_parent;
 	}
 
+	if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
+		if (!get_dev_entry_bit(iommu, devid, DEV_ENTRY_NMI_PASS)) {
+			set_dev_entry_bit(iommu, devid, DEV_ENTRY_NMI_PASS);
+			iommu_flush_dte(iommu, devid);
+		}
+	}
+
 	for (i = 0; i < nr_irqs; i++) {
 		irq_data = irq_domain_get_irq_data(domain, virq + i);
 		cfg = irq_data ? irqd_cfg(irq_data) : NULL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 10/24] iommu/amd: Enable NMIPass when allocating an NMI
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Joerg Roedel,
	Ricardo Neri, Stephane Eranian, linux-kernel, iommu,
	Suravee Suthikulpanit, linuxppc-dev

As per the AMD I/O Virtualization Technology (IOMMU) Specification, the
AMD IOMMU only remaps fixed and arbitrated MSIs. NMIs are controlled
by the NMIPass bit of a Device Table Entry. When set, the IOMMU passes
through NMI interrupt messages unmapped. Otherwise, they are aborted.

Also, Section 2.2.5 Table 19 states that the IOMMU will abort NMIs when the
destination mode is logical.

Update the NMIPass setting of a device's DTE when an NMI is being
allocated. Only do so when the destination mode of the APIC is not logical.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Removed check for nr_irqs in irq_remapping_alloc(). Allocation had
   been rejected already in the root domain. (Thomas)

Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/amd/iommu.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 5a505ba5467e..9bf71e7335f5 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3299,6 +3299,10 @@ static int irq_remapping_alloc(struct irq_domain *domain, unsigned int virq,
 	if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI)
 		return -EINVAL;
 
+	/* NMIs are aborted when the destination mode is logical. */
+	if (info->flags & X86_IRQ_ALLOC_AS_NMI && apic->dest_mode_logical)
+		return -EPERM;
+
 	sbdf = get_devid(info);
 	if (sbdf < 0)
 		return -EINVAL;
@@ -3348,6 +3352,13 @@ static int irq_remapping_alloc(struct irq_domain *domain, unsigned int virq,
 		goto out_free_parent;
 	}
 
+	if (info->flags & X86_IRQ_ALLOC_AS_NMI) {
+		if (!get_dev_entry_bit(iommu, devid, DEV_ENTRY_NMI_PASS)) {
+			set_dev_entry_bit(iommu, devid, DEV_ENTRY_NMI_PASS);
+			iommu_flush_dte(iommu, devid);
+		}
+	}
+
 	for (i = 0; i < nr_irqs; i++) {
 		irq_data = irq_domain_get_irq_data(domain, virq + i);
 		cfg = irq_data ? irqd_cfg(irq_data) : NULL;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 11/24] iommu/amd: Compose MSI messages for NMIs in non-IR format
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri, Joerg Roedel,
	Suravee Suthikulpanit

If NMIPass is enabled in a Device Table Entry, the IOMMU lets NMI interrupt
messages pass through unmapped. The contents of the MSI message, not an
IRTE, determine how and where the NMI is delivered.

The IOMMU driver owns the MSI message of the NMI. Compose it using the non-
interrupt-remapping format. Let descendant irqchips write the composed
message.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Reworded changelog to remove acronyms. (Thomas)
 * Removed confusing comment regarding interrupt vector cleanup after
   changing the affinity of an interrupt. (Thomas)

Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/amd/iommu.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 9bf71e7335f5..c6b0c365bf33 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3254,7 +3254,16 @@ static void irq_remapping_prepare_irte(struct amd_ir_data *data,
 	case X86_IRQ_ALLOC_TYPE_HPET:
 	case X86_IRQ_ALLOC_TYPE_PCI_MSI:
 	case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
-		fill_msi_msg(&data->msi_entry, irte_info->index);
+		if (irq_cfg->delivery_mode == APIC_DELIVERY_MODE_NMI)
+			/*
+			 * The IOMMU lets NMIs pass through unmapped. Thus, the
+			 * MSI message, not the IRTE, determines the interrupt
+			 * configuration. Since we own the MSI message,
+			 * compose it.
+			 */
+			__irq_msi_compose_msg(irq_cfg, &data->msi_entry, true);
+		else
+			fill_msi_msg(&data->msi_entry, irte_info->index);
 		break;
 
 	default:
@@ -3643,6 +3652,15 @@ static int amd_ir_set_affinity(struct irq_data *data,
 	 */
 	send_cleanup_vector(cfg);
 
+	/*
+	 * When the delivery mode of an interrupt is NMI, the IOMMU lets the NMI
+	 * interrupt messages pass through unmapped. Changes in the destination
+	 * must be reflected in the MSI message, not the IRTE. Descendant
+	 * irqchips must set the affinity and write the MSI message.
+	 */
+	if (cfg->delivery_mode == APIC_DELIVERY_MODE_NMI)
+		return IRQ_SET_MASK_OK;
+
 	return IRQ_SET_MASK_OK_DONE;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 11/24] iommu/amd: Compose MSI messages for NMIs in non-IR format
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Joerg Roedel,
	Ricardo Neri, Stephane Eranian, linux-kernel, iommu,
	Suravee Suthikulpanit, linuxppc-dev

If NMIPass is enabled in a Device Table Entry, the IOMMU lets NMI interrupt
messages pass through unmapped. The contents of the MSI message, not an
IRTE, determine how and where the NMI is delivered.

The IOMMU driver owns the MSI message of the NMI. Compose it using the non-
interrupt-remapping format. Let descendant irqchips write the composed
message.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Reworded changelog to remove acronyms. (Thomas)
 * Removed confusing comment regarding interrupt vector cleanup after
   changing the affinity of an interrupt. (Thomas)

Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 drivers/iommu/amd/iommu.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 9bf71e7335f5..c6b0c365bf33 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3254,7 +3254,16 @@ static void irq_remapping_prepare_irte(struct amd_ir_data *data,
 	case X86_IRQ_ALLOC_TYPE_HPET:
 	case X86_IRQ_ALLOC_TYPE_PCI_MSI:
 	case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
-		fill_msi_msg(&data->msi_entry, irte_info->index);
+		if (irq_cfg->delivery_mode == APIC_DELIVERY_MODE_NMI)
+			/*
+			 * The IOMMU lets NMIs pass through unmapped. Thus, the
+			 * MSI message, not the IRTE, determines the interrupt
+			 * configuration. Since we own the MSI message,
+			 * compose it.
+			 */
+			__irq_msi_compose_msg(irq_cfg, &data->msi_entry, true);
+		else
+			fill_msi_msg(&data->msi_entry, irte_info->index);
 		break;
 
 	default:
@@ -3643,6 +3652,15 @@ static int amd_ir_set_affinity(struct irq_data *data,
 	 */
 	send_cleanup_vector(cfg);
 
+	/*
+	 * When the delivery mode of an interrupt is NMI, the IOMMU lets the NMI
+	 * interrupt messages pass through unmapped. Changes in the destination
+	 * must be reflected in the MSI message, not the IRTE. Descendant
+	 * irqchips must set the affinity and write the MSI message.
+	 */
+	if (cfg->delivery_mode == APIC_DELIVERY_MODE_NMI)
+		return IRQ_SET_MASK_OK;
+
 	return IRQ_SET_MASK_OK_DONE;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 12/24] x86/hpet: Expose hpet_writel() in header
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

In order to allow hpet_writel() to be used by other components (e.g.,
the HPET-based hardlockup detector), expose it in the HPET header file.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * None

Changes since v4:
 * Dropped exposing hpet_readq() as it is not needed.

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 arch/x86/include/asm/hpet.h | 1 +
 arch/x86/kernel/hpet.c      | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index ab9f3dd87c80..be9848f0883f 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -72,6 +72,7 @@ extern int is_hpet_enabled(void);
 extern int hpet_enable(void);
 extern void hpet_disable(void);
 extern unsigned int hpet_readl(unsigned int a);
+extern void hpet_writel(unsigned int d, unsigned int a);
 extern void force_hpet_resume(void);
 
 #ifdef CONFIG_HPET_EMULATE_RTC
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index c8eb1ac5125a..8303fb1b63a9 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -79,7 +79,7 @@ inline unsigned int hpet_readl(unsigned int a)
 	return readl(hpet_virt_address + a);
 }
 
-static inline void hpet_writel(unsigned int d, unsigned int a)
+inline void hpet_writel(unsigned int d, unsigned int a)
 {
 	writel(d, hpet_virt_address + a);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 12/24] x86/hpet: Expose hpet_writel() in header
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

In order to allow hpet_writel() to be used by other components (e.g.,
the HPET-based hardlockup detector), expose it in the HPET header file.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * None

Changes since v4:
 * Dropped exposing hpet_readq() as it is not needed.

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 arch/x86/include/asm/hpet.h | 1 +
 arch/x86/kernel/hpet.c      | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index ab9f3dd87c80..be9848f0883f 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -72,6 +72,7 @@ extern int is_hpet_enabled(void);
 extern int hpet_enable(void);
 extern void hpet_disable(void);
 extern unsigned int hpet_readl(unsigned int a);
+extern void hpet_writel(unsigned int d, unsigned int a);
 extern void force_hpet_resume(void);
 
 #ifdef CONFIG_HPET_EMULATE_RTC
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index c8eb1ac5125a..8303fb1b63a9 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -79,7 +79,7 @@ inline unsigned int hpet_readl(unsigned int a)
 	return readl(hpet_virt_address + a);
 }
 
-static inline void hpet_writel(unsigned int d, unsigned int a)
+inline void hpet_writel(unsigned int d, unsigned int a)
 {
 	writel(d, hpet_virt_address + a);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 13/24] x86/hpet: Add helper function hpet_set_comparator_periodic()
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

Programming an HPET channel in periodic mode involves several steps.
Besides the HPET clocksource, the HPET-based hardlockup detector may
need to program its HPET channel in periodic mode.

To avoid code duplication, wrap the programming of the HPET timer in
a helper function.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Originally-by: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
When programming the HPET channel in periodic mode, a udelay(1) between
the two successive writes to HPET_Tn_CMP was introduced in commit
e9e2cdb41241 ("[PATCH] clockevents: i386 drivers"). The commit message
does not give any reason for such delay. The hardware specification does
not seem to require it. The refactoring in this patch simply carries such
delay.
---
Changes since v6:
 * Reworded the commit message for clarity.

Changes since v5:
 * None

Changes since v4:
 * Implement function only for periodic mode. This removed extra logic to
   to use a non-zero period value as a proxy for periodic mode
   programming. (Thomas)
 * Added a comment on the history of the udelay() when programming the
   channel in periodic mode. (Ashok)

Changes since v3:
 * Added back a missing hpet_writel() for time configuration.

Changes since v2:
 *  Introduced this patch.

Changes since v1:
 * N/A
---
 arch/x86/include/asm/hpet.h |  2 ++
 arch/x86/kernel/hpet.c      | 49 ++++++++++++++++++++++++++++---------
 2 files changed, 39 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index be9848f0883f..486e001413c7 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -74,6 +74,8 @@ extern void hpet_disable(void);
 extern unsigned int hpet_readl(unsigned int a);
 extern void hpet_writel(unsigned int d, unsigned int a);
 extern void force_hpet_resume(void);
+extern void hpet_set_comparator_periodic(int channel, unsigned int cmp,
+					 unsigned int period);
 
 #ifdef CONFIG_HPET_EMULATE_RTC
 
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 8303fb1b63a9..3563849c2290 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -294,6 +294,39 @@ static void hpet_enable_legacy_int(void)
 	hpet_legacy_int_enabled = true;
 }
 
+/**
+ * hpet_set_comparator_periodic() - Helper function to set periodic channel
+ * @channel:	The HPET channel
+ * @cmp:	The value to be written to the comparator/accumulator
+ * @period:	Number of ticks per period
+ *
+ * Helper function for updating comparator, accumulator and period values.
+ *
+ * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing
+ * to the Tn_CMP to update the accumulator. Then, HPET needs a second
+ * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period.
+ * The HPET_TN_SETVAL bit is automatically cleared after the first write.
+ *
+ * This function takes a 1 microsecond delay. However, this function is supposed
+ * to be called only once (or when reprogramming the timer) as it deals with a
+ * periodic timer channel.
+ *
+ * See the following documents:
+ *   - Intel IA-PC HPET (High Precision Event Timers) Specification
+ *   - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674
+ */
+void hpet_set_comparator_periodic(int channel, unsigned int cmp, unsigned int period)
+{
+	unsigned int v = hpet_readl(HPET_Tn_CFG(channel));
+
+	hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(channel));
+
+	hpet_writel(cmp, HPET_Tn_CMP(channel));
+
+	udelay(1);
+	hpet_writel(period, HPET_Tn_CMP(channel));
+}
+
 static int hpet_clkevt_set_state_periodic(struct clock_event_device *evt)
 {
 	unsigned int channel = clockevent_to_channel(evt)->num;
@@ -306,19 +339,11 @@ static int hpet_clkevt_set_state_periodic(struct clock_event_device *evt)
 	now = hpet_readl(HPET_COUNTER);
 	cmp = now + (unsigned int)delta;
 	cfg = hpet_readl(HPET_Tn_CFG(channel));
-	cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL |
-	       HPET_TN_32BIT;
+	cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_32BIT;
 	hpet_writel(cfg, HPET_Tn_CFG(channel));
-	hpet_writel(cmp, HPET_Tn_CMP(channel));
-	udelay(1);
-	/*
-	 * HPET on AMD 81xx needs a second write (with HPET_TN_SETVAL
-	 * cleared) to T0_CMP to set the period. The HPET_TN_SETVAL
-	 * bit is automatically cleared after the first write.
-	 * (See AMD-8111 HyperTransport I/O Hub Data Sheet,
-	 * Publication # 24674)
-	 */
-	hpet_writel((unsigned int)delta, HPET_Tn_CMP(channel));
+
+	hpet_set_comparator_periodic(channel, cmp, (unsigned int)delta);
+
 	hpet_start_counter();
 	hpet_print_config();
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 13/24] x86/hpet: Add helper function hpet_set_comparator_periodic()
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

Programming an HPET channel in periodic mode involves several steps.
Besides the HPET clocksource, the HPET-based hardlockup detector may
need to program its HPET channel in periodic mode.

To avoid code duplication, wrap the programming of the HPET timer in
a helper function.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Originally-by: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
When programming the HPET channel in periodic mode, a udelay(1) between
the two successive writes to HPET_Tn_CMP was introduced in commit
e9e2cdb41241 ("[PATCH] clockevents: i386 drivers"). The commit message
does not give any reason for such delay. The hardware specification does
not seem to require it. The refactoring in this patch simply carries such
delay.
---
Changes since v6:
 * Reworded the commit message for clarity.

Changes since v5:
 * None

Changes since v4:
 * Implement function only for periodic mode. This removed extra logic to
   to use a non-zero period value as a proxy for periodic mode
   programming. (Thomas)
 * Added a comment on the history of the udelay() when programming the
   channel in periodic mode. (Ashok)

Changes since v3:
 * Added back a missing hpet_writel() for time configuration.

Changes since v2:
 *  Introduced this patch.

Changes since v1:
 * N/A
---
 arch/x86/include/asm/hpet.h |  2 ++
 arch/x86/kernel/hpet.c      | 49 ++++++++++++++++++++++++++++---------
 2 files changed, 39 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index be9848f0883f..486e001413c7 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -74,6 +74,8 @@ extern void hpet_disable(void);
 extern unsigned int hpet_readl(unsigned int a);
 extern void hpet_writel(unsigned int d, unsigned int a);
 extern void force_hpet_resume(void);
+extern void hpet_set_comparator_periodic(int channel, unsigned int cmp,
+					 unsigned int period);
 
 #ifdef CONFIG_HPET_EMULATE_RTC
 
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 8303fb1b63a9..3563849c2290 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -294,6 +294,39 @@ static void hpet_enable_legacy_int(void)
 	hpet_legacy_int_enabled = true;
 }
 
+/**
+ * hpet_set_comparator_periodic() - Helper function to set periodic channel
+ * @channel:	The HPET channel
+ * @cmp:	The value to be written to the comparator/accumulator
+ * @period:	Number of ticks per period
+ *
+ * Helper function for updating comparator, accumulator and period values.
+ *
+ * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing
+ * to the Tn_CMP to update the accumulator. Then, HPET needs a second
+ * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period.
+ * The HPET_TN_SETVAL bit is automatically cleared after the first write.
+ *
+ * This function takes a 1 microsecond delay. However, this function is supposed
+ * to be called only once (or when reprogramming the timer) as it deals with a
+ * periodic timer channel.
+ *
+ * See the following documents:
+ *   - Intel IA-PC HPET (High Precision Event Timers) Specification
+ *   - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674
+ */
+void hpet_set_comparator_periodic(int channel, unsigned int cmp, unsigned int period)
+{
+	unsigned int v = hpet_readl(HPET_Tn_CFG(channel));
+
+	hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(channel));
+
+	hpet_writel(cmp, HPET_Tn_CMP(channel));
+
+	udelay(1);
+	hpet_writel(period, HPET_Tn_CMP(channel));
+}
+
 static int hpet_clkevt_set_state_periodic(struct clock_event_device *evt)
 {
 	unsigned int channel = clockevent_to_channel(evt)->num;
@@ -306,19 +339,11 @@ static int hpet_clkevt_set_state_periodic(struct clock_event_device *evt)
 	now = hpet_readl(HPET_COUNTER);
 	cmp = now + (unsigned int)delta;
 	cfg = hpet_readl(HPET_Tn_CFG(channel));
-	cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL |
-	       HPET_TN_32BIT;
+	cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_32BIT;
 	hpet_writel(cfg, HPET_Tn_CFG(channel));
-	hpet_writel(cmp, HPET_Tn_CMP(channel));
-	udelay(1);
-	/*
-	 * HPET on AMD 81xx needs a second write (with HPET_TN_SETVAL
-	 * cleared) to T0_CMP to set the period. The HPET_TN_SETVAL
-	 * bit is automatically cleared after the first write.
-	 * (See AMD-8111 HyperTransport I/O Hub Data Sheet,
-	 * Publication # 24674)
-	 */
-	hpet_writel((unsigned int)delta, HPET_Tn_CMP(channel));
+
+	hpet_set_comparator_periodic(channel, cmp, (unsigned int)delta);
+
 	hpet_start_counter();
 	hpet_print_config();
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 14/24] x86/hpet: Prepare IRQ assignments to use the X86_ALLOC_AS_NMI flag
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

The flag X86_ALLOC_AS_NMI indicates that the interrupts to be allocated in
an interrupt domain need to be configured as NMIs. Add an as_nmi argument
to hpet_assign_irq(). The HPET clock events do not need NMIs, but the HPET
hardlockup detector does.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/hpet.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 3563849c2290..f42ce3fc4528 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -618,7 +618,7 @@ static inline int hpet_dev_id(struct irq_domain *domain)
 }
 
 static int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc,
-			   int dev_num)
+			   int dev_num, bool as_nmi)
 {
 	struct irq_alloc_info info;
 
@@ -627,6 +627,8 @@ static int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc,
 	info.data = hc;
 	info.devid = hpet_dev_id(domain);
 	info.hwirq = dev_num;
+	if (as_nmi)
+		info.flags |= X86_IRQ_ALLOC_AS_NMI;
 
 	return irq_domain_alloc_irqs(domain, 1, NUMA_NO_NODE, &info);
 }
@@ -755,7 +757,7 @@ static void __init hpet_select_clockevents(void)
 
 		sprintf(hc->name, "hpet%d", i);
 
-		irq = hpet_assign_irq(hpet_domain, hc, hc->num);
+		irq = hpet_assign_irq(hpet_domain, hc, hc->num, false);
 		if (irq <= 0)
 			continue;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 14/24] x86/hpet: Prepare IRQ assignments to use the X86_ALLOC_AS_NMI flag
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

The flag X86_ALLOC_AS_NMI indicates that the interrupts to be allocated in
an interrupt domain need to be configured as NMIs. Add an as_nmi argument
to hpet_assign_irq(). The HPET clock events do not need NMIs, but the HPET
hardlockup detector does.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Introduced this patch.

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 arch/x86/kernel/hpet.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index 3563849c2290..f42ce3fc4528 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -618,7 +618,7 @@ static inline int hpet_dev_id(struct irq_domain *domain)
 }
 
 static int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc,
-			   int dev_num)
+			   int dev_num, bool as_nmi)
 {
 	struct irq_alloc_info info;
 
@@ -627,6 +627,8 @@ static int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc,
 	info.data = hc;
 	info.devid = hpet_dev_id(domain);
 	info.hwirq = dev_num;
+	if (as_nmi)
+		info.flags |= X86_IRQ_ALLOC_AS_NMI;
 
 	return irq_domain_alloc_irqs(domain, 1, NUMA_NO_NODE, &info);
 }
@@ -755,7 +757,7 @@ static void __init hpet_select_clockevents(void)
 
 		sprintf(hc->name, "hpet%d", i);
 
-		irq = hpet_assign_irq(hpet_domain, hc, hc->num);
+		irq = hpet_assign_irq(hpet_domain, hc, hc->num, false);
 		if (irq <= 0)
 			continue;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 15/24] x86/hpet: Reserve an HPET channel for the hardlockup detector
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

Create a new HPET_MODE_NMI_WATCHDOG mode category to reserve an HPET
channel for the hard lockup detector.

Only reserve the channel if the HPET frequency is sufficiently low to allow
32-bit register accesses and if Front Side BUS interrupt delivery (i.e.,
MSI interrupts) is supported.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Reworded the commit message for clarity.
 * Removed pointless global variable hld_data.

Changes since v5:
 * Added a check for the allowed maximum frequency of the HPET.
 * Added hpet_hld_free_timer() to properly free the reserved HPET channel
   if the initialization is not completed.
 * Call hpet_assign_irq() with as_nmi = true.
 * Relocated declarations of functions and data structures of the detector
   to not depend on CONFIG_HPET_TIMER.

Changes since v4:
 * Reworked timer reservation to use Thomas' rework on HPET channel
   management.
 * Removed hard-coded channel number for the hardlockup detector.
 * Provided more details on the sequence of HPET channel reservations.
   (Thomas Gleixner)
 * Only reserve a channel for the hardlockup detector if enabled via
   kernel command line. The function reserving the channel is called from
   hardlockup detector. (Thomas Gleixner)
 * Shorten the name of hpet_hardlockup_detector_get_timer() to
   hpet_hld_get_timer(). (Andi)
 * Simplify error handling when a channel is not found. (Tony)

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 arch/x86/include/asm/hpet.h |  22 ++++++++
 arch/x86/kernel/hpet.c      | 100 ++++++++++++++++++++++++++++++++++++
 2 files changed, 122 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 486e001413c7..5762bd0169a1 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -103,4 +103,26 @@ static inline int is_hpet_enabled(void) { return 0; }
 #define default_setup_hpet_msi	NULL
 
 #endif
+
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+/**
+ * struct hpet_hld_data - Data needed to operate the detector
+ * @has_periodic:		The HPET channel supports periodic mode
+ * @channel:			HPET channel assigned to the detector
+ * @channe_priv:		Private data of the assigned channel
+ * @ticks_per_second:		Frequency of the HPET timer
+ * @irq:			IRQ number assigned to the HPET channel
+ */
+struct hpet_hld_data {
+	bool			has_periodic;
+	u32			channel;
+	struct hpet_channel	*channel_priv;
+	u64			ticks_per_second;
+	int			irq;
+};
+
+extern struct hpet_hld_data *hpet_hld_get_timer(void);
+extern void hpet_hld_free_timer(struct hpet_hld_data *hdata);
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
+
 #endif /* _ASM_X86_HPET_H */
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index f42ce3fc4528..97570426f324 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -20,6 +20,7 @@ enum hpet_mode {
 	HPET_MODE_LEGACY,
 	HPET_MODE_CLOCKEVT,
 	HPET_MODE_DEVICE,
+	HPET_MODE_NMI_WATCHDOG,
 };
 
 struct hpet_channel {
@@ -216,6 +217,7 @@ static void __init hpet_reserve_platform_timers(void)
 			break;
 		case HPET_MODE_CLOCKEVT:
 		case HPET_MODE_LEGACY:
+		case HPET_MODE_NMI_WATCHDOG:
 			hpet_reserve_timer(&hd, hc->num);
 			break;
 		}
@@ -1498,3 +1500,101 @@ irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id)
 }
 EXPORT_SYMBOL_GPL(hpet_rtc_interrupt);
 #endif
+
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+
+/*
+ * We program the channel in 32-bit mode to reduce the number of register
+ * accesses. The maximum value of watch_thresh is 60 seconds. The HPET counter
+ * should not wrap around more frequently than that: its frequency must be less
+ * than 71.582788 MHz. For safety, limit the frequency to 85% of the maximum
+ * permitted frequency.
+ *
+ * The frequency of the HPET in most systems in the field is less than 24MHz.
+ */
+#define HPET_HLD_MAX_FREQ 60845000ULL
+
+/**
+ * hpet_hld_free_timer - Free the reserved channel for the hardlockup detector
+ * @hdata:	Data structure representing the reserved channel.
+ *
+ * Returns: none
+ */
+void hpet_hld_free_timer(struct hpet_hld_data *hld_data)
+{
+	hld_data->channel_priv->mode = HPET_MODE_UNUSED;
+	hld_data->channel_priv->in_use = 0;
+	kfree(hld_data);
+}
+
+/**
+ * hpet_hld_get_timer - Get an HPET channel for the hardlockup detector
+ *
+ * Reserve an HPET channel if available, supports FSB mode, and has sufficiently
+ * low frequency. This function is called by the hardlockup detector if enabled
+ * in the kernel command line.
+ *
+ * Returns: a pointer with the properties of the reserved HPET channel.
+ */
+struct hpet_hld_data *hpet_hld_get_timer(void)
+{
+	struct hpet_channel *hc = hpet_base.channels;
+	struct hpet_hld_data *hld_data;
+	int i, irq;
+
+	if (hpet_freq > HPET_HLD_MAX_FREQ)
+		return NULL;
+
+	for (i = 0; i < hpet_base.nr_channels; i++) {
+		hc = hpet_base.channels + i;
+
+		/*
+		 * Associate the first unused channel to the hardlockup
+		 * detector. Bailout if we cannot find one. This may happen if
+		 * the HPET clocksource has taken all the timers. The HPET
+		 * driver (/dev/hpet) has not taken any channels at this point.
+		 */
+		if (hc->mode == HPET_MODE_UNUSED)
+			break;
+	}
+
+	if (i == hpet_base.nr_channels)
+		return NULL;
+
+	if (!(hc->boot_cfg & HPET_TN_FSB_CAP))
+		return NULL;
+
+	hld_data = kzalloc(sizeof(*hld_data), GFP_KERNEL);
+	if (!hld_data)
+		return NULL;
+
+	hc->mode = HPET_MODE_NMI_WATCHDOG;
+	hc->in_use = 1;
+	hld_data->channel_priv = hc;
+
+	if (hc->boot_cfg & HPET_TN_PERIODIC_CAP)
+		hld_data->has_periodic = true;
+
+	if (!hpet_domain)
+		hpet_domain = hpet_create_irq_domain(hpet_blockid);
+
+	if (!hpet_domain)
+		goto err;
+
+	/* Assign an IRQ with NMI delivery mode. */
+	irq = hpet_assign_irq(hpet_domain, hc, hc->num, true);
+	if (irq <= 0)
+		goto err;
+
+	hc->irq = irq;
+	hld_data->irq = irq;
+	hld_data->channel = i;
+	hld_data->ticks_per_second = hpet_freq;
+
+	return hld_data;
+
+err:
+	hpet_hld_free_timer(hld_data);
+	return NULL;
+}
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 15/24] x86/hpet: Reserve an HPET channel for the hardlockup detector
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

Create a new HPET_MODE_NMI_WATCHDOG mode category to reserve an HPET
channel for the hard lockup detector.

Only reserve the channel if the HPET frequency is sufficiently low to allow
32-bit register accesses and if Front Side BUS interrupt delivery (i.e.,
MSI interrupts) is supported.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Reworded the commit message for clarity.
 * Removed pointless global variable hld_data.

Changes since v5:
 * Added a check for the allowed maximum frequency of the HPET.
 * Added hpet_hld_free_timer() to properly free the reserved HPET channel
   if the initialization is not completed.
 * Call hpet_assign_irq() with as_nmi = true.
 * Relocated declarations of functions and data structures of the detector
   to not depend on CONFIG_HPET_TIMER.

Changes since v4:
 * Reworked timer reservation to use Thomas' rework on HPET channel
   management.
 * Removed hard-coded channel number for the hardlockup detector.
 * Provided more details on the sequence of HPET channel reservations.
   (Thomas Gleixner)
 * Only reserve a channel for the hardlockup detector if enabled via
   kernel command line. The function reserving the channel is called from
   hardlockup detector. (Thomas Gleixner)
 * Shorten the name of hpet_hardlockup_detector_get_timer() to
   hpet_hld_get_timer(). (Andi)
 * Simplify error handling when a channel is not found. (Tony)

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 arch/x86/include/asm/hpet.h |  22 ++++++++
 arch/x86/kernel/hpet.c      | 100 ++++++++++++++++++++++++++++++++++++
 2 files changed, 122 insertions(+)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 486e001413c7..5762bd0169a1 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -103,4 +103,26 @@ static inline int is_hpet_enabled(void) { return 0; }
 #define default_setup_hpet_msi	NULL
 
 #endif
+
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+/**
+ * struct hpet_hld_data - Data needed to operate the detector
+ * @has_periodic:		The HPET channel supports periodic mode
+ * @channel:			HPET channel assigned to the detector
+ * @channe_priv:		Private data of the assigned channel
+ * @ticks_per_second:		Frequency of the HPET timer
+ * @irq:			IRQ number assigned to the HPET channel
+ */
+struct hpet_hld_data {
+	bool			has_periodic;
+	u32			channel;
+	struct hpet_channel	*channel_priv;
+	u64			ticks_per_second;
+	int			irq;
+};
+
+extern struct hpet_hld_data *hpet_hld_get_timer(void);
+extern void hpet_hld_free_timer(struct hpet_hld_data *hdata);
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
+
 #endif /* _ASM_X86_HPET_H */
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
index f42ce3fc4528..97570426f324 100644
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -20,6 +20,7 @@ enum hpet_mode {
 	HPET_MODE_LEGACY,
 	HPET_MODE_CLOCKEVT,
 	HPET_MODE_DEVICE,
+	HPET_MODE_NMI_WATCHDOG,
 };
 
 struct hpet_channel {
@@ -216,6 +217,7 @@ static void __init hpet_reserve_platform_timers(void)
 			break;
 		case HPET_MODE_CLOCKEVT:
 		case HPET_MODE_LEGACY:
+		case HPET_MODE_NMI_WATCHDOG:
 			hpet_reserve_timer(&hd, hc->num);
 			break;
 		}
@@ -1498,3 +1500,101 @@ irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id)
 }
 EXPORT_SYMBOL_GPL(hpet_rtc_interrupt);
 #endif
+
+#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+
+/*
+ * We program the channel in 32-bit mode to reduce the number of register
+ * accesses. The maximum value of watch_thresh is 60 seconds. The HPET counter
+ * should not wrap around more frequently than that: its frequency must be less
+ * than 71.582788 MHz. For safety, limit the frequency to 85% of the maximum
+ * permitted frequency.
+ *
+ * The frequency of the HPET in most systems in the field is less than 24MHz.
+ */
+#define HPET_HLD_MAX_FREQ 60845000ULL
+
+/**
+ * hpet_hld_free_timer - Free the reserved channel for the hardlockup detector
+ * @hdata:	Data structure representing the reserved channel.
+ *
+ * Returns: none
+ */
+void hpet_hld_free_timer(struct hpet_hld_data *hld_data)
+{
+	hld_data->channel_priv->mode = HPET_MODE_UNUSED;
+	hld_data->channel_priv->in_use = 0;
+	kfree(hld_data);
+}
+
+/**
+ * hpet_hld_get_timer - Get an HPET channel for the hardlockup detector
+ *
+ * Reserve an HPET channel if available, supports FSB mode, and has sufficiently
+ * low frequency. This function is called by the hardlockup detector if enabled
+ * in the kernel command line.
+ *
+ * Returns: a pointer with the properties of the reserved HPET channel.
+ */
+struct hpet_hld_data *hpet_hld_get_timer(void)
+{
+	struct hpet_channel *hc = hpet_base.channels;
+	struct hpet_hld_data *hld_data;
+	int i, irq;
+
+	if (hpet_freq > HPET_HLD_MAX_FREQ)
+		return NULL;
+
+	for (i = 0; i < hpet_base.nr_channels; i++) {
+		hc = hpet_base.channels + i;
+
+		/*
+		 * Associate the first unused channel to the hardlockup
+		 * detector. Bailout if we cannot find one. This may happen if
+		 * the HPET clocksource has taken all the timers. The HPET
+		 * driver (/dev/hpet) has not taken any channels at this point.
+		 */
+		if (hc->mode == HPET_MODE_UNUSED)
+			break;
+	}
+
+	if (i == hpet_base.nr_channels)
+		return NULL;
+
+	if (!(hc->boot_cfg & HPET_TN_FSB_CAP))
+		return NULL;
+
+	hld_data = kzalloc(sizeof(*hld_data), GFP_KERNEL);
+	if (!hld_data)
+		return NULL;
+
+	hc->mode = HPET_MODE_NMI_WATCHDOG;
+	hc->in_use = 1;
+	hld_data->channel_priv = hc;
+
+	if (hc->boot_cfg & HPET_TN_PERIODIC_CAP)
+		hld_data->has_periodic = true;
+
+	if (!hpet_domain)
+		hpet_domain = hpet_create_irq_domain(hpet_blockid);
+
+	if (!hpet_domain)
+		goto err;
+
+	/* Assign an IRQ with NMI delivery mode. */
+	irq = hpet_assign_irq(hpet_domain, hc, hc->num, true);
+	if (irq <= 0)
+		goto err;
+
+	hc->irq = irq;
+	hld_data->irq = irq;
+	hld_data->channel = i;
+	hld_data->ticks_per_second = hpet_freq;
+
+	return hld_data;
+
+err:
+	hpet_hld_free_timer(hld_data);
+	return NULL;
+}
+#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 16/24] watchdog/hardlockup: Define a generic function to detect hardlockups
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri, Nicholas Piggin,
	Andrew Morton

The procedure to detect hardlockups is independent of the source of the
the non-maskable interrupt that drives it. Place it in a separate, generic
function to be invoked by various implementations of the NMI watchdog.

Move the bulk of watchdog_overflow_callback() to the new function
inspect_for_hardlockups(). This function can then be called from the
applicable NMI handlers. No functional changes.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * None

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 include/linux/nmi.h   |  1 +
 kernel/watchdog_hld.c | 18 +++++++++++-------
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 048c0b9aa623..75038cb2710e 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -209,6 +209,7 @@ int proc_nmi_watchdog(struct ctl_table *, int , void *, size_t *, loff_t *);
 int proc_soft_watchdog(struct ctl_table *, int , void *, size_t *, loff_t *);
 int proc_watchdog_thresh(struct ctl_table *, int , void *, size_t *, loff_t *);
 int proc_watchdog_cpumask(struct ctl_table *, int, void *, size_t *, loff_t *);
+void inspect_for_hardlockups(struct pt_regs *regs);
 
 #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 #include <asm/nmi.h>
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 247bf0b1582c..b352e507b17f 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -106,14 +106,8 @@ static struct perf_event_attr wd_hw_attr = {
 	.disabled	= 1,
 };
 
-/* Callback function for perf event subsystem */
-static void watchdog_overflow_callback(struct perf_event *event,
-				       struct perf_sample_data *data,
-				       struct pt_regs *regs)
+void inspect_for_hardlockups(struct pt_regs *regs)
 {
-	/* Ensure the watchdog never gets throttled */
-	event->hw.interrupts = 0;
-
 	if (__this_cpu_read(watchdog_nmi_touch) == true) {
 		__this_cpu_write(watchdog_nmi_touch, false);
 		return;
@@ -163,6 +157,16 @@ static void watchdog_overflow_callback(struct perf_event *event,
 	return;
 }
 
+/* Callback function for perf event subsystem */
+static void watchdog_overflow_callback(struct perf_event *event,
+				       struct perf_sample_data *data,
+				       struct pt_regs *regs)
+{
+	/* Ensure the watchdog never gets throttled */
+	event->hw.interrupts = 0;
+	inspect_for_hardlockups(regs);
+}
+
 static int hardlockup_detector_event_create(void)
 {
 	unsigned int cpu = smp_processor_id();
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 16/24] watchdog/hardlockup: Define a generic function to detect hardlockups
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, Nicholas Piggin,
	Andrew Morton, linuxppc-dev

The procedure to detect hardlockups is independent of the source of the
the non-maskable interrupt that drives it. Place it in a separate, generic
function to be invoked by various implementations of the NMI watchdog.

Move the bulk of watchdog_overflow_callback() to the new function
inspect_for_hardlockups(). This function can then be called from the
applicable NMI handlers. No functional changes.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * None

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * None

Changes since v1:
 * None
---
 include/linux/nmi.h   |  1 +
 kernel/watchdog_hld.c | 18 +++++++++++-------
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 048c0b9aa623..75038cb2710e 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -209,6 +209,7 @@ int proc_nmi_watchdog(struct ctl_table *, int , void *, size_t *, loff_t *);
 int proc_soft_watchdog(struct ctl_table *, int , void *, size_t *, loff_t *);
 int proc_watchdog_thresh(struct ctl_table *, int , void *, size_t *, loff_t *);
 int proc_watchdog_cpumask(struct ctl_table *, int, void *, size_t *, loff_t *);
+void inspect_for_hardlockups(struct pt_regs *regs);
 
 #ifdef CONFIG_HAVE_ACPI_APEI_NMI
 #include <asm/nmi.h>
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 247bf0b1582c..b352e507b17f 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -106,14 +106,8 @@ static struct perf_event_attr wd_hw_attr = {
 	.disabled	= 1,
 };
 
-/* Callback function for perf event subsystem */
-static void watchdog_overflow_callback(struct perf_event *event,
-				       struct perf_sample_data *data,
-				       struct pt_regs *regs)
+void inspect_for_hardlockups(struct pt_regs *regs)
 {
-	/* Ensure the watchdog never gets throttled */
-	event->hw.interrupts = 0;
-
 	if (__this_cpu_read(watchdog_nmi_touch) == true) {
 		__this_cpu_write(watchdog_nmi_touch, false);
 		return;
@@ -163,6 +157,16 @@ static void watchdog_overflow_callback(struct perf_event *event,
 	return;
 }
 
+/* Callback function for perf event subsystem */
+static void watchdog_overflow_callback(struct perf_event *event,
+				       struct perf_sample_data *data,
+				       struct pt_regs *regs)
+{
+	/* Ensure the watchdog never gets throttled */
+	event->hw.interrupts = 0;
+	inspect_for_hardlockups(regs);
+}
+
 static int hardlockup_detector_event_create(void)
 {
 	unsigned int cpu = smp_processor_id();
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 17/24] watchdog/hardlockup: Decouple the hardlockup detector from perf
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri, Nicholas Piggin,
	Andrew Morton

The current default implementation of the hardlockup detector assumes that
it is implemented using perf events. However, the hardlockup detector can
be driven by other sources of non-maskable interrupts (e.g., a properly
configured timer).

Group and wrap in #ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF all the code
specific to perf: create and manage perf events, stop and start the perf-
based detector.

The generic portion of the detector (monitor the timers' thresholds, check
timestamps and detect hardlockups as well as the implementation of
arch_touch_nmi_watchdog()) is now selected with the new intermediate config
symbol CONFIG_HARDLOCKUP_DETECTOR_CORE.

The perf-based implementation of the detector selects the new intermediate
symbol. Other implementations should do the same.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * None

Changes since v4:
 * None

Changes since v3:
 * Squashed into this patch a previous patch to make
   arch_touch_nmi_watchdog() part of the core detector code.

Changes since v2:
 * Undid split of the generic hardlockup detector into a separate file.
   (Thomas Gleixner)
 * Added a new intermediate symbol CONFIG_HARDLOCKUP_DETECTOR_CORE to
   select generic parts of the detector (Paul E. McKenney,
   Thomas Gleixner).

Changes since v1:
 * Make the generic detector code with CONFIG_HARDLOCKUP_DETECTOR.
---
 include/linux/nmi.h   |  5 ++++-
 kernel/Makefile       |  2 +-
 kernel/watchdog_hld.c | 32 ++++++++++++++++++++------------
 lib/Kconfig.debug     |  4 ++++
 4 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 75038cb2710e..a38c4509f9eb 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -94,8 +94,11 @@ static inline void hardlockup_detector_disable(void) {}
 # define NMI_WATCHDOG_SYSCTL_PERM	0444
 #endif
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_CORE)
 extern void arch_touch_nmi_watchdog(void);
+#endif
+
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
 extern void hardlockup_detector_perf_stop(void);
 extern void hardlockup_detector_perf_restart(void);
 extern void hardlockup_detector_perf_disable(void);
diff --git a/kernel/Makefile b/kernel/Makefile
index 10ef068f598d..f35fad36cf81 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -91,7 +91,7 @@ obj-$(CONFIG_FAIL_FUNCTION) += fail_function.o
 obj-$(CONFIG_KGDB) += debug/
 obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
-obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
+obj-$(CONFIG_HARDLOCKUP_DETECTOR_CORE) += watchdog_hld.o
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index b352e507b17f..bb6435978c46 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -22,12 +22,8 @@
 
 static DEFINE_PER_CPU(bool, hard_watchdog_warn);
 static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
-static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
-static DEFINE_PER_CPU(struct perf_event *, dead_event);
-static struct cpumask dead_events_mask;
 
 static unsigned long hardlockup_allcpu_dumped;
-static atomic_t watchdog_cpus = ATOMIC_INIT(0);
 
 notrace void arch_touch_nmi_watchdog(void)
 {
@@ -98,14 +94,6 @@ static inline bool watchdog_check_timestamp(void)
 }
 #endif
 
-static struct perf_event_attr wd_hw_attr = {
-	.type		= PERF_TYPE_HARDWARE,
-	.config		= PERF_COUNT_HW_CPU_CYCLES,
-	.size		= sizeof(struct perf_event_attr),
-	.pinned		= 1,
-	.disabled	= 1,
-};
-
 void inspect_for_hardlockups(struct pt_regs *regs)
 {
 	if (__this_cpu_read(watchdog_nmi_touch) == true) {
@@ -157,6 +145,24 @@ void inspect_for_hardlockups(struct pt_regs *regs)
 	return;
 }
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
+#undef pr_fmt
+#define pr_fmt(fmt) "NMI perf watchdog: " fmt
+
+static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
+static DEFINE_PER_CPU(struct perf_event *, dead_event);
+static struct cpumask dead_events_mask;
+
+static atomic_t watchdog_cpus = ATOMIC_INIT(0);
+
+static struct perf_event_attr wd_hw_attr = {
+	.type		= PERF_TYPE_HARDWARE,
+	.config		= PERF_COUNT_HW_CPU_CYCLES,
+	.size		= sizeof(struct perf_event_attr),
+	.pinned		= 1,
+	.disabled	= 1,
+};
+
 /* Callback function for perf event subsystem */
 static void watchdog_overflow_callback(struct perf_event *event,
 				       struct perf_sample_data *data,
@@ -298,3 +304,5 @@ int __init hardlockup_detector_perf_init(void)
 	}
 	return ret;
 }
+
+#endif /* CONFIG_HARDLOCKUP_DETECTOR_PERF */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index c8b379e2e9ad..1ff53c5995b1 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1025,9 +1025,13 @@ config BOOTPARAM_SOFTLOCKUP_PANIC
 
 	  Say N if unsure.
 
+config HARDLOCKUP_DETECTOR_CORE
+	bool
+
 config HARDLOCKUP_DETECTOR_PERF
 	bool
 	select SOFTLOCKUP_DETECTOR
+	select HARDLOCKUP_DETECTOR_CORE
 
 #
 # Enables a timestamp based low pass filter to compensate for perf based
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 17/24] watchdog/hardlockup: Decouple the hardlockup detector from perf
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, Nicholas Piggin,
	Andrew Morton, linuxppc-dev

The current default implementation of the hardlockup detector assumes that
it is implemented using perf events. However, the hardlockup detector can
be driven by other sources of non-maskable interrupts (e.g., a properly
configured timer).

Group and wrap in #ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF all the code
specific to perf: create and manage perf events, stop and start the perf-
based detector.

The generic portion of the detector (monitor the timers' thresholds, check
timestamps and detect hardlockups as well as the implementation of
arch_touch_nmi_watchdog()) is now selected with the new intermediate config
symbol CONFIG_HARDLOCKUP_DETECTOR_CORE.

The perf-based implementation of the detector selects the new intermediate
symbol. Other implementations should do the same.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * None

Changes since v4:
 * None

Changes since v3:
 * Squashed into this patch a previous patch to make
   arch_touch_nmi_watchdog() part of the core detector code.

Changes since v2:
 * Undid split of the generic hardlockup detector into a separate file.
   (Thomas Gleixner)
 * Added a new intermediate symbol CONFIG_HARDLOCKUP_DETECTOR_CORE to
   select generic parts of the detector (Paul E. McKenney,
   Thomas Gleixner).

Changes since v1:
 * Make the generic detector code with CONFIG_HARDLOCKUP_DETECTOR.
---
 include/linux/nmi.h   |  5 ++++-
 kernel/Makefile       |  2 +-
 kernel/watchdog_hld.c | 32 ++++++++++++++++++++------------
 lib/Kconfig.debug     |  4 ++++
 4 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index 75038cb2710e..a38c4509f9eb 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -94,8 +94,11 @@ static inline void hardlockup_detector_disable(void) {}
 # define NMI_WATCHDOG_SYSCTL_PERM	0444
 #endif
 
-#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_CORE)
 extern void arch_touch_nmi_watchdog(void);
+#endif
+
+#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF)
 extern void hardlockup_detector_perf_stop(void);
 extern void hardlockup_detector_perf_restart(void);
 extern void hardlockup_detector_perf_disable(void);
diff --git a/kernel/Makefile b/kernel/Makefile
index 10ef068f598d..f35fad36cf81 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -91,7 +91,7 @@ obj-$(CONFIG_FAIL_FUNCTION) += fail_function.o
 obj-$(CONFIG_KGDB) += debug/
 obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
-obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o
+obj-$(CONFIG_HARDLOCKUP_DETECTOR_CORE) += watchdog_hld.o
 obj-$(CONFIG_SECCOMP) += seccomp.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index b352e507b17f..bb6435978c46 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -22,12 +22,8 @@
 
 static DEFINE_PER_CPU(bool, hard_watchdog_warn);
 static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
-static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
-static DEFINE_PER_CPU(struct perf_event *, dead_event);
-static struct cpumask dead_events_mask;
 
 static unsigned long hardlockup_allcpu_dumped;
-static atomic_t watchdog_cpus = ATOMIC_INIT(0);
 
 notrace void arch_touch_nmi_watchdog(void)
 {
@@ -98,14 +94,6 @@ static inline bool watchdog_check_timestamp(void)
 }
 #endif
 
-static struct perf_event_attr wd_hw_attr = {
-	.type		= PERF_TYPE_HARDWARE,
-	.config		= PERF_COUNT_HW_CPU_CYCLES,
-	.size		= sizeof(struct perf_event_attr),
-	.pinned		= 1,
-	.disabled	= 1,
-};
-
 void inspect_for_hardlockups(struct pt_regs *regs)
 {
 	if (__this_cpu_read(watchdog_nmi_touch) == true) {
@@ -157,6 +145,24 @@ void inspect_for_hardlockups(struct pt_regs *regs)
 	return;
 }
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF
+#undef pr_fmt
+#define pr_fmt(fmt) "NMI perf watchdog: " fmt
+
+static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
+static DEFINE_PER_CPU(struct perf_event *, dead_event);
+static struct cpumask dead_events_mask;
+
+static atomic_t watchdog_cpus = ATOMIC_INIT(0);
+
+static struct perf_event_attr wd_hw_attr = {
+	.type		= PERF_TYPE_HARDWARE,
+	.config		= PERF_COUNT_HW_CPU_CYCLES,
+	.size		= sizeof(struct perf_event_attr),
+	.pinned		= 1,
+	.disabled	= 1,
+};
+
 /* Callback function for perf event subsystem */
 static void watchdog_overflow_callback(struct perf_event *event,
 				       struct perf_sample_data *data,
@@ -298,3 +304,5 @@ int __init hardlockup_detector_perf_init(void)
 	}
 	return ret;
 }
+
+#endif /* CONFIG_HARDLOCKUP_DETECTOR_PERF */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index c8b379e2e9ad..1ff53c5995b1 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1025,9 +1025,13 @@ config BOOTPARAM_SOFTLOCKUP_PANIC
 
 	  Say N if unsure.
 
+config HARDLOCKUP_DETECTOR_CORE
+	bool
+
 config HARDLOCKUP_DETECTOR_PERF
 	bool
 	select SOFTLOCKUP_DETECTOR
+	select HARDLOCKUP_DETECTOR_CORE
 
 #
 # Enables a timestamp based low pass filter to compensate for perf based
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 18/24] init/main: Delay initialization of the lockup detector after smp_init()
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri, Nicholas Piggin,
	Andrew Morton

Certain implementations of the hardlockup detector require support for
Inter-Processor Interrupt shorthands. On x86, support for these can only
be determined after all the possible CPUs have booted once (in smp_init()).
Other architectures may not need such check.

lockup_detector_init() only performs the initializations of data structures
of the lockup detector. Hence, there are no dependencies on smp_init().

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 init/main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/init/main.c b/init/main.c
index 4425d1783d5c..d0642a49e2e8 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1620,9 +1620,11 @@ static noinline void __init kernel_init_freeable(void)
 
 	rcu_init_tasks_generic();
 	do_pre_smp_initcalls();
-	lockup_detector_init();
 
 	smp_init();
+
+	lockup_detector_init();
+
 	sched_init_smp();
 
 	padata_init();
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 18/24] init/main: Delay initialization of the lockup detector after smp_init()
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, Nicholas Piggin,
	Andrew Morton, linuxppc-dev

Certain implementations of the hardlockup detector require support for
Inter-Processor Interrupt shorthands. On x86, support for these can only
be determined after all the possible CPUs have booted once (in smp_init()).
Other architectures may not need such check.

lockup_detector_init() only performs the initializations of data structures
of the lockup detector. Hence, there are no dependencies on smp_init().

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Introduced this patch

Changes since v4:
 * N/A

Changes since v3:
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 init/main.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/init/main.c b/init/main.c
index 4425d1783d5c..d0642a49e2e8 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1620,9 +1620,11 @@ static noinline void __init kernel_init_freeable(void)
 
 	rcu_init_tasks_generic();
 	do_pre_smp_initcalls();
-	lockup_detector_init();
 
 	smp_init();
+
+	lockup_detector_init();
+
 	sched_init_smp();
 
 	padata_init();
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 19/24] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

Implement a hardlockup detector that uses an HPET channel as the source
of the non-maskable interrupt. Implement the basic functionality to
start, stop, and configure the timer.

Designate as the handling CPU one of the CPUs that the detector monitors.
Use it to service the NMI from the HPET channel. When servicing the HPET
NMI, issue an inter-processor interrupt to the rest of the monitored CPUs
to look for hardlockups. Only enable the detector if IPI shorthands are
enabled in the system.

During operation, the HPET registers are only accessed to kick the timer.
This operation can be avoided if the detector gets a periodic HPET
channel.

Since we use IPI shorthands, all CPUs get the IPI NMI. This would disturb
the isolated CPUs specified in the nohz_full command-line parameter. In
such case, do not enable this hardlockup detector implementation.

The detector is not functional at this stage. A subsequent changeset will
invoke the interfaces implemented in this changeset to operate the
detector. Another subsequent changeset implements logic to determine if
the HPET timer caused the NMI. For now, implement a stub function.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Added missing header asm/nmi.h. linux/nmi.h #includes it only if
   CONFIG_HAVE_NMI_WATCHDOG selected. Such option is not selected for
   ARCH=x86.
 * Removed IRQF_NOBALANCING from request_irq(). The NMI vector already
   is configured to prevent balancing.
 * Added check to not use the detector when the nohz_full= command-line
   parameter is present.
 * Since I dropped the patch that added a new NMI_WATCHDOG list of
   handlers, now I added my handler to the NMI_LOCAL list. This makes
   sense since now the watchdog NMIs are now delivered via IPIs. The
   only exception is the HPET NMI, but that interrupt can only be handled
   by the monitoring CPU.

Changes since v5:
 * Squashed a previously separate patch to support interrupt remapping into
   this patch. There is no need to handle interrupt remapping separately.
   All the necessary plumbing is done in the interrupt subsystem. Now it
   uses request_irq().
 * Use IPI shorthands to send an NMI to the CPUs being monitored. (Thomas)
 * Added extra check to only use the HPET hardlockup detector if the IPI
   shorthands are enabled. (Thomas)
 * Relocated flushing of outstanding interrupts from enable_timer() to
   disable_timer(). On some systems, making any change in the
   configuration of the HPET channel causes it to issue an interrupt.
 * Added a new cpumask to function as a per-cpu test bit to determine if
   a CPU should check for hardlockups.
 * Dropped pointless X86_64 || X86_32 check in Kconfig. (Tony)
 * Dropped pointless dependency on CONFIG_HPET.
 * Added dependency on CONFIG_GENERIC_MSI_IRQ, needed to build the [|IR]-
   HPET-MSI irq_chip.
 * Added hardlockup_detector_hpet_start() to be used when tsc_khz is
   recalibrated.
 * Reworked the periodic setting the HPET channel. Rather than changing it
   every time the channel is disabled or enabled, do it only once. While
   at here, wrap the code in an initial setup function.
 * Implemented hardlockup_detector_hpet_start() to be called when tsc_khz
   is refined.
 * Enhanced inline comments for clarity.
 * Added missing #include files.
 * Relocated function declarations to not depend on CONFIG_HPET_TIMER.

Changes since v4:
 * Dropped hpet_hld_data.enabled_cpus and instead use cpumask_weight().
 * Renamed hpet_hld_data.cpu_monitored_mask to
   hld_data_data.cpu_monitored_mask and converted it to cpumask_var_t.
 * Flushed out any outstanding interrupt before enabling the HPET channel.
 * Removed unnecessary MSI_DATA_LEVEL_ASSERT from the MSI message.
 * Added comments in hardlockup_detector_nmi_handler() to explain how
   CPUs are targeted for an IPI.
 * Updated code to only issue an IPI when needed (i.e., there are monitored
   CPUs to be inspected via an IPI).
 * Reworked hardlockup_detector_hpet_init() for readability.
 * Now reserve the cpumasks in the hardlockup detector code and not in the
   generic HPET code.
 * Handled the case of watchdog_thresh = 0 when disabling the detector.
 * Made this detector available to i386.
 * Reworked logic to kick the timer to remove a local variable. (Andi)
 * Added a comment on what type of timer channel will be assigned to the
   detector. (Andi)
 * Reworded prompt comment in Kconfig. (Andi)
 * Removed unneeded switch to level interrupt mode when disabling the
   timer. (Andi)
 * Disabled the HPET timer to avoid a race between an incoming interrupt
   and an update of the MSI destination ID. (Ashok)
 * Corrected a typo in an inline comment. (Tony)
 * Made the HPET hardlockup detector depend on HARDLOCKUP_DETECTOR instead
   of selecting it.

Changes since v3:
 * Fixed typo in Kconfig.debug. (Randy Dunlap)
 * Added missing slab.h to include the definition of kfree to fix a build
   break.

Changes since v2:
 * Removed use of struct cpumask in favor of a variable length array in
   conjunction with kzalloc. (Peter Zijlstra)
 * Removed redundant documentation of functions. (Thomas Gleixner)
 * Added CPU as argument hardlockup_detector_hpet_enable()/disable().
   (Thomas Gleixner).

Changes since v1:
 * Do not target CPUs in a round-robin manner. Instead, the HPET timer
   always targets the same CPU; other CPUs are monitored via an
   interprocessor interrupt.
 * Dropped support for IO APIC interrupts and instead use only MSI
   interrupts.
 * Removed use of generic irq code to set interrupt affinity and NMI
   delivery. Instead, configure the interrupt directly in HPET registers.
   (Thomas Gleixner)
 * Fixed unconditional return NMI_HANDLED when the HPET timer is
   programmed for FSB/MSI delivery. (Peter Zijlstra)
---
 arch/x86/Kconfig.debug              |   8 +
 arch/x86/include/asm/hpet.h         |  21 ++
 arch/x86/kernel/Makefile            |   1 +
 arch/x86/kernel/watchdog_hld_hpet.c | 380 ++++++++++++++++++++++++++++
 4 files changed, 410 insertions(+)
 create mode 100644 arch/x86/kernel/watchdog_hld_hpet.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index bdfe08f1a930..b4dced142116 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -110,6 +110,14 @@ config IOMMU_LEAK
 config HAVE_MMIOTRACE_SUPPORT
 	def_bool y
 
+config X86_HARDLOCKUP_DETECTOR_HPET
+	bool "HPET-based hardlockup detector"
+	select HARDLOCKUP_DETECTOR_CORE
+	depends on HARDLOCKUP_DETECTOR && HPET_TIMER && GENERIC_MSI_IRQ
+	help
+	  Say y to drive the hardlockup detector using the High-Precision Event
+	  Timer instead of performance counters.
+
 config X86_DECODER_SELFTEST
 	bool "x86 instruction decoder selftest"
 	depends on DEBUG_KERNEL && INSTRUCTION_DECODER
diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 5762bd0169a1..c88901744848 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -105,6 +105,8 @@ static inline int is_hpet_enabled(void) { return 0; }
 #endif
 
 #ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+#include <linux/cpumask.h>
+
 /**
  * struct hpet_hld_data - Data needed to operate the detector
  * @has_periodic:		The HPET channel supports periodic mode
@@ -112,6 +114,10 @@ static inline int is_hpet_enabled(void) { return 0; }
  * @channe_priv:		Private data of the assigned channel
  * @ticks_per_second:		Frequency of the HPET timer
  * @irq:			IRQ number assigned to the HPET channel
+ * @handling_cpu:		CPU handling the HPET interrupt
+ * @monitored_cpumask:		CPUs monitored by the hardlockup detector
+ * @inspect_cpumask:		CPUs that will be inspected at a given time.
+ *				Each CPU clears itself upon inspection.
  */
 struct hpet_hld_data {
 	bool			has_periodic;
@@ -119,10 +125,25 @@ struct hpet_hld_data {
 	struct hpet_channel	*channel_priv;
 	u64			ticks_per_second;
 	int			irq;
+	u32			handling_cpu;
+	cpumask_var_t		monitored_cpumask;
+	cpumask_var_t		inspect_cpumask;
 };
 
 extern struct hpet_hld_data *hpet_hld_get_timer(void);
 extern void hpet_hld_free_timer(struct hpet_hld_data *hdata);
+int hardlockup_detector_hpet_init(void);
+void hardlockup_detector_hpet_start(void);
+void hardlockup_detector_hpet_stop(void);
+void hardlockup_detector_hpet_enable(unsigned int cpu);
+void hardlockup_detector_hpet_disable(unsigned int cpu);
+#else
+static inline int hardlockup_detector_hpet_init(void)
+{ return -ENODEV; }
+static inline void hardlockup_detector_hpet_start(void) {}
+static inline void hardlockup_detector_hpet_stop(void) {}
+static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
+static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
 #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
 
 #endif /* _ASM_X86_HPET_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index dd61752f4c96..58eb858f33ff 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -109,6 +109,7 @@ obj-$(CONFIG_VM86)		+= vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK)	+= early_printk.o
 
 obj-$(CONFIG_HPET_TIMER) 	+= hpet.o
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 
 obj-$(CONFIG_AMD_NB)		+= amd_nb.o
 obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
new file mode 100644
index 000000000000..b583d3180ae0
--- /dev/null
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -0,0 +1,380 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A hardlockup detector driven by an HPET channel.
+ *
+ * Copyright (C) Intel Corporation 2023
+ *
+ * An HPET channel is reserved for the detector. The channel issues an NMI to
+ * one of the CPUs in @watchdog_allowed_mask. This CPU monitors itself for
+ * hardlockups and sends an NMI IPI to the rest of the CPUs in the system.
+ *
+ * The detector uses IPI shorthands. Thus, all CPUs in the system get the NMI
+ * (offline CPUs also get the NMI but they "ignore" it). A cpumask is used to
+ * specify whether a CPU must check for hardlockups.
+ *
+ * The NMI also disturbs isolated CPUs. The detector fails to initialize if
+ * tick_nohz_full is enabled.
+ */
+
+#define pr_fmt(fmt) "NMI hpet watchdog: " fmt
+
+#include <linux/cpumask.h>
+#include <linux/interrupt.h>
+#include <linux/jump_label.h>
+#include <linux/nmi.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+#include <linux/tick.h>
+
+#include <asm/apic.h>
+#include <asm/hpet.h>
+#include <asm/nmi.h>
+#include <asm/tsc.h>
+
+#include "apic/local.h"
+
+static struct hpet_hld_data *hld_data;
+
+static void __init setup_hpet_channel(struct hpet_hld_data *hdata)
+{
+	u32 v;
+
+	v = hpet_readl(HPET_Tn_CFG(hdata->channel));
+	if (hdata->has_periodic)
+		v |= HPET_TN_PERIODIC;
+	else
+		v &= ~HPET_TN_PERIODIC;
+
+	/*
+	 * Use 32-bit mode to limit the number of register accesses. If we are
+	 * here the HPET frequency is sufficiently low to accommodate this mode.
+	 */
+	v |= HPET_TN_32BIT;
+
+	/* If we are here, FSB mode is supported. */
+	v |= HPET_TN_FSB;
+
+	hpet_writel(v, HPET_Tn_CFG(hdata->channel));
+}
+
+/**
+ * kick_timer() - Reprogram timer to expire in the future
+ * @hdata:	A data structure describing the HPET channel
+ * @force:	Force reprogramming
+ *
+ * Reprogram the timer to expire in watchdog_thresh seconds in the future.
+ * If the timer supports periodic mode, it is not kicked unless @force is
+ * true.
+ */
+static void kick_timer(struct hpet_hld_data *hdata, bool force)
+{
+	u64 new_compare, count, period = 0;
+
+	/* Kick the timer only when needed. */
+	if (!force && hdata->has_periodic)
+		return;
+
+	/*
+	 * Update the comparator in increments of watch_thresh seconds relative
+	 * to the current count. Since watch_thresh is given in seconds, we can
+	 * safely update the comparator before the counter reaches such new
+	 * value.
+	 *
+	 * Let it wrap around if needed.
+	 */
+
+	count = hpet_readl(HPET_COUNTER);
+	new_compare = count + watchdog_thresh * hdata->ticks_per_second;
+
+	if (!hdata->has_periodic) {
+		hpet_writel(new_compare, HPET_Tn_CMP(hdata->channel));
+		return;
+	}
+
+	period = watchdog_thresh * hdata->ticks_per_second;
+	hpet_set_comparator_periodic(hdata->channel, (u32)new_compare,
+				     (u32)period);
+}
+
+static void disable_timer(struct hpet_hld_data *hdata)
+{
+	u32 v;
+
+	v = hpet_readl(HPET_Tn_CFG(hdata->channel));
+	v &= ~HPET_TN_ENABLE;
+
+	/*
+	 * Prepare to flush out any outstanding interrupt. This can only be
+	 * done in level-triggered mode.
+	 */
+	v |= HPET_TN_LEVEL;
+	hpet_writel(v, HPET_Tn_CFG(hdata->channel));
+
+	/*
+	 * Even though we use the HPET channel in edge-triggered mode, hardware
+	 * seems to keep an outstanding interrupt and posts an MSI message when
+	 * making any change to it (e.g., enabling or setting to FSB mode).
+	 * Flush out the interrupt status bit of our channel.
+	 */
+	hpet_writel(1 << hdata->channel, HPET_STATUS);
+}
+
+static void enable_timer(struct hpet_hld_data *hdata)
+{
+	u32 v;
+
+	v = hpet_readl(HPET_Tn_CFG(hdata->channel));
+	v &= ~HPET_TN_LEVEL;
+	v |= HPET_TN_ENABLE;
+	hpet_writel(v, HPET_Tn_CFG(hdata->channel));
+}
+
+/**
+ * is_hpet_hld_interrupt() - Check if the HPET channel caused the interrupt
+ * @hdata:	A data structure describing the HPET channel
+ *
+ * Returns:
+ * True if the HPET watchdog timer caused the interrupt. False otherwise.
+ */
+static bool is_hpet_hld_interrupt(struct hpet_hld_data *hdata)
+{
+	return false;
+}
+
+/**
+ * hardlockup_detector_nmi_handler() - NMI Interrupt handler
+ * @type:	Type of NMI handler; not used.
+ * @regs:	Register values as seen when the NMI was asserted
+ *
+ * Check if our HPET channel caused the NMI. If yes, inspect for lockups by
+ * issuing an IPI to the rest of the CPUs. Also, kick the timer if it is
+ * non-periodic.
+ *
+ * Returns:
+ * NMI_DONE if the HPET timer did not cause the interrupt. NMI_HANDLED
+ * otherwise.
+ */
+static int hardlockup_detector_nmi_handler(unsigned int type,
+					   struct pt_regs *regs)
+{
+	struct hpet_hld_data *hdata = hld_data;
+	int cpu;
+
+	/*
+	 * The CPU handling the HPET NMI will land here and trigger the
+	 * inspection of hardlockups in the rest of the monitored
+	 * CPUs.
+	 */
+	if (is_hpet_hld_interrupt(hdata)) {
+		/*
+		 * Kick the timer first. If the HPET channel is periodic, it
+		 * helps to reduce the delta between the expected TSC value and
+		 * its actual value the next time the HPET channel fires.
+		 */
+		kick_timer(hdata, !(hdata->has_periodic));
+
+		if (cpumask_weight(hld_data->monitored_cpumask) > 1) {
+			/*
+			 * Since we cannot know the source of an NMI, the best
+			 * we can do is to use a flag to indicate to all online
+			 * CPUs that they will get an NMI and that the source of
+			 * that NMI is the hardlockup detector. Offline CPUs
+			 * also receive the NMI but they ignore it.
+			 */
+			cpumask_copy(hld_data->inspect_cpumask,
+				     cpu_online_mask);
+
+			/* If we are here, IPI shorthands are enabled. */
+			apic->send_IPI_allbutself(NMI_VECTOR);
+		}
+
+		inspect_for_hardlockups(regs);
+		return NMI_HANDLED;
+	}
+
+	/* The rest of the CPUs will land here after receiving the IPI. */
+	cpu = smp_processor_id();
+	if (cpumask_test_and_clear_cpu(cpu, hld_data->inspect_cpumask)) {
+		if (cpumask_test_cpu(cpu, hld_data->monitored_cpumask))
+			inspect_for_hardlockups(regs);
+
+		return NMI_HANDLED;
+	}
+
+	return NMI_DONE;
+}
+
+/**
+ * setup_hpet_irq() - Install the interrupt handler of the detector
+ * @data:	Data associated with the instance of the HPET channel
+ *
+ * Returns:
+ * 0 success. An error code if setup was unsuccessful.
+ */
+static int setup_hpet_irq(struct hpet_hld_data *hdata)
+{
+	int ret;
+
+	/*
+	 * hld_data::irq was configured to deliver the interrupt as
+	 * NMI. Thus, there is no need for a regular interrupt handler.
+	 */
+	ret = request_irq(hld_data->irq, no_action, IRQF_TIMER,
+			  "hpet_hld", hld_data);
+	if (ret)
+		return ret;
+
+	ret = register_nmi_handler(NMI_LOCAL,
+				   hardlockup_detector_nmi_handler, 0,
+				   "hpet_hld");
+	if (ret)
+		free_irq(hld_data->irq, hld_data);
+
+	return ret;
+}
+
+/**
+ * hardlockup_detector_hpet_enable() - Enable the hardlockup detector
+ * @cpu:	CPU Index in which the watchdog will be enabled.
+ *
+ * Enable the hardlockup detector in @cpu. Also, start the detector if not done
+ * before.
+ */
+void hardlockup_detector_hpet_enable(unsigned int cpu)
+{
+	cpumask_set_cpu(cpu, hld_data->monitored_cpumask);
+
+	/*
+	 * If this is the first CPU on which the detector is enabled, designate
+	 * @cpu as the handling CPU and start everything. The HPET channel is
+	 * disabled at this point.
+	 */
+	if (cpumask_weight(hld_data->monitored_cpumask) == 1) {
+		hld_data->handling_cpu = cpu;
+
+		if (irq_set_affinity(hld_data->irq,
+				     cpumask_of(hld_data->handling_cpu))) {
+			pr_warn_once("Failed to set affinity. Hardlockdup detector not started");
+			return;
+		}
+
+		kick_timer(hld_data, true);
+		enable_timer(hld_data);
+	}
+}
+
+/**
+ * hardlockup_detector_hpet_disable() - Disable the hardlockup detector
+ * @cpu:	CPU index in which the watchdog will be disabled
+ *
+ * Disable the hardlockup detector in @cpu. If @cpu is also handling the NMI
+ * from the HPET channel, update the affinity of the interrupt.
+ */
+void hardlockup_detector_hpet_disable(unsigned int cpu)
+{
+	cpumask_clear_cpu(cpu, hld_data->monitored_cpumask);
+
+	if (hld_data->handling_cpu != cpu)
+		return;
+
+	disable_timer(hld_data);
+	if (!cpumask_weight(hld_data->monitored_cpumask))
+		return;
+
+	/*
+	 * If watchdog_thresh is zero, then the hardlockup detector is being
+	 * disabled.
+	 */
+	if (!watchdog_thresh)
+		return;
+
+	hld_data->handling_cpu = cpumask_any_but(hld_data->monitored_cpumask,
+						 cpu);
+	/*
+	 * Only update the affinity of the HPET channel interrupt when
+	 * disabled.
+	 */
+	if (irq_set_affinity(hld_data->irq,
+			     cpumask_of(hld_data->handling_cpu))) {
+		pr_warn_once("Failed to set affinity. Hardlockdup detector stopped");
+		return;
+	}
+
+	enable_timer(hld_data);
+}
+
+void hardlockup_detector_hpet_stop(void)
+{
+	disable_timer(hld_data);
+}
+
+void hardlockup_detector_hpet_start(void)
+{
+	kick_timer(hld_data, true);
+	enable_timer(hld_data);
+}
+
+static const char hpet_hld_init_failed[] = "Initialization failed:";
+
+/**
+ * hardlockup_detector_hpet_init() - Initialize the hardlockup detector
+ *
+ * Only initialize and configure the detector if an HPET is available on the
+ * system, the TSC is stable, IPI shorthands are enabled, and there are no
+ * isolated CPUs.
+ *
+ * Returns:
+ * 0 success. An error code if initialization was unsuccessful.
+ */
+int __init hardlockup_detector_hpet_init(void)
+{
+	int ret;
+
+	if (!is_hpet_enabled()) {
+		pr_info("%s HPET unavailable\n", hpet_hld_init_failed);
+		return -ENODEV;
+	}
+
+	if (tick_nohz_full_enabled()) {
+		pr_info("%s nohz_full in use\n", hpet_hld_init_failed);
+		return -EPERM;
+	}
+
+	if (!static_branch_likely(&apic_use_ipi_shorthand)) {
+		pr_info("%s APIC IPI shorthands disabled\n", hpet_hld_init_failed);
+		return -ENODEV;
+	}
+
+	if (check_tsc_unstable())
+		return -ENODEV;
+
+	hld_data = hpet_hld_get_timer();
+	if (!hld_data)
+		return -ENODEV;
+
+	disable_timer(hld_data);
+
+	setup_hpet_channel(hld_data);
+
+	ret = setup_hpet_irq(hld_data);
+	if (ret)
+		goto err_no_irq;
+
+	if (!zalloc_cpumask_var(&hld_data->monitored_cpumask, GFP_KERNEL))
+		goto err_no_monitored_cpumask;
+
+	if (!zalloc_cpumask_var(&hld_data->inspect_cpumask, GFP_KERNEL))
+		goto err_no_inspect_cpumask;
+
+	return 0;
+
+err_no_inspect_cpumask:
+	free_cpumask_var(hld_data->monitored_cpumask);
+err_no_monitored_cpumask:
+	ret = -ENOMEM;
+err_no_irq:
+	hpet_hld_free_timer(hld_data);
+	hld_data = NULL;
+
+	return ret;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 19/24] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

Implement a hardlockup detector that uses an HPET channel as the source
of the non-maskable interrupt. Implement the basic functionality to
start, stop, and configure the timer.

Designate as the handling CPU one of the CPUs that the detector monitors.
Use it to service the NMI from the HPET channel. When servicing the HPET
NMI, issue an inter-processor interrupt to the rest of the monitored CPUs
to look for hardlockups. Only enable the detector if IPI shorthands are
enabled in the system.

During operation, the HPET registers are only accessed to kick the timer.
This operation can be avoided if the detector gets a periodic HPET
channel.

Since we use IPI shorthands, all CPUs get the IPI NMI. This would disturb
the isolated CPUs specified in the nohz_full command-line parameter. In
such case, do not enable this hardlockup detector implementation.

The detector is not functional at this stage. A subsequent changeset will
invoke the interfaces implemented in this changeset to operate the
detector. Another subsequent changeset implements logic to determine if
the HPET timer caused the NMI. For now, implement a stub function.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Added missing header asm/nmi.h. linux/nmi.h #includes it only if
   CONFIG_HAVE_NMI_WATCHDOG selected. Such option is not selected for
   ARCH=x86.
 * Removed IRQF_NOBALANCING from request_irq(). The NMI vector already
   is configured to prevent balancing.
 * Added check to not use the detector when the nohz_full= command-line
   parameter is present.
 * Since I dropped the patch that added a new NMI_WATCHDOG list of
   handlers, now I added my handler to the NMI_LOCAL list. This makes
   sense since now the watchdog NMIs are now delivered via IPIs. The
   only exception is the HPET NMI, but that interrupt can only be handled
   by the monitoring CPU.

Changes since v5:
 * Squashed a previously separate patch to support interrupt remapping into
   this patch. There is no need to handle interrupt remapping separately.
   All the necessary plumbing is done in the interrupt subsystem. Now it
   uses request_irq().
 * Use IPI shorthands to send an NMI to the CPUs being monitored. (Thomas)
 * Added extra check to only use the HPET hardlockup detector if the IPI
   shorthands are enabled. (Thomas)
 * Relocated flushing of outstanding interrupts from enable_timer() to
   disable_timer(). On some systems, making any change in the
   configuration of the HPET channel causes it to issue an interrupt.
 * Added a new cpumask to function as a per-cpu test bit to determine if
   a CPU should check for hardlockups.
 * Dropped pointless X86_64 || X86_32 check in Kconfig. (Tony)
 * Dropped pointless dependency on CONFIG_HPET.
 * Added dependency on CONFIG_GENERIC_MSI_IRQ, needed to build the [|IR]-
   HPET-MSI irq_chip.
 * Added hardlockup_detector_hpet_start() to be used when tsc_khz is
   recalibrated.
 * Reworked the periodic setting the HPET channel. Rather than changing it
   every time the channel is disabled or enabled, do it only once. While
   at here, wrap the code in an initial setup function.
 * Implemented hardlockup_detector_hpet_start() to be called when tsc_khz
   is refined.
 * Enhanced inline comments for clarity.
 * Added missing #include files.
 * Relocated function declarations to not depend on CONFIG_HPET_TIMER.

Changes since v4:
 * Dropped hpet_hld_data.enabled_cpus and instead use cpumask_weight().
 * Renamed hpet_hld_data.cpu_monitored_mask to
   hld_data_data.cpu_monitored_mask and converted it to cpumask_var_t.
 * Flushed out any outstanding interrupt before enabling the HPET channel.
 * Removed unnecessary MSI_DATA_LEVEL_ASSERT from the MSI message.
 * Added comments in hardlockup_detector_nmi_handler() to explain how
   CPUs are targeted for an IPI.
 * Updated code to only issue an IPI when needed (i.e., there are monitored
   CPUs to be inspected via an IPI).
 * Reworked hardlockup_detector_hpet_init() for readability.
 * Now reserve the cpumasks in the hardlockup detector code and not in the
   generic HPET code.
 * Handled the case of watchdog_thresh = 0 when disabling the detector.
 * Made this detector available to i386.
 * Reworked logic to kick the timer to remove a local variable. (Andi)
 * Added a comment on what type of timer channel will be assigned to the
   detector. (Andi)
 * Reworded prompt comment in Kconfig. (Andi)
 * Removed unneeded switch to level interrupt mode when disabling the
   timer. (Andi)
 * Disabled the HPET timer to avoid a race between an incoming interrupt
   and an update of the MSI destination ID. (Ashok)
 * Corrected a typo in an inline comment. (Tony)
 * Made the HPET hardlockup detector depend on HARDLOCKUP_DETECTOR instead
   of selecting it.

Changes since v3:
 * Fixed typo in Kconfig.debug. (Randy Dunlap)
 * Added missing slab.h to include the definition of kfree to fix a build
   break.

Changes since v2:
 * Removed use of struct cpumask in favor of a variable length array in
   conjunction with kzalloc. (Peter Zijlstra)
 * Removed redundant documentation of functions. (Thomas Gleixner)
 * Added CPU as argument hardlockup_detector_hpet_enable()/disable().
   (Thomas Gleixner).

Changes since v1:
 * Do not target CPUs in a round-robin manner. Instead, the HPET timer
   always targets the same CPU; other CPUs are monitored via an
   interprocessor interrupt.
 * Dropped support for IO APIC interrupts and instead use only MSI
   interrupts.
 * Removed use of generic irq code to set interrupt affinity and NMI
   delivery. Instead, configure the interrupt directly in HPET registers.
   (Thomas Gleixner)
 * Fixed unconditional return NMI_HANDLED when the HPET timer is
   programmed for FSB/MSI delivery. (Peter Zijlstra)
---
 arch/x86/Kconfig.debug              |   8 +
 arch/x86/include/asm/hpet.h         |  21 ++
 arch/x86/kernel/Makefile            |   1 +
 arch/x86/kernel/watchdog_hld_hpet.c | 380 ++++++++++++++++++++++++++++
 4 files changed, 410 insertions(+)
 create mode 100644 arch/x86/kernel/watchdog_hld_hpet.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index bdfe08f1a930..b4dced142116 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -110,6 +110,14 @@ config IOMMU_LEAK
 config HAVE_MMIOTRACE_SUPPORT
 	def_bool y
 
+config X86_HARDLOCKUP_DETECTOR_HPET
+	bool "HPET-based hardlockup detector"
+	select HARDLOCKUP_DETECTOR_CORE
+	depends on HARDLOCKUP_DETECTOR && HPET_TIMER && GENERIC_MSI_IRQ
+	help
+	  Say y to drive the hardlockup detector using the High-Precision Event
+	  Timer instead of performance counters.
+
 config X86_DECODER_SELFTEST
 	bool "x86 instruction decoder selftest"
 	depends on DEBUG_KERNEL && INSTRUCTION_DECODER
diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index 5762bd0169a1..c88901744848 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -105,6 +105,8 @@ static inline int is_hpet_enabled(void) { return 0; }
 #endif
 
 #ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET
+#include <linux/cpumask.h>
+
 /**
  * struct hpet_hld_data - Data needed to operate the detector
  * @has_periodic:		The HPET channel supports periodic mode
@@ -112,6 +114,10 @@ static inline int is_hpet_enabled(void) { return 0; }
  * @channe_priv:		Private data of the assigned channel
  * @ticks_per_second:		Frequency of the HPET timer
  * @irq:			IRQ number assigned to the HPET channel
+ * @handling_cpu:		CPU handling the HPET interrupt
+ * @monitored_cpumask:		CPUs monitored by the hardlockup detector
+ * @inspect_cpumask:		CPUs that will be inspected at a given time.
+ *				Each CPU clears itself upon inspection.
  */
 struct hpet_hld_data {
 	bool			has_periodic;
@@ -119,10 +125,25 @@ struct hpet_hld_data {
 	struct hpet_channel	*channel_priv;
 	u64			ticks_per_second;
 	int			irq;
+	u32			handling_cpu;
+	cpumask_var_t		monitored_cpumask;
+	cpumask_var_t		inspect_cpumask;
 };
 
 extern struct hpet_hld_data *hpet_hld_get_timer(void);
 extern void hpet_hld_free_timer(struct hpet_hld_data *hdata);
+int hardlockup_detector_hpet_init(void);
+void hardlockup_detector_hpet_start(void);
+void hardlockup_detector_hpet_stop(void);
+void hardlockup_detector_hpet_enable(unsigned int cpu);
+void hardlockup_detector_hpet_disable(unsigned int cpu);
+#else
+static inline int hardlockup_detector_hpet_init(void)
+{ return -ENODEV; }
+static inline void hardlockup_detector_hpet_start(void) {}
+static inline void hardlockup_detector_hpet_stop(void) {}
+static inline void hardlockup_detector_hpet_enable(unsigned int cpu) {}
+static inline void hardlockup_detector_hpet_disable(unsigned int cpu) {}
 #endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */
 
 #endif /* _ASM_X86_HPET_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index dd61752f4c96..58eb858f33ff 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -109,6 +109,7 @@ obj-$(CONFIG_VM86)		+= vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK)	+= early_printk.o
 
 obj-$(CONFIG_HPET_TIMER) 	+= hpet.o
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 
 obj-$(CONFIG_AMD_NB)		+= amd_nb.o
 obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
new file mode 100644
index 000000000000..b583d3180ae0
--- /dev/null
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -0,0 +1,380 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A hardlockup detector driven by an HPET channel.
+ *
+ * Copyright (C) Intel Corporation 2023
+ *
+ * An HPET channel is reserved for the detector. The channel issues an NMI to
+ * one of the CPUs in @watchdog_allowed_mask. This CPU monitors itself for
+ * hardlockups and sends an NMI IPI to the rest of the CPUs in the system.
+ *
+ * The detector uses IPI shorthands. Thus, all CPUs in the system get the NMI
+ * (offline CPUs also get the NMI but they "ignore" it). A cpumask is used to
+ * specify whether a CPU must check for hardlockups.
+ *
+ * The NMI also disturbs isolated CPUs. The detector fails to initialize if
+ * tick_nohz_full is enabled.
+ */
+
+#define pr_fmt(fmt) "NMI hpet watchdog: " fmt
+
+#include <linux/cpumask.h>
+#include <linux/interrupt.h>
+#include <linux/jump_label.h>
+#include <linux/nmi.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+#include <linux/tick.h>
+
+#include <asm/apic.h>
+#include <asm/hpet.h>
+#include <asm/nmi.h>
+#include <asm/tsc.h>
+
+#include "apic/local.h"
+
+static struct hpet_hld_data *hld_data;
+
+static void __init setup_hpet_channel(struct hpet_hld_data *hdata)
+{
+	u32 v;
+
+	v = hpet_readl(HPET_Tn_CFG(hdata->channel));
+	if (hdata->has_periodic)
+		v |= HPET_TN_PERIODIC;
+	else
+		v &= ~HPET_TN_PERIODIC;
+
+	/*
+	 * Use 32-bit mode to limit the number of register accesses. If we are
+	 * here the HPET frequency is sufficiently low to accommodate this mode.
+	 */
+	v |= HPET_TN_32BIT;
+
+	/* If we are here, FSB mode is supported. */
+	v |= HPET_TN_FSB;
+
+	hpet_writel(v, HPET_Tn_CFG(hdata->channel));
+}
+
+/**
+ * kick_timer() - Reprogram timer to expire in the future
+ * @hdata:	A data structure describing the HPET channel
+ * @force:	Force reprogramming
+ *
+ * Reprogram the timer to expire in watchdog_thresh seconds in the future.
+ * If the timer supports periodic mode, it is not kicked unless @force is
+ * true.
+ */
+static void kick_timer(struct hpet_hld_data *hdata, bool force)
+{
+	u64 new_compare, count, period = 0;
+
+	/* Kick the timer only when needed. */
+	if (!force && hdata->has_periodic)
+		return;
+
+	/*
+	 * Update the comparator in increments of watch_thresh seconds relative
+	 * to the current count. Since watch_thresh is given in seconds, we can
+	 * safely update the comparator before the counter reaches such new
+	 * value.
+	 *
+	 * Let it wrap around if needed.
+	 */
+
+	count = hpet_readl(HPET_COUNTER);
+	new_compare = count + watchdog_thresh * hdata->ticks_per_second;
+
+	if (!hdata->has_periodic) {
+		hpet_writel(new_compare, HPET_Tn_CMP(hdata->channel));
+		return;
+	}
+
+	period = watchdog_thresh * hdata->ticks_per_second;
+	hpet_set_comparator_periodic(hdata->channel, (u32)new_compare,
+				     (u32)period);
+}
+
+static void disable_timer(struct hpet_hld_data *hdata)
+{
+	u32 v;
+
+	v = hpet_readl(HPET_Tn_CFG(hdata->channel));
+	v &= ~HPET_TN_ENABLE;
+
+	/*
+	 * Prepare to flush out any outstanding interrupt. This can only be
+	 * done in level-triggered mode.
+	 */
+	v |= HPET_TN_LEVEL;
+	hpet_writel(v, HPET_Tn_CFG(hdata->channel));
+
+	/*
+	 * Even though we use the HPET channel in edge-triggered mode, hardware
+	 * seems to keep an outstanding interrupt and posts an MSI message when
+	 * making any change to it (e.g., enabling or setting to FSB mode).
+	 * Flush out the interrupt status bit of our channel.
+	 */
+	hpet_writel(1 << hdata->channel, HPET_STATUS);
+}
+
+static void enable_timer(struct hpet_hld_data *hdata)
+{
+	u32 v;
+
+	v = hpet_readl(HPET_Tn_CFG(hdata->channel));
+	v &= ~HPET_TN_LEVEL;
+	v |= HPET_TN_ENABLE;
+	hpet_writel(v, HPET_Tn_CFG(hdata->channel));
+}
+
+/**
+ * is_hpet_hld_interrupt() - Check if the HPET channel caused the interrupt
+ * @hdata:	A data structure describing the HPET channel
+ *
+ * Returns:
+ * True if the HPET watchdog timer caused the interrupt. False otherwise.
+ */
+static bool is_hpet_hld_interrupt(struct hpet_hld_data *hdata)
+{
+	return false;
+}
+
+/**
+ * hardlockup_detector_nmi_handler() - NMI Interrupt handler
+ * @type:	Type of NMI handler; not used.
+ * @regs:	Register values as seen when the NMI was asserted
+ *
+ * Check if our HPET channel caused the NMI. If yes, inspect for lockups by
+ * issuing an IPI to the rest of the CPUs. Also, kick the timer if it is
+ * non-periodic.
+ *
+ * Returns:
+ * NMI_DONE if the HPET timer did not cause the interrupt. NMI_HANDLED
+ * otherwise.
+ */
+static int hardlockup_detector_nmi_handler(unsigned int type,
+					   struct pt_regs *regs)
+{
+	struct hpet_hld_data *hdata = hld_data;
+	int cpu;
+
+	/*
+	 * The CPU handling the HPET NMI will land here and trigger the
+	 * inspection of hardlockups in the rest of the monitored
+	 * CPUs.
+	 */
+	if (is_hpet_hld_interrupt(hdata)) {
+		/*
+		 * Kick the timer first. If the HPET channel is periodic, it
+		 * helps to reduce the delta between the expected TSC value and
+		 * its actual value the next time the HPET channel fires.
+		 */
+		kick_timer(hdata, !(hdata->has_periodic));
+
+		if (cpumask_weight(hld_data->monitored_cpumask) > 1) {
+			/*
+			 * Since we cannot know the source of an NMI, the best
+			 * we can do is to use a flag to indicate to all online
+			 * CPUs that they will get an NMI and that the source of
+			 * that NMI is the hardlockup detector. Offline CPUs
+			 * also receive the NMI but they ignore it.
+			 */
+			cpumask_copy(hld_data->inspect_cpumask,
+				     cpu_online_mask);
+
+			/* If we are here, IPI shorthands are enabled. */
+			apic->send_IPI_allbutself(NMI_VECTOR);
+		}
+
+		inspect_for_hardlockups(regs);
+		return NMI_HANDLED;
+	}
+
+	/* The rest of the CPUs will land here after receiving the IPI. */
+	cpu = smp_processor_id();
+	if (cpumask_test_and_clear_cpu(cpu, hld_data->inspect_cpumask)) {
+		if (cpumask_test_cpu(cpu, hld_data->monitored_cpumask))
+			inspect_for_hardlockups(regs);
+
+		return NMI_HANDLED;
+	}
+
+	return NMI_DONE;
+}
+
+/**
+ * setup_hpet_irq() - Install the interrupt handler of the detector
+ * @data:	Data associated with the instance of the HPET channel
+ *
+ * Returns:
+ * 0 success. An error code if setup was unsuccessful.
+ */
+static int setup_hpet_irq(struct hpet_hld_data *hdata)
+{
+	int ret;
+
+	/*
+	 * hld_data::irq was configured to deliver the interrupt as
+	 * NMI. Thus, there is no need for a regular interrupt handler.
+	 */
+	ret = request_irq(hld_data->irq, no_action, IRQF_TIMER,
+			  "hpet_hld", hld_data);
+	if (ret)
+		return ret;
+
+	ret = register_nmi_handler(NMI_LOCAL,
+				   hardlockup_detector_nmi_handler, 0,
+				   "hpet_hld");
+	if (ret)
+		free_irq(hld_data->irq, hld_data);
+
+	return ret;
+}
+
+/**
+ * hardlockup_detector_hpet_enable() - Enable the hardlockup detector
+ * @cpu:	CPU Index in which the watchdog will be enabled.
+ *
+ * Enable the hardlockup detector in @cpu. Also, start the detector if not done
+ * before.
+ */
+void hardlockup_detector_hpet_enable(unsigned int cpu)
+{
+	cpumask_set_cpu(cpu, hld_data->monitored_cpumask);
+
+	/*
+	 * If this is the first CPU on which the detector is enabled, designate
+	 * @cpu as the handling CPU and start everything. The HPET channel is
+	 * disabled at this point.
+	 */
+	if (cpumask_weight(hld_data->monitored_cpumask) == 1) {
+		hld_data->handling_cpu = cpu;
+
+		if (irq_set_affinity(hld_data->irq,
+				     cpumask_of(hld_data->handling_cpu))) {
+			pr_warn_once("Failed to set affinity. Hardlockdup detector not started");
+			return;
+		}
+
+		kick_timer(hld_data, true);
+		enable_timer(hld_data);
+	}
+}
+
+/**
+ * hardlockup_detector_hpet_disable() - Disable the hardlockup detector
+ * @cpu:	CPU index in which the watchdog will be disabled
+ *
+ * Disable the hardlockup detector in @cpu. If @cpu is also handling the NMI
+ * from the HPET channel, update the affinity of the interrupt.
+ */
+void hardlockup_detector_hpet_disable(unsigned int cpu)
+{
+	cpumask_clear_cpu(cpu, hld_data->monitored_cpumask);
+
+	if (hld_data->handling_cpu != cpu)
+		return;
+
+	disable_timer(hld_data);
+	if (!cpumask_weight(hld_data->monitored_cpumask))
+		return;
+
+	/*
+	 * If watchdog_thresh is zero, then the hardlockup detector is being
+	 * disabled.
+	 */
+	if (!watchdog_thresh)
+		return;
+
+	hld_data->handling_cpu = cpumask_any_but(hld_data->monitored_cpumask,
+						 cpu);
+	/*
+	 * Only update the affinity of the HPET channel interrupt when
+	 * disabled.
+	 */
+	if (irq_set_affinity(hld_data->irq,
+			     cpumask_of(hld_data->handling_cpu))) {
+		pr_warn_once("Failed to set affinity. Hardlockdup detector stopped");
+		return;
+	}
+
+	enable_timer(hld_data);
+}
+
+void hardlockup_detector_hpet_stop(void)
+{
+	disable_timer(hld_data);
+}
+
+void hardlockup_detector_hpet_start(void)
+{
+	kick_timer(hld_data, true);
+	enable_timer(hld_data);
+}
+
+static const char hpet_hld_init_failed[] = "Initialization failed:";
+
+/**
+ * hardlockup_detector_hpet_init() - Initialize the hardlockup detector
+ *
+ * Only initialize and configure the detector if an HPET is available on the
+ * system, the TSC is stable, IPI shorthands are enabled, and there are no
+ * isolated CPUs.
+ *
+ * Returns:
+ * 0 success. An error code if initialization was unsuccessful.
+ */
+int __init hardlockup_detector_hpet_init(void)
+{
+	int ret;
+
+	if (!is_hpet_enabled()) {
+		pr_info("%s HPET unavailable\n", hpet_hld_init_failed);
+		return -ENODEV;
+	}
+
+	if (tick_nohz_full_enabled()) {
+		pr_info("%s nohz_full in use\n", hpet_hld_init_failed);
+		return -EPERM;
+	}
+
+	if (!static_branch_likely(&apic_use_ipi_shorthand)) {
+		pr_info("%s APIC IPI shorthands disabled\n", hpet_hld_init_failed);
+		return -ENODEV;
+	}
+
+	if (check_tsc_unstable())
+		return -ENODEV;
+
+	hld_data = hpet_hld_get_timer();
+	if (!hld_data)
+		return -ENODEV;
+
+	disable_timer(hld_data);
+
+	setup_hpet_channel(hld_data);
+
+	ret = setup_hpet_irq(hld_data);
+	if (ret)
+		goto err_no_irq;
+
+	if (!zalloc_cpumask_var(&hld_data->monitored_cpumask, GFP_KERNEL))
+		goto err_no_monitored_cpumask;
+
+	if (!zalloc_cpumask_var(&hld_data->inspect_cpumask, GFP_KERNEL))
+		goto err_no_inspect_cpumask;
+
+	return 0;
+
+err_no_inspect_cpumask:
+	free_cpumask_var(hld_data->monitored_cpumask);
+err_no_monitored_cpumask:
+	ret = -ENOMEM;
+err_no_irq:
+	hpet_hld_free_timer(hld_data);
+	hld_data = NULL;
+
+	return ret;
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 20/24] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

It is not possible to determine the source of a non-maskable interrupt
(NMI) in x86. When dealing with an HPET channel, the only direct method to
determine whether it caused an NMI would be to read the Interrupt Status
register.

Reading HPET registers is slow and, therefore, not to be done while in NMI
context. Also, the interrupt status bit is not available if the HPET
channel is programmed to deliver an MSI interrupt.

An indirect manner to infer if the HPET channel is the source of an NMI is
is to use the time-stamp counter (TSC). Compute the value that the TSC is
expected to have at the next interrupt of the HPET channel and compare it
with the value it has when the interrupt does happen. Let this error be
tsc_next_error. If tsc_next_error is less than a certain value, assume that
the HPET channel of the detector is the source of the NMI.

Below is a table that characterizes tsc_next_error in a collection of
systems. The error is expressed in microseconds as well as a percentage of
tsc_delta: the computed number of TSC counts between two consecutive
interrupts of the HPET channel.

The table summarizes the error of 4096 interrupts of the HPET channel in
two experiments: a) since the system booted and b) ignoring the first 5
minutes after boot.

The maximum observed error in a) is 0.198%. For b) the maximum error is
0.045%.

Allow a maximum tsc_next_error that is twice as big the maximum error
observed in these experiments: 0.4% of tsc_delta.

watchdog_thresh      1s                  10s                60s
tsc_next_error   %        us         %        us        %        us

AMD EPYC 7742 64-Core Processor
max(abs(a))   0.04517   451.74    0.00171   171.04   0.00034   201.89
max(abs(b))   0.04517   451.74    0.00171   171.04   0.00034   201.89

Intel(R) Xeon(R) CPU E7-8890 - INTEL_FAM6_HASWELL_X
max(abs(a))   0.00811    81.15    0.00462   462.40   0.00014    81.65
max(abs(b))   0.00811    81.15    0.00084    84.31   0.00014    81.65

Intel(R) Xeon(R) Platinum 8170M - INTEL_FAM6_SKYLAKE_X
max(abs(a))   0.10530  1053.04    0.01324  1324.27   0.00407  2443.25
max(abs(b))   0.01166   116.59    0.00114   114.11   0.00024   143.47

Intel(R) Xeon(R) CPU E5-2699A v4 - INTEL_FAM6_BROADSWELL_X
max(abs(a))   0.00010    99.34    0.00099    98.83   0.00016    97.50
max(abs(b))   0.00010    99.34    0.00099    98.83   0.00016    97.50

Intel(R) Xeon(R) Gold 5318H - INTEL_FAM6_COOPERLAKE_X
max(abs(a))   0.11262  1126.17    0.01109  1109.17   0.00409  2455.73
max(abs(b))   0.01073   107.31    0.00109   109.02   0.00019   115.34

Intel(R) Xeon(R) Platinum 8360Y - INTEL_FAM6_ICELAKE_X
max(abs(a))   0.19853  1985.30    0.00784   783.53  -0.00017  -104.77
max(abs(b))   0.01550   155.02    0.00158   157.56   0.00020   117.74

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
NOTE: The error characterization data is repeated here from the cover
letter.
---
Changes since v6:
 * Fixed bug when checking the error window. Now check for an error
   which is +/-4% the actual TSC value, not +/-2%.

Changes since v5:
 * Reworked is_hpet_hld_interrupt() to reduce indentation.
 * Use time_in_range64() to compare the actual TSC value vs the expected
   value. This makes it more readable. (Tony)
 * Reduced the error window of the expected TSC value at the time of the
   HPET channel expiration.
 * Described better the heuristics used to determine if the HPET channel
   caused the NMI. (Tony)
 * Added a table to characterize the error in the expected TSC value when
   the HPET channel fires.
 * Removed references to groups of monitored CPUs. Instead, use tsc_khz
   directly.

Changes since v4:
 * Compute the TSC expected value at the next HPET interrupt based on the
   number of monitored packages and not the number of monitored CPUs.

Changes since v3:
 * None

Changes since v2:
 * Reworked condition to check if the expected TSC value is within the
   error margin to avoid an unnecessary conditional. (Peter Zijlstra)
 * Removed TSC error margin from struct hld_data; use a global variable
   instead. (Peter Zijlstra)

Changes since v1:
 * Introduced this patch.
---
 arch/x86/include/asm/hpet.h         |  3 ++
 arch/x86/kernel/watchdog_hld_hpet.c | 58 +++++++++++++++++++++++++++--
 2 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index c88901744848..af0a504b5cff 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -113,6 +113,8 @@ static inline int is_hpet_enabled(void) { return 0; }
  * @channel:			HPET channel assigned to the detector
  * @channe_priv:		Private data of the assigned channel
  * @ticks_per_second:		Frequency of the HPET timer
+ * @tsc_next:			Estimated value of the TSC at the next
+ *				HPET timer interrupt
  * @irq:			IRQ number assigned to the HPET channel
  * @handling_cpu:		CPU handling the HPET interrupt
  * @monitored_cpumask:		CPUs monitored by the hardlockup detector
@@ -124,6 +126,7 @@ struct hpet_hld_data {
 	u32			channel;
 	struct hpet_channel	*channel_priv;
 	u64			ticks_per_second;
+	u64			tsc_next;
 	int			irq;
 	u32			handling_cpu;
 	cpumask_var_t		monitored_cpumask;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index b583d3180ae0..a03126e02eda 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -12,6 +12,11 @@
  * (offline CPUs also get the NMI but they "ignore" it). A cpumask is used to
  * specify whether a CPU must check for hardlockups.
  *
+ * It is not possible to determine the source of an NMI. Instead, we calculate
+ * the value that the TSC counter should have when the next HPET NMI occurs. If
+ * it has the calculated value +/- 0.4%, we conclude that the HPET channel is the
+ * source of the NMI.
+ *
  * The NMI also disturbs isolated CPUs. The detector fails to initialize if
  * tick_nohz_full is enabled.
  */
@@ -34,6 +39,7 @@
 #include "apic/local.h"
 
 static struct hpet_hld_data *hld_data;
+static u64 tsc_next_error;
 
 static void __init setup_hpet_channel(struct hpet_hld_data *hdata)
 {
@@ -65,12 +71,39 @@ static void __init setup_hpet_channel(struct hpet_hld_data *hdata)
  * Reprogram the timer to expire in watchdog_thresh seconds in the future.
  * If the timer supports periodic mode, it is not kicked unless @force is
  * true.
+ *
+ * Also, compute the expected value of the time-stamp counter at the time of
+ * expiration as well as a deviation from the expected value.
  */
 static void kick_timer(struct hpet_hld_data *hdata, bool force)
 {
-	u64 new_compare, count, period = 0;
+	u64 tsc_curr, tsc_delta, new_compare, count, period = 0;
+
+	tsc_curr = rdtsc();
+
+	/*
+	 * Compute the delta between the value of the TSC now and the value
+	 * it will have the next time the HPET channel fires.
+	 */
+	tsc_delta = watchdog_thresh * tsc_khz * 1000L;
+	hdata->tsc_next = tsc_curr + tsc_delta;
+
+	/*
+	 * Define an error window between the expected TSC value and the actual
+	 * value it will have the next time the HPET channel fires. Define this
+	 * error as percentage of tsc_delta.
+	 *
+	 * The systems that have been tested so far exhibit an error of 0.05%
+	 * of the expected TSC value once the system is up and running. Systems
+	 * that refine tsc_khz exhibit a larger initial error up to 0.2%. To be
+	 * safe, allow a maximum error of ~0.4% (i.e., tsc_delta / 256).
+	 */
+	tsc_next_error = tsc_delta >> 8;
 
-	/* Kick the timer only when needed. */
+	/*
+	 * We must compute the exptected TSC value always. Kick the timer only
+	 * when needed.
+	 */
 	if (!force && hdata->has_periodic)
 		return;
 
@@ -133,12 +166,31 @@ static void enable_timer(struct hpet_hld_data *hdata)
  * is_hpet_hld_interrupt() - Check if the HPET channel caused the interrupt
  * @hdata:	A data structure describing the HPET channel
  *
+ * Determining the sources of NMIs is not possible. Furthermore, we have
+ * programmed the HPET channel for MSI delivery, which does not have a
+ * status bit. Also, reading HPET registers is slow.
+ *
+ * Instead, we just assume that an NMI delivered within a time window
+ * of when the HPET was expected to fire probably came from the HPET.
+ *
+ * The window is estimated using the TSC counter. Check the comments in
+ * kick_timer() for details on the size of the time window.
+ *
  * Returns:
  * True if the HPET watchdog timer caused the interrupt. False otherwise.
  */
 static bool is_hpet_hld_interrupt(struct hpet_hld_data *hdata)
 {
-	return false;
+	u64 tsc_curr, tsc_curr_min, tsc_curr_max;
+
+	if (smp_processor_id() != hdata->handling_cpu)
+		return false;
+
+	tsc_curr = rdtsc();
+	tsc_curr_min = tsc_curr - tsc_next_error;
+	tsc_curr_max = tsc_curr + tsc_next_error;
+
+	return time_in_range64(hdata->tsc_next, tsc_curr_min, tsc_curr_max);
 }
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 20/24] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

It is not possible to determine the source of a non-maskable interrupt
(NMI) in x86. When dealing with an HPET channel, the only direct method to
determine whether it caused an NMI would be to read the Interrupt Status
register.

Reading HPET registers is slow and, therefore, not to be done while in NMI
context. Also, the interrupt status bit is not available if the HPET
channel is programmed to deliver an MSI interrupt.

An indirect manner to infer if the HPET channel is the source of an NMI is
is to use the time-stamp counter (TSC). Compute the value that the TSC is
expected to have at the next interrupt of the HPET channel and compare it
with the value it has when the interrupt does happen. Let this error be
tsc_next_error. If tsc_next_error is less than a certain value, assume that
the HPET channel of the detector is the source of the NMI.

Below is a table that characterizes tsc_next_error in a collection of
systems. The error is expressed in microseconds as well as a percentage of
tsc_delta: the computed number of TSC counts between two consecutive
interrupts of the HPET channel.

The table summarizes the error of 4096 interrupts of the HPET channel in
two experiments: a) since the system booted and b) ignoring the first 5
minutes after boot.

The maximum observed error in a) is 0.198%. For b) the maximum error is
0.045%.

Allow a maximum tsc_next_error that is twice as big the maximum error
observed in these experiments: 0.4% of tsc_delta.

watchdog_thresh      1s                  10s                60s
tsc_next_error   %        us         %        us        %        us

AMD EPYC 7742 64-Core Processor
max(abs(a))   0.04517   451.74    0.00171   171.04   0.00034   201.89
max(abs(b))   0.04517   451.74    0.00171   171.04   0.00034   201.89

Intel(R) Xeon(R) CPU E7-8890 - INTEL_FAM6_HASWELL_X
max(abs(a))   0.00811    81.15    0.00462   462.40   0.00014    81.65
max(abs(b))   0.00811    81.15    0.00084    84.31   0.00014    81.65

Intel(R) Xeon(R) Platinum 8170M - INTEL_FAM6_SKYLAKE_X
max(abs(a))   0.10530  1053.04    0.01324  1324.27   0.00407  2443.25
max(abs(b))   0.01166   116.59    0.00114   114.11   0.00024   143.47

Intel(R) Xeon(R) CPU E5-2699A v4 - INTEL_FAM6_BROADSWELL_X
max(abs(a))   0.00010    99.34    0.00099    98.83   0.00016    97.50
max(abs(b))   0.00010    99.34    0.00099    98.83   0.00016    97.50

Intel(R) Xeon(R) Gold 5318H - INTEL_FAM6_COOPERLAKE_X
max(abs(a))   0.11262  1126.17    0.01109  1109.17   0.00409  2455.73
max(abs(b))   0.01073   107.31    0.00109   109.02   0.00019   115.34

Intel(R) Xeon(R) Platinum 8360Y - INTEL_FAM6_ICELAKE_X
max(abs(a))   0.19853  1985.30    0.00784   783.53  -0.00017  -104.77
max(abs(b))   0.01550   155.02    0.00158   157.56   0.00020   117.74

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
NOTE: The error characterization data is repeated here from the cover
letter.
---
Changes since v6:
 * Fixed bug when checking the error window. Now check for an error
   which is +/-4% the actual TSC value, not +/-2%.

Changes since v5:
 * Reworked is_hpet_hld_interrupt() to reduce indentation.
 * Use time_in_range64() to compare the actual TSC value vs the expected
   value. This makes it more readable. (Tony)
 * Reduced the error window of the expected TSC value at the time of the
   HPET channel expiration.
 * Described better the heuristics used to determine if the HPET channel
   caused the NMI. (Tony)
 * Added a table to characterize the error in the expected TSC value when
   the HPET channel fires.
 * Removed references to groups of monitored CPUs. Instead, use tsc_khz
   directly.

Changes since v4:
 * Compute the TSC expected value at the next HPET interrupt based on the
   number of monitored packages and not the number of monitored CPUs.

Changes since v3:
 * None

Changes since v2:
 * Reworked condition to check if the expected TSC value is within the
   error margin to avoid an unnecessary conditional. (Peter Zijlstra)
 * Removed TSC error margin from struct hld_data; use a global variable
   instead. (Peter Zijlstra)

Changes since v1:
 * Introduced this patch.
---
 arch/x86/include/asm/hpet.h         |  3 ++
 arch/x86/kernel/watchdog_hld_hpet.c | 58 +++++++++++++++++++++++++++--
 2 files changed, 58 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h
index c88901744848..af0a504b5cff 100644
--- a/arch/x86/include/asm/hpet.h
+++ b/arch/x86/include/asm/hpet.h
@@ -113,6 +113,8 @@ static inline int is_hpet_enabled(void) { return 0; }
  * @channel:			HPET channel assigned to the detector
  * @channe_priv:		Private data of the assigned channel
  * @ticks_per_second:		Frequency of the HPET timer
+ * @tsc_next:			Estimated value of the TSC at the next
+ *				HPET timer interrupt
  * @irq:			IRQ number assigned to the HPET channel
  * @handling_cpu:		CPU handling the HPET interrupt
  * @monitored_cpumask:		CPUs monitored by the hardlockup detector
@@ -124,6 +126,7 @@ struct hpet_hld_data {
 	u32			channel;
 	struct hpet_channel	*channel_priv;
 	u64			ticks_per_second;
+	u64			tsc_next;
 	int			irq;
 	u32			handling_cpu;
 	cpumask_var_t		monitored_cpumask;
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index b583d3180ae0..a03126e02eda 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -12,6 +12,11 @@
  * (offline CPUs also get the NMI but they "ignore" it). A cpumask is used to
  * specify whether a CPU must check for hardlockups.
  *
+ * It is not possible to determine the source of an NMI. Instead, we calculate
+ * the value that the TSC counter should have when the next HPET NMI occurs. If
+ * it has the calculated value +/- 0.4%, we conclude that the HPET channel is the
+ * source of the NMI.
+ *
  * The NMI also disturbs isolated CPUs. The detector fails to initialize if
  * tick_nohz_full is enabled.
  */
@@ -34,6 +39,7 @@
 #include "apic/local.h"
 
 static struct hpet_hld_data *hld_data;
+static u64 tsc_next_error;
 
 static void __init setup_hpet_channel(struct hpet_hld_data *hdata)
 {
@@ -65,12 +71,39 @@ static void __init setup_hpet_channel(struct hpet_hld_data *hdata)
  * Reprogram the timer to expire in watchdog_thresh seconds in the future.
  * If the timer supports periodic mode, it is not kicked unless @force is
  * true.
+ *
+ * Also, compute the expected value of the time-stamp counter at the time of
+ * expiration as well as a deviation from the expected value.
  */
 static void kick_timer(struct hpet_hld_data *hdata, bool force)
 {
-	u64 new_compare, count, period = 0;
+	u64 tsc_curr, tsc_delta, new_compare, count, period = 0;
+
+	tsc_curr = rdtsc();
+
+	/*
+	 * Compute the delta between the value of the TSC now and the value
+	 * it will have the next time the HPET channel fires.
+	 */
+	tsc_delta = watchdog_thresh * tsc_khz * 1000L;
+	hdata->tsc_next = tsc_curr + tsc_delta;
+
+	/*
+	 * Define an error window between the expected TSC value and the actual
+	 * value it will have the next time the HPET channel fires. Define this
+	 * error as percentage of tsc_delta.
+	 *
+	 * The systems that have been tested so far exhibit an error of 0.05%
+	 * of the expected TSC value once the system is up and running. Systems
+	 * that refine tsc_khz exhibit a larger initial error up to 0.2%. To be
+	 * safe, allow a maximum error of ~0.4% (i.e., tsc_delta / 256).
+	 */
+	tsc_next_error = tsc_delta >> 8;
 
-	/* Kick the timer only when needed. */
+	/*
+	 * We must compute the exptected TSC value always. Kick the timer only
+	 * when needed.
+	 */
 	if (!force && hdata->has_periodic)
 		return;
 
@@ -133,12 +166,31 @@ static void enable_timer(struct hpet_hld_data *hdata)
  * is_hpet_hld_interrupt() - Check if the HPET channel caused the interrupt
  * @hdata:	A data structure describing the HPET channel
  *
+ * Determining the sources of NMIs is not possible. Furthermore, we have
+ * programmed the HPET channel for MSI delivery, which does not have a
+ * status bit. Also, reading HPET registers is slow.
+ *
+ * Instead, we just assume that an NMI delivered within a time window
+ * of when the HPET was expected to fire probably came from the HPET.
+ *
+ * The window is estimated using the TSC counter. Check the comments in
+ * kick_timer() for details on the size of the time window.
+ *
  * Returns:
  * True if the HPET watchdog timer caused the interrupt. False otherwise.
  */
 static bool is_hpet_hld_interrupt(struct hpet_hld_data *hdata)
 {
-	return false;
+	u64 tsc_curr, tsc_curr_min, tsc_curr_max;
+
+	if (smp_processor_id() != hdata->handling_cpu)
+		return false;
+
+	tsc_curr = rdtsc();
+	tsc_curr_min = tsc_curr - tsc_next_error;
+	tsc_curr_max = tsc_curr + tsc_next_error;
+
+	return time_in_range64(hdata->tsc_next, tsc_curr_min, tsc_curr_max);
 }
 
 /**
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 21/24] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

Keep the HPET-based hardlockup detector disabled unless explicitly enabled
via a command-line argument. If such parameter is not given, the
initialization of the HPET-based hardlockup detector fails and the NMI
watchdog will fall back to use the perf-based implementation.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
--
Changes since v6:
 * Do not reuse the nmi_watchdog command line option. Instead, use a
   separate command line option. (Nicholas Piggin)
 * Document conflict with conflict between `hpet_nmi_watchdog` and
   `nohz_full` and dependency on `no_ipi_broadcast`.

Changes since v5:
 * None

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * Do not imply that using nmi_watchdog=hpet means the detector is
   enabled. Instead, print a warning in such case.

Changes since v1:
 * Added documentation to the function handing the nmi_watchdog
   kernel command-line argument.
---
 Documentation/admin-guide/kernel-parameters.txt |  8 ++++++++
 arch/x86/kernel/watchdog_hld_hpet.c             | 17 +++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 46268d6baa43..2d1262bb99c7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1708,6 +1708,14 @@
 	hpet_mmap=	[X86, HPET_MMAP] Allow userspace to mmap HPET
 			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
 
+	hpet_nmi_watchdog [X86, KNL]
+			Drive the NMI watchdog with an HPET channel. This option
+			has no effect if the NMI watchdog is not enabled.
+			The HPET NMI watchdog conflicts with the parameters
+			nohz_full, no_ipi_broadcast, and hpet=disable. If any
+			of these parameters is present the NMI watchdog will
+			fall back to the perf-driven implementation.
+
 	hugepages=	[HW] Number of HugeTLB pages to allocate at boot.
 			If this follows hugepagesz (below), it specifies
 			the number of pages of hugepagesz to be allocated.
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index a03126e02eda..0fc728ad6f15 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -39,6 +39,7 @@
 #include "apic/local.h"
 
 static struct hpet_hld_data *hld_data;
+static bool hardlockup_use_hpet;
 static u64 tsc_next_error;
 
 static void __init setup_hpet_channel(struct hpet_hld_data *hdata)
@@ -366,6 +367,19 @@ void hardlockup_detector_hpet_start(void)
 	enable_timer(hld_data);
 }
 
+/**
+ * hardlockup_detector_hpet_setup() - Parse command-line parameters
+ * @str:	A string containing the kernel command line
+ *
+ * If selected by the user, enable this hardlockup detector.
+ */
+static int __init hardlockup_detector_hpet_setup(char *str)
+{
+	hardlockup_use_hpet = true;
+	return 1;
+}
+__setup("hpet_nmi_watchdog", hardlockup_detector_hpet_setup);
+
 static const char hpet_hld_init_failed[] = "Initialization failed:";
 
 /**
@@ -382,6 +396,9 @@ int __init hardlockup_detector_hpet_init(void)
 {
 	int ret;
 
+	if (!hardlockup_use_hpet)
+		return -ENODEV;
+
 	if (!is_hpet_enabled()) {
 		pr_info("%s HPET unavailable\n", hpet_hld_init_failed);
 		return -ENODEV;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 21/24] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

Keep the HPET-based hardlockup detector disabled unless explicitly enabled
via a command-line argument. If such parameter is not given, the
initialization of the HPET-based hardlockup detector fails and the NMI
watchdog will fall back to use the perf-based implementation.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
--
Changes since v6:
 * Do not reuse the nmi_watchdog command line option. Instead, use a
   separate command line option. (Nicholas Piggin)
 * Document conflict with conflict between `hpet_nmi_watchdog` and
   `nohz_full` and dependency on `no_ipi_broadcast`.

Changes since v5:
 * None

Changes since v4:
 * None

Changes since v3:
 * None

Changes since v2:
 * Do not imply that using nmi_watchdog=hpet means the detector is
   enabled. Instead, print a warning in such case.

Changes since v1:
 * Added documentation to the function handing the nmi_watchdog
   kernel command-line argument.
---
 Documentation/admin-guide/kernel-parameters.txt |  8 ++++++++
 arch/x86/kernel/watchdog_hld_hpet.c             | 17 +++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 46268d6baa43..2d1262bb99c7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1708,6 +1708,14 @@
 	hpet_mmap=	[X86, HPET_MMAP] Allow userspace to mmap HPET
 			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
 
+	hpet_nmi_watchdog [X86, KNL]
+			Drive the NMI watchdog with an HPET channel. This option
+			has no effect if the NMI watchdog is not enabled.
+			The HPET NMI watchdog conflicts with the parameters
+			nohz_full, no_ipi_broadcast, and hpet=disable. If any
+			of these parameters is present the NMI watchdog will
+			fall back to the perf-driven implementation.
+
 	hugepages=	[HW] Number of HugeTLB pages to allocate at boot.
 			If this follows hugepagesz (below), it specifies
 			the number of pages of hugepagesz to be allocated.
diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c
index a03126e02eda..0fc728ad6f15 100644
--- a/arch/x86/kernel/watchdog_hld_hpet.c
+++ b/arch/x86/kernel/watchdog_hld_hpet.c
@@ -39,6 +39,7 @@
 #include "apic/local.h"
 
 static struct hpet_hld_data *hld_data;
+static bool hardlockup_use_hpet;
 static u64 tsc_next_error;
 
 static void __init setup_hpet_channel(struct hpet_hld_data *hdata)
@@ -366,6 +367,19 @@ void hardlockup_detector_hpet_start(void)
 	enable_timer(hld_data);
 }
 
+/**
+ * hardlockup_detector_hpet_setup() - Parse command-line parameters
+ * @str:	A string containing the kernel command line
+ *
+ * If selected by the user, enable this hardlockup detector.
+ */
+static int __init hardlockup_detector_hpet_setup(char *str)
+{
+	hardlockup_use_hpet = true;
+	return 1;
+}
+__setup("hpet_nmi_watchdog", hardlockup_detector_hpet_setup);
+
 static const char hpet_hld_init_failed[] = "Initialization failed:";
 
 /**
@@ -382,6 +396,9 @@ int __init hardlockup_detector_hpet_init(void)
 {
 	int ret;
 
+	if (!hardlockup_use_hpet)
+		return -ENODEV;
+
 	if (!is_hpet_enabled()) {
 		pr_info("%s HPET unavailable\n", hpet_hld_init_failed);
 		return -ENODEV;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 22/24] x86/watchdog: Add a shim hardlockup detector
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

Add a shim hardlockup detector that allows to select between the perf- and
HPET-driven implementations available for x86.

Override the interfaces of the default hardlockup detector.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Added watchdog_nmi_start() to be used when the watchdog is
   reconfigured.
 * Always build the x86-specific hardlockup detector shim; not only
   when the HPET-based detector is selected.
 * Corrected a typo in comment in watchdog_nmi_probe() (Ani)
 * Removed useless local ret variable in watchdog_nmi_enable(). (Ani)

Changes since v4:
 * Use a switch to enable and disable the various available detectors.
   (Andi)

Changes since v3:
 * Fixed style in multi-line comment. (Randy Dunlap)

Changes since v2:
 * Pass cpu number as argument to hardlockup_detector_[enable|disable].
   (Thomas Gleixner)

Changes since v1:
 * Introduced this patch: Added an x86-specific shim hardlockup
   detector. (Nicholas Piggin)
---
 arch/x86/Kconfig.debug         |  3 ++
 arch/x86/kernel/Makefile       |  2 +
 arch/x86/kernel/watchdog_hld.c | 86 ++++++++++++++++++++++++++++++++++
 3 files changed, 91 insertions(+)
 create mode 100644 arch/x86/kernel/watchdog_hld.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index b4dced142116..41ae46314307 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -3,6 +3,9 @@
 config EARLY_PRINTK_USB
 	bool
 
+config X86_HARDLOCKUP_DETECTOR
+	def_bool y if HARDLOCKUP_DETECTOR_CORE
+
 config X86_VERBOSE_BOOTUP
 	bool "Enable verbose x86 bootup info messages"
 	default y
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 58eb858f33ff..706294fd5e46 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -108,6 +108,8 @@ obj-$(CONFIG_KGDB)		+= kgdb.o
 obj-$(CONFIG_VM86)		+= vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK)	+= early_printk.o
 
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR) += watchdog_hld.o
+
 obj-$(CONFIG_HPET_TIMER) 	+= hpet.o
 obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
new file mode 100644
index 000000000000..33c22f6456a3
--- /dev/null
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -0,0 +1,86 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A shim hardlockup detector for x86 to select between the perf- and HPET-
+ * driven implementations.
+ *
+ * Copyright (C) Intel Corporation 2023
+ */
+
+#include <linux/nmi.h>
+#include <asm/hpet.h>
+
+enum x86_hardlockup_detector {
+	X86_HARDLOCKUP_DETECTOR_PERF,
+	X86_HARDLOCKUP_DETECTOR_HPET,
+};
+
+static enum x86_hardlockup_detector detector_type __ro_after_init;
+
+int watchdog_nmi_enable(unsigned int cpu)
+{
+	switch (detector_type) {
+	case X86_HARDLOCKUP_DETECTOR_PERF:
+		hardlockup_detector_perf_enable();
+		break;
+	case X86_HARDLOCKUP_DETECTOR_HPET:
+		hardlockup_detector_hpet_enable(cpu);
+		break;
+	default:
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
+void watchdog_nmi_disable(unsigned int cpu)
+{
+	switch (detector_type) {
+	case X86_HARDLOCKUP_DETECTOR_PERF:
+		hardlockup_detector_perf_disable();
+		break;
+	case X86_HARDLOCKUP_DETECTOR_HPET:
+		hardlockup_detector_hpet_disable(cpu);
+		break;
+	}
+}
+
+int __init watchdog_nmi_probe(void)
+{
+	int ret;
+
+	/*
+	 * Try first with the HPET hardlockup detector. It will only succeed if
+	 * requested via the kernel command line. The perf-based detector is
+	 * used by default.
+	 */
+	ret = hardlockup_detector_hpet_init();
+	if (!ret) {
+		detector_type = X86_HARDLOCKUP_DETECTOR_HPET;
+		return ret;
+	}
+
+	ret = hardlockup_detector_perf_init();
+	if (!ret) {
+		detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+		return ret;
+	}
+
+	return 0;
+}
+
+void watchdog_nmi_stop(void)
+{
+	/* Only the HPET lockup detector defines a stop function. */
+	if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
+		hardlockup_detector_hpet_stop();
+}
+
+void watchdog_nmi_start(void)
+{
+	if (!(watchdog_enabled & NMI_WATCHDOG_ENABLED))
+		return;
+
+	/* Only the HPET lockup detector defines a start function. */
+	if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
+		hardlockup_detector_hpet_start();
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 22/24] x86/watchdog: Add a shim hardlockup detector
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

Add a shim hardlockup detector that allows to select between the perf- and
HPET-driven implementations available for x86.

Override the interfaces of the default hardlockup detector.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * None

Changes since v5:
 * Added watchdog_nmi_start() to be used when the watchdog is
   reconfigured.
 * Always build the x86-specific hardlockup detector shim; not only
   when the HPET-based detector is selected.
 * Corrected a typo in comment in watchdog_nmi_probe() (Ani)
 * Removed useless local ret variable in watchdog_nmi_enable(). (Ani)

Changes since v4:
 * Use a switch to enable and disable the various available detectors.
   (Andi)

Changes since v3:
 * Fixed style in multi-line comment. (Randy Dunlap)

Changes since v2:
 * Pass cpu number as argument to hardlockup_detector_[enable|disable].
   (Thomas Gleixner)

Changes since v1:
 * Introduced this patch: Added an x86-specific shim hardlockup
   detector. (Nicholas Piggin)
---
 arch/x86/Kconfig.debug         |  3 ++
 arch/x86/kernel/Makefile       |  2 +
 arch/x86/kernel/watchdog_hld.c | 86 ++++++++++++++++++++++++++++++++++
 3 files changed, 91 insertions(+)
 create mode 100644 arch/x86/kernel/watchdog_hld.c

diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index b4dced142116..41ae46314307 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -3,6 +3,9 @@
 config EARLY_PRINTK_USB
 	bool
 
+config X86_HARDLOCKUP_DETECTOR
+	def_bool y if HARDLOCKUP_DETECTOR_CORE
+
 config X86_VERBOSE_BOOTUP
 	bool "Enable verbose x86 bootup info messages"
 	default y
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 58eb858f33ff..706294fd5e46 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -108,6 +108,8 @@ obj-$(CONFIG_KGDB)		+= kgdb.o
 obj-$(CONFIG_VM86)		+= vm86_32.o
 obj-$(CONFIG_EARLY_PRINTK)	+= early_printk.o
 
+obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR) += watchdog_hld.o
+
 obj-$(CONFIG_HPET_TIMER) 	+= hpet.o
 obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o
 
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
new file mode 100644
index 000000000000..33c22f6456a3
--- /dev/null
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -0,0 +1,86 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A shim hardlockup detector for x86 to select between the perf- and HPET-
+ * driven implementations.
+ *
+ * Copyright (C) Intel Corporation 2023
+ */
+
+#include <linux/nmi.h>
+#include <asm/hpet.h>
+
+enum x86_hardlockup_detector {
+	X86_HARDLOCKUP_DETECTOR_PERF,
+	X86_HARDLOCKUP_DETECTOR_HPET,
+};
+
+static enum x86_hardlockup_detector detector_type __ro_after_init;
+
+int watchdog_nmi_enable(unsigned int cpu)
+{
+	switch (detector_type) {
+	case X86_HARDLOCKUP_DETECTOR_PERF:
+		hardlockup_detector_perf_enable();
+		break;
+	case X86_HARDLOCKUP_DETECTOR_HPET:
+		hardlockup_detector_hpet_enable(cpu);
+		break;
+	default:
+		return -ENODEV;
+	}
+
+	return 0;
+}
+
+void watchdog_nmi_disable(unsigned int cpu)
+{
+	switch (detector_type) {
+	case X86_HARDLOCKUP_DETECTOR_PERF:
+		hardlockup_detector_perf_disable();
+		break;
+	case X86_HARDLOCKUP_DETECTOR_HPET:
+		hardlockup_detector_hpet_disable(cpu);
+		break;
+	}
+}
+
+int __init watchdog_nmi_probe(void)
+{
+	int ret;
+
+	/*
+	 * Try first with the HPET hardlockup detector. It will only succeed if
+	 * requested via the kernel command line. The perf-based detector is
+	 * used by default.
+	 */
+	ret = hardlockup_detector_hpet_init();
+	if (!ret) {
+		detector_type = X86_HARDLOCKUP_DETECTOR_HPET;
+		return ret;
+	}
+
+	ret = hardlockup_detector_perf_init();
+	if (!ret) {
+		detector_type = X86_HARDLOCKUP_DETECTOR_PERF;
+		return ret;
+	}
+
+	return 0;
+}
+
+void watchdog_nmi_stop(void)
+{
+	/* Only the HPET lockup detector defines a stop function. */
+	if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
+		hardlockup_detector_hpet_stop();
+}
+
+void watchdog_nmi_start(void)
+{
+	if (!(watchdog_enabled & NMI_WATCHDOG_ENABLED))
+		return;
+
+	/* Only the HPET lockup detector defines a start function. */
+	if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
+		hardlockup_detector_hpet_start();
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 23/24] watchdog: Introduce hardlockup_detector_mark_unavailable()
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

The NMI watchdog may become unreliable during runtime. This is the case
in x86 if, for instance, the HPET-based hardlockup detector is in use and
the TSC counter becomes unstable.

Introduce a new interface to mark the hardlockup detector as unavailable
in such cases. When doing this, update the state of /proc/sys/kernel/
nmi_watchdog to keep it consistent.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Introduced this patch

Changes since v5:
 * N/A

Changes since v4
 * N/A

Changes since v3
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 include/linux/nmi.h |  2 ++
 kernel/watchdog.c   | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index a38c4509f9eb..40a97139ec65 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -83,9 +83,11 @@ static inline void reset_hung_task_detector(void) { }
 
 #if defined(CONFIG_HARDLOCKUP_DETECTOR)
 extern void hardlockup_detector_disable(void);
+extern void hardlockup_detector_mark_unavailable(void);
 extern unsigned int hardlockup_panic;
 #else
 static inline void hardlockup_detector_disable(void) {}
+static inline void hardlockup_detector_mark_unavailable(void) {}
 #endif
 
 #if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 8e61f21e7e33..0e4fed6d95b9 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -47,6 +47,8 @@ static int __read_mostly nmi_watchdog_available;
 struct cpumask watchdog_cpumask __read_mostly;
 unsigned long *watchdog_cpumask_bits = cpumask_bits(&watchdog_cpumask);
 
+static void __lockup_detector_reconfigure(void);
+
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
 
 # ifdef CONFIG_SMP
@@ -85,6 +87,24 @@ static int __init hardlockup_panic_setup(char *str)
 }
 __setup("nmi_watchdog=", hardlockup_panic_setup);
 
+/**
+ * hardlockup_detector_mark_unavailable - Mark the NMI watchdog as unavailable
+ *
+ * Indicate that the hardlockup detector has become unavailable. This may
+ * happen if the hardware resources that the detector uses have become
+ * unreliable.
+ */
+void hardlockup_detector_mark_unavailable(void)
+{
+	mutex_lock(&watchdog_mutex);
+
+	/* These variables can be updated without stopping the detector. */
+	nmi_watchdog_user_enabled = 0;
+	nmi_watchdog_available = false;
+
+	__lockup_detector_reconfigure();
+	mutex_unlock(&watchdog_mutex);
+}
 #endif /* CONFIG_HARDLOCKUP_DETECTOR */
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 23/24] watchdog: Introduce hardlockup_detector_mark_unavailable()
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

The NMI watchdog may become unreliable during runtime. This is the case
in x86 if, for instance, the HPET-based hardlockup detector is in use and
the TSC counter becomes unstable.

Introduce a new interface to mark the hardlockup detector as unavailable
in such cases. When doing this, update the state of /proc/sys/kernel/
nmi_watchdog to keep it consistent.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Introduced this patch

Changes since v5:
 * N/A

Changes since v4
 * N/A

Changes since v3
 * N/A

Changes since v2:
 * N/A

Changes since v1:
 * N/A
---
 include/linux/nmi.h |  2 ++
 kernel/watchdog.c   | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index a38c4509f9eb..40a97139ec65 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -83,9 +83,11 @@ static inline void reset_hung_task_detector(void) { }
 
 #if defined(CONFIG_HARDLOCKUP_DETECTOR)
 extern void hardlockup_detector_disable(void);
+extern void hardlockup_detector_mark_unavailable(void);
 extern unsigned int hardlockup_panic;
 #else
 static inline void hardlockup_detector_disable(void) {}
+static inline void hardlockup_detector_mark_unavailable(void) {}
 #endif
 
 #if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 8e61f21e7e33..0e4fed6d95b9 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -47,6 +47,8 @@ static int __read_mostly nmi_watchdog_available;
 struct cpumask watchdog_cpumask __read_mostly;
 unsigned long *watchdog_cpumask_bits = cpumask_bits(&watchdog_cpumask);
 
+static void __lockup_detector_reconfigure(void);
+
 #ifdef CONFIG_HARDLOCKUP_DETECTOR
 
 # ifdef CONFIG_SMP
@@ -85,6 +87,24 @@ static int __init hardlockup_panic_setup(char *str)
 }
 __setup("nmi_watchdog=", hardlockup_panic_setup);
 
+/**
+ * hardlockup_detector_mark_unavailable - Mark the NMI watchdog as unavailable
+ *
+ * Indicate that the hardlockup detector has become unavailable. This may
+ * happen if the hardware resources that the detector uses have become
+ * unreliable.
+ */
+void hardlockup_detector_mark_unavailable(void)
+{
+	mutex_lock(&watchdog_mutex);
+
+	/* These variables can be updated without stopping the detector. */
+	nmi_watchdog_user_enabled = 0;
+	nmi_watchdog_available = false;
+
+	__lockup_detector_reconfigure();
+	mutex_unlock(&watchdog_mutex);
+}
 #endif /* CONFIG_HARDLOCKUP_DETECTOR */
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 24/24] x86/tsc: Stop the HPET hardlockup detector if TSC become unstable
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-03-01 23:47   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel, Ricardo Neri

The HPET-based hardlockup detector relies on the TSC to determine if an
observed NMI interrupt was originated by HPET timer. Hence, this detector
can no longer be used with an unstable TSC. Once marked as unstable,
the TSC cannot be stable again. In such case, permanently stop the HPET-
based hardlockup detector.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Do not switch to the perf-based NMI watchdog. Instead, only stop
   the HPET-based NMI watchdog if the TSC counter becomes unstable.

Changes since v5:
 * Relocated the declaration of hardlockup_detector_switch_to_perf() to
   x86/nmi.h It does not depend on HPET.
 * Removed function stub. The shim hardlockup detector is always for x86.

Changes since v4:
 * Added a stub version of hardlockup_detector_switch_to_perf() for
   !CONFIG_HPET_TIMER. (lkp)
 * Reconfigure the whole lockup detector instead of unconditionally
   starting the perf-based hardlockup detector.

Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 arch/x86/include/asm/nmi.h     |  6 ++++++
 arch/x86/kernel/tsc.c          |  3 +++
 arch/x86/kernel/watchdog_hld.c | 11 +++++++++++
 3 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 5c5f1e56c404..4d0687a2b4ea 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -63,4 +63,10 @@ void stop_nmi(void);
 void restart_nmi(void);
 void local_touch_nmi(void);
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR
+extern void hardlockup_detector_mark_hpet_hld_unavailable(void);
+#else
+static inline void hardlockup_detector_mark_hpet_hld_unavailable(void) {}
+#endif
+
 #endif /* _ASM_X86_NMI_H */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 344698852146..24f77efea569 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1191,6 +1191,9 @@ void mark_tsc_unstable(char *reason)
 
 	clocksource_mark_unstable(&clocksource_tsc_early);
 	clocksource_mark_unstable(&clocksource_tsc);
+
+	/* The HPET hardlockup detector depends on a stable TSC. */
+	hardlockup_detector_mark_hpet_hld_unavailable();
 }
 
 EXPORT_SYMBOL_GPL(mark_tsc_unstable);
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
index 33c22f6456a3..f5d79ce0e7a2 100644
--- a/arch/x86/kernel/watchdog_hld.c
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -6,6 +6,8 @@
  * Copyright (C) Intel Corporation 2023
  */
 
+#define pr_fmt(fmt) "watchdog: " fmt
+
 #include <linux/nmi.h>
 #include <asm/hpet.h>
 
@@ -84,3 +86,12 @@ void watchdog_nmi_start(void)
 	if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
 		hardlockup_detector_hpet_start();
 }
+
+void hardlockup_detector_mark_hpet_hld_unavailable(void)
+{
+	if (detector_type != X86_HARDLOCKUP_DETECTOR_HPET)
+		return;
+
+	pr_warn("TSC is unstable. Stopping the HPET NMI watchdog.");
+	hardlockup_detector_mark_unavailable();
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v7 24/24] x86/tsc: Stop the HPET hardlockup detector if TSC become unstable
@ 2023-03-01 23:47   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-03-01 23:47 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Ricardo Neri,
	Stephane Eranian, linux-kernel, iommu, linuxppc-dev

The HPET-based hardlockup detector relies on the TSC to determine if an
observed NMI interrupt was originated by HPET timer. Hence, this detector
can no longer be used with an unstable TSC. Once marked as unstable,
the TSC cannot be stable again. In such case, permanently stop the HPET-
based hardlockup detector.

Cc: Andi Kleen <ak@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Ravi V. Shankar" <ravi.v.shankar@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: linuxppc-dev@lists.ozlabs.org
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
---
Changes since v6:
 * Do not switch to the perf-based NMI watchdog. Instead, only stop
   the HPET-based NMI watchdog if the TSC counter becomes unstable.

Changes since v5:
 * Relocated the declaration of hardlockup_detector_switch_to_perf() to
   x86/nmi.h It does not depend on HPET.
 * Removed function stub. The shim hardlockup detector is always for x86.

Changes since v4:
 * Added a stub version of hardlockup_detector_switch_to_perf() for
   !CONFIG_HPET_TIMER. (lkp)
 * Reconfigure the whole lockup detector instead of unconditionally
   starting the perf-based hardlockup detector.

Changes since v3:
 * None

Changes since v2:
 * Introduced this patch.

Changes since v1:
 * N/A
---
 arch/x86/include/asm/nmi.h     |  6 ++++++
 arch/x86/kernel/tsc.c          |  3 +++
 arch/x86/kernel/watchdog_hld.c | 11 +++++++++++
 3 files changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 5c5f1e56c404..4d0687a2b4ea 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -63,4 +63,10 @@ void stop_nmi(void);
 void restart_nmi(void);
 void local_touch_nmi(void);
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR
+extern void hardlockup_detector_mark_hpet_hld_unavailable(void);
+#else
+static inline void hardlockup_detector_mark_hpet_hld_unavailable(void) {}
+#endif
+
 #endif /* _ASM_X86_NMI_H */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 344698852146..24f77efea569 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1191,6 +1191,9 @@ void mark_tsc_unstable(char *reason)
 
 	clocksource_mark_unstable(&clocksource_tsc_early);
 	clocksource_mark_unstable(&clocksource_tsc);
+
+	/* The HPET hardlockup detector depends on a stable TSC. */
+	hardlockup_detector_mark_hpet_hld_unavailable();
 }
 
 EXPORT_SYMBOL_GPL(mark_tsc_unstable);
diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c
index 33c22f6456a3..f5d79ce0e7a2 100644
--- a/arch/x86/kernel/watchdog_hld.c
+++ b/arch/x86/kernel/watchdog_hld.c
@@ -6,6 +6,8 @@
  * Copyright (C) Intel Corporation 2023
  */
 
+#define pr_fmt(fmt) "watchdog: " fmt
+
 #include <linux/nmi.h>
 #include <asm/hpet.h>
 
@@ -84,3 +86,12 @@ void watchdog_nmi_start(void)
 	if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET)
 		hardlockup_detector_hpet_start();
 }
+
+void hardlockup_detector_mark_hpet_hld_unavailable(void)
+{
+	if (detector_type != X86_HARDLOCKUP_DETECTOR_HPET)
+		return;
+
+	pr_warn("TSC is unstable. Stopping the HPET NMI watchdog.");
+	hardlockup_detector_mark_unavailable();
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v7 00/24] x86: Implement an HPET-based hardlockup detector
  2023-03-01 23:47 ` Ricardo Neri
@ 2023-04-13  3:58   ` Ricardo Neri
  -1 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-04-13  3:58 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Andi Kleen, Stephane Eranian, Ravi V. Shankar, Ricardo Neri,
	linuxppc-dev, iommu, linux-kernel

On Wed, Mar 01, 2023 at 03:47:29PM -0800, Ricardo Neri wrote:
> Hi x86 trusted reviewers,
> 
> This is the seventh version of this patchset. I acknowledge that it took me
> a long time to post a new version. Sorry! I will commit time to continue
> working on this series with high priority and I will post a new series soon
> after receiving your new feedback.
> 
> Although this series touches several subsystems, I plan to send it to the
> x86 maintainers because a) the series does not make much sense if split
> into subsystems, b) Thomas Gleixner has reviewed previous versions, and c)
> he has contributed to all the subsystems I modify.
> 
> Tony Luck has kindly reviewed previous versions of the series and I carried
> his Reviewed-by tags. This version, however, has new patches that also need
> review.
> 
> I seek to collect the Reviewed-by tags from the x86 trusted reviewers for
> the following patches:
>    + arch/x86: 4, 5
>    + Intel IOMMU: 6,
>    + AMD IOMMU: 9, 10, 11,
>    + NMI watchdog: 23 and 24.

Hello, checking if there is any feedback on these patches that I plan to
send to the x86 maintainer.

I am still seeking to collect the Reviewed-by: tags from the x86 trusted
reviewers.

Thanks in advance!

BR,
Ricardo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v7 00/24] x86: Implement an HPET-based hardlockup detector
@ 2023-04-13  3:58   ` Ricardo Neri
  0 siblings, 0 replies; 52+ messages in thread
From: Ricardo Neri @ 2023-04-13  3:58 UTC (permalink / raw)
  To: Tony Luck, Dave Hansen, Rafael J. Wysocki, Reinette Chatre,
	Dan Williams, Len Brown
  Cc: Ravi V. Shankar, Andi Kleen, Ricardo Neri, Stephane Eranian,
	linux-kernel, iommu, linuxppc-dev

On Wed, Mar 01, 2023 at 03:47:29PM -0800, Ricardo Neri wrote:
> Hi x86 trusted reviewers,
> 
> This is the seventh version of this patchset. I acknowledge that it took me
> a long time to post a new version. Sorry! I will commit time to continue
> working on this series with high priority and I will post a new series soon
> after receiving your new feedback.
> 
> Although this series touches several subsystems, I plan to send it to the
> x86 maintainers because a) the series does not make much sense if split
> into subsystems, b) Thomas Gleixner has reviewed previous versions, and c)
> he has contributed to all the subsystems I modify.
> 
> Tony Luck has kindly reviewed previous versions of the series and I carried
> his Reviewed-by tags. This version, however, has new patches that also need
> review.
> 
> I seek to collect the Reviewed-by tags from the x86 trusted reviewers for
> the following patches:
>    + arch/x86: 4, 5
>    + Intel IOMMU: 6,
>    + AMD IOMMU: 9, 10, 11,
>    + NMI watchdog: 23 and 24.

Hello, checking if there is any feedback on these patches that I plan to
send to the x86 maintainer.

I am still seeking to collect the Reviewed-by: tags from the x86 trusted
reviewers.

Thanks in advance!

BR,
Ricardo

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2023-04-13  3:48 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-01 23:47 [PATCH v7 00/24] x86: Implement an HPET-based hardlockup detector Ricardo Neri
2023-03-01 23:47 ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 01/24] x86/apic: Add irq_cfg::delivery_mode Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 02/24] x86/apic/msi: Use the delivery mode from irq_cfg for message composition Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 03/24] x86/apic: Add the X86_IRQ_ALLOC_AS_NMI interrupt allocation flag Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 04/24] x86/apic/vector: Implement a local APIC NMI controller Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 05/24] x86/apic/vector: Skip cleanup for the NMI vector Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 06/24] iommu/vt-d: Clear the redirection hint when the destination mode is physical Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 07/24] iommu/vt-d: Rework prepare_irte() to support per-interrupt delivery mode Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 08/24] iommu/vt-d: Set the IRTE delivery mode individually for each interrupt Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 09/24] iommu/amd: Expose [set|get]_dev_entry_bit() Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 10/24] iommu/amd: Enable NMIPass when allocating an NMI Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 11/24] iommu/amd: Compose MSI messages for NMIs in non-IR format Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 12/24] x86/hpet: Expose hpet_writel() in header Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 13/24] x86/hpet: Add helper function hpet_set_comparator_periodic() Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 14/24] x86/hpet: Prepare IRQ assignments to use the X86_ALLOC_AS_NMI flag Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 15/24] x86/hpet: Reserve an HPET channel for the hardlockup detector Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 16/24] watchdog/hardlockup: Define a generic function to detect hardlockups Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 17/24] watchdog/hardlockup: Decouple the hardlockup detector from perf Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 18/24] init/main: Delay initialization of the lockup detector after smp_init() Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 19/24] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 20/24] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 21/24] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 22/24] x86/watchdog: Add a shim hardlockup detector Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 23/24] watchdog: Introduce hardlockup_detector_mark_unavailable() Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-03-01 23:47 ` [PATCH v7 24/24] x86/tsc: Stop the HPET hardlockup detector if TSC become unstable Ricardo Neri
2023-03-01 23:47   ` Ricardo Neri
2023-04-13  3:58 ` [PATCH v7 00/24] x86: Implement an HPET-based hardlockup detector Ricardo Neri
2023-04-13  3:58   ` Ricardo Neri

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.