archive mirror
 help / color / mirror / Atom feed
From: Thomas Gleixner <>
To: Stephane Eranian <>
Cc: Ricardo Neri <>,
	Ingo Molnar <>, Borislav Petkov <>,
	Ashok Raj <>, Joerg Roedel <>,
	Andi Kleen <>,
	Peter Zijlstra <>,
	Suravee Suthikulpanit <>,
	"Ravi V. Shankar" <>,
	Randy Dunlap <>, x86 <>,
	LKML <>,,
	Ricardo Neri <>,
	Tony Luck <>,
	Jacob Pan <>,
	Juergen Gross <>,
	Bjorn Helgaas <>,
	Wincy Van <>,
	Kate Stewart <>,
	Philippe Ombredanne <>,
	"Eric W. Biederman" <>,
	Baoquan He <>, Jan Kiszka <>,
	Lu Baolu <>
Subject: Re: [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog
Date: Tue, 18 Jun 2019 01:08:06 +0200 (CEST)	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>


On Mon, 17 Jun 2019, Stephane Eranian wrote:
> On Mon, Jun 17, 2019 at 1:25 AM Thomas Gleixner <> wrote:
> > Great that there is no trace of any mail from Andi or Stephane about this
> > on LKML. There is no problem with talking offlist about this stuff, but
> > then you should at least provide a rationale for those who were not part of
> > the private conversation.
> >
> Let me add some context to this whole patch series. The pressure on the
> core PMU counters is increasing as more people want to use them to
> measure always more events. When the PMU is overcommitted, i.e., more
> events than counters for them, there is multiplexing. It comes with an
> overhead that is too high for certain applications. One way to avoid this
> is to lower the multiplexing frequency, which is by default 1ms, but that
> comes with loss of accuracy. Another approach is to measure only a small
> number of events at a time and use multiple runs, but then you lose
> consistent event view. Another approach is to push for increasing the
> number of counters. But getting new hardware counters takes time. Short
> term, we can investigate what it would take to free one cycle-capable
> counter which is commandeered by the hard lockup detector on all X86
> processors today. The functionality of the watchdog, being able to get a
> crash dump on kernel deadlocks, is important and we cannot simply disable
> it. At scale, many bugs are exposed and thus machines
> deadlock. Therefore, we want to investigate what it would take to move
> the detector to another NMI-capable source, such as the HPET because the
> detector does not need high low granularity timer and interrupts only
> every 2s.

I'm well aware about the reasons for this.

> Furthermore, recent Intel erratum, e.g., the TSX issue forcing the TFA
> code in perf_events, have increased the pressure even more with only 3
> generic counters left. Thus, it is time to look at alternative ways of
> getting a hard lockup detector (NMI watchdog) from another NMI source
> than the PMU. To that extent, I have been discussing about alternatives.
> Intel suggested using the HPET and Ricardo has been working on
> producing this patch series. It is clear from your review
> that the patches have issues, but I am hoping that they can be
> resolved with constructive feedback knowing what the end goal is.

Well, I gave constructive feedback from the very first version on. But
essential parts of that feedback have been ignored for whatever reasons.

> As for the round-robin changes, yes, we discussed this as an alternative
> to avoid overloading CPU0 with handling all of the work to broadcasting
> IPI to 100+ other CPUs.

I can understand the reason why you don't want to do that, but again, I
said way before this was tried that changing affinity from NMI context with
the IOMMU cannot work by just calling into the iommu code and it needs some
deep investigation with the IOMMU wizards whether a preallocated entry can
be used lockless (including the subsequently required flush).

The outcome is that the change was implemented by simply calling into
functions which I told that they cannot be called from NMI context.

Unless this problem is not solved and I doubt it can be solved after
talking to IOMMU people and studying manuals, the round robin mechanics in
the current form are not going to happen. We'd need a SMI based lockup
detector to debug the resulting livelock wreckage.

There are two possible options:

  1) Back to the IPI approach

     The probem with broadcast is that it sends IPIs one by one to each
     online CPU, which sums up with a large number of CPUs.

     The interesting question is why the kernel does not utilize the all
     excluding self destination shorthand for this. The SDM is not giving
     any information.

     But there is a historic commit which is related and gives a hint:

        commit e77deacb7b078156fcadf27b838a4ce1a65eda04
        Author: Keith Owens <>
        Date:   Mon Jun 26 13:59:56 2006 +0200

        [PATCH] x86_64: Avoid broadcasting NMI IPIs
        On some i386/x86_64 systems, sending an NMI IPI as a broadcast will
    	reset the system.  This seems to be a BIOS bug which affects
    	machines where one or more cpus are not under OS control.  It
    	occurs on HT systems with a version of the OS that is not compiled
    	without HT support.  It also occurs when a system is booted with
    	max_cpus=n where 2 <= n < cpus known to the BIOS.  The fix is to
    	always send NMI IPI as a mask instead of as a broadcast.

    I can see the issue with max_cpus and that'd be trivial to solve by
    disabling the HPET watchdog when maxcpus < num_present_cpus is on the
    command line (That's broken anyway with respect to MCEs. See the stupid
    dance we need to do for 'nosmt').

    Though the HT part of the changelog is unparseable garbage but might be
    a cryptic hint to the 'nosmt' mess. Though back then we did not have
    a way to disable the siblings (or did we?). Weird...

    It definitely would be worthwhile to experiment with that. if we could
    use shorthands (also for regular IPIs) that would be a great
    improvement in general and would nicely solve that NMI issue. Beware of
    the dragons though.

  2) Delegate round robin to irq_work

    Use the same mechanism as perf for stuff which needs to be done outside
    of NMI context.

    That can solve the issue, but the drawback is that in case the NMI hits
    a locked up interrupt disabled region the affinity stays on the same
    CPU as the regular IPI which kicks the irq work is not going to be
    handled.  Might not be a big issue as we could detect the situation
    that the IPI comes back to the same CPU. Not pretty and lots of nasty
    corner case and race condition handling.

    There is another issue with that round robin scheme as I pointed out to

      With a small watchdog threshold and tons of CPUs the time to switch
      the affinity becomes short. That brings the HPET reprogramming (in
      case of oneshot) into the SMI endagered zone and aside of that it
      will eat performance as well because with lets say 1 second threshold
      and 1000 CPUs we are going to flush the interrupt remapping
      table/entry once per millisecond. No idea how big the penalty is, but
      it's certainly not free.

    One possible way out would be to use a combined approach of building
    CPU groups (lets say 8) where one of the CPUs gets the NMI and IPIs the
    other 7 and then round robins to the next group. Whether that's any
    better, I can't tell.

Sorry that I can't come up with the magic cure and just can provide more
questions than answers.



  reply	other threads:[~2019-06-17 23:08 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-05-24  1:16 [RFC PATCH v4 00/21] Implement an HPET-based hardlockup detector Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 01/21] x86/msi: Add definition for NMI delivery mode Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 02/21] x86/hpet: Expose hpet_writel() in header Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 03/21] x86/hpet: Calculate ticks-per-second in a separate function Ricardo Neri
2019-06-14 15:54   ` Thomas Gleixner
2019-06-14 15:59     ` Thomas Gleixner
2019-06-18 22:48     ` Ricardo Neri
2019-06-18 23:13       ` Thomas Gleixner
2019-05-24  1:16 ` [RFC PATCH v4 04/21] x86/hpet: Add hpet_set_comparator() for periodic and one-shot modes Ricardo Neri
2019-06-14 18:17   ` Thomas Gleixner
2019-06-18 22:48     ` Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 05/21] x86/hpet: Reserve timer for the HPET hardlockup detector Ricardo Neri
2019-06-11 19:54   ` Thomas Gleixner
2019-06-14  1:14     ` Ricardo Neri
2019-06-14 16:10       ` Thomas Gleixner
2019-06-18 22:48         ` Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 06/21] x86/hpet: Configure the timer used by the " Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 07/21] watchdog/hardlockup: Define a generic function to detect hardlockups Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 08/21] watchdog/hardlockup: Decouple the hardlockup detector from perf Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 09/21] x86/nmi: Add a NMI_WATCHDOG NMI handler category Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 10/21] watchdog/hardlockup: Add function to enable NMI watchdog on all allowed CPUs at once Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 11/21] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 12/21] watchdog/hardlockup/hpet: Adjust timer expiration on the number of monitored CPUs Ricardo Neri
2019-06-11 20:11   ` Thomas Gleixner
2019-06-18 22:46     ` Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 13/21] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 14/21] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog" Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 15/21] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 16/21] x86/watchdog: Add a shim hardlockup detector Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 17/21] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable Ricardo Neri
2019-06-07  0:35   ` Stephane Eranian
2019-06-07 14:14     ` Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 18/21] x86/apic: Add a parameter for the APIC delivery mode Ricardo Neri
2019-06-16  9:55   ` Thomas Gleixner
2019-06-18 22:47     ` Ricardo Neri
2019-06-18 23:15       ` Thomas Gleixner
2019-05-24  1:16 ` [RFC PATCH v4 19/21] iommu/vt-d: Rework prepare_irte() to support per-irq " Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 20/21] iommu/vt-d: hpet: Reserve an interrupt remampping table entry for watchdog Ricardo Neri
2019-06-16 18:42   ` Thomas Gleixner
2019-06-16 19:21   ` Thomas Gleixner
2019-06-17  8:25     ` Thomas Gleixner
2019-06-17 21:38       ` Stephane Eranian
2019-06-17 23:08         ` Thomas Gleixner [this message]
2019-06-19 15:43           ` Jacob Pan
2019-06-21 15:33             ` Thomas Gleixner
2019-06-21 17:31               ` Jacob Pan
2019-06-21 18:39                 ` Jacob Pan
2019-06-21 20:05                   ` Thomas Gleixner
2019-06-21 23:55                     ` Ricardo Neri
2019-06-22  7:21                       ` Thomas Gleixner
2019-10-18  2:48           ` Ricardo Neri
2019-06-18 22:45       ` Ricardo Neri
2019-05-24  1:16 ` [RFC PATCH v4 21/21] x86/watchdog/hardlockup/hpet: Support interrupt remapping Ricardo Neri
2019-06-16  8:44   ` Thomas Gleixner
2019-06-16  8:53     ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).