linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Tony Luck <tony.luck@intel.com>
To: Borislav Petkov <bp@alien8.de>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>,
	Smita.KoralahalliChannabasappa@amd.com,
	dave.hansen@linux.intel.com, x86@kernel.org,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	patches@lists.linux.dev, Tony Luck <tony.luck@intel.com>
Subject: [PATCH v9 0/3] Handle corrected machine check interrupt storms
Date: Wed,  4 Oct 2023 11:36:20 -0700	[thread overview]
Message-ID: <20231004183623.17067-1-tony.luck@intel.com> (raw)
In-Reply-To: <20230929181626.210782-1-tony.luck@intel.com>

Linux CMCI storm mitigation is a big hammer that just disables the CMCI
interrupt globally and switches to polling all banks.

There are two problems with this:
1) It really is a big hammer. It means that errors reported in other
banks from different functional units are all subject to the same
polling delay before being processed.
2) Intel systems signal some uncorrected errors using CMCI (e.g.
memory controller patrol scrub on Icelake Xeon and newer). Delaying
processing these error reports negates some of the benefit of the patrol
scrubber providing early notice of errors before they are consumed and
cause a machine check.

This series throws away the old storm implementation and replaces it
with one that keeps track of the weather on each separate machine check
bank. When a storm is detected from a bank. On Intel the storm is
mitigated by setting a very high threshold for corrected errors to
signal CMCI. This threshold does not affect signaling CMCI for
uncorrected errors.

Signed-off-by: Tony Luck <tony.luck@intel.com>

---

Changes since v8:

Fixed issue reported by lkp with randconfig build with neither
CONFIG_X86_MCE_INTEL not CONFIG_X86_MCE_AMD set by making a
cleaner division between the storm tracking code in threshold.c
with the restof the code using more function accessors that can
be stubbed out.

Tony Luck (3):
  x86/mce: Remove old CMCI storm mitigation code
  x86/mce: Add per-bank CMCI storm mitigation
  x86/mce: Handle Intel threshold interrupt storms

 arch/x86/kernel/cpu/mce/internal.h  |  48 ++++-
 arch/x86/kernel/cpu/mce/core.c      |  45 ++---
 arch/x86/kernel/cpu/mce/intel.c     | 303 ++++++++++++----------------
 arch/x86/kernel/cpu/mce/threshold.c | 115 +++++++++++
 4 files changed, 304 insertions(+), 207 deletions(-)


base-commit: 6465e260f48790807eef06b583b38ca9789b6072
-- 
2.41.0


  parent reply	other threads:[~2023-10-04 18:37 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-06  6:35 [RFC PATCH 0/5] Handle corrected machine check interrupt storms Smita Koralahalli
2022-04-06  6:35 ` [PATCH 1/5] x86/mce: Remove old CMCI storm mitigation code Smita Koralahalli
2022-04-06  6:35 ` [PATCH 2/5] x86/mce: Add per-bank CMCI storm mitigation Smita Koralahalli
2022-04-06  6:35 ` [RFC PATCH 3/5] x86/mce: Introduce a function pointer mce_handle_storm Smita Koralahalli
2022-04-06 22:38   ` Luck, Tony
2022-04-06  6:35 ` [RFC PATCH 4/5] x86/mce: Move storm handling to core Smita Koralahalli
2022-06-21  5:08   ` Luck, Tony
2022-06-27 17:36     ` [PATCH v2 0/5] Handle corrected machine check interrupt storms Tony Luck
2022-06-27 17:36       ` [PATCH v2 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2022-06-27 17:36       ` [PATCH v2 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2022-06-27 17:36       ` [PATCH v2 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2022-06-27 17:36       ` [PATCH v2 4/5] x86/mce: Move storm handling to core Tony Luck
2022-06-27 17:36       ` [PATCH v2 5/5] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-03-17 14:50       ` [PATCH v2 0/5] Handle corrected machine check " Yazen Ghannam
2023-03-17 17:20         ` [PATCH v3 " Tony Luck
2023-03-17 17:20           ` [PATCH v3 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-03-17 17:20           ` [PATCH v3 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-03-17 17:20           ` [PATCH v3 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2023-03-23 15:22             ` Yazen Ghannam
2023-03-23 18:00               ` Tony Luck
2023-03-17 17:20           ` [PATCH v3 4/5] x86/mce: Move storm handling to core Tony Luck
2023-03-23 15:27             ` Yazen Ghannam
2023-03-23 18:10               ` Luck, Tony
2023-03-23 20:26                 ` Luck, Tony
2023-03-24 20:44                   ` Yazen Ghannam
2023-03-29 15:26                   ` Yazen Ghannam
2023-04-03 19:03                     ` Luck, Tony
2023-04-03 21:07                     ` [PATCH v4 0/5] Handle corrected machine check interrupt storms Tony Luck
2023-04-03 21:07                       ` [PATCH v4 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-04-03 21:07                       ` [PATCH v4 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-04-11 12:32                         ` Borislav Petkov
2023-04-11 14:06                           ` Yazen Ghannam
2023-04-11 16:06                             ` Luck, Tony
2023-04-11 17:17                               ` Borislav Petkov
2023-04-03 21:07                       ` [PATCH v4 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2023-04-03 21:07                       ` [PATCH v4 4/5] x86/mce: Move storm handling to core Tony Luck
2023-04-03 21:07                       ` [PATCH v4 5/5] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-04-11 17:38                       ` [PATCH v5 0/5] Handle corrected machine check " Tony Luck
2023-04-11 17:38                         ` [PATCH v5 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-04-11 17:38                         ` [PATCH v5 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-06-13 17:45                           ` Borislav Petkov
2023-06-16 18:15                             ` Tony Luck
2023-04-11 17:38                         ` [PATCH v5 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2023-04-11 17:38                         ` [PATCH v5 4/5] x86/mce: Move storm handling to core Tony Luck
2023-04-11 17:38                         ` [PATCH v5 5/5] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-06-16 18:27                         ` [PATCH v6 0/4] Handle corrected machine check " Tony Luck
2023-06-16 18:27                           ` [PATCH v6 1/4] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-06-16 18:27                           ` [PATCH v6 2/4] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-06-23 12:09                             ` Borislav Petkov
2023-06-23 15:40                               ` Luck, Tony
2023-07-17  8:58                                 ` Borislav Petkov
2023-06-16 18:27                           ` [PATCH v6 3/4] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-06-23 14:45                             ` Borislav Petkov
2023-06-23 15:54                               ` Yazen Ghannam
2023-06-16 18:27                           ` [PATCH v6 4/4] x86/mce: Handle Intel " Tony Luck
2023-07-18 21:08                           ` [PATCH v7 0/3] Handle corrected machine check " Tony Luck
2023-07-18 21:08                             ` [PATCH v7 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-07-18 21:08                             ` [PATCH v7 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-09-19 17:44                               ` Yazen Ghannam
2023-09-20 15:56                               ` Yazen Ghannam
2023-09-20 16:09                                 ` Luck, Tony
2023-07-18 21:08                             ` [PATCH v7 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-09-19 17:59                               ` Yazen Ghannam
2023-09-29 18:16                             ` [PATCH v8 0/3] Handle corrected machine check " Tony Luck
2023-09-29 18:16                               ` [PATCH v8 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-09-29 18:16                               ` [PATCH v8 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-09-29 18:16                               ` [PATCH v8 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-10-02 17:57                               ` [PATCH v8 0/3] Handle corrected machine check " Luck, Tony
2023-10-04 18:36                               ` Tony Luck [this message]
2023-10-04 18:36                                 ` [PATCH v9 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-10-04 18:36                                 ` [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-10-11  9:11                                   ` kernel test robot
2023-10-11 15:16                                     ` Luck, Tony
2023-10-11 15:42                                       ` Feng Tang
2023-10-11 17:23                                         ` Luck, Tony
2023-10-12  5:36                                           ` Feng Tang
2023-10-12  5:56                                             ` Feng Tang
2023-10-12  2:35                                         ` Philip Li
2023-10-19 15:12                                   ` Borislav Petkov
2023-10-23 18:14                                     ` Tony Luck
2023-11-14 19:23                                       ` Borislav Petkov
2023-11-14 22:04                                         ` Tony Luck
2023-11-21 11:54                                           ` Borislav Petkov
2023-11-27 19:50                                             ` Tony Luck
2023-11-27 20:14                                               ` Tony Luck
2023-11-28  0:42                                                 ` Tony Luck
2023-11-28 15:32                                                   ` Yazen Ghannam
2023-12-14 16:58                                                   ` Borislav Petkov
2023-12-14 18:03                                                     ` Luck, Tony
2023-10-04 18:36                                 ` [PATCH v9 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-11-15 19:54                                 ` [PATCH v10 0/3] Handle corrected machine check " Tony Luck
2023-11-15 19:54                                   ` [PATCH v10 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-11-15 19:54                                   ` [PATCH v10 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-11-15 19:54                                   ` [PATCH v10 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-03-17 17:20           ` [PATCH v3 5/5] x86/mce: Handle AMD " Tony Luck
2022-04-06  6:35 ` [RFC PATCH " Smita Koralahalli
2022-04-06 22:44   ` Luck, Tony
2022-04-08  7:48     ` Koralahalli Channabasappa, Smita
2022-04-08 19:29       ` Luck, Tony

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231004183623.17067-1-tony.luck@intel.com \
    --to=tony.luck@intel.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=x86@kernel.org \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).