All of lore.kernel.org
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>,
	Smita.KoralahalliChannabasappa@amd.com,
	dave.hansen@linux.intel.com, x86@kernel.org,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	patches@lists.linux.dev
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation
Date: Tue, 14 Nov 2023 20:23:24 +0100	[thread overview]
Message-ID: <20231114192324.GAZVPJLGZmfJBS181/@fat_crate.local> (raw)
In-Reply-To: <ZTa37L2nlnbok8dz@agluck-desk3>

On Mon, Oct 23, 2023 at 11:14:04AM -0700, Tony Luck wrote:
> I want to track whether each bank is in storm mode, or not. But there's
> no indication when a CMCI is delivered which bank was the source. Code
> just has to scan all the banks, and might find more than one with an
> error. While no bank is in polling mode, there isn't a set time interval
> between scanning banks.

Well, if no bank is in *storm* polling mode - I presume you mean that
when you say "polling mode" - then we have the default polling interval
of machine_check_poll() of INITIAL_CHECK_INTERVAL, i.e., 5 mins, I'd
say.

> A scan is just triggered when a CMCI happens. So it's non-trivial to
> compute a rate. Would require saving a timestamp for every logged
> error.

So what I'm trying to establish first is, what our entry vectors into
the storm code are.

1. You can enter storm tracking when you poll normally. I.e.,
   machine_check_poll() each 5 mins once.

   mce_track_storm() tracks history only for MCI_STATUS_VAL=1b CEs, which
   is what I was wondering. It is hidden a bit down in that function.

2. As a result of a CMCI interrupt. That will call machine_check_poll()
   too and go down the same path.

So that flow should be called out in the commit message so that the big
picture is clear - how we're doing that storm tracking.

Now, what is protecting this against concurrent runs of
machine_check_poll()? Imagine the timer fires and a CMCI happens at the
same time and on the same core.

> In a simple case there's just one bank responsible for a ton of CMCI.
> No need for complexity here, the count of interrupts from that bank will
> hit a threshold and a storm is declared.

Right.

> But more complex scenarois are possible. Other banks may trigger small
> numbers of CMCI. Not enough to call it a storm.  Or multiple banks may
> be screaming together.
> 
> By tracking both the hits and misses in each bank, I end up with a
> bitmap history for the past 64 polls. If there are enough "1" bits in
> that bitmap to meet the threshold, then declare a storm for that
> bank.

Yap, I only want to be crystal clear on the flow and the entry points.

> I need to stare at this again to refresh my memory of what's going on
> here. This code may need pulling apart into a routine that is used for
> systems with no CMCI (or have CMCI disabled). Then the whole "divide the
> poll interval by two" when you see an error and double the interval
> when you don't see an error makes sense.
> 
> For systems with CMCI ... I think just polling a one second interval
> until the storm is over makes sense.

Ok.

> These are only used in threshold.c now. What's the point of them
> being in internal.h. That's for defintiones shared by multiple
> mcs/*.c files. Isn't it? But will move there if you still want this.

Structs hidden in .c files looks weird but ok.

> Ideally the new CPU would inherit the precise state of the previous
> owner of this bank. But there's no easy way to track that as the bank
> is abanoned by the CPU going offline, and there is a free-for-all with
> remaining CPUs racing to claim ownership. It is known that this bank
> was in storm mode (because the threshold in the CTL2 bank register is
> set to CMCI_STORM_THRESHOLD).
> 
> I went with "worst case" to make sure the new CPU didn't prematurely
> declare an end to the storm.
> 
> I'll add a comment in mce_inherit_storm() to explain this.

Yap, exactly.

> Like this?
> 
> #define NUM_HISTORY_BITS (sizeof(u64) * BITS_PER_BYTE)
> 
> 	if (shift < NUM_HISTORY_BITS)

Yap.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

  reply	other threads:[~2023-11-14 19:23 UTC|newest]

Thread overview: 105+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-06  6:35 [RFC PATCH 0/5] Handle corrected machine check interrupt storms Smita Koralahalli
2022-04-06  6:35 ` [PATCH 1/5] x86/mce: Remove old CMCI storm mitigation code Smita Koralahalli
2022-04-06  6:35 ` [PATCH 2/5] x86/mce: Add per-bank CMCI storm mitigation Smita Koralahalli
2022-04-06  6:35 ` [RFC PATCH 3/5] x86/mce: Introduce a function pointer mce_handle_storm Smita Koralahalli
2022-04-06 22:38   ` Luck, Tony
2022-04-06  6:35 ` [RFC PATCH 4/5] x86/mce: Move storm handling to core Smita Koralahalli
2022-06-21  5:08   ` Luck, Tony
2022-06-27 17:36     ` [PATCH v2 0/5] Handle corrected machine check interrupt storms Tony Luck
2022-06-27 17:36       ` [PATCH v2 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2022-06-27 17:36       ` [PATCH v2 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2022-06-27 17:36       ` [PATCH v2 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2022-06-27 17:36       ` [PATCH v2 4/5] x86/mce: Move storm handling to core Tony Luck
2022-06-27 17:36       ` [PATCH v2 5/5] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-03-17 14:50       ` [PATCH v2 0/5] Handle corrected machine check " Yazen Ghannam
2023-03-17 17:20         ` [PATCH v3 " Tony Luck
2023-03-17 17:20           ` [PATCH v3 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-03-17 17:20           ` [PATCH v3 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-03-17 17:20           ` [PATCH v3 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2023-03-23 15:22             ` Yazen Ghannam
2023-03-23 18:00               ` Tony Luck
2023-03-17 17:20           ` [PATCH v3 4/5] x86/mce: Move storm handling to core Tony Luck
2023-03-23 15:27             ` Yazen Ghannam
2023-03-23 18:10               ` Luck, Tony
2023-03-23 20:26                 ` Luck, Tony
2023-03-24 20:44                   ` Yazen Ghannam
2023-03-29 15:26                   ` Yazen Ghannam
2023-04-03 19:03                     ` Luck, Tony
2023-04-03 21:07                     ` [PATCH v4 0/5] Handle corrected machine check interrupt storms Tony Luck
2023-04-03 21:07                       ` [PATCH v4 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-04-03 21:07                       ` [PATCH v4 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-04-11 12:32                         ` Borislav Petkov
2023-04-11 14:06                           ` Yazen Ghannam
2023-04-11 16:06                             ` Luck, Tony
2023-04-11 17:17                               ` Borislav Petkov
2023-04-03 21:07                       ` [PATCH v4 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2023-04-03 21:07                       ` [PATCH v4 4/5] x86/mce: Move storm handling to core Tony Luck
2023-04-03 21:07                       ` [PATCH v4 5/5] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-04-11 17:38                       ` [PATCH v5 0/5] Handle corrected machine check " Tony Luck
2023-04-11 17:38                         ` [PATCH v5 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-04-11 17:38                         ` [PATCH v5 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-06-13 17:45                           ` Borislav Petkov
2023-06-16 18:15                             ` Tony Luck
2023-04-11 17:38                         ` [PATCH v5 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2023-04-11 17:38                         ` [PATCH v5 4/5] x86/mce: Move storm handling to core Tony Luck
2023-04-11 17:38                         ` [PATCH v5 5/5] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-06-16 18:27                         ` [PATCH v6 0/4] Handle corrected machine check " Tony Luck
2023-06-16 18:27                           ` [PATCH v6 1/4] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-06-16 18:27                           ` [PATCH v6 2/4] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-06-23 12:09                             ` Borislav Petkov
2023-06-23 15:40                               ` Luck, Tony
2023-07-17  8:58                                 ` Borislav Petkov
2023-06-16 18:27                           ` [PATCH v6 3/4] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-06-23 14:45                             ` Borislav Petkov
2023-06-23 15:54                               ` Yazen Ghannam
2023-06-16 18:27                           ` [PATCH v6 4/4] x86/mce: Handle Intel " Tony Luck
2023-07-18 21:08                           ` [PATCH v7 0/3] Handle corrected machine check " Tony Luck
2023-07-18 21:08                             ` [PATCH v7 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-07-18 21:08                             ` [PATCH v7 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-09-19 17:44                               ` Yazen Ghannam
2023-09-20 15:56                               ` Yazen Ghannam
2023-09-20 16:09                                 ` Luck, Tony
2023-07-18 21:08                             ` [PATCH v7 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-09-19 17:59                               ` Yazen Ghannam
2023-09-29 18:16                             ` [PATCH v8 0/3] Handle corrected machine check " Tony Luck
2023-09-29 18:16                               ` [PATCH v8 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-09-29 18:16                               ` [PATCH v8 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-09-30  6:02                                 ` kernel test robot
2023-09-29 18:16                               ` [PATCH v8 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-10-02 17:57                               ` [PATCH v8 0/3] Handle corrected machine check " Luck, Tony
2023-10-04 18:36                               ` [PATCH v9 " Tony Luck
2023-10-04 18:36                                 ` [PATCH v9 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-10-04 18:36                                 ` [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-10-11  9:11                                   ` kernel test robot
2023-10-11 15:16                                     ` Luck, Tony
2023-10-11 15:42                                       ` Feng Tang
2023-10-11 17:23                                         ` Luck, Tony
2023-10-12  5:36                                           ` Feng Tang
2023-10-12  5:56                                             ` Feng Tang
2023-10-12  2:35                                         ` Philip Li
2023-10-19 15:12                                   ` Borislav Petkov
2023-10-23 18:14                                     ` Tony Luck
2023-11-14 19:23                                       ` Borislav Petkov [this message]
2023-11-14 22:04                                         ` Tony Luck
2023-11-21 11:54                                           ` Borislav Petkov
2023-11-27 19:50                                             ` Tony Luck
2023-11-27 20:14                                               ` Tony Luck
2023-11-28  0:42                                                 ` Tony Luck
2023-11-28 15:32                                                   ` Yazen Ghannam
2023-12-14 16:58                                                   ` Borislav Petkov
2023-12-14 18:03                                                     ` Luck, Tony
2023-10-04 18:36                                 ` [PATCH v9 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-11-15 19:54                                 ` [PATCH v10 0/3] Handle corrected machine check " Tony Luck
2023-11-15 19:54                                   ` [PATCH v10 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-12-15 16:10                                     ` [tip: ras/core] " tip-bot2 for Tony Luck
2023-11-15 19:54                                   ` [PATCH v10 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-12-15 16:10                                     ` [tip: ras/core] " tip-bot2 for Tony Luck
2023-11-15 19:54                                   ` [PATCH v10 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-12-15 16:10                                     ` [tip: ras/core] " tip-bot2 for Tony Luck
2023-12-15 17:21                                       ` Luck, Tony
2023-12-15 17:37                                         ` Borislav Petkov
2023-03-17 17:20           ` [PATCH v3 5/5] x86/mce: Handle AMD " Tony Luck
2022-04-06  6:35 ` [RFC PATCH " Smita Koralahalli
2022-04-06 22:44   ` Luck, Tony
2022-04-08  7:48     ` Koralahalli Channabasappa, Smita
2022-04-08 19:29       ` Luck, Tony

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231114192324.GAZVPJLGZmfJBS181/@fat_crate.local \
    --to=bp@alien8.de \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=patches@lists.linux.dev \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.