From: Borislav Petkov <bp@alien8.de>
To: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>,
Smita.KoralahalliChannabasappa@amd.com,
dave.hansen@linux.intel.com, x86@kernel.org,
linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
patches@lists.linux.dev
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation
Date: Thu, 19 Oct 2023 17:12:11 +0200 [thread overview]
Message-ID: <20231019151211.GHZTFHS3osBIL1IJbF@fat_crate.local> (raw)
In-Reply-To: <20231004183623.17067-3-tony.luck@intel.com>
On Wed, Oct 04, 2023 at 11:36:22AM -0700, Tony Luck wrote:
> +/*
> + * history: bitmask tracking whether errors were seen or not seen in
> + * the most recent polls of a bank.
each bit in that bitmask represents an error seen.
> + * timestamp: last time (in jiffies) that the bank was polled
> + * storm: Is this bank in storm mode?
> + */
> +struct storm_bank {
> + u64 history;
> + u64 timestamp;
> + bool storm;
I guess "in_storm_mode" is even more descriptive:
storm->banks[bank].in_storm_mode = false;
etc.
> +};
> +
> +/* How many errors within the history buffer mark the start of a storm. */
> +#define STORM_BEGIN_THRESHOLD 5
> +
> +/*
> + * How many polls of machine check bank without an error before declaring
> + * the storm is over. Since it is tracked by the bitmaks in the history
> + * field of struct storm_bank the mask is 30 bits [0 ... 29].
> + */
> +#define STORM_END_POLL_THRESHOLD 29
>
> #ifdef CONFIG_ACPI_APEI
> int apei_write_mce(struct mce *m);
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index f6e87443b37a..7c931f0c9251 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -680,6 +680,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
> barrier();
> m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));
>
> + mce_track_storm(&m);
Lemme see if I understand the idea here:
the default polling interval is 5 mins. So the storm tracking is called
every 5 mins once to see how are banks "doing", so to speak and to get
them in or out of storm mode. So far so good...
> +
> /* If this entry is not valid, ignore it */
> if (!(m.status & MCI_STATUS_VAL))
> continue;
Btw, you're tracking storm even if the error is not valid - conditional
above here. Why?
> @@ -1652,22 +1654,29 @@ static void mce_timer_fn(struct timer_list *t)
> else
> iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));
>
> - __this_cpu_write(mce_next_interval, iv);
> - __start_timer(t, iv);
> + if (mce_is_storm()) {
> + __start_timer(t, HZ);
> + } else {
> + __this_cpu_write(mce_next_interval, iv);
> + __start_timer(t, iv);
> + }
... this is where it becomes, hm, interesting: the check interval will
be halved if an error has been seen during this round but then if we're
in storm mode, that check interval doesn't matter - you'll run the timer
each second.
Then you need to restructure this to check the storm condition and not
do anything to iv if storm.
Or, am I missing something?
> diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c
> index ef4e7bb5fd88..ecdf13f1bb7d 100644
> --- a/arch/x86/kernel/cpu/mce/threshold.c
> +++ b/arch/x86/kernel/cpu/mce/threshold.c
> @@ -29,3 +29,115 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_threshold)
> trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR);
> apic_eoi();
> }
> +
> +/*
> + * banks: per-cpu, per-bank details
> + * stormy_bank_count: count of MC banks in storm state
> + * poll_mode: CPU is in poll mode
> + */
> +struct mca_storm_desc {
> + struct storm_bank banks[MAX_NR_BANKS];
> + u8 stormy_bank_count;
> + bool poll_mode;
> +};
Yeah, put the struct definition into internal.h pls.
> +static DEFINE_PER_CPU(struct mca_storm_desc, storm_desc);
> +
> +void mce_inherit_storm(unsigned int bank)
> +{
> + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
> +
> + storm->banks[bank].history = ~0ull;
So upon inheriting a bank, you set its history that it has seen errors
each time?
That's weird.
> + storm->banks[bank].timestamp = jiffies;
> +}
> +
> +bool mce_is_storm(void)
That's a weird name. mce_get_storm_mode() perhaps?
> +{
> + return __this_cpu_read(storm_desc.poll_mode);
> +}
> +
> +void mce_set_storm(bool storm)
mce_set_storm_mode()
> +{
> + __this_cpu_write(storm_desc.poll_mode, storm);
> +}
> +
> +static void mce_handle_storm(unsigned int bank, bool on)
> +{
> + switch (boot_cpu_data.x86_vendor) {
> + }
> +}
...
> +void mce_track_storm(struct mce *mce)
> +{
> + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc);
> + unsigned long now = jiffies, delta;
> + unsigned int shift = 1;
> + u64 history = 0;
> +
> + /*
> + * When a bank is in storm mode it is polled once per second and
> + * the history mask will record about the last minute of poll results.
> + * If it is not in storm mode, then the bank is only checked when
> + * there is a CMCI interrupt. Check how long it has been since
> + * this bank was last checked, and adjust the amount of "shift"
> + * to apply to history.
> + */
> + if (!storm->banks[mce->bank].storm) {
Yeah, if this were
if (!storm->banks[mce->bank].in_storm_mode)
it would've been perfectly clear what the condition tests.
> + delta = now - storm->banks[mce->bank].timestamp;
> + shift = (delta + HZ) / HZ;
> + }
> +
> + /* If it has been a long time since the last poll, clear history. */
> + if (shift < 64)
Use a properly named define instead of a naked number.
...
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
next prev parent reply other threads:[~2023-10-19 15:12 UTC|newest]
Thread overview: 99+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-06 6:35 [RFC PATCH 0/5] Handle corrected machine check interrupt storms Smita Koralahalli
2022-04-06 6:35 ` [PATCH 1/5] x86/mce: Remove old CMCI storm mitigation code Smita Koralahalli
2022-04-06 6:35 ` [PATCH 2/5] x86/mce: Add per-bank CMCI storm mitigation Smita Koralahalli
2022-04-06 6:35 ` [RFC PATCH 3/5] x86/mce: Introduce a function pointer mce_handle_storm Smita Koralahalli
2022-04-06 22:38 ` Luck, Tony
2022-04-06 6:35 ` [RFC PATCH 4/5] x86/mce: Move storm handling to core Smita Koralahalli
2022-06-21 5:08 ` Luck, Tony
2022-06-27 17:36 ` [PATCH v2 0/5] Handle corrected machine check interrupt storms Tony Luck
2022-06-27 17:36 ` [PATCH v2 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2022-06-27 17:36 ` [PATCH v2 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2022-06-27 17:36 ` [PATCH v2 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2022-06-27 17:36 ` [PATCH v2 4/5] x86/mce: Move storm handling to core Tony Luck
2022-06-27 17:36 ` [PATCH v2 5/5] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-03-17 14:50 ` [PATCH v2 0/5] Handle corrected machine check " Yazen Ghannam
2023-03-17 17:20 ` [PATCH v3 " Tony Luck
2023-03-17 17:20 ` [PATCH v3 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-03-17 17:20 ` [PATCH v3 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-03-17 17:20 ` [PATCH v3 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2023-03-23 15:22 ` Yazen Ghannam
2023-03-23 18:00 ` Tony Luck
2023-03-17 17:20 ` [PATCH v3 4/5] x86/mce: Move storm handling to core Tony Luck
2023-03-23 15:27 ` Yazen Ghannam
2023-03-23 18:10 ` Luck, Tony
2023-03-23 20:26 ` Luck, Tony
2023-03-24 20:44 ` Yazen Ghannam
2023-03-29 15:26 ` Yazen Ghannam
2023-04-03 19:03 ` Luck, Tony
2023-04-03 21:07 ` [PATCH v4 0/5] Handle corrected machine check interrupt storms Tony Luck
2023-04-03 21:07 ` [PATCH v4 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-04-03 21:07 ` [PATCH v4 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-04-11 12:32 ` Borislav Petkov
2023-04-11 14:06 ` Yazen Ghannam
2023-04-11 16:06 ` Luck, Tony
2023-04-11 17:17 ` Borislav Petkov
2023-04-03 21:07 ` [PATCH v4 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2023-04-03 21:07 ` [PATCH v4 4/5] x86/mce: Move storm handling to core Tony Luck
2023-04-03 21:07 ` [PATCH v4 5/5] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-04-11 17:38 ` [PATCH v5 0/5] Handle corrected machine check " Tony Luck
2023-04-11 17:38 ` [PATCH v5 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-04-11 17:38 ` [PATCH v5 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-06-13 17:45 ` Borislav Petkov
2023-06-16 18:15 ` Tony Luck
2023-04-11 17:38 ` [PATCH v5 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2023-04-11 17:38 ` [PATCH v5 4/5] x86/mce: Move storm handling to core Tony Luck
2023-04-11 17:38 ` [PATCH v5 5/5] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-06-16 18:27 ` [PATCH v6 0/4] Handle corrected machine check " Tony Luck
2023-06-16 18:27 ` [PATCH v6 1/4] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-06-16 18:27 ` [PATCH v6 2/4] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-06-23 12:09 ` Borislav Petkov
2023-06-23 15:40 ` Luck, Tony
2023-07-17 8:58 ` Borislav Petkov
2023-06-16 18:27 ` [PATCH v6 3/4] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-06-23 14:45 ` Borislav Petkov
2023-06-23 15:54 ` Yazen Ghannam
2023-06-16 18:27 ` [PATCH v6 4/4] x86/mce: Handle Intel " Tony Luck
2023-07-18 21:08 ` [PATCH v7 0/3] Handle corrected machine check " Tony Luck
2023-07-18 21:08 ` [PATCH v7 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-07-18 21:08 ` [PATCH v7 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-09-19 17:44 ` Yazen Ghannam
2023-09-20 15:56 ` Yazen Ghannam
2023-09-20 16:09 ` Luck, Tony
2023-07-18 21:08 ` [PATCH v7 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-09-19 17:59 ` Yazen Ghannam
2023-09-29 18:16 ` [PATCH v8 0/3] Handle corrected machine check " Tony Luck
2023-09-29 18:16 ` [PATCH v8 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-09-29 18:16 ` [PATCH v8 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-09-29 18:16 ` [PATCH v8 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-10-02 17:57 ` [PATCH v8 0/3] Handle corrected machine check " Luck, Tony
2023-10-04 18:36 ` [PATCH v9 " Tony Luck
2023-10-04 18:36 ` [PATCH v9 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-10-04 18:36 ` [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-10-11 9:11 ` kernel test robot
2023-10-11 15:16 ` Luck, Tony
2023-10-11 15:42 ` Feng Tang
2023-10-11 17:23 ` Luck, Tony
2023-10-12 5:36 ` Feng Tang
2023-10-12 5:56 ` Feng Tang
2023-10-12 2:35 ` Philip Li
2023-10-19 15:12 ` Borislav Petkov [this message]
2023-10-23 18:14 ` Tony Luck
2023-11-14 19:23 ` Borislav Petkov
2023-11-14 22:04 ` Tony Luck
2023-11-21 11:54 ` Borislav Petkov
2023-11-27 19:50 ` Tony Luck
2023-11-27 20:14 ` Tony Luck
2023-11-28 0:42 ` Tony Luck
2023-11-28 15:32 ` Yazen Ghannam
2023-12-14 16:58 ` Borislav Petkov
2023-12-14 18:03 ` Luck, Tony
2023-10-04 18:36 ` [PATCH v9 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-11-15 19:54 ` [PATCH v10 0/3] Handle corrected machine check " Tony Luck
2023-11-15 19:54 ` [PATCH v10 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-11-15 19:54 ` [PATCH v10 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-11-15 19:54 ` [PATCH v10 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-03-17 17:20 ` [PATCH v3 5/5] x86/mce: Handle AMD " Tony Luck
2022-04-06 6:35 ` [RFC PATCH " Smita Koralahalli
2022-04-06 22:44 ` Luck, Tony
2022-04-08 7:48 ` Koralahalli Channabasappa, Smita
2022-04-08 19:29 ` Luck, Tony
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231019151211.GHZTFHS3osBIL1IJbF@fat_crate.local \
--to=bp@alien8.de \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=dave.hansen@linux.intel.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=patches@lists.linux.dev \
--cc=tony.luck@intel.com \
--cc=x86@kernel.org \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).