From: Borislav Petkov <bp@alien8.de>
To: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>,
Smita.KoralahalliChannabasappa@amd.com,
dave.hansen@linux.intel.com, x86@kernel.org,
linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
patches@lists.linux.dev
Subject: Re: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation
Date: Tue, 21 Nov 2023 12:54:48 +0100 [thread overview]
Message-ID: <20231121115448.GCZVyaiNkNvb4t2NxB@fat_crate.local> (raw)
In-Reply-To: <ZVPu/hX9b7lUkrBY@agluck-desk3>
On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote:
> Before any storm happens, machine_check_poll() may only be called once
> a month, or less, when errors occur.
Err:
[ 317.825546] mce: mc_poll_banks_default: CPU2 irq ctxt level: 1
[ 317.825585] mce: mc_poll_banks_default: CPU0 irq ctxt level: 1
[ 317.825585] mce: mc_poll_banks_default: CPU1 irq ctxt level: 1
[ 317.825586] mce: mc_poll_banks_default: CPU3 irq ctxt level: 1
[ 317.825586] mce: mc_poll_banks_default: CPU4 irq ctxt level: 1
[ 317.825586] mce: mc_poll_banks_default: CPU5 irq ctxt level: 1
[ 629.121536] mce: mc_poll_banks_default: CPU1 irq ctxt level: 1
[ 629.121536] mce: mc_poll_banks_default: CPU4 irq ctxt level: 1
[ 629.121560] mce: mc_poll_banks_default: CPU2 irq ctxt level: 1
[ 629.121561] mce: mc_poll_banks_default: CPU0 irq ctxt level: 1
[ 629.121561] mce: mc_poll_banks_default: CPU5 irq ctxt level: 1
[ 629.121569] mce: mc_poll_banks_default: CPU3 irq ctxt level: 1
[ 940.417507] mce: mc_poll_banks_default: CPU2 irq ctxt level: 1
[ 940.417508] mce: mc_poll_banks_default: CPU3 irq ctxt level: 1
[ 940.417508] mce: mc_poll_banks_default: CPU1 irq ctxt level: 1
[ 940.417508] mce: mc_poll_banks_default: CPU4 irq ctxt level: 1
[ 940.417509] mce: mc_poll_banks_default: CPU5 irq ctxt level: 1
[ 940.417508] mce: mc_poll_banks_default: CPU0 irq ctxt level: 1
...
That's from my coffeelake test box.
The irq context level thing says we're in softirq context when the
polling happens.
> When a storm is detected for a bank, that bank (and any others in storm
> mode) will be checked once per second.
Ok.
> For a bank that doesn't support CMCI, then polling is the only way
> to find errors. You are right, these will feed into the history
> tracker, but while at 5-minute interval will not be able to trigger
> a storm.
Yes. But you need to call into the storm handling code somehow. So you
do that from the polling code.
And if the machine supports CMCI, you do the same - call the polling
code which then does the storm check.
> Since that 5-minute interval is halved every time error is
> found consecutively, it is possible at the 1-second poll interval to
> fill out enough bits to indicate a storm. I think I need to add some
> code to handle that case as it makes no sense to mess with the CMCI
> threshold in IA32_MCi_CTL2 for a bank that doesn't support CMCI.
> Probably will just skip tracking for any such banks.
Ok.
> Aren't interrupts disabled while running the code after the timer fires?
No, see above.
> Whichever of the timer and the CMCI happens first will run. Second to
> arrive will pend the interrupt and be handled when interrupts are
> enabled as the first completes.
So I still don't like the timer calling machine_check_poll() and
cmci_mc_poll_banks() doing the same without any proper synchronization
between the two.
Yes, when you get a CMCI interrupt, you poll and do the call the storm
code. Now what happens if the polling runs from softirq context and you
get a CMCI interrupt at exactly the same time. I.e., is
machine_check_poll() reentrant and audited properly?
I hope I'm making more sense.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
next prev parent reply other threads:[~2023-11-21 11:55 UTC|newest]
Thread overview: 99+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-06 6:35 [RFC PATCH 0/5] Handle corrected machine check interrupt storms Smita Koralahalli
2022-04-06 6:35 ` [PATCH 1/5] x86/mce: Remove old CMCI storm mitigation code Smita Koralahalli
2022-04-06 6:35 ` [PATCH 2/5] x86/mce: Add per-bank CMCI storm mitigation Smita Koralahalli
2022-04-06 6:35 ` [RFC PATCH 3/5] x86/mce: Introduce a function pointer mce_handle_storm Smita Koralahalli
2022-04-06 22:38 ` Luck, Tony
2022-04-06 6:35 ` [RFC PATCH 4/5] x86/mce: Move storm handling to core Smita Koralahalli
2022-06-21 5:08 ` Luck, Tony
2022-06-27 17:36 ` [PATCH v2 0/5] Handle corrected machine check interrupt storms Tony Luck
2022-06-27 17:36 ` [PATCH v2 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2022-06-27 17:36 ` [PATCH v2 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2022-06-27 17:36 ` [PATCH v2 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2022-06-27 17:36 ` [PATCH v2 4/5] x86/mce: Move storm handling to core Tony Luck
2022-06-27 17:36 ` [PATCH v2 5/5] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-03-17 14:50 ` [PATCH v2 0/5] Handle corrected machine check " Yazen Ghannam
2023-03-17 17:20 ` [PATCH v3 " Tony Luck
2023-03-17 17:20 ` [PATCH v3 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-03-17 17:20 ` [PATCH v3 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-03-17 17:20 ` [PATCH v3 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2023-03-23 15:22 ` Yazen Ghannam
2023-03-23 18:00 ` Tony Luck
2023-03-17 17:20 ` [PATCH v3 4/5] x86/mce: Move storm handling to core Tony Luck
2023-03-23 15:27 ` Yazen Ghannam
2023-03-23 18:10 ` Luck, Tony
2023-03-23 20:26 ` Luck, Tony
2023-03-24 20:44 ` Yazen Ghannam
2023-03-29 15:26 ` Yazen Ghannam
2023-04-03 19:03 ` Luck, Tony
2023-04-03 21:07 ` [PATCH v4 0/5] Handle corrected machine check interrupt storms Tony Luck
2023-04-03 21:07 ` [PATCH v4 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-04-03 21:07 ` [PATCH v4 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-04-11 12:32 ` Borislav Petkov
2023-04-11 14:06 ` Yazen Ghannam
2023-04-11 16:06 ` Luck, Tony
2023-04-11 17:17 ` Borislav Petkov
2023-04-03 21:07 ` [PATCH v4 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2023-04-03 21:07 ` [PATCH v4 4/5] x86/mce: Move storm handling to core Tony Luck
2023-04-03 21:07 ` [PATCH v4 5/5] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-04-11 17:38 ` [PATCH v5 0/5] Handle corrected machine check " Tony Luck
2023-04-11 17:38 ` [PATCH v5 1/5] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-04-11 17:38 ` [PATCH v5 2/5] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-06-13 17:45 ` Borislav Petkov
2023-06-16 18:15 ` Tony Luck
2023-04-11 17:38 ` [PATCH v5 3/5] x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms Tony Luck
2023-04-11 17:38 ` [PATCH v5 4/5] x86/mce: Move storm handling to core Tony Luck
2023-04-11 17:38 ` [PATCH v5 5/5] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-06-16 18:27 ` [PATCH v6 0/4] Handle corrected machine check " Tony Luck
2023-06-16 18:27 ` [PATCH v6 1/4] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-06-16 18:27 ` [PATCH v6 2/4] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-06-23 12:09 ` Borislav Petkov
2023-06-23 15:40 ` Luck, Tony
2023-07-17 8:58 ` Borislav Petkov
2023-06-16 18:27 ` [PATCH v6 3/4] x86/mce: Handle AMD threshold interrupt storms Tony Luck
2023-06-23 14:45 ` Borislav Petkov
2023-06-23 15:54 ` Yazen Ghannam
2023-06-16 18:27 ` [PATCH v6 4/4] x86/mce: Handle Intel " Tony Luck
2023-07-18 21:08 ` [PATCH v7 0/3] Handle corrected machine check " Tony Luck
2023-07-18 21:08 ` [PATCH v7 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-07-18 21:08 ` [PATCH v7 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-09-19 17:44 ` Yazen Ghannam
2023-09-20 15:56 ` Yazen Ghannam
2023-09-20 16:09 ` Luck, Tony
2023-07-18 21:08 ` [PATCH v7 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-09-19 17:59 ` Yazen Ghannam
2023-09-29 18:16 ` [PATCH v8 0/3] Handle corrected machine check " Tony Luck
2023-09-29 18:16 ` [PATCH v8 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-09-29 18:16 ` [PATCH v8 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-09-29 18:16 ` [PATCH v8 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-10-02 17:57 ` [PATCH v8 0/3] Handle corrected machine check " Luck, Tony
2023-10-04 18:36 ` [PATCH v9 " Tony Luck
2023-10-04 18:36 ` [PATCH v9 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-10-04 18:36 ` [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-10-11 9:11 ` kernel test robot
2023-10-11 15:16 ` Luck, Tony
2023-10-11 15:42 ` Feng Tang
2023-10-11 17:23 ` Luck, Tony
2023-10-12 5:36 ` Feng Tang
2023-10-12 5:56 ` Feng Tang
2023-10-12 2:35 ` Philip Li
2023-10-19 15:12 ` Borislav Petkov
2023-10-23 18:14 ` Tony Luck
2023-11-14 19:23 ` Borislav Petkov
2023-11-14 22:04 ` Tony Luck
2023-11-21 11:54 ` Borislav Petkov [this message]
2023-11-27 19:50 ` Tony Luck
2023-11-27 20:14 ` Tony Luck
2023-11-28 0:42 ` Tony Luck
2023-11-28 15:32 ` Yazen Ghannam
2023-12-14 16:58 ` Borislav Petkov
2023-12-14 18:03 ` Luck, Tony
2023-10-04 18:36 ` [PATCH v9 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-11-15 19:54 ` [PATCH v10 0/3] Handle corrected machine check " Tony Luck
2023-11-15 19:54 ` [PATCH v10 1/3] x86/mce: Remove old CMCI storm mitigation code Tony Luck
2023-11-15 19:54 ` [PATCH v10 2/3] x86/mce: Add per-bank CMCI storm mitigation Tony Luck
2023-11-15 19:54 ` [PATCH v10 3/3] x86/mce: Handle Intel threshold interrupt storms Tony Luck
2023-03-17 17:20 ` [PATCH v3 5/5] x86/mce: Handle AMD " Tony Luck
2022-04-06 6:35 ` [RFC PATCH " Smita Koralahalli
2022-04-06 22:44 ` Luck, Tony
2022-04-08 7:48 ` Koralahalli Channabasappa, Smita
2022-04-08 19:29 ` Luck, Tony
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231121115448.GCZVyaiNkNvb4t2NxB@fat_crate.local \
--to=bp@alien8.de \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=dave.hansen@linux.intel.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=patches@lists.linux.dev \
--cc=tony.luck@intel.com \
--cc=x86@kernel.org \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).