linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] x86/mce: Provide sysfs interface to show CMCI storm state
@ 2021-06-01 20:05 Tony Luck
  2021-06-01 20:35 ` Borislav Petkov
  0 siblings, 1 reply; 5+ messages in thread
From: Tony Luck @ 2021-06-01 20:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Tony Luck, Christopher BeSerra, x86, linux-edac, linux-kernel

Scripts that process error logs can do better if they know whether
Linux is executing in CMCI storm mode (only polling and reporting
some errors instead of trying to report them all). While it is possible
to parse the console log for:

	CMCI storm detected: switching to poll mode
	CMCI storm subsided: switching to interrupt mode

messages, that is error prone.

Add a new file to sysfs to report the current storm count.

Reported-by: Christopher BeSerra <beserra@amazon.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
---

RFC questions:
1) Is there a better way to do this?
2) Maybe just return 0 or 1 instead of the count?
3) Is there a cleaner way to handle the CONFIG_X86_MCE_INTEL dependency

 arch/x86/kernel/cpu/mce/core.c  | 20 ++++++++++++++++++++
 arch/x86/kernel/cpu/mce/intel.c |  5 +++++
 2 files changed, 25 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index bf7fe87a7e88..4c4d6b1ec120 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -2431,6 +2431,20 @@ static ssize_t store_int_with_restart(struct device *s,
 	return ret;
 }
 
+#ifndef CONFIG_X86_MCE_INTEL
+static int cmci_storm_value(void)
+{
+	return 0;
+}
+#else
+int cmci_storm_value(void);
+#endif
+
+static ssize_t show_storm(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", cmci_storm_value());
+}
+
 static DEVICE_INT_ATTR(tolerant, 0644, mca_cfg.tolerant);
 static DEVICE_INT_ATTR(monarch_timeout, 0644, mca_cfg.monarch_timeout);
 static DEVICE_BOOL_ATTR(dont_log_ce, 0644, mca_cfg.dont_log_ce);
@@ -2451,6 +2465,11 @@ static struct dev_ext_attribute dev_attr_cmci_disabled = {
 	&mca_cfg.cmci_disabled
 };
 
+static struct dev_ext_attribute dev_attr_show_storm = {
+	__ATTR(show_storm, 0444, show_storm, NULL),
+	NULL
+};
+
 static struct device_attribute *mce_device_attrs[] = {
 	&dev_attr_tolerant.attr,
 	&dev_attr_check_interval.attr,
@@ -2462,6 +2481,7 @@ static struct device_attribute *mce_device_attrs[] = {
 	&dev_attr_print_all.attr,
 	&dev_attr_ignore_ce.attr,
 	&dev_attr_cmci_disabled.attr,
+	&dev_attr_show_storm.attr,
 	NULL
 };
 
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index acfd5d9f93c6..4edaa0608de3 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -73,6 +73,11 @@ enum {
 
 static atomic_t cmci_storm_on_cpus;
 
+int cmci_storm_value(void)
+{
+	return atomic_read(&cmci_storm_on_cpus);
+}
+
 static int cmci_supported(int *banks)
 {
 	u64 cap;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] x86/mce: Provide sysfs interface to show CMCI storm state
  2021-06-01 20:05 [RFC PATCH] x86/mce: Provide sysfs interface to show CMCI storm state Tony Luck
@ 2021-06-01 20:35 ` Borislav Petkov
  2021-06-01 20:40   ` Luck, Tony
  0 siblings, 1 reply; 5+ messages in thread
From: Borislav Petkov @ 2021-06-01 20:35 UTC (permalink / raw)
  To: Tony Luck; +Cc: Christopher BeSerra, x86, linux-edac, linux-kernel

On Tue, Jun 01, 2021 at 01:05:05PM -0700, Tony Luck wrote:
> Scripts that process error logs can do better if they know whether
> Linux is executing in CMCI storm mode (only polling and reporting
> some errors instead of trying to report them all). While it is possible
> to parse the console log for:
> 
> 	CMCI storm detected: switching to poll mode
> 	CMCI storm subsided: switching to interrupt mode
> 
> messages, that is error prone.
> 
> Add a new file to sysfs to report the current storm count.
> 
> Reported-by: Christopher BeSerra <beserra@amazon.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> 
> RFC questions:
> 1) Is there a better way to do this?

Probably.

But I'm unclear as to what this whole use case is. The very first
"Scripts that process error logs" already sounds like a bad idea - I'd
expect userspace consumers to open the trace_mce_record() and get the
MCE records from there. And in that case CMCI storm shouldn't matter...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [RFC PATCH] x86/mce: Provide sysfs interface to show CMCI storm state
  2021-06-01 20:35 ` Borislav Petkov
@ 2021-06-01 20:40   ` Luck, Tony
  2021-06-03 22:48     ` BeSerra, Christopher
  0 siblings, 1 reply; 5+ messages in thread
From: Luck, Tony @ 2021-06-01 20:40 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Christopher BeSerra, x86, linux-edac, linux-kernel

> But I'm unclear as to what this whole use case is. The very first
> "Scripts that process error logs" already sounds like a bad idea - I'd
> expect userspace consumers to open the trace_mce_record() and get the
> MCE records from there. And in that case CMCI storm shouldn't matter...

I think the problem is knowing that many errors are being missed because
of the switch to poll mode. All methods to track errors, including the trace_mce_record()
technique are equally affected by missed errors.

But maybe Chris can better describe what the problem is ...

-Tony

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] x86/mce: Provide sysfs interface to show CMCI storm state
  2021-06-01 20:40   ` Luck, Tony
@ 2021-06-03 22:48     ` BeSerra, Christopher
  2021-06-04  9:16       ` Borislav Petkov
  0 siblings, 1 reply; 5+ messages in thread
From: BeSerra, Christopher @ 2021-06-03 22:48 UTC (permalink / raw)
  To: Luck, Tony, Borislav Petkov; +Cc: x86, linux-edac, linux-kernel

There are corner cases where the CE count is 0 when a storm occurs.  EDAC completely missed logging CEs.

On 6/1/21, 1:41 PM, "Luck, Tony" <tony.luck@intel.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    > But I'm unclear as to what this whole use case is. The very first
    > "Scripts that process error logs" already sounds like a bad idea - I'd
    > expect userspace consumers to open the trace_mce_record() and get the
    > MCE records from there. And in that case CMCI storm shouldn't matter...

    I think the problem is knowing that many errors are being missed because
    of the switch to poll mode. All methods to track errors, including the trace_mce_record()
    technique are equally affected by missed errors.

    But maybe Chris can better describe what the problem is ...

    -Tony


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] x86/mce: Provide sysfs interface to show CMCI storm state
  2021-06-03 22:48     ` BeSerra, Christopher
@ 2021-06-04  9:16       ` Borislav Petkov
  0 siblings, 0 replies; 5+ messages in thread
From: Borislav Petkov @ 2021-06-04  9:16 UTC (permalink / raw)
  To: BeSerra, Christopher; +Cc: Luck, Tony, x86, linux-edac, linux-kernel

On Thu, Jun 03, 2021 at 10:48:12PM +0000, BeSerra, Christopher wrote:
> There are corner cases where the CE count is 0 when a storm occurs.
> EDAC completely missed logging CEs.

-ENOPARSE.

I'm sorry but you'll have to try again and be a lot more specific and
detailed when describing your use case and what exactly you're trying to
achieve.

Oh, and btw, please do not top-post.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-06-04  9:16 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-01 20:05 [RFC PATCH] x86/mce: Provide sysfs interface to show CMCI storm state Tony Luck
2021-06-01 20:35 ` Borislav Petkov
2021-06-01 20:40   ` Luck, Tony
2021-06-03 22:48     ` BeSerra, Christopher
2021-06-04  9:16       ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).