All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] RAS/CEC: Add debugfs switch to disable at run time
@ 2019-04-18 22:02 Tony Luck
  2019-04-18 22:51 ` Cong Wang
  2019-04-20 19:50 ` Borislav Petkov
  0 siblings, 2 replies; 25+ messages in thread
From: Tony Luck @ 2019-04-18 22:02 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, linux-kernel

Useful when running error injection tests that want to
see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 drivers/ras/cec.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 2d9ec378a8bc..a2ceedcd8516 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -123,6 +123,9 @@ static u64 dfs_pfn;
 /* Amount of errors after which we offline */
 static unsigned int count_threshold = COUNT_MASK;
 
+/* debugfs switch to enable/disable CEC */
+static u64 cec_enabled = 1;
+
 /*
  * The timer "decays" element count each timer_interval which is 24hrs by
  * default.
@@ -400,6 +403,14 @@ static int count_threshold_set(void *data, u64 val)
 }
 DEFINE_DEBUGFS_ATTRIBUTE(count_threshold_ops, u64_get, count_threshold_set, "%lld\n");
 
+static int enable_set(void *data, u64 val)
+{
+	ce_arr.disabled = !val;
+	return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(enable_ops, u64_get, enable_set, "%lld\n");
+
 static int array_dump(struct seq_file *m, void *v)
 {
 	struct ce_array *ca = &ce_arr;
@@ -451,7 +462,7 @@ static const struct file_operations array_ops = {
 
 static int __init create_debugfs_nodes(void)
 {
-	struct dentry *d, *pfn, *decay, *count, *array;
+	struct dentry *d, *pfn, *decay, *count, *array, *enable;
 
 	d = debugfs_create_dir("cec", ras_debugfs_dir);
 	if (!d) {
@@ -485,6 +496,13 @@ static int __init create_debugfs_nodes(void)
 		goto err;
 	}
 
+	enable = debugfs_create_file("enable", S_IRUSR | S_IWUSR, d,
+				    &cec_enabled, &enable_ops);
+	if (!enable) {
+		pr_warn("Error creating enable debugfs node!\n");
+		goto err;
+	}
+
 
 	return 0;
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-18 22:02 [PATCH] RAS/CEC: Add debugfs switch to disable at run time Tony Luck
@ 2019-04-18 22:51 ` Cong Wang
  2019-04-18 23:29   ` Borislav Petkov
  2019-04-20 19:50 ` Borislav Petkov
  1 sibling, 1 reply; 25+ messages in thread
From: Cong Wang @ 2019-04-18 22:51 UTC (permalink / raw)
  To: Tony Luck; +Cc: Borislav Petkov, LKML

On Thu, Apr 18, 2019 at 3:02 PM Tony Luck <tony.luck@intel.com> wrote:
>
> Useful when running error injection tests that want to
> see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog.
>
> Signed-off-by: Tony Luck <tony.luck@intel.com>

We saw the same problem, CONFIG_RAS hijacks all the
correctable memory errors, which leaves mcelog "broken"
silently. I know it is arguable, but until we can switch from
mcelog to rasdaemon, mcelog should still work as before.

I like this idea of disabling it at runtime, so:

Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-18 22:51 ` Cong Wang
@ 2019-04-18 23:29   ` Borislav Petkov
  2019-04-18 23:58     ` Cong Wang
  2019-04-19  0:07     ` Luck, Tony
  0 siblings, 2 replies; 25+ messages in thread
From: Borislav Petkov @ 2019-04-18 23:29 UTC (permalink / raw)
  To: Cong Wang, Tony Luck; +Cc: LKML

On Thu, Apr 18, 2019 at 03:51:07PM -0700, Cong Wang wrote:
> On Thu, Apr 18, 2019 at 3:02 PM Tony Luck <tony.luck@intel.com> wrote:
> >
> > Useful when running error injection tests that want to
> > see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog.
> >
> > Signed-off-by: Tony Luck <tony.luck@intel.com>
> 
> We saw the same problem, CONFIG_RAS hijacks all the
> correctable memory errors, which leaves mcelog "broken"
> silently. I know it is arguable, but until we can switch from
> mcelog to rasdaemon, mcelog should still work as before.

It is "arguable" because this is not how the CEC is supposed to be used.

If you want to collect errors with mcelog, you don't use the CEC at all.
And there's ras=cec_disable for that or you simply don't enable it in
your .config.

As Tony says in the commit message, the enable should be used only for
injection tests. Which is where that thing should only be used for -
debugging the CEC itself.

Which reminds me, Tony, I think all those debugging files "pfn"
and "array" and the one you add now, should all be under a
CONFIG_RAS_CEC_DEBUG which is default off and used only for development.
Mind adding that too pls?

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-18 23:29   ` Borislav Petkov
@ 2019-04-18 23:58     ` Cong Wang
  2019-04-19  0:26       ` Borislav Petkov
  2019-04-19  0:07     ` Luck, Tony
  1 sibling, 1 reply; 25+ messages in thread
From: Cong Wang @ 2019-04-18 23:58 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, LKML

On Thu, Apr 18, 2019 at 4:29 PM Borislav Petkov <bp@alien8.de> wrote:
>
> On Thu, Apr 18, 2019 at 03:51:07PM -0700, Cong Wang wrote:
> > On Thu, Apr 18, 2019 at 3:02 PM Tony Luck <tony.luck@intel.com> wrote:
> > >
> > > Useful when running error injection tests that want to
> > > see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog.
> > >
> > > Signed-off-by: Tony Luck <tony.luck@intel.com>
> >
> > We saw the same problem, CONFIG_RAS hijacks all the
> > correctable memory errors, which leaves mcelog "broken"
> > silently. I know it is arguable, but until we can switch from
> > mcelog to rasdaemon, mcelog should still work as before.
>
> It is "arguable" because this is not how the CEC is supposed to be used.

No, it is all about whether we should break users' expectation.

>
> If you want to collect errors with mcelog, you don't use the CEC at all.
> And there's ras=cec_disable for that or you simply don't enable it in
> your .config.
>
> As Tony says in the commit message, the enable should be used only for
> injection tests. Which is where that thing should only be used for -
> debugging the CEC itself.

This doesn't sounds like a valid reason for us to break users'
expectation.

Prior to CONFIG_RAS, mcelog just works fine for users (at least Intel
users). Suddenly after enabling CONFIG_RAS in kernel, mcelog will
no longer receive any correctable memory errors _silently_. What's
more, we don't even have rasdaemon running in our system, so there is
no consumer of RAS CEC, these errors just simply disappear from
users' expected place.

I know CONFIG_RAS is new feature supposed to replace MCELOG,
but they can co-exist in kernel config, which means mcelog should
continue to work as before until it gets fully replaced.

Even the following PoC change could make this situation better,
because with this change when we enable CONFIG_RAS,mcelog
will break _loudly_ rather than just silently, users will notice mcelog
is no longer supported and will look for its alternative choice.

diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
index b834ff555188..f2e2b75fffbe 100644
--- a/drivers/ras/Kconfig
+++ b/drivers/ras/Kconfig
@@ -1,5 +1,6 @@
 menuconfig RAS
        bool "Reliability, Availability and Serviceability (RAS) features"
+       depends on !X86_MCELOG_LEGACY
        help
          Reliability, availability and serviceability (RAS) is a computer
          hardware engineering term. Computers designed with higher levels


Just my 2 cents.

Thanks.

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-18 23:29   ` Borislav Petkov
  2019-04-18 23:58     ` Cong Wang
@ 2019-04-19  0:07     ` Luck, Tony
  2019-04-19  0:29       ` Borislav Petkov
  2019-04-20  5:50       ` Cong Wang
  1 sibling, 2 replies; 25+ messages in thread
From: Luck, Tony @ 2019-04-19  0:07 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Cong Wang, LKML

On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote:
> Which reminds me, Tony, I think all those debugging files "pfn"
> and "array" and the one you add now, should all be under a
> CONFIG_RAS_CEC_DEBUG which is default off and used only for development.
> Mind adding that too pls?

Patch below, on top of previous patch.  Note that I didn't move "enable"
into the RAS_CEC_DEBUG code. I think it has some value even on
production systems.  It is still in debugfs (which many production
systems don't mount) so I don't see that people are going to be randomly
using it to disable the CEC.

-Tony


From ac9d8c9bf7b38e18dcffdd41f8fcf0f07c632cd3 Mon Sep 17 00:00:00 2001
From: Tony Luck <tony.luck@intel.com>
Date: Thu, 18 Apr 2019 16:46:55 -0700
Subject: [PATCH] RAS/CEC: Move CEC debug features under a CONFIG_RAS_CEC_DEBUG
 option

The pfn and array files in /sys/kernel/debug/ras/cec are intended
for debugging the CEC code itself. They are not needed on production
systems, so the default setting for this CONFIG option is "n".

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 arch/x86/ras/Kconfig | 10 ++++++++++
 drivers/ras/cec.c    | 13 ++++++++++++-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/ras/Kconfig b/arch/x86/ras/Kconfig
index a9c3db125222..7fde8d55e394 100644
--- a/arch/x86/ras/Kconfig
+++ b/arch/x86/ras/Kconfig
@@ -11,3 +11,13 @@ config RAS_CEC
 
 	  Bear in mind that this is absolutely useless if your platform doesn't
 	  have ECC DIMMs and doesn't have DRAM ECC checking enabled in the BIOS.
+
+config RAS_CEC_DEBUG
+	bool "Debugging"
+	default n
+	depends on RAS_CEC
+	---help---
+	  Add extra files to /sys/kernel/debug/ras/cec to test the correctable
+	  memory error feature. "pfn" is a writable file that allows user to
+	  simulate an error in a particular page frame. "array" is a read-only
+	  file that dumps out the current state of all pages logged so far.
diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index a2ceedcd8516..ff880a5c289a 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -118,7 +118,9 @@ static struct ce_array {
 } ce_arr;
 
 static DEFINE_MUTEX(ce_mutex);
+#ifdef CONFIG_RAS_CEC_DEBUG
 static u64 dfs_pfn;
+#endif
 
 /* Amount of errors after which we offline */
 static unsigned int count_threshold = COUNT_MASK;
@@ -364,6 +366,7 @@ static int u64_get(void *data, u64 *val)
 	return 0;
 }
 
+#ifdef CONFIG_RAS_CEC_DEBUG
 static int pfn_set(void *data, u64 val)
 {
 	*(u64 *)data = val;
@@ -372,6 +375,7 @@ static int pfn_set(void *data, u64 val)
 }
 
 DEFINE_DEBUGFS_ATTRIBUTE(pfn_ops, u64_get, pfn_set, "0x%llx\n");
+#endif
 
 static int decay_interval_set(void *data, u64 val)
 {
@@ -411,6 +415,7 @@ static int enable_set(void *data, u64 val)
 
 DEFINE_DEBUGFS_ATTRIBUTE(enable_ops, u64_get, enable_set, "%lld\n");
 
+#ifdef CONFIG_RAS_CEC_DEBUG
 static int array_dump(struct seq_file *m, void *v)
 {
 	struct ce_array *ca = &ce_arr;
@@ -459,10 +464,14 @@ static const struct file_operations array_ops = {
 	.llseek	 = seq_lseek,
 	.release = single_release,
 };
+#endif
 
 static int __init create_debugfs_nodes(void)
 {
-	struct dentry *d, *pfn, *decay, *count, *array, *enable;
+	struct dentry *d, *decay, *count, *enable;
+#ifdef CONFIG_RAS_CEC_DEBUG
+	struct dentry *pfn, *array;
+#endif
 
 	d = debugfs_create_dir("cec", ras_debugfs_dir);
 	if (!d) {
@@ -470,6 +479,7 @@ static int __init create_debugfs_nodes(void)
 		return -1;
 	}
 
+#ifdef CONFIG_RAS_CEC_DEBUG
 	pfn = debugfs_create_file("pfn", S_IRUSR | S_IWUSR, d, &dfs_pfn, &pfn_ops);
 	if (!pfn) {
 		pr_warn("Error creating pfn debugfs node!\n");
@@ -481,6 +491,7 @@ static int __init create_debugfs_nodes(void)
 		pr_warn("Error creating array debugfs node!\n");
 		goto err;
 	}
+#endif
 
 	decay = debugfs_create_file("decay_interval", S_IRUSR | S_IWUSR, d,
 				    &timer_interval, &decay_interval_ops);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-18 23:58     ` Cong Wang
@ 2019-04-19  0:26       ` Borislav Petkov
  2019-04-20  5:43         ` Cong Wang
  0 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2019-04-19  0:26 UTC (permalink / raw)
  To: Cong Wang; +Cc: Tony Luck, LKML

On Thu, Apr 18, 2019 at 04:58:22PM -0700, Cong Wang wrote:
> No, it is all about whether we should break users' expectation.

What user expectation?

> This doesn't sounds like a valid reason for us to break users'
> expectation.

I think it is *you* who has some sort of "expectation" but that
"expectation" is wrong.

> Prior to CONFIG_RAS, mcelog just works fine for users (at least Intel
> users). Suddenly after enabling CONFIG_RAS in kernel, mcelog will
> no longer receive any correctable memory errors _silently_.

That is, of course, wrong too.

> What's more, we don't even have rasdaemon running in our system, so

Are you saying "we" to mean "we the users" or some company "we"?

And that is wrong too, there's at least one rasdaemon:

  http://git.infradead.org/users/mchehab/rasdaemon.git

> there is no consumer of RAS CEC,

RAS CEC doesn't need a consumer. You're misunderstanding the whole
concept of the error collector.

> these errors just simply disappear from users' expected place.

They "disappear" because you have CONFIG_RAS_CEC enabled. But they
don't really disappear - they're collected by the thing to filter out
only the pages which keep generating errors constantly and those get
soft-offlined.

The sporadic ones simply get ignored because they don't happen again
and are only result of alpha particles or overheating conditions or
whatever.

Now here's the CEC help text:

config RAS_CEC
        bool "Correctable Errors Collector"
        depends on X86_MCE && MEMORY_FAILURE && DEBUG_FS
        ---help---
          This is a small cache which collects correctable memory errors per 4K
          page PFN and counts their repeated occurrence. Once the counter for a
          PFN overflows, we try to soft-offline that page as we take it to mean
          that it has reached a relatively high error count and would probably
          be best if we don't use it anymore.

          Bear in mind that this is absolutely useless if your platform doesn't
          have ECC DIMMs and doesn't have DRAM ECC checking enabled in the BIOS.

you can tell me what in that text is not clear so that I can make it
more clear and obvious what that thing is.

> I know CONFIG_RAS is new feature supposed to replace MCELOG,

No, it isn't. CONFIG_RAS is supposed to collect all the RAS-related
functionality in the kernel and it looks like you have some
misconceptions about it.

> but they can co-exist in kernel config, which means mcelog should
> continue to work as before until it gets fully replaced.

For that you need to enable X86_MCELOG_LEGACY.

And let me repeat it again - if you want to collect errors in userspace,
do not enable RAS_CEC at all.

> Even the following PoC change could make this situation better,
> because with this change when we enable CONFIG_RAS,mcelog
> will break _loudly_ rather than just silently, users will notice mcelog
> is no longer supported and will look for its alternative choice.

You have somehow put in your head that CONFIG_RAS is the counterpart of
CONFIG_X86_MCELOG_LEGACY. Which is *simply* *not* *true*.

And the moment you realize that, then you'll be a step further in the
right direction.

So enable X86_MCELOG_LEGACY and you can collect all the errors you wish.
And there's a rasdaemon which you can use too, as I pointed above, if
you don't want mcelog.

CEC is something *completely* different and its purpose is to run in the
kernel and prevent users and admins from upsetting unnecessarily with
every sporadic correctable error and just because an alpha particle flew
through their DIMMs, they all start running in headless chicken mode,
trying to RMA perfectly good hardware.

Now, if any of that above still doesn't make it clear, please state what
you're trying to achieve and I'll try to help.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-19  0:07     ` Luck, Tony
@ 2019-04-19  0:29       ` Borislav Petkov
  2019-04-19 15:04         ` Luck, Tony
  2019-04-20  5:50       ` Cong Wang
  1 sibling, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2019-04-19  0:29 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Cong Wang, LKML

On Thu, Apr 18, 2019 at 05:07:45PM -0700, Luck, Tony wrote:
> On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote:
> > Which reminds me, Tony, I think all those debugging files "pfn"
> > and "array" and the one you add now, should all be under a
> > CONFIG_RAS_CEC_DEBUG which is default off and used only for development.
> > Mind adding that too pls?
> 
> Patch below, on top of previous patch.  Note that I didn't move "enable"
> into the RAS_CEC_DEBUG code. I think it has some value even on
> production systems.

And that value is? Usecase?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-19  0:29       ` Borislav Petkov
@ 2019-04-19 15:04         ` Luck, Tony
  2019-04-20  9:41           ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: Luck, Tony @ 2019-04-19 15:04 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Cong Wang, LKML

On Fri, Apr 19, 2019 at 02:29:11AM +0200, Borislav Petkov wrote:
> On Thu, Apr 18, 2019 at 05:07:45PM -0700, Luck, Tony wrote:
> > On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote:
> > > Which reminds me, Tony, I think all those debugging files "pfn"
> > > and "array" and the one you add now, should all be under a
> > > CONFIG_RAS_CEC_DEBUG which is default off and used only for development.
> > > Mind adding that too pls?
> > 
> > Patch below, on top of previous patch.  Note that I didn't move "enable"
> > into the RAS_CEC_DEBUG code. I think it has some value even on
> > production systems.
> 
> And that value is? Usecase?

Suppose that an entire device on a DIMM fails. Systems with the
right type of DIMM (X4) and a memory controller that implements
https://en.wikipedia.org/wiki/Chipkill (Intel calls this "SDDC")
can continue running ... but there will be a lot of corrected
errors from a vast range of different pages.

After fifteen or so errors Linux will trigger storm mode and
the user will see:

 mce: CMCI storm detected: switching to poll mode

on the console.  As we poll we'll find errors and hand them
to CEC. But because the errors come from far more than 512
distinct pages CEC will never manage to get a count above 1
before it drops the entry to make space for a new log.

So the only indication that the user sees that something is
wrong is that storm warning (and the lack of a following
"storm subsided" message) tells them that errors are still
happening.

This amounts to a serviceability failure ... lack of useful
diagnostics about a problem.

Now there isn't really anything better that CEC can do in
this situation. It won't help to have a bigger array. Taking
pages offline wouldn't solve the problem (though if that
did happen at least it would break the silence).

Same situation for other DRAM failure modes that affect a
wide range of pages (rank, bank, perhaps row ... though all
the errors from a single row failure might fit in the CEC array).

Allowing the user to bypass CEC (without a reboot ... cloud folks
hate to reboot their systems) would allow the sysadmin to see
what is happening (either via /dev/mcelog, or via EDAC driver).

-Tony

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-19  0:26       ` Borislav Petkov
@ 2019-04-20  5:43         ` Cong Wang
  2019-04-20  9:13           ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: Cong Wang @ 2019-04-20  5:43 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, LKML

On Thu, Apr 18, 2019 at 5:26 PM Borislav Petkov <bp@alien8.de> wrote:
>
> Now, if any of that above still doesn't make it clear, please state what
> you're trying to achieve and I'll try to help.

Sorry that I misled you to believe we don't even enable
CONFIG_X86_MCELOG_LEGACY. Here is what we have and
what we have tried:

1. We have CONFIG_X86_MCELOG_LEGACY=y

2. We also have CONFIG_RAS=y and CONFIG_RAS_CEC=y

3. mcelog started as a daemon successfully, like before

4. Some real correctable memory errors happened, as logged in
dmesg

5. mcelog couldn't receive any of them, reported 0 errors

6. Admin's complained to us as they believe this is a kernel bug

7. We dug into kernel source code and found out CONFIG_RAS
hijacks all these errors, by stopping there in the notification chain:

static int mce_first_notifier(struct notifier_block *nb, unsigned long val,
                              void *data)
{
        struct mce *m = (struct mce *)data;

        if (!m)
                return NOTIFY_DONE;

        if (cec_add_mce(m))
                return NOTIFY_STOP; // <=== Returns and stops here

        /* Emit the trace record: */
        trace_mce_record(m);

        set_bit(0, &mce_need_notify);

        mce_notify_irq(); // <=== There is where MCELOG receives

        return NOTIFY_DONE;
}


8. I noticed rasdaemon, and tried to start it instead of mcelog.

9. I injected some memory error and could successfully read them
via ras-mc-ctl.

To demonstrate what I think we should have, here is the PoC code
ONLY to show the idea (please don't judge it):


@ -567,12 +567,12 @@ static int mce_first_notifier(struct
notifier_block *nb, unsigned long val,
                              void *data)
 {
        struct mce *m = (struct mce *)data;
+       bool consumed;

        if (!m)
                return NOTIFY_DONE;

-       if (cec_add_mce(m))
-               return NOTIFY_STOP;
+       consumed = cec_add_mce(m);

        /* Emit the trace record: */
        trace_mce_record(m);
@@ -581,7 +581,7 @@ static int mce_first_notifier(struct
notifier_block *nb, unsigned long val,

        mce_notify_irq();

-       return NOTIFY_DONE;
+       return consumed ? NOTIFY_STOP : NOTIFY_DONE;
 }

With this change, although not even compiled, mcelog should still
receive correctable memory errors like before, even when we have
CONFIG_RAS_CEC=y.

Does this make any sense to you?

Thanks!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-19  0:07     ` Luck, Tony
  2019-04-19  0:29       ` Borislav Petkov
@ 2019-04-20  5:50       ` Cong Wang
  1 sibling, 0 replies; 25+ messages in thread
From: Cong Wang @ 2019-04-20  5:50 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Borislav Petkov, LKML

On Thu, Apr 18, 2019 at 5:07 PM Luck, Tony <tony.luck@intel.com> wrote:
>
> On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote:
> > Which reminds me, Tony, I think all those debugging files "pfn"
> > and "array" and the one you add now, should all be under a
> > CONFIG_RAS_CEC_DEBUG which is default off and used only for development.
> > Mind adding that too pls?
>
> Patch below, on top of previous patch.  Note that I didn't move "enable"
> into the RAS_CEC_DEBUG code. I think it has some value even on
> production systems.  It is still in debugfs (which many production
> systems don't mount) so I don't see that people are going to be randomly
> using it to disable the CEC.

For me, "enable" is useful to make mcelog work like before. Please see the
other email from me for all the details.

For debugfs, I believe many productions mount it, as tracing still uses debugfs
rather than tracefs (at least on CentOS7), and rasdaemon also uses trace
events to collect different errors.

Therefore, I believe your patch is fine as it is.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-20  5:43         ` Cong Wang
@ 2019-04-20  9:13           ` Borislav Petkov
  2019-04-20 18:18             ` Cong Wang
  0 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2019-04-20  9:13 UTC (permalink / raw)
  To: Cong Wang; +Cc: Tony Luck, LKML

On Fri, Apr 19, 2019 at 10:43:03PM -0700, Cong Wang wrote:
> With this change, although not even compiled, mcelog should still
> receive correctable memory errors like before, even when we have
> CONFIG_RAS_CEC=y.
> 
> Does this make any sense to you?

Yes, the answer is in the mail you snipped. Did you read it?

Hint: disable CONFIG_RAS_CEC.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-19 15:04         ` Luck, Tony
@ 2019-04-20  9:41           ` Borislav Petkov
  2019-04-22 15:59             ` Luck, Tony
  0 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2019-04-20  9:41 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Cong Wang, LKML

On Fri, Apr 19, 2019 at 08:04:01AM -0700, Luck, Tony wrote:
> Now there isn't really anything better that CEC can do in
> this situation. It won't help to have a bigger array. Taking
> pages offline wouldn't solve the problem (though if that
> did happen at least it would break the silence).
> 
> Same situation for other DRAM failure modes that affect a
> wide range of pages (rank, bank, perhaps row ... though all
> the errors from a single row failure might fit in the CEC array).
> 
> Allowing the user to bypass CEC (without a reboot ... cloud folks
> hate to reboot their systems) would allow the sysadmin to see
> what is happening (either via /dev/mcelog, or via EDAC driver).

Err, this all sounds to me like the storm detection code should
*automatically* disable the CEC in such cases, I'd say. Because I
don't see a cloud admin going into the debugfs and turning it off.
Rather, if the detection heuristic we use is smart enough, disabling it
automatically should be a lot better serviceability action.

Hmmm?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-20  9:13           ` Borislav Petkov
@ 2019-04-20 18:18             ` Cong Wang
  2019-04-20 18:47               ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: Cong Wang @ 2019-04-20 18:18 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, LKML

On Sat, Apr 20, 2019 at 2:13 AM Borislav Petkov <bp@alien8.de> wrote:
>
> On Fri, Apr 19, 2019 at 10:43:03PM -0700, Cong Wang wrote:
> > With this change, although not even compiled, mcelog should still
> > receive correctable memory errors like before, even when we have
> > CONFIG_RAS_CEC=y.
> >
> > Does this make any sense to you?
>
> Yes, the answer is in the mail you snipped. Did you read it?

I read it, all of your response is based on your speculation that I
don't have CONFIG_X86_MCELOG_LEGACY=y, which is clearly
a misunderstanding.

You didn't answer my question here, because I asked you whether
the following change (PoC only) makes sense:

@ -567,12 +567,12 @@ static int mce_first_notifier(struct
notifier_block *nb, unsigned long val,
                              void *data)
 {
        struct mce *m = (struct mce *)data;
+       bool consumed;

        if (!m)
                return NOTIFY_DONE;

-       if (cec_add_mce(m))
-               return NOTIFY_STOP;
+       consumed = cec_add_mce(m);

        /* Emit the trace record: */
        trace_mce_record(m);
@@ -581,7 +581,7 @@ static int mce_first_notifier(struct
notifier_block *nb, unsigned long val,

        mce_notify_irq();

-       return NOTIFY_DONE;
+       return consumed ? NOTIFY_STOP : NOTIFY_DONE;
 }


>
> Hint: disable CONFIG_RAS_CEC.

I knew disabling it could cure the problem from the beginning, please
save your own time by not repeating things we both already knew. :)

Once again, I still don't think it is the right answer, which is also why I
keep finding different solutions. I know you disagree, but you never explain
why you disagree, you speculated CONFIG_X86_MCELOG_LEGACY,
which is completely a misunderstanding.

I brought up CONFIG_X86_MCELOG_LEGACY simply to show you
how we could break mcelog _LOUDLY_ if we really decide to break it,
currently it just breaks silently. You misinterpret it as if I understand
CONFIG_RAS as a replacement for CONFIG_X86_MCELOG_LEGACY,
which is a very sad misunderstanding.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-20 18:18             ` Cong Wang
@ 2019-04-20 18:47               ` Borislav Petkov
  2019-04-20 19:08                 ` Cong Wang
  2019-04-22 16:29                 ` Luck, Tony
  0 siblings, 2 replies; 25+ messages in thread
From: Borislav Petkov @ 2019-04-20 18:47 UTC (permalink / raw)
  To: Cong Wang; +Cc: Tony Luck, LKML

On Sat, Apr 20, 2019 at 11:18:46AM -0700, Cong Wang wrote:
> You didn't answer my question here, because I asked you whether
> the following change (PoC only) makes sense:

I answered it - the answer is to disable CONFIG_RAS_CEC. But let me do a
more detailed answer, maybe that'll help.

The PoC doesn't make sense.

Why?

Because if you don't return early from the notifier when the CEC has
consumed the error, you don't need the CEC at all. Ergo, you can just as
well disable it.

Because, let me paste from a couple of mails ago what the CEC is:

"CEC is something *completely* different and its purpose is to run in
the kernel and prevent users and admins from upsetting unnecessarily
with every sporadic correctable error and just because an alpha particle
flew through their DIMMs, they all start running in headless chicken
mode, trying to RMA perfectly good hardware."

IOW, when you have the CEC enabled, you don't need to log memory errors
with a userspace agent. The CEC collects them and discards them if they
don't repeat.

If they do repeat, then it offlines the page.

Without user intervention and interference.

Now, if you still want to know how many errors and where they happened
and when they happened and yadda yadda, you *disable* the CEC.

I hope this makes more sense now.

> I knew disabling it could cure the problem from the beginning, please
> save your own time by not repeating things we both already knew. :)
> 
> Once again, I still don't think it is the right answer, which is also why I
> keep finding different solutions.

This is where you come in and say "it is not the right answer
because..." and give your arguments why. I gave mine a couple of times
already. I never said this functionality is cast in stone the way it is
but there has to be a *good* *reason* why it needs to be changed. I.e.,
basic kernel deveopment. People come with ideas and they *justify* those
ideas with arguments why they're better.

> I know you disagree, but you never explain why you disagree,

You're kidding, right?

https://lkml.kernel.org/r/20190419002645.GA559@zn.tnic

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-20 18:47               ` Borislav Petkov
@ 2019-04-20 19:08                 ` Cong Wang
  2019-04-22 16:29                 ` Luck, Tony
  1 sibling, 0 replies; 25+ messages in thread
From: Cong Wang @ 2019-04-20 19:08 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Tony Luck, LKML

On Sat, Apr 20, 2019 at 11:47 AM Borislav Petkov <bp@alien8.de> wrote:
> IOW, when you have the CEC enabled, you don't need to log memory errors
> with a userspace agent. The CEC collects them and discards them if they
> don't repeat.

So, you mean breaking mcelog is intentionally, if so, why not break it
loudly?

That is, for example, preventing mcelog from starting by disabling
CONFIG_X86_MCELOG_LEGACY in Kconfig _automatically_ when
CONFIG_RAS is enabled? (Like what I showed in my PoC change.)

Or, for another example, print a kernel warning and let users know this
behavior is intentional?


>
> If they do repeat, then it offlines the page.
>
> Without user intervention and interference.
>
> Now, if you still want to know how many errors and where they happened
> and when they happened and yadda yadda, you *disable* the CEC.

Well, I believe rasdaemon has the counters too, it is not hard to count
the trace events at all. I don't worry about this at all. What I worry is how
we treat mcelog when having CONFIG_RAS=y.

>
> I hope this makes more sense now.

Yes, thanks for the information. It is kinda what I expected, as I keep saying,
I believe we can improve this situation to avoid users' confusion, rather than
just saying CONFIG_RAS=n is the answer.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-18 22:02 [PATCH] RAS/CEC: Add debugfs switch to disable at run time Tony Luck
  2019-04-18 22:51 ` Cong Wang
@ 2019-04-20 19:50 ` Borislav Petkov
  1 sibling, 0 replies; 25+ messages in thread
From: Borislav Petkov @ 2019-04-20 19:50 UTC (permalink / raw)
  To: Tony Luck; +Cc: linux-kernel

On Thu, Apr 18, 2019 at 03:02:29PM -0700, Tony Luck wrote:
> Useful when running error injection tests that want to
> see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  drivers/ras/cec.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
> index 2d9ec378a8bc..a2ceedcd8516 100644
> --- a/drivers/ras/cec.c
> +++ b/drivers/ras/cec.c
> @@ -123,6 +123,9 @@ static u64 dfs_pfn;
>  /* Amount of errors after which we offline */
>  static unsigned int count_threshold = COUNT_MASK;
>  
> +/* debugfs switch to enable/disable CEC */
> +static u64 cec_enabled = 1;
> +
>  /*
>   * The timer "decays" element count each timer_interval which is 24hrs by
>   * default.
> @@ -400,6 +403,14 @@ static int count_threshold_set(void *data, u64 val)
>  }
>  DEFINE_DEBUGFS_ATTRIBUTE(count_threshold_ops, u64_get, count_threshold_set, "%lld\n");
>  
> +static int enable_set(void *data, u64 val)
> +{
> +	ce_arr.disabled = !val;

Btw, regardless of what we end up doing wrt hiding this switch under
RAS_CEC_DEBUG or exposing it, that ce_arr.disabled flag is not
needed anymore and you can remove it along with the flags union in
ce_array with the next version of this. Because one of the cec_enabled
or ce_arr.disabled is now redundant and I think you want to have
cec_enabled. :)

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-20  9:41           ` Borislav Petkov
@ 2019-04-22 15:59             ` Luck, Tony
  2019-04-22 17:15               ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: Luck, Tony @ 2019-04-22 15:59 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Cong Wang, LKML

> Err, this all sounds to me like the storm detection code should
> *automatically* disable the CEC in such cases, I'd say.

Sounds good. But we should distinguish storms that have many different
addresses from storms that just ping a few addresses.  CEC will see counts
hit the threshold in the latter case, but it might not be able to take the pages
offline (because they are locked, or in-use by kernel).

So I think the change might be to the return value from NOTIFY_STOP to NOTIFY_DONE
... but only if we are in the middle of a storm AND the CEC array is full.

-Tony


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-20 18:47               ` Borislav Petkov
  2019-04-20 19:08                 ` Cong Wang
@ 2019-04-22 16:29                 ` Luck, Tony
  2019-04-22 16:31                   ` Borislav Petkov
  1 sibling, 1 reply; 25+ messages in thread
From: Luck, Tony @ 2019-04-22 16:29 UTC (permalink / raw)
  To: Borislav Petkov, Cong Wang; +Cc: LKML

> Now, if you still want to know how many errors and where they happened
> and when they happened and yadda yadda, you *disable* the CEC.

Rebooting isn't popular in many end user situations. Many CSP (cloud
service providers) vehemently hate the idea of rebooting.

-Tony

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-22 16:29                 ` Luck, Tony
@ 2019-04-22 16:31                   ` Borislav Petkov
  2019-04-22 16:43                     ` Luck, Tony
  0 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2019-04-22 16:31 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Cong Wang, LKML

On Mon, Apr 22, 2019 at 04:29:35PM +0000, Luck, Tony wrote:
> > Now, if you still want to know how many errors and where they happened
> > and when they happened and yadda yadda, you *disable* the CEC.
> 
> Rebooting isn't popular in many end user situations. Many CSP (cloud
> service providers) vehemently hate the idea of rebooting.

I meant disable in Kconfig - not build it in at all.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-22 16:31                   ` Borislav Petkov
@ 2019-04-22 16:43                     ` Luck, Tony
  2019-04-22 17:05                       ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: Luck, Tony @ 2019-04-22 16:43 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Cong Wang, LKML

>> Rebooting isn't popular in many end user situations. Many CSP (cloud
>> service providers) vehemently hate the idea of rebooting.
>
> I meant disable in Kconfig - not build it in at all.

If rebooting is bad, then re-compiling and rebooting is 100x worse. :-)

-Tony

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-22 16:43                     ` Luck, Tony
@ 2019-04-22 17:05                       ` Borislav Petkov
  2019-04-22 17:23                         ` Luck, Tony
  0 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2019-04-22 17:05 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Cong Wang, LKML

On Mon, Apr 22, 2019 at 04:43:58PM +0000, Luck, Tony wrote:
> >> Rebooting isn't popular in many end user situations. Many CSP (cloud
> >> service providers) vehemently hate the idea of rebooting.
> >
> > I meant disable in Kconfig - not build it in at all.
> 
> If rebooting is bad, then re-compiling and rebooting is 100x worse. :-)

I think we're talking past each other here: I mean disable the CEC
*forever* and *never* use it. Use only a userspace agent and log errors
with it.

Makes sense?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-22 15:59             ` Luck, Tony
@ 2019-04-22 17:15               ` Borislav Petkov
  2019-04-22 17:44                 ` Luck, Tony
  0 siblings, 1 reply; 25+ messages in thread
From: Borislav Petkov @ 2019-04-22 17:15 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Cong Wang, LKML

On Mon, Apr 22, 2019 at 03:59:16PM +0000, Luck, Tony wrote:
> > Err, this all sounds to me like the storm detection code should
> > *automatically* disable the CEC in such cases, I'd say.
> 
> Sounds good. But we should distinguish storms that have many different
> addresses from storms that just ping a few addresses.  CEC will see counts
> hit the threshold in the latter case, but it might not be able to take the pages
> offline (because they are locked, or in-use by kernel).
> 
> So I think the change might be to the return value from NOTIFY_STOP to NOTIFY_DONE
> ... but only if we are in the middle of a storm AND the CEC array is full.

Well, regardless of this specific use case, isn't that a generic enough
action that we should do always? I mean, the aspect of falling back to
logging to external agent.

However, currently we don't signal that the CEC is full - we simply
remove the LRU element in cec_add_elem() before we insert the new one.

We can either return a specific retval to say, CEC is full and we had to
delete an elem or we can add a cec_is_full() accessor...

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-22 17:05                       ` Borislav Petkov
@ 2019-04-22 17:23                         ` Luck, Tony
  0 siblings, 0 replies; 25+ messages in thread
From: Luck, Tony @ 2019-04-22 17:23 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Cong Wang, LKML

> I think we're talking past each other here: I mean disable the CEC
> *forever* and *never* use it. Use only a userspace agent and log errors
> with it.
>
> Makes sense?

Not really. We want pretty much everyone to enable and use CEC. That way
people don't bother use about the occasional neutron strike flipping a bit.

So we need to make sure there are no excuses for disabling it. People
may come up with other cases where they say they would need to disable
(perhaps genuine, perhaps bogus) ... if our only answers are reboot, or
recompile and reboot ... then they'll disable forever.

If they do that, why do we even have it in the kernel?

-Tony

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-22 17:15               ` Borislav Petkov
@ 2019-04-22 17:44                 ` Luck, Tony
  2019-04-22 18:08                   ` Borislav Petkov
  0 siblings, 1 reply; 25+ messages in thread
From: Luck, Tony @ 2019-04-22 17:44 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Cong Wang, LKML

On Mon, Apr 22, 2019 at 07:15:32PM +0200, Borislav Petkov wrote:
> On Mon, Apr 22, 2019 at 03:59:16PM +0000, Luck, Tony wrote:
> > > Err, this all sounds to me like the storm detection code should
> > > *automatically* disable the CEC in such cases, I'd say.
> > 
> > Sounds good. But we should distinguish storms that have many different
> > addresses from storms that just ping a few addresses.  CEC will see counts
> > hit the threshold in the latter case, but it might not be able to take the pages
> > offline (because they are locked, or in-use by kernel).
> > 
> > So I think the change might be to the return value from NOTIFY_STOP to NOTIFY_DONE
> > ... but only if we are in the middle of a storm AND the CEC array is full.
> 
> Well, regardless of this specific use case, isn't that a generic enough
> action that we should do always? I mean, the aspect of falling back to
> logging to external agent.

Yes. Automating this would be a very good idea.

> However, currently we don't signal that the CEC is full - we simply
> remove the LRU element in cec_add_elem() before we insert the new one.
> 
> We can either return a specific retval to say, CEC is full and we had to
> delete an elem or we can add a cec_is_full() accessor...

A lot depends on why the CEC is full, and which entry is being
deleted to make room.

In the case of many errors at different addresses we are deleting
the entry with the lowest count. But all of the entries have low
counts because we are just thrashing the array with many different
addresses. In this situation a warning would be helpful.

But in the case where the system has been up for months and
we very slowly accumlated logs of bit flips. The periodic
spring cleaning means they all have generation "00", but
we never actually drop an old entry because of age. In this
case dropping one entry to make space for a new one is fine
and doesn't need any action.

Perhaps we can distinguish the cases by the generation? If
we are dropping an entry that was recently added, then it
will still have generation "11" (or at least not "00").
Use that to trigger an action?

-Tony

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time
  2019-04-22 17:44                 ` Luck, Tony
@ 2019-04-22 18:08                   ` Borislav Petkov
  0 siblings, 0 replies; 25+ messages in thread
From: Borislav Petkov @ 2019-04-22 18:08 UTC (permalink / raw)
  To: Luck, Tony; +Cc: Cong Wang, LKML

On Mon, Apr 22, 2019 at 10:44:15AM -0700, Luck, Tony wrote:
> Yes. Automating this would be a very good idea.

Yeah, in general integrating the CEC better with the rest of the error
chain is something we still need to discuss and do.

> In the case of many errors at different addresses we are deleting
> the entry with the lowest count. But all of the entries have low
> counts because we are just thrashing the array with many different
> addresses. In this situation a warning would be helpful.

Can we detect that situation reliably even? You can have many errors at
different addresses which have accumulated over time, due to a slow but
constant stream of errors. Dunno if that is possible though... someone
needs to analyze error occurrence patterns :-\

> But in the case where the system has been up for months and
> we very slowly accumlated logs of bit flips. The periodic
> spring cleaning means they all have generation "00", but
> we never actually drop an old entry because of age.

Yes, we drop only on insertion and when the array is full or when we
soft-offline.

> In this case dropping one entry to make space for a new one is fine
> and doesn't need any action.
>
> Perhaps we can distinguish the cases by the generation? If
> we are dropping an entry that was recently added, then it
> will still have generation "11" (or at least not "00").
> Use that to trigger an action?

That and the fact that we're in an error storm is probably a good enough
heurstic. And then when the storm subsides, we reenable it? We basically
say, error storm is over, the error rate should go back to normal so we
can stick the CEC in front of it again.

Hmmm.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-04-22 18:08 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-18 22:02 [PATCH] RAS/CEC: Add debugfs switch to disable at run time Tony Luck
2019-04-18 22:51 ` Cong Wang
2019-04-18 23:29   ` Borislav Petkov
2019-04-18 23:58     ` Cong Wang
2019-04-19  0:26       ` Borislav Petkov
2019-04-20  5:43         ` Cong Wang
2019-04-20  9:13           ` Borislav Petkov
2019-04-20 18:18             ` Cong Wang
2019-04-20 18:47               ` Borislav Petkov
2019-04-20 19:08                 ` Cong Wang
2019-04-22 16:29                 ` Luck, Tony
2019-04-22 16:31                   ` Borislav Petkov
2019-04-22 16:43                     ` Luck, Tony
2019-04-22 17:05                       ` Borislav Petkov
2019-04-22 17:23                         ` Luck, Tony
2019-04-19  0:07     ` Luck, Tony
2019-04-19  0:29       ` Borislav Petkov
2019-04-19 15:04         ` Luck, Tony
2019-04-20  9:41           ` Borislav Petkov
2019-04-22 15:59             ` Luck, Tony
2019-04-22 17:15               ` Borislav Petkov
2019-04-22 17:44                 ` Luck, Tony
2019-04-22 18:08                   ` Borislav Petkov
2019-04-20  5:50       ` Cong Wang
2019-04-20 19:50 ` Borislav Petkov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.