linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
@ 2021-05-05 17:30 Lei Wang
  2021-05-05 18:01 ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Lei Wang @ 2021-05-05 17:30 UTC (permalink / raw)
  To: bp, mchehab, tony.luck, james.morse, rric
  Cc: linux-edac, linux-kernel, lewan, hangl, tyhicks, bwaller, Lei Wang

Update printk to the ratelimited version, so that in some corner cases when
vast of CE errors show up, the kernel logging can be suppressed.

Signed-off-by: Lei Wang <wangglei@gmail.com>
---
 drivers/edac/edac_mc.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/edac/edac_mc.h b/drivers/edac/edac_mc.h
index 881b00eadf7a..355529317980 100644
--- a/drivers/edac/edac_mc.h
+++ b/drivers/edac/edac_mc.h
@@ -46,19 +46,19 @@
 #endif
 
 #define edac_printk(level, prefix, fmt, arg...) \
-	printk(level "EDAC " prefix ": " fmt, ##arg)
+	printk_ratelimited(level "EDAC " prefix ": " fmt, ##arg)
 
 #define edac_mc_printk(mci, level, fmt, arg...) \
-	printk(level "EDAC MC%d: " fmt, mci->mc_idx, ##arg)
+	printk_ratelimited(level "EDAC MC%d: " fmt, mci->mc_idx, ##arg)
 
 #define edac_mc_chipset_printk(mci, level, prefix, fmt, arg...) \
-	printk(level "EDAC " prefix " MC%d: " fmt, mci->mc_idx, ##arg)
+	printk_ratelimited(level "EDAC " prefix " MC%d: " fmt, mci->mc_idx, ##arg)
 
 #define edac_device_printk(ctl, level, fmt, arg...) \
-	printk(level "EDAC DEVICE%d: " fmt, ctl->dev_idx, ##arg)
+	printk_ratelimited(level "EDAC DEVICE%d: " fmt, ctl->dev_idx, ##arg)
 
 #define edac_pci_printk(ctl, level, fmt, arg...) \
-	printk(level "EDAC PCI%d: " fmt, ctl->pci_idx, ##arg)
+	printk_ratelimited(level "EDAC PCI%d: " fmt, ctl->pci_idx, ##arg)
 
 /* prefixes for edac_printk() and edac_mc_printk() */
 #define EDAC_MC "MC"
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
  2021-05-05 17:30 [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited Lei Wang
@ 2021-05-05 18:01 ` Borislav Petkov
  2021-05-05 19:02   ` [EXTERNAL] " Lei Wang (DPLAT)
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2021-05-05 18:01 UTC (permalink / raw)
  To: Lei Wang
  Cc: mchehab, tony.luck, james.morse, rric, linux-edac, linux-kernel,
	lewan, hangl, tyhicks, bwaller

On Wed, May 05, 2021 at 10:30:27AM -0700, Lei Wang wrote:
> Update printk to the ratelimited version, so that in some corner cases when
> vast of CE errors show up, the kernel logging can be suppressed.

Err, why?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
  2021-05-05 18:01 ` Borislav Petkov
@ 2021-05-05 19:02   ` Lei Wang (DPLAT)
  2021-05-05 19:45     ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Lei Wang (DPLAT) @ 2021-05-05 19:02 UTC (permalink / raw)
  To: Borislav Petkov, wangglei
  Cc: mchehab, tony.luck, james.morse, rric, linux-edac, linux-kernel,
	Hang Li, tyhicks, Brandon Waller

Hi Boris,

We found a corner case in production environment that there are ~500 CE errors per second. The SoC otherwise functions just fine. Making printk ratelimited reduced CE error logging to < 20 per second. Though this is just one case so far, we think moving to printk_ratelimited could benefit broader use as well, by helping control the amount of kernel logging. In most running condition, the error rate is way below the limit. And in an error case like this one, vast error logging would not provide much valuable details, rather it's storming the kernel logging.

Thanks,
-Lei

-----Original Message-----
From: Borislav Petkov <bp@alien8.de> 
Sent: Wednesday, May 5, 2021 11:01 AM
To: wangglei <wangglei@gmail.com>
Cc: mchehab@kernel.org; tony.luck@intel.com; james.morse@arm.com; rric@kernel.org; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org; Lei Wang (DPLAT) <Wang.Lei@microsoft.com>; Hang Li <hangl@microsoft.com>; tyhicks@linux.microsoft.com; Brandon Waller <bwaller@microsoft.com>
Subject: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.

On Wed, May 05, 2021 at 10:30:27AM -0700, Lei Wang wrote:
> Update printk to the ratelimited version, so that in some corner cases 
> when vast of CE errors show up, the kernel logging can be suppressed.

Err, why?

--
Regards/Gruss,
    Boris.

https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&amp;data=04%7C01%7CWang.Lei%40microsoft.com%7C71421584bc2a43951df908d90fefc1b9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637558344708605379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=fJG0%2Fdk8VCVNIGS0kM2BZDHAXLVcq4CLHEajhND0rzg%3D&amp;reserved=0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
  2021-05-05 19:02   ` [EXTERNAL] " Lei Wang (DPLAT)
@ 2021-05-05 19:45     ` Borislav Petkov
  2021-05-05 20:23       ` Tyler Hicks
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2021-05-05 19:45 UTC (permalink / raw)
  To: Lei Wang (DPLAT)
  Cc: wangglei, mchehab, tony.luck, james.morse, rric, linux-edac,
	linux-kernel, Hang Li, tyhicks, Brandon Waller

Hi Lei,

On Wed, May 05, 2021 at 07:02:14PM +0000, Lei Wang (DPLAT) wrote:
> Hi Boris,

first of all, please do not top-post.

> We found a corner case in production environment that there are ~500
> CE errors per second. The SoC otherwise functions just fine. Making
> printk ratelimited reduced CE error logging to < 20 per second.

If you want to avoid CE logs flooding dmesg, there's a couple of things
you can do:

1. Use drivers/ras/cec.c

2. Do not load EDAC drivers at all since you don't care about the error
reports, apparently.

3. Fix the CE source: replace the DIMMs, etc.

> Though this is just one case so far, we think moving to
> printk_ratelimited could benefit broader use as well, by helping
> control the amount of kernel logging.

No, this will make EDAC driver loading output incomplete when some of
the messages are omitted due to the ratelimiting. And no, this is not
going to happen.

HTH.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
  2021-05-05 19:45     ` Borislav Petkov
@ 2021-05-05 20:23       ` Tyler Hicks
  2021-05-05 21:04         ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Tyler Hicks @ 2021-05-05 20:23 UTC (permalink / raw)
  To: Borislav Petkov, wangglei
  Cc: Lei Wang (DPLAT),
	mchehab, tony.luck, james.morse, rric, linux-edac, linux-kernel,
	Hang Li, Brandon Waller

On 2021-05-05 21:45:01, Borislav Petkov wrote:
> Hi Lei,
> 
> On Wed, May 05, 2021 at 07:02:14PM +0000, Lei Wang (DPLAT) wrote:
> > Hi Boris,
> 
> first of all, please do not top-post.
> 
> > We found a corner case in production environment that there are ~500
> > CE errors per second. The SoC otherwise functions just fine. Making
> > printk ratelimited reduced CE error logging to < 20 per second.
> 
> If you want to avoid CE logs flooding dmesg, there's a couple of things
> you can do:
> 
> 1. Use drivers/ras/cec.c
> 
> 2. Do not load EDAC drivers at all since you don't care about the error
> reports, apparently.

Lei, if you don't care about the CE error messages at all, there's
also an edac_mc_log_ce module parameter that can be used to quiet the
message emitted from edac_ce_error():

 https://www.kernel.org/doc/html/latest/admin-guide/ras.html#module-parameters

> 3. Fix the CE source: replace the DIMMs, etc.
> 
> > Though this is just one case so far, we think moving to
> > printk_ratelimited could benefit broader use as well, by helping
> > control the amount of kernel logging.
> 
> No, this will make EDAC driver loading output incomplete when some of
> the messages are omitted due to the ratelimiting. And no, this is not
> going to happen.

Boris, I agree that a more surgical approach is needed than this if Lei
still needs some traces of the CE error messages in the logs. Would it
be any more acceptable to add an edac_mc_printk_ratelimited() macro,
which uses printk_ratelimited(), and then call that new macro from
edac_ce_error()?

If you still don't want those CE errors ratelimited by default, perhaps
a new, non-default mode (2) could be added to the edac_mc_log_ce module
parameter that uses the ratelimited variant?

Tyler

> 
> HTH.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
  2021-05-05 20:23       ` Tyler Hicks
@ 2021-05-05 21:04         ` Borislav Petkov
  2021-05-05 21:48           ` Tyler Hicks
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2021-05-05 21:04 UTC (permalink / raw)
  To: Tyler Hicks
  Cc: wangglei, Lei Wang (DPLAT),
	mchehab, tony.luck, james.morse, rric, linux-edac, linux-kernel,
	Hang Li, Brandon Waller

On Wed, May 05, 2021 at 03:23:57PM -0500, Tyler Hicks wrote:
>  Would it be any more acceptable to add an
> edac_mc_printk_ratelimited() macro, which uses printk_ratelimited(),
> and then call that new macro from edac_ce_error()?

You guys are way off here: the intent of EDAC drivers is to accurately
report errors for purposes of counting them and doing analysis on
that collected data as to whether components are going wrong - not to
ratelimit them as some nuisance output.

With breaking the EDAC reporting, you're barking up the wrong tree - if
you don't want to see those errors, do not load the drivers. It is that
simple.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
  2021-05-05 21:04         ` Borislav Petkov
@ 2021-05-05 21:48           ` Tyler Hicks
  2021-05-05 22:02             ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Tyler Hicks @ 2021-05-05 21:48 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: wangglei, Lei Wang (DPLAT),
	mchehab, tony.luck, james.morse, rric, linux-edac, linux-kernel,
	Hang Li, Brandon Waller

On 2021-05-05 23:04:44, Borislav Petkov wrote:
> On Wed, May 05, 2021 at 03:23:57PM -0500, Tyler Hicks wrote:
> >  Would it be any more acceptable to add an
> > edac_mc_printk_ratelimited() macro, which uses printk_ratelimited(),
> > and then call that new macro from edac_ce_error()?
> 
> You guys are way off here: the intent of EDAC drivers is to accurately
> report errors for purposes of counting them and doing analysis on
> that collected data as to whether components are going wrong - not to
> ratelimit them as some nuisance output.
> 
> With breaking the EDAC reporting, you're barking up the wrong tree - if
> you don't want to see those errors, do not load the drivers. It is that
> simple.

As I understand it, the idea here wasn't to treat the log messages as a
nuisance that should be completely squelched. The counters are monitored
and provide the definitive way to detect large scale problems but the CE
log messages can be an easier-to-discover way for humans to identify
potential problems when, for example, centralized log aggregation and
indexing is in place.

The thought was that the full stream of log messages isn't necessary to
notice that there's a problem when they are being emitted at such a high
rate (500 per second). They're just filling up disk space and/or wasting
networking bandwidth at that point. Of course, the best course of action
here is to service the machine but there's still a period of time
between the CE errors popping up and the machine being serviced.

Tyler

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
  2021-05-05 21:48           ` Tyler Hicks
@ 2021-05-05 22:02             ` Borislav Petkov
  2021-05-05 22:16               ` Tyler Hicks
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2021-05-05 22:02 UTC (permalink / raw)
  To: Tyler Hicks
  Cc: wangglei, Lei Wang (DPLAT),
	mchehab, tony.luck, james.morse, rric, linux-edac, linux-kernel,
	Hang Li, Brandon Waller

On Wed, May 05, 2021 at 04:48:46PM -0500, Tyler Hicks wrote:
> The thought was that the full stream of log messages isn't necessary to
> notice that there's a problem when they are being emitted at such a high
> rate (500 per second). They're just filling up disk space and/or wasting
> networking bandwidth at that point.

I already asked about this but lemme point it out again: have you guys
looked at drivers/ras/cec.c ?

With that there won't be *any* error reports in dmesg and it will even
poison and offline pages which generate excessive errors so that ...

> Of course, the best course of action here is to service the machine
> but there's still a period of time between the CE errors popping up
> and the machine being serviced.

... you'll have ample time to service the machine.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
  2021-05-05 22:02             ` Borislav Petkov
@ 2021-05-05 22:16               ` Tyler Hicks
  2021-05-05 22:43                 ` Tyler Hicks
  0 siblings, 1 reply; 14+ messages in thread
From: Tyler Hicks @ 2021-05-05 22:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: wangglei, Lei Wang (DPLAT),
	mchehab, tony.luck, james.morse, rric, linux-edac, linux-kernel,
	Hang Li, Brandon Waller

On 2021-05-06 00:02:44, Borislav Petkov wrote:
> On Wed, May 05, 2021 at 04:48:46PM -0500, Tyler Hicks wrote:
> > The thought was that the full stream of log messages isn't necessary to
> > notice that there's a problem when they are being emitted at such a high
> > rate (500 per second). They're just filling up disk space and/or wasting
> > networking bandwidth at that point.
> 
> I already asked about this but lemme point it out again: have you guys
> looked at drivers/ras/cec.c ?

We'll have a closer look. Thanks for the pointer!

Tyler

> 
> With that there won't be *any* error reports in dmesg and it will even
> poison and offline pages which generate excessive errors so that ...
> 
> > Of course, the best course of action here is to service the machine
> > but there's still a period of time between the CE errors popping up
> > and the machine being serviced.
> 
> ... you'll have ample time to service the machine.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
  2021-05-05 22:16               ` Tyler Hicks
@ 2021-05-05 22:43                 ` Tyler Hicks
  2021-05-05 22:55                   ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Tyler Hicks @ 2021-05-05 22:43 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: wangglei, Lei Wang (DPLAT),
	mchehab, tony.luck, james.morse, rric, linux-edac, linux-kernel,
	Hang Li, Brandon Waller

On 2021-05-05 17:16:11, Tyler Hicks wrote:
> On 2021-05-06 00:02:44, Borislav Petkov wrote:
> > On Wed, May 05, 2021 at 04:48:46PM -0500, Tyler Hicks wrote:
> > > The thought was that the full stream of log messages isn't necessary to
> > > notice that there's a problem when they are being emitted at such a high
> > > rate (500 per second). They're just filling up disk space and/or wasting
> > > networking bandwidth at that point.
> > 
> > I already asked about this but lemme point it out again: have you guys
> > looked at drivers/ras/cec.c ?
> 
> We'll have a closer look. Thanks for the pointer!

This is x86-specific and not applicable in our situation.

Tyler

> 
> Tyler
> 
> > 
> > With that there won't be *any* error reports in dmesg and it will even
> > poison and offline pages which generate excessive errors so that ...
> > 
> > > Of course, the best course of action here is to service the machine
> > > but there's still a period of time between the CE errors popping up
> > > and the machine being serviced.
> > 
> > ... you'll have ample time to service the machine.
> > 
> > -- 
> > Regards/Gruss,
> >     Boris.
> > 
> > https://people.kernel.org/tglx/notes-about-netiquette
> > 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
  2021-05-05 22:43                 ` Tyler Hicks
@ 2021-05-05 22:55                   ` Borislav Petkov
  2021-05-05 23:01                     ` Tyler Hicks
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2021-05-05 22:55 UTC (permalink / raw)
  To: Tyler Hicks
  Cc: wangglei, Lei Wang (DPLAT),
	mchehab, tony.luck, james.morse, rric, linux-edac, linux-kernel,
	Hang Li, Brandon Waller

On Wed, May 05, 2021 at 05:43:57PM -0500, Tyler Hicks wrote:
> This is x86-specific 

That's because it is used by x86 currently. It shouldn't be hard to use
it on another arch though as the machinery is pretty generic.

> and not applicable in our situation.

What is your situation? ARM?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
  2021-05-05 22:55                   ` Borislav Petkov
@ 2021-05-05 23:01                     ` Tyler Hicks
  2021-05-05 23:13                       ` Luck, Tony
  2021-05-06  7:16                       ` Mauro Carvalho Chehab
  0 siblings, 2 replies; 14+ messages in thread
From: Tyler Hicks @ 2021-05-05 23:01 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: wangglei, Lei Wang (DPLAT),
	mchehab, tony.luck, james.morse, rric, linux-edac, linux-kernel,
	Hang Li, Brandon Waller

On 2021-05-06 00:55:00, Borislav Petkov wrote:
> On Wed, May 05, 2021 at 05:43:57PM -0500, Tyler Hicks wrote:
> > This is x86-specific 
> 
> That's because it is used by x86 currently. It shouldn't be hard to use
> it on another arch though as the machinery is pretty generic.
> 
> > and not applicable in our situation.
> 
> What is your situation? ARM?

Yes, though I'm not sure if those additional features are
important/useful enough for us to generalize that driver. The main
motivation here was just to prevent storage/network from being flooded
by obviously-bad nodes that haven't been offlined yet. :) 

Lei and others on cc will need to evaluate porting cec.c and what it
will gain them. Thanks again.

Tyler

> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
  2021-05-05 23:01                     ` Tyler Hicks
@ 2021-05-05 23:13                       ` Luck, Tony
  2021-05-06  7:16                       ` Mauro Carvalho Chehab
  1 sibling, 0 replies; 14+ messages in thread
From: Luck, Tony @ 2021-05-05 23:13 UTC (permalink / raw)
  To: Tyler Hicks, Borislav Petkov
  Cc: wangglei, Lei Wang (DPLAT),
	mchehab, james.morse, rric, linux-edac, linux-kernel, Hang Li,
	Brandon Waller

>> What is your situation? ARM?
>
> Yes, though I'm not sure if those additional features are
> important/useful enough for us to generalize that driver. The main
> motivation here was just to prevent storage/network from being flooded
> by obviously-bad nodes that haven't been offlined yet. :) 
>
> Lei and others on cc will need to evaluate porting cec.c and what it
> will gain them. Thanks again.

Tyler,

You might also look at the x86 "storm" detection code (tl;dr version
"If error interrupts are coming too fast, turn off the interrupts and poll").

-Tony 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
  2021-05-05 23:01                     ` Tyler Hicks
  2021-05-05 23:13                       ` Luck, Tony
@ 2021-05-06  7:16                       ` Mauro Carvalho Chehab
  1 sibling, 0 replies; 14+ messages in thread
From: Mauro Carvalho Chehab @ 2021-05-06  7:16 UTC (permalink / raw)
  To: Tyler Hicks
  Cc: Borislav Petkov, wangglei, Lei Wang (DPLAT),
	tony.luck, james.morse, rric, linux-edac, linux-kernel, Hang Li,
	Brandon Waller

Em Wed, 5 May 2021 18:01:52 -0500
Tyler Hicks <tyhicks@linux.microsoft.com> escreveu:

> On 2021-05-06 00:55:00, Borislav Petkov wrote:
> > On Wed, May 05, 2021 at 05:43:57PM -0500, Tyler Hicks wrote:  
> > > This is x86-specific   
> > 
> > That's because it is used by x86 currently. It shouldn't be hard to use
> > it on another arch though as the machinery is pretty generic.
> >   
> > > and not applicable in our situation.  
> > 
> > What is your situation? ARM?  
> 
> Yes, though I'm not sure if those additional features are
> important/useful enough for us to generalize that driver. The main
> motivation here was just to prevent storage/network from being flooded
> by obviously-bad nodes that haven't been offlined yet. :) 

Well, if a machine starts to produce 500+ errors per second,
then it should be offlined as soon as possible, as otherwise bad results
will be produced ;-)

See, the CE error reporting mechanism is meant to be used together
with some error correction code algorithm like the ones used on ECC
memories. Such algorithms are designed to detect a single errored bit 
with a change usually at the ~10⁻4 to 10^-7 order (the precision
depends on how many bits are used and what algorithm is used), but 
if there are two wrong bits at the same word, the chance to detect 
is a lot lower.

So, keeping the server enabled up to the point that it would consume
enough resources at the storage/network to bother someone sounds a 
terrible idea, as sooner or later it will miss an error or produce
an uncorrected event ;-)

Besides that, if you're running rasdaemon to capture the hardware errors, 
the storage will also be flooded by something like that, even if you
disable them from going to syslog via 
sys/module/edac_core/parameters/edac_mc_log_ce.

Now, the question is: are those 500+ errors per second a real hardware
problem, or is it due to some broken error report mechanism?

In the latter case, the driver or the hardware that it is producing 
invalid errors should be fixed.

> 
> Lei and others on cc will need to evaluate porting cec.c and what it
> will gain them. Thanks again.

Regards,
Mauro

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-05-06  7:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-05 17:30 [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited Lei Wang
2021-05-05 18:01 ` Borislav Petkov
2021-05-05 19:02   ` [EXTERNAL] " Lei Wang (DPLAT)
2021-05-05 19:45     ` Borislav Petkov
2021-05-05 20:23       ` Tyler Hicks
2021-05-05 21:04         ` Borislav Petkov
2021-05-05 21:48           ` Tyler Hicks
2021-05-05 22:02             ` Borislav Petkov
2021-05-05 22:16               ` Tyler Hicks
2021-05-05 22:43                 ` Tyler Hicks
2021-05-05 22:55                   ` Borislav Petkov
2021-05-05 23:01                     ` Tyler Hicks
2021-05-05 23:13                       ` Luck, Tony
2021-05-06  7:16                       ` Mauro Carvalho Chehab

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).