EDAC: update edac printk wrappers to use printk_ratelimited.
diff mbox series

Message ID 20210505173027.78428-1-wangglei@gmail.com
State New, archived
Headers show
Series
  • EDAC: update edac printk wrappers to use printk_ratelimited.
Related show

Commit Message

Lei Wang May 5, 2021, 5:30 p.m. UTC
Update printk to the ratelimited version, so that in some corner cases when
vast of CE errors show up, the kernel logging can be suppressed.

Signed-off-by: Lei Wang <wangglei@gmail.com>
---
 drivers/edac/edac_mc.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Borislav Petkov May 5, 2021, 6:01 p.m. UTC | #1
On Wed, May 05, 2021 at 10:30:27AM -0700, Lei Wang wrote:
> Update printk to the ratelimited version, so that in some corner cases when
> vast of CE errors show up, the kernel logging can be suppressed.

Err, why?
Lei Wang (DPLAT) May 5, 2021, 7:02 p.m. UTC | #2
Hi Boris,

We found a corner case in production environment that there are ~500 CE errors per second. The SoC otherwise functions just fine. Making printk ratelimited reduced CE error logging to < 20 per second. Though this is just one case so far, we think moving to printk_ratelimited could benefit broader use as well, by helping control the amount of kernel logging. In most running condition, the error rate is way below the limit. And in an error case like this one, vast error logging would not provide much valuable details, rather it's storming the kernel logging.

Thanks,
-Lei

-----Original Message-----
From: Borislav Petkov <bp@alien8.de> 
Sent: Wednesday, May 5, 2021 11:01 AM
To: wangglei <wangglei@gmail.com>
Cc: mchehab@kernel.org; tony.luck@intel.com; james.morse@arm.com; rric@kernel.org; linux-edac@vger.kernel.org; linux-kernel@vger.kernel.org; Lei Wang (DPLAT) <Wang.Lei@microsoft.com>; Hang Li <hangl@microsoft.com>; tyhicks@linux.microsoft.com; Brandon Waller <bwaller@microsoft.com>
Subject: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.

On Wed, May 05, 2021 at 10:30:27AM -0700, Lei Wang wrote:
> Update printk to the ratelimited version, so that in some corner cases 
> when vast of CE errors show up, the kernel logging can be suppressed.

Err, why?

--
Regards/Gruss,
    Boris.

https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&amp;data=04%7C01%7CWang.Lei%40microsoft.com%7C71421584bc2a43951df908d90fefc1b9%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637558344708605379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=fJG0%2Fdk8VCVNIGS0kM2BZDHAXLVcq4CLHEajhND0rzg%3D&amp;reserved=0
Borislav Petkov May 5, 2021, 7:45 p.m. UTC | #3
Hi Lei,

On Wed, May 05, 2021 at 07:02:14PM +0000, Lei Wang (DPLAT) wrote:
> Hi Boris,

first of all, please do not top-post.

> We found a corner case in production environment that there are ~500
> CE errors per second. The SoC otherwise functions just fine. Making
> printk ratelimited reduced CE error logging to < 20 per second.

If you want to avoid CE logs flooding dmesg, there's a couple of things
you can do:

1. Use drivers/ras/cec.c

2. Do not load EDAC drivers at all since you don't care about the error
reports, apparently.

3. Fix the CE source: replace the DIMMs, etc.

> Though this is just one case so far, we think moving to
> printk_ratelimited could benefit broader use as well, by helping
> control the amount of kernel logging.

No, this will make EDAC driver loading output incomplete when some of
the messages are omitted due to the ratelimiting. And no, this is not
going to happen.

HTH.
Tyler Hicks May 5, 2021, 8:23 p.m. UTC | #4
On 2021-05-05 21:45:01, Borislav Petkov wrote:
> Hi Lei,
> 
> On Wed, May 05, 2021 at 07:02:14PM +0000, Lei Wang (DPLAT) wrote:
> > Hi Boris,
> 
> first of all, please do not top-post.
> 
> > We found a corner case in production environment that there are ~500
> > CE errors per second. The SoC otherwise functions just fine. Making
> > printk ratelimited reduced CE error logging to < 20 per second.
> 
> If you want to avoid CE logs flooding dmesg, there's a couple of things
> you can do:
> 
> 1. Use drivers/ras/cec.c
> 
> 2. Do not load EDAC drivers at all since you don't care about the error
> reports, apparently.

Lei, if you don't care about the CE error messages at all, there's
also an edac_mc_log_ce module parameter that can be used to quiet the
message emitted from edac_ce_error():

 https://www.kernel.org/doc/html/latest/admin-guide/ras.html#module-parameters

> 3. Fix the CE source: replace the DIMMs, etc.
> 
> > Though this is just one case so far, we think moving to
> > printk_ratelimited could benefit broader use as well, by helping
> > control the amount of kernel logging.
> 
> No, this will make EDAC driver loading output incomplete when some of
> the messages are omitted due to the ratelimiting. And no, this is not
> going to happen.

Boris, I agree that a more surgical approach is needed than this if Lei
still needs some traces of the CE error messages in the logs. Would it
be any more acceptable to add an edac_mc_printk_ratelimited() macro,
which uses printk_ratelimited(), and then call that new macro from
edac_ce_error()?

If you still don't want those CE errors ratelimited by default, perhaps
a new, non-default mode (2) could be added to the edac_mc_log_ce module
parameter that uses the ratelimited variant?

Tyler

> 
> HTH.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
>
Borislav Petkov May 5, 2021, 9:04 p.m. UTC | #5
On Wed, May 05, 2021 at 03:23:57PM -0500, Tyler Hicks wrote:
>  Would it be any more acceptable to add an
> edac_mc_printk_ratelimited() macro, which uses printk_ratelimited(),
> and then call that new macro from edac_ce_error()?

You guys are way off here: the intent of EDAC drivers is to accurately
report errors for purposes of counting them and doing analysis on
that collected data as to whether components are going wrong - not to
ratelimit them as some nuisance output.

With breaking the EDAC reporting, you're barking up the wrong tree - if
you don't want to see those errors, do not load the drivers. It is that
simple.
Tyler Hicks May 5, 2021, 9:48 p.m. UTC | #6
On 2021-05-05 23:04:44, Borislav Petkov wrote:
> On Wed, May 05, 2021 at 03:23:57PM -0500, Tyler Hicks wrote:
> >  Would it be any more acceptable to add an
> > edac_mc_printk_ratelimited() macro, which uses printk_ratelimited(),
> > and then call that new macro from edac_ce_error()?
> 
> You guys are way off here: the intent of EDAC drivers is to accurately
> report errors for purposes of counting them and doing analysis on
> that collected data as to whether components are going wrong - not to
> ratelimit them as some nuisance output.
> 
> With breaking the EDAC reporting, you're barking up the wrong tree - if
> you don't want to see those errors, do not load the drivers. It is that
> simple.

As I understand it, the idea here wasn't to treat the log messages as a
nuisance that should be completely squelched. The counters are monitored
and provide the definitive way to detect large scale problems but the CE
log messages can be an easier-to-discover way for humans to identify
potential problems when, for example, centralized log aggregation and
indexing is in place.

The thought was that the full stream of log messages isn't necessary to
notice that there's a problem when they are being emitted at such a high
rate (500 per second). They're just filling up disk space and/or wasting
networking bandwidth at that point. Of course, the best course of action
here is to service the machine but there's still a period of time
between the CE errors popping up and the machine being serviced.

Tyler
Borislav Petkov May 5, 2021, 10:02 p.m. UTC | #7
On Wed, May 05, 2021 at 04:48:46PM -0500, Tyler Hicks wrote:
> The thought was that the full stream of log messages isn't necessary to
> notice that there's a problem when they are being emitted at such a high
> rate (500 per second). They're just filling up disk space and/or wasting
> networking bandwidth at that point.

I already asked about this but lemme point it out again: have you guys
looked at drivers/ras/cec.c ?

With that there won't be *any* error reports in dmesg and it will even
poison and offline pages which generate excessive errors so that ...

> Of course, the best course of action here is to service the machine
> but there's still a period of time between the CE errors popping up
> and the machine being serviced.

... you'll have ample time to service the machine.
Tyler Hicks May 5, 2021, 10:16 p.m. UTC | #8
On 2021-05-06 00:02:44, Borislav Petkov wrote:
> On Wed, May 05, 2021 at 04:48:46PM -0500, Tyler Hicks wrote:
> > The thought was that the full stream of log messages isn't necessary to
> > notice that there's a problem when they are being emitted at such a high
> > rate (500 per second). They're just filling up disk space and/or wasting
> > networking bandwidth at that point.
> 
> I already asked about this but lemme point it out again: have you guys
> looked at drivers/ras/cec.c ?

We'll have a closer look. Thanks for the pointer!

Tyler

> 
> With that there won't be *any* error reports in dmesg and it will even
> poison and offline pages which generate excessive errors so that ...
> 
> > Of course, the best course of action here is to service the machine
> > but there's still a period of time between the CE errors popping up
> > and the machine being serviced.
> 
> ... you'll have ample time to service the machine.
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
>
Tyler Hicks May 5, 2021, 10:43 p.m. UTC | #9
On 2021-05-05 17:16:11, Tyler Hicks wrote:
> On 2021-05-06 00:02:44, Borislav Petkov wrote:
> > On Wed, May 05, 2021 at 04:48:46PM -0500, Tyler Hicks wrote:
> > > The thought was that the full stream of log messages isn't necessary to
> > > notice that there's a problem when they are being emitted at such a high
> > > rate (500 per second). They're just filling up disk space and/or wasting
> > > networking bandwidth at that point.
> > 
> > I already asked about this but lemme point it out again: have you guys
> > looked at drivers/ras/cec.c ?
> 
> We'll have a closer look. Thanks for the pointer!

This is x86-specific and not applicable in our situation.

Tyler

> 
> Tyler
> 
> > 
> > With that there won't be *any* error reports in dmesg and it will even
> > poison and offline pages which generate excessive errors so that ...
> > 
> > > Of course, the best course of action here is to service the machine
> > > but there's still a period of time between the CE errors popping up
> > > and the machine being serviced.
> > 
> > ... you'll have ample time to service the machine.
> > 
> > -- 
> > Regards/Gruss,
> >     Boris.
> > 
> > https://people.kernel.org/tglx/notes-about-netiquette
> >
Borislav Petkov May 5, 2021, 10:55 p.m. UTC | #10
On Wed, May 05, 2021 at 05:43:57PM -0500, Tyler Hicks wrote:
> This is x86-specific 

That's because it is used by x86 currently. It shouldn't be hard to use
it on another arch though as the machinery is pretty generic.

> and not applicable in our situation.

What is your situation? ARM?
Tyler Hicks May 5, 2021, 11:01 p.m. UTC | #11
On 2021-05-06 00:55:00, Borislav Petkov wrote:
> On Wed, May 05, 2021 at 05:43:57PM -0500, Tyler Hicks wrote:
> > This is x86-specific 
> 
> That's because it is used by x86 currently. It shouldn't be hard to use
> it on another arch though as the machinery is pretty generic.
> 
> > and not applicable in our situation.
> 
> What is your situation? ARM?

Yes, though I'm not sure if those additional features are
important/useful enough for us to generalize that driver. The main
motivation here was just to prevent storage/network from being flooded
by obviously-bad nodes that haven't been offlined yet. :) 

Lei and others on cc will need to evaluate porting cec.c and what it
will gain them. Thanks again.

Tyler

> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
>
Luck, Tony May 5, 2021, 11:13 p.m. UTC | #12
>> What is your situation? ARM?
>
> Yes, though I'm not sure if those additional features are
> important/useful enough for us to generalize that driver. The main
> motivation here was just to prevent storage/network from being flooded
> by obviously-bad nodes that haven't been offlined yet. :) 
>
> Lei and others on cc will need to evaluate porting cec.c and what it
> will gain them. Thanks again.

Tyler,

You might also look at the x86 "storm" detection code (tl;dr version
"If error interrupts are coming too fast, turn off the interrupts and poll").

-Tony
Mauro Carvalho Chehab May 6, 2021, 7:16 a.m. UTC | #13
Em Wed, 5 May 2021 18:01:52 -0500
Tyler Hicks <tyhicks@linux.microsoft.com> escreveu:

> On 2021-05-06 00:55:00, Borislav Petkov wrote:
> > On Wed, May 05, 2021 at 05:43:57PM -0500, Tyler Hicks wrote:  
> > > This is x86-specific   
> > 
> > That's because it is used by x86 currently. It shouldn't be hard to use
> > it on another arch though as the machinery is pretty generic.
> >   
> > > and not applicable in our situation.  
> > 
> > What is your situation? ARM?  
> 
> Yes, though I'm not sure if those additional features are
> important/useful enough for us to generalize that driver. The main
> motivation here was just to prevent storage/network from being flooded
> by obviously-bad nodes that haven't been offlined yet. :) 

Well, if a machine starts to produce 500+ errors per second,
then it should be offlined as soon as possible, as otherwise bad results
will be produced ;-)

See, the CE error reporting mechanism is meant to be used together
with some error correction code algorithm like the ones used on ECC
memories. Such algorithms are designed to detect a single errored bit 
with a change usually at the ~10⁻4 to 10^-7 order (the precision
depends on how many bits are used and what algorithm is used), but 
if there are two wrong bits at the same word, the chance to detect 
is a lot lower.

So, keeping the server enabled up to the point that it would consume
enough resources at the storage/network to bother someone sounds a 
terrible idea, as sooner or later it will miss an error or produce
an uncorrected event ;-)

Besides that, if you're running rasdaemon to capture the hardware errors, 
the storage will also be flooded by something like that, even if you
disable them from going to syslog via 
sys/module/edac_core/parameters/edac_mc_log_ce.

Now, the question is: are those 500+ errors per second a real hardware
problem, or is it due to some broken error report mechanism?

In the latter case, the driver or the hardware that it is producing 
invalid errors should be fixed.

> 
> Lei and others on cc will need to evaluate porting cec.c and what it
> will gain them. Thanks again.

Regards,
Mauro

Patch
diff mbox series

diff --git a/drivers/edac/edac_mc.h b/drivers/edac/edac_mc.h
index 881b00eadf7a..355529317980 100644
--- a/drivers/edac/edac_mc.h
+++ b/drivers/edac/edac_mc.h
@@ -46,19 +46,19 @@ 
 #endif
 
 #define edac_printk(level, prefix, fmt, arg...) \
-	printk(level "EDAC " prefix ": " fmt, ##arg)
+	printk_ratelimited(level "EDAC " prefix ": " fmt, ##arg)
 
 #define edac_mc_printk(mci, level, fmt, arg...) \
-	printk(level "EDAC MC%d: " fmt, mci->mc_idx, ##arg)
+	printk_ratelimited(level "EDAC MC%d: " fmt, mci->mc_idx, ##arg)
 
 #define edac_mc_chipset_printk(mci, level, prefix, fmt, arg...) \
-	printk(level "EDAC " prefix " MC%d: " fmt, mci->mc_idx, ##arg)
+	printk_ratelimited(level "EDAC " prefix " MC%d: " fmt, mci->mc_idx, ##arg)
 
 #define edac_device_printk(ctl, level, fmt, arg...) \
-	printk(level "EDAC DEVICE%d: " fmt, ctl->dev_idx, ##arg)
+	printk_ratelimited(level "EDAC DEVICE%d: " fmt, ctl->dev_idx, ##arg)
 
 #define edac_pci_printk(ctl, level, fmt, arg...) \
-	printk(level "EDAC PCI%d: " fmt, ctl->pci_idx, ##arg)
+	printk_ratelimited(level "EDAC PCI%d: " fmt, ctl->pci_idx, ##arg)
 
 /* prefixes for edac_printk() and edac_mc_printk() */
 #define EDAC_MC "MC"