linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mauro Carvalho Chehab <mchehab@kernel.org>
To: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Borislav Petkov <bp@alien8.de>, wangglei <wangglei@gmail.com>,
	"Lei Wang (DPLAT)" <Wang.Lei@microsoft.com>,
	"tony.luck@intel.com" <tony.luck@intel.com>,
	"james.morse@arm.com" <james.morse@arm.com>,
	"rric@kernel.org" <rric@kernel.org>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Hang Li <hangl@microsoft.com>,
	Brandon Waller <bwaller@microsoft.com>
Subject: Re: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited.
Date: Thu, 6 May 2021 09:16:30 +0200	[thread overview]
Message-ID: <20210506091630.168c7887@coco.lan> (raw)
In-Reply-To: <20210505230152.GH4967@sequoia>

Em Wed, 5 May 2021 18:01:52 -0500
Tyler Hicks <tyhicks@linux.microsoft.com> escreveu:

> On 2021-05-06 00:55:00, Borislav Petkov wrote:
> > On Wed, May 05, 2021 at 05:43:57PM -0500, Tyler Hicks wrote:  
> > > This is x86-specific   
> > 
> > That's because it is used by x86 currently. It shouldn't be hard to use
> > it on another arch though as the machinery is pretty generic.
> >   
> > > and not applicable in our situation.  
> > 
> > What is your situation? ARM?  
> 
> Yes, though I'm not sure if those additional features are
> important/useful enough for us to generalize that driver. The main
> motivation here was just to prevent storage/network from being flooded
> by obviously-bad nodes that haven't been offlined yet. :) 

Well, if a machine starts to produce 500+ errors per second,
then it should be offlined as soon as possible, as otherwise bad results
will be produced ;-)

See, the CE error reporting mechanism is meant to be used together
with some error correction code algorithm like the ones used on ECC
memories. Such algorithms are designed to detect a single errored bit 
with a change usually at the ~10⁻4 to 10^-7 order (the precision
depends on how many bits are used and what algorithm is used), but 
if there are two wrong bits at the same word, the chance to detect 
is a lot lower.

So, keeping the server enabled up to the point that it would consume
enough resources at the storage/network to bother someone sounds a 
terrible idea, as sooner or later it will miss an error or produce
an uncorrected event ;-)

Besides that, if you're running rasdaemon to capture the hardware errors, 
the storage will also be flooded by something like that, even if you
disable them from going to syslog via 
sys/module/edac_core/parameters/edac_mc_log_ce.

Now, the question is: are those 500+ errors per second a real hardware
problem, or is it due to some broken error report mechanism?

In the latter case, the driver or the hardware that it is producing 
invalid errors should be fixed.

> 
> Lei and others on cc will need to evaluate porting cec.c and what it
> will gain them. Thanks again.

Regards,
Mauro

      parent reply	other threads:[~2021-05-06  7:16 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-05 17:30 [PATCH] EDAC: update edac printk wrappers to use printk_ratelimited Lei Wang
2021-05-05 18:01 ` Borislav Petkov
2021-05-05 19:02   ` [EXTERNAL] " Lei Wang (DPLAT)
2021-05-05 19:45     ` Borislav Petkov
2021-05-05 20:23       ` Tyler Hicks
2021-05-05 21:04         ` Borislav Petkov
2021-05-05 21:48           ` Tyler Hicks
2021-05-05 22:02             ` Borislav Petkov
2021-05-05 22:16               ` Tyler Hicks
2021-05-05 22:43                 ` Tyler Hicks
2021-05-05 22:55                   ` Borislav Petkov
2021-05-05 23:01                     ` Tyler Hicks
2021-05-05 23:13                       ` Luck, Tony
2021-05-06  7:16                       ` Mauro Carvalho Chehab [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210506091630.168c7887@coco.lan \
    --to=mchehab@kernel.org \
    --cc=Wang.Lei@microsoft.com \
    --cc=bp@alien8.de \
    --cc=bwaller@microsoft.com \
    --cc=hangl@microsoft.com \
    --cc=james.morse@arm.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rric@kernel.org \
    --cc=tony.luck@intel.com \
    --cc=tyhicks@linux.microsoft.com \
    --cc=wangglei@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).