From mboxrd@z Thu Jan  1 00:00:00 1970
From: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Subject: Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Date: Fri, 21 Jul 2017 14:01:31 -0300
Message-ID: <20170721140131.40079805@vento.lan>
References: <20170718060007.GB8736@nazgul.tnic>
 <1500407379.2042.21.camel@hpe.com>
 <20170718181545.32bd9181@vento.lan>
 <1500481869.2042.29.camel@hpe.com>
 <20170720043344.GC14367@nazgul.tnic>
 <1500579646.2042.37.camel@hpe.com>
 <20170721133441.GB5036@nazgul.tnic>
 <20170721104001.3cd2b884@vento.lan>
 <20170721134715.GC5036@nazgul.tnic>
 <1500649162.2042.43.camel@hpe.com>
 <20170721151317.GA13424@nazgul.tnic>
 <1500650732.2042.45.camel@hpe.com>
 <20170721124401.5f94aba9@vento.lan>
 <1500654661.2042.49.camel@hpe.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8BIT
Return-path: <linux-kernel-owner@vger.kernel.org>
In-Reply-To: <1500654661.2042.49.camel@hpe.com>
Sender: linux-kernel-owner@vger.kernel.org
To: "Kani, Toshimitsu" <toshi.kani@hpe.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "tglx@linutronix.de" <tglx@linutronix.de>, "mchehab@kernel.org" <mchehab@kernel.org>, "rjw@rjwysocki.net" <rjw@rjwysocki.net>, "srinivas.pandruvada@linux.intel.com" <srinivas.pandruvada@linux.intel.com>, "bp@alien8.de" <bp@alien8.de>, "tony.luck@intel.com" <tony.luck@intel.com>, "lenb@kernel.org" <lenb@kernel.org>, "linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>, "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
List-Id: linux-acpi@vger.kernel.org

Em Fri, 21 Jul 2017 16:40:20 +0000
"Kani, Toshimitsu" <toshi.kani@hpe.com> escreveu:

> On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote:
> > Em Fri, 21 Jul 2017 15:34:50 +0000
> > "Kani, Toshimitsu" <toshi.kani@hpe.com> escreveu:
> >   
> > > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote:  
> > > > On Fri, Jul 21, 2017 at 03:08:41PM +0000, Kani, Toshimitsu
> > > > wrote:    
> > > > > Yes, that is correct.  Corrected errors are reported to the OS
> > > > > when they exceeded the platform's threshold.    
> > > > 
> > > > Are those thresholds user-configurable?    
> > > 
> > > I suppose it'd depend on vendors, but I do not think users can do
> > > it properly unless they have depth knowledge about the hardware.
> > >   
> > > > If not, what are you telling users who want to see *every*
> > > > corrected error for measuring DIMM wear and so on...?    
> > > 
> > > Corrected errors are normal and expected to occur on healthy
> > > hardware.  They do not need user's attention until they repeatedly
> > > occurred at a same place.  
> > 
> > Yes, they're expected to happen. Still, some sys admins have their
> > own measurements about what's "normal" for their scenario, and want
> > to monitor every single corrected error, running their own
> > algorithm to warn if the number of corrected errors is above their
> > "normal" rate.  
> 
> I suppose these admins had to do it because their platforms reported
> all corrected errors.  It addresses such administrators' burden.

I see the value of having a threshold in BIOS, provided that it is
well documented, and whose value can be adjusted, if needed.

One of the things I wanted to implement in ras-daemon were an
algorithm that would be doing such threshold in software.
The problem is that it would require field experience. So,
I talked with a few vendors, to see if they could help doing
it, but, on that time, none rised their hands :-)

The thing with a BIOS threshold is that the user has no way to
audit the algorithm. So, when BIOS start reporting such errors,
it may be already too late: the systems may be in the verge of 
losing data (or some data was already lost).

That's critical on cluster systems with thousands of machines:
while the impact of disabling a cluster node to do some maintainance
is marginal, the impact of an uncorrected error on a single
machine may compromise weeks of expensive processing.

That's why some users prefer to monitor every single corrected
error, and compare with the probability distribution they
know that the risk of uncorrected errors is acceptable.

Thanks,
Mauro

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1754711AbdGURBs convert rfc822-to-8bit (ORCPT <rfc822;w@1wt.eu>);
        Fri, 21 Jul 2017 13:01:48 -0400
Received: from ec2-52-27-115-49.us-west-2.compute.amazonaws.com ([52.27.115.49]:42720
        "EHLO osg.samsung.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org
        with ESMTP id S1753927AbdGURBm (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 21 Jul 2017 13:01:42 -0400
Date: Fri, 21 Jul 2017 14:01:31 -0300
From: Mauro Carvalho Chehab <mchehab@s-opensource.com>
To: "Kani, Toshimitsu" <toshi.kani@hpe.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "tglx@linutronix.de" <tglx@linutronix.de>,
        "mchehab@kernel.org" <mchehab@kernel.org>,
        "rjw@rjwysocki.net" <rjw@rjwysocki.net>,
        "srinivas.pandruvada@linux.intel.com" 
        <srinivas.pandruvada@linux.intel.com>,
        "bp@alien8.de" <bp@alien8.de>,
        "tony.luck@intel.com" <tony.luck@intel.com>,
        "lenb@kernel.org" <lenb@kernel.org>,
        "linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
        "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Subject: Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac
Message-ID: <20170721140131.40079805@vento.lan>
In-Reply-To: <1500654661.2042.49.camel@hpe.com>
References: <20170718060007.GB8736@nazgul.tnic>
 <1500407379.2042.21.camel@hpe.com>
 <20170718181545.32bd9181@vento.lan>
 <1500481869.2042.29.camel@hpe.com>
 <20170720043344.GC14367@nazgul.tnic>
 <1500579646.2042.37.camel@hpe.com>
 <20170721133441.GB5036@nazgul.tnic>
 <20170721104001.3cd2b884@vento.lan>
 <20170721134715.GC5036@nazgul.tnic>
 <1500649162.2042.43.camel@hpe.com>
 <20170721151317.GA13424@nazgul.tnic>
 <1500650732.2042.45.camel@hpe.com>
 <20170721124401.5f94aba9@vento.lan>
 <1500654661.2042.49.camel@hpe.com>
Organization: Samsung
X-Mailer: Claws Mail 3.14.1 (GTK+ 2.24.31; x86_64-redhat-linux-gnu)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8BIT
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Em Fri, 21 Jul 2017 16:40:20 +0000
"Kani, Toshimitsu" <toshi.kani@hpe.com> escreveu:

> On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote:
> > Em Fri, 21 Jul 2017 15:34:50 +0000
> > "Kani, Toshimitsu" <toshi.kani@hpe.com> escreveu:
> >   
> > > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote:  
> > > > On Fri, Jul 21, 2017 at 03:08:41PM +0000, Kani, Toshimitsu
> > > > wrote:    
> > > > > Yes, that is correct.  Corrected errors are reported to the OS
> > > > > when they exceeded the platform's threshold.    
> > > > 
> > > > Are those thresholds user-configurable?    
> > > 
> > > I suppose it'd depend on vendors, but I do not think users can do
> > > it properly unless they have depth knowledge about the hardware.
> > >   
> > > > If not, what are you telling users who want to see *every*
> > > > corrected error for measuring DIMM wear and so on...?    
> > > 
> > > Corrected errors are normal and expected to occur on healthy
> > > hardware.  They do not need user's attention until they repeatedly
> > > occurred at a same place.  
> > 
> > Yes, they're expected to happen. Still, some sys admins have their
> > own measurements about what's "normal" for their scenario, and want
> > to monitor every single corrected error, running their own
> > algorithm to warn if the number of corrected errors is above their
> > "normal" rate.  
> 
> I suppose these admins had to do it because their platforms reported
> all corrected errors.  It addresses such administrators' burden.

I see the value of having a threshold in BIOS, provided that it is
well documented, and whose value can be adjusted, if needed.

One of the things I wanted to implement in ras-daemon were an
algorithm that would be doing such threshold in software.
The problem is that it would require field experience. So,
I talked with a few vendors, to see if they could help doing
it, but, on that time, none rised their hands :-)

The thing with a BIOS threshold is that the user has no way to
audit the algorithm. So, when BIOS start reporting such errors,
it may be already too late: the systems may be in the verge of 
losing data (or some data was already lost).

That's critical on cluster systems with thousands of machines:
while the impact of disabling a cluster node to do some maintainance
is marginal, the impact of an uncorrected error on a single
machine may compromise weeks of expensive processing.

That's why some users prefer to monitor every single corrected
error, and compare with the probability distribution they
know that the risk of uncorrected errors is acceptable.

Thanks,
Mauro

From mboxrd@z Thu Jan  1 00:00:00 1970
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Subject: [3/3] ghes_edac: add platform check to enable ghes_edac
From: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Message-Id: <20170721140131.40079805@vento.lan>
Date: Fri, 21 Jul 2017 14:01:31 -0300
To: "Kani, Toshimitsu" <toshi.kani@hpe.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, "tglx@linutronix.de" <tglx@linutronix.de>, "mchehab@kernel.org" <mchehab@kernel.org>, "rjw@rjwysocki.net" <rjw@rjwysocki.net>, "srinivas.pandruvada@linux.intel.com" <srinivas.pandruvada@linux.intel.com>, "bp@alien8.de" <bp@alien8.de>, "tony.luck@intel.com" <tony.luck@intel.com>, "lenb@kernel.org" <lenb@kernel.org>, "linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>, "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
List-ID: <edac.vger.kernel.org>

RW0gRnJpLCAyMSBKdWwgMjAxNyAxNjo0MDoyMCArMDAwMAoiS2FuaSwgVG9zaGltaXRzdSIgPHRv
c2hpLmthbmlAaHBlLmNvbT4gZXNjcmV2ZXU6Cgo+IE9uIEZyaSwgMjAxNy0wNy0yMSBhdCAxMjo0
NCAtMDMwMCwgTWF1cm8gQ2FydmFsaG8gQ2hlaGFiIHdyb3RlOgo+ID4gRW0gRnJpLCAyMSBKdWwg
MjAxNyAxNTozNDo1MCArMDAwMAo+ID4gIkthbmksIFRvc2hpbWl0c3UiIDx0b3NoaS5rYW5pQGhw
ZS5jb20+IGVzY3JldmV1Ogo+ID4gICAKPiA+ID4gT24gRnJpLCAyMDE3LTA3LTIxIGF0IDE3OjEz
ICswMjAwLCBCb3Jpc2xhdiBQZXRrb3Ygd3JvdGU6ICAKPiA+ID4gPiBPbiBGcmksIEp1bCAyMSwg
MjAxNyBhdCAwMzowODo0MVBNICswMDAwLCBLYW5pLCBUb3NoaW1pdHN1Cj4gPiA+ID4gd3JvdGU6
wqDCoCAgCj4gPiA+ID4gPiBZZXMsIHRoYXQgaXMgY29ycmVjdC7CoMKgQ29ycmVjdGVkIGVycm9y
cyBhcmUgcmVwb3J0ZWQgdG8gdGhlIE9TCj4gPiA+ID4gPiB3aGVuIHRoZXkgZXhjZWVkZWQgdGhl
IHBsYXRmb3JtJ3MgdGhyZXNob2xkLsKgwqAgIAo+ID4gPiA+IAo+ID4gPiA+IEFyZSB0aG9zZSB0
aHJlc2hvbGRzIHVzZXItY29uZmlndXJhYmxlP8KgwqAgIAo+ID4gPiAKPiA+ID4gSSBzdXBwb3Nl
IGl0J2QgZGVwZW5kIG9uIHZlbmRvcnMsIGJ1dCBJIGRvIG5vdCB0aGluayB1c2VycyBjYW4gZG8K
PiA+ID4gaXQgcHJvcGVybHkgdW5sZXNzIHRoZXkgaGF2ZSBkZXB0aCBrbm93bGVkZ2UgYWJvdXQg
dGhlIGhhcmR3YXJlLgo+ID4gPiAgIAo+ID4gPiA+IElmIG5vdCwgd2hhdCBhcmUgeW91IHRlbGxp
bmcgdXNlcnMgd2hvIHdhbnQgdG8gc2VlICpldmVyeSoKPiA+ID4gPiBjb3JyZWN0ZWQgZXJyb3Ig
Zm9yIG1lYXN1cmluZyBESU1NIHdlYXIgYW5kIHNvIG9uLi4uP8KgwqAgIAo+ID4gPiAKPiA+ID4g
Q29ycmVjdGVkIGVycm9ycyBhcmUgbm9ybWFsIGFuZCBleHBlY3RlZCB0byBvY2N1ciBvbiBoZWFs
dGh5Cj4gPiA+IGhhcmR3YXJlLiAgVGhleSBkbyBub3QgbmVlZCB1c2VyJ3MgYXR0ZW50aW9uIHVu
dGlsIHRoZXkgcmVwZWF0ZWRseQo+ID4gPiBvY2N1cnJlZCBhdCBhIHNhbWUgcGxhY2UuICAKPiA+
IAo+ID4gWWVzLCB0aGV5J3JlIGV4cGVjdGVkIHRvIGhhcHBlbi4gU3RpbGwsIHNvbWUgc3lzIGFk
bWlucyBoYXZlIHRoZWlyCj4gPiBvd24gbWVhc3VyZW1lbnRzIGFib3V0IHdoYXQncyAibm9ybWFs
IiBmb3IgdGhlaXIgc2NlbmFyaW8sIGFuZCB3YW50Cj4gPiB0byBtb25pdG9yIGV2ZXJ5IHNpbmds
ZSBjb3JyZWN0ZWQgZXJyb3IsIHJ1bm5pbmcgdGhlaXIgb3duCj4gPiBhbGdvcml0aG0gdG8gd2Fy
biBpZiB0aGUgbnVtYmVyIG9mIGNvcnJlY3RlZCBlcnJvcnMgaXMgYWJvdmUgdGhlaXIKPiA+ICJu
b3JtYWwiIHJhdGUuICAKPiAKPiBJIHN1cHBvc2UgdGhlc2UgYWRtaW5zIGhhZCB0byBkbyBpdCBi
ZWNhdXNlIHRoZWlyIHBsYXRmb3JtcyByZXBvcnRlZAo+IGFsbCBjb3JyZWN0ZWQgZXJyb3JzLiAg
SXQgYWRkcmVzc2VzIHN1Y2ggYWRtaW5pc3RyYXRvcnMnIGJ1cmRlbi4KCkkgc2VlIHRoZSB2YWx1
ZSBvZiBoYXZpbmcgYSB0aHJlc2hvbGQgaW4gQklPUywgcHJvdmlkZWQgdGhhdCBpdCBpcwp3ZWxs
IGRvY3VtZW50ZWQsIGFuZCB3aG9zZSB2YWx1ZSBjYW4gYmUgYWRqdXN0ZWQsIGlmIG5lZWRlZC4K
Ck9uZSBvZiB0aGUgdGhpbmdzIEkgd2FudGVkIHRvIGltcGxlbWVudCBpbiByYXMtZGFlbW9uIHdl
cmUgYW4KYWxnb3JpdGhtIHRoYXQgd291bGQgYmUgZG9pbmcgc3VjaCB0aHJlc2hvbGQgaW4gc29m
dHdhcmUuClRoZSBwcm9ibGVtIGlzIHRoYXQgaXQgd291bGQgcmVxdWlyZSBmaWVsZCBleHBlcmll
bmNlLiBTbywKSSB0YWxrZWQgd2l0aCBhIGZldyB2ZW5kb3JzLCB0byBzZWUgaWYgdGhleSBjb3Vs
ZCBoZWxwIGRvaW5nCml0LCBidXQsIG9uIHRoYXQgdGltZSwgbm9uZSByaXNlZCB0aGVpciBoYW5k
cyA6LSkKClRoZSB0aGluZyB3aXRoIGEgQklPUyB0aHJlc2hvbGQgaXMgdGhhdCB0aGUgdXNlciBo
YXMgbm8gd2F5IHRvCmF1ZGl0IHRoZSBhbGdvcml0aG0uIFNvLCB3aGVuIEJJT1Mgc3RhcnQgcmVw
b3J0aW5nIHN1Y2ggZXJyb3JzLAppdCBtYXkgYmUgYWxyZWFkeSB0b28gbGF0ZTogdGhlIHN5c3Rl
bXMgbWF5IGJlIGluIHRoZSB2ZXJnZSBvZiAKbG9zaW5nIGRhdGEgKG9yIHNvbWUgZGF0YSB3YXMg
YWxyZWFkeSBsb3N0KS4KClRoYXQncyBjcml0aWNhbCBvbiBjbHVzdGVyIHN5c3RlbXMgd2l0aCB0
aG91c2FuZHMgb2YgbWFjaGluZXM6CndoaWxlIHRoZSBpbXBhY3Qgb2YgZGlzYWJsaW5nIGEgY2x1
c3RlciBub2RlIHRvIGRvIHNvbWUgbWFpbnRhaW5hbmNlCmlzIG1hcmdpbmFsLCB0aGUgaW1wYWN0
IG9mIGFuIHVuY29ycmVjdGVkIGVycm9yIG9uIGEgc2luZ2xlCm1hY2hpbmUgbWF5IGNvbXByb21p
c2Ugd2Vla3Mgb2YgZXhwZW5zaXZlIHByb2Nlc3NpbmcuCgpUaGF0J3Mgd2h5IHNvbWUgdXNlcnMg
cHJlZmVyIHRvIG1vbml0b3IgZXZlcnkgc2luZ2xlIGNvcnJlY3RlZAplcnJvciwgYW5kIGNvbXBh
cmUgd2l0aCB0aGUgcHJvYmFiaWxpdHkgZGlzdHJpYnV0aW9uIHRoZXkKa25vdyB0aGF0IHRoZSBy
aXNrIG9mIHVuY29ycmVjdGVkIGVycm9ycyBpcyBhY2NlcHRhYmxlLgoKVGhhbmtzLApNYXVybwot
LS0KVG8gdW5zdWJzY3JpYmUgZnJvbSB0aGlzIGxpc3Q6IHNlbmQgdGhlIGxpbmUgInVuc3Vic2Ny
aWJlIGxpbnV4LWVkYWMiIGluCnRoZSBib2R5IG9mIGEgbWVzc2FnZSB0byBtYWpvcmRvbW9Admdl
ci5rZXJuZWwub3JnCk1vcmUgbWFqb3Jkb21vIGluZm8gYXQgIGh0dHA6Ly92Z2VyLmtlcm5lbC5v
cmcvbWFqb3Jkb21vLWluZm8uaHRtbAo=