From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mauro Carvalho Chehab Subject: Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac Date: Fri, 21 Jul 2017 14:01:31 -0300 Message-ID: <20170721140131.40079805@vento.lan> References: <20170718060007.GB8736@nazgul.tnic> <1500407379.2042.21.camel@hpe.com> <20170718181545.32bd9181@vento.lan> <1500481869.2042.29.camel@hpe.com> <20170720043344.GC14367@nazgul.tnic> <1500579646.2042.37.camel@hpe.com> <20170721133441.GB5036@nazgul.tnic> <20170721104001.3cd2b884@vento.lan> <20170721134715.GC5036@nazgul.tnic> <1500649162.2042.43.camel@hpe.com> <20170721151317.GA13424@nazgul.tnic> <1500650732.2042.45.camel@hpe.com> <20170721124401.5f94aba9@vento.lan> <1500654661.2042.49.camel@hpe.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Return-path: In-Reply-To: <1500654661.2042.49.camel@hpe.com> Sender: linux-kernel-owner@vger.kernel.org To: "Kani, Toshimitsu" Cc: "linux-kernel@vger.kernel.org" , "tglx@linutronix.de" , "mchehab@kernel.org" , "rjw@rjwysocki.net" , "srinivas.pandruvada@linux.intel.com" , "bp@alien8.de" , "tony.luck@intel.com" , "lenb@kernel.org" , "linux-acpi@vger.kernel.org" , "linux-edac@vger.kernel.org" List-Id: linux-acpi@vger.kernel.org Em Fri, 21 Jul 2017 16:40:20 +0000 "Kani, Toshimitsu" escreveu: > On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote: > > Em Fri, 21 Jul 2017 15:34:50 +0000 > > "Kani, Toshimitsu" escreveu: > > > > > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote: > > > > On Fri, Jul 21, 2017 at 03:08:41PM +0000, Kani, Toshimitsu > > > > wrote:   > > > > > Yes, that is correct.  Corrected errors are reported to the OS > > > > > when they exceeded the platform's threshold.   > > > > > > > > Are those thresholds user-configurable?   > > > > > > I suppose it'd depend on vendors, but I do not think users can do > > > it properly unless they have depth knowledge about the hardware. > > > > > > > If not, what are you telling users who want to see *every* > > > > corrected error for measuring DIMM wear and so on...?   > > > > > > Corrected errors are normal and expected to occur on healthy > > > hardware. They do not need user's attention until they repeatedly > > > occurred at a same place. > > > > Yes, they're expected to happen. Still, some sys admins have their > > own measurements about what's "normal" for their scenario, and want > > to monitor every single corrected error, running their own > > algorithm to warn if the number of corrected errors is above their > > "normal" rate. > > I suppose these admins had to do it because their platforms reported > all corrected errors. It addresses such administrators' burden. I see the value of having a threshold in BIOS, provided that it is well documented, and whose value can be adjusted, if needed. One of the things I wanted to implement in ras-daemon were an algorithm that would be doing such threshold in software. The problem is that it would require field experience. So, I talked with a few vendors, to see if they could help doing it, but, on that time, none rised their hands :-) The thing with a BIOS threshold is that the user has no way to audit the algorithm. So, when BIOS start reporting such errors, it may be already too late: the systems may be in the verge of losing data (or some data was already lost). That's critical on cluster systems with thousands of machines: while the impact of disabling a cluster node to do some maintainance is marginal, the impact of an uncorrected error on a single machine may compromise weeks of expensive processing. That's why some users prefer to monitor every single corrected error, and compare with the probability distribution they know that the risk of uncorrected errors is acceptable. Thanks, Mauro From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754711AbdGURBs convert rfc822-to-8bit (ORCPT ); Fri, 21 Jul 2017 13:01:48 -0400 Received: from ec2-52-27-115-49.us-west-2.compute.amazonaws.com ([52.27.115.49]:42720 "EHLO osg.samsung.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753927AbdGURBm (ORCPT ); Fri, 21 Jul 2017 13:01:42 -0400 Date: Fri, 21 Jul 2017 14:01:31 -0300 From: Mauro Carvalho Chehab To: "Kani, Toshimitsu" Cc: "linux-kernel@vger.kernel.org" , "tglx@linutronix.de" , "mchehab@kernel.org" , "rjw@rjwysocki.net" , "srinivas.pandruvada@linux.intel.com" , "bp@alien8.de" , "tony.luck@intel.com" , "lenb@kernel.org" , "linux-acpi@vger.kernel.org" , "linux-edac@vger.kernel.org" Subject: Re: [PATCH 3/3] ghes_edac: add platform check to enable ghes_edac Message-ID: <20170721140131.40079805@vento.lan> In-Reply-To: <1500654661.2042.49.camel@hpe.com> References: <20170718060007.GB8736@nazgul.tnic> <1500407379.2042.21.camel@hpe.com> <20170718181545.32bd9181@vento.lan> <1500481869.2042.29.camel@hpe.com> <20170720043344.GC14367@nazgul.tnic> <1500579646.2042.37.camel@hpe.com> <20170721133441.GB5036@nazgul.tnic> <20170721104001.3cd2b884@vento.lan> <20170721134715.GC5036@nazgul.tnic> <1500649162.2042.43.camel@hpe.com> <20170721151317.GA13424@nazgul.tnic> <1500650732.2042.45.camel@hpe.com> <20170721124401.5f94aba9@vento.lan> <1500654661.2042.49.camel@hpe.com> Organization: Samsung X-Mailer: Claws Mail 3.14.1 (GTK+ 2.24.31; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Em Fri, 21 Jul 2017 16:40:20 +0000 "Kani, Toshimitsu" escreveu: > On Fri, 2017-07-21 at 12:44 -0300, Mauro Carvalho Chehab wrote: > > Em Fri, 21 Jul 2017 15:34:50 +0000 > > "Kani, Toshimitsu" escreveu: > > > > > On Fri, 2017-07-21 at 17:13 +0200, Borislav Petkov wrote: > > > > On Fri, Jul 21, 2017 at 03:08:41PM +0000, Kani, Toshimitsu > > > > wrote:   > > > > > Yes, that is correct.  Corrected errors are reported to the OS > > > > > when they exceeded the platform's threshold.   > > > > > > > > Are those thresholds user-configurable?   > > > > > > I suppose it'd depend on vendors, but I do not think users can do > > > it properly unless they have depth knowledge about the hardware. > > > > > > > If not, what are you telling users who want to see *every* > > > > corrected error for measuring DIMM wear and so on...?   > > > > > > Corrected errors are normal and expected to occur on healthy > > > hardware. They do not need user's attention until they repeatedly > > > occurred at a same place. > > > > Yes, they're expected to happen. Still, some sys admins have their > > own measurements about what's "normal" for their scenario, and want > > to monitor every single corrected error, running their own > > algorithm to warn if the number of corrected errors is above their > > "normal" rate. > > I suppose these admins had to do it because their platforms reported > all corrected errors. It addresses such administrators' burden. I see the value of having a threshold in BIOS, provided that it is well documented, and whose value can be adjusted, if needed. One of the things I wanted to implement in ras-daemon were an algorithm that would be doing such threshold in software. The problem is that it would require field experience. So, I talked with a few vendors, to see if they could help doing it, but, on that time, none rised their hands :-) The thing with a BIOS threshold is that the user has no way to audit the algorithm. So, when BIOS start reporting such errors, it may be already too late: the systems may be in the verge of losing data (or some data was already lost). That's critical on cluster systems with thousands of machines: while the impact of disabling a cluster node to do some maintainance is marginal, the impact of an uncorrected error on a single machine may compromise weeks of expensive processing. That's why some users prefer to monitor every single corrected error, and compare with the probability distribution they know that the risk of uncorrected errors is acceptable. Thanks, Mauro From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Subject: [3/3] ghes_edac: add platform check to enable ghes_edac From: Mauro Carvalho Chehab Message-Id: <20170721140131.40079805@vento.lan> Date: Fri, 21 Jul 2017 14:01:31 -0300 To: "Kani, Toshimitsu" Cc: "linux-kernel@vger.kernel.org" , "tglx@linutronix.de" , "mchehab@kernel.org" , "rjw@rjwysocki.net" , "srinivas.pandruvada@linux.intel.com" , "bp@alien8.de" , "tony.luck@intel.com" , "lenb@kernel.org" , "linux-acpi@vger.kernel.org" , "linux-edac@vger.kernel.org" List-ID: RW0gRnJpLCAyMSBKdWwgMjAxNyAxNjo0MDoyMCArMDAwMAoiS2FuaSwgVG9zaGltaXRzdSIgPHRv c2hpLmthbmlAaHBlLmNvbT4gZXNjcmV2ZXU6Cgo+IE9uIEZyaSwgMjAxNy0wNy0yMSBhdCAxMjo0 NCAtMDMwMCwgTWF1cm8gQ2FydmFsaG8gQ2hlaGFiIHdyb3RlOgo+ID4gRW0gRnJpLCAyMSBKdWwg MjAxNyAxNTozNDo1MCArMDAwMAo+ID4gIkthbmksIFRvc2hpbWl0c3UiIDx0b3NoaS5rYW5pQGhw ZS5jb20+IGVzY3JldmV1Ogo+ID4gICAKPiA+ID4gT24gRnJpLCAyMDE3LTA3LTIxIGF0IDE3OjEz ICswMjAwLCBCb3Jpc2xhdiBQZXRrb3Ygd3JvdGU6ICAKPiA+ID4gPiBPbiBGcmksIEp1bCAyMSwg MjAxNyBhdCAwMzowODo0MVBNICswMDAwLCBLYW5pLCBUb3NoaW1pdHN1Cj4gPiA+ID4gd3JvdGU6 wqDCoCAgCj4gPiA+ID4gPiBZZXMsIHRoYXQgaXMgY29ycmVjdC7CoMKgQ29ycmVjdGVkIGVycm9y cyBhcmUgcmVwb3J0ZWQgdG8gdGhlIE9TCj4gPiA+ID4gPiB3aGVuIHRoZXkgZXhjZWVkZWQgdGhl IHBsYXRmb3JtJ3MgdGhyZXNob2xkLsKgwqAgIAo+ID4gPiA+IAo+ID4gPiA+IEFyZSB0aG9zZSB0 aHJlc2hvbGRzIHVzZXItY29uZmlndXJhYmxlP8KgwqAgIAo+ID4gPiAKPiA+ID4gSSBzdXBwb3Nl IGl0J2QgZGVwZW5kIG9uIHZlbmRvcnMsIGJ1dCBJIGRvIG5vdCB0aGluayB1c2VycyBjYW4gZG8K PiA+ID4gaXQgcHJvcGVybHkgdW5sZXNzIHRoZXkgaGF2ZSBkZXB0aCBrbm93bGVkZ2UgYWJvdXQg dGhlIGhhcmR3YXJlLgo+ID4gPiAgIAo+ID4gPiA+IElmIG5vdCwgd2hhdCBhcmUgeW91IHRlbGxp bmcgdXNlcnMgd2hvIHdhbnQgdG8gc2VlICpldmVyeSoKPiA+ID4gPiBjb3JyZWN0ZWQgZXJyb3Ig Zm9yIG1lYXN1cmluZyBESU1NIHdlYXIgYW5kIHNvIG9uLi4uP8KgwqAgIAo+ID4gPiAKPiA+ID4g Q29ycmVjdGVkIGVycm9ycyBhcmUgbm9ybWFsIGFuZCBleHBlY3RlZCB0byBvY2N1ciBvbiBoZWFs dGh5Cj4gPiA+IGhhcmR3YXJlLiAgVGhleSBkbyBub3QgbmVlZCB1c2VyJ3MgYXR0ZW50aW9uIHVu dGlsIHRoZXkgcmVwZWF0ZWRseQo+ID4gPiBvY2N1cnJlZCBhdCBhIHNhbWUgcGxhY2UuICAKPiA+ IAo+ID4gWWVzLCB0aGV5J3JlIGV4cGVjdGVkIHRvIGhhcHBlbi4gU3RpbGwsIHNvbWUgc3lzIGFk bWlucyBoYXZlIHRoZWlyCj4gPiBvd24gbWVhc3VyZW1lbnRzIGFib3V0IHdoYXQncyAibm9ybWFs IiBmb3IgdGhlaXIgc2NlbmFyaW8sIGFuZCB3YW50Cj4gPiB0byBtb25pdG9yIGV2ZXJ5IHNpbmds ZSBjb3JyZWN0ZWQgZXJyb3IsIHJ1bm5pbmcgdGhlaXIgb3duCj4gPiBhbGdvcml0aG0gdG8gd2Fy biBpZiB0aGUgbnVtYmVyIG9mIGNvcnJlY3RlZCBlcnJvcnMgaXMgYWJvdmUgdGhlaXIKPiA+ICJu b3JtYWwiIHJhdGUuICAKPiAKPiBJIHN1cHBvc2UgdGhlc2UgYWRtaW5zIGhhZCB0byBkbyBpdCBi ZWNhdXNlIHRoZWlyIHBsYXRmb3JtcyByZXBvcnRlZAo+IGFsbCBjb3JyZWN0ZWQgZXJyb3JzLiAg SXQgYWRkcmVzc2VzIHN1Y2ggYWRtaW5pc3RyYXRvcnMnIGJ1cmRlbi4KCkkgc2VlIHRoZSB2YWx1 ZSBvZiBoYXZpbmcgYSB0aHJlc2hvbGQgaW4gQklPUywgcHJvdmlkZWQgdGhhdCBpdCBpcwp3ZWxs IGRvY3VtZW50ZWQsIGFuZCB3aG9zZSB2YWx1ZSBjYW4gYmUgYWRqdXN0ZWQsIGlmIG5lZWRlZC4K Ck9uZSBvZiB0aGUgdGhpbmdzIEkgd2FudGVkIHRvIGltcGxlbWVudCBpbiByYXMtZGFlbW9uIHdl cmUgYW4KYWxnb3JpdGhtIHRoYXQgd291bGQgYmUgZG9pbmcgc3VjaCB0aHJlc2hvbGQgaW4gc29m dHdhcmUuClRoZSBwcm9ibGVtIGlzIHRoYXQgaXQgd291bGQgcmVxdWlyZSBmaWVsZCBleHBlcmll bmNlLiBTbywKSSB0YWxrZWQgd2l0aCBhIGZldyB2ZW5kb3JzLCB0byBzZWUgaWYgdGhleSBjb3Vs ZCBoZWxwIGRvaW5nCml0LCBidXQsIG9uIHRoYXQgdGltZSwgbm9uZSByaXNlZCB0aGVpciBoYW5k cyA6LSkKClRoZSB0aGluZyB3aXRoIGEgQklPUyB0aHJlc2hvbGQgaXMgdGhhdCB0aGUgdXNlciBo YXMgbm8gd2F5IHRvCmF1ZGl0IHRoZSBhbGdvcml0aG0uIFNvLCB3aGVuIEJJT1Mgc3RhcnQgcmVw b3J0aW5nIHN1Y2ggZXJyb3JzLAppdCBtYXkgYmUgYWxyZWFkeSB0b28gbGF0ZTogdGhlIHN5c3Rl bXMgbWF5IGJlIGluIHRoZSB2ZXJnZSBvZiAKbG9zaW5nIGRhdGEgKG9yIHNvbWUgZGF0YSB3YXMg YWxyZWFkeSBsb3N0KS4KClRoYXQncyBjcml0aWNhbCBvbiBjbHVzdGVyIHN5c3RlbXMgd2l0aCB0 aG91c2FuZHMgb2YgbWFjaGluZXM6CndoaWxlIHRoZSBpbXBhY3Qgb2YgZGlzYWJsaW5nIGEgY2x1 c3RlciBub2RlIHRvIGRvIHNvbWUgbWFpbnRhaW5hbmNlCmlzIG1hcmdpbmFsLCB0aGUgaW1wYWN0 IG9mIGFuIHVuY29ycmVjdGVkIGVycm9yIG9uIGEgc2luZ2xlCm1hY2hpbmUgbWF5IGNvbXByb21p c2Ugd2Vla3Mgb2YgZXhwZW5zaXZlIHByb2Nlc3NpbmcuCgpUaGF0J3Mgd2h5IHNvbWUgdXNlcnMg cHJlZmVyIHRvIG1vbml0b3IgZXZlcnkgc2luZ2xlIGNvcnJlY3RlZAplcnJvciwgYW5kIGNvbXBh cmUgd2l0aCB0aGUgcHJvYmFiaWxpdHkgZGlzdHJpYnV0aW9uIHRoZXkKa25vdyB0aGF0IHRoZSBy aXNrIG9mIHVuY29ycmVjdGVkIGVycm9ycyBpcyBhY2NlcHRhYmxlLgoKVGhhbmtzLApNYXVybwot LS0KVG8gdW5zdWJzY3JpYmUgZnJvbSB0aGlzIGxpc3Q6IHNlbmQgdGhlIGxpbmUgInVuc3Vic2Ny aWJlIGxpbnV4LWVkYWMiIGluCnRoZSBib2R5IG9mIGEgbWVzc2FnZSB0byBtYWpvcmRvbW9Admdl ci5rZXJuZWwub3JnCk1vcmUgbWFqb3Jkb21vIGluZm8gYXQgIGh0dHA6Ly92Z2VyLmtlcm5lbC5v cmcvbWFqb3Jkb21vLWluZm8uaHRtbAo=