All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Alex G." <mr.nuke.me@gmail.com>
To: Pavel Machek <pavel@ucw.cz>, Borislav Petkov <bp@alien8.de>
Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Len Brown <lenb@kernel.org>, Tony Luck <tony.luck@intel.com>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	Robert Moore <robert.moore@intel.com>,
	Erik Schmauss <erik.schmauss@intel.com>,
	Tyler Baicar <tbaicar@codeaurora.org>,
	Will Deacon <will.deacon@arm.com>,
	James Morse <james.morse@arm.com>,
	Shiju Jose <shiju.jose@huawei.com>,
	"Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>,
	Dongjiu Geng <gengdongjiu@huawei.com>,
	linux-kernel@vger.kernel.org, devel@acpica.org
Subject: Re: [RFC PATCH v3 3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal"
Date: Wed, 2 May 2018 14:29:40 -0500	[thread overview]
Message-ID: <2e1b1d18-08b5-fa85-be7b-ab29ac2195b3@gmail.com> (raw)
In-Reply-To: <20180502191029.hcvf56xbdna7oi4k@devuan>

On 05/02/2018 02:10 PM, Pavel Machek wrote:
> On Thu 2018-04-26 13:20:57, Borislav Petkov wrote:
>> On Wed, Apr 25, 2018 at 03:39:51PM -0500, Alexandru Gagniuc wrote:
>>> There seems to be a culture amongst BIOS teams to want to crash the
>>> OS when an error can't be handled in firmware. Marking GHES errors as
>>> "fatal" is a very common way to do this.
>>>
>>> However, a number of errors reported by GHES may be fatal in the sense
>>> a device or link is lost, but are not fatal to the system. When there
>>> is a disagreement with firmware about the handleability of an error,
>>> print a warning message.
> 
> 
>>> +
>>> +	if ((sev >= GHES_SEV_PANIC) && (ghes_actual_severity(ghes) < sev)) {
>>> +		pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
>>> +		pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");
>>
>> Pasting the same comment from last time since you missed it:
>>
>> "No, I don't want any of that crap issuing stuff in dmesg and then people
>> opening bugs and running around and trying to replace hardware.
> 
> We want to see warnings. Maybe they can be toned done. We even have
> dedicated distros for firmware testing.

I'm told that had we had this warning when the r740 BIOS was in
development, we would have solved a lot of the issues that I'm currently
working on. That would, in turn, have exposed bigger issues, and we
would have had a platform to fix and test those bigger issues.

Hardware vendors who test on linux might be scratching their heads at
this error, though they tend to figure out what they're doing wrong, and
fix it.

One argument against was "expensive support calls", on which I call BS.
The firmware resources are expensive, but those are there whether or not
the customers call to complain.

Alex

>> Good mailing practices for 400: avoid top-posting and trim the reply.
> 
> Good mailing practices -- limit use of four letter words on public lists.

Then can't show word 'four'.

WARNING: multiple messages have this Message-ID (diff)
From: Alexandru Gagniuc <mr.nuke.me@gmail.com>
To: Pavel Machek <pavel@ucw.cz>, Borislav Petkov <bp@alien8.de>
Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Len Brown <lenb@kernel.org>, Tony Luck <tony.luck@intel.com>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	Robert Moore <robert.moore@intel.com>,
	Erik Schmauss <erik.schmauss@intel.com>,
	Tyler Baicar <tbaicar@codeaurora.org>,
	Will Deacon <will.deacon@arm.com>,
	James Morse <james.morse@arm.com>,
	Shiju Jose <shiju.jose@huawei.com>,
	"Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>,
	Dongjiu Geng <gengdongjiu@huawei.com>,
	linux-kernel@vger.kernel.org, devel@acpica.org
Subject: [RFC,v3,3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal"
Date: Wed, 2 May 2018 14:29:40 -0500	[thread overview]
Message-ID: <2e1b1d18-08b5-fa85-be7b-ab29ac2195b3@gmail.com> (raw)

On 05/02/2018 02:10 PM, Pavel Machek wrote:
> On Thu 2018-04-26 13:20:57, Borislav Petkov wrote:
>> On Wed, Apr 25, 2018 at 03:39:51PM -0500, Alexandru Gagniuc wrote:
>>> There seems to be a culture amongst BIOS teams to want to crash the
>>> OS when an error can't be handled in firmware. Marking GHES errors as
>>> "fatal" is a very common way to do this.
>>>
>>> However, a number of errors reported by GHES may be fatal in the sense
>>> a device or link is lost, but are not fatal to the system. When there
>>> is a disagreement with firmware about the handleability of an error,
>>> print a warning message.
> 
> 
>>> +
>>> +	if ((sev >= GHES_SEV_PANIC) && (ghes_actual_severity(ghes) < sev)) {
>>> +		pr_warn("FIRMWARE BUG: Firmware sent fatal error that we were able to correct");
>>> +		pr_warn("BROKEN FIRMWARE: Complain to your hardware vendor");
>>
>> Pasting the same comment from last time since you missed it:
>>
>> "No, I don't want any of that crap issuing stuff in dmesg and then people
>> opening bugs and running around and trying to replace hardware.
> 
> We want to see warnings. Maybe they can be toned done. We even have
> dedicated distros for firmware testing.

I'm told that had we had this warning when the r740 BIOS was in
development, we would have solved a lot of the issues that I'm currently
working on. That would, in turn, have exposed bigger issues, and we
would have had a platform to fix and test those bigger issues.

Hardware vendors who test on linux might be scratching their heads at
this error, though they tend to figure out what they're doing wrong, and
fix it.

One argument against was "expensive support calls", on which I call BS.
The firmware resources are expensive, but those are there whether or not
the customers call to complain.

Alex

>> Good mailing practices for 400: avoid top-posting and trim the reply.
> 
> Good mailing practices -- limit use of four letter words on public lists.

Then can't show word 'four'.
---
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2018-05-02 19:29 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-16 21:58 [RFC PATCH v2 0/4] acpi: apei: Improve error handling with firmware-first Alexandru Gagniuc
2018-04-16 21:59 ` [RFC PATCH v2 1/4] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,1/4] " Alexandru Gagniuc
2018-04-17  9:36   ` [RFC PATCH v2 1/4] " Borislav Petkov
2018-04-17  9:36     ` [RFC,v2,1/4] " Borislav Petkov
2018-04-17 16:43     ` [RFC PATCH v2 1/4] " Alex G.
2018-04-17 16:43       ` [RFC,v2,1/4] " Alexandru Gagniuc
2018-04-16 21:59 ` [RFC PATCH v2 2/4] acpi: apei: Split GHES handlers outside of ghes_do_proc Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-18 17:52   ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-18 17:52     ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 14:19     ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 14:19       ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 14:30       ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 14:30         ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 14:57         ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 14:57           ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 15:29           ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 15:29             ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 15:46             ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 15:46               ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 16:40               ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 16:40                 ` [RFC,v2,2/4] " Borislav Petkov
2018-04-16 21:59 ` [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-18 17:54   ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-18 17:54     ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 14:57     ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 14:57       ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 15:35       ` [RFC PATCH v2 3/4] " James Morse
2018-04-19 15:35         ` [Devel] " James Morse
2018-04-19 15:35         ` [RFC,v2,3/4] " James Morse
2018-04-19 16:27         ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 16:27           ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 15:40       ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 15:40         ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 16:26         ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 16:26           ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 16:45           ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 16:45             ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 17:40             ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 17:40               ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 19:03               ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 19:03                 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 22:55                 ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 22:55                   ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-22 10:48                   ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-22 10:48                     ` [RFC,v2,3/4] " Borislav Petkov
2018-04-24  4:19                     ` [RFC PATCH v2 3/4] " Alex G.
2018-04-24  4:19                       ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-25 14:01                       ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 14:01                         ` [RFC,v2,3/4] " Borislav Petkov
2018-04-25 15:00                         ` [RFC PATCH v2 3/4] " Alex G.
2018-04-25 15:00                           ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-25 17:15                           ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 17:15                             ` [RFC,v2,3/4] " Borislav Petkov
2018-04-25 17:27                             ` [RFC PATCH v2 3/4] " Alex G.
2018-04-25 17:27                               ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-25 17:39                               ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 17:39                                 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-16 21:59 ` [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,4/4] " Alexandru Gagniuc
2018-04-18 17:54   ` [RFC PATCH v2 4/4] " Borislav Petkov
2018-04-18 17:54     ` [RFC,v2,4/4] " Borislav Petkov
2018-04-19 15:11     ` [RFC PATCH v2 4/4] " Alex G.
2018-04-19 15:11       ` [RFC,v2,4/4] " Alexandru Gagniuc
2018-04-19 15:46       ` [RFC PATCH v2 4/4] " Borislav Petkov
2018-04-19 15:46         ` [RFC,v2,4/4] " Borislav Petkov
2018-04-25 20:39 ` [RFC PATCH v3 0/3] acpi: apei: Improve PCIe error handling with firmware-first Alexandru Gagniuc
2018-04-25 20:39   ` [RFC PATCH v3 1/3] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error Alexandru Gagniuc
2018-04-25 20:39     ` [RFC,v3,1/3] " Alexandru Gagniuc
2018-04-25 20:39   ` [RFC PATCH v3 2/3] acpi: apei: Do not panic() on PCIe errors reported through GHES Alexandru Gagniuc
2018-04-25 20:39     ` [RFC,v3,2/3] " Alexandru Gagniuc
2018-04-26 11:19     ` [RFC PATCH v3 2/3] " Borislav Petkov
2018-04-26 11:19       ` [RFC,v3,2/3] " Borislav Petkov
2018-04-26 17:44       ` [RFC PATCH v3 2/3] " Alex G.
2018-04-26 17:44         ` [RFC,v3,2/3] " Alexandru Gagniuc
2018-04-25 20:39   ` [RFC PATCH v3 3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc
2018-04-25 20:39     ` [RFC,v3,3/3] " Alexandru Gagniuc
2018-04-26 11:20     ` [RFC PATCH v3 3/3] " Borislav Petkov
2018-04-26 11:20       ` [RFC,v3,3/3] " Borislav Petkov
2018-04-26 17:47       ` [RFC PATCH v3 3/3] " Alex G.
2018-04-26 17:47         ` [RFC,v3,3/3] " Alexandru Gagniuc
2018-04-26 18:03         ` [RFC PATCH v3 3/3] " Borislav Petkov
2018-04-26 18:03           ` [RFC,v3,3/3] " Borislav Petkov
2018-05-02 19:10       ` [RFC PATCH v3 3/3] " Pavel Machek
2018-05-02 19:10         ` [RFC,v3,3/3] " Pavel Machek
2018-05-02 19:29         ` Alex G. [this message]
2018-05-02 19:29           ` Alexandru Gagniuc

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2e1b1d18-08b5-fa85-be7b-ab29ac2195b3@gmail.com \
    --to=mr.nuke.me@gmail.com \
    --cc=bp@alien8.de \
    --cc=devel@acpica.org \
    --cc=erik.schmauss@intel.com \
    --cc=gengdongjiu@huawei.com \
    --cc=james.morse@arm.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=pavel@ucw.cz \
    --cc=rjw@rjwysocki.net \
    --cc=robert.moore@intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=tbaicar@codeaurora.org \
    --cc=tony.luck@intel.com \
    --cc=will.deacon@arm.com \
    --cc=zjzhang@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.