All of lore.kernel.org
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: Alexandru Gagniuc <mr.nuke.me@gmail.com>
Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org,
	rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com,
	tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com,
	shiju.jose@huawei.com, zjzhang@codeaurora.org,
	gengdongjiu@huawei.com, linux-kernel@vger.kernel.org,
	alex_gagniuc@dellteam.com, austin_bolen@dell.com,
	shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org,
	robert.moore@intel.com, erik.schmauss@intel.com
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.
Date: Wed, 18 Apr 2018 19:54:15 +0200	[thread overview]
Message-ID: <20180418175415.GJ4795@pd.tnic> (raw)
In-Reply-To: <20180416215903.7318-4-mr.nuke.me@gmail.com>

On Mon, Apr 16, 2018 at 04:59:02PM -0500, Alexandru Gagniuc wrote:
> Firmware is evil:
>  - ACPI was created to "try and make the 'ACPI' extensions somehow
>  Windows specific" in order to "work well with NT and not the others
>  even if they are open"
>  - EFI was created to hide "secret" registers from the OS.
>  - UEFI was created to allow compromising an otherwise secure OS.
> 
> Never has firmware been created to solve a problem or simplify an
> otherwise cumbersome process. It is of no surprise then, that
> firmware nowadays intentionally crashes an OS.

I don't believe I'm saying this but, get rid of that rant. Even though I
agree, it doesn't belong in a commit message.

> 
> One simple way to do that is to mark GHES errors as fatal. Firmware
> knows and even expects that an OS will crash in this case. And most
> OSes do.
> 
> PCIe errors are notorious for having different definitions of "fatal".
> In ACPI, and other firmware sandards, 'fatal' means the machine is
> about to explode and needs to be reset. In PCIe, on the other hand,
> fatal means that the link to a device has died. In the hotplug world
> of PCIe, this is akin to a USB disconnect. From that view, the "fatal"
> loss of a link is a normal event. To allow a machine to crash in this
> case is downright idiotic.
> 
> To solve this, implement an IRQ safe handler for AER. This makes sure
> we have enough information to invoke the full AER handler later down
> the road, and tells ghes_notify_nmi that "It's all cool".
> ghes_notify_nmi() then gets calmed down a little, and doesn't panic().
> 
> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> ---
>  drivers/acpi/apei/ghes.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 2119c51b4a9e..e0528da4e8f8 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -481,12 +481,26 @@ static int ghes_handle_aer(struct acpi_hest_generic_data *gdata, int sev)
>  	return ghes_severity(gdata->error_severity);
>  }
>  
> +static int ghes_handle_aer_irqsafe(struct acpi_hest_generic_data *gdata,
> +				   int sev)
> +{
> +	struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
> +
> +	/* The system can always recover from AER errors. */
> +	if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> +		pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO)
> +		return CPER_SEV_RECOVERABLE;
> +
> +	return ghes_severity(gdata->error_severity);
> +}

Well, Tyler touched that AER error severity handling recently and we had
it all nicely documented in the comment above ghes_handle_aer().

Your ghes_handle_aer_irqsafe() graft basically bypasses
ghes_handle_aer() instead of incorporating in it.

If all you wanna say is, the severity computation should go through all
the sections and look at each error's severity before making a decision,
then add that to ghes_severity() instead of doing that "deferrable"
severity dance.

And add the changes to the policy to the comment above
ghes_handle_aer(). I don't want any changes from people coming and going
and leaving us scratching heads why we did it this way.

And no need for those handlers and so on - make it simple first - then we
can talk more complex handling.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

WARNING: multiple messages have this Message-ID (diff)
From: Borislav Petkov <bp@alien8.de>
To: Alexandru Gagniuc <mr.nuke.me@gmail.com>
Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org,
	rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com,
	tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com,
	shiju.jose@huawei.com, zjzhang@codeaurora.org,
	gengdongjiu@huawei.com, linux-kernel@vger.kernel.org,
	alex_gagniuc@dellteam.com, austin_bolen@dell.com,
	shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org,
	robert.moore@intel.com, erik.schmauss@intel.com
Subject: [RFC,v2,3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.
Date: Wed, 18 Apr 2018 19:54:15 +0200	[thread overview]
Message-ID: <20180418175415.GJ4795@pd.tnic> (raw)

On Mon, Apr 16, 2018 at 04:59:02PM -0500, Alexandru Gagniuc wrote:
> Firmware is evil:
>  - ACPI was created to "try and make the 'ACPI' extensions somehow
>  Windows specific" in order to "work well with NT and not the others
>  even if they are open"
>  - EFI was created to hide "secret" registers from the OS.
>  - UEFI was created to allow compromising an otherwise secure OS.
> 
> Never has firmware been created to solve a problem or simplify an
> otherwise cumbersome process. It is of no surprise then, that
> firmware nowadays intentionally crashes an OS.

I don't believe I'm saying this but, get rid of that rant. Even though I
agree, it doesn't belong in a commit message.

> 
> One simple way to do that is to mark GHES errors as fatal. Firmware
> knows and even expects that an OS will crash in this case. And most
> OSes do.
> 
> PCIe errors are notorious for having different definitions of "fatal".
> In ACPI, and other firmware sandards, 'fatal' means the machine is
> about to explode and needs to be reset. In PCIe, on the other hand,
> fatal means that the link to a device has died. In the hotplug world
> of PCIe, this is akin to a USB disconnect. From that view, the "fatal"
> loss of a link is a normal event. To allow a machine to crash in this
> case is downright idiotic.
> 
> To solve this, implement an IRQ safe handler for AER. This makes sure
> we have enough information to invoke the full AER handler later down
> the road, and tells ghes_notify_nmi that "It's all cool".
> ghes_notify_nmi() then gets calmed down a little, and doesn't panic().
> 
> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> ---
>  drivers/acpi/apei/ghes.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 2119c51b4a9e..e0528da4e8f8 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -481,12 +481,26 @@ static int ghes_handle_aer(struct acpi_hest_generic_data *gdata, int sev)
>  	return ghes_severity(gdata->error_severity);
>  }
>  
> +static int ghes_handle_aer_irqsafe(struct acpi_hest_generic_data *gdata,
> +				   int sev)
> +{
> +	struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
> +
> +	/* The system can always recover from AER errors. */
> +	if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
> +		pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO)
> +		return CPER_SEV_RECOVERABLE;
> +
> +	return ghes_severity(gdata->error_severity);
> +}

Well, Tyler touched that AER error severity handling recently and we had
it all nicely documented in the comment above ghes_handle_aer().

Your ghes_handle_aer_irqsafe() graft basically bypasses
ghes_handle_aer() instead of incorporating in it.

If all you wanna say is, the severity computation should go through all
the sections and look at each error's severity before making a decision,
then add that to ghes_severity() instead of doing that "deferrable"
severity dance.

And add the changes to the policy to the comment above
ghes_handle_aer(). I don't want any changes from people coming and going
and leaving us scratching heads why we did it this way.

And no need for those handlers and so on - make it simple first - then we
can talk more complex handling.

  reply	other threads:[~2018-04-18 17:54 UTC|newest]

Thread overview: 89+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-16 21:58 [RFC PATCH v2 0/4] acpi: apei: Improve error handling with firmware-first Alexandru Gagniuc
2018-04-16 21:59 ` [RFC PATCH v2 1/4] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,1/4] " Alexandru Gagniuc
2018-04-17  9:36   ` [RFC PATCH v2 1/4] " Borislav Petkov
2018-04-17  9:36     ` [RFC,v2,1/4] " Borislav Petkov
2018-04-17 16:43     ` [RFC PATCH v2 1/4] " Alex G.
2018-04-17 16:43       ` [RFC,v2,1/4] " Alexandru Gagniuc
2018-04-16 21:59 ` [RFC PATCH v2 2/4] acpi: apei: Split GHES handlers outside of ghes_do_proc Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-18 17:52   ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-18 17:52     ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 14:19     ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 14:19       ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 14:30       ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 14:30         ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 14:57         ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 14:57           ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 15:29           ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 15:29             ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 15:46             ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 15:46               ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 16:40               ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 16:40                 ` [RFC,v2,2/4] " Borislav Petkov
2018-04-16 21:59 ` [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-18 17:54   ` Borislav Petkov [this message]
2018-04-18 17:54     ` Borislav Petkov
2018-04-19 14:57     ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 14:57       ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 15:35       ` [RFC PATCH v2 3/4] " James Morse
2018-04-19 15:35         ` [Devel] " James Morse
2018-04-19 15:35         ` [RFC,v2,3/4] " James Morse
2018-04-19 16:27         ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 16:27           ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 15:40       ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 15:40         ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 16:26         ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 16:26           ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 16:45           ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 16:45             ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 17:40             ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 17:40               ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 19:03               ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 19:03                 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 22:55                 ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 22:55                   ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-22 10:48                   ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-22 10:48                     ` [RFC,v2,3/4] " Borislav Petkov
2018-04-24  4:19                     ` [RFC PATCH v2 3/4] " Alex G.
2018-04-24  4:19                       ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-25 14:01                       ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 14:01                         ` [RFC,v2,3/4] " Borislav Petkov
2018-04-25 15:00                         ` [RFC PATCH v2 3/4] " Alex G.
2018-04-25 15:00                           ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-25 17:15                           ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 17:15                             ` [RFC,v2,3/4] " Borislav Petkov
2018-04-25 17:27                             ` [RFC PATCH v2 3/4] " Alex G.
2018-04-25 17:27                               ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-25 17:39                               ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 17:39                                 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-16 21:59 ` [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc
2018-04-16 21:59   ` [RFC,v2,4/4] " Alexandru Gagniuc
2018-04-18 17:54   ` [RFC PATCH v2 4/4] " Borislav Petkov
2018-04-18 17:54     ` [RFC,v2,4/4] " Borislav Petkov
2018-04-19 15:11     ` [RFC PATCH v2 4/4] " Alex G.
2018-04-19 15:11       ` [RFC,v2,4/4] " Alexandru Gagniuc
2018-04-19 15:46       ` [RFC PATCH v2 4/4] " Borislav Petkov
2018-04-19 15:46         ` [RFC,v2,4/4] " Borislav Petkov
2018-04-25 20:39 ` [RFC PATCH v3 0/3] acpi: apei: Improve PCIe error handling with firmware-first Alexandru Gagniuc
2018-04-25 20:39   ` [RFC PATCH v3 1/3] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error Alexandru Gagniuc
2018-04-25 20:39     ` [RFC,v3,1/3] " Alexandru Gagniuc
2018-04-25 20:39   ` [RFC PATCH v3 2/3] acpi: apei: Do not panic() on PCIe errors reported through GHES Alexandru Gagniuc
2018-04-25 20:39     ` [RFC,v3,2/3] " Alexandru Gagniuc
2018-04-26 11:19     ` [RFC PATCH v3 2/3] " Borislav Petkov
2018-04-26 11:19       ` [RFC,v3,2/3] " Borislav Petkov
2018-04-26 17:44       ` [RFC PATCH v3 2/3] " Alex G.
2018-04-26 17:44         ` [RFC,v3,2/3] " Alexandru Gagniuc
2018-04-25 20:39   ` [RFC PATCH v3 3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc
2018-04-25 20:39     ` [RFC,v3,3/3] " Alexandru Gagniuc
2018-04-26 11:20     ` [RFC PATCH v3 3/3] " Borislav Petkov
2018-04-26 11:20       ` [RFC,v3,3/3] " Borislav Petkov
2018-04-26 17:47       ` [RFC PATCH v3 3/3] " Alex G.
2018-04-26 17:47         ` [RFC,v3,3/3] " Alexandru Gagniuc
2018-04-26 18:03         ` [RFC PATCH v3 3/3] " Borislav Petkov
2018-04-26 18:03           ` [RFC,v3,3/3] " Borislav Petkov
2018-05-02 19:10       ` [RFC PATCH v3 3/3] " Pavel Machek
2018-05-02 19:10         ` [RFC,v3,3/3] " Pavel Machek
2018-05-02 19:29         ` [RFC PATCH v3 3/3] " Alex G.
2018-05-02 19:29           ` [RFC,v3,3/3] " Alexandru Gagniuc

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180418175415.GJ4795@pd.tnic \
    --to=bp@alien8.de \
    --cc=alex_gagniuc@dellteam.com \
    --cc=austin_bolen@dell.com \
    --cc=devel@acpica.org \
    --cc=erik.schmauss@intel.com \
    --cc=gengdongjiu@huawei.com \
    --cc=james.morse@arm.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=mr.nuke.me@gmail.com \
    --cc=rjw@rjwysocki.net \
    --cc=robert.moore@intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=shyam_iyer@dell.com \
    --cc=tbaicar@codeaurora.org \
    --cc=tony.luck@intel.com \
    --cc=will.deacon@arm.com \
    --cc=zjzhang@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.