linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: James Morse <james.morse@arm.com>
To: Alexandru Gagniuc <mr.nuke.me@gmail.com>
Cc: linux-acpi@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org,
	tony.luck@intel.com, bp@alien8.de, tbaicar@codeaurora.org,
	will.deacon@arm.com, shiju.jose@huawei.com,
	zjzhang@codeaurora.org, gengdongjiu@huawei.com,
	linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com,
	austin_bolen@dell.com, shyam_iyer@dell.com
Subject: Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages
Date: Wed, 4 Apr 2018 08:18:21 +0100	[thread overview]
Message-ID: <338e9bb4-a837-69f9-36e5-5ee2ddcaaa38@arm.com> (raw)
In-Reply-To: <20180403170830.29282-4-mr.nuke.me@gmail.com>

Hi Alexandru,

On 03/04/18 18:08, Alexandru Gagniuc wrote:
> BIOSes like to send NMIs for a number of silly reasons often deemed
> to be "fatal". For example pin bounce during a PCIE hotplug/unplug
> might cause the link to go down and retrain, with fatal PCI errors
> being generated while the link is retraining.

Sounds fun!


> Instead of panic()ing in NMI context, pass fatal errors down to IRQ
> context to see if they can be resolved.

How do we know we will survive this trip?

On arm64 systems it may not be possible to return to the context we took the NMI
notification from: we may bounce back into firmware with the same "world is on
fire" error. Firmware can rightly assume the OS has made no attempt to handle
the error. Your 'not re-arming the error' example makes this worrying.


> With these change, PCIe error are handled by AER. Other far less
> common errors, such as machine check exceptions, still cause a panic()
> in their respective handlers.

I agree AER is always going to be different. Could we take a look at the CPER
records while still in_nmi() to decide whether linux knows better than firmware?

For non-standard or processor-errors I think we should always panic() if they're
marked as fatal.
For memory-errors we could split memory_failure() up to have
{NMI,IRQ,process}-context helpers, all we need to know at NMI-time is whether
the affected memory is kernel memory.

For the PCI*/AER errors we should be able to try and handle it ... if we can get
back to process/IRQ context:
What happens if a PCIe driver has interrupts masked and is doing something to
cause this error? We can take the NMI and setup the irq-work, but it may never
run as we return to interrupts-masked context.


> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 2c998125b1d5..7243a99ea57e 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -955,7 +962,7 @@ static void __process_error(struct ghes *ghes)
>  static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>  {
>  	struct ghes *ghes;
> -	int sev, ret = NMI_DONE;
> +	int ret = NMI_DONE;
>  
>  	if (!atomic_add_unless(&ghes_in_nmi, 1, 1))
>  		return ret;
> @@ -968,13 +975,6 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
>  			ret = NMI_HANDLED;
>  		}
>  
> -		sev = ghes_severity(ghes->estatus->error_severity);
> -		if (sev >= GHES_SEV_PANIC) {
> -			oops_begin();
> -			ghes_print_queued_estatus();
> -			__ghes_panic(ghes);
> -		}
> -
>  		if (!(ghes->flags & GHES_TO_CLEAR))
>  			continue;

For Processor-errors I think this is the wrong thing to do, but we should be
able to poke around in the CPER records and find out what we are dealing with.


Thanks,

James

  reply	other threads:[~2018-04-04  7:21 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-03 17:08 [RFC PATCH 0/4] acpi: apei: Improve error handling with firmware-first Alexandru Gagniuc
2018-04-03 17:08 ` [RFC PATCH 1/4] acpi: apei: Return severity of GHES messages after handling Alexandru Gagniuc
2018-04-03 17:08 ` [RFC PATCH 2/4] acpi: apei: Swap ghes_print_queued_estatus and ghes_proc_in_irq Alexandru Gagniuc
2018-04-03 17:08 ` [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages Alexandru Gagniuc
2018-04-04  7:18   ` James Morse [this message]
2018-04-04 15:33     ` Alex G.
2018-04-04 16:53       ` James Morse
2018-04-04 19:49         ` Alex G.
2018-04-06 18:24           ` James Morse
2018-04-09 18:11             ` Alex G.
2018-04-13 16:38               ` James Morse
2018-04-16 21:59                 ` Alex G.
2018-04-20  7:27                   ` James Morse
2018-04-20 22:04                     ` Alex G.
2018-04-03 17:08 ` [RFC PATCH 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=338e9bb4-a837-69f9-36e5-5ee2ddcaaa38@arm.com \
    --to=james.morse@arm.com \
    --cc=alex_gagniuc@dellteam.com \
    --cc=austin_bolen@dell.com \
    --cc=bp@alien8.de \
    --cc=gengdongjiu@huawei.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mr.nuke.me@gmail.com \
    --cc=rjw@rjwysocki.net \
    --cc=shiju.jose@huawei.com \
    --cc=shyam_iyer@dell.com \
    --cc=tbaicar@codeaurora.org \
    --cc=tony.luck@intel.com \
    --cc=will.deacon@arm.com \
    --cc=zjzhang@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).