All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4]  acpi: apei: Improve error handling with firmware-first
@ 2018-04-03 17:08 Alexandru Gagniuc
  2018-04-03 17:08 ` [RFC PATCH 1/4] acpi: apei: Return severity of GHES messages after handling Alexandru Gagniuc
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Alexandru Gagniuc @ 2018-04-03 17:08 UTC (permalink / raw)
  To: linux-acpi
  Cc: rjw, lenb, tony.luck, bp, tbaicar, will.deacon, james.morse,
	shiju.jose, zjzhang, gengdongjiu, linux-kernel, alex_gagniuc,
	austin_bolen, shyam_iyer, Alexandru Gagniuc

Hi,

I'm helping out Dell work out through the issues related to PCIe and NVMe
hotplug. Although hot-plug generally works, there are corner cases such as
pin bounce, drives failing and surprise removal that are not 100% worked out.
Because of this, NVMe is not yet on feature parity with SCSI and SAS.

One of the interesting issues is that most server vendors like to use
firmware-first (FFS), for various reasons that I won't go into. The side
effect of that is that we oftentimes don't even a stab at correcting the
problem.

This is especially troublesome for NVMe, which needs PCIe hotplug to work
correctly. When we do get a stab, it's after FFS can't handle a fatal error,
and we're told of it through ACPI tables. On x86, this happens through an
NMI, and as soon as we see a "fatal" error, we panic().

One problem with this FFS approach is that AER never even gets notified of
the issue. And even if a PCIe drive were to stop responding, nvme or higher
block drivers would notice something is wrong even without AER. Unless there
is a physical defect or silicon bug, AER can recover the link.

Another issue we're seeing with FFS is that BIOSes assume than an OS will crash
on a fatal error reported through ACPI. Sometimes they will leave hardware in
a "kind of" working state, or will fail to re-arm the errors. From what I've
observed, this happens on hardware with silicon bugs. For example, PCIe root
ports are unaffected, but certain PCIe switches may stop issuing hotplug
interrupts. It's just another headache with FFS.

While I don't expect server vendors to drop FFS in favor of native AER control,
I do think we can harden linux against the idiosyncrasies of such systems. The
scope of these patches is to protect against poorly designed firmware, and
perform proper error handling when possible. It is not to make FFS a first
class citizen in error handling.

Alexandru Gagniuc (4):
  acpi: apei: Return severity of GHES messages after handling
  acpi: apei: Swap ghes_print_queued_estatus and ghes_proc_in_irq
  acpi: apei: Do not panic() in NMI because of GHES messages
  acpi: apei: Warn when GHES marks correctable errors as "fatal"

 drivers/acpi/apei/ghes.c | 100 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 64 insertions(+), 36 deletions(-)

--
2.14.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-04-20 22:04 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-03 17:08 [RFC PATCH 0/4] acpi: apei: Improve error handling with firmware-first Alexandru Gagniuc
2018-04-03 17:08 ` [RFC PATCH 1/4] acpi: apei: Return severity of GHES messages after handling Alexandru Gagniuc
2018-04-03 17:08 ` [RFC PATCH 2/4] acpi: apei: Swap ghes_print_queued_estatus and ghes_proc_in_irq Alexandru Gagniuc
2018-04-03 17:08 ` [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages Alexandru Gagniuc
2018-04-04  7:18   ` James Morse
2018-04-04 15:33     ` Alex G.
2018-04-04 16:53       ` James Morse
2018-04-04 19:49         ` Alex G.
2018-04-06 18:24           ` James Morse
2018-04-09 18:11             ` Alex G.
2018-04-13 16:38               ` James Morse
2018-04-16 21:59                 ` Alex G.
2018-04-20  7:27                   ` James Morse
2018-04-20 22:04                     ` Alex G.
2018-04-03 17:08 ` [RFC PATCH 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.