linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alexandru Gagniuc <mr.nuke.me@gmail.com>
To: linux-acpi@vger.kernel.org
Cc: alex_gagniuc@dellteam.com, austin_bolen@dell.com,
	shyam_iyer@dell.com, Alexandru Gagniuc <mr.nuke.me@gmail.com>,
	Tony Luck <tony.luck@intel.com>, Borislav Petkov <bp@alien8.de>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, "Rafael J. Wysocki" <rjw@rjwysocki.net>,
	Len Brown <lenb@kernel.org>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	Robert Moore <robert.moore@intel.com>,
	Erik Schmauss <erik.schmauss@intel.com>,
	Tyler Baicar <tbaicar@codeaurora.org>,
	Will Deacon <will.deacon@arm.com>,
	James Morse <james.morse@arm.com>,
	"Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>,
	Dongjiu Geng <gengdongjiu@huawei.com>,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	devel@acpica.org
Subject: [PATCH v7 3/3] acpi: apei: Do not panic() on PCIe errors reported through GHES
Date: Fri, 25 May 2018 10:53:48 -0500	[thread overview]
Message-ID: <20180525155352.22350-4-mr.nuke.me@gmail.com> (raw)
In-Reply-To: <20180525155352.22350-1-mr.nuke.me@gmail.com>

As previously noted, the policy to panic on any "Fatal" GHES error is
not suitable for several classes of errors. The most notable is
error containment. The correct policy is to achieve identical behavior
to native error handling -- i.e. when not reported through GHES. This,
in special cases, may not be possible, as we have to exit NMIs, which
requires these special considerations

PCIe AER errors are contained and reported at the root port. On DPC
capable hardware, containment can be done by all downstream ports. DPC
also has the added advantage of preventing future errors. Since these
errors stop at the root port, we can do all the work we need to exit
NMI and reach the error handler.

This patch does away with the mindless crashing of the system, and
correctly invokes the AER handler. When AER is not enabled, or the
firmware doesn't provide sufficient information to identify the source
of the error, the original panic() behavior is maintained.

Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---
 drivers/acpi/apei/ghes.c | 43 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 41 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1b22e18168f5..f7126f6d8d52 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -425,7 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
  * GHES_SEV_RECOVERABLE -> AER_NONFATAL
  * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
  *     These both need to be reported and recovered from by the AER driver.
- * GHES_SEV_FATAL does not make it to this handler
+ * GHES_SEV_FATAL -> AER_FATAL
  */
 static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
 {
@@ -837,6 +837,45 @@ static inline void ghes_sea_remove(struct ghes *ghes) { }
 static struct llist_head ghes_estatus_llist;
 static struct irq_work ghes_proc_irq_work;
 
+/* PCIe AER errors are safe if AER section contains enough info. */
+static int ghes_pcie_has_safe_handler(struct acpi_hest_generic_data *gdata)
+{
+	struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
+
+	if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
+		pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO &&
+		IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER))
+		return true;
+
+	return false;
+}
+
+/*
+ * Do we have an error handler that we can safely reach? We're concerned with
+ * being able to notify an error handler by crossing the NMI/IRQ boundary,
+ * being able to schedule_work, and so forth.
+ */
+static int ghes_has_fatal_handler(struct ghes *ghes)
+{
+	int worst_sev, sec_sev;
+	bool safe = true;
+	struct acpi_hest_generic_data *gdata;
+	const guid_t *section_type;
+	const struct acpi_hest_generic_status *estatus = ghes->estatus;
+
+	apei_estatus_for_each_section(estatus, gdata) {
+		section_type = (guid_t *)gdata->section_type;
+
+		if (guid_equal(section_type, &CPER_SEC_PCIE))
+			safe = ghes_pcie_has_safe_handler(gdata);
+
+		if (!safe)
+			break;
+	}
+
+	return safe;
+}
+
 /*
  * NMI may be triggered on any CPU, so ghes_in_nmi is used for
  * having only one concurrent reader.
@@ -944,7 +983,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
 		}
 
 		sev = ghes_cper_severity(ghes->estatus->error_severity);
-		if (sev >= GHES_SEV_FATAL) {
+		if ((sev >= GHES_SEV_FATAL) && !ghes_has_fatal_handler(ghes)) {
 			oops_begin();
 			ghes_print_queued_estatus();
 			__ghes_panic(ghes);
-- 
2.14.3

  parent reply	other threads:[~2018-05-25 15:54 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-25 15:53 [PATCH v7 0/3] acpi: apei: Drop panic() on fatal errors policy Alexandru Gagniuc
2018-05-25 15:53 ` [PATCH v7 1/3] acpi: apei: Rename GHES_SEV_PANIC to GHES_SEV_FATAL Alexandru Gagniuc
2018-05-25 15:53 ` [PATCH v7 2/3] acpi: apei: Rename ghes_severity() to ghes_cper_severity() Alexandru Gagniuc
2018-05-25 15:53 ` Alexandru Gagniuc [this message]
2018-05-27  9:36 ` [PATCH v7 0/3] acpi: apei: Drop panic() on fatal errors policy Rafael J. Wysocki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180525155352.22350-4-mr.nuke.me@gmail.com \
    --to=mr.nuke.me@gmail.com \
    --cc=alex_gagniuc@dellteam.com \
    --cc=austin_bolen@dell.com \
    --cc=bp@alien8.de \
    --cc=devel@acpica.org \
    --cc=erik.schmauss@intel.com \
    --cc=gengdongjiu@huawei.com \
    --cc=hpa@zytor.com \
    --cc=james.morse@arm.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=mingo@redhat.com \
    --cc=rjw@rjwysocki.net \
    --cc=robert.moore@intel.com \
    --cc=shyam_iyer@dell.com \
    --cc=tbaicar@codeaurora.org \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=will.deacon@arm.com \
    --cc=x86@kernel.org \
    --cc=zjzhang@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).