linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 0/7] Address error and recovery for AER and DPC
@ 2018-02-23  8:23 Oza Pawandeep
  2018-02-23  8:23 ` [PATCH v11 1/7] PCI/AER: Rename error recovery to generic pci naming Oza Pawandeep
                   ` (7 more replies)
  0 siblings, 8 replies; 22+ messages in thread
From: Oza Pawandeep @ 2018-02-23  8:23 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

This patch set brings in error handling support for DPC

The current implementation of AER and error message broadcasting to the
EP driver is tightly coupled and limited to AER service driver.
It is important to factor out broadcasting and other link handling
callbacks. So that not only when AER gets triggered, but also when DPC get
triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.

DPC should enumerate the devices after recovering the link, which is
achieved by implementing error_resume callback.

Changes since v10:
    Christoph Hellwig's, David Laight's and Randy Dunlap's
    comments addressed.
	> renamed pci_do_recovery to pcie_do_recovery
	> removed inner braces in conditional statements.
	> restrctured the code in pci_wait_for_link 
	> EXPORT_SYMBOL_GPL
Changes since v9:
    Sinan's comments addressed.
	> bool active = true; unnecessary variable removed.
Changes since v8:
    Fixed Kbuild errors.
Changes since v7:
    Rebased the code on pci master
	> https://kernel.googlesource.com/pub/scm/linux/kernel/git/helgaas/pci
Changes since v6:
    Sinan's and Stefan's comments implemented.
	> reordered patch 6 and 7
	> cleaned up
Changes since v5:
    Sinan's and Keith's comments incorporated.
        > made separate patch for mutex
        > unified error repotting codes into driver/pci/pci.h
        > got rid of wait link active/inactive and
          made generic function in driver/pci/pci.c
Changes since v4:
    Bjorn's comments incorporated.
        > Renamed only do_recovery.
        > moved the things more locally to drivers/pci/pci.h
Changes since v3:
    Bjorn's comments incorporated.
        > Made separate patch renaming generic pci_err.c
        > Introduce pci_err.h to contain all the error types and recovery
        > removed all the dependencies on pci.h
Changes since v2:
    Based on feedback from Keith:
    "
    When DPC is triggered due to receipt of an uncorrectable error Message,
    the Requester ID from the Message is recorded in the DPC Error
    Source ID register and that Message is discarded and not forwarded Upstream.
    "
    Removed the patch where AER checks if DPC service is active
Changes since v1:
    Kbuild errors fixed:
        > pci_find_dpc_dev made static
        > ras_event.h updated
        > pci_find_aer_service call with CONFIG check
        > pci_find_dpc_service call with CONFIG check

Oza Pawandeep (7):
  PCI/AER: Rename error recovery to generic pci naming
  PCI/AER: factor out error reporting from AER
  PCI/ERR: add mutex to synchronize recovery
  PCI/DPC: Unify and plumb error handling into DPC
  PCI/AER: Unify aer error defines at single space
  PCI/DPC: Enumerate the devices after DPC trigger event
  PCI: Unify wait for link active into generic pci

 drivers/acpi/apei/ghes.c               |   1 +
 drivers/pci/hotplug/pciehp_hpc.c       |  21 +-
 drivers/pci/pci.c                      |  39 +++-
 drivers/pci/pci.h                      |  11 +
 drivers/pci/pcie/Makefile              |   2 +-
 drivers/pci/pcie/aer/aerdrv.h          |  30 ---
 drivers/pci/pcie/aer/aerdrv_core.c     | 293 +-------------------------
 drivers/pci/pcie/aer/aerdrv_errprint.c |   1 +
 drivers/pci/pcie/pcie-dpc.c            | 115 ++++++++++-
 drivers/pci/pcie/pcie-err.c            | 366 +++++++++++++++++++++++++++++++++
 drivers/pci/pcie/portdrv.h             |   2 +
 include/linux/aer.h                    |   4 -
 include/linux/pci.h                    |   1 +
 13 files changed, 534 insertions(+), 352 deletions(-)
 create mode 100644 drivers/pci/pcie/pcie-err.c

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v11 1/7] PCI/AER: Rename error recovery to generic pci naming
  2018-02-23  8:23 [PATCH v11 0/7] Address error and recovery for AER and DPC Oza Pawandeep
@ 2018-02-23  8:23 ` Oza Pawandeep
  2018-02-23  8:23 ` [PATCH v11 2/7] PCI/AER: factor out error reporting from AER Oza Pawandeep
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 22+ messages in thread
From: Oza Pawandeep @ 2018-02-23  8:23 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

This patch renames error recovery to generic name with pci prefix

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index a4bfea5..aeb83a0 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -478,7 +478,7 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
 }
 
 /**
- * do_recovery - handle nonfatal/fatal error recovery process
+ * pcie_do_recovery - handle nonfatal/fatal error recovery process
  * @dev: pointer to a pci_dev data structure of agent detecting an error
  * @severity: error severity type
  *
@@ -486,7 +486,7 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
  * error detected message to all downstream drivers within a hierarchy in
  * question and return the returned code.
  */
-static void do_recovery(struct pci_dev *dev, int severity)
+static void pcie_do_recovery(struct pci_dev *dev, int severity)
 {
 	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
 	enum pci_channel_state state;
@@ -566,7 +566,7 @@ static void handle_error_source(struct pcie_device *aerdev,
 			pci_write_config_dword(dev, pos + PCI_ERR_COR_STATUS,
 					info->status);
 	} else
-		do_recovery(dev, info->severity);
+		pcie_do_recovery(dev, info->severity);
 }
 
 #ifdef CONFIG_ACPI_APEI_PCIEAER
@@ -631,7 +631,7 @@ static void aer_recover_work_func(struct work_struct *work)
 		}
 		cper_print_aer(pdev, entry.severity, entry.regs);
 		if (entry.severity != AER_CORRECTABLE)
-			do_recovery(pdev, entry.severity);
+			pcie_do_recovery(pdev, entry.severity);
 		pci_dev_put(pdev);
 	}
 }
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v11 2/7] PCI/AER: factor out error reporting from AER
  2018-02-23  8:23 [PATCH v11 0/7] Address error and recovery for AER and DPC Oza Pawandeep
  2018-02-23  8:23 ` [PATCH v11 1/7] PCI/AER: Rename error recovery to generic pci naming Oza Pawandeep
@ 2018-02-23  8:23 ` Oza Pawandeep
  2018-02-23 23:42   ` Bjorn Helgaas
  2018-02-23  8:24 ` [PATCH v11 3/7] PCI/ERR: add mutex to synchronize recovery Oza Pawandeep
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Oza Pawandeep @ 2018-02-23  8:23 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

This patch factors out error reporting callbacks, which are currently
tightly coupled with AER.
DPC should be able to register callbacks and attmept recovery when DPC
trigger event occurs.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fcd8191..abc514e 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -342,6 +342,9 @@ static inline resource_size_t pci_resource_alignment(struct pci_dev *dev,
 
 void pci_enable_acs(struct pci_dev *dev);
 
+/* PCI error reporting and recovery */
+void pcie_do_recovery(struct pci_dev *dev, int severity);
+
 #ifdef CONFIG_PCIEASPM
 void pcie_aspm_init_link_state(struct pci_dev *pdev);
 void pcie_aspm_exit_link_state(struct pci_dev *pdev);
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index 223e4c3..d669497 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -6,7 +6,7 @@
 # Build PCI Express ASPM if needed
 obj-$(CONFIG_PCIEASPM)		+= aspm.o
 
-pcieportdrv-y			:= portdrv_core.o portdrv_pci.o portdrv_bus.o
+pcieportdrv-y			:= portdrv_core.o portdrv_pci.o portdrv_bus.o pcie-err.o
 pcieportdrv-$(CONFIG_ACPI)	+= portdrv_acpi.o
 
 obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index 5449e5c..bc9db53 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -76,36 +76,6 @@ struct aer_rpc {
 					 */
 };
 
-struct aer_broadcast_data {
-	enum pci_channel_state state;
-	enum pci_ers_result result;
-};
-
-static inline pci_ers_result_t merge_result(enum pci_ers_result orig,
-		enum pci_ers_result new)
-{
-	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
-		return PCI_ERS_RESULT_NO_AER_DRIVER;
-
-	if (new == PCI_ERS_RESULT_NONE)
-		return orig;
-
-	switch (orig) {
-	case PCI_ERS_RESULT_CAN_RECOVER:
-	case PCI_ERS_RESULT_RECOVERED:
-		orig = new;
-		break;
-	case PCI_ERS_RESULT_DISCONNECT:
-		if (new == PCI_ERS_RESULT_NEED_RESET)
-			orig = PCI_ERS_RESULT_NEED_RESET;
-		break;
-	default:
-		break;
-	}
-
-	return orig;
-}
-
 extern struct bus_type pcie_port_bus_type;
 void aer_isr(struct work_struct *work);
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index aeb83a0..f60b1bb 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -23,6 +23,7 @@
 #include <linux/slab.h>
 #include <linux/kfifo.h>
 #include "aerdrv.h"
+#include "../../pci.h"
 
 #define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
 				 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
@@ -230,191 +231,6 @@ static bool find_source_device(struct pci_dev *parent,
 	return true;
 }
 
-static int report_error_detected(struct pci_dev *dev, void *data)
-{
-	pci_ers_result_t vote;
-	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-	result_data = (struct aer_broadcast_data *) data;
-
-	device_lock(&dev->dev);
-	dev->error_state = result_data->state;
-
-	if (!dev->driver ||
-		!dev->driver->err_handler ||
-		!dev->driver->err_handler->error_detected) {
-		if (result_data->state == pci_channel_io_frozen &&
-			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
-			/*
-			 * In case of fatal recovery, if one of down-
-			 * stream device has no driver. We might be
-			 * unable to recover because a later insmod
-			 * of a driver for this device is unaware of
-			 * its hw state.
-			 */
-			pci_printk(KERN_DEBUG, dev, "device has %s\n",
-				   dev->driver ?
-				   "no AER-aware driver" : "no driver");
-		}
-
-		/*
-		 * If there's any device in the subtree that does not
-		 * have an error_detected callback, returning
-		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
-		 * the subsequent mmio_enabled/slot_reset/resume
-		 * callbacks of "any" device in the subtree. All the
-		 * devices in the subtree are left in the error state
-		 * without recovery.
-		 */
-
-		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
-			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
-		else
-			vote = PCI_ERS_RESULT_NONE;
-	} else {
-		err_handler = dev->driver->err_handler;
-		vote = err_handler->error_detected(dev, result_data->state);
-		pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
-	}
-
-	result_data->result = merge_result(result_data->result, vote);
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-static int report_mmio_enabled(struct pci_dev *dev, void *data)
-{
-	pci_ers_result_t vote;
-	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-	result_data = (struct aer_broadcast_data *) data;
-
-	device_lock(&dev->dev);
-	if (!dev->driver ||
-		!dev->driver->err_handler ||
-		!dev->driver->err_handler->mmio_enabled)
-		goto out;
-
-	err_handler = dev->driver->err_handler;
-	vote = err_handler->mmio_enabled(dev);
-	result_data->result = merge_result(result_data->result, vote);
-out:
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-static int report_slot_reset(struct pci_dev *dev, void *data)
-{
-	pci_ers_result_t vote;
-	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-	result_data = (struct aer_broadcast_data *) data;
-
-	device_lock(&dev->dev);
-	if (!dev->driver ||
-		!dev->driver->err_handler ||
-		!dev->driver->err_handler->slot_reset)
-		goto out;
-
-	err_handler = dev->driver->err_handler;
-	vote = err_handler->slot_reset(dev);
-	result_data->result = merge_result(result_data->result, vote);
-out:
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-static int report_resume(struct pci_dev *dev, void *data)
-{
-	const struct pci_error_handlers *err_handler;
-
-	device_lock(&dev->dev);
-	dev->error_state = pci_channel_io_normal;
-
-	if (!dev->driver ||
-		!dev->driver->err_handler ||
-		!dev->driver->err_handler->resume)
-		goto out;
-
-	err_handler = dev->driver->err_handler;
-	err_handler->resume(dev);
-	pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
-out:
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-/**
- * broadcast_error_message - handle message broadcast to downstream drivers
- * @dev: pointer to from where in a hierarchy message is broadcasted down
- * @state: error state
- * @error_mesg: message to print
- * @cb: callback to be broadcasted
- *
- * Invoked during error recovery process. Once being invoked, the content
- * of error severity will be broadcasted to all downstream drivers in a
- * hierarchy in question.
- */
-static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
-	enum pci_channel_state state,
-	char *error_mesg,
-	int (*cb)(struct pci_dev *, void *))
-{
-	struct aer_broadcast_data result_data;
-
-	pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg);
-	result_data.state = state;
-	if (cb == report_error_detected)
-		result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
-	else
-		result_data.result = PCI_ERS_RESULT_RECOVERED;
-
-	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
-		/*
-		 * If the error is reported by a bridge, we think this error
-		 * is related to the downstream link of the bridge, so we
-		 * do error recovery on all subordinates of the bridge instead
-		 * of the bridge and clear the error status of the bridge.
-		 */
-		if (cb == report_error_detected)
-			dev->error_state = state;
-		pci_walk_bus(dev->subordinate, cb, &result_data);
-		if (cb == report_resume) {
-			pci_cleanup_aer_uncorrect_error_status(dev);
-			dev->error_state = pci_channel_io_normal;
-		}
-	} else {
-		/*
-		 * If the error is reported by an end point, we think this
-		 * error is related to the upstream link of the end point.
-		 */
-		if (state == pci_channel_io_normal)
-			/*
-			 * the error is non fatal so the bus is ok, just invoke
-			 * the callback for the function that logged the error.
-			 */
-			cb(dev, &result_data);
-		else
-			pci_walk_bus(dev->bus, cb, &result_data);
-	}
-
-	return result_data.result;
-}
-
-/**
- * default_reset_link - default reset function
- * @dev: pointer to pci_dev data structure
- *
- * Invoked when performing link reset on a Downstream Port or a
- * Root Port with no aer driver.
- */
-static pci_ers_result_t default_reset_link(struct pci_dev *dev)
-{
-	pci_reset_bridge_secondary_bus(dev);
-	pci_printk(KERN_DEBUG, dev, "downstream link has been reset\n");
-	return PCI_ERS_RESULT_RECOVERED;
-}
-
 static int find_aer_service_iter(struct device *device, void *data)
 {
 	struct pcie_port_service_driver *service_driver, **drv;
@@ -432,7 +248,7 @@ static int find_aer_service_iter(struct device *device, void *data)
 	return 0;
 }
 
-static struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
+struct pcie_port_service_driver *pci_find_aer_service(struct pci_dev *dev)
 {
 	struct pcie_port_service_driver *drv = NULL;
 
@@ -440,107 +256,7 @@ static struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
 
 	return drv;
 }
-
-static pci_ers_result_t reset_link(struct pci_dev *dev)
-{
-	struct pci_dev *udev;
-	pci_ers_result_t status;
-	struct pcie_port_service_driver *driver;
-
-	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
-		/* Reset this port for all subordinates */
-		udev = dev;
-	} else {
-		/* Reset the upstream component (likely downstream port) */
-		udev = dev->bus->self;
-	}
-
-	/* Use the aer driver of the component firstly */
-	driver = find_aer_service(udev);
-
-	if (driver && driver->reset_link) {
-		status = driver->reset_link(udev);
-	} else if (udev->has_secondary_link) {
-		status = default_reset_link(udev);
-	} else {
-		pci_printk(KERN_DEBUG, dev, "no link-reset support at upstream device %s\n",
-			pci_name(udev));
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-
-	if (status != PCI_ERS_RESULT_RECOVERED) {
-		pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s failed\n",
-			pci_name(udev));
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-
-	return status;
-}
-
-/**
- * pcie_do_recovery - handle nonfatal/fatal error recovery process
- * @dev: pointer to a pci_dev data structure of agent detecting an error
- * @severity: error severity type
- *
- * Invoked when an error is nonfatal/fatal. Once being invoked, broadcast
- * error detected message to all downstream drivers within a hierarchy in
- * question and return the returned code.
- */
-static void pcie_do_recovery(struct pci_dev *dev, int severity)
-{
-	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
-	enum pci_channel_state state;
-
-	if (severity == AER_FATAL)
-		state = pci_channel_io_frozen;
-	else
-		state = pci_channel_io_normal;
-
-	status = broadcast_error_message(dev,
-			state,
-			"error_detected",
-			report_error_detected);
-
-	if (severity == AER_FATAL) {
-		result = reset_link(dev);
-		if (result != PCI_ERS_RESULT_RECOVERED)
-			goto failed;
-	}
-
-	if (status == PCI_ERS_RESULT_CAN_RECOVER)
-		status = broadcast_error_message(dev,
-				state,
-				"mmio_enabled",
-				report_mmio_enabled);
-
-	if (status == PCI_ERS_RESULT_NEED_RESET) {
-		/*
-		 * TODO: Should call platform-specific
-		 * functions to reset slot before calling
-		 * drivers' slot_reset callbacks?
-		 */
-		status = broadcast_error_message(dev,
-				state,
-				"slot_reset",
-				report_slot_reset);
-	}
-
-	if (status != PCI_ERS_RESULT_RECOVERED)
-		goto failed;
-
-	broadcast_error_message(dev,
-				state,
-				"resume",
-				report_resume);
-
-	pci_info(dev, "AER: Device recovery successful\n");
-	return;
-
-failed:
-	pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
-	/* TODO: Should kernel panic here? */
-	pci_info(dev, "AER: Device recovery failed\n");
-}
+EXPORT_SYMBOL_GPL(pci_find_aer_service);
 
 /**
  * handle_error_source - handle logging error into an event log
diff --git a/drivers/pci/pcie/pcie-err.c b/drivers/pci/pcie/pcie-err.c
new file mode 100644
index 0000000..fcd5add
--- /dev/null
+++ b/drivers/pci/pcie/pcie-err.c
@@ -0,0 +1,334 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This file implements the error recovery as a core part of PCIe error reporting.
+ * When a PCIe error is delivered, an error message will be collected and printed
+ * to console, then, an error recovery procedure will be executed by following
+ * the PCI error recovery rules.
+ *
+ * Copyright (C) 2006 Intel Corp.
+ *	Tom Long Nguyen (tom.l.nguyen@intel.com)
+ *	Zhang Yanmin (yanmin.zhang@intel.com)
+ *
+ */
+
+#include <linux/pci.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/aer.h>
+#include <linux/pcieport_if.h>
+#include "portdrv.h"
+
+struct aer_broadcast_data {
+	enum pci_channel_state state;
+	enum pci_ers_result result;
+};
+
+static pci_ers_result_t merge_result(enum pci_ers_result orig,
+				  enum pci_ers_result new)
+{
+	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
+		return PCI_ERS_RESULT_NO_AER_DRIVER;
+
+	if (new == PCI_ERS_RESULT_NONE)
+		return orig;
+
+	switch (orig) {
+	case PCI_ERS_RESULT_CAN_RECOVER:
+	case PCI_ERS_RESULT_RECOVERED:
+		orig = new;
+		break;
+	case PCI_ERS_RESULT_DISCONNECT:
+		if (new == PCI_ERS_RESULT_NEED_RESET)
+			orig = PCI_ERS_RESULT_NEED_RESET;
+		break;
+	default:
+		break;
+	}
+
+	return orig;
+}
+
+static int report_mmio_enabled(struct pci_dev *dev, void *data)
+{
+	pci_ers_result_t vote;
+	const struct pci_error_handlers *err_handler;
+	struct aer_broadcast_data *result_data;
+
+	result_data = (struct aer_broadcast_data *) data;
+
+	device_lock(&dev->dev);
+	if (!dev->driver ||
+		!dev->driver->err_handler ||
+		!dev->driver->err_handler->mmio_enabled)
+		goto out;
+
+	err_handler = dev->driver->err_handler;
+	vote = err_handler->mmio_enabled(dev);
+	result_data->result = merge_result(result_data->result, vote);
+out:
+	device_unlock(&dev->dev);
+	return 0;
+}
+
+static int report_slot_reset(struct pci_dev *dev, void *data)
+{
+	pci_ers_result_t vote;
+	const struct pci_error_handlers *err_handler;
+	struct aer_broadcast_data *result_data;
+
+	result_data = (struct aer_broadcast_data *) data;
+
+	device_lock(&dev->dev);
+	if (!dev->driver ||
+		!dev->driver->err_handler ||
+		!dev->driver->err_handler->slot_reset)
+		goto out;
+
+	err_handler = dev->driver->err_handler;
+	vote = err_handler->slot_reset(dev);
+	result_data->result = merge_result(result_data->result, vote);
+out:
+	device_unlock(&dev->dev);
+	return 0;
+}
+
+static int report_resume(struct pci_dev *dev, void *data)
+{
+	const struct pci_error_handlers *err_handler;
+
+	device_lock(&dev->dev);
+	dev->error_state = pci_channel_io_normal;
+
+	if (!dev->driver ||
+		!dev->driver->err_handler ||
+		!dev->driver->err_handler->resume)
+		goto out;
+
+	err_handler = dev->driver->err_handler;
+	err_handler->resume(dev);
+out:
+	device_unlock(&dev->dev);
+	return 0;
+}
+
+static int report_error_detected(struct pci_dev *dev, void *data)
+{
+	pci_ers_result_t vote;
+	const struct pci_error_handlers *err_handler;
+	struct aer_broadcast_data *result_data;
+
+	result_data = (struct aer_broadcast_data *) data;
+
+	device_lock(&dev->dev);
+	dev->error_state = result_data->state;
+
+	if (!dev->driver ||
+		!dev->driver->err_handler ||
+		!dev->driver->err_handler->error_detected) {
+		if (result_data->state == pci_channel_io_frozen &&
+			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
+			/*
+			 * In case of fatal recovery, if one of down-
+			 * stream device has no driver. We might be
+			 * unable to recover because a later insmod
+			 * of a driver for this device is unaware of
+			 * its hw state.
+			 */
+			dev_printk(KERN_DEBUG, &dev->dev, "device has %s\n",
+				   dev->driver ?
+				   "no error-aware driver" : "no driver");
+		}
+
+		/*
+		 * If there's any device in the subtree that does not
+		 * have an error_detected callback, returning
+		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
+		 * the subsequent mmio_enabled/slot_reset/resume
+		 * callbacks of "any" device in the subtree. All the
+		 * devices in the subtree are left in the error state
+		 * without recovery.
+		 */
+
+		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
+			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
+		else
+			vote = PCI_ERS_RESULT_NONE;
+	} else {
+		err_handler = dev->driver->err_handler;
+		vote = err_handler->error_detected(dev, result_data->state);
+	}
+
+	result_data->result = merge_result(result_data->result, vote);
+	device_unlock(&dev->dev);
+	return 0;
+}
+
+/**
+ * default_reset_link - default reset function
+ * @dev: pointer to pci_dev data structure
+ *
+ * Invoked when performing link reset on a Downstream Port or a
+ * Root Port with no aer driver.
+ */
+static pci_ers_result_t default_reset_link(struct pci_dev *dev)
+{
+	pci_reset_bridge_secondary_bus(dev);
+	dev_printk(KERN_DEBUG, &dev->dev, "downstream link has been reset\n");
+	return PCI_ERS_RESULT_RECOVERED;
+}
+
+static pci_ers_result_t reset_link(struct pci_dev *dev)
+{
+	struct pci_dev *udev;
+	pci_ers_result_t status;
+	struct pcie_port_service_driver *driver = NULL;
+
+	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+		/* Reset this port for all subordinates */
+		udev = dev;
+	} else {
+		/* Reset the upstream component (likely downstream port) */
+		udev = dev->bus->self;
+	}
+
+#if IS_ENABLED(CONFIG_PCIEAER)
+	/* Use the aer driver of the component firstly */
+	driver = pci_find_aer_service(udev);
+#endif
+
+	if (driver && driver->reset_link) {
+		status = driver->reset_link(udev);
+	} else if (udev->has_secondary_link) {
+		status = default_reset_link(udev);
+	} else {
+		dev_printk(KERN_DEBUG, &dev->dev,
+			"no link-reset support at upstream device %s\n",
+			pci_name(udev));
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	if (status != PCI_ERS_RESULT_RECOVERED) {
+		dev_printk(KERN_DEBUG, &dev->dev,
+			"link reset at upstream device %s failed\n",
+			pci_name(udev));
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	return status;
+}
+
+/**
+ * broadcast_error_message - handle message broadcast to downstream drivers
+ * @dev: pointer to where in a hierarchy message is broadcasted down
+ * @state: error state
+ * @error_mesg: message to print
+ * @cb: callback to be broadcast
+ *
+ * Invoked during error recovery process. Once being invoked, the content
+ * of error severity will be broadcast to all downstream drivers in a
+ * hierarchy in question.
+ */
+static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
+	enum pci_channel_state state,
+	char *error_mesg,
+	int (*cb)(struct pci_dev *, void *))
+{
+	struct aer_broadcast_data result_data;
+
+	dev_printk(KERN_DEBUG, &dev->dev, "broadcast %s message\n", error_mesg);
+	result_data.state = state;
+	if (cb == report_error_detected)
+		result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
+	else
+		result_data.result = PCI_ERS_RESULT_RECOVERED;
+
+	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+		/*
+		 * If the error is reported by a bridge, we think this error
+		 * is related to the downstream link of the bridge, so we
+		 * do error recovery on all subordinates of the bridge instead
+		 * of the bridge and clear the error status of the bridge.
+		 */
+		if (cb == report_error_detected)
+			dev->error_state = state;
+		pci_walk_bus(dev->subordinate, cb, &result_data);
+		if (cb == report_resume) {
+			pci_cleanup_aer_uncorrect_error_status(dev);
+			dev->error_state = pci_channel_io_normal;
+		}
+	} else {
+		/*
+		 * If the error is reported by an end point, we think this
+		 * error is related to the upstream link of the end point.
+		 */
+		pci_walk_bus(dev->bus, cb, &result_data);
+	}
+
+	return result_data.result;
+}
+
+/**
+ * pcie_do_recovery - handle nonfatal/fatal error recovery process
+ * @dev: pointer to a pci_dev data structure of agent detecting an error
+ * @severity: error severity type
+ *
+ * Invoked when an error is nonfatal/fatal. Once being invoked, broadcast
+ * error detected message to all downstream drivers within a hierarchy in
+ * question and return the returned code.
+ */
+void pcie_do_recovery(struct pci_dev *dev, int severity)
+{
+	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
+	enum pci_channel_state state;
+
+	if (severity == AER_FATAL)
+		state = pci_channel_io_frozen;
+	else
+		state = pci_channel_io_normal;
+
+	status = broadcast_error_message(dev,
+			state,
+			"error_detected",
+			report_error_detected);
+
+	if (severity == AER_FATAL) {
+		result = reset_link(dev);
+		if (result != PCI_ERS_RESULT_RECOVERED)
+			goto failed;
+	}
+
+	if (status == PCI_ERS_RESULT_CAN_RECOVER)
+		status = broadcast_error_message(dev,
+				state,
+				"mmio_enabled",
+				report_mmio_enabled);
+
+	if (status == PCI_ERS_RESULT_NEED_RESET) {
+		/*
+		 * TODO: Should call platform-specific
+		 * functions to reset slot before calling
+		 * drivers' slot_reset callbacks?
+		 */
+		status = broadcast_error_message(dev,
+				state,
+				"slot_reset",
+				report_slot_reset);
+	}
+
+	if (status != PCI_ERS_RESULT_RECOVERED)
+		goto failed;
+
+	broadcast_error_message(dev,
+				state,
+				"resume",
+				report_resume);
+
+	dev_info(&dev->dev, "Device recovery successful\n");
+	return;
+
+failed:
+	/* TODO: Should kernel panic here? */
+	dev_info(&dev->dev, "Device recovery failed\n");
+}
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index a854bc5..4f1992d 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -79,4 +79,5 @@ static inline void pcie_port_platform_notify(struct pci_dev *port, int *mask)
 static inline void pcie_port_platform_notify(struct pci_dev *port, int *mask){}
 #endif /* !CONFIG_ACPI */
 
+struct pcie_port_service_driver *pci_find_aer_service(struct pci_dev *dev);
 #endif /* _PORTDRV_H_ */
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v11 3/7] PCI/ERR: add mutex to synchronize recovery
  2018-02-23  8:23 [PATCH v11 0/7] Address error and recovery for AER and DPC Oza Pawandeep
  2018-02-23  8:23 ` [PATCH v11 1/7] PCI/AER: Rename error recovery to generic pci naming Oza Pawandeep
  2018-02-23  8:23 ` [PATCH v11 2/7] PCI/AER: factor out error reporting from AER Oza Pawandeep
@ 2018-02-23  8:24 ` Oza Pawandeep
  2018-02-23 23:45   ` Bjorn Helgaas
  2018-02-23  8:24 ` [PATCH v11 4/7] PCI/DPC: Unify and plumb error handling into DPC Oza Pawandeep
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Oza Pawandeep @ 2018-02-23  8:24 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

This patch protects pci_do_recovery with mutex.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/pcie-err.c b/drivers/pci/pcie/pcie-err.c
index fcd5add..f830975 100644
--- a/drivers/pci/pcie/pcie-err.c
+++ b/drivers/pci/pcie/pcie-err.c
@@ -20,6 +20,8 @@
 #include <linux/pcieport_if.h>
 #include "portdrv.h"
 
+static DEFINE_MUTEX(pci_err_recovery_lock);
+
 struct aer_broadcast_data {
 	enum pci_channel_state state;
 	enum pci_ers_result result;
@@ -283,6 +285,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
 	enum pci_channel_state state;
 
+	mutex_lock(&pci_err_recovery_lock);
+
 	if (severity == AER_FATAL)
 		state = pci_channel_io_frozen;
 	else
@@ -326,9 +330,11 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 				report_resume);
 
 	dev_info(&dev->dev, "Device recovery successful\n");
+	mutex_unlock(&pci_err_recovery_lock);
 	return;
 
 failed:
 	/* TODO: Should kernel panic here? */
 	dev_info(&dev->dev, "Device recovery failed\n");
+	mutex_unlock(&pci_err_recovery_lock);
 }
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v11 4/7] PCI/DPC: Unify and plumb error handling into DPC
  2018-02-23  8:23 [PATCH v11 0/7] Address error and recovery for AER and DPC Oza Pawandeep
                   ` (2 preceding siblings ...)
  2018-02-23  8:24 ` [PATCH v11 3/7] PCI/ERR: add mutex to synchronize recovery Oza Pawandeep
@ 2018-02-23  8:24 ` Oza Pawandeep
  2018-02-24  0:07   ` Bjorn Helgaas
  2018-02-23  8:24 ` [PATCH v11 5/7] PCI/AER: Unify aer error defines at single space Oza Pawandeep
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Oza Pawandeep @ 2018-02-23  8:24 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

Current DPC driver does not do recovery, e.g. calling end-point's driver's
callbacks, which sanitize the sw.

DPC driver implements link_reset callback, and calls pcie_do_recovery.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index abc514e..f8575da 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -343,6 +343,8 @@ static inline resource_size_t pci_resource_alignment(struct pci_dev *dev,
 void pci_enable_acs(struct pci_dev *dev);
 
 /* PCI error reporting and recovery */
+#define DPC_FATAL	4
+
 void pcie_do_recovery(struct pci_dev *dev, int severity);
 
 #ifdef CONFIG_PCIEASPM
diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
index 38e40c6..5c01c63 100644
--- a/drivers/pci/pcie/pcie-dpc.c
+++ b/drivers/pci/pcie/pcie-dpc.c
@@ -13,6 +13,7 @@
 #include <linux/pcieport_if.h>
 #include "../pci.h"
 #include "aer/aerdrv.h"
+#include "portdrv.h"
 
 struct dpc_dev {
 	struct pcie_device	*dev;
@@ -45,6 +46,58 @@ struct dpc_dev {
 	"Memory Request Completion Timeout",		 /* Bit Position 18 */
 };
 
+static int find_dpc_dev_iter(struct device *device, void *data)
+{
+	struct pcie_port_service_driver *service_driver;
+	struct device **dev = (struct device **) data;;
+
+	if (device->bus == &pcie_port_bus_type && device->driver) {
+		service_driver = to_service_driver(device->driver);
+		if (service_driver->service == PCIE_PORT_SERVICE_DPC) {
+			*dev = device;
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
+static struct device *pci_find_dpc_dev(struct pci_dev *pdev)
+{
+	struct device *dev = NULL;
+
+	device_for_each_child(&pdev->dev, &dev, find_dpc_dev_iter);
+
+	return dev;
+}
+
+static int find_dpc_service_iter(struct device *device, void *data)
+{
+	struct pcie_port_service_driver *service_driver;
+	struct pcie_port_service_driver **drv =
+		(struct pcie_port_service_driver **) data;
+
+	if (device->bus == &pcie_port_bus_type && device->driver) {
+		service_driver = to_service_driver(device->driver);
+		if (service_driver->service == PCIE_PORT_SERVICE_DPC) {
+			*drv = service_driver;
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
+struct pcie_port_service_driver *pci_find_dpc_service(struct pci_dev *dev)
+{
+	struct pcie_port_service_driver *drv = NULL;
+
+	device_for_each_child(&dev->dev, &drv, find_dpc_service_iter);
+
+	return drv;
+}
+EXPORT_SYMBOL_GPL(pci_find_dpc_service);
+
 static int dpc_wait_rp_inactive(struct dpc_dev *dpc)
 {
 	unsigned long timeout = jiffies + HZ;
@@ -82,12 +135,25 @@ static void dpc_wait_link_inactive(struct dpc_dev *dpc)
 		dev_warn(dev, "Link state not disabled for DPC event\n");
 }
 
-static void dpc_work(struct work_struct *work)
+/**
+ * dpc_reset_link - reset link DPC  routine
+ * @dev: pointer to Root Port's pci_dev data structure
+ *
+ * Invoked by Port Bus driver when performing link reset at Root Port.
+ */
+static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
 {
-	struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
-	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
 	struct pci_bus *parent = pdev->subordinate;
-	u16 cap = dpc->cap_pos, ctl;
+	struct pci_dev *dev, *temp;
+	struct dpc_dev *dpc;
+	struct pcie_device *pciedev;
+	struct device *devdpc;
+	u16 cap, ctl;
+
+	devdpc = pci_find_dpc_dev(pdev);
+	pciedev = to_pcie_device(devdpc);
+	dpc = get_service_data(pciedev);
+	cap = dpc->cap_pos;
 
 	pci_lock_rescan_remove();
 	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
@@ -104,21 +170,31 @@ static void dpc_work(struct work_struct *work)
 
 	dpc_wait_link_inactive(dpc);
 	if (dpc->rp_extensions && dpc_wait_rp_inactive(dpc))
-		return;
+		return PCI_ERS_RESULT_DISCONNECT;
 	if (dpc->rp_extensions && dpc->rp_pio_status) {
 		pci_write_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_STATUS,
 				       dpc->rp_pio_status);
 		dpc->rp_pio_status = 0;
 	}
 
-	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
+	pci_write_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_STATUS,
 		PCI_EXP_DPC_STATUS_TRIGGER | PCI_EXP_DPC_STATUS_INTERRUPT);
 
 	pci_read_config_word(pdev, cap + PCI_EXP_DPC_CTL, &ctl);
 	pci_write_config_word(pdev, cap + PCI_EXP_DPC_CTL,
-			      ctl | PCI_EXP_DPC_CTL_INT_EN);
+				ctl | PCI_EXP_DPC_CTL_INT_EN);
+
+	return PCI_ERS_RESULT_RECOVERED;
 }
 
+static void dpc_work(struct work_struct *work)
+{
+	struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
+	struct pci_dev *pdev = dpc->dev->port;
+
+	/* From DPC point of view error is always FATAL. */
+	pcie_do_recovery(pdev, DPC_FATAL);
+}
 static void dpc_process_rp_pio_error(struct dpc_dev *dpc)
 {
 	struct device *dev = &dpc->dev->device;
@@ -297,6 +373,7 @@ static void dpc_remove(struct pcie_device *dev)
 	.service	= PCIE_PORT_SERVICE_DPC,
 	.probe		= dpc_probe,
 	.remove		= dpc_remove,
+	.reset_link     = dpc_reset_link,
 };
 
 static int __init dpc_service_init(void)
diff --git a/drivers/pci/pcie/pcie-err.c b/drivers/pci/pcie/pcie-err.c
index f830975..1ea4b9a 100644
--- a/drivers/pci/pcie/pcie-err.c
+++ b/drivers/pci/pcie/pcie-err.c
@@ -19,6 +19,7 @@
 #include <linux/aer.h>
 #include <linux/pcieport_if.h>
 #include "portdrv.h"
+#include "./../pci.h"
 
 static DEFINE_MUTEX(pci_err_recovery_lock);
 
@@ -181,7 +182,7 @@ static pci_ers_result_t default_reset_link(struct pci_dev *dev)
 	return PCI_ERS_RESULT_RECOVERED;
 }
 
-static pci_ers_result_t reset_link(struct pci_dev *dev)
+static pci_ers_result_t reset_link(struct pci_dev *dev, int severity)
 {
 	struct pci_dev *udev;
 	pci_ers_result_t status;
@@ -195,9 +196,17 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
 		udev = dev->bus->self;
 	}
 
+
+	/* Use the service driver of the component firstly */
+#if IS_ENABLED(CONFIG_PCIE_DPC)
+	if (severity == DPC_FATAL)
+		driver = pci_find_dpc_service(udev);
+#endif
 #if IS_ENABLED(CONFIG_PCIEAER)
-	/* Use the aer driver of the component firstly */
-	driver = pci_find_aer_service(udev);
+	if (severity == AER_FATAL ||
+	    severity == AER_NONFATAL ||
+	    severity == AER_CORRECTABLE)
+		driver = pci_find_aer_service(udev);
 #endif
 
 	if (driver && driver->reset_link) {
@@ -287,7 +296,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 
 	mutex_lock(&pci_err_recovery_lock);
 
-	if (severity == AER_FATAL)
+	if (severity == AER_FATAL ||
+	    severity == DPC_FATAL)
 		state = pci_channel_io_frozen;
 	else
 		state = pci_channel_io_normal;
@@ -297,10 +307,14 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 			"error_detected",
 			report_error_detected);
 
-	if (severity == AER_FATAL) {
-		result = reset_link(dev);
+	if (severity == AER_FATAL ||
+	    severity == DPC_FATAL) {
+		result = reset_link(dev, severity);
 		if (result != PCI_ERS_RESULT_RECOVERED)
 			goto failed;
+		else if (severity == DPC_FATAL)
+			goto resume;
+
 	}
 
 	if (status == PCI_ERS_RESULT_CAN_RECOVER)
@@ -324,6 +338,7 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 	if (status != PCI_ERS_RESULT_RECOVERED)
 		goto failed;
 
+resume:
 	broadcast_error_message(dev,
 				state,
 				"resume",
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index 4f1992d..b013e24 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -80,4 +80,5 @@ static inline void pcie_port_platform_notify(struct pci_dev *port, int *mask){}
 #endif /* !CONFIG_ACPI */
 
 struct pcie_port_service_driver *pci_find_aer_service(struct pci_dev *dev);
+struct pcie_port_service_driver *pci_find_dpc_service(struct pci_dev *dev);
 #endif /* _PORTDRV_H_ */
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v11 5/7] PCI/AER: Unify aer error defines at single space
  2018-02-23  8:23 [PATCH v11 0/7] Address error and recovery for AER and DPC Oza Pawandeep
                   ` (3 preceding siblings ...)
  2018-02-23  8:24 ` [PATCH v11 4/7] PCI/DPC: Unify and plumb error handling into DPC Oza Pawandeep
@ 2018-02-23  8:24 ` Oza Pawandeep
  2018-02-24 15:36   ` Bjorn Helgaas
  2018-02-23  8:24 ` [PATCH v11 6/7] PCI: Unify wait for link active into generic pci Oza Pawandeep
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 22+ messages in thread
From: Oza Pawandeep @ 2018-02-23  8:24 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

This patch moves AER error defines to drivers/pci/pci.h.
So that it unifies the error repoting codes at single place along with dpc

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1efefe9..7ae9bb3 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -56,6 +56,7 @@
 #include <ras/ras_event.h>
 
 #include "apei-internal.h"
+#include "../../pci/pci.h"
 
 #define GHES_PFX	"GHES: "
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index f8575da..c8394ec 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -343,7 +343,11 @@ static inline resource_size_t pci_resource_alignment(struct pci_dev *dev,
 void pci_enable_acs(struct pci_dev *dev);
 
 /* PCI error reporting and recovery */
-#define DPC_FATAL	4
+#define AER_NONFATAL		0
+#define AER_FATAL		1
+#define AER_CORRECTABLE		2
+
+#define DPC_FATAL		4
 
 void pcie_do_recovery(struct pci_dev *dev, int severity);
 
diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 6a352e6..4c59f37 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -19,6 +19,7 @@
 #include <linux/cper.h>
 
 #include "aerdrv.h"
+#include "../../pci.h"
 #include <ras/ras_event.h>
 
 #define AER_AGENT_RECEIVER		0
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 8f87bbe..3eac8ed 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -11,10 +11,6 @@
 #include <linux/errno.h>
 #include <linux/types.h>
 
-#define AER_NONFATAL			0
-#define AER_FATAL			1
-#define AER_CORRECTABLE			2
-
 struct pci_dev;
 
 struct aer_header_log_regs {
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 9c68986..d75c75b 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -13,6 +13,7 @@
 #include <linux/aer.h>
 #include <linux/cper.h>
 #include <linux/mm.h>
+#include "../../../drivers/pci/pci.h"
 
 /*
  * MCE Extended Error Log trace event
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v11 6/7] PCI: Unify wait for link active into generic pci
  2018-02-23  8:23 [PATCH v11 0/7] Address error and recovery for AER and DPC Oza Pawandeep
                   ` (4 preceding siblings ...)
  2018-02-23  8:24 ` [PATCH v11 5/7] PCI/AER: Unify aer error defines at single space Oza Pawandeep
@ 2018-02-23  8:24 ` Oza Pawandeep
  2018-02-24 15:41   ` Bjorn Helgaas
  2018-02-23  8:24 ` [PATCH v11 7/7] PCI/DPC: Enumerate the devices after DPC trigger event Oza Pawandeep
  2018-02-23 23:12 ` [PATCH v11 0/7] Address error and recovery for AER and DPC Bjorn Helgaas
  7 siblings, 1 reply; 22+ messages in thread
From: Oza Pawandeep @ 2018-02-23  8:24 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

Clients such as pciehp, dpc are using pcie_wait_link_active, which waits
till the link becomes active or inactive.

Made generic function and moved it to drivers/pci/pci.c

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 18a42f8..de9b0ea 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -231,25 +231,11 @@ bool pciehp_check_link_active(struct controller *ctrl)
 	return ret;
 }
 
-static void __pcie_wait_link_active(struct controller *ctrl, bool active)
+static bool pcie_wait_link_active(struct controller *ctrl)
 {
-	int timeout = 1000;
-
-	if (pciehp_check_link_active(ctrl) == active)
-		return;
-	while (timeout > 0) {
-		msleep(10);
-		timeout -= 10;
-		if (pciehp_check_link_active(ctrl) == active)
-			return;
-	}
-	ctrl_dbg(ctrl, "Data Link Layer Link Active not %s in 1000 msec\n",
-			active ? "set" : "cleared");
-}
+	struct pci_dev *pdev = ctrl_dev(ctrl);
 
-static void pcie_wait_link_active(struct controller *ctrl)
-{
-	__pcie_wait_link_active(ctrl, true);
+	return pci_wait_for_link(pdev, true);
 }
 
 static bool pci_bus_check_dev(struct pci_bus *bus, int devfn)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f6a4dd1..f8d44b4 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4176,6 +4176,37 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
 	return 0;
 }
 
+/**
+ * pci_wait_for_link - Wait for link till its active/inactive
+ * @pdev: Bridge device
+ * @active: waiting for active or inactive ?
+ *
+ * Use this to wait till link becomes active or inactive.
+ */
+bool pci_wait_for_link(struct pci_dev *pdev, bool active)
+{
+	int timeout = 1000;
+	bool ret;
+	u16 lnk_status;
+
+	for (;;) {
+		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
+		ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA);
+		if (ret == active)
+			return true;
+		if (timeout <= 0)
+			break;
+		timeout -= 10;
+	}
+
+	dev_printk(KERN_DEBUG, &pdev->dev,
+		   "Data Link Layer Link Active not %s in 1000 msec\n",
+		   active ? "set" : "cleared");
+
+	return false;
+}
+EXPORT_SYMBOL(pci_wait_for_link);
+
 void pci_reset_secondary_bus(struct pci_dev *dev)
 {
 	u16 ctrl;
diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
index 5c01c63..e15bcda 100644
--- a/drivers/pci/pcie/pcie-dpc.c
+++ b/drivers/pci/pcie/pcie-dpc.c
@@ -120,19 +120,9 @@ static int dpc_wait_rp_inactive(struct dpc_dev *dpc)
 
 static void dpc_wait_link_inactive(struct dpc_dev *dpc)
 {
-	unsigned long timeout = jiffies + HZ;
 	struct pci_dev *pdev = dpc->dev->port;
-	struct device *dev = &dpc->dev->device;
-	u16 lnk_status;
 
-	pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
-	while (lnk_status & PCI_EXP_LNKSTA_DLLLA &&
-					!time_after(jiffies, timeout)) {
-		msleep(10);
-		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
-	}
-	if (lnk_status & PCI_EXP_LNKSTA_DLLLA)
-		dev_warn(dev, "Link state not disabled for DPC event\n");
+	pci_wait_for_link(pdev, false);
 }
 
 /**
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 024a1be..cb674c3 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1195,6 +1195,7 @@ int pci_add_ext_cap_save_buffer(struct pci_dev *dev,
 int pci_request_selected_regions(struct pci_dev *, int, const char *);
 int pci_request_selected_regions_exclusive(struct pci_dev *, int, const char *);
 void pci_release_selected_regions(struct pci_dev *, int);
+bool pci_wait_for_link(struct pci_dev *pdev, bool active);
 
 /* drivers/pci/bus.c */
 struct pci_bus *pci_bus_get(struct pci_bus *bus);
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v11 7/7] PCI/DPC: Enumerate the devices after DPC trigger event
  2018-02-23  8:23 [PATCH v11 0/7] Address error and recovery for AER and DPC Oza Pawandeep
                   ` (5 preceding siblings ...)
  2018-02-23  8:24 ` [PATCH v11 6/7] PCI: Unify wait for link active into generic pci Oza Pawandeep
@ 2018-02-23  8:24 ` Oza Pawandeep
  2018-02-23 23:12 ` [PATCH v11 0/7] Address error and recovery for AER and DPC Bjorn Helgaas
  7 siblings, 0 replies; 22+ messages in thread
From: Oza Pawandeep @ 2018-02-23  8:24 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

Implement error_resume callback in DPC so, after DPC trigger event
enumerates the devices beneath.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
index e15bcda..751efc0 100644
--- a/drivers/pci/pcie/pcie-dpc.c
+++ b/drivers/pci/pcie/pcie-dpc.c
@@ -126,6 +126,21 @@ static void dpc_wait_link_inactive(struct dpc_dev *dpc)
 }
 
 /**
+ * dpc_error_resume - enumerate the devices beneath
+ * @dev: pointer to Root Port's pci_dev data structure
+ *
+ * Invoked by Port Bus driver during fatal recovery.
+ */
+static void dpc_error_resume(struct pci_dev *pdev)
+{
+	if (pci_wait_for_link(pdev, true)) {
+		pci_lock_rescan_remove();
+		pci_rescan_bus(pdev->bus);
+		pci_unlock_rescan_remove();
+	}
+}
+
+/**
  * dpc_reset_link - reset link DPC  routine
  * @dev: pointer to Root Port's pci_dev data structure
  *
@@ -364,6 +379,7 @@ static void dpc_remove(struct pcie_device *dev)
 	.probe		= dpc_probe,
 	.remove		= dpc_remove,
 	.reset_link     = dpc_reset_link,
+	.error_resume	= dpc_error_resume,
 };
 
 static int __init dpc_service_init(void)
diff --git a/drivers/pci/pcie/pcie-err.c b/drivers/pci/pcie/pcie-err.c
index 1ea4b9a..1e1bf98 100644
--- a/drivers/pci/pcie/pcie-err.c
+++ b/drivers/pci/pcie/pcie-err.c
@@ -236,6 +236,7 @@ static pci_ers_result_t reset_link(struct pci_dev *dev, int severity)
  * @state: error state
  * @error_mesg: message to print
  * @cb: callback to be broadcast
+ * @severity: error severity
  *
  * Invoked during error recovery process. Once being invoked, the content
  * of error severity will be broadcast to all downstream drivers in a
@@ -244,7 +245,8 @@ static pci_ers_result_t reset_link(struct pci_dev *dev, int severity)
 static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
 	enum pci_channel_state state,
 	char *error_mesg,
-	int (*cb)(struct pci_dev *, void *))
+	int (*cb)(struct pci_dev *, void *),
+	int severity)
 {
 	struct aer_broadcast_data result_data;
 
@@ -256,6 +258,17 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
 		result_data.result = PCI_ERS_RESULT_RECOVERED;
 
 	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+		/* If DPC is triggered, call resume error handler
+		 * because, at this point we can safely assume that
+		 * link recovery has happened, this is only handled if
+		 * callback is resume, as this function can be called
+		 * with multiple callbacks.
+		 */
+		if ((severity == DPC_FATAL) &&
+		    (cb == report_resume)) {
+			cb(dev, NULL);
+			return PCI_ERS_RESULT_RECOVERED;
+		}
 		/*
 		 * If the error is reported by a bridge, we think this error
 		 * is related to the downstream link of the bridge, so we
@@ -305,7 +318,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 	status = broadcast_error_message(dev,
 			state,
 			"error_detected",
-			report_error_detected);
+			report_error_detected,
+			severity);
 
 	if (severity == AER_FATAL ||
 	    severity == DPC_FATAL) {
@@ -321,7 +335,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 		status = broadcast_error_message(dev,
 				state,
 				"mmio_enabled",
-				report_mmio_enabled);
+				report_mmio_enabled,
+				severity);
 
 	if (status == PCI_ERS_RESULT_NEED_RESET) {
 		/*
@@ -332,7 +347,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 		status = broadcast_error_message(dev,
 				state,
 				"slot_reset",
-				report_slot_reset);
+				report_slot_reset,
+				severity);
 	}
 
 	if (status != PCI_ERS_RESULT_RECOVERED)
@@ -342,7 +358,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 	broadcast_error_message(dev,
 				state,
 				"resume",
-				report_resume);
+				report_resume,
+				severity);
 
 	dev_info(&dev->dev, "Device recovery successful\n");
 	mutex_unlock(&pci_err_recovery_lock);
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v11 0/7] Address error and recovery for AER and DPC
  2018-02-23  8:23 [PATCH v11 0/7] Address error and recovery for AER and DPC Oza Pawandeep
                   ` (6 preceding siblings ...)
  2018-02-23  8:24 ` [PATCH v11 7/7] PCI/DPC: Enumerate the devices after DPC trigger event Oza Pawandeep
@ 2018-02-23 23:12 ` Bjorn Helgaas
  7 siblings, 0 replies; 22+ messages in thread
From: Bjorn Helgaas @ 2018-02-23 23:12 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On Fri, Feb 23, 2018 at 01:53:57PM +0530, Oza Pawandeep wrote:
> This patch set brings in error handling support for DPC
> 
> The current implementation of AER and error message broadcasting to the
> EP driver is tightly coupled and limited to AER service driver.
> It is important to factor out broadcasting and other link handling
> callbacks. So that not only when AER gets triggered, but also when DPC get
> triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
> 
> DPC should enumerate the devices after recovering the link, which is
> achieved by implementing error_resume callback.
> 
> Changes since v10:
>     Christoph Hellwig's, David Laight's and Randy Dunlap's
>     comments addressed.
> 	> renamed pci_do_recovery to pcie_do_recovery
> 	> removed inner braces in conditional statements.
> 	> restrctured the code in pci_wait_for_link 
> 	> EXPORT_SYMBOL_GPL
> Changes since v9:
>     Sinan's comments addressed.
> 	> bool active = true; unnecessary variable removed.
> Changes since v8:
>     Fixed Kbuild errors.
> Changes since v7:
>     Rebased the code on pci master
> 	> https://kernel.googlesource.com/pub/scm/linux/kernel/git/helgaas/pci
> Changes since v6:
>     Sinan's and Stefan's comments implemented.
> 	> reordered patch 6 and 7
> 	> cleaned up
> Changes since v5:
>     Sinan's and Keith's comments incorporated.
>         > made separate patch for mutex
>         > unified error repotting codes into driver/pci/pci.h
>         > got rid of wait link active/inactive and
>           made generic function in driver/pci/pci.c
> Changes since v4:
>     Bjorn's comments incorporated.
>         > Renamed only do_recovery.
>         > moved the things more locally to drivers/pci/pci.h
> Changes since v3:
>     Bjorn's comments incorporated.
>         > Made separate patch renaming generic pci_err.c
>         > Introduce pci_err.h to contain all the error types and recovery
>         > removed all the dependencies on pci.h
> Changes since v2:
>     Based on feedback from Keith:
>     "
>     When DPC is triggered due to receipt of an uncorrectable error Message,
>     the Requester ID from the Message is recorded in the DPC Error
>     Source ID register and that Message is discarded and not forwarded Upstream.
>     "
>     Removed the patch where AER checks if DPC service is active
> Changes since v1:
>     Kbuild errors fixed:
>         > pci_find_dpc_dev made static
>         > ras_event.h updated
>         > pci_find_aer_service call with CONFIG check
>         > pci_find_dpc_service call with CONFIG check

Woof, v8, v9, v10, and v11 all in the last two days.  It's OK to wait
a couple days for feedback to settle out before posting a new version :)

> Oza Pawandeep (7):
>   PCI/AER: Rename error recovery to generic pci naming
>   PCI/AER: factor out error reporting from AER
>   PCI/ERR: add mutex to synchronize recovery
>   PCI/DPC: Unify and plumb error handling into DPC
>   PCI/AER: Unify aer error defines at single space
>   PCI/DPC: Enumerate the devices after DPC trigger event
>   PCI: Unify wait for link active into generic pci

Please capitalize the subject lines consistently.

Please capitalize acronyms in English text, e.g., PCI, AER, DPC.

Please use a blank line between paragraphs in changelogs.

I have more comments on individual patches.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11 2/7] PCI/AER: factor out error reporting from AER
  2018-02-23  8:23 ` [PATCH v11 2/7] PCI/AER: factor out error reporting from AER Oza Pawandeep
@ 2018-02-23 23:42   ` Bjorn Helgaas
  2018-02-26  5:32     ` poza
  0 siblings, 1 reply; 22+ messages in thread
From: Bjorn Helgaas @ 2018-02-23 23:42 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On Fri, Feb 23, 2018 at 01:53:59PM +0530, Oza Pawandeep wrote:
> This patch factors out error reporting callbacks, which are currently
> tightly coupled with AER.

Add blank line between paragraphs.

> DPC should be able to register callbacks and attmept recovery when DPC
> trigger event occurs.

s/attmept/attempt/

This patch basically moves code from aerdrv_core.c to pcie-err.c.  Make it
do *only* that.

> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index fcd8191..abc514e 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -342,6 +342,9 @@ static inline resource_size_t pci_resource_alignment(struct pci_dev *dev,
>  
>  void pci_enable_acs(struct pci_dev *dev);
>  
> +/* PCI error reporting and recovery */
> +void pcie_do_recovery(struct pci_dev *dev, int severity);

Add this declaration in the first patch, the one where you rename the
function.

>  #ifdef CONFIG_PCIEASPM
>  void pcie_aspm_init_link_state(struct pci_dev *pdev);
>  void pcie_aspm_exit_link_state(struct pci_dev *pdev);
> diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
> index 223e4c3..d669497 100644
> --- a/drivers/pci/pcie/Makefile
> +++ b/drivers/pci/pcie/Makefile
> @@ -6,7 +6,7 @@
>  # Build PCI Express ASPM if needed
>  obj-$(CONFIG_PCIEASPM)		+= aspm.o
>  
> -pcieportdrv-y			:= portdrv_core.o portdrv_pci.o portdrv_bus.o
> +pcieportdrv-y			:= portdrv_core.o portdrv_pci.o portdrv_bus.o pcie-err.o

Can we name this just "drivers/pci/pcie/err.c"?  I know we have
pcie-dpc.c already, but it does get a little repetitious to type
"pci" THREE times in that path.

>  pcieportdrv-$(CONFIG_ACPI)	+= portdrv_acpi.o
>  
>  obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o
> diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
> index 5449e5c..bc9db53 100644
> --- a/drivers/pci/pcie/aer/aerdrv.h
> +++ b/drivers/pci/pcie/aer/aerdrv.h
> @@ -76,36 +76,6 @@ struct aer_rpc {
>  					 */
>  };
>  
> -struct aer_broadcast_data {
> -	enum pci_channel_state state;
> -	enum pci_ers_result result;
> -};
> -
> -static inline pci_ers_result_t merge_result(enum pci_ers_result orig,
> -		enum pci_ers_result new)
> -{
> -	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
> -		return PCI_ERS_RESULT_NO_AER_DRIVER;
> -
> -	if (new == PCI_ERS_RESULT_NONE)
> -		return orig;
> -
> -	switch (orig) {
> -	case PCI_ERS_RESULT_CAN_RECOVER:
> -	case PCI_ERS_RESULT_RECOVERED:
> -		orig = new;
> -		break;
> -	case PCI_ERS_RESULT_DISCONNECT:
> -		if (new == PCI_ERS_RESULT_NEED_RESET)
> -			orig = PCI_ERS_RESULT_NEED_RESET;
> -		break;
> -	default:
> -		break;
> -	}
> -
> -	return orig;
> -}
> -
>  extern struct bus_type pcie_port_bus_type;
>  void aer_isr(struct work_struct *work);
>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index aeb83a0..f60b1bb 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -23,6 +23,7 @@
>  #include <linux/slab.h>
>  #include <linux/kfifo.h>
>  #include "aerdrv.h"
> +#include "../../pci.h"
>  
>  #define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
>  				 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
> @@ -230,191 +231,6 @@ static bool find_source_device(struct pci_dev *parent,
>  	return true;
>  }
>  
> -static int report_error_detected(struct pci_dev *dev, void *data)
> -{
> -	pci_ers_result_t vote;
> -	const struct pci_error_handlers *err_handler;
> -	struct aer_broadcast_data *result_data;
> -	result_data = (struct aer_broadcast_data *) data;
> -
> -	device_lock(&dev->dev);
> -	dev->error_state = result_data->state;
> -
> -	if (!dev->driver ||
> -		!dev->driver->err_handler ||
> -		!dev->driver->err_handler->error_detected) {
> -		if (result_data->state == pci_channel_io_frozen &&
> -			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> -			/*
> -			 * In case of fatal recovery, if one of down-
> -			 * stream device has no driver. We might be
> -			 * unable to recover because a later insmod
> -			 * of a driver for this device is unaware of
> -			 * its hw state.
> -			 */
> -			pci_printk(KERN_DEBUG, dev, "device has %s\n",
> -				   dev->driver ?
> -				   "no AER-aware driver" : "no driver");
> -		}
> -
> -		/*
> -		 * If there's any device in the subtree that does not
> -		 * have an error_detected callback, returning
> -		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
> -		 * the subsequent mmio_enabled/slot_reset/resume
> -		 * callbacks of "any" device in the subtree. All the
> -		 * devices in the subtree are left in the error state
> -		 * without recovery.
> -		 */
> -
> -		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
> -			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> -		else
> -			vote = PCI_ERS_RESULT_NONE;
> -	} else {
> -		err_handler = dev->driver->err_handler;
> -		vote = err_handler->error_detected(dev, result_data->state);
> -		pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
> -	}
> -
> -	result_data->result = merge_result(result_data->result, vote);
> -	device_unlock(&dev->dev);
> -	return 0;
> -}
> -
> -static int report_mmio_enabled(struct pci_dev *dev, void *data)
> -{
> -	pci_ers_result_t vote;
> -	const struct pci_error_handlers *err_handler;
> -	struct aer_broadcast_data *result_data;
> -	result_data = (struct aer_broadcast_data *) data;
> -
> -	device_lock(&dev->dev);
> -	if (!dev->driver ||
> -		!dev->driver->err_handler ||
> -		!dev->driver->err_handler->mmio_enabled)
> -		goto out;
> -
> -	err_handler = dev->driver->err_handler;
> -	vote = err_handler->mmio_enabled(dev);
> -	result_data->result = merge_result(result_data->result, vote);
> -out:
> -	device_unlock(&dev->dev);
> -	return 0;
> -}
> -
> -static int report_slot_reset(struct pci_dev *dev, void *data)
> -{
> -	pci_ers_result_t vote;
> -	const struct pci_error_handlers *err_handler;
> -	struct aer_broadcast_data *result_data;
> -	result_data = (struct aer_broadcast_data *) data;
> -
> -	device_lock(&dev->dev);
> -	if (!dev->driver ||
> -		!dev->driver->err_handler ||
> -		!dev->driver->err_handler->slot_reset)
> -		goto out;
> -
> -	err_handler = dev->driver->err_handler;
> -	vote = err_handler->slot_reset(dev);
> -	result_data->result = merge_result(result_data->result, vote);
> -out:
> -	device_unlock(&dev->dev);
> -	return 0;
> -}
> -
> -static int report_resume(struct pci_dev *dev, void *data)
> -{
> -	const struct pci_error_handlers *err_handler;
> -
> -	device_lock(&dev->dev);
> -	dev->error_state = pci_channel_io_normal;
> -
> -	if (!dev->driver ||
> -		!dev->driver->err_handler ||
> -		!dev->driver->err_handler->resume)
> -		goto out;
> -
> -	err_handler = dev->driver->err_handler;
> -	err_handler->resume(dev);
> -	pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
> -out:
> -	device_unlock(&dev->dev);
> -	return 0;
> -}
> -
> -/**
> - * broadcast_error_message - handle message broadcast to downstream drivers
> - * @dev: pointer to from where in a hierarchy message is broadcasted down
> - * @state: error state
> - * @error_mesg: message to print
> - * @cb: callback to be broadcasted
> - *
> - * Invoked during error recovery process. Once being invoked, the content
> - * of error severity will be broadcasted to all downstream drivers in a
> - * hierarchy in question.
> - */
> -static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
> -	enum pci_channel_state state,
> -	char *error_mesg,
> -	int (*cb)(struct pci_dev *, void *))
> -{
> -	struct aer_broadcast_data result_data;
> -
> -	pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg);
> -	result_data.state = state;
> -	if (cb == report_error_detected)
> -		result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
> -	else
> -		result_data.result = PCI_ERS_RESULT_RECOVERED;
> -
> -	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> -		/*
> -		 * If the error is reported by a bridge, we think this error
> -		 * is related to the downstream link of the bridge, so we
> -		 * do error recovery on all subordinates of the bridge instead
> -		 * of the bridge and clear the error status of the bridge.
> -		 */
> -		if (cb == report_error_detected)
> -			dev->error_state = state;
> -		pci_walk_bus(dev->subordinate, cb, &result_data);
> -		if (cb == report_resume) {
> -			pci_cleanup_aer_uncorrect_error_status(dev);
> -			dev->error_state = pci_channel_io_normal;
> -		}
> -	} else {
> -		/*
> -		 * If the error is reported by an end point, we think this
> -		 * error is related to the upstream link of the end point.
> -		 */
> -		if (state == pci_channel_io_normal)
> -			/*
> -			 * the error is non fatal so the bus is ok, just invoke
> -			 * the callback for the function that logged the error.
> -			 */
> -			cb(dev, &result_data);
> -		else
> -			pci_walk_bus(dev->bus, cb, &result_data);
> -	}
> -
> -	return result_data.result;
> -}
> -
> -/**
> - * default_reset_link - default reset function
> - * @dev: pointer to pci_dev data structure
> - *
> - * Invoked when performing link reset on a Downstream Port or a
> - * Root Port with no aer driver.
> - */
> -static pci_ers_result_t default_reset_link(struct pci_dev *dev)
> -{
> -	pci_reset_bridge_secondary_bus(dev);
> -	pci_printk(KERN_DEBUG, dev, "downstream link has been reset\n");
> -	return PCI_ERS_RESULT_RECOVERED;
> -}
> -
>  static int find_aer_service_iter(struct device *device, void *data)
>  {
>  	struct pcie_port_service_driver *service_driver, **drv;
> @@ -432,7 +248,7 @@ static int find_aer_service_iter(struct device *device, void *data)
>  	return 0;
>  }
>  
> -static struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
> +struct pcie_port_service_driver *pci_find_aer_service(struct pci_dev *dev)

Move this rename to a different patch.  The new name should probably
start with "pcie" like you did with pcie_do_recovery().

>  {
>  	struct pcie_port_service_driver *drv = NULL;
>  
> @@ -440,107 +256,7 @@ static struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
>  
>  	return drv;
>  }
> -
> -static pci_ers_result_t reset_link(struct pci_dev *dev)
> -{
> -	struct pci_dev *udev;
> -	pci_ers_result_t status;
> -	struct pcie_port_service_driver *driver;
> -
> -	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> -		/* Reset this port for all subordinates */
> -		udev = dev;
> -	} else {
> -		/* Reset the upstream component (likely downstream port) */
> -		udev = dev->bus->self;
> -	}
> -
> -	/* Use the aer driver of the component firstly */
> -	driver = find_aer_service(udev);
> -
> -	if (driver && driver->reset_link) {
> -		status = driver->reset_link(udev);
> -	} else if (udev->has_secondary_link) {
> -		status = default_reset_link(udev);
> -	} else {
> -		pci_printk(KERN_DEBUG, dev, "no link-reset support at upstream device %s\n",
> -			pci_name(udev));
> -		return PCI_ERS_RESULT_DISCONNECT;
> -	}
> -
> -	if (status != PCI_ERS_RESULT_RECOVERED) {
> -		pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s failed\n",
> -			pci_name(udev));
> -		return PCI_ERS_RESULT_DISCONNECT;
> -	}
> -
> -	return status;
> -}
> -
> -/**
> - * pcie_do_recovery - handle nonfatal/fatal error recovery process
> - * @dev: pointer to a pci_dev data structure of agent detecting an error
> - * @severity: error severity type
> - *
> - * Invoked when an error is nonfatal/fatal. Once being invoked, broadcast
> - * error detected message to all downstream drivers within a hierarchy in
> - * question and return the returned code.
> - */
> -static void pcie_do_recovery(struct pci_dev *dev, int severity)
> -{
> -	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
> -	enum pci_channel_state state;
> -
> -	if (severity == AER_FATAL)
> -		state = pci_channel_io_frozen;
> -	else
> -		state = pci_channel_io_normal;
> -
> -	status = broadcast_error_message(dev,
> -			state,
> -			"error_detected",
> -			report_error_detected);
> -
> -	if (severity == AER_FATAL) {
> -		result = reset_link(dev);
> -		if (result != PCI_ERS_RESULT_RECOVERED)
> -			goto failed;
> -	}
> -
> -	if (status == PCI_ERS_RESULT_CAN_RECOVER)
> -		status = broadcast_error_message(dev,
> -				state,
> -				"mmio_enabled",
> -				report_mmio_enabled);
> -
> -	if (status == PCI_ERS_RESULT_NEED_RESET) {
> -		/*
> -		 * TODO: Should call platform-specific
> -		 * functions to reset slot before calling
> -		 * drivers' slot_reset callbacks?
> -		 */
> -		status = broadcast_error_message(dev,
> -				state,
> -				"slot_reset",
> -				report_slot_reset);
> -	}
> -
> -	if (status != PCI_ERS_RESULT_RECOVERED)
> -		goto failed;
> -
> -	broadcast_error_message(dev,
> -				state,
> -				"resume",
> -				report_resume);
> -
> -	pci_info(dev, "AER: Device recovery successful\n");
> -	return;
> -
> -failed:
> -	pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
> -	/* TODO: Should kernel panic here? */
> -	pci_info(dev, "AER: Device recovery failed\n");
> -}
> +EXPORT_SYMBOL_GPL(pci_find_aer_service);

This is never called from a module, so I don't think you need to
export it at all.  If you do, it should be a separate patch, not
buried in the middle of this big one.

>  /**
>   * handle_error_source - handle logging error into an event log
> diff --git a/drivers/pci/pcie/pcie-err.c b/drivers/pci/pcie/pcie-err.c
> new file mode 100644
> index 0000000..fcd5add
> --- /dev/null
> +++ b/drivers/pci/pcie/pcie-err.c
> @@ -0,0 +1,334 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * This file implements the error recovery as a core part of PCIe error reporting.
> + * When a PCIe error is delivered, an error message will be collected and printed
> + * to console, then, an error recovery procedure will be executed by following
> + * the PCI error recovery rules.

Wrap this so it fits in 80 columns.

> + * Copyright (C) 2006 Intel Corp.
> + *	Tom Long Nguyen (tom.l.nguyen@intel.com)
> + *	Zhang Yanmin (yanmin.zhang@intel.com)
> + *
> + */
> +
> +#include <linux/pci.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include <linux/kernel.h>
> +#include <linux/errno.h>
> +#include <linux/aer.h>
> +#include <linux/pcieport_if.h>
> +#include "portdrv.h"
> +
> +struct aer_broadcast_data {
> +	enum pci_channel_state state;
> +	enum pci_ers_result result;
> +};
> +
> +static pci_ers_result_t merge_result(enum pci_ers_result orig,
> +				  enum pci_ers_result new)
> +{
> +	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
> +		return PCI_ERS_RESULT_NO_AER_DRIVER;
> +
> +	if (new == PCI_ERS_RESULT_NONE)
> +		return orig;
> +
> +	switch (orig) {
> +	case PCI_ERS_RESULT_CAN_RECOVER:
> +	case PCI_ERS_RESULT_RECOVERED:
> +		orig = new;
> +		break;
> +	case PCI_ERS_RESULT_DISCONNECT:
> +		if (new == PCI_ERS_RESULT_NEED_RESET)
> +			orig = PCI_ERS_RESULT_NEED_RESET;
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return orig;
> +}
> +
> +static int report_mmio_enabled(struct pci_dev *dev, void *data)
> +{
> +	pci_ers_result_t vote;
> +	const struct pci_error_handlers *err_handler;
> +	struct aer_broadcast_data *result_data;
> +
> +	result_data = (struct aer_broadcast_data *) data;
> +
> +	device_lock(&dev->dev);
> +	if (!dev->driver ||
> +		!dev->driver->err_handler ||
> +		!dev->driver->err_handler->mmio_enabled)
> +		goto out;
> +
> +	err_handler = dev->driver->err_handler;
> +	vote = err_handler->mmio_enabled(dev);
> +	result_data->result = merge_result(result_data->result, vote);
> +out:
> +	device_unlock(&dev->dev);
> +	return 0;
> +}
> +
> +static int report_slot_reset(struct pci_dev *dev, void *data)
> +{
> +	pci_ers_result_t vote;
> +	const struct pci_error_handlers *err_handler;
> +	struct aer_broadcast_data *result_data;
> +
> +	result_data = (struct aer_broadcast_data *) data;
> +
> +	device_lock(&dev->dev);
> +	if (!dev->driver ||
> +		!dev->driver->err_handler ||
> +		!dev->driver->err_handler->slot_reset)
> +		goto out;
> +
> +	err_handler = dev->driver->err_handler;
> +	vote = err_handler->slot_reset(dev);
> +	result_data->result = merge_result(result_data->result, vote);
> +out:
> +	device_unlock(&dev->dev);
> +	return 0;
> +}
> +
> +static int report_resume(struct pci_dev *dev, void *data)
> +{
> +	const struct pci_error_handlers *err_handler;
> +
> +	device_lock(&dev->dev);
> +	dev->error_state = pci_channel_io_normal;
> +
> +	if (!dev->driver ||
> +		!dev->driver->err_handler ||
> +		!dev->driver->err_handler->resume)
> +		goto out;
> +
> +	err_handler = dev->driver->err_handler;
> +	err_handler->resume(dev);
> +out:
> +	device_unlock(&dev->dev);
> +	return 0;
> +}
> +
> +static int report_error_detected(struct pci_dev *dev, void *data)
> +{
> +	pci_ers_result_t vote;
> +	const struct pci_error_handlers *err_handler;
> +	struct aer_broadcast_data *result_data;
> +
> +	result_data = (struct aer_broadcast_data *) data;
> +
> +	device_lock(&dev->dev);
> +	dev->error_state = result_data->state;
> +
> +	if (!dev->driver ||
> +		!dev->driver->err_handler ||
> +		!dev->driver->err_handler->error_detected) {
> +		if (result_data->state == pci_channel_io_frozen &&
> +			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> +			/*
> +			 * In case of fatal recovery, if one of down-
> +			 * stream device has no driver. We might be
> +			 * unable to recover because a later insmod
> +			 * of a driver for this device is unaware of
> +			 * its hw state.
> +			 */
> +			dev_printk(KERN_DEBUG, &dev->dev, "device has %s\n",
> +				   dev->driver ?
> +				   "no error-aware driver" : "no driver");

This was a pci_printk() before you moved it and it should be the same here.

> +		}
> +
> +		/*
> +		 * If there's any device in the subtree that does not
> +		 * have an error_detected callback, returning
> +		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
> +		 * the subsequent mmio_enabled/slot_reset/resume
> +		 * callbacks of "any" device in the subtree. All the
> +		 * devices in the subtree are left in the error state
> +		 * without recovery.
> +		 */
> +
> +		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
> +			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> +		else
> +			vote = PCI_ERS_RESULT_NONE;
> +	} else {
> +		err_handler = dev->driver->err_handler;
> +		vote = err_handler->error_detected(dev, result_data->state);
> +	}
> +
> +	result_data->result = merge_result(result_data->result, vote);
> +	device_unlock(&dev->dev);
> +	return 0;
> +}
> +
> +/**
> + * default_reset_link - default reset function
> + * @dev: pointer to pci_dev data structure
> + *
> + * Invoked when performing link reset on a Downstream Port or a
> + * Root Port with no aer driver.
> + */
> +static pci_ers_result_t default_reset_link(struct pci_dev *dev)
> +{
> +	pci_reset_bridge_secondary_bus(dev);
> +	dev_printk(KERN_DEBUG, &dev->dev, "downstream link has been reset\n");

This should be a pci_printk() as it was before the move.  There are more
below.

> +	return PCI_ERS_RESULT_RECOVERED;
> +}
> +
> +static pci_ers_result_t reset_link(struct pci_dev *dev)
> +{
> +	struct pci_dev *udev;
> +	pci_ers_result_t status;
> +	struct pcie_port_service_driver *driver = NULL;
> +
> +	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> +		/* Reset this port for all subordinates */
> +		udev = dev;
> +	} else {
> +		/* Reset the upstream component (likely downstream port) */
> +		udev = dev->bus->self;
> +	}
> +
> +#if IS_ENABLED(CONFIG_PCIEAER)

AER can't be a module, so you can use just:

  #ifdef CONFIG_PCIEAER

This ifdef should be added in the patch where you add a caller from non-AER
code.  This patch should only move code, not change it.

> +	/* Use the aer driver of the component firstly */
> +	driver = pci_find_aer_service(udev);
> +#endif
> +
> +	if (driver && driver->reset_link) {
> +		status = driver->reset_link(udev);
> +	} else if (udev->has_secondary_link) {
> +		status = default_reset_link(udev);
> +	} else {
> +		dev_printk(KERN_DEBUG, &dev->dev,
> +			"no link-reset support at upstream device %s\n",
> +			pci_name(udev));
> +		return PCI_ERS_RESULT_DISCONNECT;
> +	}
> +
> +	if (status != PCI_ERS_RESULT_RECOVERED) {
> +		dev_printk(KERN_DEBUG, &dev->dev,
> +			"link reset at upstream device %s failed\n",
> +			pci_name(udev));
> +		return PCI_ERS_RESULT_DISCONNECT;
> +	}
> +
> +	return status;
> +}
> +
> +/**
> + * broadcast_error_message - handle message broadcast to downstream drivers
> + * @dev: pointer to where in a hierarchy message is broadcasted down
> + * @state: error state
> + * @error_mesg: message to print
> + * @cb: callback to be broadcast
> + *
> + * Invoked during error recovery process. Once being invoked, the content
> + * of error severity will be broadcast to all downstream drivers in a
> + * hierarchy in question.
> + */
> +static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
> +	enum pci_channel_state state,
> +	char *error_mesg,
> +	int (*cb)(struct pci_dev *, void *))
> +{
> +	struct aer_broadcast_data result_data;
> +
> +	dev_printk(KERN_DEBUG, &dev->dev, "broadcast %s message\n", error_mesg);
> +	result_data.state = state;
> +	if (cb == report_error_detected)
> +		result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
> +	else
> +		result_data.result = PCI_ERS_RESULT_RECOVERED;
> +
> +	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> +		/*
> +		 * If the error is reported by a bridge, we think this error
> +		 * is related to the downstream link of the bridge, so we
> +		 * do error recovery on all subordinates of the bridge instead
> +		 * of the bridge and clear the error status of the bridge.
> +		 */
> +		if (cb == report_error_detected)
> +			dev->error_state = state;
> +		pci_walk_bus(dev->subordinate, cb, &result_data);
> +		if (cb == report_resume) {
> +			pci_cleanup_aer_uncorrect_error_status(dev);
> +			dev->error_state = pci_channel_io_normal;
> +		}
> +	} else {
> +		/*
> +		 * If the error is reported by an end point, we think this
> +		 * error is related to the upstream link of the end point.
> +		 */
> +		pci_walk_bus(dev->bus, cb, &result_data);
> +	}
> +
> +	return result_data.result;
> +}
> +
> +/**
> + * pcie_do_recovery - handle nonfatal/fatal error recovery process
> + * @dev: pointer to a pci_dev data structure of agent detecting an error
> + * @severity: error severity type
> + *
> + * Invoked when an error is nonfatal/fatal. Once being invoked, broadcast
> + * error detected message to all downstream drivers within a hierarchy in
> + * question and return the returned code.
> + */
> +void pcie_do_recovery(struct pci_dev *dev, int severity)
> +{
> +	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
> +	enum pci_channel_state state;
> +
> +	if (severity == AER_FATAL)
> +		state = pci_channel_io_frozen;
> +	else
> +		state = pci_channel_io_normal;
> +
> +	status = broadcast_error_message(dev,
> +			state,
> +			"error_detected",
> +			report_error_detected);
> +
> +	if (severity == AER_FATAL) {
> +		result = reset_link(dev);
> +		if (result != PCI_ERS_RESULT_RECOVERED)
> +			goto failed;
> +	}
> +
> +	if (status == PCI_ERS_RESULT_CAN_RECOVER)
> +		status = broadcast_error_message(dev,
> +				state,
> +				"mmio_enabled",
> +				report_mmio_enabled);
> +
> +	if (status == PCI_ERS_RESULT_NEED_RESET) {
> +		/*
> +		 * TODO: Should call platform-specific
> +		 * functions to reset slot before calling
> +		 * drivers' slot_reset callbacks?
> +		 */
> +		status = broadcast_error_message(dev,
> +				state,
> +				"slot_reset",
> +				report_slot_reset);
> +	}
> +
> +	if (status != PCI_ERS_RESULT_RECOVERED)
> +		goto failed;
> +
> +	broadcast_error_message(dev,
> +				state,
> +				"resume",
> +				report_resume);
> +
> +	dev_info(&dev->dev, "Device recovery successful\n");
> +	return;
> +
> +failed:
> +	/* TODO: Should kernel panic here? */
> +	dev_info(&dev->dev, "Device recovery failed\n");
> +}
> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> index a854bc5..4f1992d 100644
> --- a/drivers/pci/pcie/portdrv.h
> +++ b/drivers/pci/pcie/portdrv.h
> @@ -79,4 +79,5 @@ static inline void pcie_port_platform_notify(struct pci_dev *port, int *mask)
>  static inline void pcie_port_platform_notify(struct pci_dev *port, int *mask){}
>  #endif /* !CONFIG_ACPI */
>  
> +struct pcie_port_service_driver *pci_find_aer_service(struct pci_dev *dev);

Should be in a different patch, maybe the one where you rename it.

>  #endif /* _PORTDRV_H_ */
> -- 
> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
> a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11 3/7] PCI/ERR: add mutex to synchronize recovery
  2018-02-23  8:24 ` [PATCH v11 3/7] PCI/ERR: add mutex to synchronize recovery Oza Pawandeep
@ 2018-02-23 23:45   ` Bjorn Helgaas
  2018-02-27  5:16     ` poza
  0 siblings, 1 reply; 22+ messages in thread
From: Bjorn Helgaas @ 2018-02-23 23:45 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On Fri, Feb 23, 2018 at 01:54:00PM +0530, Oza Pawandeep wrote:
> This patch protects pci_do_recovery with mutex.

pcie_do_recovery()

Please explain why the mutex is necessary.  What bad things happen
without the mutex?

You named (some) of the other things "pcie"; maybe use "pcie" in the
mutex name as well so they look the same.

> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
> 
> diff --git a/drivers/pci/pcie/pcie-err.c b/drivers/pci/pcie/pcie-err.c
> index fcd5add..f830975 100644
> --- a/drivers/pci/pcie/pcie-err.c
> +++ b/drivers/pci/pcie/pcie-err.c
> @@ -20,6 +20,8 @@
>  #include <linux/pcieport_if.h>
>  #include "portdrv.h"
>  
> +static DEFINE_MUTEX(pci_err_recovery_lock);
> +
>  struct aer_broadcast_data {
>  	enum pci_channel_state state;
>  	enum pci_ers_result result;
> @@ -283,6 +285,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
>  	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
>  	enum pci_channel_state state;
>  
> +	mutex_lock(&pci_err_recovery_lock);
> +
>  	if (severity == AER_FATAL)
>  		state = pci_channel_io_frozen;
>  	else
> @@ -326,9 +330,11 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
>  				report_resume);
>  
>  	dev_info(&dev->dev, "Device recovery successful\n");
> +	mutex_unlock(&pci_err_recovery_lock);
>  	return;
>  
>  failed:
>  	/* TODO: Should kernel panic here? */
>  	dev_info(&dev->dev, "Device recovery failed\n");
> +	mutex_unlock(&pci_err_recovery_lock);
>  }
> -- 
> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
> a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11 4/7] PCI/DPC: Unify and plumb error handling into DPC
  2018-02-23  8:24 ` [PATCH v11 4/7] PCI/DPC: Unify and plumb error handling into DPC Oza Pawandeep
@ 2018-02-24  0:07   ` Bjorn Helgaas
  2018-02-27  6:06     ` poza
  0 siblings, 1 reply; 22+ messages in thread
From: Bjorn Helgaas @ 2018-02-24  0:07 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On Fri, Feb 23, 2018 at 01:54:01PM +0530, Oza Pawandeep wrote:
> Current DPC driver does not do recovery, e.g. calling end-point's driver's
> callbacks, which sanitize the sw.
> 
> DPC driver implements link_reset callback, and calls pcie_do_recovery.

s/pcie_do_recovery/pcie_do_recovery()/

> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index abc514e..f8575da 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -343,6 +343,8 @@ static inline resource_size_t pci_resource_alignment(struct pci_dev *dev,
>  void pci_enable_acs(struct pci_dev *dev);
>  
>  /* PCI error reporting and recovery */
> +#define DPC_FATAL	4

This needs to go next to the AER_FATAL, etc., definitions because
DPC_FATAL shares the namespace and they all need to have distinct
values.  I can't tell from this patch whether they do or not.

>  void pcie_do_recovery(struct pci_dev *dev, int severity);
>  
>  #ifdef CONFIG_PCIEASPM
> diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
> index 38e40c6..5c01c63 100644
> --- a/drivers/pci/pcie/pcie-dpc.c
> +++ b/drivers/pci/pcie/pcie-dpc.c
> @@ -13,6 +13,7 @@
>  #include <linux/pcieport_if.h>
>  #include "../pci.h"
>  #include "aer/aerdrv.h"
> +#include "portdrv.h"
>  
>  struct dpc_dev {
>  	struct pcie_device	*dev;
> @@ -45,6 +46,58 @@ struct dpc_dev {
>  	"Memory Request Completion Timeout",		 /* Bit Position 18 */
>  };
>  
> +static int find_dpc_dev_iter(struct device *device, void *data)
> +{
> +	struct pcie_port_service_driver *service_driver;
> +	struct device **dev = (struct device **) data;;
> +
> +	if (device->bus == &pcie_port_bus_type && device->driver) {
> +		service_driver = to_service_driver(device->driver);
> +		if (service_driver->service == PCIE_PORT_SERVICE_DPC) {
> +			*dev = device;
> +			return 1;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static struct device *pci_find_dpc_dev(struct pci_dev *pdev)
> +{
> +	struct device *dev = NULL;
> +
> +	device_for_each_child(&pdev->dev, &dev, find_dpc_dev_iter);
> +
> +	return dev;
> +}

Ugh.  You're not responsible for this and you don't need to do
anything, but hanging the struct dpc_dev off the struct pcie_device
and then having to grub around like this to locate it from the pci_dev
is just ... clunky.  OK, rant over, sorry :)

> +static int find_dpc_service_iter(struct device *device, void *data)
> +{
> +	struct pcie_port_service_driver *service_driver;
> +	struct pcie_port_service_driver **drv =
> +		(struct pcie_port_service_driver **) data;
> +
> +	if (device->bus == &pcie_port_bus_type && device->driver) {
> +		service_driver = to_service_driver(device->driver);
> +		if (service_driver->service == PCIE_PORT_SERVICE_DPC) {
> +			*drv = service_driver;
> +			return 1;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +struct pcie_port_service_driver *pci_find_dpc_service(struct pci_dev *dev)
> +{
> +	struct pcie_port_service_driver *drv = NULL;
> +
> +	device_for_each_child(&dev->dev, &drv, find_dpc_service_iter);
> +
> +	return drv;
> +}
> +EXPORT_SYMBOL_GPL(pci_find_dpc_service);

No module uses this, so it doesn't need to be exported.

This is a clone of find_aer_service().  Can you add a preliminary patch to
make a generic "find service" interface that accepts the service type
(PCIE_PORT_SERVICE_AER, PCIE_PORT_SERVICE_DPC) as a parameter?

This whole "find service" thing is ugly as sin.  You're not responsible for
cleaning it up, but maybe we can at least limit the proliferation of it.

>  static int dpc_wait_rp_inactive(struct dpc_dev *dpc)
>  {
>  	unsigned long timeout = jiffies + HZ;
> @@ -82,12 +135,25 @@ static void dpc_wait_link_inactive(struct dpc_dev *dpc)
>  		dev_warn(dev, "Link state not disabled for DPC event\n");
>  }
>  
> -static void dpc_work(struct work_struct *work)
> +/**
> + * dpc_reset_link - reset link DPC  routine

s/  / / (remove extra space)

> + * @dev: pointer to Root Port's pci_dev data structure
> + *
> + * Invoked by Port Bus driver when performing link reset at Root Port.
> + */
> +static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
>  {
> -	struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
> -	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
>  	struct pci_bus *parent = pdev->subordinate;
> -	u16 cap = dpc->cap_pos, ctl;
> +	struct pci_dev *dev, *temp;
> +	struct dpc_dev *dpc;
> +	struct pcie_device *pciedev;
> +	struct device *devdpc;
> +	u16 cap, ctl;
> +
> +	devdpc = pci_find_dpc_dev(pdev);
> +	pciedev = to_pcie_device(devdpc);
> +	dpc = get_service_data(pciedev);
> +	cap = dpc->cap_pos;
>  
>  	pci_lock_rescan_remove();
>  	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
> @@ -104,21 +170,31 @@ static void dpc_work(struct work_struct *work)
>  
>  	dpc_wait_link_inactive(dpc);
>  	if (dpc->rp_extensions && dpc_wait_rp_inactive(dpc))
> -		return;
> +		return PCI_ERS_RESULT_DISCONNECT;
>  	if (dpc->rp_extensions && dpc->rp_pio_status) {
>  		pci_write_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_STATUS,
>  				       dpc->rp_pio_status);
>  		dpc->rp_pio_status = 0;
>  	}
>  
> -	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
> +	pci_write_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_STATUS,
>  		PCI_EXP_DPC_STATUS_TRIGGER | PCI_EXP_DPC_STATUS_INTERRUPT);
>  
>  	pci_read_config_word(pdev, cap + PCI_EXP_DPC_CTL, &ctl);
>  	pci_write_config_word(pdev, cap + PCI_EXP_DPC_CTL,
> -			      ctl | PCI_EXP_DPC_CTL_INT_EN);
> +				ctl | PCI_EXP_DPC_CTL_INT_EN);

Align "ctl" with "pdev".

> +	return PCI_ERS_RESULT_RECOVERED;
>  }
>  
> +static void dpc_work(struct work_struct *work)
> +{
> +	struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
> +	struct pci_dev *pdev = dpc->dev->port;
> +
> +	/* From DPC point of view error is always FATAL. */
> +	pcie_do_recovery(pdev, DPC_FATAL);
> +}
>  static void dpc_process_rp_pio_error(struct dpc_dev *dpc)
>  {
>  	struct device *dev = &dpc->dev->device;
> @@ -297,6 +373,7 @@ static void dpc_remove(struct pcie_device *dev)
>  	.service	= PCIE_PORT_SERVICE_DPC,
>  	.probe		= dpc_probe,
>  	.remove		= dpc_remove,
> +	.reset_link     = dpc_reset_link,
>  };
>  
>  static int __init dpc_service_init(void)
> diff --git a/drivers/pci/pcie/pcie-err.c b/drivers/pci/pcie/pcie-err.c
> index f830975..1ea4b9a 100644
> --- a/drivers/pci/pcie/pcie-err.c
> +++ b/drivers/pci/pcie/pcie-err.c
> @@ -19,6 +19,7 @@
>  #include <linux/aer.h>
>  #include <linux/pcieport_if.h>
>  #include "portdrv.h"
> +#include "./../pci.h"
>  
>  static DEFINE_MUTEX(pci_err_recovery_lock);
>  
> @@ -181,7 +182,7 @@ static pci_ers_result_t default_reset_link(struct pci_dev *dev)
>  	return PCI_ERS_RESULT_RECOVERED;
>  }
>  
> -static pci_ers_result_t reset_link(struct pci_dev *dev)
> +static pci_ers_result_t reset_link(struct pci_dev *dev, int severity)
>  {
>  	struct pci_dev *udev;
>  	pci_ers_result_t status;
> @@ -195,9 +196,17 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
>  		udev = dev->bus->self;
>  	}
>  
> +
> +	/* Use the service driver of the component firstly */
> +#if IS_ENABLED(CONFIG_PCIE_DPC)

#ifdef CONFIG_PCIE_DPC

> +	if (severity == DPC_FATAL)
> +		driver = pci_find_dpc_service(udev);
> +#endif
>  #if IS_ENABLED(CONFIG_PCIEAER)
> -	/* Use the aer driver of the component firstly */
> -	driver = pci_find_aer_service(udev);
> +	if (severity == AER_FATAL ||
> +	    severity == AER_NONFATAL ||
> +	    severity == AER_CORRECTABLE)

This change (to check for AER_FATAL, etc) looks like it belongs in a
different patch.  This patch doesn't change any places that set the
severity.

> +		driver = pci_find_aer_service(udev);
>  #endif
>  
>  	if (driver && driver->reset_link) {
> @@ -287,7 +296,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
>  
>  	mutex_lock(&pci_err_recovery_lock);
>  
> -	if (severity == AER_FATAL)
> +	if (severity == AER_FATAL ||
> +	    severity == DPC_FATAL)
>  		state = pci_channel_io_frozen;
>  	else
>  		state = pci_channel_io_normal;
> @@ -297,10 +307,14 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
>  			"error_detected",
>  			report_error_detected);
>  
> -	if (severity == AER_FATAL) {
> -		result = reset_link(dev);
> +	if (severity == AER_FATAL ||
> +	    severity == DPC_FATAL) {
> +		result = reset_link(dev, severity);
>  		if (result != PCI_ERS_RESULT_RECOVERED)
>  			goto failed;
> +		else if (severity == DPC_FATAL)
> +			goto resume;
> +
>  	}
>  
>  	if (status == PCI_ERS_RESULT_CAN_RECOVER)
> @@ -324,6 +338,7 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
>  	if (status != PCI_ERS_RESULT_RECOVERED)
>  		goto failed;
>  
> +resume:
>  	broadcast_error_message(dev,
>  				state,
>  				"resume",
> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> index 4f1992d..b013e24 100644
> --- a/drivers/pci/pcie/portdrv.h
> +++ b/drivers/pci/pcie/portdrv.h
> @@ -80,4 +80,5 @@ static inline void pcie_port_platform_notify(struct pci_dev *port, int *mask){}
>  #endif /* !CONFIG_ACPI */
>  
>  struct pcie_port_service_driver *pci_find_aer_service(struct pci_dev *dev);
> +struct pcie_port_service_driver *pci_find_dpc_service(struct pci_dev *dev);
>  #endif /* _PORTDRV_H_ */
> -- 
> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
> a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11 5/7] PCI/AER: Unify aer error defines at single space
  2018-02-23  8:24 ` [PATCH v11 5/7] PCI/AER: Unify aer error defines at single space Oza Pawandeep
@ 2018-02-24 15:36   ` Bjorn Helgaas
  2018-02-27  6:12     ` poza
  0 siblings, 1 reply; 22+ messages in thread
From: Bjorn Helgaas @ 2018-02-24 15:36 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On Fri, Feb 23, 2018 at 01:54:02PM +0530, Oza Pawandeep wrote:
> This patch moves AER error defines to drivers/pci/pci.h.
> So that it unifies the error repoting codes at single place along with dpc

s/repoting/reporting/
s/dpc/DPC/ (in English text)

> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 1efefe9..7ae9bb3 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -56,6 +56,7 @@
>  #include <ras/ras_event.h>
>  
>  #include "apei-internal.h"
> +#include "../../pci/pci.h"

You're right, it's ugly to use this sort of path to a private PCI
header file from outside drivers/pci.

Could you just add DPC_FATAL to include/linux/aer.h?  Maybe we
discarded that for some reason?  Having pcie-dpc.c include linux/aer.h
seems like it would be better than having this ACPI code include
"../../pci/pci.h"

>  #define GHES_PFX	"GHES: "
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index f8575da..c8394ec 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -343,7 +343,11 @@ static inline resource_size_t pci_resource_alignment(struct pci_dev *dev,
>  void pci_enable_acs(struct pci_dev *dev);
>  
>  /* PCI error reporting and recovery */
> -#define DPC_FATAL	4
> +#define AER_NONFATAL		0
> +#define AER_FATAL		1
> +#define AER_CORRECTABLE		2
> +
> +#define DPC_FATAL		4
>  
>  void pcie_do_recovery(struct pci_dev *dev, int severity);
>  
> diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
> index 6a352e6..4c59f37 100644
> --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
> +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
> @@ -19,6 +19,7 @@
>  #include <linux/cper.h>
>  
>  #include "aerdrv.h"
> +#include "../../pci.h"
>  #include <ras/ras_event.h>
>  
>  #define AER_AGENT_RECEIVER		0
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 8f87bbe..3eac8ed 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -11,10 +11,6 @@
>  #include <linux/errno.h>
>  #include <linux/types.h>
>  
> -#define AER_NONFATAL			0
> -#define AER_FATAL			1
> -#define AER_CORRECTABLE			2
> -
>  struct pci_dev;
>  
>  struct aer_header_log_regs {
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index 9c68986..d75c75b 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -13,6 +13,7 @@
>  #include <linux/aer.h>
>  #include <linux/cper.h>
>  #include <linux/mm.h>
> +#include "../../../drivers/pci/pci.h"
>  
>  /*
>   * MCE Extended Error Log trace event
> -- 
> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
> a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11 6/7] PCI: Unify wait for link active into generic pci
  2018-02-23  8:24 ` [PATCH v11 6/7] PCI: Unify wait for link active into generic pci Oza Pawandeep
@ 2018-02-24 15:41   ` Bjorn Helgaas
  2018-02-27  8:39     ` poza
  0 siblings, 1 reply; 22+ messages in thread
From: Bjorn Helgaas @ 2018-02-24 15:41 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

In subject:

s/pci/PCI/

On Fri, Feb 23, 2018 at 01:54:03PM +0530, Oza Pawandeep wrote:
> Clients such as pciehp, dpc are using pcie_wait_link_active, which waits
> till the link becomes active or inactive.

Use "()" after function names so we have a visual clue that they are
functions.

> Made generic function and moved it to drivers/pci/pci.c
> 
> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
> 
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
> index 18a42f8..de9b0ea 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -231,25 +231,11 @@ bool pciehp_check_link_active(struct controller *ctrl)
>  	return ret;
>  }
>  
> -static void __pcie_wait_link_active(struct controller *ctrl, bool active)
> +static bool pcie_wait_link_active(struct controller *ctrl)
>  {
> -	int timeout = 1000;
> -
> -	if (pciehp_check_link_active(ctrl) == active)
> -		return;
> -	while (timeout > 0) {
> -		msleep(10);
> -		timeout -= 10;
> -		if (pciehp_check_link_active(ctrl) == active)
> -			return;
> -	}
> -	ctrl_dbg(ctrl, "Data Link Layer Link Active not %s in 1000 msec\n",
> -			active ? "set" : "cleared");
> -}
> +	struct pci_dev *pdev = ctrl_dev(ctrl);
>  
> -static void pcie_wait_link_active(struct controller *ctrl)
> -{
> -	__pcie_wait_link_active(ctrl, true);
> +	return pci_wait_for_link(pdev, true);
>  }
>  
>  static bool pci_bus_check_dev(struct pci_bus *bus, int devfn)
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index f6a4dd1..f8d44b4 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4176,6 +4176,37 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
>  	return 0;
>  }
>  
> +/**
> + * pci_wait_for_link - Wait for link till its active/inactive

s/its/it's/

> + * @pdev: Bridge device
> + * @active: waiting for active or inactive ?
> + *
> + * Use this to wait till link becomes active or inactive.
> + */
> +bool pci_wait_for_link(struct pci_dev *pdev, bool active)

I think this should be "pcie_wait_for_link()".  There's no concept of a
link in conventional PCI.

> +{
> +	int timeout = 1000;
> +	bool ret;
> +	u16 lnk_status;
> +
> +	for (;;) {
> +		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
> +		ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA);
> +		if (ret == active)
> +			return true;
> +		if (timeout <= 0)
> +			break;
> +		timeout -= 10;
> +	}
> +
> +	dev_printk(KERN_DEBUG, &pdev->dev,

pci_info().  Distros often seem to have logging configured so
KERN_DEBUG things aren't captured, and this definitely seems worth
capturing.

> +		   "Data Link Layer Link Active not %s in 1000 msec\n",
> +		   active ? "set" : "cleared");
> +
> +	return false;
> +}
> +EXPORT_SYMBOL(pci_wait_for_link);

I don't think this needs to be exported.

>  void pci_reset_secondary_bus(struct pci_dev *dev)
>  {
>  	u16 ctrl;
> diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
> index 5c01c63..e15bcda 100644
> --- a/drivers/pci/pcie/pcie-dpc.c
> +++ b/drivers/pci/pcie/pcie-dpc.c
> @@ -120,19 +120,9 @@ static int dpc_wait_rp_inactive(struct dpc_dev *dpc)
>  
>  static void dpc_wait_link_inactive(struct dpc_dev *dpc)
>  {
> -	unsigned long timeout = jiffies + HZ;
>  	struct pci_dev *pdev = dpc->dev->port;
> -	struct device *dev = &dpc->dev->device;
> -	u16 lnk_status;
>  
> -	pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
> -	while (lnk_status & PCI_EXP_LNKSTA_DLLLA &&
> -					!time_after(jiffies, timeout)) {
> -		msleep(10);
> -		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
> -	}
> -	if (lnk_status & PCI_EXP_LNKSTA_DLLLA)
> -		dev_warn(dev, "Link state not disabled for DPC event\n");
> +	pci_wait_for_link(pdev, false);
>  }
>  
>  /**
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 024a1be..cb674c3 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1195,6 +1195,7 @@ int pci_add_ext_cap_save_buffer(struct pci_dev *dev,
>  int pci_request_selected_regions(struct pci_dev *, int, const char *);
>  int pci_request_selected_regions_exclusive(struct pci_dev *, int, const char *);
>  void pci_release_selected_regions(struct pci_dev *, int);
> +bool pci_wait_for_link(struct pci_dev *pdev, bool active);

I don't think this needs to be available outside the PCI core, so this
could probably be declared in drivers/pci/pci.h

>  /* drivers/pci/bus.c */
>  struct pci_bus *pci_bus_get(struct pci_bus *bus);
> -- 
> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
> a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11 2/7] PCI/AER: factor out error reporting from AER
  2018-02-23 23:42   ` Bjorn Helgaas
@ 2018-02-26  5:32     ` poza
  2018-02-26  5:39       ` poza
  2018-02-26 20:23       ` Bjorn Helgaas
  0 siblings, 2 replies; 22+ messages in thread
From: poza @ 2018-02-26  5:32 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On 2018-02-24 05:12, Bjorn Helgaas wrote:
> On Fri, Feb 23, 2018 at 01:53:59PM +0530, Oza Pawandeep wrote:
>> This patch factors out error reporting callbacks, which are currently
>> tightly coupled with AER.
> 
> Add blank line between paragraphs.
> 
>> DPC should be able to register callbacks and attmept recovery when DPC
>> trigger event occurs.
> 
> s/attmept/attempt/
> 
> This patch basically moves code from aerdrv_core.c to pcie-err.c.  Make 
> it
> do *only* that.
> 

sure.

>> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
>> 
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index fcd8191..abc514e 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -342,6 +342,9 @@ static inline resource_size_t 
>> pci_resource_alignment(struct pci_dev *dev,
>> 
>>  void pci_enable_acs(struct pci_dev *dev);
>> 
>> +/* PCI error reporting and recovery */
>> +void pcie_do_recovery(struct pci_dev *dev, int severity);
> 
> Add this declaration in the first patch, the one where you rename the
> function.
> 

done.

>>  #ifdef CONFIG_PCIEASPM
>>  void pcie_aspm_init_link_state(struct pci_dev *pdev);
>>  void pcie_aspm_exit_link_state(struct pci_dev *pdev);
>> diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
>> index 223e4c3..d669497 100644
>> --- a/drivers/pci/pcie/Makefile
>> +++ b/drivers/pci/pcie/Makefile
>> @@ -6,7 +6,7 @@
>>  # Build PCI Express ASPM if needed
>>  obj-$(CONFIG_PCIEASPM)		+= aspm.o
>> 
>> -pcieportdrv-y			:= portdrv_core.o portdrv_pci.o portdrv_bus.o
>> +pcieportdrv-y			:= portdrv_core.o portdrv_pci.o portdrv_bus.o 
>> pcie-err.o
> 
> Can we name this just "drivers/pci/pcie/err.c"?  I know we have
> pcie-dpc.c already, but it does get a little repetitious to type
> "pci" THREE times in that path.
> 

sure, will rename.

>>  pcieportdrv-$(CONFIG_ACPI)	+= portdrv_acpi.o
>> 
>>  obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o
>> diff --git a/drivers/pci/pcie/aer/aerdrv.h 
>> b/drivers/pci/pcie/aer/aerdrv.h
>> index 5449e5c..bc9db53 100644
>> --- a/drivers/pci/pcie/aer/aerdrv.h
>> +++ b/drivers/pci/pcie/aer/aerdrv.h
>> @@ -76,36 +76,6 @@ struct aer_rpc {
>>  					 */
>>  };
>> 
>> -struct aer_broadcast_data {
>> -	enum pci_channel_state state;
>> -	enum pci_ers_result result;
>> -};
>> -
>> -static inline pci_ers_result_t merge_result(enum pci_ers_result orig,
>> -		enum pci_ers_result new)
>> -{
>> -	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
>> -		return PCI_ERS_RESULT_NO_AER_DRIVER;
>> -
>> -	if (new == PCI_ERS_RESULT_NONE)
>> -		return orig;
>> -
>> -	switch (orig) {
>> -	case PCI_ERS_RESULT_CAN_RECOVER:
>> -	case PCI_ERS_RESULT_RECOVERED:
>> -		orig = new;
>> -		break;
>> -	case PCI_ERS_RESULT_DISCONNECT:
>> -		if (new == PCI_ERS_RESULT_NEED_RESET)
>> -			orig = PCI_ERS_RESULT_NEED_RESET;
>> -		break;
>> -	default:
>> -		break;
>> -	}
>> -
>> -	return orig;
>> -}
>> -
>>  extern struct bus_type pcie_port_bus_type;
>>  void aer_isr(struct work_struct *work);
>>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
>> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
>> b/drivers/pci/pcie/aer/aerdrv_core.c
>> index aeb83a0..f60b1bb 100644
>> --- a/drivers/pci/pcie/aer/aerdrv_core.c
>> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
>> @@ -23,6 +23,7 @@
>>  #include <linux/slab.h>
>>  #include <linux/kfifo.h>
>>  #include "aerdrv.h"
>> +#include "../../pci.h"
>> 
>>  #define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE 
>> | \
>>  				 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
>> @@ -230,191 +231,6 @@ static bool find_source_device(struct pci_dev 
>> *parent,
>>  	return true;
>>  }
>> 
>> -static int report_error_detected(struct pci_dev *dev, void *data)
>> -{
>> -	pci_ers_result_t vote;
>> -	const struct pci_error_handlers *err_handler;
>> -	struct aer_broadcast_data *result_data;
>> -	result_data = (struct aer_broadcast_data *) data;
>> -
>> -	device_lock(&dev->dev);
>> -	dev->error_state = result_data->state;
>> -
>> -	if (!dev->driver ||
>> -		!dev->driver->err_handler ||
>> -		!dev->driver->err_handler->error_detected) {
>> -		if (result_data->state == pci_channel_io_frozen &&
>> -			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
>> -			/*
>> -			 * In case of fatal recovery, if one of down-
>> -			 * stream device has no driver. We might be
>> -			 * unable to recover because a later insmod
>> -			 * of a driver for this device is unaware of
>> -			 * its hw state.
>> -			 */
>> -			pci_printk(KERN_DEBUG, dev, "device has %s\n",
>> -				   dev->driver ?
>> -				   "no AER-aware driver" : "no driver");
>> -		}
>> -
>> -		/*
>> -		 * If there's any device in the subtree that does not
>> -		 * have an error_detected callback, returning
>> -		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
>> -		 * the subsequent mmio_enabled/slot_reset/resume
>> -		 * callbacks of "any" device in the subtree. All the
>> -		 * devices in the subtree are left in the error state
>> -		 * without recovery.
>> -		 */
>> -
>> -		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
>> -			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>> -		else
>> -			vote = PCI_ERS_RESULT_NONE;
>> -	} else {
>> -		err_handler = dev->driver->err_handler;
>> -		vote = err_handler->error_detected(dev, result_data->state);
>> -		pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
>> -	}
>> -
>> -	result_data->result = merge_result(result_data->result, vote);
>> -	device_unlock(&dev->dev);
>> -	return 0;
>> -}
>> -
>> -static int report_mmio_enabled(struct pci_dev *dev, void *data)
>> -{
>> -	pci_ers_result_t vote;
>> -	const struct pci_error_handlers *err_handler;
>> -	struct aer_broadcast_data *result_data;
>> -	result_data = (struct aer_broadcast_data *) data;
>> -
>> -	device_lock(&dev->dev);
>> -	if (!dev->driver ||
>> -		!dev->driver->err_handler ||
>> -		!dev->driver->err_handler->mmio_enabled)
>> -		goto out;
>> -
>> -	err_handler = dev->driver->err_handler;
>> -	vote = err_handler->mmio_enabled(dev);
>> -	result_data->result = merge_result(result_data->result, vote);
>> -out:
>> -	device_unlock(&dev->dev);
>> -	return 0;
>> -}
>> -
>> -static int report_slot_reset(struct pci_dev *dev, void *data)
>> -{
>> -	pci_ers_result_t vote;
>> -	const struct pci_error_handlers *err_handler;
>> -	struct aer_broadcast_data *result_data;
>> -	result_data = (struct aer_broadcast_data *) data;
>> -
>> -	device_lock(&dev->dev);
>> -	if (!dev->driver ||
>> -		!dev->driver->err_handler ||
>> -		!dev->driver->err_handler->slot_reset)
>> -		goto out;
>> -
>> -	err_handler = dev->driver->err_handler;
>> -	vote = err_handler->slot_reset(dev);
>> -	result_data->result = merge_result(result_data->result, vote);
>> -out:
>> -	device_unlock(&dev->dev);
>> -	return 0;
>> -}
>> -
>> -static int report_resume(struct pci_dev *dev, void *data)
>> -{
>> -	const struct pci_error_handlers *err_handler;
>> -
>> -	device_lock(&dev->dev);
>> -	dev->error_state = pci_channel_io_normal;
>> -
>> -	if (!dev->driver ||
>> -		!dev->driver->err_handler ||
>> -		!dev->driver->err_handler->resume)
>> -		goto out;
>> -
>> -	err_handler = dev->driver->err_handler;
>> -	err_handler->resume(dev);
>> -	pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
>> -out:
>> -	device_unlock(&dev->dev);
>> -	return 0;
>> -}
>> -
>> -/**
>> - * broadcast_error_message - handle message broadcast to downstream 
>> drivers
>> - * @dev: pointer to from where in a hierarchy message is broadcasted 
>> down
>> - * @state: error state
>> - * @error_mesg: message to print
>> - * @cb: callback to be broadcasted
>> - *
>> - * Invoked during error recovery process. Once being invoked, the 
>> content
>> - * of error severity will be broadcasted to all downstream drivers in 
>> a
>> - * hierarchy in question.
>> - */
>> -static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
>> -	enum pci_channel_state state,
>> -	char *error_mesg,
>> -	int (*cb)(struct pci_dev *, void *))
>> -{
>> -	struct aer_broadcast_data result_data;
>> -
>> -	pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg);
>> -	result_data.state = state;
>> -	if (cb == report_error_detected)
>> -		result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
>> -	else
>> -		result_data.result = PCI_ERS_RESULT_RECOVERED;
>> -
>> -	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
>> -		/*
>> -		 * If the error is reported by a bridge, we think this error
>> -		 * is related to the downstream link of the bridge, so we
>> -		 * do error recovery on all subordinates of the bridge instead
>> -		 * of the bridge and clear the error status of the bridge.
>> -		 */
>> -		if (cb == report_error_detected)
>> -			dev->error_state = state;
>> -		pci_walk_bus(dev->subordinate, cb, &result_data);
>> -		if (cb == report_resume) {
>> -			pci_cleanup_aer_uncorrect_error_status(dev);
>> -			dev->error_state = pci_channel_io_normal;
>> -		}
>> -	} else {
>> -		/*
>> -		 * If the error is reported by an end point, we think this
>> -		 * error is related to the upstream link of the end point.
>> -		 */
>> -		if (state == pci_channel_io_normal)
>> -			/*
>> -			 * the error is non fatal so the bus is ok, just invoke
>> -			 * the callback for the function that logged the error.
>> -			 */
>> -			cb(dev, &result_data);
>> -		else
>> -			pci_walk_bus(dev->bus, cb, &result_data);
>> -	}
>> -
>> -	return result_data.result;
>> -}
>> -
>> -/**
>> - * default_reset_link - default reset function
>> - * @dev: pointer to pci_dev data structure
>> - *
>> - * Invoked when performing link reset on a Downstream Port or a
>> - * Root Port with no aer driver.
>> - */
>> -static pci_ers_result_t default_reset_link(struct pci_dev *dev)
>> -{
>> -	pci_reset_bridge_secondary_bus(dev);
>> -	pci_printk(KERN_DEBUG, dev, "downstream link has been reset\n");
>> -	return PCI_ERS_RESULT_RECOVERED;
>> -}
>> -
>>  static int find_aer_service_iter(struct device *device, void *data)
>>  {
>>  	struct pcie_port_service_driver *service_driver, **drv;
>> @@ -432,7 +248,7 @@ static int find_aer_service_iter(struct device 
>> *device, void *data)
>>  	return 0;
>>  }
>> 
>> -static struct pcie_port_service_driver *find_aer_service(struct 
>> pci_dev *dev)
>> +struct pcie_port_service_driver *pci_find_aer_service(struct pci_dev 
>> *dev)
> 
> Move this rename to a different patch.  The new name should probably
> start with "pcie" like you did with pcie_do_recovery().
> 

sure, will do that.

>>  {
>>  	struct pcie_port_service_driver *drv = NULL;
>> 
>> @@ -440,107 +256,7 @@ static struct pcie_port_service_driver 
>> *find_aer_service(struct pci_dev *dev)
>> 
>>  	return drv;
>>  }
>> -
>> -static pci_ers_result_t reset_link(struct pci_dev *dev)
>> -{
>> -	struct pci_dev *udev;
>> -	pci_ers_result_t status;
>> -	struct pcie_port_service_driver *driver;
>> -
>> -	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
>> -		/* Reset this port for all subordinates */
>> -		udev = dev;
>> -	} else {
>> -		/* Reset the upstream component (likely downstream port) */
>> -		udev = dev->bus->self;
>> -	}
>> -
>> -	/* Use the aer driver of the component firstly */
>> -	driver = find_aer_service(udev);
>> -
>> -	if (driver && driver->reset_link) {
>> -		status = driver->reset_link(udev);
>> -	} else if (udev->has_secondary_link) {
>> -		status = default_reset_link(udev);
>> -	} else {
>> -		pci_printk(KERN_DEBUG, dev, "no link-reset support at upstream 
>> device %s\n",
>> -			pci_name(udev));
>> -		return PCI_ERS_RESULT_DISCONNECT;
>> -	}
>> -
>> -	if (status != PCI_ERS_RESULT_RECOVERED) {
>> -		pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s 
>> failed\n",
>> -			pci_name(udev));
>> -		return PCI_ERS_RESULT_DISCONNECT;
>> -	}
>> -
>> -	return status;
>> -}
>> -
>> -/**
>> - * pcie_do_recovery - handle nonfatal/fatal error recovery process
>> - * @dev: pointer to a pci_dev data structure of agent detecting an 
>> error
>> - * @severity: error severity type
>> - *
>> - * Invoked when an error is nonfatal/fatal. Once being invoked, 
>> broadcast
>> - * error detected message to all downstream drivers within a 
>> hierarchy in
>> - * question and return the returned code.
>> - */
>> -static void pcie_do_recovery(struct pci_dev *dev, int severity)
>> -{
>> -	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
>> -	enum pci_channel_state state;
>> -
>> -	if (severity == AER_FATAL)
>> -		state = pci_channel_io_frozen;
>> -	else
>> -		state = pci_channel_io_normal;
>> -
>> -	status = broadcast_error_message(dev,
>> -			state,
>> -			"error_detected",
>> -			report_error_detected);
>> -
>> -	if (severity == AER_FATAL) {
>> -		result = reset_link(dev);
>> -		if (result != PCI_ERS_RESULT_RECOVERED)
>> -			goto failed;
>> -	}
>> -
>> -	if (status == PCI_ERS_RESULT_CAN_RECOVER)
>> -		status = broadcast_error_message(dev,
>> -				state,
>> -				"mmio_enabled",
>> -				report_mmio_enabled);
>> -
>> -	if (status == PCI_ERS_RESULT_NEED_RESET) {
>> -		/*
>> -		 * TODO: Should call platform-specific
>> -		 * functions to reset slot before calling
>> -		 * drivers' slot_reset callbacks?
>> -		 */
>> -		status = broadcast_error_message(dev,
>> -				state,
>> -				"slot_reset",
>> -				report_slot_reset);
>> -	}
>> -
>> -	if (status != PCI_ERS_RESULT_RECOVERED)
>> -		goto failed;
>> -
>> -	broadcast_error_message(dev,
>> -				state,
>> -				"resume",
>> -				report_resume);
>> -
>> -	pci_info(dev, "AER: Device recovery successful\n");
>> -	return;
>> -
>> -failed:
>> -	pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
>> -	/* TODO: Should kernel panic here? */
>> -	pci_info(dev, "AER: Device recovery failed\n");
>> -}
>> +EXPORT_SYMBOL_GPL(pci_find_aer_service);
> 
> This is never called from a module, so I don't think you need to
> export it at all.  If you do, it should be a separate patch, not
> buried in the middle of this big one.
> 

got it, will see if its really required to be exported.
but certainly, will remove it from this patch.

>>  /**
>>   * handle_error_source - handle logging error into an event log
>> diff --git a/drivers/pci/pcie/pcie-err.c b/drivers/pci/pcie/pcie-err.c
>> new file mode 100644
>> index 0000000..fcd5add
>> --- /dev/null
>> +++ b/drivers/pci/pcie/pcie-err.c
>> @@ -0,0 +1,334 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * This file implements the error recovery as a core part of PCIe 
>> error reporting.
>> + * When a PCIe error is delivered, an error message will be collected 
>> and printed
>> + * to console, then, an error recovery procedure will be executed by 
>> following
>> + * the PCI error recovery rules.
> 
> Wrap this so it fits in 80 columns.

I thought of keeping the way it was before (hence did not change it)
I would change it now.

> 
>> + * Copyright (C) 2006 Intel Corp.
>> + *	Tom Long Nguyen (tom.l.nguyen@intel.com)
>> + *	Zhang Yanmin (yanmin.zhang@intel.com)
>> + *
>> + */
>> +
>> +#include <linux/pci.h>
>> +#include <linux/module.h>
>> +#include <linux/pci.h>
>> +#include <linux/kernel.h>
>> +#include <linux/errno.h>
>> +#include <linux/aer.h>
>> +#include <linux/pcieport_if.h>
>> +#include "portdrv.h"
>> +
>> +struct aer_broadcast_data {
>> +	enum pci_channel_state state;
>> +	enum pci_ers_result result;
>> +};
>> +
>> +static pci_ers_result_t merge_result(enum pci_ers_result orig,
>> +				  enum pci_ers_result new)
>> +{
>> +	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
>> +		return PCI_ERS_RESULT_NO_AER_DRIVER;
>> +
>> +	if (new == PCI_ERS_RESULT_NONE)
>> +		return orig;
>> +
>> +	switch (orig) {
>> +	case PCI_ERS_RESULT_CAN_RECOVER:
>> +	case PCI_ERS_RESULT_RECOVERED:
>> +		orig = new;
>> +		break;
>> +	case PCI_ERS_RESULT_DISCONNECT:
>> +		if (new == PCI_ERS_RESULT_NEED_RESET)
>> +			orig = PCI_ERS_RESULT_NEED_RESET;
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +
>> +	return orig;
>> +}
>> +
>> +static int report_mmio_enabled(struct pci_dev *dev, void *data)
>> +{
>> +	pci_ers_result_t vote;
>> +	const struct pci_error_handlers *err_handler;
>> +	struct aer_broadcast_data *result_data;
>> +
>> +	result_data = (struct aer_broadcast_data *) data;
>> +
>> +	device_lock(&dev->dev);
>> +	if (!dev->driver ||
>> +		!dev->driver->err_handler ||
>> +		!dev->driver->err_handler->mmio_enabled)
>> +		goto out;
>> +
>> +	err_handler = dev->driver->err_handler;
>> +	vote = err_handler->mmio_enabled(dev);
>> +	result_data->result = merge_result(result_data->result, vote);
>> +out:
>> +	device_unlock(&dev->dev);
>> +	return 0;
>> +}
>> +
>> +static int report_slot_reset(struct pci_dev *dev, void *data)
>> +{
>> +	pci_ers_result_t vote;
>> +	const struct pci_error_handlers *err_handler;
>> +	struct aer_broadcast_data *result_data;
>> +
>> +	result_data = (struct aer_broadcast_data *) data;
>> +
>> +	device_lock(&dev->dev);
>> +	if (!dev->driver ||
>> +		!dev->driver->err_handler ||
>> +		!dev->driver->err_handler->slot_reset)
>> +		goto out;
>> +
>> +	err_handler = dev->driver->err_handler;
>> +	vote = err_handler->slot_reset(dev);
>> +	result_data->result = merge_result(result_data->result, vote);
>> +out:
>> +	device_unlock(&dev->dev);
>> +	return 0;
>> +}
>> +
>> +static int report_resume(struct pci_dev *dev, void *data)
>> +{
>> +	const struct pci_error_handlers *err_handler;
>> +
>> +	device_lock(&dev->dev);
>> +	dev->error_state = pci_channel_io_normal;
>> +
>> +	if (!dev->driver ||
>> +		!dev->driver->err_handler ||
>> +		!dev->driver->err_handler->resume)
>> +		goto out;
>> +
>> +	err_handler = dev->driver->err_handler;
>> +	err_handler->resume(dev);
>> +out:
>> +	device_unlock(&dev->dev);
>> +	return 0;
>> +}
>> +
>> +static int report_error_detected(struct pci_dev *dev, void *data)
>> +{
>> +	pci_ers_result_t vote;
>> +	const struct pci_error_handlers *err_handler;
>> +	struct aer_broadcast_data *result_data;
>> +
>> +	result_data = (struct aer_broadcast_data *) data;
>> +
>> +	device_lock(&dev->dev);
>> +	dev->error_state = result_data->state;
>> +
>> +	if (!dev->driver ||
>> +		!dev->driver->err_handler ||
>> +		!dev->driver->err_handler->error_detected) {
>> +		if (result_data->state == pci_channel_io_frozen &&
>> +			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
>> +			/*
>> +			 * In case of fatal recovery, if one of down-
>> +			 * stream device has no driver. We might be
>> +			 * unable to recover because a later insmod
>> +			 * of a driver for this device is unaware of
>> +			 * its hw state.
>> +			 */
>> +			dev_printk(KERN_DEBUG, &dev->dev, "device has %s\n",
>> +				   dev->driver ?
>> +				   "no error-aware driver" : "no driver");
> 
> This was a pci_printk() before you moved it and it should be the same 
> here.
> 

sure, will correct this.

>> +		}
>> +
>> +		/*
>> +		 * If there's any device in the subtree that does not
>> +		 * have an error_detected callback, returning
>> +		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
>> +		 * the subsequent mmio_enabled/slot_reset/resume
>> +		 * callbacks of "any" device in the subtree. All the
>> +		 * devices in the subtree are left in the error state
>> +		 * without recovery.
>> +		 */
>> +
>> +		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
>> +			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>> +		else
>> +			vote = PCI_ERS_RESULT_NONE;
>> +	} else {
>> +		err_handler = dev->driver->err_handler;
>> +		vote = err_handler->error_detected(dev, result_data->state);
>> +	}
>> +
>> +	result_data->result = merge_result(result_data->result, vote);
>> +	device_unlock(&dev->dev);
>> +	return 0;
>> +}
>> +
>> +/**
>> + * default_reset_link - default reset function
>> + * @dev: pointer to pci_dev data structure
>> + *
>> + * Invoked when performing link reset on a Downstream Port or a
>> + * Root Port with no aer driver.
>> + */
>> +static pci_ers_result_t default_reset_link(struct pci_dev *dev)
>> +{
>> +	pci_reset_bridge_secondary_bus(dev);
>> +	dev_printk(KERN_DEBUG, &dev->dev, "downstream link has been 
>> reset\n");
> 
> This should be a pci_printk() as it was before the move.  There are 
> more
> below.
> 

yes all will be corrected.
thanks.

>> +	return PCI_ERS_RESULT_RECOVERED;
>> +}
>> +
>> +static pci_ers_result_t reset_link(struct pci_dev *dev)
>> +{
>> +	struct pci_dev *udev;
>> +	pci_ers_result_t status;
>> +	struct pcie_port_service_driver *driver = NULL;
>> +
>> +	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
>> +		/* Reset this port for all subordinates */
>> +		udev = dev;
>> +	} else {
>> +		/* Reset the upstream component (likely downstream port) */
>> +		udev = dev->bus->self;
>> +	}
>> +
>> +#if IS_ENABLED(CONFIG_PCIEAER)
> 
> AER can't be a module, so you can use just:
> 
>   #ifdef CONFIG_PCIEAER
> 
> This ifdef should be added in the patch where you add a caller from 
> non-AER
> code.  This patch should only move code, not change it.

ok, it can remain unchanged. but reset_link() is called by 
pcie_do_recovery()
and pcie_do_recovery can be called by various agents such as AER, DPC.
so let us say if DPC calls pcie_do_recovery, then DPC has no way of 
knowing that AER is enabled or not.
in fact it should not know, but err.c/reset_link() should take care 
somehow.

I can make it a separate patch to do #ifdef CONFIG_PCIEAER inside 
reset_link()
or
I can add severity parameter in reset_link() so based on severity it can 
find the service.

but I think you have comment to unify the find_aer_service and 
find_dpc_service into a pcie_find_service (routine)
so I will see how I can club and take care of this comment. [without the 
need of #ifdef]

> 
>> +	/* Use the aer driver of the component firstly */
>> +	driver = pci_find_aer_service(udev);
>> +#endif
>> +
>> +	if (driver && driver->reset_link) {
>> +		status = driver->reset_link(udev);
>> +	} else if (udev->has_secondary_link) {
>> +		status = default_reset_link(udev);
>> +	} else {
>> +		dev_printk(KERN_DEBUG, &dev->dev,
>> +			"no link-reset support at upstream device %s\n",
>> +			pci_name(udev));
>> +		return PCI_ERS_RESULT_DISCONNECT;
>> +	}
>> +
>> +	if (status != PCI_ERS_RESULT_RECOVERED) {
>> +		dev_printk(KERN_DEBUG, &dev->dev,
>> +			"link reset at upstream device %s failed\n",
>> +			pci_name(udev));
>> +		return PCI_ERS_RESULT_DISCONNECT;
>> +	}
>> +
>> +	return status;
>> +}
>> +
>> +/**
>> + * broadcast_error_message - handle message broadcast to downstream 
>> drivers
>> + * @dev: pointer to where in a hierarchy message is broadcasted down
>> + * @state: error state
>> + * @error_mesg: message to print
>> + * @cb: callback to be broadcast
>> + *
>> + * Invoked during error recovery process. Once being invoked, the 
>> content
>> + * of error severity will be broadcast to all downstream drivers in a
>> + * hierarchy in question.
>> + */
>> +static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
>> +	enum pci_channel_state state,
>> +	char *error_mesg,
>> +	int (*cb)(struct pci_dev *, void *))
>> +{
>> +	struct aer_broadcast_data result_data;
>> +
>> +	dev_printk(KERN_DEBUG, &dev->dev, "broadcast %s message\n", 
>> error_mesg);
>> +	result_data.state = state;
>> +	if (cb == report_error_detected)
>> +		result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
>> +	else
>> +		result_data.result = PCI_ERS_RESULT_RECOVERED;
>> +
>> +	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
>> +		/*
>> +		 * If the error is reported by a bridge, we think this error
>> +		 * is related to the downstream link of the bridge, so we
>> +		 * do error recovery on all subordinates of the bridge instead
>> +		 * of the bridge and clear the error status of the bridge.
>> +		 */
>> +		if (cb == report_error_detected)
>> +			dev->error_state = state;
>> +		pci_walk_bus(dev->subordinate, cb, &result_data);
>> +		if (cb == report_resume) {
>> +			pci_cleanup_aer_uncorrect_error_status(dev);
>> +			dev->error_state = pci_channel_io_normal;
>> +		}
>> +	} else {
>> +		/*
>> +		 * If the error is reported by an end point, we think this
>> +		 * error is related to the upstream link of the end point.
>> +		 */
>> +		pci_walk_bus(dev->bus, cb, &result_data);
>> +	}
>> +
>> +	return result_data.result;
>> +}
>> +
>> +/**
>> + * pcie_do_recovery - handle nonfatal/fatal error recovery process
>> + * @dev: pointer to a pci_dev data structure of agent detecting an 
>> error
>> + * @severity: error severity type
>> + *
>> + * Invoked when an error is nonfatal/fatal. Once being invoked, 
>> broadcast
>> + * error detected message to all downstream drivers within a 
>> hierarchy in
>> + * question and return the returned code.
>> + */
>> +void pcie_do_recovery(struct pci_dev *dev, int severity)
>> +{
>> +	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
>> +	enum pci_channel_state state;
>> +
>> +	if (severity == AER_FATAL)
>> +		state = pci_channel_io_frozen;
>> +	else
>> +		state = pci_channel_io_normal;
>> +
>> +	status = broadcast_error_message(dev,
>> +			state,
>> +			"error_detected",
>> +			report_error_detected);
>> +
>> +	if (severity == AER_FATAL) {
>> +		result = reset_link(dev);
>> +		if (result != PCI_ERS_RESULT_RECOVERED)
>> +			goto failed;
>> +	}
>> +
>> +	if (status == PCI_ERS_RESULT_CAN_RECOVER)
>> +		status = broadcast_error_message(dev,
>> +				state,
>> +				"mmio_enabled",
>> +				report_mmio_enabled);
>> +
>> +	if (status == PCI_ERS_RESULT_NEED_RESET) {
>> +		/*
>> +		 * TODO: Should call platform-specific
>> +		 * functions to reset slot before calling
>> +		 * drivers' slot_reset callbacks?
>> +		 */
>> +		status = broadcast_error_message(dev,
>> +				state,
>> +				"slot_reset",
>> +				report_slot_reset);
>> +	}
>> +
>> +	if (status != PCI_ERS_RESULT_RECOVERED)
>> +		goto failed;
>> +
>> +	broadcast_error_message(dev,
>> +				state,
>> +				"resume",
>> +				report_resume);
>> +
>> +	dev_info(&dev->dev, "Device recovery successful\n");
>> +	return;
>> +
>> +failed:
>> +	/* TODO: Should kernel panic here? */
>> +	dev_info(&dev->dev, "Device recovery failed\n");
>> +}
>> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
>> index a854bc5..4f1992d 100644
>> --- a/drivers/pci/pcie/portdrv.h
>> +++ b/drivers/pci/pcie/portdrv.h
>> @@ -79,4 +79,5 @@ static inline void pcie_port_platform_notify(struct 
>> pci_dev *port, int *mask)
>>  static inline void pcie_port_platform_notify(struct pci_dev *port, 
>> int *mask){}
>>  #endif /* !CONFIG_ACPI */
>> 
>> +struct pcie_port_service_driver *pci_find_aer_service(struct pci_dev 
>> *dev);
> 
> Should be in a different patch, maybe the one where you rename it.
> 
>>  #endif /* _PORTDRV_H_ */
>> --
>> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
>> Technologies, Inc.,
>> a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a 
>> Linux Foundation Collaborative Project.
>> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11 2/7] PCI/AER: factor out error reporting from AER
  2018-02-26  5:32     ` poza
@ 2018-02-26  5:39       ` poza
  2018-02-26 20:23       ` Bjorn Helgaas
  1 sibling, 0 replies; 22+ messages in thread
From: poza @ 2018-02-26  5:39 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi,
	linux-pci-owner

On 2018-02-26 11:02, poza@codeaurora.org wrote:
> On 2018-02-24 05:12, Bjorn Helgaas wrote:
>> On Fri, Feb 23, 2018 at 01:53:59PM +0530, Oza Pawandeep wrote:
>>> This patch factors out error reporting callbacks, which are currently
>>> tightly coupled with AER.
>> 
>> Add blank line between paragraphs.
>> 
>>> DPC should be able to register callbacks and attmept recovery when 
>>> DPC
>>> trigger event occurs.
>> 
>> s/attmept/attempt/
>> 
>> This patch basically moves code from aerdrv_core.c to pcie-err.c.  
>> Make it
>> do *only* that.
>> 
> 
> sure.
> 
>>> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
>>> 
>>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>>> index fcd8191..abc514e 100644
>>> --- a/drivers/pci/pci.h
>>> +++ b/drivers/pci/pci.h
>>> @@ -342,6 +342,9 @@ static inline resource_size_t 
>>> pci_resource_alignment(struct pci_dev *dev,
>>> 
>>>  void pci_enable_acs(struct pci_dev *dev);
>>> 
>>> +/* PCI error reporting and recovery */
>>> +void pcie_do_recovery(struct pci_dev *dev, int severity);
>> 
>> Add this declaration in the first patch, the one where you rename the
>> function.
>> 
> 
> done.
> 
>>>  #ifdef CONFIG_PCIEASPM
>>>  void pcie_aspm_init_link_state(struct pci_dev *pdev);
>>>  void pcie_aspm_exit_link_state(struct pci_dev *pdev);
>>> diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
>>> index 223e4c3..d669497 100644
>>> --- a/drivers/pci/pcie/Makefile
>>> +++ b/drivers/pci/pcie/Makefile
>>> @@ -6,7 +6,7 @@
>>>  # Build PCI Express ASPM if needed
>>>  obj-$(CONFIG_PCIEASPM)		+= aspm.o
>>> 
>>> -pcieportdrv-y			:= portdrv_core.o portdrv_pci.o portdrv_bus.o
>>> +pcieportdrv-y			:= portdrv_core.o portdrv_pci.o portdrv_bus.o 
>>> pcie-err.o
>> 
>> Can we name this just "drivers/pci/pcie/err.c"?  I know we have
>> pcie-dpc.c already, but it does get a little repetitious to type
>> "pci" THREE times in that path.
>> 
> 
> sure, will rename.
> 
>>>  pcieportdrv-$(CONFIG_ACPI)	+= portdrv_acpi.o
>>> 
>>>  obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o
>>> diff --git a/drivers/pci/pcie/aer/aerdrv.h 
>>> b/drivers/pci/pcie/aer/aerdrv.h
>>> index 5449e5c..bc9db53 100644
>>> --- a/drivers/pci/pcie/aer/aerdrv.h
>>> +++ b/drivers/pci/pcie/aer/aerdrv.h
>>> @@ -76,36 +76,6 @@ struct aer_rpc {
>>>  					 */
>>>  };
>>> 
>>> -struct aer_broadcast_data {
>>> -	enum pci_channel_state state;
>>> -	enum pci_ers_result result;
>>> -};
>>> -
>>> -static inline pci_ers_result_t merge_result(enum pci_ers_result 
>>> orig,
>>> -		enum pci_ers_result new)
>>> -{
>>> -	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
>>> -		return PCI_ERS_RESULT_NO_AER_DRIVER;
>>> -
>>> -	if (new == PCI_ERS_RESULT_NONE)
>>> -		return orig;
>>> -
>>> -	switch (orig) {
>>> -	case PCI_ERS_RESULT_CAN_RECOVER:
>>> -	case PCI_ERS_RESULT_RECOVERED:
>>> -		orig = new;
>>> -		break;
>>> -	case PCI_ERS_RESULT_DISCONNECT:
>>> -		if (new == PCI_ERS_RESULT_NEED_RESET)
>>> -			orig = PCI_ERS_RESULT_NEED_RESET;
>>> -		break;
>>> -	default:
>>> -		break;
>>> -	}
>>> -
>>> -	return orig;
>>> -}
>>> -
>>>  extern struct bus_type pcie_port_bus_type;
>>>  void aer_isr(struct work_struct *work);
>>>  void aer_print_error(struct pci_dev *dev, struct aer_err_info 
>>> *info);
>>> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
>>> b/drivers/pci/pcie/aer/aerdrv_core.c
>>> index aeb83a0..f60b1bb 100644
>>> --- a/drivers/pci/pcie/aer/aerdrv_core.c
>>> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
>>> @@ -23,6 +23,7 @@
>>>  #include <linux/slab.h>
>>>  #include <linux/kfifo.h>
>>>  #include "aerdrv.h"
>>> +#include "../../pci.h"
>>> 
>>>  #define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | 
>>> PCI_EXP_DEVCTL_NFERE | \
>>>  				 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
>>> @@ -230,191 +231,6 @@ static bool find_source_device(struct pci_dev 
>>> *parent,
>>>  	return true;
>>>  }
>>> 
>>> -static int report_error_detected(struct pci_dev *dev, void *data)
>>> -{
>>> -	pci_ers_result_t vote;
>>> -	const struct pci_error_handlers *err_handler;
>>> -	struct aer_broadcast_data *result_data;
>>> -	result_data = (struct aer_broadcast_data *) data;
>>> -
>>> -	device_lock(&dev->dev);
>>> -	dev->error_state = result_data->state;
>>> -
>>> -	if (!dev->driver ||
>>> -		!dev->driver->err_handler ||
>>> -		!dev->driver->err_handler->error_detected) {
>>> -		if (result_data->state == pci_channel_io_frozen &&
>>> -			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
>>> -			/*
>>> -			 * In case of fatal recovery, if one of down-
>>> -			 * stream device has no driver. We might be
>>> -			 * unable to recover because a later insmod
>>> -			 * of a driver for this device is unaware of
>>> -			 * its hw state.
>>> -			 */
>>> -			pci_printk(KERN_DEBUG, dev, "device has %s\n",
>>> -				   dev->driver ?
>>> -				   "no AER-aware driver" : "no driver");
>>> -		}
>>> -
>>> -		/*
>>> -		 * If there's any device in the subtree that does not
>>> -		 * have an error_detected callback, returning
>>> -		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
>>> -		 * the subsequent mmio_enabled/slot_reset/resume
>>> -		 * callbacks of "any" device in the subtree. All the
>>> -		 * devices in the subtree are left in the error state
>>> -		 * without recovery.
>>> -		 */
>>> -
>>> -		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
>>> -			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>>> -		else
>>> -			vote = PCI_ERS_RESULT_NONE;
>>> -	} else {
>>> -		err_handler = dev->driver->err_handler;
>>> -		vote = err_handler->error_detected(dev, result_data->state);
>>> -		pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
>>> -	}
>>> -
>>> -	result_data->result = merge_result(result_data->result, vote);
>>> -	device_unlock(&dev->dev);
>>> -	return 0;
>>> -}
>>> -
>>> -static int report_mmio_enabled(struct pci_dev *dev, void *data)
>>> -{
>>> -	pci_ers_result_t vote;
>>> -	const struct pci_error_handlers *err_handler;
>>> -	struct aer_broadcast_data *result_data;
>>> -	result_data = (struct aer_broadcast_data *) data;
>>> -
>>> -	device_lock(&dev->dev);
>>> -	if (!dev->driver ||
>>> -		!dev->driver->err_handler ||
>>> -		!dev->driver->err_handler->mmio_enabled)
>>> -		goto out;
>>> -
>>> -	err_handler = dev->driver->err_handler;
>>> -	vote = err_handler->mmio_enabled(dev);
>>> -	result_data->result = merge_result(result_data->result, vote);
>>> -out:
>>> -	device_unlock(&dev->dev);
>>> -	return 0;
>>> -}
>>> -
>>> -static int report_slot_reset(struct pci_dev *dev, void *data)
>>> -{
>>> -	pci_ers_result_t vote;
>>> -	const struct pci_error_handlers *err_handler;
>>> -	struct aer_broadcast_data *result_data;
>>> -	result_data = (struct aer_broadcast_data *) data;
>>> -
>>> -	device_lock(&dev->dev);
>>> -	if (!dev->driver ||
>>> -		!dev->driver->err_handler ||
>>> -		!dev->driver->err_handler->slot_reset)
>>> -		goto out;
>>> -
>>> -	err_handler = dev->driver->err_handler;
>>> -	vote = err_handler->slot_reset(dev);
>>> -	result_data->result = merge_result(result_data->result, vote);
>>> -out:
>>> -	device_unlock(&dev->dev);
>>> -	return 0;
>>> -}
>>> -
>>> -static int report_resume(struct pci_dev *dev, void *data)
>>> -{
>>> -	const struct pci_error_handlers *err_handler;
>>> -
>>> -	device_lock(&dev->dev);
>>> -	dev->error_state = pci_channel_io_normal;
>>> -
>>> -	if (!dev->driver ||
>>> -		!dev->driver->err_handler ||
>>> -		!dev->driver->err_handler->resume)
>>> -		goto out;
>>> -
>>> -	err_handler = dev->driver->err_handler;
>>> -	err_handler->resume(dev);
>>> -	pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
>>> -out:
>>> -	device_unlock(&dev->dev);
>>> -	return 0;
>>> -}
>>> -
>>> -/**
>>> - * broadcast_error_message - handle message broadcast to downstream 
>>> drivers
>>> - * @dev: pointer to from where in a hierarchy message is broadcasted 
>>> down
>>> - * @state: error state
>>> - * @error_mesg: message to print
>>> - * @cb: callback to be broadcasted
>>> - *
>>> - * Invoked during error recovery process. Once being invoked, the 
>>> content
>>> - * of error severity will be broadcasted to all downstream drivers 
>>> in a
>>> - * hierarchy in question.
>>> - */
>>> -static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
>>> -	enum pci_channel_state state,
>>> -	char *error_mesg,
>>> -	int (*cb)(struct pci_dev *, void *))
>>> -{
>>> -	struct aer_broadcast_data result_data;
>>> -
>>> -	pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg);
>>> -	result_data.state = state;
>>> -	if (cb == report_error_detected)
>>> -		result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
>>> -	else
>>> -		result_data.result = PCI_ERS_RESULT_RECOVERED;
>>> -
>>> -	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
>>> -		/*
>>> -		 * If the error is reported by a bridge, we think this error
>>> -		 * is related to the downstream link of the bridge, so we
>>> -		 * do error recovery on all subordinates of the bridge instead
>>> -		 * of the bridge and clear the error status of the bridge.
>>> -		 */
>>> -		if (cb == report_error_detected)
>>> -			dev->error_state = state;
>>> -		pci_walk_bus(dev->subordinate, cb, &result_data);
>>> -		if (cb == report_resume) {
>>> -			pci_cleanup_aer_uncorrect_error_status(dev);
>>> -			dev->error_state = pci_channel_io_normal;
>>> -		}
>>> -	} else {
>>> -		/*
>>> -		 * If the error is reported by an end point, we think this
>>> -		 * error is related to the upstream link of the end point.
>>> -		 */
>>> -		if (state == pci_channel_io_normal)
>>> -			/*
>>> -			 * the error is non fatal so the bus is ok, just invoke
>>> -			 * the callback for the function that logged the error.
>>> -			 */
>>> -			cb(dev, &result_data);
>>> -		else
>>> -			pci_walk_bus(dev->bus, cb, &result_data);
>>> -	}
>>> -
>>> -	return result_data.result;
>>> -}
>>> -
>>> -/**
>>> - * default_reset_link - default reset function
>>> - * @dev: pointer to pci_dev data structure
>>> - *
>>> - * Invoked when performing link reset on a Downstream Port or a
>>> - * Root Port with no aer driver.
>>> - */
>>> -static pci_ers_result_t default_reset_link(struct pci_dev *dev)
>>> -{
>>> -	pci_reset_bridge_secondary_bus(dev);
>>> -	pci_printk(KERN_DEBUG, dev, "downstream link has been reset\n");
>>> -	return PCI_ERS_RESULT_RECOVERED;
>>> -}
>>> -
>>>  static int find_aer_service_iter(struct device *device, void *data)
>>>  {
>>>  	struct pcie_port_service_driver *service_driver, **drv;
>>> @@ -432,7 +248,7 @@ static int find_aer_service_iter(struct device 
>>> *device, void *data)
>>>  	return 0;
>>>  }
>>> 
>>> -static struct pcie_port_service_driver *find_aer_service(struct 
>>> pci_dev *dev)
>>> +struct pcie_port_service_driver *pci_find_aer_service(struct pci_dev 
>>> *dev)
>> 
>> Move this rename to a different patch.  The new name should probably
>> start with "pcie" like you did with pcie_do_recovery().
>> 
> 
> sure, will do that.
> 
>>>  {
>>>  	struct pcie_port_service_driver *drv = NULL;
>>> 
>>> @@ -440,107 +256,7 @@ static struct pcie_port_service_driver 
>>> *find_aer_service(struct pci_dev *dev)
>>> 
>>>  	return drv;
>>>  }
>>> -
>>> -static pci_ers_result_t reset_link(struct pci_dev *dev)
>>> -{
>>> -	struct pci_dev *udev;
>>> -	pci_ers_result_t status;
>>> -	struct pcie_port_service_driver *driver;
>>> -
>>> -	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
>>> -		/* Reset this port for all subordinates */
>>> -		udev = dev;
>>> -	} else {
>>> -		/* Reset the upstream component (likely downstream port) */
>>> -		udev = dev->bus->self;
>>> -	}
>>> -
>>> -	/* Use the aer driver of the component firstly */
>>> -	driver = find_aer_service(udev);
>>> -
>>> -	if (driver && driver->reset_link) {
>>> -		status = driver->reset_link(udev);
>>> -	} else if (udev->has_secondary_link) {
>>> -		status = default_reset_link(udev);
>>> -	} else {
>>> -		pci_printk(KERN_DEBUG, dev, "no link-reset support at upstream 
>>> device %s\n",
>>> -			pci_name(udev));
>>> -		return PCI_ERS_RESULT_DISCONNECT;
>>> -	}
>>> -
>>> -	if (status != PCI_ERS_RESULT_RECOVERED) {
>>> -		pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s 
>>> failed\n",
>>> -			pci_name(udev));
>>> -		return PCI_ERS_RESULT_DISCONNECT;
>>> -	}
>>> -
>>> -	return status;
>>> -}
>>> -
>>> -/**
>>> - * pcie_do_recovery - handle nonfatal/fatal error recovery process
>>> - * @dev: pointer to a pci_dev data structure of agent detecting an 
>>> error
>>> - * @severity: error severity type
>>> - *
>>> - * Invoked when an error is nonfatal/fatal. Once being invoked, 
>>> broadcast
>>> - * error detected message to all downstream drivers within a 
>>> hierarchy in
>>> - * question and return the returned code.
>>> - */
>>> -static void pcie_do_recovery(struct pci_dev *dev, int severity)
>>> -{
>>> -	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
>>> -	enum pci_channel_state state;
>>> -
>>> -	if (severity == AER_FATAL)
>>> -		state = pci_channel_io_frozen;
>>> -	else
>>> -		state = pci_channel_io_normal;
>>> -
>>> -	status = broadcast_error_message(dev,
>>> -			state,
>>> -			"error_detected",
>>> -			report_error_detected);
>>> -
>>> -	if (severity == AER_FATAL) {
>>> -		result = reset_link(dev);
>>> -		if (result != PCI_ERS_RESULT_RECOVERED)
>>> -			goto failed;
>>> -	}
>>> -
>>> -	if (status == PCI_ERS_RESULT_CAN_RECOVER)
>>> -		status = broadcast_error_message(dev,
>>> -				state,
>>> -				"mmio_enabled",
>>> -				report_mmio_enabled);
>>> -
>>> -	if (status == PCI_ERS_RESULT_NEED_RESET) {
>>> -		/*
>>> -		 * TODO: Should call platform-specific
>>> -		 * functions to reset slot before calling
>>> -		 * drivers' slot_reset callbacks?
>>> -		 */
>>> -		status = broadcast_error_message(dev,
>>> -				state,
>>> -				"slot_reset",
>>> -				report_slot_reset);
>>> -	}
>>> -
>>> -	if (status != PCI_ERS_RESULT_RECOVERED)
>>> -		goto failed;
>>> -
>>> -	broadcast_error_message(dev,
>>> -				state,
>>> -				"resume",
>>> -				report_resume);
>>> -
>>> -	pci_info(dev, "AER: Device recovery successful\n");
>>> -	return;
>>> -
>>> -failed:
>>> -	pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
>>> -	/* TODO: Should kernel panic here? */
>>> -	pci_info(dev, "AER: Device recovery failed\n");
>>> -}
>>> +EXPORT_SYMBOL_GPL(pci_find_aer_service);
>> 
>> This is never called from a module, so I don't think you need to
>> export it at all.  If you do, it should be a separate patch, not
>> buried in the middle of this big one.
>> 
> 
> got it, will see if its really required to be exported.
> but certainly, will remove it from this patch.
> 
>>>  /**
>>>   * handle_error_source - handle logging error into an event log
>>> diff --git a/drivers/pci/pcie/pcie-err.c 
>>> b/drivers/pci/pcie/pcie-err.c
>>> new file mode 100644
>>> index 0000000..fcd5add
>>> --- /dev/null
>>> +++ b/drivers/pci/pcie/pcie-err.c
>>> @@ -0,0 +1,334 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * This file implements the error recovery as a core part of PCIe 
>>> error reporting.
>>> + * When a PCIe error is delivered, an error message will be 
>>> collected and printed
>>> + * to console, then, an error recovery procedure will be executed by 
>>> following
>>> + * the PCI error recovery rules.
>> 
>> Wrap this so it fits in 80 columns.
> 
> I thought of keeping the way it was before (hence did not change it)
> I would change it now.
> 
>> 
>>> + * Copyright (C) 2006 Intel Corp.
>>> + *	Tom Long Nguyen (tom.l.nguyen@intel.com)
>>> + *	Zhang Yanmin (yanmin.zhang@intel.com)
>>> + *
>>> + */
>>> +
>>> +#include <linux/pci.h>
>>> +#include <linux/module.h>
>>> +#include <linux/pci.h>
>>> +#include <linux/kernel.h>
>>> +#include <linux/errno.h>
>>> +#include <linux/aer.h>
>>> +#include <linux/pcieport_if.h>
>>> +#include "portdrv.h"
>>> +
>>> +struct aer_broadcast_data {
>>> +	enum pci_channel_state state;
>>> +	enum pci_ers_result result;
>>> +};
>>> +
>>> +static pci_ers_result_t merge_result(enum pci_ers_result orig,
>>> +				  enum pci_ers_result new)
>>> +{
>>> +	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
>>> +		return PCI_ERS_RESULT_NO_AER_DRIVER;
>>> +
>>> +	if (new == PCI_ERS_RESULT_NONE)
>>> +		return orig;
>>> +
>>> +	switch (orig) {
>>> +	case PCI_ERS_RESULT_CAN_RECOVER:
>>> +	case PCI_ERS_RESULT_RECOVERED:
>>> +		orig = new;
>>> +		break;
>>> +	case PCI_ERS_RESULT_DISCONNECT:
>>> +		if (new == PCI_ERS_RESULT_NEED_RESET)
>>> +			orig = PCI_ERS_RESULT_NEED_RESET;
>>> +		break;
>>> +	default:
>>> +		break;
>>> +	}
>>> +
>>> +	return orig;
>>> +}
>>> +
>>> +static int report_mmio_enabled(struct pci_dev *dev, void *data)
>>> +{
>>> +	pci_ers_result_t vote;
>>> +	const struct pci_error_handlers *err_handler;
>>> +	struct aer_broadcast_data *result_data;
>>> +
>>> +	result_data = (struct aer_broadcast_data *) data;
>>> +
>>> +	device_lock(&dev->dev);
>>> +	if (!dev->driver ||
>>> +		!dev->driver->err_handler ||
>>> +		!dev->driver->err_handler->mmio_enabled)
>>> +		goto out;
>>> +
>>> +	err_handler = dev->driver->err_handler;
>>> +	vote = err_handler->mmio_enabled(dev);
>>> +	result_data->result = merge_result(result_data->result, vote);
>>> +out:
>>> +	device_unlock(&dev->dev);
>>> +	return 0;
>>> +}
>>> +
>>> +static int report_slot_reset(struct pci_dev *dev, void *data)
>>> +{
>>> +	pci_ers_result_t vote;
>>> +	const struct pci_error_handlers *err_handler;
>>> +	struct aer_broadcast_data *result_data;
>>> +
>>> +	result_data = (struct aer_broadcast_data *) data;
>>> +
>>> +	device_lock(&dev->dev);
>>> +	if (!dev->driver ||
>>> +		!dev->driver->err_handler ||
>>> +		!dev->driver->err_handler->slot_reset)
>>> +		goto out;
>>> +
>>> +	err_handler = dev->driver->err_handler;
>>> +	vote = err_handler->slot_reset(dev);
>>> +	result_data->result = merge_result(result_data->result, vote);
>>> +out:
>>> +	device_unlock(&dev->dev);
>>> +	return 0;
>>> +}
>>> +
>>> +static int report_resume(struct pci_dev *dev, void *data)
>>> +{
>>> +	const struct pci_error_handlers *err_handler;
>>> +
>>> +	device_lock(&dev->dev);
>>> +	dev->error_state = pci_channel_io_normal;
>>> +
>>> +	if (!dev->driver ||
>>> +		!dev->driver->err_handler ||
>>> +		!dev->driver->err_handler->resume)
>>> +		goto out;
>>> +
>>> +	err_handler = dev->driver->err_handler;
>>> +	err_handler->resume(dev);
>>> +out:
>>> +	device_unlock(&dev->dev);
>>> +	return 0;
>>> +}
>>> +
>>> +static int report_error_detected(struct pci_dev *dev, void *data)
>>> +{
>>> +	pci_ers_result_t vote;
>>> +	const struct pci_error_handlers *err_handler;
>>> +	struct aer_broadcast_data *result_data;
>>> +
>>> +	result_data = (struct aer_broadcast_data *) data;
>>> +
>>> +	device_lock(&dev->dev);
>>> +	dev->error_state = result_data->state;
>>> +
>>> +	if (!dev->driver ||
>>> +		!dev->driver->err_handler ||
>>> +		!dev->driver->err_handler->error_detected) {
>>> +		if (result_data->state == pci_channel_io_frozen &&
>>> +			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
>>> +			/*
>>> +			 * In case of fatal recovery, if one of down-
>>> +			 * stream device has no driver. We might be
>>> +			 * unable to recover because a later insmod
>>> +			 * of a driver for this device is unaware of
>>> +			 * its hw state.
>>> +			 */
>>> +			dev_printk(KERN_DEBUG, &dev->dev, "device has %s\n",
>>> +				   dev->driver ?
>>> +				   "no error-aware driver" : "no driver");
>> 
>> This was a pci_printk() before you moved it and it should be the same 
>> here.
>> 
> 
> sure, will correct this.
> 
>>> +		}
>>> +
>>> +		/*
>>> +		 * If there's any device in the subtree that does not
>>> +		 * have an error_detected callback, returning
>>> +		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
>>> +		 * the subsequent mmio_enabled/slot_reset/resume
>>> +		 * callbacks of "any" device in the subtree. All the
>>> +		 * devices in the subtree are left in the error state
>>> +		 * without recovery.
>>> +		 */
>>> +
>>> +		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
>>> +			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>>> +		else
>>> +			vote = PCI_ERS_RESULT_NONE;
>>> +	} else {
>>> +		err_handler = dev->driver->err_handler;
>>> +		vote = err_handler->error_detected(dev, result_data->state);
>>> +	}
>>> +
>>> +	result_data->result = merge_result(result_data->result, vote);
>>> +	device_unlock(&dev->dev);
>>> +	return 0;
>>> +}
>>> +
>>> +/**
>>> + * default_reset_link - default reset function
>>> + * @dev: pointer to pci_dev data structure
>>> + *
>>> + * Invoked when performing link reset on a Downstream Port or a
>>> + * Root Port with no aer driver.
>>> + */
>>> +static pci_ers_result_t default_reset_link(struct pci_dev *dev)
>>> +{
>>> +	pci_reset_bridge_secondary_bus(dev);
>>> +	dev_printk(KERN_DEBUG, &dev->dev, "downstream link has been 
>>> reset\n");
>> 
>> This should be a pci_printk() as it was before the move.  There are 
>> more
>> below.
>> 
> 
> yes all will be corrected.
> thanks.
> 
>>> +	return PCI_ERS_RESULT_RECOVERED;
>>> +}
>>> +
>>> +static pci_ers_result_t reset_link(struct pci_dev *dev)
>>> +{
>>> +	struct pci_dev *udev;
>>> +	pci_ers_result_t status;
>>> +	struct pcie_port_service_driver *driver = NULL;
>>> +
>>> +	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
>>> +		/* Reset this port for all subordinates */
>>> +		udev = dev;
>>> +	} else {
>>> +		/* Reset the upstream component (likely downstream port) */
>>> +		udev = dev->bus->self;
>>> +	}
>>> +
>>> +#if IS_ENABLED(CONFIG_PCIEAER)
>> 
>> AER can't be a module, so you can use just:
>> 
>>   #ifdef CONFIG_PCIEAER
>> 
>> This ifdef should be added in the patch where you add a caller from 
>> non-AER
>> code.  This patch should only move code, not change it.
> 
> ok, it can remain unchanged. but reset_link() is called by 
> pcie_do_recovery()
> and pcie_do_recovery can be called by various agents such as AER, DPC.
> so let us say if DPC calls pcie_do_recovery, then DPC has no way of
> knowing that AER is enabled or not.
> in fact it should not know, but err.c/reset_link() should take care 
> somehow.
> 
> I can make it a separate patch to do #ifdef CONFIG_PCIEAER inside 
> reset_link()
> or
> I can add severity parameter in reset_link() so based on severity it
> can find the service.
> 
> but I think you have comment to unify the find_aer_service and
> find_dpc_service into a pcie_find_service (routine)
> so I will see how I can club and take care of this comment. [without
> the need of #ifdef]
> 
>> 
>>> +	/* Use the aer driver of the component firstly */
>>> +	driver = pci_find_aer_service(udev);
>>> +#endif
>>> +
>>> +	if (driver && driver->reset_link) {
>>> +		status = driver->reset_link(udev);
>>> +	} else if (udev->has_secondary_link) {
>>> +		status = default_reset_link(udev);
>>> +	} else {
>>> +		dev_printk(KERN_DEBUG, &dev->dev,
>>> +			"no link-reset support at upstream device %s\n",
>>> +			pci_name(udev));
>>> +		return PCI_ERS_RESULT_DISCONNECT;
>>> +	}
>>> +
>>> +	if (status != PCI_ERS_RESULT_RECOVERED) {
>>> +		dev_printk(KERN_DEBUG, &dev->dev,
>>> +			"link reset at upstream device %s failed\n",
>>> +			pci_name(udev));
>>> +		return PCI_ERS_RESULT_DISCONNECT;
>>> +	}
>>> +
>>> +	return status;
>>> +}
>>> +
>>> +/**
>>> + * broadcast_error_message - handle message broadcast to downstream 
>>> drivers
>>> + * @dev: pointer to where in a hierarchy message is broadcasted down
>>> + * @state: error state
>>> + * @error_mesg: message to print
>>> + * @cb: callback to be broadcast
>>> + *
>>> + * Invoked during error recovery process. Once being invoked, the 
>>> content
>>> + * of error severity will be broadcast to all downstream drivers in 
>>> a
>>> + * hierarchy in question.
>>> + */
>>> +static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
>>> +	enum pci_channel_state state,
>>> +	char *error_mesg,
>>> +	int (*cb)(struct pci_dev *, void *))
>>> +{
>>> +	struct aer_broadcast_data result_data;
>>> +
>>> +	dev_printk(KERN_DEBUG, &dev->dev, "broadcast %s message\n", 
>>> error_mesg);
>>> +	result_data.state = state;
>>> +	if (cb == report_error_detected)
>>> +		result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
>>> +	else
>>> +		result_data.result = PCI_ERS_RESULT_RECOVERED;
>>> +
>>> +	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
>>> +		/*
>>> +		 * If the error is reported by a bridge, we think this error
>>> +		 * is related to the downstream link of the bridge, so we
>>> +		 * do error recovery on all subordinates of the bridge instead
>>> +		 * of the bridge and clear the error status of the bridge.
>>> +		 */
>>> +		if (cb == report_error_detected)
>>> +			dev->error_state = state;
>>> +		pci_walk_bus(dev->subordinate, cb, &result_data);
>>> +		if (cb == report_resume) {
>>> +			pci_cleanup_aer_uncorrect_error_status(dev);
>>> +			dev->error_state = pci_channel_io_normal;
>>> +		}
>>> +	} else {
>>> +		/*
>>> +		 * If the error is reported by an end point, we think this
>>> +		 * error is related to the upstream link of the end point.
>>> +		 */
>>> +		pci_walk_bus(dev->bus, cb, &result_data);
>>> +	}
>>> +
>>> +	return result_data.result;
>>> +}
>>> +
>>> +/**
>>> + * pcie_do_recovery - handle nonfatal/fatal error recovery process
>>> + * @dev: pointer to a pci_dev data structure of agent detecting an 
>>> error
>>> + * @severity: error severity type
>>> + *
>>> + * Invoked when an error is nonfatal/fatal. Once being invoked, 
>>> broadcast
>>> + * error detected message to all downstream drivers within a 
>>> hierarchy in
>>> + * question and return the returned code.
>>> + */
>>> +void pcie_do_recovery(struct pci_dev *dev, int severity)
>>> +{
>>> +	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
>>> +	enum pci_channel_state state;
>>> +
>>> +	if (severity == AER_FATAL)
>>> +		state = pci_channel_io_frozen;
>>> +	else
>>> +		state = pci_channel_io_normal;
>>> +
>>> +	status = broadcast_error_message(dev,
>>> +			state,
>>> +			"error_detected",
>>> +			report_error_detected);
>>> +
>>> +	if (severity == AER_FATAL) {
>>> +		result = reset_link(dev);
>>> +		if (result != PCI_ERS_RESULT_RECOVERED)
>>> +			goto failed;
>>> +	}
>>> +
>>> +	if (status == PCI_ERS_RESULT_CAN_RECOVER)
>>> +		status = broadcast_error_message(dev,
>>> +				state,
>>> +				"mmio_enabled",
>>> +				report_mmio_enabled);
>>> +
>>> +	if (status == PCI_ERS_RESULT_NEED_RESET) {
>>> +		/*
>>> +		 * TODO: Should call platform-specific
>>> +		 * functions to reset slot before calling
>>> +		 * drivers' slot_reset callbacks?
>>> +		 */
>>> +		status = broadcast_error_message(dev,
>>> +				state,
>>> +				"slot_reset",
>>> +				report_slot_reset);
>>> +	}
>>> +
>>> +	if (status != PCI_ERS_RESULT_RECOVERED)
>>> +		goto failed;
>>> +
>>> +	broadcast_error_message(dev,
>>> +				state,
>>> +				"resume",
>>> +				report_resume);
>>> +
>>> +	dev_info(&dev->dev, "Device recovery successful\n");
>>> +	return;
>>> +
>>> +failed:
>>> +	/* TODO: Should kernel panic here? */
>>> +	dev_info(&dev->dev, "Device recovery failed\n");
>>> +}
>>> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
>>> index a854bc5..4f1992d 100644
>>> --- a/drivers/pci/pcie/portdrv.h
>>> +++ b/drivers/pci/pcie/portdrv.h
>>> @@ -79,4 +79,5 @@ static inline void pcie_port_platform_notify(struct 
>>> pci_dev *port, int *mask)
>>>  static inline void pcie_port_platform_notify(struct pci_dev *port, 
>>> int *mask){}
>>>  #endif /* !CONFIG_ACPI */
>>> 
>>> +struct pcie_port_service_driver *pci_find_aer_service(struct pci_dev 
>>> *dev);
>> 
>> Should be in a different patch, maybe the one where you rename it.

this can not be in a different patch I suppose, because this patch would 
not compile saying

error: implicit declaration of function ‘find_aer_service’ 
[-Werror=implicit-function-declaration]
driver = find_aer_service(udev);

err.c calls find_aer_service() so it needs to find declaration 
somewhere.
Though I will make a separate patch renaming this as you suggested.

>> 
>>>  #endif /* _PORTDRV_H_ */
>>> --
>>> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
>>> Technologies, Inc.,
>>> a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a 
>>> Linux Foundation Collaborative Project.
>>> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11 2/7] PCI/AER: factor out error reporting from AER
  2018-02-26  5:32     ` poza
  2018-02-26  5:39       ` poza
@ 2018-02-26 20:23       ` Bjorn Helgaas
  1 sibling, 0 replies; 22+ messages in thread
From: Bjorn Helgaas @ 2018-02-26 20:23 UTC (permalink / raw)
  To: poza
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On Mon, Feb 26, 2018 at 11:02:50AM +0530, poza@codeaurora.org wrote:
> On 2018-02-24 05:12, Bjorn Helgaas wrote:
> > On Fri, Feb 23, 2018 at 01:53:59PM +0530, Oza Pawandeep wrote:

> > >   * handle_error_source - handle logging error into an event log
> > > diff --git a/drivers/pci/pcie/pcie-err.c b/drivers/pci/pcie/pcie-err.c
> > > new file mode 100644
> > > index 0000000..fcd5add
> > > --- /dev/null
> > > +++ b/drivers/pci/pcie/pcie-err.c
> > > @@ -0,0 +1,334 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * This file implements the error recovery as a core part of PCIe
> > > error reporting.
> > > + * When a PCIe error is delivered, an error message will be
> > > collected and printed
> > > + * to console, then, an error recovery procedure will be executed
> > > by following
> > > + * the PCI error recovery rules.
> > 
> > Wrap this so it fits in 80 columns.
> 
> I thought of keeping the way it was before (hence did not change it)
> I would change it now.

The original text fit in 80 columns, but you changed the text a little
bit as part of making this code more generic, which made it not fit
anymore.  Ideally I would leave the text the same in this patch that
only moves code, then update the text (and rewrap it) in the patch
that makes the code more generic.

> > > +static pci_ers_result_t reset_link(struct pci_dev *dev)
> > > +{
> > > +	struct pci_dev *udev;
> > > +	pci_ers_result_t status;
> > > +	struct pcie_port_service_driver *driver = NULL;
> > > +
> > > +	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
> > > +		/* Reset this port for all subordinates */
> > > +		udev = dev;
> > > +	} else {
> > > +		/* Reset the upstream component (likely downstream port) */
> > > +		udev = dev->bus->self;
> > > +	}
> > > +
> > > +#if IS_ENABLED(CONFIG_PCIEAER)
> > 
> > AER can't be a module, so you can use just:
> > 
> >   #ifdef CONFIG_PCIEAER
> > 
> > This ifdef should be added in the patch where you add a caller from
> > non-AER
> > code.  This patch should only move code, not change it.
> 
> ok, it can remain unchanged. but reset_link() is called by
> pcie_do_recovery()
> and pcie_do_recovery can be called by various agents such as AER, DPC.
> so let us say if DPC calls pcie_do_recovery, then DPC has no way of knowing
> that AER is enabled or not.
> in fact it should not know, but err.c/reset_link() should take care somehow.

If all you're doing is moving code, the functionality isn't changing
and you shouldn't need to add the ifdef.  At the point where you add a
new caller and the #ifdef becomes necessary, you can add it there.
Then it will make sense because we can connect the ifdef with the need
for it.

> I can make it a separate patch to do #ifdef CONFIG_PCIEAER inside
> reset_link()
> or
> I can add severity parameter in reset_link() so based on severity it can
> find the service.
> 
> but I think you have comment to unify the find_aer_service and
> find_dpc_service into a pcie_find_service (routine)
> so I will see how I can club and take care of this comment. [without the
> need of #ifdef]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11 3/7] PCI/ERR: add mutex to synchronize recovery
  2018-02-23 23:45   ` Bjorn Helgaas
@ 2018-02-27  5:16     ` poza
  2018-02-27 14:41       ` Bjorn Helgaas
  0 siblings, 1 reply; 22+ messages in thread
From: poza @ 2018-02-27  5:16 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On 2018-02-24 05:15, Bjorn Helgaas wrote:
> On Fri, Feb 23, 2018 at 01:54:00PM +0530, Oza Pawandeep wrote:
>> This patch protects pci_do_recovery with mutex.
> 
> pcie_do_recovery()
> 
> Please explain why the mutex is necessary.  What bad things happen
> without the mutex?
> 
> You named (some) of the other things "pcie"; maybe use "pcie" in the
> mutex name as well so they look the same.
> 

PCIe specification: 6.2.10
When DPC is triggered due to receipt of an uncorrectable error Message, 
the Requester ID from the Message is recorded in the DPC Error Source ID 
register and that Message is discarded and not forwarded Upstream.

So, having said that, what we think is we dont need Mutex, because in 
DPC enabled system either AER or DPC can be triggered, not both.
so right now there is no need of guarding pcie_do_recovery() with mutex.

but I was in a thought that; since pcie_do_recovery is supposed to be 
used by error clients,
from sw architecture point of view, adding mutex takes care of 
concurrency if it exists (in corner cases, faulty hw where both AER and 
DPC triggered etc..)

We can choose to drop this patch, since we dont require mutex.
Bjorn, please advise.


>> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
>> 
>> diff --git a/drivers/pci/pcie/pcie-err.c b/drivers/pci/pcie/pcie-err.c
>> index fcd5add..f830975 100644
>> --- a/drivers/pci/pcie/pcie-err.c
>> +++ b/drivers/pci/pcie/pcie-err.c
>> @@ -20,6 +20,8 @@
>>  #include <linux/pcieport_if.h>
>>  #include "portdrv.h"
>> 
>> +static DEFINE_MUTEX(pci_err_recovery_lock);
>> +
>>  struct aer_broadcast_data {
>>  	enum pci_channel_state state;
>>  	enum pci_ers_result result;
>> @@ -283,6 +285,8 @@ void pcie_do_recovery(struct pci_dev *dev, int 
>> severity)
>>  	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
>>  	enum pci_channel_state state;
>> 
>> +	mutex_lock(&pci_err_recovery_lock);
>> +
>>  	if (severity == AER_FATAL)
>>  		state = pci_channel_io_frozen;
>>  	else
>> @@ -326,9 +330,11 @@ void pcie_do_recovery(struct pci_dev *dev, int 
>> severity)
>>  				report_resume);
>> 
>>  	dev_info(&dev->dev, "Device recovery successful\n");
>> +	mutex_unlock(&pci_err_recovery_lock);
>>  	return;
>> 
>>  failed:
>>  	/* TODO: Should kernel panic here? */
>>  	dev_info(&dev->dev, "Device recovery failed\n");
>> +	mutex_unlock(&pci_err_recovery_lock);
>>  }
>> --
>> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
>> Technologies, Inc.,
>> a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a 
>> Linux Foundation Collaborative Project.
>> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11 4/7] PCI/DPC: Unify and plumb error handling into DPC
  2018-02-24  0:07   ` Bjorn Helgaas
@ 2018-02-27  6:06     ` poza
  0 siblings, 0 replies; 22+ messages in thread
From: poza @ 2018-02-27  6:06 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On 2018-02-24 05:37, Bjorn Helgaas wrote:
> On Fri, Feb 23, 2018 at 01:54:01PM +0530, Oza Pawandeep wrote:
>> Current DPC driver does not do recovery, e.g. calling end-point's 
>> driver's
>> callbacks, which sanitize the sw.
>> 
>> DPC driver implements link_reset callback, and calls pcie_do_recovery.
> 
> s/pcie_do_recovery/pcie_do_recovery()/
> 

sure.

>> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
>> 
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index abc514e..f8575da 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -343,6 +343,8 @@ static inline resource_size_t 
>> pci_resource_alignment(struct pci_dev *dev,
>>  void pci_enable_acs(struct pci_dev *dev);
>> 
>>  /* PCI error reporting and recovery */
>> +#define DPC_FATAL	4
> 
> This needs to go next to the AER_FATAL, etc., definitions because
> DPC_FATAL shares the namespace and they all need to have distinct
> values.  I can't tell from this patch whether they do or not.
> 

sure.

>>  void pcie_do_recovery(struct pci_dev *dev, int severity);
>> 
>>  #ifdef CONFIG_PCIEASPM
>> diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
>> index 38e40c6..5c01c63 100644
>> --- a/drivers/pci/pcie/pcie-dpc.c
>> +++ b/drivers/pci/pcie/pcie-dpc.c
>> @@ -13,6 +13,7 @@
>>  #include <linux/pcieport_if.h>
>>  #include "../pci.h"
>>  #include "aer/aerdrv.h"
>> +#include "portdrv.h"
>> 
>>  struct dpc_dev {
>>  	struct pcie_device	*dev;
>> @@ -45,6 +46,58 @@ struct dpc_dev {
>>  	"Memory Request Completion Timeout",		 /* Bit Position 18 */
>>  };
>> 
>> +static int find_dpc_dev_iter(struct device *device, void *data)
>> +{
>> +	struct pcie_port_service_driver *service_driver;
>> +	struct device **dev = (struct device **) data;;
>> +
>> +	if (device->bus == &pcie_port_bus_type && device->driver) {
>> +		service_driver = to_service_driver(device->driver);
>> +		if (service_driver->service == PCIE_PORT_SERVICE_DPC) {
>> +			*dev = device;
>> +			return 1;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static struct device *pci_find_dpc_dev(struct pci_dev *pdev)
>> +{
>> +	struct device *dev = NULL;
>> +
>> +	device_for_each_child(&pdev->dev, &dev, find_dpc_dev_iter);
>> +
>> +	return dev;
>> +}
> 
> Ugh.  You're not responsible for this and you don't need to do
> anything, but hanging the struct dpc_dev off the struct pcie_device
> and then having to grub around like this to locate it from the pci_dev
> is just ... clunky.  OK, rant over, sorry :)
> 

:) I keep it for now.

>> +static int find_dpc_service_iter(struct device *device, void *data)
>> +{
>> +	struct pcie_port_service_driver *service_driver;
>> +	struct pcie_port_service_driver **drv =
>> +		(struct pcie_port_service_driver **) data;
>> +
>> +	if (device->bus == &pcie_port_bus_type && device->driver) {
>> +		service_driver = to_service_driver(device->driver);
>> +		if (service_driver->service == PCIE_PORT_SERVICE_DPC) {
>> +			*drv = service_driver;
>> +			return 1;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +struct pcie_port_service_driver *pci_find_dpc_service(struct pci_dev 
>> *dev)
>> +{
>> +	struct pcie_port_service_driver *drv = NULL;
>> +
>> +	device_for_each_child(&dev->dev, &drv, find_dpc_service_iter);
>> +
>> +	return drv;
>> +}
>> +EXPORT_SYMBOL_GPL(pci_find_dpc_service);
> 
> No module uses this, so it doesn't need to be exported.
> 
> This is a clone of find_aer_service().  Can you add a preliminary patch 
> to
> make a generic "find service" interface that accepts the service type
> (PCIE_PORT_SERVICE_AER, PCIE_PORT_SERVICE_DPC) as a parameter?
> 
> This whole "find service" thing is ugly as sin.  You're not responsible 
> for
> cleaning it up, but maybe we can at least limit the proliferation of 
> it.
> 

I have taken care of making this as a generic find_Service in pcie port 
driver now.

>>  static int dpc_wait_rp_inactive(struct dpc_dev *dpc)
>>  {
>>  	unsigned long timeout = jiffies + HZ;
>> @@ -82,12 +135,25 @@ static void dpc_wait_link_inactive(struct dpc_dev 
>> *dpc)
>>  		dev_warn(dev, "Link state not disabled for DPC event\n");
>>  }
>> 
>> -static void dpc_work(struct work_struct *work)
>> +/**
>> + * dpc_reset_link - reset link DPC  routine
> 
> s/  / / (remove extra space)
sure.
> 
>> + * @dev: pointer to Root Port's pci_dev data structure
>> + *
>> + * Invoked by Port Bus driver when performing link reset at Root 
>> Port.
>> + */
>> +static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
>>  {
>> -	struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
>> -	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
>>  	struct pci_bus *parent = pdev->subordinate;
>> -	u16 cap = dpc->cap_pos, ctl;
>> +	struct pci_dev *dev, *temp;
>> +	struct dpc_dev *dpc;
>> +	struct pcie_device *pciedev;
>> +	struct device *devdpc;
>> +	u16 cap, ctl;
>> +
>> +	devdpc = pci_find_dpc_dev(pdev);
>> +	pciedev = to_pcie_device(devdpc);
>> +	dpc = get_service_data(pciedev);
>> +	cap = dpc->cap_pos;
>> 
>>  	pci_lock_rescan_remove();
>>  	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
>> @@ -104,21 +170,31 @@ static void dpc_work(struct work_struct *work)
>> 
>>  	dpc_wait_link_inactive(dpc);
>>  	if (dpc->rp_extensions && dpc_wait_rp_inactive(dpc))
>> -		return;
>> +		return PCI_ERS_RESULT_DISCONNECT;
>>  	if (dpc->rp_extensions && dpc->rp_pio_status) {
>>  		pci_write_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_STATUS,
>>  				       dpc->rp_pio_status);
>>  		dpc->rp_pio_status = 0;
>>  	}
>> 
>> -	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
>> +	pci_write_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_STATUS,
>>  		PCI_EXP_DPC_STATUS_TRIGGER | PCI_EXP_DPC_STATUS_INTERRUPT);
>> 
>>  	pci_read_config_word(pdev, cap + PCI_EXP_DPC_CTL, &ctl);
>>  	pci_write_config_word(pdev, cap + PCI_EXP_DPC_CTL,
>> -			      ctl | PCI_EXP_DPC_CTL_INT_EN);
>> +				ctl | PCI_EXP_DPC_CTL_INT_EN);
> 
> Align "ctl" with "pdev".

sure.
> 
>> +	return PCI_ERS_RESULT_RECOVERED;
>>  }
>> 
>> +static void dpc_work(struct work_struct *work)
>> +{
>> +	struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
>> +	struct pci_dev *pdev = dpc->dev->port;
>> +
>> +	/* From DPC point of view error is always FATAL. */
>> +	pcie_do_recovery(pdev, DPC_FATAL);
>> +}
>>  static void dpc_process_rp_pio_error(struct dpc_dev *dpc)
>>  {
>>  	struct device *dev = &dpc->dev->device;
>> @@ -297,6 +373,7 @@ static void dpc_remove(struct pcie_device *dev)
>>  	.service	= PCIE_PORT_SERVICE_DPC,
>>  	.probe		= dpc_probe,
>>  	.remove		= dpc_remove,
>> +	.reset_link     = dpc_reset_link,
>>  };
>> 
>>  static int __init dpc_service_init(void)
>> diff --git a/drivers/pci/pcie/pcie-err.c b/drivers/pci/pcie/pcie-err.c
>> index f830975..1ea4b9a 100644
>> --- a/drivers/pci/pcie/pcie-err.c
>> +++ b/drivers/pci/pcie/pcie-err.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/aer.h>
>>  #include <linux/pcieport_if.h>
>>  #include "portdrv.h"
>> +#include "./../pci.h"
>> 
>>  static DEFINE_MUTEX(pci_err_recovery_lock);
>> 
>> @@ -181,7 +182,7 @@ static pci_ers_result_t default_reset_link(struct 
>> pci_dev *dev)
>>  	return PCI_ERS_RESULT_RECOVERED;
>>  }
>> 
>> -static pci_ers_result_t reset_link(struct pci_dev *dev)
>> +static pci_ers_result_t reset_link(struct pci_dev *dev, int severity)
>>  {
>>  	struct pci_dev *udev;
>>  	pci_ers_result_t status;
>> @@ -195,9 +196,17 @@ static pci_ers_result_t reset_link(struct pci_dev 
>> *dev)
>>  		udev = dev->bus->self;
>>  	}
>> 
>> +
>> +	/* Use the service driver of the component firstly */
>> +#if IS_ENABLED(CONFIG_PCIE_DPC)
> 
> #ifdef CONFIG_PCIE_DPC
> 
>> +	if (severity == DPC_FATAL)
>> +		driver = pci_find_dpc_service(udev);
>> +#endif
>>  #if IS_ENABLED(CONFIG_PCIEAER)
>> -	/* Use the aer driver of the component firstly */
>> -	driver = pci_find_aer_service(udev);
>> +	if (severity == AER_FATAL ||
>> +	    severity == AER_NONFATAL ||
>> +	    severity == AER_CORRECTABLE)
> 
> This change (to check for AER_FATAL, etc) looks like it belongs in a
> different patch.  This patch doesn't change any places that set the
> severity.

I have made generic service and probably now it does not look like this.

> 
>> +		driver = pci_find_aer_service(udev);
>>  #endif
>> 
>>  	if (driver && driver->reset_link) {
>> @@ -287,7 +296,8 @@ void pcie_do_recovery(struct pci_dev *dev, int 
>> severity)
>> 
>>  	mutex_lock(&pci_err_recovery_lock);
>> 
>> -	if (severity == AER_FATAL)
>> +	if (severity == AER_FATAL ||
>> +	    severity == DPC_FATAL)
>>  		state = pci_channel_io_frozen;
>>  	else
>>  		state = pci_channel_io_normal;
>> @@ -297,10 +307,14 @@ void pcie_do_recovery(struct pci_dev *dev, int 
>> severity)
>>  			"error_detected",
>>  			report_error_detected);
>> 
>> -	if (severity == AER_FATAL) {
>> -		result = reset_link(dev);
>> +	if (severity == AER_FATAL ||
>> +	    severity == DPC_FATAL) {
>> +		result = reset_link(dev, severity);
>>  		if (result != PCI_ERS_RESULT_RECOVERED)
>>  			goto failed;
>> +		else if (severity == DPC_FATAL)
>> +			goto resume;
>> +
>>  	}
>> 
>>  	if (status == PCI_ERS_RESULT_CAN_RECOVER)
>> @@ -324,6 +338,7 @@ void pcie_do_recovery(struct pci_dev *dev, int 
>> severity)
>>  	if (status != PCI_ERS_RESULT_RECOVERED)
>>  		goto failed;
>> 
>> +resume:
>>  	broadcast_error_message(dev,
>>  				state,
>>  				"resume",
>> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
>> index 4f1992d..b013e24 100644
>> --- a/drivers/pci/pcie/portdrv.h
>> +++ b/drivers/pci/pcie/portdrv.h
>> @@ -80,4 +80,5 @@ static inline void pcie_port_platform_notify(struct 
>> pci_dev *port, int *mask){}
>>  #endif /* !CONFIG_ACPI */
>> 
>>  struct pcie_port_service_driver *pci_find_aer_service(struct pci_dev 
>> *dev);
>> +struct pcie_port_service_driver *pci_find_dpc_service(struct pci_dev 
>> *dev);
>>  #endif /* _PORTDRV_H_ */
>> --
>> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
>> Technologies, Inc.,
>> a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a 
>> Linux Foundation Collaborative Project.
>> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11 5/7] PCI/AER: Unify aer error defines at single space
  2018-02-24 15:36   ` Bjorn Helgaas
@ 2018-02-27  6:12     ` poza
  0 siblings, 0 replies; 22+ messages in thread
From: poza @ 2018-02-27  6:12 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On 2018-02-24 21:06, Bjorn Helgaas wrote:
> On Fri, Feb 23, 2018 at 01:54:02PM +0530, Oza Pawandeep wrote:
>> This patch moves AER error defines to drivers/pci/pci.h.
>> So that it unifies the error repoting codes at single place along with 
>> dpc
> 
> s/repoting/reporting/
> s/dpc/DPC/ (in English text)
> 
>> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
>> 
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 1efefe9..7ae9bb3 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -56,6 +56,7 @@
>>  #include <ras/ras_event.h>
>> 
>>  #include "apei-internal.h"
>> +#include "../../pci/pci.h"
> 
> You're right, it's ugly to use this sort of path to a private PCI
> header file from outside drivers/pci.
> 
> Could you just add DPC_FATAL to include/linux/aer.h?  Maybe we
> discarded that for some reason?  Having pcie-dpc.c include linux/aer.h
> seems like it would be better than having this ACPI code include
> "../../pci/pci.h"

pcie-dpc already includes #include "aer/aerdrv.h", which in turn 
includes aer.h
so aer.h is the place where DPC_FATAL should go.

> 
>>  #define GHES_PFX	"GHES: "
>> 
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index f8575da..c8394ec 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -343,7 +343,11 @@ static inline resource_size_t 
>> pci_resource_alignment(struct pci_dev *dev,
>>  void pci_enable_acs(struct pci_dev *dev);
>> 
>>  /* PCI error reporting and recovery */
>> -#define DPC_FATAL	4
>> +#define AER_NONFATAL		0
>> +#define AER_FATAL		1
>> +#define AER_CORRECTABLE		2
>> +
>> +#define DPC_FATAL		4
>> 
>>  void pcie_do_recovery(struct pci_dev *dev, int severity);
>> 
>> diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c 
>> b/drivers/pci/pcie/aer/aerdrv_errprint.c
>> index 6a352e6..4c59f37 100644
>> --- a/drivers/pci/pcie/aer/aerdrv_errprint.c
>> +++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/cper.h>
>> 
>>  #include "aerdrv.h"
>> +#include "../../pci.h"
>>  #include <ras/ras_event.h>
>> 
>>  #define AER_AGENT_RECEIVER		0
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 8f87bbe..3eac8ed 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -11,10 +11,6 @@
>>  #include <linux/errno.h>
>>  #include <linux/types.h>
>> 
>> -#define AER_NONFATAL			0
>> -#define AER_FATAL			1
>> -#define AER_CORRECTABLE			2
>> -
>>  struct pci_dev;
>> 
>>  struct aer_header_log_regs {
>> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
>> index 9c68986..d75c75b 100644
>> --- a/include/ras/ras_event.h
>> +++ b/include/ras/ras_event.h
>> @@ -13,6 +13,7 @@
>>  #include <linux/aer.h>
>>  #include <linux/cper.h>
>>  #include <linux/mm.h>
>> +#include "../../../drivers/pci/pci.h"
>> 
>>  /*
>>   * MCE Extended Error Log trace event
>> --
>> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
>> Technologies, Inc.,
>> a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a 
>> Linux Foundation Collaborative Project.
>> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11 6/7] PCI: Unify wait for link active into generic pci
  2018-02-24 15:41   ` Bjorn Helgaas
@ 2018-02-27  8:39     ` poza
  0 siblings, 0 replies; 22+ messages in thread
From: poza @ 2018-02-27  8:39 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On 2018-02-24 21:11, Bjorn Helgaas wrote:
> In subject:
> 
> s/pci/PCI/
sure.
> 
> On Fri, Feb 23, 2018 at 01:54:03PM +0530, Oza Pawandeep wrote:
>> Clients such as pciehp, dpc are using pcie_wait_link_active, which 
>> waits
>> till the link becomes active or inactive.
> 
> Use "()" after function names so we have a visual clue that they are
> functions.
> 
>> Made generic function and moved it to drivers/pci/pci.c
>> 
>> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
>> 
>> diff --git a/drivers/pci/hotplug/pciehp_hpc.c 
>> b/drivers/pci/hotplug/pciehp_hpc.c
>> index 18a42f8..de9b0ea 100644
>> --- a/drivers/pci/hotplug/pciehp_hpc.c
>> +++ b/drivers/pci/hotplug/pciehp_hpc.c
>> @@ -231,25 +231,11 @@ bool pciehp_check_link_active(struct controller 
>> *ctrl)
>>  	return ret;
>>  }
>> 
>> -static void __pcie_wait_link_active(struct controller *ctrl, bool 
>> active)
>> +static bool pcie_wait_link_active(struct controller *ctrl)
>>  {
>> -	int timeout = 1000;
>> -
>> -	if (pciehp_check_link_active(ctrl) == active)
>> -		return;
>> -	while (timeout > 0) {
>> -		msleep(10);
>> -		timeout -= 10;
>> -		if (pciehp_check_link_active(ctrl) == active)
>> -			return;
>> -	}
>> -	ctrl_dbg(ctrl, "Data Link Layer Link Active not %s in 1000 msec\n",
>> -			active ? "set" : "cleared");
>> -}
>> +	struct pci_dev *pdev = ctrl_dev(ctrl);
>> 
>> -static void pcie_wait_link_active(struct controller *ctrl)
>> -{
>> -	__pcie_wait_link_active(ctrl, true);
>> +	return pci_wait_for_link(pdev, true);
>>  }
>> 
>>  static bool pci_bus_check_dev(struct pci_bus *bus, int devfn)
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index f6a4dd1..f8d44b4 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -4176,6 +4176,37 @@ static int pci_pm_reset(struct pci_dev *dev, 
>> int probe)
>>  	return 0;
>>  }
>> 
>> +/**
>> + * pci_wait_for_link - Wait for link till its active/inactive
> 
> s/its/it's/
sure.
> 
>> + * @pdev: Bridge device
>> + * @active: waiting for active or inactive ?
>> + *
>> + * Use this to wait till link becomes active or inactive.
>> + */
>> +bool pci_wait_for_link(struct pci_dev *pdev, bool active)
> 
> I think this should be "pcie_wait_for_link()".  There's no concept of a
> link in conventional PCI.
> 
>> +{
>> +	int timeout = 1000;
>> +	bool ret;
>> +	u16 lnk_status;
>> +
>> +	for (;;) {
>> +		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
>> +		ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA);
>> +		if (ret == active)
>> +			return true;
>> +		if (timeout <= 0)
>> +			break;
>> +		timeout -= 10;
>> +	}
>> +
>> +	dev_printk(KERN_DEBUG, &pdev->dev,
> 
> pci_info().  Distros often seem to have logging configured so
> KERN_DEBUG things aren't captured, and this definitely seems worth
> capturing.
> 
>> +		   "Data Link Layer Link Active not %s in 1000 msec\n",
>> +		   active ? "set" : "cleared");
>> +
>> +	return false;
>> +}
>> +EXPORT_SYMBOL(pci_wait_for_link);
> 
> I don't think this needs to be exported.
> 
sure.
>>  void pci_reset_secondary_bus(struct pci_dev *dev)
>>  {
>>  	u16 ctrl;
>> diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
>> index 5c01c63..e15bcda 100644
>> --- a/drivers/pci/pcie/pcie-dpc.c
>> +++ b/drivers/pci/pcie/pcie-dpc.c
>> @@ -120,19 +120,9 @@ static int dpc_wait_rp_inactive(struct dpc_dev 
>> *dpc)
>> 
>>  static void dpc_wait_link_inactive(struct dpc_dev *dpc)
>>  {
>> -	unsigned long timeout = jiffies + HZ;
>>  	struct pci_dev *pdev = dpc->dev->port;
>> -	struct device *dev = &dpc->dev->device;
>> -	u16 lnk_status;
>> 
>> -	pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
>> -	while (lnk_status & PCI_EXP_LNKSTA_DLLLA &&
>> -					!time_after(jiffies, timeout)) {
>> -		msleep(10);
>> -		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
>> -	}
>> -	if (lnk_status & PCI_EXP_LNKSTA_DLLLA)
>> -		dev_warn(dev, "Link state not disabled for DPC event\n");
>> +	pci_wait_for_link(pdev, false);
>>  }
>> 
>>  /**
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 024a1be..cb674c3 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -1195,6 +1195,7 @@ int pci_add_ext_cap_save_buffer(struct pci_dev 
>> *dev,
>>  int pci_request_selected_regions(struct pci_dev *, int, const char 
>> *);
>>  int pci_request_selected_regions_exclusive(struct pci_dev *, int, 
>> const char *);
>>  void pci_release_selected_regions(struct pci_dev *, int);
>> +bool pci_wait_for_link(struct pci_dev *pdev, bool active);
> 
> I don't think this needs to be available outside the PCI core, so this
> could probably be declared in drivers/pci/pci.h
> 

thanks, will move it.

>>  /* drivers/pci/bus.c */
>>  struct pci_bus *pci_bus_get(struct pci_bus *bus);
>> --
>> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
>> Technologies, Inc.,
>> a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a 
>> Linux Foundation Collaborative Project.
>> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v11 3/7] PCI/ERR: add mutex to synchronize recovery
  2018-02-27  5:16     ` poza
@ 2018-02-27 14:41       ` Bjorn Helgaas
  0 siblings, 0 replies; 22+ messages in thread
From: Bjorn Helgaas @ 2018-02-27 14:41 UTC (permalink / raw)
  To: poza
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On Tue, Feb 27, 2018 at 10:46:34AM +0530, poza@codeaurora.org wrote:
> On 2018-02-24 05:15, Bjorn Helgaas wrote:
> > On Fri, Feb 23, 2018 at 01:54:00PM +0530, Oza Pawandeep wrote:
> > > This patch protects pci_do_recovery with mutex.
> > 
> > pcie_do_recovery()
> > 
> > Please explain why the mutex is necessary.  What bad things happen
> > without the mutex?
> > 
> > You named (some) of the other things "pcie"; maybe use "pcie" in the
> > mutex name as well so they look the same.
> > 
> 
> PCIe specification: 6.2.10
> When DPC is triggered due to receipt of an uncorrectable error Message, the
> Requester ID from the Message is recorded in the DPC Error Source ID
> register and that Message is discarded and not forwarded Upstream.
> 
> So, having said that, what we think is we dont need Mutex, because in DPC
> enabled system either AER or DPC can be triggered, not both.
> so right now there is no need of guarding pcie_do_recovery() with mutex.
> 
> but I was in a thought that; since pcie_do_recovery is supposed to be used
> by error clients,
> from sw architecture point of view, adding mutex takes care of concurrency
> if it exists (in corner cases, faulty hw where both AER and DPC triggered
> etc..)
> 
> We can choose to drop this patch, since we dont require mutex.
> Bjorn, please advise.

I'm not trying to convince you that we don't need the mutex.  My
point is that if we *do* need it, the changelog needs to say *why*
(and ideally the code will either have a comment or it will be obvious
from the code why it's necessary).

If we don't have a clear indication that it's required, I guess I
would omit it.

> > > Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
> > > 
> > > diff --git a/drivers/pci/pcie/pcie-err.c b/drivers/pci/pcie/pcie-err.c
> > > index fcd5add..f830975 100644
> > > --- a/drivers/pci/pcie/pcie-err.c
> > > +++ b/drivers/pci/pcie/pcie-err.c
> > > @@ -20,6 +20,8 @@
> > >  #include <linux/pcieport_if.h>
> > >  #include "portdrv.h"
> > > 
> > > +static DEFINE_MUTEX(pci_err_recovery_lock);
> > > +
> > >  struct aer_broadcast_data {
> > >  	enum pci_channel_state state;
> > >  	enum pci_ers_result result;
> > > @@ -283,6 +285,8 @@ void pcie_do_recovery(struct pci_dev *dev, int
> > > severity)
> > >  	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
> > >  	enum pci_channel_state state;
> > > 
> > > +	mutex_lock(&pci_err_recovery_lock);
> > > +
> > >  	if (severity == AER_FATAL)
> > >  		state = pci_channel_io_frozen;
> > >  	else
> > > @@ -326,9 +330,11 @@ void pcie_do_recovery(struct pci_dev *dev, int
> > > severity)
> > >  				report_resume);
> > > 
> > >  	dev_info(&dev->dev, "Device recovery successful\n");
> > > +	mutex_unlock(&pci_err_recovery_lock);
> > >  	return;
> > > 
> > >  failed:
> > >  	/* TODO: Should kernel panic here? */
> > >  	dev_info(&dev->dev, "Device recovery failed\n");
> > > +	mutex_unlock(&pci_err_recovery_lock);
> > >  }
> > > --
> > > Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
> > > Technologies, Inc.,
> > > a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
> > > a Linux Foundation Collaborative Project.
> > > 

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-02-27 14:41 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-23  8:23 [PATCH v11 0/7] Address error and recovery for AER and DPC Oza Pawandeep
2018-02-23  8:23 ` [PATCH v11 1/7] PCI/AER: Rename error recovery to generic pci naming Oza Pawandeep
2018-02-23  8:23 ` [PATCH v11 2/7] PCI/AER: factor out error reporting from AER Oza Pawandeep
2018-02-23 23:42   ` Bjorn Helgaas
2018-02-26  5:32     ` poza
2018-02-26  5:39       ` poza
2018-02-26 20:23       ` Bjorn Helgaas
2018-02-23  8:24 ` [PATCH v11 3/7] PCI/ERR: add mutex to synchronize recovery Oza Pawandeep
2018-02-23 23:45   ` Bjorn Helgaas
2018-02-27  5:16     ` poza
2018-02-27 14:41       ` Bjorn Helgaas
2018-02-23  8:24 ` [PATCH v11 4/7] PCI/DPC: Unify and plumb error handling into DPC Oza Pawandeep
2018-02-24  0:07   ` Bjorn Helgaas
2018-02-27  6:06     ` poza
2018-02-23  8:24 ` [PATCH v11 5/7] PCI/AER: Unify aer error defines at single space Oza Pawandeep
2018-02-24 15:36   ` Bjorn Helgaas
2018-02-27  6:12     ` poza
2018-02-23  8:24 ` [PATCH v11 6/7] PCI: Unify wait for link active into generic pci Oza Pawandeep
2018-02-24 15:41   ` Bjorn Helgaas
2018-02-27  8:39     ` poza
2018-02-23  8:24 ` [PATCH v11 7/7] PCI/DPC: Enumerate the devices after DPC trigger event Oza Pawandeep
2018-02-23 23:12 ` [PATCH v11 0/7] Address error and recovery for AER and DPC Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).