All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v12 0/6] Address error and recovery for AER and DPC
@ 2018-02-28 17:04 Oza Pawandeep
  2018-02-28 17:04 ` [PATCH v12 1/6] PCI/AER: Rename error recovery to generic PCI naming Oza Pawandeep
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Oza Pawandeep @ 2018-02-28 17:04 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

This patch set brings in error handling support for DPC

The current implementation of AER and error message broadcasting to the
EP driver is tightly coupled and limited to AER service driver.
It is important to factor out broadcasting and other link handling
callbacks. So that not only when AER gets triggered, but also when DPC get
triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.

DPC should enumerate the devices after recovering the link, which is
achieved by implementing error_resume callback.

Changes since v11:
    Bjorn's comments addressed.
	> rename pcie-err.c to err.c
	> removed EXPORT_SYMBOL
	> made generic find_serivce function in port driver.
	> removed mutex patch as no need to have mutex in pcie_do_recovery
	> brough in DPC_FATAL in aer.h
	> so now all the error codes (AER and DPC) are unified in aer.h
Changes since v10:
    Christoph Hellwig's, David Laight's and Randy Dunlap's
    comments addressed.
        > renamed pci_do_recovery to pcie_do_recovery
        > removed inner braces in conditional statements.
        > restrctured the code in pci_wait_for_link
        > EXPORT_SYMBOL_GPL
Changes since v9:
    Sinan's comments addressed.
        > bool active = true; unnecessary variable removed.
Changes since v8:
    Fixed Kbuild errors.
Changes since v7:
    Rebased the code on pci master
        > https://kernel.googlesource.com/pub/scm/linux/kernel/git/helgaas/pci
Changes since v6:
    Sinan's and Stefan's comments implemented.
        > reordered patch 6 and 7
        > cleaned up
Changes since v5:
    Sinan's and Keith's comments incorporated.
        > made separate patch for mutex
        > unified error repotting codes into driver/pci/pci.h
        > got rid of wait link active/inactive and
          made generic function in driver/pci/pci.c
Changes since v4:
    Bjorn's comments incorporated.
        > Renamed only do_recovery.
        > moved the things more locally to drivers/pci/pci.h
Changes since v3:
    Bjorn's comments incorporated.
        > Made separate patch renaming generic pci_err.c
        > Introduce pci_err.h to contain all the error types and recovery
        > removed all the dependencies on pci.h
Changes since v2:
    Based on feedback from Keith:
    "
    When DPC is triggered due to receipt of an uncorrectable error Message,
    the Requester ID from the Message is recorded in the DPC Error
    Source ID register and that Message is discarded and not forwarded Upstream.
    "
    Removed the patch where AER checks if DPC service is active
Changes since v1:
    Kbuild errors fixed:
        > pci_find_dpc_dev made static
        > ras_event.h updated
        > pci_find_aer_service call with CONFIG check
        > pci_find_dpc_service call with CONFIG check

Oza Pawandeep (6):
  PCI/AER: Rename error recovery to generic PCI naming
  PCI/AER: Factor out error reporting from AER
  PCI/PORTDRV: Implement generic find service
  PCI/DPC: Unify and plumb error handling into DPC
  PCI: Unify wait for link active into generic PCI
  PCI/DPC: Enumerate the devices after DPC trigger event

 drivers/pci/hotplug/pciehp_hpc.c   |  20 +--
 drivers/pci/pci.c                  |  29 +++
 drivers/pci/pci.h                  |   5 +
 drivers/pci/pcie/Makefile          |   2 +-
 drivers/pci/pcie/aer/aerdrv.h      |  30 ----
 drivers/pci/pcie/aer/aerdrv_core.c | 317 +-------------------------------
 drivers/pci/pcie/err.c             | 359 +++++++++++++++++++++++++++++++++++++
 drivers/pci/pcie/pcie-dpc.c        |  90 ++++++++--
 drivers/pci/pcie/portdrv.h         |   2 +
 drivers/pci/pcie/portdrv_core.c    |  43 +++++
 include/linux/aer.h                |   2 +
 11 files changed, 521 insertions(+), 378 deletions(-)
 create mode 100644 drivers/pci/pcie/err.c

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v12 1/6] PCI/AER: Rename error recovery to generic PCI naming
  2018-02-28 17:04 [PATCH v12 0/6] Address error and recovery for AER and DPC Oza Pawandeep
@ 2018-02-28 17:04 ` Oza Pawandeep
  2018-02-28 17:04 ` [PATCH v12 2/6] PCI/AER: Factor out error reporting from AER Oza Pawandeep
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Oza Pawandeep @ 2018-02-28 17:04 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

This patch renames error recovery to generic name with pcie prefix

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fcd8191..abc514e 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -342,6 +342,9 @@ static inline resource_size_t pci_resource_alignment(struct pci_dev *dev,
 
 void pci_enable_acs(struct pci_dev *dev);
 
+/* PCI error reporting and recovery */
+void pcie_do_recovery(struct pci_dev *dev, int severity);
+
 #ifdef CONFIG_PCIEASPM
 void pcie_aspm_init_link_state(struct pci_dev *pdev);
 void pcie_aspm_exit_link_state(struct pci_dev *pdev);
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index a4bfea5..aeb83a0 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -478,7 +478,7 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
 }
 
 /**
- * do_recovery - handle nonfatal/fatal error recovery process
+ * pcie_do_recovery - handle nonfatal/fatal error recovery process
  * @dev: pointer to a pci_dev data structure of agent detecting an error
  * @severity: error severity type
  *
@@ -486,7 +486,7 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
  * error detected message to all downstream drivers within a hierarchy in
  * question and return the returned code.
  */
-static void do_recovery(struct pci_dev *dev, int severity)
+static void pcie_do_recovery(struct pci_dev *dev, int severity)
 {
 	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
 	enum pci_channel_state state;
@@ -566,7 +566,7 @@ static void handle_error_source(struct pcie_device *aerdev,
 			pci_write_config_dword(dev, pos + PCI_ERR_COR_STATUS,
 					info->status);
 	} else
-		do_recovery(dev, info->severity);
+		pcie_do_recovery(dev, info->severity);
 }
 
 #ifdef CONFIG_ACPI_APEI_PCIEAER
@@ -631,7 +631,7 @@ static void aer_recover_work_func(struct work_struct *work)
 		}
 		cper_print_aer(pdev, entry.severity, entry.regs);
 		if (entry.severity != AER_CORRECTABLE)
-			do_recovery(pdev, entry.severity);
+			pcie_do_recovery(pdev, entry.severity);
 		pci_dev_put(pdev);
 	}
 }
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v12 2/6] PCI/AER: Factor out error reporting from AER
  2018-02-28 17:04 [PATCH v12 0/6] Address error and recovery for AER and DPC Oza Pawandeep
  2018-02-28 17:04 ` [PATCH v12 1/6] PCI/AER: Rename error recovery to generic PCI naming Oza Pawandeep
@ 2018-02-28 17:04 ` Oza Pawandeep
  2018-02-28 17:04 ` [PATCH v12 3/6] PCI/PORTDRV: Implement generic find service Oza Pawandeep
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Oza Pawandeep @ 2018-02-28 17:04 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

This patch factors out error reporting callbacks, which are currently
tightly coupled with AER.

DPC should be able to register callbacks and attempt recovery when DPC
trigger event occurs.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index 223e4c3..f0b1a78 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -6,7 +6,7 @@
 # Build PCI Express ASPM if needed
 obj-$(CONFIG_PCIEASPM)		+= aspm.o
 
-pcieportdrv-y			:= portdrv_core.o portdrv_pci.o portdrv_bus.o
+pcieportdrv-y			:= portdrv_core.o portdrv_pci.o portdrv_bus.o err.o
 pcieportdrv-$(CONFIG_ACPI)	+= portdrv_acpi.o
 
 obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index 5449e5c..bc9db53 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -76,36 +76,6 @@ struct aer_rpc {
 					 */
 };
 
-struct aer_broadcast_data {
-	enum pci_channel_state state;
-	enum pci_ers_result result;
-};
-
-static inline pci_ers_result_t merge_result(enum pci_ers_result orig,
-		enum pci_ers_result new)
-{
-	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
-		return PCI_ERS_RESULT_NO_AER_DRIVER;
-
-	if (new == PCI_ERS_RESULT_NONE)
-		return orig;
-
-	switch (orig) {
-	case PCI_ERS_RESULT_CAN_RECOVER:
-	case PCI_ERS_RESULT_RECOVERED:
-		orig = new;
-		break;
-	case PCI_ERS_RESULT_DISCONNECT:
-		if (new == PCI_ERS_RESULT_NEED_RESET)
-			orig = PCI_ERS_RESULT_NEED_RESET;
-		break;
-	default:
-		break;
-	}
-
-	return orig;
-}
-
 extern struct bus_type pcie_port_bus_type;
 void aer_isr(struct work_struct *work);
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index aeb83a0..4acec3b 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -23,6 +23,7 @@
 #include <linux/slab.h>
 #include <linux/kfifo.h>
 #include "aerdrv.h"
+#include "../../pci.h"
 
 #define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
 				 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
@@ -230,191 +231,6 @@ static bool find_source_device(struct pci_dev *parent,
 	return true;
 }
 
-static int report_error_detected(struct pci_dev *dev, void *data)
-{
-	pci_ers_result_t vote;
-	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-	result_data = (struct aer_broadcast_data *) data;
-
-	device_lock(&dev->dev);
-	dev->error_state = result_data->state;
-
-	if (!dev->driver ||
-		!dev->driver->err_handler ||
-		!dev->driver->err_handler->error_detected) {
-		if (result_data->state == pci_channel_io_frozen &&
-			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
-			/*
-			 * In case of fatal recovery, if one of down-
-			 * stream device has no driver. We might be
-			 * unable to recover because a later insmod
-			 * of a driver for this device is unaware of
-			 * its hw state.
-			 */
-			pci_printk(KERN_DEBUG, dev, "device has %s\n",
-				   dev->driver ?
-				   "no AER-aware driver" : "no driver");
-		}
-
-		/*
-		 * If there's any device in the subtree that does not
-		 * have an error_detected callback, returning
-		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
-		 * the subsequent mmio_enabled/slot_reset/resume
-		 * callbacks of "any" device in the subtree. All the
-		 * devices in the subtree are left in the error state
-		 * without recovery.
-		 */
-
-		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
-			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
-		else
-			vote = PCI_ERS_RESULT_NONE;
-	} else {
-		err_handler = dev->driver->err_handler;
-		vote = err_handler->error_detected(dev, result_data->state);
-		pci_uevent_ers(dev, PCI_ERS_RESULT_NONE);
-	}
-
-	result_data->result = merge_result(result_data->result, vote);
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-static int report_mmio_enabled(struct pci_dev *dev, void *data)
-{
-	pci_ers_result_t vote;
-	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-	result_data = (struct aer_broadcast_data *) data;
-
-	device_lock(&dev->dev);
-	if (!dev->driver ||
-		!dev->driver->err_handler ||
-		!dev->driver->err_handler->mmio_enabled)
-		goto out;
-
-	err_handler = dev->driver->err_handler;
-	vote = err_handler->mmio_enabled(dev);
-	result_data->result = merge_result(result_data->result, vote);
-out:
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-static int report_slot_reset(struct pci_dev *dev, void *data)
-{
-	pci_ers_result_t vote;
-	const struct pci_error_handlers *err_handler;
-	struct aer_broadcast_data *result_data;
-	result_data = (struct aer_broadcast_data *) data;
-
-	device_lock(&dev->dev);
-	if (!dev->driver ||
-		!dev->driver->err_handler ||
-		!dev->driver->err_handler->slot_reset)
-		goto out;
-
-	err_handler = dev->driver->err_handler;
-	vote = err_handler->slot_reset(dev);
-	result_data->result = merge_result(result_data->result, vote);
-out:
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-static int report_resume(struct pci_dev *dev, void *data)
-{
-	const struct pci_error_handlers *err_handler;
-
-	device_lock(&dev->dev);
-	dev->error_state = pci_channel_io_normal;
-
-	if (!dev->driver ||
-		!dev->driver->err_handler ||
-		!dev->driver->err_handler->resume)
-		goto out;
-
-	err_handler = dev->driver->err_handler;
-	err_handler->resume(dev);
-	pci_uevent_ers(dev, PCI_ERS_RESULT_RECOVERED);
-out:
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-/**
- * broadcast_error_message - handle message broadcast to downstream drivers
- * @dev: pointer to from where in a hierarchy message is broadcasted down
- * @state: error state
- * @error_mesg: message to print
- * @cb: callback to be broadcasted
- *
- * Invoked during error recovery process. Once being invoked, the content
- * of error severity will be broadcasted to all downstream drivers in a
- * hierarchy in question.
- */
-static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
-	enum pci_channel_state state,
-	char *error_mesg,
-	int (*cb)(struct pci_dev *, void *))
-{
-	struct aer_broadcast_data result_data;
-
-	pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg);
-	result_data.state = state;
-	if (cb == report_error_detected)
-		result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
-	else
-		result_data.result = PCI_ERS_RESULT_RECOVERED;
-
-	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
-		/*
-		 * If the error is reported by a bridge, we think this error
-		 * is related to the downstream link of the bridge, so we
-		 * do error recovery on all subordinates of the bridge instead
-		 * of the bridge and clear the error status of the bridge.
-		 */
-		if (cb == report_error_detected)
-			dev->error_state = state;
-		pci_walk_bus(dev->subordinate, cb, &result_data);
-		if (cb == report_resume) {
-			pci_cleanup_aer_uncorrect_error_status(dev);
-			dev->error_state = pci_channel_io_normal;
-		}
-	} else {
-		/*
-		 * If the error is reported by an end point, we think this
-		 * error is related to the upstream link of the end point.
-		 */
-		if (state == pci_channel_io_normal)
-			/*
-			 * the error is non fatal so the bus is ok, just invoke
-			 * the callback for the function that logged the error.
-			 */
-			cb(dev, &result_data);
-		else
-			pci_walk_bus(dev->bus, cb, &result_data);
-	}
-
-	return result_data.result;
-}
-
-/**
- * default_reset_link - default reset function
- * @dev: pointer to pci_dev data structure
- *
- * Invoked when performing link reset on a Downstream Port or a
- * Root Port with no aer driver.
- */
-static pci_ers_result_t default_reset_link(struct pci_dev *dev)
-{
-	pci_reset_bridge_secondary_bus(dev);
-	pci_printk(KERN_DEBUG, dev, "downstream link has been reset\n");
-	return PCI_ERS_RESULT_RECOVERED;
-}
-
 static int find_aer_service_iter(struct device *device, void *data)
 {
 	struct pcie_port_service_driver *service_driver, **drv;
@@ -432,7 +248,7 @@ static int find_aer_service_iter(struct device *device, void *data)
 	return 0;
 }
 
-static struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
+struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
 {
 	struct pcie_port_service_driver *drv = NULL;
 
@@ -441,107 +257,6 @@ static struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
 	return drv;
 }
 
-static pci_ers_result_t reset_link(struct pci_dev *dev)
-{
-	struct pci_dev *udev;
-	pci_ers_result_t status;
-	struct pcie_port_service_driver *driver;
-
-	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
-		/* Reset this port for all subordinates */
-		udev = dev;
-	} else {
-		/* Reset the upstream component (likely downstream port) */
-		udev = dev->bus->self;
-	}
-
-	/* Use the aer driver of the component firstly */
-	driver = find_aer_service(udev);
-
-	if (driver && driver->reset_link) {
-		status = driver->reset_link(udev);
-	} else if (udev->has_secondary_link) {
-		status = default_reset_link(udev);
-	} else {
-		pci_printk(KERN_DEBUG, dev, "no link-reset support at upstream device %s\n",
-			pci_name(udev));
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-
-	if (status != PCI_ERS_RESULT_RECOVERED) {
-		pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s failed\n",
-			pci_name(udev));
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-
-	return status;
-}
-
-/**
- * pcie_do_recovery - handle nonfatal/fatal error recovery process
- * @dev: pointer to a pci_dev data structure of agent detecting an error
- * @severity: error severity type
- *
- * Invoked when an error is nonfatal/fatal. Once being invoked, broadcast
- * error detected message to all downstream drivers within a hierarchy in
- * question and return the returned code.
- */
-static void pcie_do_recovery(struct pci_dev *dev, int severity)
-{
-	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
-	enum pci_channel_state state;
-
-	if (severity == AER_FATAL)
-		state = pci_channel_io_frozen;
-	else
-		state = pci_channel_io_normal;
-
-	status = broadcast_error_message(dev,
-			state,
-			"error_detected",
-			report_error_detected);
-
-	if (severity == AER_FATAL) {
-		result = reset_link(dev);
-		if (result != PCI_ERS_RESULT_RECOVERED)
-			goto failed;
-	}
-
-	if (status == PCI_ERS_RESULT_CAN_RECOVER)
-		status = broadcast_error_message(dev,
-				state,
-				"mmio_enabled",
-				report_mmio_enabled);
-
-	if (status == PCI_ERS_RESULT_NEED_RESET) {
-		/*
-		 * TODO: Should call platform-specific
-		 * functions to reset slot before calling
-		 * drivers' slot_reset callbacks?
-		 */
-		status = broadcast_error_message(dev,
-				state,
-				"slot_reset",
-				report_slot_reset);
-	}
-
-	if (status != PCI_ERS_RESULT_RECOVERED)
-		goto failed;
-
-	broadcast_error_message(dev,
-				state,
-				"resume",
-				report_resume);
-
-	pci_info(dev, "AER: Device recovery successful\n");
-	return;
-
-failed:
-	pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
-	/* TODO: Should kernel panic here? */
-	pci_info(dev, "AER: Device recovery failed\n");
-}
-
 /**
  * handle_error_source - handle logging error into an event log
  * @aerdev: pointer to pcie_device data structure of the root port
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
new file mode 100644
index 0000000..c48eb0a
--- /dev/null
+++ b/drivers/pci/pcie/err.c
@@ -0,0 +1,332 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This file implements the error recovery as a core part of PCIe error
+ * reporting. When a PCIe error is delivered, an error message will be
+ * collected and printed to console, then, an error recovery procedure
+ * will be executed by following the PCI error recovery rules.
+ *
+ * Copyright (C) 2006 Intel Corp.
+ *	Tom Long Nguyen (tom.l.nguyen@intel.com)
+ *	Zhang Yanmin (yanmin.zhang@intel.com)
+ *
+ */
+
+#include <linux/pci.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/aer.h>
+#include <linux/pcieport_if.h>
+#include "portdrv.h"
+
+struct aer_broadcast_data {
+	enum pci_channel_state state;
+	enum pci_ers_result result;
+};
+
+static pci_ers_result_t merge_result(enum pci_ers_result orig,
+				  enum pci_ers_result new)
+{
+	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
+		return PCI_ERS_RESULT_NO_AER_DRIVER;
+
+	if (new == PCI_ERS_RESULT_NONE)
+		return orig;
+
+	switch (orig) {
+	case PCI_ERS_RESULT_CAN_RECOVER:
+	case PCI_ERS_RESULT_RECOVERED:
+		orig = new;
+		break;
+	case PCI_ERS_RESULT_DISCONNECT:
+		if (new == PCI_ERS_RESULT_NEED_RESET)
+			orig = PCI_ERS_RESULT_NEED_RESET;
+		break;
+	default:
+		break;
+	}
+
+	return orig;
+}
+
+static int report_mmio_enabled(struct pci_dev *dev, void *data)
+{
+	pci_ers_result_t vote;
+	const struct pci_error_handlers *err_handler;
+	struct aer_broadcast_data *result_data;
+
+	result_data = (struct aer_broadcast_data *) data;
+
+	device_lock(&dev->dev);
+	if (!dev->driver ||
+		!dev->driver->err_handler ||
+		!dev->driver->err_handler->mmio_enabled)
+		goto out;
+
+	err_handler = dev->driver->err_handler;
+	vote = err_handler->mmio_enabled(dev);
+	result_data->result = merge_result(result_data->result, vote);
+out:
+	device_unlock(&dev->dev);
+	return 0;
+}
+
+static int report_slot_reset(struct pci_dev *dev, void *data)
+{
+	pci_ers_result_t vote;
+	const struct pci_error_handlers *err_handler;
+	struct aer_broadcast_data *result_data;
+
+	result_data = (struct aer_broadcast_data *) data;
+
+	device_lock(&dev->dev);
+	if (!dev->driver ||
+		!dev->driver->err_handler ||
+		!dev->driver->err_handler->slot_reset)
+		goto out;
+
+	err_handler = dev->driver->err_handler;
+	vote = err_handler->slot_reset(dev);
+	result_data->result = merge_result(result_data->result, vote);
+out:
+	device_unlock(&dev->dev);
+	return 0;
+}
+
+static int report_resume(struct pci_dev *dev, void *data)
+{
+	const struct pci_error_handlers *err_handler;
+
+	device_lock(&dev->dev);
+	dev->error_state = pci_channel_io_normal;
+
+	if (!dev->driver ||
+		!dev->driver->err_handler ||
+		!dev->driver->err_handler->resume)
+		goto out;
+
+	err_handler = dev->driver->err_handler;
+	err_handler->resume(dev);
+out:
+	device_unlock(&dev->dev);
+	return 0;
+}
+
+static int report_error_detected(struct pci_dev *dev, void *data)
+{
+	pci_ers_result_t vote;
+	const struct pci_error_handlers *err_handler;
+	struct aer_broadcast_data *result_data;
+
+	result_data = (struct aer_broadcast_data *) data;
+
+	device_lock(&dev->dev);
+	dev->error_state = result_data->state;
+
+	if (!dev->driver ||
+		!dev->driver->err_handler ||
+		!dev->driver->err_handler->error_detected) {
+		if (result_data->state == pci_channel_io_frozen &&
+			dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
+			/*
+			 * In case of fatal recovery, if one of down-
+			 * stream device has no driver. We might be
+			 * unable to recover because a later insmod
+			 * of a driver for this device is unaware of
+			 * its hw state.
+			 */
+			pci_printk(KERN_DEBUG, dev, "device has %s\n",
+				   dev->driver ?
+				   "no error-aware driver" : "no driver");
+		}
+
+		/*
+		 * If there's any device in the subtree that does not
+		 * have an error_detected callback, returning
+		 * PCI_ERS_RESULT_NO_AER_DRIVER prevents calling of
+		 * the subsequent mmio_enabled/slot_reset/resume
+		 * callbacks of "any" device in the subtree. All the
+		 * devices in the subtree are left in the error state
+		 * without recovery.
+		 */
+
+		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
+			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
+		else
+			vote = PCI_ERS_RESULT_NONE;
+	} else {
+		err_handler = dev->driver->err_handler;
+		vote = err_handler->error_detected(dev, result_data->state);
+	}
+
+	result_data->result = merge_result(result_data->result, vote);
+	device_unlock(&dev->dev);
+	return 0;
+}
+
+/**
+ * default_reset_link - default reset function
+ * @dev: pointer to pci_dev data structure
+ *
+ * Invoked when performing link reset on a Downstream Port or a
+ * Root Port with no aer driver.
+ */
+static pci_ers_result_t default_reset_link(struct pci_dev *dev)
+{
+	pci_reset_bridge_secondary_bus(dev);
+	pci_printk(KERN_DEBUG, dev, "downstream link has been reset\n");
+	return PCI_ERS_RESULT_RECOVERED;
+}
+
+static pci_ers_result_t reset_link(struct pci_dev *dev)
+{
+	struct pci_dev *udev;
+	pci_ers_result_t status;
+	struct pcie_port_service_driver *driver = NULL;
+
+	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+		/* Reset this port for all subordinates */
+		udev = dev;
+	} else {
+		/* Reset the upstream component (likely downstream port) */
+		udev = dev->bus->self;
+	}
+
+	/* Use the aer driver of the component firstly */
+	driver = find_aer_service(udev);
+
+	if (driver && driver->reset_link) {
+		status = driver->reset_link(udev);
+	} else if (udev->has_secondary_link) {
+		status = default_reset_link(udev);
+	} else {
+		pci_printk(KERN_DEBUG, dev,
+			"no link-reset support at upstream device %s\n",
+			pci_name(udev));
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	if (status != PCI_ERS_RESULT_RECOVERED) {
+		pci_printk(KERN_DEBUG, dev,
+			"link reset at upstream device %s failed\n",
+			pci_name(udev));
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+
+	return status;
+}
+
+/**
+ * broadcast_error_message - handle message broadcast to downstream drivers
+ * @dev: pointer to where in a hierarchy message is broadcasted down
+ * @state: error state
+ * @error_mesg: message to print
+ * @cb: callback to be broadcast
+ *
+ * Invoked during error recovery process. Once being invoked, the content
+ * of error severity will be broadcast to all downstream drivers in a
+ * hierarchy in question.
+ */
+static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
+	enum pci_channel_state state,
+	char *error_mesg,
+	int (*cb)(struct pci_dev *, void *))
+{
+	struct aer_broadcast_data result_data;
+
+	pci_printk(KERN_DEBUG, dev, "broadcast %s message\n", error_mesg);
+	result_data.state = state;
+	if (cb == report_error_detected)
+		result_data.result = PCI_ERS_RESULT_CAN_RECOVER;
+	else
+		result_data.result = PCI_ERS_RESULT_RECOVERED;
+
+	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+		/*
+		 * If the error is reported by a bridge, we think this error
+		 * is related to the downstream link of the bridge, so we
+		 * do error recovery on all subordinates of the bridge instead
+		 * of the bridge and clear the error status of the bridge.
+		 */
+		if (cb == report_error_detected)
+			dev->error_state = state;
+		pci_walk_bus(dev->subordinate, cb, &result_data);
+		if (cb == report_resume) {
+			pci_cleanup_aer_uncorrect_error_status(dev);
+			dev->error_state = pci_channel_io_normal;
+		}
+	} else {
+		/*
+		 * If the error is reported by an end point, we think this
+		 * error is related to the upstream link of the end point.
+		 */
+		pci_walk_bus(dev->bus, cb, &result_data);
+	}
+
+	return result_data.result;
+}
+
+/**
+ * pcie_do_recovery - handle nonfatal/fatal error recovery process
+ * @dev: pointer to a pci_dev data structure of agent detecting an error
+ * @severity: error severity type
+ *
+ * Invoked when an error is nonfatal/fatal. Once being invoked, broadcast
+ * error detected message to all downstream drivers within a hierarchy in
+ * question and return the returned code.
+ */
+void pcie_do_recovery(struct pci_dev *dev, int severity)
+{
+	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
+	enum pci_channel_state state;
+
+	if (severity == AER_FATAL)
+		state = pci_channel_io_frozen;
+	else
+		state = pci_channel_io_normal;
+
+	status = broadcast_error_message(dev,
+			state,
+			"error_detected",
+			report_error_detected);
+
+	if (severity == AER_FATAL) {
+		result = reset_link(dev);
+		if (result != PCI_ERS_RESULT_RECOVERED)
+			goto failed;
+	}
+
+	if (status == PCI_ERS_RESULT_CAN_RECOVER)
+		status = broadcast_error_message(dev,
+				state,
+				"mmio_enabled",
+				report_mmio_enabled);
+
+	if (status == PCI_ERS_RESULT_NEED_RESET) {
+		/*
+		 * TODO: Should call platform-specific
+		 * functions to reset slot before calling
+		 * drivers' slot_reset callbacks?
+		 */
+		status = broadcast_error_message(dev,
+				state,
+				"slot_reset",
+				report_slot_reset);
+	}
+
+	if (status != PCI_ERS_RESULT_RECOVERED)
+		goto failed;
+
+	broadcast_error_message(dev,
+				state,
+				"resume",
+				report_resume);
+
+	dev_info(&dev->dev, "Device recovery successful\n");
+	return;
+
+failed:
+	/* TODO: Should kernel panic here? */
+	dev_info(&dev->dev, "Device recovery failed\n");
+}
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index a854bc5..9a8d0dd 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -79,4 +79,5 @@ static inline void pcie_port_platform_notify(struct pci_dev *port, int *mask)
 static inline void pcie_port_platform_notify(struct pci_dev *port, int *mask){}
 #endif /* !CONFIG_ACPI */
 
+struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev);
 #endif /* _PORTDRV_H_ */
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v12 3/6] PCI/PORTDRV: Implement generic find service
  2018-02-28 17:04 [PATCH v12 0/6] Address error and recovery for AER and DPC Oza Pawandeep
  2018-02-28 17:04 ` [PATCH v12 1/6] PCI/AER: Rename error recovery to generic PCI naming Oza Pawandeep
  2018-02-28 17:04 ` [PATCH v12 2/6] PCI/AER: Factor out error reporting from AER Oza Pawandeep
@ 2018-02-28 17:04 ` Oza Pawandeep
  2018-03-06 14:02   ` Sinan Kaya
  2018-02-28 17:04 ` [PATCH v12 4/6] PCI/DPC: Unify and plumb error handling into DPC Oza Pawandeep
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Oza Pawandeep @ 2018-02-28 17:04 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

This patch implements generic pcie_port_find_service() routine.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 4acec3b..aeb8236 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -231,32 +231,6 @@ static bool find_source_device(struct pci_dev *parent,
 	return true;
 }
 
-static int find_aer_service_iter(struct device *device, void *data)
-{
-	struct pcie_port_service_driver *service_driver, **drv;
-
-	drv = (struct pcie_port_service_driver **) data;
-
-	if (device->bus == &pcie_port_bus_type && device->driver) {
-		service_driver = to_service_driver(device->driver);
-		if (service_driver->service == PCIE_PORT_SERVICE_AER) {
-			*drv = service_driver;
-			return 1;
-		}
-	}
-
-	return 0;
-}
-
-struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev)
-{
-	struct pcie_port_service_driver *drv = NULL;
-
-	device_for_each_child(&dev->dev, &drv, find_aer_service_iter);
-
-	return drv;
-}
-
 /**
  * handle_error_source - handle logging error into an event log
  * @aerdev: pointer to pcie_device data structure of the root port
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c48eb0a..98aeec4 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -194,7 +194,7 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
 	}
 
 	/* Use the aer driver of the component firstly */
-	driver = find_aer_service(udev);
+	driver = pcie_port_find_service(udev, PCIE_PORT_SERVICE_AER);
 
 	if (driver && driver->reset_link) {
 		status = driver->reset_link(udev);
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index 9a8d0dd..419bdf3 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -79,5 +79,6 @@ static inline void pcie_port_platform_notify(struct pci_dev *port, int *mask)
 static inline void pcie_port_platform_notify(struct pci_dev *port, int *mask){}
 #endif /* !CONFIG_ACPI */
 
-struct pcie_port_service_driver *find_aer_service(struct pci_dev *dev);
+struct pcie_port_service_driver *pcie_port_find_service(struct pci_dev *dev,
+							u32 service);
 #endif /* _PORTDRV_H_ */
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index ef3bad4..ffc6f4f 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -23,6 +23,11 @@
 
 bool pciehp_msi_disabled;
 
+struct portdrv_service_data {
+	struct pcie_port_service_driver *drv;
+	u32 service;
+};
+
 static int __init pciehp_setup(char *str)
 {
 	if (!strncmp(str, "nomsi", 5))
@@ -414,6 +419,45 @@ static int remove_iter(struct device *dev, void *data)
 	return 0;
 }
 
+static int find_service_iter(struct device *device, void *data)
+{
+	struct pcie_port_service_driver *service_driver;
+	struct portdrv_service_data *pdrvs;
+	u32 service;
+
+	pdrvs = (struct portdrv_service_data *) data;
+	service = pdrvs->service;
+
+	if (device->bus == &pcie_port_bus_type && device->driver) {
+		service_driver = to_service_driver(device->driver);
+		if (service_driver->service == service) {
+			pdrvs->drv = service_driver;
+			return 1;
+		}
+	}
+
+	return 0;
+}
+/**
+ * pcie_port_find_service - find the service driver
+ * @dev: PCI Express port the service devices associated with
+ * @service: Service to find
+ *
+ * Find PCI Express port service driver associated with given service
+ */
+struct pcie_port_service_driver *pcie_port_find_service(struct pci_dev *dev,
+							u32 service)
+{
+	struct pcie_port_service_driver *drv = NULL;
+	struct portdrv_service_data pdrvs;
+
+	pdrvs.service = service;
+	device_for_each_child(&dev->dev, &pdrvs, find_service_iter);
+
+	drv = pdrvs.drv;
+	return drv;
+}
+
 /**
  * pcie_port_device_remove - unregister PCI Express port service devices
  * @dev: PCI Express port the service devices to unregister are associated with
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v12 4/6] PCI/DPC: Unify and plumb error handling into DPC
  2018-02-28 17:04 [PATCH v12 0/6] Address error and recovery for AER and DPC Oza Pawandeep
                   ` (2 preceding siblings ...)
  2018-02-28 17:04 ` [PATCH v12 3/6] PCI/PORTDRV: Implement generic find service Oza Pawandeep
@ 2018-02-28 17:04 ` Oza Pawandeep
  2018-02-28 17:04 ` [PATCH v12 5/6] PCI: Unify wait for link active into generic PCI Oza Pawandeep
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Oza Pawandeep @ 2018-02-28 17:04 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

Current DPC driver does not do recovery, e.g. calling end-point's driver's
callbacks, which sanitize the sw.

DPC driver implements link_reset callback, and calls pci_do_recovery().

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 98aeec4..9f1c983 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -19,6 +19,7 @@
 #include <linux/aer.h>
 #include <linux/pcieport_if.h>
 #include "portdrv.h"
+#include "./../pci.h"
 
 struct aer_broadcast_data {
 	enum pci_channel_state state;
@@ -179,11 +180,12 @@ static pci_ers_result_t default_reset_link(struct pci_dev *dev)
 	return PCI_ERS_RESULT_RECOVERED;
 }
 
-static pci_ers_result_t reset_link(struct pci_dev *dev)
+static pci_ers_result_t reset_link(struct pci_dev *dev, int severity)
 {
 	struct pci_dev *udev;
 	pci_ers_result_t status;
 	struct pcie_port_service_driver *driver = NULL;
+	u32 service;
 
 	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
 		/* Reset this port for all subordinates */
@@ -193,8 +195,12 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
 		udev = dev->bus->self;
 	}
 
-	/* Use the aer driver of the component firstly */
-	driver = pcie_port_find_service(udev, PCIE_PORT_SERVICE_AER);
+	if (severity == DPC_FATAL)
+		service = PCIE_PORT_SERVICE_DPC;
+	else
+		service = PCIE_PORT_SERVICE_AER;
+
+	driver = pcie_port_find_service(udev, service);
 
 	if (driver && driver->reset_link) {
 		status = driver->reset_link(udev);
@@ -281,7 +287,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 	pci_ers_result_t status, result = PCI_ERS_RESULT_RECOVERED;
 	enum pci_channel_state state;
 
-	if (severity == AER_FATAL)
+	if ((severity == AER_FATAL) ||
+	    (severity == DPC_FATAL))
 		state = pci_channel_io_frozen;
 	else
 		state = pci_channel_io_normal;
@@ -291,10 +298,13 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 			"error_detected",
 			report_error_detected);
 
-	if (severity == AER_FATAL) {
-		result = reset_link(dev);
+	if ((severity == AER_FATAL) ||
+	    (severity == DPC_FATAL)) {
+		result = reset_link(dev, severity);
 		if (result != PCI_ERS_RESULT_RECOVERED)
 			goto failed;
+		else if (severity == DPC_FATAL)
+			goto resume;
 	}
 
 	if (status == PCI_ERS_RESULT_CAN_RECOVER)
@@ -318,6 +328,7 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 	if (status != PCI_ERS_RESULT_RECOVERED)
 		goto failed;
 
+resume:
 	broadcast_error_message(dev,
 				state,
 				"resume",
diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
index 38e40c6..517c8b4 100644
--- a/drivers/pci/pcie/pcie-dpc.c
+++ b/drivers/pci/pcie/pcie-dpc.c
@@ -13,6 +13,7 @@
 #include <linux/pcieport_if.h>
 #include "../pci.h"
 #include "aer/aerdrv.h"
+#include "portdrv.h"
 
 struct dpc_dev {
 	struct pcie_device	*dev;
@@ -45,6 +46,33 @@ struct dpc_dev {
 	"Memory Request Completion Timeout",		 /* Bit Position 18 */
 };
 
+static int find_dpc_dev_iter(struct device *device, void *data)
+{
+	struct pcie_port_service_driver *service_driver;
+	struct device **dev;
+
+	dev = (struct device **) data;
+
+	if (device->bus == &pcie_port_bus_type && device->driver) {
+		service_driver = to_service_driver(device->driver);
+		if (service_driver->service == PCIE_PORT_SERVICE_DPC) {
+			*dev = device;
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
+static struct device *pci_find_dpc_dev(struct pci_dev *pdev)
+{
+	struct device *dev = NULL;
+
+	device_for_each_child(&pdev->dev, &dev, find_dpc_dev_iter);
+
+	return dev;
+}
+
 static int dpc_wait_rp_inactive(struct dpc_dev *dpc)
 {
 	unsigned long timeout = jiffies + HZ;
@@ -82,12 +110,25 @@ static void dpc_wait_link_inactive(struct dpc_dev *dpc)
 		dev_warn(dev, "Link state not disabled for DPC event\n");
 }
 
-static void dpc_work(struct work_struct *work)
+/**
+ * dpc_reset_link - reset link DPC routine
+ * @pdev: pointer to Root Port's pci_dev data structure
+ *
+ * Invoked by Port Bus driver when performing link reset at Root Port.
+ */
+static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
 {
-	struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
-	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
 	struct pci_bus *parent = pdev->subordinate;
-	u16 cap = dpc->cap_pos, ctl;
+	struct pci_dev *dev, *temp;
+	struct dpc_dev *dpc;
+	struct pcie_device *pciedev;
+	struct device *devdpc;
+	u16 cap, ctl;
+
+	devdpc = pci_find_dpc_dev(pdev);
+	pciedev = to_pcie_device(devdpc);
+	dpc = get_service_data(pciedev);
+	cap = dpc->cap_pos;
 
 	pci_lock_rescan_remove();
 	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
@@ -104,21 +145,31 @@ static void dpc_work(struct work_struct *work)
 
 	dpc_wait_link_inactive(dpc);
 	if (dpc->rp_extensions && dpc_wait_rp_inactive(dpc))
-		return;
+		return PCI_ERS_RESULT_DISCONNECT;
 	if (dpc->rp_extensions && dpc->rp_pio_status) {
 		pci_write_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_STATUS,
 				       dpc->rp_pio_status);
 		dpc->rp_pio_status = 0;
 	}
 
-	pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
+	pci_write_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_STATUS,
 		PCI_EXP_DPC_STATUS_TRIGGER | PCI_EXP_DPC_STATUS_INTERRUPT);
 
 	pci_read_config_word(pdev, cap + PCI_EXP_DPC_CTL, &ctl);
 	pci_write_config_word(pdev, cap + PCI_EXP_DPC_CTL,
 			      ctl | PCI_EXP_DPC_CTL_INT_EN);
+
+	return PCI_ERS_RESULT_RECOVERED;
 }
 
+static void dpc_work(struct work_struct *work)
+{
+	struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
+	struct pci_dev *pdev = dpc->dev->port;
+
+	/* From DPC point of view error is always FATAL. */
+	pcie_do_recovery(pdev, DPC_FATAL);
+}
 static void dpc_process_rp_pio_error(struct dpc_dev *dpc)
 {
 	struct device *dev = &dpc->dev->device;
@@ -297,6 +348,7 @@ static void dpc_remove(struct pcie_device *dev)
 	.service	= PCIE_PORT_SERVICE_DPC,
 	.probe		= dpc_probe,
 	.remove		= dpc_remove,
+	.reset_link     = dpc_reset_link,
 };
 
 static int __init dpc_service_init(void)
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 8f87bbe..9cfd0b8 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -15,6 +15,8 @@
 #define AER_FATAL			1
 #define AER_CORRECTABLE			2
 
+#define DPC_FATAL			4
+
 struct pci_dev;
 
 struct aer_header_log_regs {
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v12 5/6] PCI: Unify wait for link active into generic PCI
  2018-02-28 17:04 [PATCH v12 0/6] Address error and recovery for AER and DPC Oza Pawandeep
                   ` (3 preceding siblings ...)
  2018-02-28 17:04 ` [PATCH v12 4/6] PCI/DPC: Unify and plumb error handling into DPC Oza Pawandeep
@ 2018-02-28 17:04 ` Oza Pawandeep
  2018-02-28 17:04 ` [PATCH v12 6/6] PCI/DPC: Enumerate the devices after DPC trigger event Oza Pawandeep
  2018-03-11 22:03 ` [PATCH v12 0/6] Address error and recovery for AER and DPC Bjorn Helgaas
  6 siblings, 0 replies; 25+ messages in thread
From: Oza Pawandeep @ 2018-02-28 17:04 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

Clients such as HP, DPC are using pcie_wait_link_active(), which waits
till the link becomes active or inactive.

Made generic function and moved it to drivers/pci/pci.c

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 18a42f8..e0c2b8e 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -231,25 +231,11 @@ bool pciehp_check_link_active(struct controller *ctrl)
 	return ret;
 }
 
-static void __pcie_wait_link_active(struct controller *ctrl, bool active)
-{
-	int timeout = 1000;
-
-	if (pciehp_check_link_active(ctrl) == active)
-		return;
-	while (timeout > 0) {
-		msleep(10);
-		timeout -= 10;
-		if (pciehp_check_link_active(ctrl) == active)
-			return;
-	}
-	ctrl_dbg(ctrl, "Data Link Layer Link Active not %s in 1000 msec\n",
-			active ? "set" : "cleared");
-}
-
 static void pcie_wait_link_active(struct controller *ctrl)
 {
-	__pcie_wait_link_active(ctrl, true);
+	struct pci_dev *pdev = ctrl_dev(ctrl);
+
+	pcie_wait_for_link(pdev, true);
 }
 
 static bool pci_bus_check_dev(struct pci_bus *bus, int devfn)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index f6a4dd1..bbea997 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4176,6 +4176,35 @@ static int pci_pm_reset(struct pci_dev *dev, int probe)
 	return 0;
 }
 
+/**
+ * pcie_wait_for_link - Wait for link till it's active/inactive
+ * @pdev: Bridge device
+ * @active: waiting for active or inactive ?
+ *
+ * Use this to wait till link becomes active or inactive.
+ */
+bool pcie_wait_for_link(struct pci_dev *pdev, bool active)
+{
+	int timeout = 1000;
+	bool ret;
+	u16 lnk_status;
+
+	for (;;) {
+		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
+		ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA);
+		if (ret == active)
+			return true;
+		if (timeout <= 0)
+			break;
+		timeout -= 10;
+	}
+
+	pci_info(pdev, "Data Link Layer Link Active not %s in 1000 msec\n",
+		 active ? "set" : "cleared");
+
+	return false;
+}
+
 void pci_reset_secondary_bus(struct pci_dev *dev)
 {
 	u16 ctrl;
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index abc514e..5c44fbc 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -345,6 +345,8 @@ static inline resource_size_t pci_resource_alignment(struct pci_dev *dev,
 /* PCI error reporting and recovery */
 void pcie_do_recovery(struct pci_dev *dev, int severity);
 
+bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
+
 #ifdef CONFIG_PCIEASPM
 void pcie_aspm_init_link_state(struct pci_dev *pdev);
 void pcie_aspm_exit_link_state(struct pci_dev *pdev);
diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
index 517c8b4..8e1553b 100644
--- a/drivers/pci/pcie/pcie-dpc.c
+++ b/drivers/pci/pcie/pcie-dpc.c
@@ -95,19 +95,9 @@ static int dpc_wait_rp_inactive(struct dpc_dev *dpc)
 
 static void dpc_wait_link_inactive(struct dpc_dev *dpc)
 {
-	unsigned long timeout = jiffies + HZ;
 	struct pci_dev *pdev = dpc->dev->port;
-	struct device *dev = &dpc->dev->device;
-	u16 lnk_status;
 
-	pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
-	while (lnk_status & PCI_EXP_LNKSTA_DLLLA &&
-					!time_after(jiffies, timeout)) {
-		msleep(10);
-		pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, &lnk_status);
-	}
-	if (lnk_status & PCI_EXP_LNKSTA_DLLLA)
-		dev_warn(dev, "Link state not disabled for DPC event\n");
+	pcie_wait_for_link(pdev, false);
 }
 
 /**
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v12 6/6] PCI/DPC: Enumerate the devices after DPC trigger event
  2018-02-28 17:04 [PATCH v12 0/6] Address error and recovery for AER and DPC Oza Pawandeep
                   ` (4 preceding siblings ...)
  2018-02-28 17:04 ` [PATCH v12 5/6] PCI: Unify wait for link active into generic PCI Oza Pawandeep
@ 2018-02-28 17:04 ` Oza Pawandeep
  2018-03-11 22:03 ` [PATCH v12 0/6] Address error and recovery for AER and DPC Bjorn Helgaas
  6 siblings, 0 replies; 25+ messages in thread
From: Oza Pawandeep @ 2018-02-28 17:04 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

Implement error_resume callback in DPC so, after DPC trigger event
enumerates the devices beneath.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 9f1c983..d439bfd 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -237,7 +237,8 @@ static pci_ers_result_t reset_link(struct pci_dev *dev, int severity)
 static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
 	enum pci_channel_state state,
 	char *error_mesg,
-	int (*cb)(struct pci_dev *, void *))
+	int (*cb)(struct pci_dev *, void *),
+	int severity)
 {
 	struct aer_broadcast_data result_data;
 
@@ -249,6 +250,17 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
 		result_data.result = PCI_ERS_RESULT_RECOVERED;
 
 	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE) {
+		/* If DPC is triggered, call resume error handler
+		 * because, at this point we can safely assume that
+		 * link recovery has happened, this is only handled if
+		 * callback is resume, as this function can be called
+		 * with multiple callbacks.
+		 */
+		if ((severity == DPC_FATAL) &&
+			(cb == report_resume)) {
+			cb(dev, NULL);
+			return PCI_ERS_RESULT_RECOVERED;
+		}
 		/*
 		 * If the error is reported by a bridge, we think this error
 		 * is related to the downstream link of the bridge, so we
@@ -296,7 +308,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 	status = broadcast_error_message(dev,
 			state,
 			"error_detected",
-			report_error_detected);
+			report_error_detected,
+			severity);
 
 	if ((severity == AER_FATAL) ||
 	    (severity == DPC_FATAL)) {
@@ -311,7 +324,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 		status = broadcast_error_message(dev,
 				state,
 				"mmio_enabled",
-				report_mmio_enabled);
+				report_mmio_enabled,
+				severity);
 
 	if (status == PCI_ERS_RESULT_NEED_RESET) {
 		/*
@@ -322,7 +336,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 		status = broadcast_error_message(dev,
 				state,
 				"slot_reset",
-				report_slot_reset);
+				report_slot_reset,
+				severity);
 	}
 
 	if (status != PCI_ERS_RESULT_RECOVERED)
@@ -332,7 +347,8 @@ void pcie_do_recovery(struct pci_dev *dev, int severity)
 	broadcast_error_message(dev,
 				state,
 				"resume",
-				report_resume);
+				report_resume,
+				severity);
 
 	dev_info(&dev->dev, "Device recovery successful\n");
 	return;
diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
index 8e1553b..fb6bbd3 100644
--- a/drivers/pci/pcie/pcie-dpc.c
+++ b/drivers/pci/pcie/pcie-dpc.c
@@ -101,9 +101,24 @@ static void dpc_wait_link_inactive(struct dpc_dev *dpc)
 }
 
 /**
- * dpc_reset_link - reset link DPC routine
+ * dpc_error_resume - enumerate the devices beneath
  * @pdev: pointer to Root Port's pci_dev data structure
  *
+ * Invoked by Port Bus driver during fatal recovery.
+ */
+static void dpc_error_resume(struct pci_dev *pdev)
+{
+	if (pcie_wait_for_link(pdev, true)) {
+		pci_lock_rescan_remove();
+		pci_rescan_bus(pdev->bus);
+		pci_unlock_rescan_remove();
+	}
+}
+
+/**
+ * dpc_reset_link - reset link DPC routine
+ * @dev: pointer to Root Port's pci_dev data structure
+ *
  * Invoked by Port Bus driver when performing link reset at Root Port.
  */
 static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
@@ -338,6 +353,7 @@ static void dpc_remove(struct pcie_device *dev)
 	.service	= PCIE_PORT_SERVICE_DPC,
 	.probe		= dpc_probe,
 	.remove		= dpc_remove,
+	.error_resume	= dpc_error_resume,
 	.reset_link     = dpc_reset_link,
 };
 
-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.,
a Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 3/6] PCI/PORTDRV: Implement generic find service
  2018-02-28 17:04 ` [PATCH v12 3/6] PCI/PORTDRV: Implement generic find service Oza Pawandeep
@ 2018-03-06 14:02   ` Sinan Kaya
  2018-03-08  7:56     ` poza
  0 siblings, 1 reply; 25+ messages in thread
From: Sinan Kaya @ 2018-03-06 14:02 UTC (permalink / raw)
  To: Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Keith Busch, Wei Zhang, Timur Tabi

On 2/28/2018 12:04 PM, Oza Pawandeep wrote:
> +struct pcie_port_service_driver *pcie_port_find_service(struct pci_dev *dev,
> +							u32 service)
> +{
> +	struct pcie_port_service_driver *drv = NULL;

Remove initialization

> +	struct portdrv_service_data pdrvs;
> +
> +	pdrvs.service = service;

initialize pdrvs.drv = NULL here.

> +	device_for_each_child(&dev->dev, &pdrvs, find_service_iter);
> +
> +	drv = pdrvs.drv;
> +	return drv;
> +}


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 3/6] PCI/PORTDRV: Implement generic find service
  2018-03-06 14:02   ` Sinan Kaya
@ 2018-03-08  7:56     ` poza
  0 siblings, 0 replies; 25+ messages in thread
From: poza @ 2018-03-08  7:56 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Timur Tabi

On 2018-03-06 19:32, Sinan Kaya wrote:
> On 2/28/2018 12:04 PM, Oza Pawandeep wrote:
>> +struct pcie_port_service_driver *pcie_port_find_service(struct 
>> pci_dev *dev,
>> +							u32 service)
>> +{
>> +	struct pcie_port_service_driver *drv = NULL;
> 
> Remove initialization
> 
>> +	struct portdrv_service_data pdrvs;
>> +
>> +	pdrvs.service = service;
> 
> initialize pdrvs.drv = NULL here.
> 
good point, will take care of this. [along with other comments if any]

>> +	device_for_each_child(&dev->dev, &pdrvs, find_service_iter);
>> +
>> +	drv = pdrvs.drv;
>> +	return drv;
>> +}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-02-28 17:04 [PATCH v12 0/6] Address error and recovery for AER and DPC Oza Pawandeep
                   ` (5 preceding siblings ...)
  2018-02-28 17:04 ` [PATCH v12 6/6] PCI/DPC: Enumerate the devices after DPC trigger event Oza Pawandeep
@ 2018-03-11 22:03 ` Bjorn Helgaas
  2018-03-12  3:03   ` Sinan Kaya
  2018-03-12 14:01   ` poza
  6 siblings, 2 replies; 25+ messages in thread
From: Bjorn Helgaas @ 2018-03-11 22:03 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
> This patch set brings in error handling support for DPC
> 
> The current implementation of AER and error message broadcasting to the
> EP driver is tightly coupled and limited to AER service driver.
> It is important to factor out broadcasting and other link handling
> callbacks. So that not only when AER gets triggered, but also when DPC get
> triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
> 
> DPC should enumerate the devices after recovering the link, which is
> achieved by implementing error_resume callback.

The main point of this series is to call the driver error handling
callbacks (error_detected(), mmio_enabled(), slot_reset(), etc.) when
DPC events occur.  We've always called them for AER events, but prior
to this series, we didn't call them for DPC events.

That's a good thing -- we should treat DPC events as much like AER
events as possible.

This series does make it more obvious that there's still a big
difference between AER and DPC handling -- for DPC, we remove and
re-enumerate all the devices, but we don't for AER.  

That difference has been there since the beginning of DPC, so it has
nothing to do with *this* series EXCEPT for the fact that it really
complicates the logic you're adding to reset_link() and
broadcast_error_message().

We ought to be able to simplify that somehow because the only real
difference between AER and DPC should be that DPC automatically
disables the link and AER does it in software.

Bjorn

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-11 22:03 ` [PATCH v12 0/6] Address error and recovery for AER and DPC Bjorn Helgaas
@ 2018-03-12  3:03   ` Sinan Kaya
  2018-03-12 14:25     ` Keith Busch
  2018-03-12 14:01   ` poza
  1 sibling, 1 reply; 25+ messages in thread
From: Sinan Kaya @ 2018-03-12  3:03 UTC (permalink / raw)
  To: Bjorn Helgaas, Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Timur Tabi

On 3/11/2018 6:03 PM, Bjorn Helgaas wrote:
> On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:

> That difference has been there since the beginning of DPC, so it has
> nothing to do with *this* series EXCEPT for the fact that it really
> complicates the logic you're adding to reset_link() and
> broadcast_error_message().
> 
> We ought to be able to simplify that somehow because the only real
> difference between AER and DPC should be that DPC automatically
> disables the link and AER does it in software.

I agree this should be possible. Code execution path should be almost
identical to fatal error case.

Is there any reason why you went to stop driver path, Keith?

Was it the absence of error callback? or are we missing something in our
assumptions?

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-11 22:03 ` [PATCH v12 0/6] Address error and recovery for AER and DPC Bjorn Helgaas
  2018-03-12  3:03   ` Sinan Kaya
@ 2018-03-12 14:01   ` poza
  1 sibling, 0 replies; 25+ messages in thread
From: poza @ 2018-03-12 14:01 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On 2018-03-12 03:33, Bjorn Helgaas wrote:
> On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
>> This patch set brings in error handling support for DPC
>> 
>> The current implementation of AER and error message broadcasting to 
>> the
>> EP driver is tightly coupled and limited to AER service driver.
>> It is important to factor out broadcasting and other link handling
>> callbacks. So that not only when AER gets triggered, but also when DPC 
>> get
>> triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
>> 
>> DPC should enumerate the devices after recovering the link, which is
>> achieved by implementing error_resume callback.
> 
> The main point of this series is to call the driver error handling
> callbacks (error_detected(), mmio_enabled(), slot_reset(), etc.) when
> DPC events occur.  We've always called them for AER events, but prior
> to this series, we didn't call them for DPC events.
> 
> That's a good thing -- we should treat DPC events as much like AER
> events as possible.
> 
> This series does make it more obvious that there's still a big
> difference between AER and DPC handling -- for DPC, we remove and
> re-enumerate all the devices, but we don't for AER.
> 
> That difference has been there since the beginning of DPC, so it has
> nothing to do with *this* series EXCEPT for the fact that it really
> complicates the logic you're adding to reset_link() and
> broadcast_error_message().
> 
> We ought to be able to simplify that somehow because the only real
> difference between AER and DPC should be that DPC automatically
> disables the link and AER does it in software.
> 
> Bjorn

Ok, so here is slight modification which I will do and will post final 
looking patches.

1) do not enumerate the devices after link is up (since AER does not do 
it)
2) remove the call to pci_stop_and_remove_bus_device(dev); because now 
DPC also makes use of driver's callback. (like AER does)

above 2 changes should make behavior of DPC identical to AER.

Regards,
Oza.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-12  3:03   ` Sinan Kaya
@ 2018-03-12 14:25     ` Keith Busch
  2018-03-12 14:46       ` poza
  2018-03-12 19:47       ` Bjorn Helgaas
  0 siblings, 2 replies; 25+ messages in thread
From: Keith Busch @ 2018-03-12 14:25 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Bjorn Helgaas, Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi

On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote:
> On 3/11/2018 6:03 PM, Bjorn Helgaas wrote:
> > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
> 
> > That difference has been there since the beginning of DPC, so it has
> > nothing to do with *this* series EXCEPT for the fact that it really
> > complicates the logic you're adding to reset_link() and
> > broadcast_error_message().
> > 
> > We ought to be able to simplify that somehow because the only real
> > difference between AER and DPC should be that DPC automatically
> > disables the link and AER does it in software.
> 
> I agree this should be possible. Code execution path should be almost
> identical to fatal error case.
> 
> Is there any reason why you went to stop driver path, Keith?

The fact is the link is truly down during a DPC event. When the link
is enabled again, you don't know at that point if the device(s) on the
other side have changed. Calling a driver's error handler for the wrong
device in an unknown state may have undefined results. Enumerating the
slot from scratch should be safe, and will assign resources, tune bus
settings, and bind to the matching driver.

Per spec, DPC is the recommended way for handling surprise removal
events and even recommends DPC capable slots *not* set 'Surprise'
in Slot Capabilities so that removals are always handled by DPC. This
service driver was developed with that use in mind.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-12 14:25     ` Keith Busch
@ 2018-03-12 14:46       ` poza
  2018-03-12 14:58         ` Keith Busch
  2018-03-12 19:47       ` Bjorn Helgaas
  1 sibling, 1 reply; 25+ messages in thread
From: poza @ 2018-03-12 14:46 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sinan Kaya, Bjorn Helgaas, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	linux-pci-owner

On 2018-03-12 19:55, Keith Busch wrote:
> On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote:
>> On 3/11/2018 6:03 PM, Bjorn Helgaas wrote:
>> > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
>> 
>> > That difference has been there since the beginning of DPC, so it has
>> > nothing to do with *this* series EXCEPT for the fact that it really
>> > complicates the logic you're adding to reset_link() and
>> > broadcast_error_message().
>> >
>> > We ought to be able to simplify that somehow because the only real
>> > difference between AER and DPC should be that DPC automatically
>> > disables the link and AER does it in software.
>> 
>> I agree this should be possible. Code execution path should be almost
>> identical to fatal error case.
>> 
>> Is there any reason why you went to stop driver path, Keith?
> 
> The fact is the link is truly down during a DPC event. When the link
> is enabled again, you don't know at that point if the device(s) on the
> other side have changed. Calling a driver's error handler for the wrong
> device in an unknown state may have undefined results. Enumerating the
> slot from scratch should be safe, and will assign resources, tune bus
> settings, and bind to the matching driver.
> 
> Per spec, DPC is the recommended way for handling surprise removal
> events and even recommends DPC capable slots *not* set 'Surprise'
> in Slot Capabilities so that removals are always handled by DPC. This
> service driver was developed with that use in mind.

Now it begs the question, that

after DPC trigger

should we enumerate the devices, ?
or
error handling callbacks, followed by stop devices followed by 
enumeration ?
or
error handling callbacks, followed by enumeration ? (no stop devices)

Regards,
Oza.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-12 14:46       ` poza
@ 2018-03-12 14:58         ` Keith Busch
  2018-03-12 15:34           ` poza
  0 siblings, 1 reply; 25+ messages in thread
From: Keith Busch @ 2018-03-12 14:58 UTC (permalink / raw)
  To: poza
  Cc: Sinan Kaya, Bjorn Helgaas, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	linux-pci-owner

On Mon, Mar 12, 2018 at 08:16:38PM +0530, poza@codeaurora.org wrote:
> On 2018-03-12 19:55, Keith Busch wrote:
> > On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote:
> > > On 3/11/2018 6:03 PM, Bjorn Helgaas wrote:
> > > > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
> > > 
> > > > That difference has been there since the beginning of DPC, so it has
> > > > nothing to do with *this* series EXCEPT for the fact that it really
> > > > complicates the logic you're adding to reset_link() and
> > > > broadcast_error_message().
> > > >
> > > > We ought to be able to simplify that somehow because the only real
> > > > difference between AER and DPC should be that DPC automatically
> > > > disables the link and AER does it in software.
> > > 
> > > I agree this should be possible. Code execution path should be almost
> > > identical to fatal error case.
> > > 
> > > Is there any reason why you went to stop driver path, Keith?
> > 
> > The fact is the link is truly down during a DPC event. When the link
> > is enabled again, you don't know at that point if the device(s) on the
> > other side have changed. Calling a driver's error handler for the wrong
> > device in an unknown state may have undefined results. Enumerating the
> > slot from scratch should be safe, and will assign resources, tune bus
> > settings, and bind to the matching driver.
> > 
> > Per spec, DPC is the recommended way for handling surprise removal
> > events and even recommends DPC capable slots *not* set 'Surprise'
> > in Slot Capabilities so that removals are always handled by DPC. This
> > service driver was developed with that use in mind.
> 
> Now it begs the question, that
> 
> after DPC trigger
> 
> should we enumerate the devices, ?
> or
> error handling callbacks, followed by stop devices followed by enumeration ?
> or
> error handling callbacks, followed by enumeration ? (no stop devices)

I'm not sure I understand. The link is disabled while DPC is triggered,
so if anything, you'd want to un-enumerate everything below the contained
port (that's what it does today).

After releasing a slot from DPC, the link is allowed to retrain. If there
is a working device on the other side, a link up event occurs. That
event is handled by the pciehp driver, and that schedules enumeration
no matter what you do to the DPC driver.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-12 14:58         ` Keith Busch
@ 2018-03-12 15:34           ` poza
  2018-03-12 17:33             ` Keith Busch
  0 siblings, 1 reply; 25+ messages in thread
From: poza @ 2018-03-12 15:34 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sinan Kaya, Bjorn Helgaas, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	linux-pci-owner

On 2018-03-12 20:28, Keith Busch wrote:
> On Mon, Mar 12, 2018 at 08:16:38PM +0530, poza@codeaurora.org wrote:
>> On 2018-03-12 19:55, Keith Busch wrote:
>> > On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote:
>> > > On 3/11/2018 6:03 PM, Bjorn Helgaas wrote:
>> > > > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
>> > >
>> > > > That difference has been there since the beginning of DPC, so it has
>> > > > nothing to do with *this* series EXCEPT for the fact that it really
>> > > > complicates the logic you're adding to reset_link() and
>> > > > broadcast_error_message().
>> > > >
>> > > > We ought to be able to simplify that somehow because the only real
>> > > > difference between AER and DPC should be that DPC automatically
>> > > > disables the link and AER does it in software.
>> > >
>> > > I agree this should be possible. Code execution path should be almost
>> > > identical to fatal error case.
>> > >
>> > > Is there any reason why you went to stop driver path, Keith?
>> >
>> > The fact is the link is truly down during a DPC event. When the link
>> > is enabled again, you don't know at that point if the device(s) on the
>> > other side have changed. Calling a driver's error handler for the wrong
>> > device in an unknown state may have undefined results. Enumerating the
>> > slot from scratch should be safe, and will assign resources, tune bus
>> > settings, and bind to the matching driver.
>> >
>> > Per spec, DPC is the recommended way for handling surprise removal
>> > events and even recommends DPC capable slots *not* set 'Surprise'
>> > in Slot Capabilities so that removals are always handled by DPC. This
>> > service driver was developed with that use in mind.
>> 
>> Now it begs the question, that
>> 
>> after DPC trigger
>> 
>> should we enumerate the devices, ?
>> or
>> error handling callbacks, followed by stop devices followed by 
>> enumeration ?
>> or
>> error handling callbacks, followed by enumeration ? (no stop devices)
> 
> I'm not sure I understand. The link is disabled while DPC is triggered,
> so if anything, you'd want to un-enumerate everything below the 
> contained
> port (that's what it does today).
> 
> After releasing a slot from DPC, the link is allowed to retrain. If 
> there
> is a working device on the other side, a link up event occurs. That
> event is handled by the pciehp driver, and that schedules enumeration
> no matter what you do to the DPC driver.

yes, that is what i current, but this patch-set makes DPC aware of error 
handling driver callbacks.

besides, in absence of pciehp there is nobody to do enumeration.

And, I was talking about pci_stop_and_remove_bus_device() in dpc.
if DPC calls driver's error callbacks, is it required to stop the 
devices  ?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-12 15:34           ` poza
@ 2018-03-12 17:33             ` Keith Busch
  2018-03-12 17:41               ` Sinan Kaya
  0 siblings, 1 reply; 25+ messages in thread
From: Keith Busch @ 2018-03-12 17:33 UTC (permalink / raw)
  To: poza
  Cc: Sinan Kaya, Bjorn Helgaas, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	linux-pci-owner

On Mon, Mar 12, 2018 at 09:04:47PM +0530, poza@codeaurora.org wrote:
> On 2018-03-12 20:28, Keith Busch wrote:
> > I'm not sure I understand. The link is disabled while DPC is triggered,
> > so if anything, you'd want to un-enumerate everything below the
> > contained
> > port (that's what it does today).
> > 
> > After releasing a slot from DPC, the link is allowed to retrain. If
> > there
> > is a working device on the other side, a link up event occurs. That
> > event is handled by the pciehp driver, and that schedules enumeration
> > no matter what you do to the DPC driver.
> 
> yes, that is what i current, but this patch-set makes DPC aware of error
> handling driver callbacks.

I've been questioning the utility of doing that since the very first
version of this patch set.

> besides, in absence of pciehp there is nobody to do enumeration.

If you configure your kernel to not have a feature, you don't get to
expect the feature works.

You can still manually rescan through sysfs, /sys/bus/pci/rescan.
 
> And, I was talking about pci_stop_and_remove_bus_device() in dpc.
> if DPC calls driver's error callbacks, is it required to stop the devices  ?

DPC is PCIe's preferred surprise removal mechanism. If you don't want
the DPC driver to handle removing devices downstream the containment,
how do you want those devices get removed? Just calling driver's error
callbacks leaves the kernel's view of the topology in the wrong state.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-12 17:33             ` Keith Busch
@ 2018-03-12 17:41               ` Sinan Kaya
  2018-03-12 17:56                 ` Keith Busch
  0 siblings, 1 reply; 25+ messages in thread
From: Sinan Kaya @ 2018-03-12 17:41 UTC (permalink / raw)
  To: Keith Busch, poza
  Cc: Bjorn Helgaas, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	linux-pci-owner

I was just writing a reply to you. You acted first :)

On 3/12/2018 1:33 PM, Keith Busch wrote:
>>> After releasing a slot from DPC, the link is allowed to retrain. If
>>> there
>>> is a working device on the other side, a link up event occurs. That
>>> event is handled by the pciehp driver, and that schedules enumeration
>>> no matter what you do to the DPC driver.
>> yes, that is what i current, but this patch-set makes DPC aware of error
>> handling driver callbacks.
> I've been questioning the utility of doing that since the very first
> version of this patch set.
> 

I think we should all agree that shutting down the device drivers with active
work is not safe. There could be outstanding work that the endpoint driver
needs to take care of. 

That was the motivation for this change so that we give endpoint drivers an 
error callback when something goes wrong. 

The rest is implementation detail that we can all figure out.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-12 17:41               ` Sinan Kaya
@ 2018-03-12 17:56                 ` Keith Busch
  0 siblings, 0 replies; 25+ messages in thread
From: Keith Busch @ 2018-03-12 17:56 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: poza, Bjorn Helgaas, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	linux-pci-owner

On Mon, Mar 12, 2018 at 01:41:07PM -0400, Sinan Kaya wrote:
> I was just writing a reply to you. You acted first :)
> 
> On 3/12/2018 1:33 PM, Keith Busch wrote:
> >>> After releasing a slot from DPC, the link is allowed to retrain. If
> >>> there
> >>> is a working device on the other side, a link up event occurs. That
> >>> event is handled by the pciehp driver, and that schedules enumeration
> >>> no matter what you do to the DPC driver.
> >> yes, that is what i current, but this patch-set makes DPC aware of error
> >> handling driver callbacks.
> > I've been questioning the utility of doing that since the very first
> > version of this patch set.
> > 
> 
> I think we should all agree that shutting down the device drivers with active
> work is not safe. There could be outstanding work that the endpoint driver
> needs to take care of. 
> 
> That was the motivation for this change so that we give endpoint drivers an 
> error callback when something goes wrong. 
> 
> The rest is implementation detail that we can all figure out.

I'm not sure if I agree here. All Linux device drivers are supposed to
cope with sudden/unexpected loss of communication at any time. This
includes cleaning up appropriately when requested to unbind from an
inaccessible device with active outstanding work.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-12 14:25     ` Keith Busch
  2018-03-12 14:46       ` poza
@ 2018-03-12 19:47       ` Bjorn Helgaas
  2018-03-12 23:26         ` Keith Busch
  1 sibling, 1 reply; 25+ messages in thread
From: Bjorn Helgaas @ 2018-03-12 19:47 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sinan Kaya, Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

[+cc Alex]

On Mon, Mar 12, 2018 at 08:25:51AM -0600, Keith Busch wrote:
> On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote:
> > On 3/11/2018 6:03 PM, Bjorn Helgaas wrote:
> > > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
> > 
> > > That difference has been there since the beginning of DPC, so it has
> > > nothing to do with *this* series EXCEPT for the fact that it really
> > > complicates the logic you're adding to reset_link() and
> > > broadcast_error_message().
> > > 
> > > We ought to be able to simplify that somehow because the only real
> > > difference between AER and DPC should be that DPC automatically
> > > disables the link and AER does it in software.
> > 
> > I agree this should be possible. Code execution path should be almost
> > identical to fatal error case.
> > 
> > Is there any reason why you went to stop driver path, Keith?
> 
> The fact is the link is truly down during a DPC event. When the link
> is enabled again, you don't know at that point if the device(s) on the
> other side have changed.

When DPC is triggered, the port takes the link down.  When we handle
an uncorrectable (nonfatal or fatal) AER event, software takes the
link down.

In both cases, devices on the other side are at least reset.  Whenever
the link goes down, it's conceivable the device could be replaced with
a different one before the link comes back up.  Is this why you remove
and re-enumerate?  (See tangent [1] below.)

The point is that from the device's hardware perspective, these
scenarios are the same (it sent a ERR_NONFATAL or ERR_FATAL message
and it sees the link go down).  I think we should make them the same
on the software side, too: the driver should see the same callbacks,
in the same order, whether we're doing AER or DPC.

If removing and re-enumerating is the right thing for DPC, I think
that means it's also the right thing for AER.

Along this line, we had a question a while back about logging AER
information after a DPC trigger.  Obviously we can't collect any
logged information from the downstream devices while link is down, but
I noticed the AER status bits are RW1CS, which means they're sticky
and are not modified by hot reset or FLR.

So we may be able to read and log the AER information after the DPC
driver brings the link back up.  We may want to do the same with AER,
i.e., reset the downstream devices first, then collect the AER status
bits after the link comes back up.

> Calling a driver's error handler for the wrong device in an unknown
> state may have undefined results. Enumerating the slot from scratch
> should be safe, and will assign resources, tune bus settings, and
> bind to the matching driver.

I agree with this; I think this is heading toward doing
remove/re-enumerate on AER errors as well because it has also reset
the device.

> Per spec, DPC is the recommended way for handling surprise removal
> events and even recommends DPC capable slots *not* set 'Surprise'
> in Slot Capabilities so that removals are always handled by DPC. This
> service driver was developed with that use in mind.

Thanks for this tip.  The only thing I've found so far is the mention
of Surprise Down triggering DPC in the last paragraph of sec 6.7.5.
Are there other references I should look at?  I haven't found the
recommendation about not setting 'Surprise' in Slot Capabilities yet.

Bjorn

[1] Tangent: I have similar concerns with the device reset paths.  We
currently save some config state, reset the device, and restore the
config state.  It's theoretically possible that the device was
replaced or came up with different firmware after the reset, so I
would really prefer to remove and re-enumerate there, too.  But that
would make it hard for things up the stack that want to reset the
device but not re-setup the whole stack.

I wonder if DPC is going to cause trouble for that scenario.  That's
not an argument against DPC, but it might be a stronger reason to
figure out how to deal with remove/re-enumerate in those stacks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-12 19:47       ` Bjorn Helgaas
@ 2018-03-12 23:26         ` Keith Busch
  2018-03-13  3:47           ` Sinan Kaya
  2018-05-08 19:25           ` Bjorn Helgaas
  0 siblings, 2 replies; 25+ messages in thread
From: Keith Busch @ 2018-03-12 23:26 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Sinan Kaya, Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

On Mon, Mar 12, 2018 at 02:47:30PM -0500, Bjorn Helgaas wrote:
> [+cc Alex]
> 
> On Mon, Mar 12, 2018 at 08:25:51AM -0600, Keith Busch wrote:
> > On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote:
> > > On 3/11/2018 6:03 PM, Bjorn Helgaas wrote:
> > > > On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
> > > 
> > > > That difference has been there since the beginning of DPC, so it has
> > > > nothing to do with *this* series EXCEPT for the fact that it really
> > > > complicates the logic you're adding to reset_link() and
> > > > broadcast_error_message().
> > > > 
> > > > We ought to be able to simplify that somehow because the only real
> > > > difference between AER and DPC should be that DPC automatically
> > > > disables the link and AER does it in software.
> > > 
> > > I agree this should be possible. Code execution path should be almost
> > > identical to fatal error case.
> > > 
> > > Is there any reason why you went to stop driver path, Keith?
> > 
> > The fact is the link is truly down during a DPC event. When the link
> > is enabled again, you don't know at that point if the device(s) on the
> > other side have changed.
> 
> When DPC is triggered, the port takes the link down.  When we handle
> an uncorrectable (nonfatal or fatal) AER event, software takes the
> link down.
> 
> In both cases, devices on the other side are at least reset.  Whenever
> the link goes down, it's conceivable the device could be replaced with
> a different one before the link comes back up.  Is this why you remove
> and re-enumerate?  (See tangent [1] below.)

Yes. Truthfully, DPC events due to changing topologies was the motivating
use case when we initially developed this. We were also going for
simplicity (at least initially), and remove + re-enumerate seemed
safe without concerning this driver with other capability regsiters, or
coordinating with/depending on other drivers. For example, a successful
reset may depend on any particular driver calling pci_restore_state from
a good saved state.

> The point is that from the device's hardware perspective, these
> scenarios are the same (it sent a ERR_NONFATAL or ERR_FATAL message
> and it sees the link go down).  I think we should make them the same
> on the software side, too: the driver should see the same callbacks,
> in the same order, whether we're doing AER or DPC.
> 
> If removing and re-enumerating is the right thing for DPC, I think
> that means it's also the right thing for AER.
> 
> Along this line, we had a question a while back about logging AER
> information after a DPC trigger.  Obviously we can't collect any
> logged information from the downstream devices while link is down, but
> I noticed the AER status bits are RW1CS, which means they're sticky
> and are not modified by hot reset or FLR.
> 
> So we may be able to read and log the AER information after the DPC
> driver brings the link back up.  We may want to do the same with AER,
> i.e., reset the downstream devices first, then collect the AER status
> bits after the link comes back up.

I totally agree. We could consult Slot and AER status to guide a
smarter approach.

> > Calling a driver's error handler for the wrong device in an unknown
> > state may have undefined results. Enumerating the slot from scratch
> > should be safe, and will assign resources, tune bus settings, and
> > bind to the matching driver.
> 
> I agree with this; I think this is heading toward doing
> remove/re-enumerate on AER errors as well because it has also reset
> the device.
> 
> > Per spec, DPC is the recommended way for handling surprise removal
> > events and even recommends DPC capable slots *not* set 'Surprise'
> > in Slot Capabilities so that removals are always handled by DPC. This
> > service driver was developed with that use in mind.
> 
> Thanks for this tip.  The only thing I've found so far is the mention
> of Surprise Down triggering DPC in the last paragraph of sec 6.7.5.
> Are there other references I should look at?  I haven't found the
> recommendation about not setting 'Surprise' in Slot Capabilities yet.

No problem, it's in the "IMPLEMENTATION NOTE" at the end of 6.2.10.4,
"Avoid Disabled Link and Hot-Plug Surprise Use with DPC".

Outside the spec, Microsemi as one of the PCI-SIG contributors and early
adopters of the capability published a white paper "Hot and Surprise
Plug Recommendations for Enterprise PCIe" providing guidance for OS
drivers using DPC. We originally developed to that guidance. The paper
unfortunately appears to be pay-walled now. :(

DPC triggers don't necessarily mean a surprise removal occurred, and
I understand those conditions are motivating motivating these current
proposals. I've no qualms adding smarter handling as long as we don't
break removals: there are installations relying on this.

> [1] Tangent: I have similar concerns with the device reset paths.  We
> currently save some config state, reset the device, and restore the
> config state.  It's theoretically possible that the device was
> replaced or came up with different firmware after the reset, so I
> would really prefer to remove and re-enumerate there, too.  But that
> would make it hard for things up the stack that want to reset the
> device but not re-setup the whole stack.
> 
> I wonder if DPC is going to cause trouble for that scenario.  That's
> not an argument against DPC, but it might be a stronger reason to
> figure out how to deal with remove/re-enumerate in those stacks.

Indeed, that's a great point. From a storage perspective, when a removal
tears down the block devices, re-adding the same device initializes as a
new block handle. Applications with open file descriptors on old handles
are going to have a bad time. You can open through device mappers to
avoid those problems, but inflight IO may be aborted.

I assume other classes of devices have similar implications to consider,
so I agree expanding remove/re-enumerate may need to be considered
carefully.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-12 23:26         ` Keith Busch
@ 2018-03-13  3:47           ` Sinan Kaya
  2018-03-14 20:50             ` Keith Busch
  2018-05-08 19:25           ` Bjorn Helgaas
  1 sibling, 1 reply; 25+ messages in thread
From: Sinan Kaya @ 2018-03-13  3:47 UTC (permalink / raw)
  To: Keith Busch, Bjorn Helgaas
  Cc: Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

On 3/12/2018 7:26 PM, Keith Busch wrote:
> On Mon, Mar 12, 2018 at 02:47:30PM -0500, Bjorn Helgaas wrote:
>> [+cc Alex]
>>
>> On Mon, Mar 12, 2018 at 08:25:51AM -0600, Keith Busch wrote:
>>> On Sun, Mar 11, 2018 at 11:03:58PM -0400, Sinan Kaya wrote:
>>>> On 3/11/2018 6:03 PM, Bjorn Helgaas wrote:
>>>>> On Wed, Feb 28, 2018 at 10:34:11PM +0530, Oza Pawandeep wrote:
>>>>
>>>>> That difference has been there since the beginning of DPC, so it has
>>>>> nothing to do with *this* series EXCEPT for the fact that it really
>>>>> complicates the logic you're adding to reset_link() and
>>>>> broadcast_error_message().
>>>>>
>>>>> We ought to be able to simplify that somehow because the only real
>>>>> difference between AER and DPC should be that DPC automatically
>>>>> disables the link and AER does it in software.
>>>>
>>>> I agree this should be possible. Code execution path should be almost
>>>> identical to fatal error case.
>>>>
>>>> Is there any reason why you went to stop driver path, Keith?
>>>
>>> The fact is the link is truly down during a DPC event. When the link
>>> is enabled again, you don't know at that point if the device(s) on the
>>> other side have changed.
>>
>> When DPC is triggered, the port takes the link down.  When we handle
>> an uncorrectable (nonfatal or fatal) AER event, software takes the
>> link down.
>>
>> In both cases, devices on the other side are at least reset.  Whenever
>> the link goes down, it's conceivable the device could be replaced with
>> a different one before the link comes back up.  Is this why you remove
>> and re-enumerate?  (See tangent [1] below.)
> 
> Yes. Truthfully, DPC events due to changing topologies was the motivating
> use case when we initially developed this. We were also going for
> simplicity (at least initially), and remove + re-enumerate seemed
> safe without concerning this driver with other capability regsiters, or
> coordinating with/depending on other drivers. For example, a successful
> reset may depend on any particular driver calling pci_restore_state from
> a good saved state.

The spec is recommending code to use "Hotplug Surprise" to differentiate
these two cases we are looking for. 

The use case Keith is looking for is for hotplug support. 
The case I and Oza are more interested is for error handling on platforms
with no hotplug support.

According to the spec, if "Hotplug Surprise" is set in slot capabilities,
then hotplug driver handles link up and DPC driver doesn't interfere with
its operation. Hotplug driver observes link up interrupt like it is doing today.
When link up event is observed, hotplug driver will do the enumeration.

If "Hotplug Surprise" bit is not set, it is the job of the DPC driver to
bring up the link. I believe this path should follow the AER driver path
as there is a very well defined error reporting and recovery framework
in the code.

The link comes back up automatically when DPC driver handles its interrupt
very similar to what secondary bus reset does for AER. I don't believe
there is a hotplug possibility under this condition since it is not supported
to begin with.

Should we plumb the "Hotplug Surprise" condition into the code to satisfy
both cases and leave the error handling path according to this code series?

> 
>> The point is that from the device's hardware perspective, these
>> scenarios are the same (it sent a ERR_NONFATAL or ERR_FATAL message
>> and it sees the link go down).  I think we should make them the same
>> on the software side, too: the driver should see the same callbacks,
>> in the same order, whether we're doing AER or DPC.
>>
>> If removing and re-enumerating is the right thing for DPC, I think
>> that means it's also the right thing for AER.
>>
-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-13  3:47           ` Sinan Kaya
@ 2018-03-14 20:50             ` Keith Busch
  2018-03-14 21:00               ` Sinan Kaya
  0 siblings, 1 reply; 25+ messages in thread
From: Keith Busch @ 2018-03-14 20:50 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Bjorn Helgaas, Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

On Mon, Mar 12, 2018 at 11:47:12PM -0400, Sinan Kaya wrote:
> 
> The spec is recommending code to use "Hotplug Surprise" to differentiate
> these two cases we are looking for. 
> 
> The use case Keith is looking for is for hotplug support. 
> The case I and Oza are more interested is for error handling on platforms
> with no hotplug support.
> 
> According to the spec, if "Hotplug Surprise" is set in slot capabilities,
> then hotplug driver handles link up and DPC driver doesn't interfere with
> its operation. Hotplug driver observes link up interrupt like it is doing today.
> When link up event is observed, hotplug driver will do the enumeration.
> 
> If "Hotplug Surprise" bit is not set, it is the job of the DPC driver to
> bring up the link. I believe this path should follow the AER driver path
> as there is a very well defined error reporting and recovery framework
> in the code.
> 
> The link comes back up automatically when DPC driver handles its interrupt
> very similar to what secondary bus reset does for AER. I don't believe
> there is a hotplug possibility under this condition since it is not supported
> to begin with.
> 
> Should we plumb the "Hotplug Surprise" condition into the code to satisfy
> both cases and leave the error handling path according to this code series?

I'm on board with this, but I think you mean "Hotplug Capable" rather
than "Hotplug Surprise". The former may co-exist with DPC and handles
the link-ups, leaving DPC responsible for handling the link-down.

The "Hotplug Surprise" capability is recommended to not co-exist with DPC
since that capability may suppress the "surprise link down" notification
that DPC needs to see in order to trigger.

If the "Hotplug Capable" bit is not set, detecting the link-up after a
DPC event would have to fall under a different component's responsibility,
so I think I'm finally seeing your point.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-14 20:50             ` Keith Busch
@ 2018-03-14 21:00               ` Sinan Kaya
  0 siblings, 0 replies; 25+ messages in thread
From: Sinan Kaya @ 2018-03-14 21:00 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

On 3/14/2018 4:50 PM, Keith Busch wrote:
> On Mon, Mar 12, 2018 at 11:47:12PM -0400, Sinan Kaya wrote:
>>
>> The spec is recommending code to use "Hotplug Surprise" to differentiate
>> these two cases we are looking for. 
>>
>> The use case Keith is looking for is for hotplug support. 
>> The case I and Oza are more interested is for error handling on platforms
>> with no hotplug support.
>>
>> According to the spec, if "Hotplug Surprise" is set in slot capabilities,
>> then hotplug driver handles link up and DPC driver doesn't interfere with
>> its operation. Hotplug driver observes link up interrupt like it is doing today.
>> When link up event is observed, hotplug driver will do the enumeration.
>>
>> If "Hotplug Surprise" bit is not set, it is the job of the DPC driver to
>> bring up the link. I believe this path should follow the AER driver path
>> as there is a very well defined error reporting and recovery framework
>> in the code.
>>
>> The link comes back up automatically when DPC driver handles its interrupt
>> very similar to what secondary bus reset does for AER. I don't believe
>> there is a hotplug possibility under this condition since it is not supported
>> to begin with.
>>
>> Should we plumb the "Hotplug Surprise" condition into the code to satisfy
>> both cases and leave the error handling path according to this code series?
> 
> I'm on board with this, but I think you mean "Hotplug Capable" rather
> than "Hotplug Surprise". The former may co-exist with DPC and handles
> the link-ups, leaving DPC responsible for handling the link-down.
> 

Yes, we can go with "Hotplug Capable".

> The "Hotplug Surprise" capability is recommended to not co-exist with DPC
> since that capability may suppress the "surprise link down" notification
> that DPC needs to see in order to trigger.
> 
> If the "Hotplug Capable" bit is not set, detecting the link-up after a
> DPC event would have to fall under a different component's responsibility,
> so I think I'm finally seeing your point.
> 

OK. To simplify our life, we can check if hotplug service is attached to this
device or not and follow two different paths.

We can get rid of re-enumeration and stop devices for the non-hotplug case
and make it behave exactly like AER.

Bjorn, will that work for you?


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v12 0/6] Address error and recovery for AER and DPC
  2018-03-12 23:26         ` Keith Busch
  2018-03-13  3:47           ` Sinan Kaya
@ 2018-05-08 19:25           ` Bjorn Helgaas
  1 sibling, 0 replies; 25+ messages in thread
From: Bjorn Helgaas @ 2018-05-08 19:25 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sinan Kaya, Oza Pawandeep, Bjorn Helgaas, Philippe Ombredanne,
	Thomas Gleixner, Greg Kroah-Hartman, Kate Stewart, linux-pci,
	linux-kernel, Dongdong Liu, Wei Zhang, Timur Tabi,
	Alex Williamson

On Mon, Mar 12, 2018 at 05:26:26PM -0600, Keith Busch wrote:
> On Mon, Mar 12, 2018 at 02:47:30PM -0500, Bjorn Helgaas wrote:
> > On Mon, Mar 12, 2018 at 08:25:51AM -0600, Keith Busch wrote:

> > > Per spec, DPC is the recommended way for handling surprise removal
> > > events and even recommends DPC capable slots *not* set 'Surprise'
> > > in Slot Capabilities so that removals are always handled by DPC. This
> > > service driver was developed with that use in mind.
> > 
> > Thanks for this tip.  The only thing I've found so far is the mention
> > of Surprise Down triggering DPC in the last paragraph of sec 6.7.5.
> > Are there other references I should look at?  I haven't found the
> > recommendation about not setting 'Surprise' in Slot Capabilities yet.
> 
> No problem, it's in the "IMPLEMENTATION NOTE" at the end of 6.2.10.4,
> "Avoid Disabled Link and Hot-Plug Surprise Use with DPC".
> 
> Outside the spec, Microsemi as one of the PCI-SIG contributors and early
> adopters of the capability published a white paper "Hot and Surprise
> Plug Recommendations for Enterprise PCIe" providing guidance for OS
> drivers using DPC. We originally developed to that guidance. The paper
> unfortunately appears to be pay-walled now. :(

Any chance you have a URL or contact info for this pay-walled white
paper?  Sounds like it might have useful information in it.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2018-05-08 19:25 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-28 17:04 [PATCH v12 0/6] Address error and recovery for AER and DPC Oza Pawandeep
2018-02-28 17:04 ` [PATCH v12 1/6] PCI/AER: Rename error recovery to generic PCI naming Oza Pawandeep
2018-02-28 17:04 ` [PATCH v12 2/6] PCI/AER: Factor out error reporting from AER Oza Pawandeep
2018-02-28 17:04 ` [PATCH v12 3/6] PCI/PORTDRV: Implement generic find service Oza Pawandeep
2018-03-06 14:02   ` Sinan Kaya
2018-03-08  7:56     ` poza
2018-02-28 17:04 ` [PATCH v12 4/6] PCI/DPC: Unify and plumb error handling into DPC Oza Pawandeep
2018-02-28 17:04 ` [PATCH v12 5/6] PCI: Unify wait for link active into generic PCI Oza Pawandeep
2018-02-28 17:04 ` [PATCH v12 6/6] PCI/DPC: Enumerate the devices after DPC trigger event Oza Pawandeep
2018-03-11 22:03 ` [PATCH v12 0/6] Address error and recovery for AER and DPC Bjorn Helgaas
2018-03-12  3:03   ` Sinan Kaya
2018-03-12 14:25     ` Keith Busch
2018-03-12 14:46       ` poza
2018-03-12 14:58         ` Keith Busch
2018-03-12 15:34           ` poza
2018-03-12 17:33             ` Keith Busch
2018-03-12 17:41               ` Sinan Kaya
2018-03-12 17:56                 ` Keith Busch
2018-03-12 19:47       ` Bjorn Helgaas
2018-03-12 23:26         ` Keith Busch
2018-03-13  3:47           ` Sinan Kaya
2018-03-14 20:50             ` Keith Busch
2018-03-14 21:00               ` Sinan Kaya
2018-05-08 19:25           ` Bjorn Helgaas
2018-03-12 14:01   ` poza

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.