linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 0/8] Add Error Disconnect Recover (EDR) support
@ 2019-12-27  0:39 sathyanarayanan.kuppuswamy
  2019-12-27  0:39 ` [PATCH v11 1/8] PCI/ERR: Update error status after reset_link() sathyanarayanan.kuppuswamy
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: sathyanarayanan.kuppuswamy @ 2019-12-27  0:39 UTC (permalink / raw)
  To: bhelgaas
  Cc: linux-pci, linux-kernel, ashok.raj, keith.busch,
	sathyanarayanan.kuppuswamy

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

This patchset adds support for following features:

1. Error Disconnect Recover (EDR) support.
2. _OSC based negotiation support for DPC.

You can find EDR spec in the following link.

https://members.pcisig.com/wg/PCI-SIG/document/12614

Changes since v10:
 * Added "edr_enabled" member to dpc priv structure, which is used to cache EDR
   enabling status based on status of pcie_ports_dpc_native and FF mode.
 * Changed type of _DSM argument from Integer to Package in acpi_enable_dpc_port()
   function to fix ACPI related boot warnings.
 * Rebased on top of v5.5-rc3

Changes since v9:
 * Removed caching of pcie_aer_get_firmware_first() in dpc driver.
 * Added proper spec reference in git log for patch 5 & 7.
 * Added new function parameter "ff_check" to pci_cleanup_aer_uncorrect_error_status(),
   pci_aer_clear_fatal_status() and pci_cleanup_aer_error_status_regs() functions.
 * Rebased on top of v5.4-rc5

Changes since v8:
 * Rebased on top of v5.4-rc1

Changes since v7:
 * Updated DSM version number to match the spec.

Changes since v6:
 * Modified the order of patches to enable EDR only after all necessary support is added in kernel.
 * Addressed Bjorn comments.

Changes since v5:
 * Addressed Keith's comments.
 * Added additional check for FF mode in pci_aer_init().
 * Updated commit history of "PCI/DPC: Add support for DPC recovery on NON_FATAL errors" patch.

Changes since v4:
 * Rebased on top of v5.3-rc1
 * Fixed lock/unlock issue in edr_handle_event().
 * Merged "Update error status after reset_link()" patch into this patchset.

Changes since v3:
 * Moved EDR related ACPI functions/definitions to pci-acpi.c
 * Modified commit history in few patches to include spec reference.
 * Added support to handle DPC triggered by NON_FATAL errors.
 * Added edr_lock to protect PCI device receiving duplicate EDR notifications.
 * Addressed Bjorn comments.

Changes since v2:
 * Split EDR support patch into multiple patches.
 * Addressed Bjorn comments.

Changes since v1:
 * Rebased on top of v5.1-rc1

Kuppuswamy Sathyanarayanan (8):
  PCI/ERR: Update error status after reset_link()
  PCI/DPC: Allow dpc_probe() even if firmware first mode is enabled
  PCI/DPC: Add dpc_process_error() wrapper function
  PCI/DPC: Add Error Disconnect Recover (EDR) support
  PCI/AER: Allow clearing Error Status Register in FF mode
  PCI/DPC: Update comments related to DPC recovery on NON_FATAL errors
  PCI/DPC: Clear AER registers in EDR mode
  PCI/ACPI: Enable EDR support

 Documentation/PCI/pcieaer-howto.rst       |   2 +-
 drivers/acpi/pci_root.c                   |   9 +
 drivers/net/ethernet/intel/ice/ice_main.c |   2 +-
 drivers/ntb/hw/idt/ntb_hw_idt.c           |   2 +-
 drivers/pci/pci-acpi.c                    |  98 ++++++++++
 drivers/pci/pci.c                         |   2 +-
 drivers/pci/pci.h                         |   2 +-
 drivers/pci/pcie/Kconfig                  |  10 +
 drivers/pci/pcie/aer.c                    |  16 +-
 drivers/pci/pcie/dpc.c                    | 214 ++++++++++++++++++++--
 drivers/pci/pcie/err.c                    |  12 +-
 drivers/pci/pcie/portdrv_core.c           |   7 +-
 drivers/pci/probe.c                       |   1 +
 drivers/scsi/lpfc/lpfc_attr.c             |   2 +-
 include/linux/acpi.h                      |   6 +-
 include/linux/aer.h                       |  12 +-
 include/linux/pci-acpi.h                  |  11 ++
 include/linux/pci.h                       |   3 +-
 18 files changed, 365 insertions(+), 46 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v11 1/8] PCI/ERR: Update error status after reset_link()
  2019-12-27  0:39 [PATCH v11 0/8] Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
@ 2019-12-27  0:39 ` sathyanarayanan.kuppuswamy
  2020-01-04  0:34   ` Bjorn Helgaas
  2019-12-27  0:39 ` [PATCH v11 2/8] PCI/DPC: Allow dpc_probe() even if firmware first mode is enabled sathyanarayanan.kuppuswamy
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: sathyanarayanan.kuppuswamy @ 2019-12-27  0:39 UTC (permalink / raw)
  To: bhelgaas
  Cc: linux-pci, linux-kernel, ashok.raj, keith.busch,
	sathyanarayanan.kuppuswamy

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") uses
reset_link() to recover from fatal errors. But, if the reset is
successful there is no need to continue the rest of the error recovery
checks. Also, during fatal error recovery, if the initial value of error
status is PCI_ERS_RESULT_DISCONNECT or PCI_ERS_RESULT_NO_AER_DRIVER then
even after successful recovery (using reset_link()) pcie_do_recovery()
will report the recovery result as failure. So update the status of
error after reset_link().

Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/err.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index b0e6048a9208..53cd9200ec2c 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -204,9 +204,12 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 	else
 		pci_walk_bus(bus, report_normal_detected, &status);
 
-	if (state == pci_channel_io_frozen &&
-	    reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
-		goto failed;
+	if (state == pci_channel_io_frozen) {
+		status = reset_link(dev, service);
+		if (status != PCI_ERS_RESULT_RECOVERED)
+			goto failed;
+		goto done;
+	}
 
 	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
 		status = PCI_ERS_RESULT_RECOVERED;
@@ -228,6 +231,7 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 	if (status != PCI_ERS_RESULT_RECOVERED)
 		goto failed;
 
+done:
 	pci_dbg(dev, "broadcast resume message\n");
 	pci_walk_bus(bus, report_resume, &status);
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v11 2/8] PCI/DPC: Allow dpc_probe() even if firmware first mode is enabled
  2019-12-27  0:39 [PATCH v11 0/8] Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
  2019-12-27  0:39 ` [PATCH v11 1/8] PCI/ERR: Update error status after reset_link() sathyanarayanan.kuppuswamy
@ 2019-12-27  0:39 ` sathyanarayanan.kuppuswamy
  2019-12-27  0:39 ` [PATCH v11 3/8] PCI/DPC: Add dpc_process_error() wrapper function sathyanarayanan.kuppuswamy
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: sathyanarayanan.kuppuswamy @ 2019-12-27  0:39 UTC (permalink / raw)
  To: bhelgaas
  Cc: linux-pci, linux-kernel, ashok.raj, keith.busch,
	sathyanarayanan.kuppuswamy

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

As per ACPI specification v6.3, sec 5.6.6, Error Disconnect Recover
(EDR) notification used by firmware to let OS know about the DPC event
and permit OS to perform error recovery when processing the EDR
notification. Also, as per PCI firmware specification r3.2 Downstream
Port Containment Related Enhancements ECN, sec 4.5.1, table 4-6, if DPC
is controlled by firmware (firmware first mode), it's responsible for
initializing Downstream Port Containment Extended Capability Structures
per firmware policy. And, OS is permitted to read or write DPC Control
and Status registers of a port while processing an Error Disconnect
Recover (EDR) notification from firmware on that port.

Currently, if firmware controls DPC (firmware first mode), OS will not
create/enumerate DPC PCIe port services. But, if OS supports EDR
feature, then as mentioned in above spec references, it should permit
enumeration of DPC driver and also support handling ACPI EDR
notification. So as first step, allow dpc_probe() to continue even if
firmware first mode is enabled. Also add appropriate checks to ensure
device registers are not modified outside EDR notification window in
firmware first mode. This is a preparatory patch for adding EDR support.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/dpc.c | 68 ++++++++++++++++++++++++++++++++++--------
 1 file changed, 55 insertions(+), 13 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index e06f42f58d3d..2c1251afb6a2 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -22,6 +22,7 @@ struct dpc_dev {
 	u16			cap_pos;
 	bool			rp_extensions;
 	u8			rp_log_size;
+	bool			edr_enabled; /* EDR mode is supported */
 };
 
 static const char * const rp_pio_error_string[] = {
@@ -69,6 +70,14 @@ void pci_save_dpc_state(struct pci_dev *dev)
 	if (!dpc)
 		return;
 
+	/*
+	 * If DPC is controlled by firmware then save/restore tasks are also
+	 * controller by firmware. So skip rest of the function if DPC is
+	 * controlled by firmware.
+	 */
+	if (dpc->edr_enabled)
+		return;
+
 	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_DPC);
 	if (!save_state)
 		return;
@@ -90,6 +99,14 @@ void pci_restore_dpc_state(struct pci_dev *dev)
 	if (!dpc)
 		return;
 
+	/*
+	 * If DPC is controlled by firmware then save/restore tasks are also
+	 * controller by firmware. So skip rest of the function if DPC is
+	 * controlled by firmware.
+	 */
+	if (dpc->edr_enabled)
+		return;
+
 	save_state = pci_find_saved_ext_cap(dev, PCI_EXT_CAP_ID_DPC);
 	if (!save_state)
 		return;
@@ -291,24 +308,42 @@ static int dpc_probe(struct pcie_device *dev)
 	int status;
 	u16 ctl, cap;
 
-	if (pcie_aer_get_firmware_first(pdev) && !pcie_ports_dpc_native)
-		return -ENOTSUPP;
-
 	dpc = devm_kzalloc(device, sizeof(*dpc), GFP_KERNEL);
 	if (!dpc)
 		return -ENOMEM;
 
 	dpc->cap_pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DPC);
 	dpc->dev = dev;
+	if (pcie_aer_get_firmware_first(pdev) && !pcie_ports_dpc_native)
+		dpc->edr_enabled = 1;
 	set_service_data(dev, dpc);
 
-	status = devm_request_threaded_irq(device, dev->irq, dpc_irq,
-					   dpc_handler, IRQF_SHARED,
-					   "pcie-dpc", dpc);
-	if (status) {
-		pci_warn(pdev, "request IRQ%d failed: %d\n", dev->irq,
-			 status);
-		return status;
+	/*
+	 * As per PCIe spec r5.0, implementation note titled "Determination
+	 * of DPC Control", to avoid conflicts over whether platform
+	 * firmware or the operating system have control of DPC, it is
+	 * recommended that platform firmware and operating systems always link
+	 * the control of DPC to the control of Advanced Error Reporting.
+	 *
+	 * So use AER FF mode check API pcie_aer_get_firmware_first() to decide
+	 * whether DPC is controlled by software or firmware.
+	 *
+	 * If DPC is handled in firmware and ACPI support is not enabled
+	 * in OS, skip probe and return error.
+	 */
+	if (dpc->edr_enabled && !IS_ENABLED(CONFIG_ACPI))
+		return -ENODEV;
+
+	/* Register interrupt handler only if OS controls DPC */
+	if (!dpc->edr_enabled) {
+		status = devm_request_threaded_irq(device, dev->irq, dpc_irq,
+						   dpc_handler, IRQF_SHARED,
+						   "pcie-dpc", dpc);
+		if (status) {
+			pci_warn(pdev, "request IRQ%d failed: %d\n", dev->irq,
+				 status);
+			return status;
+		}
 	}
 
 	pci_read_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_CAP, &cap);
@@ -323,9 +358,12 @@ static int dpc_probe(struct pcie_device *dev)
 			dpc->rp_log_size = 0;
 		}
 	}
-
-	ctl = (ctl & 0xfff4) | PCI_EXP_DPC_CTL_EN_FATAL | PCI_EXP_DPC_CTL_INT_EN;
-	pci_write_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_CTL, ctl);
+	if (!dpc->edr_enabled) {
+		ctl = (ctl & 0xfff4) |
+			(PCI_EXP_DPC_CTL_EN_FATAL | PCI_EXP_DPC_CTL_INT_EN);
+		pci_write_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_CTL,
+				      ctl);
+	}
 
 	pci_info(pdev, "error containment capabilities: Int Msg #%d, RPExt%c PoisonedTLP%c SwTrigger%c RP PIO Log %d, DL_ActiveErr%c\n",
 		 cap & PCI_EXP_DPC_IRQ, FLAG(cap, PCI_EXP_DPC_CAP_RP_EXT),
@@ -343,6 +381,10 @@ static void dpc_remove(struct pcie_device *dev)
 	struct pci_dev *pdev = dev->port;
 	u16 ctl;
 
+	/* Skip updating DPC registers if DPC is controlled by firmware */
+	if (dpc->edr_enabled)
+		return;
+
 	pci_read_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_CTL, &ctl);
 	ctl &= ~(PCI_EXP_DPC_CTL_EN_FATAL | PCI_EXP_DPC_CTL_INT_EN);
 	pci_write_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_CTL, ctl);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v11 3/8] PCI/DPC: Add dpc_process_error() wrapper function
  2019-12-27  0:39 [PATCH v11 0/8] Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
  2019-12-27  0:39 ` [PATCH v11 1/8] PCI/ERR: Update error status after reset_link() sathyanarayanan.kuppuswamy
  2019-12-27  0:39 ` [PATCH v11 2/8] PCI/DPC: Allow dpc_probe() even if firmware first mode is enabled sathyanarayanan.kuppuswamy
@ 2019-12-27  0:39 ` sathyanarayanan.kuppuswamy
  2019-12-27  0:39 ` [PATCH v11 4/8] PCI/DPC: Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: sathyanarayanan.kuppuswamy @ 2019-12-27  0:39 UTC (permalink / raw)
  To: bhelgaas
  Cc: linux-pci, linux-kernel, ashok.raj, keith.busch,
	sathyanarayanan.kuppuswamy

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

With Error Disconnect Recover (EDR) support, we need to support
processing DPC event either from DPC IRQ or ACPI EDR event. So create
a wrapper function dpc_process_error() and move common error handling
code in to it. It will be used to process the DPC event in both DPC IRQ
and EDR ACPI event contexts.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/dpc.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 2c1251afb6a2..d29f5d25f3f9 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -241,10 +241,9 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
 	return 1;
 }
 
-static irqreturn_t dpc_handler(int irq, void *context)
+static void dpc_process_error(struct dpc_dev *dpc)
 {
 	struct aer_err_info info;
-	struct dpc_dev *dpc = context;
 	struct pci_dev *pdev = dpc->dev->port;
 	u16 cap = dpc->cap_pos, status, source, reason, ext_reason;
 
@@ -277,6 +276,13 @@ static irqreturn_t dpc_handler(int irq, void *context)
 
 	/* We configure DPC so it only triggers on ERR_FATAL */
 	pcie_do_recovery(pdev, pci_channel_io_frozen, PCIE_PORT_SERVICE_DPC);
+}
+
+static irqreturn_t dpc_handler(int irq, void *context)
+{
+	struct dpc_dev *dpc = context;
+
+	dpc_process_error(dpc);
 
 	return IRQ_HANDLED;
 }
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v11 4/8] PCI/DPC: Add Error Disconnect Recover (EDR) support
  2019-12-27  0:39 [PATCH v11 0/8] Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
                   ` (2 preceding siblings ...)
  2019-12-27  0:39 ` [PATCH v11 3/8] PCI/DPC: Add dpc_process_error() wrapper function sathyanarayanan.kuppuswamy
@ 2019-12-27  0:39 ` sathyanarayanan.kuppuswamy
  2019-12-27  0:39 ` [PATCH v11 5/8] PCI/AER: Allow clearing Error Status Register in FF mode sathyanarayanan.kuppuswamy
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: sathyanarayanan.kuppuswamy @ 2019-12-27  0:39 UTC (permalink / raw)
  To: bhelgaas
  Cc: linux-pci, linux-kernel, ashok.raj, keith.busch,
	sathyanarayanan.kuppuswamy, Huong Nguyen, Austin Bolen

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

As per ACPI specification r6.3, sec 5.6.6, when firmware owns Downstream
Port Containment (DPC), its expected to use the "Error Disconnect
Recover" (EDR) notification to alert OSPM of a DPC event and if OS
supports EDR, its expected to handle the software state invalidation and
port recovery in OS, and also let firmware know the recovery status via
_OST ACPI call. Related _OST status codes can be found in ACPI
specification r6.3, sec 6.3.5.2.

Also, as per PCI firmware specification r3.2 Downstream Port Containment
Related Enhancements ECN, sec 4.5.1, table 4-6, If DPC is controlled by
firmware (firmware first mode), firmware is responsible for
configuring the DPC and OS is responsible for error recovery. Also, OS
is allowed to modify DPC registers only during the EDR notification
window. So with EDR support, OS should provide DPC port services even in
firmware first mode.

Cc: Keith Busch <keith.busch@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Tested-by: Huong Nguyen <huong.nguyen@dell.com>
Tested-by: Austin Bolen <Austin.Bolen@dell.com>
---
 drivers/pci/pci-acpi.c   |  98 +++++++++++++++++++++++++++++++
 drivers/pci/pcie/Kconfig |  10 ++++
 drivers/pci/pcie/dpc.c   | 122 ++++++++++++++++++++++++++++++++++++++-
 include/linux/pci-acpi.h |  11 ++++
 4 files changed, 240 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
index 0c02d500158f..13086518de27 100644
--- a/drivers/pci/pci-acpi.c
+++ b/drivers/pci/pci-acpi.c
@@ -103,6 +103,104 @@ int acpi_get_rc_resources(struct device *dev, const char *hid, u16 segment,
 }
 #endif
 
+#if defined(CONFIG_PCIE_DPC) && defined(CONFIG_ACPI)
+
+/*
+ * _DSM wrapper function to enable/disable DPC port.
+ * @dpc   : DPC device structure
+ * @enable: status of DPC port (0 or 1).
+ *
+ * returns 0 on success or errno on failure.
+ */
+int acpi_enable_dpc_port(struct pci_dev *pdev, acpi_handle handle, bool enable)
+{
+	union acpi_object *obj, argv4, req;
+	int status = 0;
+
+	req.type = ACPI_TYPE_INTEGER;
+	req.integer.value = enable;
+
+	argv4.type = ACPI_TYPE_PACKAGE;
+	argv4.package.count = 1;
+	argv4.package.elements = &req;
+
+	/* As per PCI firmware specification r3.2 Downstream Port Containment
+	 * Related Enhancements ECN, sec 4.6.12, EDR_PORT_ENABLE_DSM is
+	 * optional and hence if its not implemented return success.
+	 */
+	obj = acpi_evaluate_dsm(handle, &pci_acpi_dsm_guid, 5,
+				EDR_PORT_ENABLE_DSM, &argv4);
+	if (!obj)
+		return status;
+
+	if (obj->type != ACPI_TYPE_INTEGER || obj->integer.value != enable)
+		status = -EIO;
+
+	ACPI_FREE(obj);
+
+	return status;
+}
+
+/*
+ * _DSM wrapper function to locate DPC port.
+ * @dpc   : DPC device structure
+ *
+ * returns pci_dev or NULL.
+ */
+struct pci_dev *acpi_locate_dpc_port(struct pci_dev *pdev, acpi_handle handle)
+{
+	union acpi_object *obj;
+	u16 port;
+
+	obj = acpi_evaluate_dsm(handle, &pci_acpi_dsm_guid, 5,
+				EDR_PORT_LOCATE_DSM, NULL);
+	if (!obj)
+		return pci_dev_get(pdev);
+
+	if (obj->type != ACPI_TYPE_INTEGER) {
+		ACPI_FREE(obj);
+		return NULL;
+	}
+
+	/*
+	 * Firmware returns DPC port BDF details in following format:
+	 *	15:8 = bus
+	 *	7:3 = device
+	 *	2:0 = function
+	 */
+	port = obj->integer.value;
+
+	ACPI_FREE(obj);
+
+	return pci_get_domain_bus_and_slot(pci_domain_nr(pdev->bus),
+					   PCI_BUS_NUM(port), port & 0xff);
+}
+
+/*
+ * _OST wrapper function to let firmware know the status of EDR event.
+ * @dpc   : DPC device structure.
+ * @status: Status of EDR event.
+ */
+int acpi_send_edr_status(struct pci_dev *pdev,  acpi_handle handle, u16 status)
+{
+	u32 ost_status;
+
+	pci_dbg(pdev, "Sending EDR status :%x\n", status);
+
+	ost_status =  PCI_DEVID(pdev->bus->number, pdev->devfn);
+	ost_status = (ost_status << 16) | status;
+
+	status = acpi_evaluate_ost(handle,
+				   ACPI_NOTIFY_DISCONNECT_RECOVER,
+				   ost_status, NULL);
+	if (ACPI_FAILURE(status))
+		return -EINVAL;
+
+	return 0;
+}
+
+#endif /* CONFIG_PCIE_DPC && CONFIG_ACPI */
+
 phys_addr_t acpi_pci_root_get_mcfg_addr(acpi_handle handle)
 {
 	acpi_status status = AE_NOT_EXIST;
diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
index 6e3c04b46fb1..2db8a3109cb5 100644
--- a/drivers/pci/pcie/Kconfig
+++ b/drivers/pci/pcie/Kconfig
@@ -140,3 +140,13 @@ config PCIE_BW
 	  This enables PCI Express Bandwidth Change Notification.  If
 	  you know link width or rate changes occur only to correct
 	  unreliable links, you may answer Y.
+
+config PCIE_EDR
+	bool "PCI Express Error Disconnect Recover support"
+	default n
+	depends on PCIE_DPC && ACPI
+	help
+	  This options adds Error Disconnect Recover support as specified
+	  in PCI firmware specification v3.2 Downstream Port Containment
+	  Related Enhancements ECN. Enable this if you want to support hybrid
+	  DPC model which uses both firmware and OS to implement DPC.
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index d29f5d25f3f9..b19d707db222 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -13,6 +13,8 @@
 #include <linux/interrupt.h>
 #include <linux/init.h>
 #include <linux/pci.h>
+#include <linux/acpi.h>
+#include <linux/pci-acpi.h>
 
 #include "portdrv.h"
 #include "../pci.h"
@@ -23,6 +25,11 @@ struct dpc_dev {
 	bool			rp_extensions;
 	u8			rp_log_size;
 	bool			edr_enabled; /* EDR mode is supported */
+	pci_ers_result_t	error_state;
+	struct mutex		edr_lock;
+#ifdef CONFIG_ACPI
+	struct acpi_device	*adev;
+#endif
 };
 
 static const char * const rp_pio_error_string[] = {
@@ -161,6 +168,9 @@ static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
 	if (!pcie_wait_for_link(pdev, true))
 		return PCI_ERS_RESULT_DISCONNECT;
 
+	/* Since the device recovery is done just update the error state */
+	dpc->error_state = PCI_ERS_RESULT_RECOVERED;
+
 	return PCI_ERS_RESULT_RECOVERED;
 }
 
@@ -305,13 +315,90 @@ static irqreturn_t dpc_irq(int irq, void *context)
 	return IRQ_HANDLED;
 }
 
+static void dpc_error_resume(struct pci_dev *dev)
+{
+	struct dpc_dev *dpc = to_dpc_dev(dev);
+
+	dpc->error_state = PCI_ERS_RESULT_RECOVERED;
+}
+
+#ifdef CONFIG_ACPI
+
+static void edr_handle_event(acpi_handle handle, u32 event, void *data)
+{
+	struct dpc_dev *dpc = (struct dpc_dev *) data;
+	struct pci_dev *pdev = dpc->dev->port;
+	u16 status;
+
+	pci_info(pdev, "ACPI event %x received\n", event);
+
+	if (event != ACPI_NOTIFY_DISCONNECT_RECOVER)
+		return;
+
+	/*
+	 * Check if _DSM(0xD) is available, and if present locate the
+	 * port which issued EDR event.
+	 */
+	pdev = acpi_locate_dpc_port(pdev, dpc->adev->handle);
+	if (!pdev) {
+		pdev = dpc->dev->port;
+		pci_err(pdev, "No valid port found\n");
+		return;
+	}
+
+	dpc = to_dpc_dev(pdev);
+	if (!dpc) {
+		pci_err(pdev, "DPC port is NULL\n");
+		goto done;
+	}
+
+	mutex_lock(&dpc->edr_lock);
+
+	dpc->error_state = PCI_ERS_RESULT_DISCONNECT;
+
+	/*
+	 * Check if the port supports DPC:
+	 *
+	 * If port supports DPC, then fall back to default error
+	 * recovery.
+	 */
+	if (dpc->cap_pos) {
+		/* Check if there is a valid DPC trigger */
+		pci_read_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_STATUS,
+				     &status);
+		if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
+			pci_err(pdev, "Invalid DPC trigger %x\n", status);
+			goto edr_unlock;
+		}
+		dpc_process_error(dpc);
+	}
+
+	/*
+	 * If recovery is successful, send _OST(0xF, BDF << 16 | 0x80)
+	 * to firmware. If not successful, send _OST(0xF, BDF << 16 | 0x81).
+	 */
+	if (dpc->error_state == PCI_ERS_RESULT_RECOVERED)
+		status = 0x80;
+	else
+		status = 0x81;
+
+	acpi_send_edr_status(pdev, dpc->adev->handle, status);
+
+edr_unlock:
+	mutex_unlock(&dpc->edr_lock);
+done:
+	pci_dev_put(pdev);
+}
+
+#endif
+
 #define FLAG(x, y) (((x) & (y)) ? '+' : '-')
 static int dpc_probe(struct pcie_device *dev)
 {
 	struct dpc_dev *dpc;
 	struct pci_dev *pdev = dev->port;
 	struct device *device = &dev->device;
-	int status;
+	int status = 0;
 	u16 ctl, cap;
 
 	dpc = devm_kzalloc(device, sizeof(*dpc), GFP_KERNEL);
@@ -324,6 +411,9 @@ static int dpc_probe(struct pcie_device *dev)
 		dpc->edr_enabled = 1;
 	set_service_data(dev, dpc);
 
+	dpc->error_state = PCI_ERS_RESULT_NONE;
+	mutex_init(&dpc->edr_lock);
+
 	/*
 	 * As per PCIe spec r5.0, implementation note titled "Determination
 	 * of DPC Control", to avoid conflicts over whether platform
@@ -352,6 +442,35 @@ static int dpc_probe(struct pcie_device *dev)
 		}
 	}
 
+#ifdef CONFIG_ACPI
+	struct acpi_device *adev = ACPI_COMPANION(&pdev->dev);
+
+	if (pcie_aer_get_firmware_first(pdev) && adev) {
+		acpi_status astatus;
+
+		dpc->adev = adev;
+
+		astatus = acpi_install_notify_handler(adev->handle,
+						      ACPI_SYSTEM_NOTIFY,
+						      edr_handle_event,
+						      dpc);
+
+		if (ACPI_FAILURE(astatus)) {
+			pci_err(pdev,
+				"Install ACPI_SYSTEM_NOTIFY handler failed\n");
+			return -EBUSY;
+		}
+
+		status = acpi_enable_dpc_port(pdev, adev->handle, true);
+		if (status) {
+			pci_warn(pdev, "Enable DPC port failed\n");
+			acpi_remove_notify_handler(adev->handle,
+						   ACPI_SYSTEM_NOTIFY,
+						   edr_handle_event);
+			return status;
+		}
+	}
+#endif
 	pci_read_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_CAP, &cap);
 	pci_read_config_word(pdev, dpc->cap_pos + PCI_EXP_DPC_CTL, &ctl);
 
@@ -403,6 +522,7 @@ static struct pcie_port_service_driver dpcdriver = {
 	.probe		= dpc_probe,
 	.remove		= dpc_remove,
 	.reset_link	= dpc_reset_link,
+	.error_resume   = dpc_error_resume,
 };
 
 int __init pcie_dpc_init(void)
diff --git a/include/linux/pci-acpi.h b/include/linux/pci-acpi.h
index 62b7fdcc661c..36ffe1e16e69 100644
--- a/include/linux/pci-acpi.h
+++ b/include/linux/pci-acpi.h
@@ -117,6 +117,17 @@ static inline void acpi_pci_add_bus(struct pci_bus *bus) { }
 static inline void acpi_pci_remove_bus(struct pci_bus *bus) { }
 #endif	/* CONFIG_ACPI */
 
+#if defined(CONFIG_PCIE_DPC) && defined(CONFIG_ACPI)
+#define EDR_PORT_ENABLE_DSM     0x0C
+#define EDR_PORT_LOCATE_DSM     0x0D
+int acpi_enable_dpc_port(struct pci_dev *pdev, acpi_handle handle,
+			 bool enable);
+struct pci_dev *acpi_locate_dpc_port(struct pci_dev *pdev,
+				     acpi_handle handle);
+int acpi_send_edr_status(struct pci_dev *pdev,
+			 acpi_handle handle, u16 status);
+#endif
+
 #ifdef CONFIG_ACPI_APEI
 extern bool aer_acpi_firmware_first(void);
 #else
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v11 5/8] PCI/AER: Allow clearing Error Status Register in FF mode
  2019-12-27  0:39 [PATCH v11 0/8] Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
                   ` (3 preceding siblings ...)
  2019-12-27  0:39 ` [PATCH v11 4/8] PCI/DPC: Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
@ 2019-12-27  0:39 ` sathyanarayanan.kuppuswamy
  2019-12-30 23:59   ` Bjorn Helgaas
  2019-12-27  0:39 ` [PATCH v11 6/8] PCI/DPC: Update comments related to DPC recovery on NON_FATAL errors sathyanarayanan.kuppuswamy
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: sathyanarayanan.kuppuswamy @ 2019-12-27  0:39 UTC (permalink / raw)
  To: bhelgaas
  Cc: linux-pci, linux-kernel, ashok.raj, keith.busch,
	sathyanarayanan.kuppuswamy

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

As per PCI firmware specification r3.2 System Firmware Intermediary
(SFI) _OSC and DPC Updates ECR
(https://members.pcisig.com/wg/PCI-SIG/document/13563), sec titled
"DPC Event Handling Implementation Note", page 10, Error Disconnect
Recover (EDR) support allows OS to handle error recovery and clearing
Error Registers even in FF mode. So create exception for FF mode checks
in pci_cleanup_aer_uncorrect_error_status(), pci_aer_clear_fatal_status()
and pci_cleanup_aer_error_status_regs() functions when its being called
from DPC code path.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/PCI/pcieaer-howto.rst       |  2 +-
 drivers/net/ethernet/intel/ice/ice_main.c |  2 +-
 drivers/ntb/hw/idt/ntb_hw_idt.c           |  2 +-
 drivers/pci/pci.c                         |  2 +-
 drivers/pci/pci.h                         |  2 +-
 drivers/pci/pcie/aer.c                    | 16 ++++++++--------
 drivers/pci/pcie/dpc.c                    |  4 ++--
 drivers/pci/pcie/err.c                    |  2 +-
 drivers/scsi/lpfc/lpfc_attr.c             |  2 +-
 include/linux/aer.h                       | 12 ++++++++----
 10 files changed, 25 insertions(+), 21 deletions(-)

diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
index 18bdefaafd1a..184c966b61cb 100644
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@@ -243,7 +243,7 @@ messages to root port when an error is detected.
 
 ::
 
-  int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);`
+  int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev, bool ff_check);
 
 pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
 error status register.
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 69bff085acf7..8378dbf3500b 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -3491,7 +3491,7 @@ static pci_ers_result_t ice_pci_err_slot_reset(struct pci_dev *pdev)
 			result = PCI_ERS_RESULT_DISCONNECT;
 	}
 
-	err = pci_cleanup_aer_uncorrect_error_status(pdev);
+	err = pci_cleanup_aer_uncorrect_error_status(pdev, 1);
 	if (err)
 		dev_dbg(&pdev->dev,
 			"pci_cleanup_aer_uncorrect_error_status failed, error %d\n",
diff --git a/drivers/ntb/hw/idt/ntb_hw_idt.c b/drivers/ntb/hw/idt/ntb_hw_idt.c
index dcf234680535..9023308be2de 100644
--- a/drivers/ntb/hw/idt/ntb_hw_idt.c
+++ b/drivers/ntb/hw/idt/ntb_hw_idt.c
@@ -2675,7 +2675,7 @@ static int idt_init_pci(struct idt_ntb_dev *ndev)
 	if (ret != 0)
 		dev_warn(&pdev->dev, "PCIe AER capability disabled\n");
 	else /* Cleanup uncorrectable error status before getting to init */
-		pci_cleanup_aer_uncorrect_error_status(pdev);
+		pci_cleanup_aer_uncorrect_error_status(pdev, 1);
 
 	/* First enable the PCI device */
 	ret = pcim_enable_device(pdev);
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e87196cc1a7f..6264244d92bf 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1499,7 +1499,7 @@ void pci_restore_state(struct pci_dev *dev)
 	pci_restore_rebar_state(dev);
 	pci_restore_dpc_state(dev);
 
-	pci_cleanup_aer_error_status_regs(dev);
+	pci_cleanup_aer_error_status_regs(dev, 1);
 	pci_restore_aer_state(dev);
 
 	pci_restore_config_space(dev);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index a0a53bd05a0b..0b4452f72a9a 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -646,7 +646,7 @@ void pci_no_aer(void);
 void pci_aer_init(struct pci_dev *dev);
 void pci_aer_exit(struct pci_dev *dev);
 extern const struct attribute_group aer_stats_attr_group;
-void pci_aer_clear_fatal_status(struct pci_dev *dev);
+void pci_aer_clear_fatal_status(struct pci_dev *dev, bool ff_check);
 void pci_aer_clear_device_status(struct pci_dev *dev);
 #else
 static inline void pci_no_aer(void) { }
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 1ca86f2e0166..c64a91b347d2 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -376,7 +376,7 @@ void pci_aer_clear_device_status(struct pci_dev *dev)
 	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
 }
 
-int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
+int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev, bool ff_check)
 {
 	int pos;
 	u32 status, sev;
@@ -385,7 +385,7 @@ int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
 	if (!pos)
 		return -EIO;
 
-	if (pcie_aer_get_firmware_first(dev))
+	if (ff_check && pcie_aer_get_firmware_first(dev))
 		return -EIO;
 
 	/* Clear status bits for ERR_NONFATAL errors only */
@@ -399,7 +399,7 @@ int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_cleanup_aer_uncorrect_error_status);
 
-void pci_aer_clear_fatal_status(struct pci_dev *dev)
+void pci_aer_clear_fatal_status(struct pci_dev *dev, bool ff_check)
 {
 	int pos;
 	u32 status, sev;
@@ -408,8 +408,8 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
 	if (!pos)
 		return;
 
-	if (pcie_aer_get_firmware_first(dev))
-		return;
+	if (ff_check && pcie_aer_get_firmware_first(dev))
+		return -EIO;
 
 	/* Clear status bits for ERR_FATAL errors only */
 	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
@@ -419,7 +419,7 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
 		pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status);
 }
 
-int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
+int pci_cleanup_aer_error_status_regs(struct pci_dev *dev, bool ff_check)
 {
 	int pos;
 	u32 status;
@@ -432,7 +432,7 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
 	if (!pos)
 		return -EIO;
 
-	if (pcie_aer_get_firmware_first(dev))
+	if (ff_check && pcie_aer_get_firmware_first(dev))
 		return -EIO;
 
 	port_type = pci_pcie_type(dev);
@@ -515,7 +515,7 @@ void pci_aer_init(struct pci_dev *dev)
 	n = pcie_cap_has_rtctl(dev) ? 5 : 4;
 	pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n);
 
-	pci_cleanup_aer_error_status_regs(dev);
+	pci_cleanup_aer_error_status_regs(dev, 1);
 }
 
 void pci_aer_exit(struct pci_dev *dev)
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index b19d707db222..e711a3747f48 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -280,8 +280,8 @@ static void dpc_process_error(struct dpc_dev *dpc)
 		 dpc_get_aer_uncorrect_severity(pdev, &info) &&
 		 aer_get_device_error_info(pdev, &info)) {
 		aer_print_error(pdev, &info);
-		pci_cleanup_aer_uncorrect_error_status(pdev);
-		pci_aer_clear_fatal_status(pdev);
+		pci_cleanup_aer_uncorrect_error_status(pdev, 0);
+		pci_aer_clear_fatal_status(pdev, 0);
 	}
 
 	/* We configure DPC so it only triggers on ERR_FATAL */
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 53cd9200ec2c..043eab17e45d 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -236,7 +236,7 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 	pci_walk_bus(bus, report_resume, &status);
 
 	pci_aer_clear_device_status(dev);
-	pci_cleanup_aer_uncorrect_error_status(dev);
+	pci_cleanup_aer_uncorrect_error_status(dev, 1);
 	pci_info(dev, "AER: Device recovery successful\n");
 	return;
 
diff --git a/drivers/scsi/lpfc/lpfc_attr.c b/drivers/scsi/lpfc/lpfc_attr.c
index 4ff82b36a37a..5b37efd29e20 100644
--- a/drivers/scsi/lpfc/lpfc_attr.c
+++ b/drivers/scsi/lpfc/lpfc_attr.c
@@ -4810,7 +4810,7 @@ lpfc_aer_cleanup_state(struct device *dev, struct device_attribute *attr,
 		return -EINVAL;
 
 	if (phba->hba_flag & HBA_AER_ENABLED)
-		rc = pci_cleanup_aer_uncorrect_error_status(phba->pcidev);
+		rc = pci_cleanup_aer_uncorrect_error_status(phba->pcidev, 1);
 
 	if (rc == 0)
 		return strlen(buf);
diff --git a/include/linux/aer.h b/include/linux/aer.h
index fa19e01f418a..f4f49df500ad 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -44,8 +44,10 @@ struct aer_capability_regs {
 /* PCIe port driver needs this function to enable AER */
 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
-int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
-int pci_cleanup_aer_error_status_regs(struct pci_dev *dev);
+int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev,
+					   bool ff_check);
+int pci_cleanup_aer_error_status_regs(struct pci_dev *dev,
+				      bool ff_check);
 void pci_save_aer_state(struct pci_dev *dev);
 void pci_restore_aer_state(struct pci_dev *dev);
 #else
@@ -57,11 +59,13 @@ static inline int pci_disable_pcie_error_reporting(struct pci_dev *dev)
 {
 	return -EINVAL;
 }
-static inline int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
+static inline int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev,
+							 bool ff_check)
 {
 	return -EINVAL;
 }
-static inline int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
+static inline int pci_cleanup_aer_error_status_regs(struct pci_dev *dev,
+						    bool ff_check)
 {
 	return -EINVAL;
 }
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v11 6/8] PCI/DPC: Update comments related to DPC recovery on NON_FATAL errors
  2019-12-27  0:39 [PATCH v11 0/8] Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
                   ` (4 preceding siblings ...)
  2019-12-27  0:39 ` [PATCH v11 5/8] PCI/AER: Allow clearing Error Status Register in FF mode sathyanarayanan.kuppuswamy
@ 2019-12-27  0:39 ` sathyanarayanan.kuppuswamy
  2019-12-27  0:39 ` [PATCH v11 7/8] PCI/DPC: Clear AER registers in EDR mode sathyanarayanan.kuppuswamy
  2019-12-27  0:39 ` [PATCH v11 8/8] PCI/ACPI: Enable EDR support sathyanarayanan.kuppuswamy
  7 siblings, 0 replies; 17+ messages in thread
From: sathyanarayanan.kuppuswamy @ 2019-12-27  0:39 UTC (permalink / raw)
  To: bhelgaas
  Cc: linux-pci, linux-kernel, ashok.raj, keith.busch,
	sathyanarayanan.kuppuswamy

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Currently, in native mode, DPC driver is configured to trigger DPC only
for FATAL errors and hence it only supports port recovery for FATAL
errors. But with Error Disconnect Recover (EDR) support, DPC
configuration is done by firmware, and hence we should expect DPC
triggered for both FATAL/NON_FATAL errors. So update comments and add
details about how NON_FATAL dpc recovery is handled.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/dpc.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index e711a3747f48..db2e5cb635d7 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -284,7 +284,11 @@ static void dpc_process_error(struct dpc_dev *dpc)
 		pci_aer_clear_fatal_status(pdev, 0);
 	}
 
-	/* We configure DPC so it only triggers on ERR_FATAL */
+	/*
+	 * Irrespective of whether the DPC event is triggered by
+	 * ERR_FATAL or ERR_NONFATAL, since the link is already down,
+	 * use the FATAL error recovery path for both cases.
+	 */
 	pcie_do_recovery(pdev, pci_channel_io_frozen, PCIE_PORT_SERVICE_DPC);
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v11 7/8] PCI/DPC: Clear AER registers in EDR mode
  2019-12-27  0:39 [PATCH v11 0/8] Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
                   ` (5 preceding siblings ...)
  2019-12-27  0:39 ` [PATCH v11 6/8] PCI/DPC: Update comments related to DPC recovery on NON_FATAL errors sathyanarayanan.kuppuswamy
@ 2019-12-27  0:39 ` sathyanarayanan.kuppuswamy
  2019-12-27  0:39 ` [PATCH v11 8/8] PCI/ACPI: Enable EDR support sathyanarayanan.kuppuswamy
  7 siblings, 0 replies; 17+ messages in thread
From: sathyanarayanan.kuppuswamy @ 2019-12-27  0:39 UTC (permalink / raw)
  To: bhelgaas
  Cc: linux-pci, linux-kernel, ashok.raj, keith.busch,
	sathyanarayanan.kuppuswamy

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

As per PCI firmware specification r3.2 System Firmware Intermediary
(SFI) _OSC and DPC Updates ECR
(https://members.pcisig.com/wg/PCI-SIG/document/13563), sec titled
"DPC Event Handling Implementation Note", page 10, OS is responsible
for clearing the AER registers in EDR mode. So clear AER registers in
dpc_process_error() function.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/dpc.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index db2e5cb635d7..588ae4f99781 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -284,6 +284,10 @@ static void dpc_process_error(struct dpc_dev *dpc)
 		pci_aer_clear_fatal_status(pdev, 0);
 	}
 
+	/* In EDR mode, OS is responsible for clearing AER registers */
+	if (pcie_aer_get_firmware_first(pdev))
+		pci_cleanup_aer_error_status_regs(pdev, 0);
+
 	/*
 	 * Irrespective of whether the DPC event is triggered by
 	 * ERR_FATAL or ERR_NONFATAL, since the link is already down,
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v11 8/8] PCI/ACPI: Enable EDR support
  2019-12-27  0:39 [PATCH v11 0/8] Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
                   ` (6 preceding siblings ...)
  2019-12-27  0:39 ` [PATCH v11 7/8] PCI/DPC: Clear AER registers in EDR mode sathyanarayanan.kuppuswamy
@ 2019-12-27  0:39 ` sathyanarayanan.kuppuswamy
  7 siblings, 0 replies; 17+ messages in thread
From: sathyanarayanan.kuppuswamy @ 2019-12-27  0:39 UTC (permalink / raw)
  To: bhelgaas
  Cc: linux-pci, linux-kernel, ashok.raj, keith.busch,
	sathyanarayanan.kuppuswamy, Rafael J. Wysocki, Len Brown,
	Huong Nguyen, Austin Bolen

From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

As per PCI firmware specification r3.2 Downstream Port Containment
Related Enhancements ECN, sec 4.5.1, OS must implement following steps
to enable/use EDR feature.

1. OS can use bit 7 of _OSC Control Field to negotiate control over
Downstream Port Containment (DPC) configuration of PCIe port. After _OSC
negotiation, firmware will Set this bit to grant OS control over PCIe
DPC configuration and Clear it if this feature was requested and denied,
or was not requested.

2. Also, if OS supports EDR, it should expose its support to BIOS by
setting bit 7 of _OSC Support Field. And if OS sets bit 7 of _OSC
Control Field it must also expose support for EDR by setting bit 7 of
_OSC Support Field.

Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Keith Busch <keith.busch@intel.com>
Tested-by: Huong Nguyen <huong.nguyen@dell.com>
Tested-by: Austin Bolen <Austin.Bolen@dell.com>
---
 drivers/acpi/pci_root.c         | 9 +++++++++
 drivers/pci/pcie/portdrv_core.c | 7 +++++--
 drivers/pci/probe.c             | 1 +
 include/linux/acpi.h            | 6 ++++--
 include/linux/pci.h             | 3 ++-
 5 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
index d1e666ef3fcc..134e20474dfd 100644
--- a/drivers/acpi/pci_root.c
+++ b/drivers/acpi/pci_root.c
@@ -131,6 +131,7 @@ static struct pci_osc_bit_struct pci_osc_support_bit[] = {
 	{ OSC_PCI_CLOCK_PM_SUPPORT, "ClockPM" },
 	{ OSC_PCI_SEGMENT_GROUPS_SUPPORT, "Segments" },
 	{ OSC_PCI_MSI_SUPPORT, "MSI" },
+	{ OSC_PCI_EDR_SUPPORT, "EDR" },
 	{ OSC_PCI_HPX_TYPE_3_SUPPORT, "HPX-Type3" },
 };
 
@@ -141,6 +142,7 @@ static struct pci_osc_bit_struct pci_osc_control_bit[] = {
 	{ OSC_PCI_EXPRESS_AER_CONTROL, "AER" },
 	{ OSC_PCI_EXPRESS_CAPABILITY_CONTROL, "PCIeCapability" },
 	{ OSC_PCI_EXPRESS_LTR_CONTROL, "LTR" },
+	{ OSC_PCI_EXPRESS_DPC_CONTROL, "DPC" },
 };
 
 static void decode_osc_bits(struct acpi_pci_root *root, char *msg, u32 word,
@@ -440,6 +442,8 @@ static void negotiate_os_control(struct acpi_pci_root *root, int *no_aspm,
 		support |= OSC_PCI_ASPM_SUPPORT | OSC_PCI_CLOCK_PM_SUPPORT;
 	if (pci_msi_enabled())
 		support |= OSC_PCI_MSI_SUPPORT;
+	if (IS_ENABLED(CONFIG_PCIE_EDR))
+		support |= OSC_PCI_EDR_SUPPORT;
 
 	decode_osc_support(root, "OS supports", support);
 	status = acpi_pci_osc_support(root, support);
@@ -487,6 +491,9 @@ static void negotiate_os_control(struct acpi_pci_root *root, int *no_aspm,
 			control |= OSC_PCI_EXPRESS_AER_CONTROL;
 	}
 
+	if (IS_ENABLED(CONFIG_PCIE_DPC))
+		control |= OSC_PCI_EXPRESS_DPC_CONTROL;
+
 	requested = control;
 	status = acpi_pci_osc_control_set(handle, &control,
 					  OSC_PCI_EXPRESS_CAPABILITY_CONTROL);
@@ -916,6 +923,8 @@ struct pci_bus *acpi_pci_root_create(struct acpi_pci_root *root,
 		host_bridge->native_pme = 0;
 	if (!(root->osc_control_set & OSC_PCI_EXPRESS_LTR_CONTROL))
 		host_bridge->native_ltr = 0;
+	if (!(root->osc_control_set & OSC_PCI_EXPRESS_DPC_CONTROL))
+		host_bridge->native_dpc = 0;
 
 	/*
 	 * Evaluate the "PCI Boot Configuration" _DSM Function.  If it
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index 5075cb9e850c..009742c865d6 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -253,10 +253,13 @@ static int get_port_device_capability(struct pci_dev *dev)
 	/*
 	 * With dpc-native, allow Linux to use DPC even if it doesn't have
 	 * permission to use AER.
+	 * If EDR support is enabled in OS, then even if AER is not handled in
+	 * OS, DPC service can be enabled.
 	 */
 	if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
-	    pci_aer_available() &&
-	    (pcie_ports_dpc_native || (services & PCIE_PORT_SERVICE_AER)))
+	    ((IS_ENABLED(CONFIG_PCIE_EDR) && !host->native_dpc) ||
+	    (pci_aer_available() &&
+	    (pcie_ports_dpc_native || (services & PCIE_PORT_SERVICE_AER)))))
 		services |= PCIE_PORT_SERVICE_DPC;
 
 	if (pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM ||
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 512cb4312ddd..c9a9c5b42e72 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -598,6 +598,7 @@ static void pci_init_host_bridge(struct pci_host_bridge *bridge)
 	bridge->native_shpc_hotplug = 1;
 	bridge->native_pme = 1;
 	bridge->native_ltr = 1;
+	bridge->native_dpc = 1;
 }
 
 struct pci_host_bridge *pci_alloc_host_bridge(size_t priv)
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 0f37a7d5fa77..0a7aaa452a98 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -515,8 +515,9 @@ extern bool osc_pc_lpi_support_confirmed;
 #define OSC_PCI_CLOCK_PM_SUPPORT		0x00000004
 #define OSC_PCI_SEGMENT_GROUPS_SUPPORT		0x00000008
 #define OSC_PCI_MSI_SUPPORT			0x00000010
+#define OSC_PCI_EDR_SUPPORT			0x00000080
 #define OSC_PCI_HPX_TYPE_3_SUPPORT		0x00000100
-#define OSC_PCI_SUPPORT_MASKS			0x0000011f
+#define OSC_PCI_SUPPORT_MASKS			0x0000019f
 
 /* PCI Host Bridge _OSC: Capabilities DWORD 3: Control Field */
 #define OSC_PCI_EXPRESS_NATIVE_HP_CONTROL	0x00000001
@@ -525,7 +526,8 @@ extern bool osc_pc_lpi_support_confirmed;
 #define OSC_PCI_EXPRESS_AER_CONTROL		0x00000008
 #define OSC_PCI_EXPRESS_CAPABILITY_CONTROL	0x00000010
 #define OSC_PCI_EXPRESS_LTR_CONTROL		0x00000020
-#define OSC_PCI_CONTROL_MASKS			0x0000003f
+#define OSC_PCI_EXPRESS_DPC_CONTROL		0x00000080
+#define OSC_PCI_CONTROL_MASKS			0x000000bf
 
 #define ACPI_GSB_ACCESS_ATTRIB_QUICK		0x00000002
 #define ACPI_GSB_ACCESS_ATTRIB_SEND_RCV         0x00000004
diff --git a/include/linux/pci.h b/include/linux/pci.h
index c393dff2d66f..0b7c63c7888d 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -510,8 +510,9 @@ struct pci_host_bridge {
 	unsigned int	native_shpc_hotplug:1;	/* OS may use SHPC hotplug */
 	unsigned int	native_pme:1;		/* OS may use PCIe PME */
 	unsigned int	native_ltr:1;		/* OS may use PCIe LTR */
-	unsigned int	preserve_config:1;	/* Preserve FW resource setup */
+	unsigned int	native_dpc:1;		/* OS may use PCIe DPC */
 
+	unsigned int	preserve_config:1;	/* Preserve FW resource setup */
 	/* Resource alignment requirements */
 	resource_size_t (*align_resource)(struct pci_dev *dev,
 			const struct resource *res,
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v11 5/8] PCI/AER: Allow clearing Error Status Register in FF mode
  2019-12-27  0:39 ` [PATCH v11 5/8] PCI/AER: Allow clearing Error Status Register in FF mode sathyanarayanan.kuppuswamy
@ 2019-12-30 23:59   ` Bjorn Helgaas
  2019-12-31 18:11     ` Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 17+ messages in thread
From: Bjorn Helgaas @ 2019-12-30 23:59 UTC (permalink / raw)
  To: sathyanarayanan.kuppuswamy
  Cc: linux-pci, linux-kernel, ashok.raj, keith.busch, Austin.Bolen

[+cc Austin]

On Thu, Dec 26, 2019 at 04:39:11PM -0800, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> As per PCI firmware specification r3.2 System Firmware Intermediary
> (SFI) _OSC and DPC Updates ECR
> (https://members.pcisig.com/wg/PCI-SIG/document/13563),

What is the state of this ECR?  I see it in the "PCI Express Review
Zone Archive".  I don't know what the usage is of the "Review Zone" vs
the "Review Zone Archive / PCI Express Review Zone Archive".  AFAICS,
it is not listed in any of the "Documents for 60 Day Member Review".

And I think it needs some clarification (for one thing, it needs to
say what the red/blue text means).  I've mentioned some other items to
Austin, but I haven't read it in detail because it seems like it's not
quite baked yet.

E.g., there's language about "it may make sense for an embedded system
OS to own SFI, but it's recommended that general-purpose OSes never
request SFI ownership."  That's useless: Linux is certainly a general
purpose OS, but Linux is also often an embedded OS.  So the ECR
doesn't provide useful guidance about how an OS should decide whether
to request SFI ownership.

Making code changes based on a published spec or ECN is fine,
obviously.  Changes based on an ECR that is well on track to being
accepted, e.g., is in the 60-day review period, are probably OK.  I
don't yet have warm fuzzies about this ECR because I have no idea how
far along it is.

We might be able to justify some of these changes based on other
specs; it just sounds weird to me to say "based on this Engineering
Change Request that might be accepted someday, we must do X".  Anybody
can dream up an ECR that says anything at all, so AFAICT, an ECR is
not at all authoritative.

> sec titled
> "DPC Event Handling Implementation Note", page 10, Error Disconnect
> Recover (EDR) support allows OS to handle error recovery and clearing
> Error Registers even in FF mode. So create exception for FF mode checks
> in pci_cleanup_aer_uncorrect_error_status(), pci_aer_clear_fatal_status()
> and pci_cleanup_aer_error_status_regs() functions when its being called
> from DPC code path.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v11 5/8] PCI/AER: Allow clearing Error Status Register in FF mode
  2019-12-30 23:59   ` Bjorn Helgaas
@ 2019-12-31 18:11     ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 17+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2019-12-31 18:11 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, linux-kernel, ashok.raj, keith.busch, Austin.Bolen

Hi Bjorn,

On 12/30/19 3:59 PM, Bjorn Helgaas wrote:
> [+cc Austin]
>
> On Thu, Dec 26, 2019 at 04:39:11PM -0800, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
>> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>
>> As per PCI firmware specification r3.2 System Firmware Intermediary
>> (SFI) _OSC and DPC Updates ECR
>> (https://members.pcisig.com/wg/PCI-SIG/document/13563),
> What is the state of this ECR?  I see it in the "PCI Express Review
> Zone Archive".  I don't know what the usage is of the "Review Zone" vs
> the "Review Zone Archive / PCI Express Review Zone Archive".  AFAICS,
> it is not listed in any of the "Documents for 60 Day Member Review".
>
> And I think it needs some clarification (for one thing, it needs to
> say what the red/blue text means).  I've mentioned some other items to
> Austin, but I haven't read it in detail because it seems like it's not
> quite baked yet.
>
> E.g., there's language about "it may make sense for an embedded system
> OS to own SFI, but it's recommended that general-purpose OSes never
> request SFI ownership."  That's useless: Linux is certainly a general
> purpose OS, but Linux is also often an embedded OS.  So the ECR
> doesn't provide useful guidance about how an OS should decide whether
> to request SFI ownership.
This ECR has merged three different change proposals (SFI related,
_OSC related updates and update to implementation note of DPC
handling with EDR support) into a single document.  Out of these
three changes, we only care about "DPC implementation note update".

We already have a ECR specification for Error Disconnect Recover (EDR)
support (https://members.pcisig.com/wg/PCI-SIG/document/12888) in published
spec section. But this document has some ambiguous statements / missing 
details
which as  clarified in the implementation note section of mentioned ECR.
>
> Making code changes based on a published spec or ECN is fine,
> obviously.  Changes based on an ECR that is well on track to being
> accepted, e.g., is in the 60-day review period, are probably OK.  I
> don't yet have warm fuzzies about this ECR because I have no idea how
> far along it is.
>
> We might be able to justify some of these changes based on other
> specs; it just sounds weird to me to say "based on this Engineering
> Change Request that might be accepted someday, we must do X".  Anybody
> can dream up an ECR that says anything at all, so AFAICT, an ECR is
> not at all authoritative.
>
>> sec titled
>> "DPC Event Handling Implementation Note", page 10, Error Disconnect
>> Recover (EDR) support allows OS to handle error recovery and clearing
>> Error Registers even in FF mode. So create exception for FF mode checks
>> in pci_cleanup_aer_uncorrect_error_status(), pci_aer_clear_fatal_status()
>> and pci_cleanup_aer_error_status_regs() functions when its being called
>> from DPC code path.

-- 
Sathyanarayanan Kuppuswamy
Linux kernel developer


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v11 1/8] PCI/ERR: Update error status after reset_link()
  2019-12-27  0:39 ` [PATCH v11 1/8] PCI/ERR: Update error status after reset_link() sathyanarayanan.kuppuswamy
@ 2020-01-04  0:34   ` Bjorn Helgaas
  2020-01-04  1:03     ` Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 17+ messages in thread
From: Bjorn Helgaas @ 2020-01-04  0:34 UTC (permalink / raw)
  To: sathyanarayanan.kuppuswamy
  Cc: linux-pci, linux-kernel, ashok.raj, keith.busch

On Thu, Dec 26, 2019 at 04:39:07PM -0800, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") uses
> reset_link() to recover from fatal errors. But, if the reset is
> successful there is no need to continue the rest of the error recovery
> checks. Also, during fatal error recovery, if the initial value of error
> status is PCI_ERS_RESULT_DISCONNECT or PCI_ERS_RESULT_NO_AER_DRIVER then
> even after successful recovery (using reset_link()) pcie_do_recovery()
> will report the recovery result as failure. So update the status of
> error after reset_link().

I like the part about updating "status" with the result of
reset_link(), and I split that into its own patch because it
seems like a fix that *can* be separated.

But I'm not convinced that we should skip the ->slot_reset()
callbacks if the reset_link() was successful.  According to
Documentation/PCI/pci-error-recovery.rst, we should call
->slot_reset() after completion of the reset.

For example, rsxx_err_handler implements ->slot_reset(), but
not ->resume().  If we reset the device, we'll claim success and
return, but we won't call rsxx_slot_reset(), which does a bunch
of important-looking recovery stuff.

If pci-error-recovery.rst is wrong, we should fix that (after
auditing all the drivers to make sure they match).

> Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> Cc: Ashok Raj <ashok.raj@intel.com>
> Cc: Keith Busch <keith.busch@intel.com>
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Acked-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/pci/pcie/err.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index b0e6048a9208..53cd9200ec2c 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -204,9 +204,12 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>  	else
>  		pci_walk_bus(bus, report_normal_detected, &status);
>  
> -	if (state == pci_channel_io_frozen &&
> -	    reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
> -		goto failed;
> +	if (state == pci_channel_io_frozen) {
> +		status = reset_link(dev, service);
> +		if (status != PCI_ERS_RESULT_RECOVERED)
> +			goto failed;
> +		goto done;
> +	}
>  
>  	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>  		status = PCI_ERS_RESULT_RECOVERED;
> @@ -228,6 +231,7 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>  	if (status != PCI_ERS_RESULT_RECOVERED)
>  		goto failed;
>  
> +done:
>  	pci_dbg(dev, "broadcast resume message\n");
>  	pci_walk_bus(bus, report_resume, &status);
>  
> -- 
> 2.21.0
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v11 1/8] PCI/ERR: Update error status after reset_link()
  2020-01-04  0:34   ` Bjorn Helgaas
@ 2020-01-04  1:03     ` Kuppuswamy Sathyanarayanan
  2020-01-04  2:54       ` Bjorn Helgaas
  0 siblings, 1 reply; 17+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2020-01-04  1:03 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, linux-kernel, ashok.raj, keith.busch


On 1/3/20 4:34 PM, Bjorn Helgaas wrote:
> On Thu, Dec 26, 2019 at 04:39:07PM -0800, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
>> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>
>> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") uses
>> reset_link() to recover from fatal errors. But, if the reset is
>> successful there is no need to continue the rest of the error recovery
>> checks. Also, during fatal error recovery, if the initial value of error
>> status is PCI_ERS_RESULT_DISCONNECT or PCI_ERS_RESULT_NO_AER_DRIVER then
>> even after successful recovery (using reset_link()) pcie_do_recovery()
>> will report the recovery result as failure. So update the status of
>> error after reset_link().
> I like the part about updating "status" with the result of
> reset_link(), and I split that into its own patch because it
> seems like a fix that *can* be separated.
>
> But I'm not convinced that we should skip the ->slot_reset()
> callbacks if the reset_link() was successful.
If reset_link() call is successful then the result value will be
"PCI_ERS_RESULT_RECOVERED". So even if you proceed with
rest of the code, slot_reset() will never get called right ?
>    According to
> Documentation/PCI/pci-error-recovery.rst, we should call
> ->slot_reset() after completion of the reset.
>
> For example, rsxx_err_handler implements ->slot_reset(), but
> not ->resume().  If we reset the device, we'll claim success and
> return, but we won't call rsxx_slot_reset(), which does a bunch
> of important-looking recovery stuff.
>
> If pci-error-recovery.rst is wrong, we should fix that (after
> auditing all the drivers to make sure they match).
>
>> Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
>> Cc: Ashok Raj <ashok.raj@intel.com>
>> Cc: Keith Busch <keith.busch@intel.com>
>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> Acked-by: Keith Busch <keith.busch@intel.com>
>> ---
>>   drivers/pci/pcie/err.c | 10 +++++++---
>>   1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index b0e6048a9208..53cd9200ec2c 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -204,9 +204,12 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>>   	else
>>   		pci_walk_bus(bus, report_normal_detected, &status);
>>   
>> -	if (state == pci_channel_io_frozen &&
>> -	    reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
>> -		goto failed;
>> +	if (state == pci_channel_io_frozen) {
>> +		status = reset_link(dev, service);
>> +		if (status != PCI_ERS_RESULT_RECOVERED)
>> +			goto failed;
>> +		goto done;
>> +	}
>>   
>>   	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>>   		status = PCI_ERS_RESULT_RECOVERED;
>> @@ -228,6 +231,7 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>>   	if (status != PCI_ERS_RESULT_RECOVERED)
>>   		goto failed;
>>   
>> +done:
>>   	pci_dbg(dev, "broadcast resume message\n");
>>   	pci_walk_bus(bus, report_resume, &status);
>>   
>> -- 
>> 2.21.0
>>
-- 
Sathyanarayanan Kuppuswamy
Linux kernel developer


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v11 1/8] PCI/ERR: Update error status after reset_link()
  2020-01-04  1:03     ` Kuppuswamy Sathyanarayanan
@ 2020-01-04  2:54       ` Bjorn Helgaas
  2020-01-09  0:14         ` Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 17+ messages in thread
From: Bjorn Helgaas @ 2020-01-04  2:54 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: linux-pci, linux-kernel, ashok.raj, keith.busch

On Fri, Jan 03, 2020 at 05:03:03PM -0800, Kuppuswamy Sathyanarayanan wrote:
> On 1/3/20 4:34 PM, Bjorn Helgaas wrote:
> > On Thu, Dec 26, 2019 at 04:39:07PM -0800, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
> > > From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> > > 
> > > Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") uses
> > > reset_link() to recover from fatal errors. But, if the reset is
> > > successful there is no need to continue the rest of the error recovery
> > > checks. Also, during fatal error recovery, if the initial value of error
> > > status is PCI_ERS_RESULT_DISCONNECT or PCI_ERS_RESULT_NO_AER_DRIVER then
> > > even after successful recovery (using reset_link()) pcie_do_recovery()
> > > will report the recovery result as failure. So update the status of
> > > error after reset_link().
> > I like the part about updating "status" with the result of
> > reset_link(), and I split that into its own patch because it
> > seems like a fix that *can* be separated.
> > 
> > But I'm not convinced that we should skip the ->slot_reset()
> > callbacks if the reset_link() was successful.
>
> If reset_link() call is successful then the result value will be
> "PCI_ERS_RESULT_RECOVERED". So even if you proceed with
> rest of the code, slot_reset() will never get called right ?

The current code:

        if (state == pci_channel_io_frozen &&
            reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
                goto failed;
        ...
        if (status == PCI_ERS_RESULT_NEED_RESET) {
                status = PCI_ERS_RESULT_RECOVERED;
                pci_walk_bus(bus, report_slot_reset, &status);

doesn't save the result of reset_link(), so if status was
PCI_ERS_RESULT_NEED_RESET and the reset succeeds, we will call
->slot_reset().

After your patch, if "state == pci_channel_io_frozen", we *never* call
->slot_reset().

Do you think that matches pci-error-recovery.rst?  It doesn't seem
like it to me, but perhaps I haven't read it closely enough.

> > According to
> > Documentation/PCI/pci-error-recovery.rst, we should call
> > ->slot_reset() after completion of the reset.
> > 
> > For example, rsxx_err_handler implements ->slot_reset(), but
> > not ->resume().  If we reset the device, we'll claim success and
> > return, but we won't call rsxx_slot_reset(), which does a bunch
> > of important-looking recovery stuff.
> > 
> > If pci-error-recovery.rst is wrong, we should fix that (after
> > auditing all the drivers to make sure they match).
> > 
> > > Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> > > Cc: Ashok Raj <ashok.raj@intel.com>
> > > Cc: Keith Busch <keith.busch@intel.com>
> > > Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> > > Acked-by: Keith Busch <keith.busch@intel.com>
> > > ---
> > >   drivers/pci/pcie/err.c | 10 +++++++---
> > >   1 file changed, 7 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > > index b0e6048a9208..53cd9200ec2c 100644
> > > --- a/drivers/pci/pcie/err.c
> > > +++ b/drivers/pci/pcie/err.c
> > > @@ -204,9 +204,12 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
> > >   	else
> > >   		pci_walk_bus(bus, report_normal_detected, &status);
> > > -	if (state == pci_channel_io_frozen &&
> > > -	    reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
> > > -		goto failed;
> > > +	if (state == pci_channel_io_frozen) {
> > > +		status = reset_link(dev, service);
> > > +		if (status != PCI_ERS_RESULT_RECOVERED)
> > > +			goto failed;
> > > +		goto done;
> > > +	}
> > >   	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
> > >   		status = PCI_ERS_RESULT_RECOVERED;
> > > @@ -228,6 +231,7 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
> > >   	if (status != PCI_ERS_RESULT_RECOVERED)
> > >   		goto failed;
> > > +done:
> > >   	pci_dbg(dev, "broadcast resume message\n");
> > >   	pci_walk_bus(bus, report_resume, &status);
> > > -- 
> > > 2.21.0
> > > 
> -- 
> Sathyanarayanan Kuppuswamy
> Linux kernel developer
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v11 1/8] PCI/ERR: Update error status after reset_link()
  2020-01-04  2:54       ` Bjorn Helgaas
@ 2020-01-09  0:14         ` Kuppuswamy Sathyanarayanan
  2020-01-09 23:26           ` Bjorn Helgaas
  0 siblings, 1 reply; 17+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2020-01-09  0:14 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, linux-kernel, ashok.raj, keith.busch

Hi Bjorn,

Thanks for the comments.

On 1/3/20 6:54 PM, Bjorn Helgaas wrote:
> On Fri, Jan 03, 2020 at 05:03:03PM -0800, Kuppuswamy Sathyanarayanan wrote:
>> On 1/3/20 4:34 PM, Bjorn Helgaas wrote:
>>> On Thu, Dec 26, 2019 at 04:39:07PM -0800, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
>>>> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>>>
>>>> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") uses
>>>> reset_link() to recover from fatal errors. But, if the reset is
>>>> successful there is no need to continue the rest of the error recovery
>>>> checks. Also, during fatal error recovery, if the initial value of error
>>>> status is PCI_ERS_RESULT_DISCONNECT or PCI_ERS_RESULT_NO_AER_DRIVER then
>>>> even after successful recovery (using reset_link()) pcie_do_recovery()
>>>> will report the recovery result as failure. So update the status of
>>>> error after reset_link().
>>> I like the part about updating "status" with the result of
>>> reset_link(), and I split that into its own patch because it
>>> seems like a fix that *can* be separated.
>>>
>>> But I'm not convinced that we should skip the ->slot_reset()
>>> callbacks if the reset_link() was successful.
>> If reset_link() call is successful then the result value will be
>> "PCI_ERS_RESULT_RECOVERED". So even if you proceed with
>> rest of the code, slot_reset() will never get called right ?
> The current code:
>
>          if (state == pci_channel_io_frozen &&
>              reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
>                  goto failed;
>          ...
>          if (status == PCI_ERS_RESULT_NEED_RESET) {
>                  status = PCI_ERS_RESULT_RECOVERED;
>                  pci_walk_bus(bus, report_slot_reset, &status);
>
> doesn't save the result of reset_link(), so if status was
> PCI_ERS_RESULT_NEED_RESET and the reset succeeds, we will call
> ->slot_reset().
>
> After your patch, if "state == pci_channel_io_frozen", we *never* call
> ->slot_reset().
>
> Do you think that matches pci-error-recovery.rst?  It doesn't seem
> like it to me, but perhaps I haven't read it closely enough.
Documentation does not have clear details on what to do with return
value of reset_link() ( step 3). But IMO, if step 3 recovers the device and
returns PCI_ERS_RESULT_RECOVERED then there is no need to proceed
to slot reset (step 4). May be we should update the Documentation ?

Keith,
You have any comments ?
>
>>> According to
>>> Documentation/PCI/pci-error-recovery.rst, we should call
>>> ->slot_reset() after completion of the reset.
>>>
>>> For example, rsxx_err_handler implements ->slot_reset(), but
>>> not ->resume().  If we reset the device, we'll claim success and
>>> return, but we won't call rsxx_slot_reset(), which does a bunch
>>> of important-looking recovery stuff.
>>>
>>> If pci-error-recovery.rst is wrong, we should fix that (after
>>> auditing all the drivers to make sure they match).
>>>
>>>> Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
>>>> Cc: Ashok Raj <ashok.raj@intel.com>
>>>> Cc: Keith Busch <keith.busch@intel.com>
>>>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>>> Acked-by: Keith Busch <keith.busch@intel.com>
>>>> ---
>>>>    drivers/pci/pcie/err.c | 10 +++++++---
>>>>    1 file changed, 7 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>>> index b0e6048a9208..53cd9200ec2c 100644
>>>> --- a/drivers/pci/pcie/err.c
>>>> +++ b/drivers/pci/pcie/err.c
>>>> @@ -204,9 +204,12 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>>>>    	else
>>>>    		pci_walk_bus(bus, report_normal_detected, &status);
>>>> -	if (state == pci_channel_io_frozen &&
>>>> -	    reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
>>>> -		goto failed;
>>>> +	if (state == pci_channel_io_frozen) {
>>>> +		status = reset_link(dev, service);
>>>> +		if (status != PCI_ERS_RESULT_RECOVERED)
>>>> +			goto failed;
>>>> +		goto done;
>>>> +	}
>>>>    	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>>>>    		status = PCI_ERS_RESULT_RECOVERED;
>>>> @@ -228,6 +231,7 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>>>>    	if (status != PCI_ERS_RESULT_RECOVERED)
>>>>    		goto failed;
>>>> +done:
>>>>    	pci_dbg(dev, "broadcast resume message\n");
>>>>    	pci_walk_bus(bus, report_resume, &status);
>>>> -- 
>>>> 2.21.0
>>>>
>> -- 
>> Sathyanarayanan Kuppuswamy
>> Linux kernel developer
>>
-- 
Sathyanarayanan Kuppuswamy
Linux kernel developer


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v11 1/8] PCI/ERR: Update error status after reset_link()
  2020-01-09  0:14         ` Kuppuswamy Sathyanarayanan
@ 2020-01-09 23:26           ` Bjorn Helgaas
  2020-01-10  1:08             ` Kuppuswamy Sathyanarayanan
  0 siblings, 1 reply; 17+ messages in thread
From: Bjorn Helgaas @ 2020-01-09 23:26 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan
  Cc: linux-pci, linux-kernel, ashok.raj, keith.busch

On Wed, Jan 08, 2020 at 04:14:09PM -0800, Kuppuswamy Sathyanarayanan wrote:
> On 1/3/20 6:54 PM, Bjorn Helgaas wrote:
> > On Fri, Jan 03, 2020 at 05:03:03PM -0800, Kuppuswamy Sathyanarayanan wrote:
> > > On 1/3/20 4:34 PM, Bjorn Helgaas wrote:
> > > > On Thu, Dec 26, 2019 at 04:39:07PM -0800, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
> > > > > From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> > > > > 
> > > > > Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") uses
> > > > > reset_link() to recover from fatal errors. But, if the reset is
> > > > > successful there is no need to continue the rest of the error recovery
> > > > > checks. Also, during fatal error recovery, if the initial value of error
> > > > > status is PCI_ERS_RESULT_DISCONNECT or PCI_ERS_RESULT_NO_AER_DRIVER then
> > > > > even after successful recovery (using reset_link()) pcie_do_recovery()
> > > > > will report the recovery result as failure. So update the status of
> > > > > error after reset_link().
> > > > I like the part about updating "status" with the result of
> > > > reset_link(), and I split that into its own patch because it
> > > > seems like a fix that *can* be separated.
> > > > 
> > > > But I'm not convinced that we should skip the ->slot_reset()
> > > > callbacks if the reset_link() was successful.
> > > If reset_link() call is successful then the result value will be
> > > "PCI_ERS_RESULT_RECOVERED". So even if you proceed with
> > > rest of the code, slot_reset() will never get called right ?
> > The current code:
> > 
> >          if (state == pci_channel_io_frozen &&
> >              reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
> >                  goto failed;
> >          ...
> >          if (status == PCI_ERS_RESULT_NEED_RESET) {
> >                  status = PCI_ERS_RESULT_RECOVERED;
> >                  pci_walk_bus(bus, report_slot_reset, &status);
> > 
> > doesn't save the result of reset_link(), so if status was
> > PCI_ERS_RESULT_NEED_RESET and the reset succeeds, we will call
> > ->slot_reset().
> > 
> > After your patch, if "state == pci_channel_io_frozen", we *never* call
> > ->slot_reset().
> > 
> > Do you think that matches pci-error-recovery.rst?  It doesn't seem
> > like it to me, but perhaps I haven't read it closely enough.
> Documentation does not have clear details on what to do with return
> value of reset_link() (step 3). But IMO, if step 3 recovers the device and
> returns PCI_ERS_RESULT_RECOVERED then there is no need to proceed
> to slot reset (step 4). May be we should update the Documentation?

Are you suggesting we don't need to call a driver callback after
resetting the device?  Note that the ->slot_reset() doesn't *perform*
a reset; it is called *after* completion of a reset.

The doc says:

  ... Upon completion of slot reset, the platform will call the device
  slot_reset() callback.
  ...
  This call gives drivers the chance to re-initialize the hardware
  (re-download firmware, etc.).  At this point, the driver may assume
  that the card is in a fresh state and is fully functional. The slot
  is unfrozen and the driver has full access to PCI config space,
  memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X)
  will also be available.

After we reset a device, the driver certainly needs a chance to
reinitialize it.

> > > > According to
> > > > Documentation/PCI/pci-error-recovery.rst, we should call
> > > > ->slot_reset() after completion of the reset.
> > > > 
> > > > For example, rsxx_err_handler implements ->slot_reset(), but
> > > > not ->resume().  If we reset the device, we'll claim success and
> > > > return, but we won't call rsxx_slot_reset(), which does a bunch
> > > > of important-looking recovery stuff.
> > > > 
> > > > If pci-error-recovery.rst is wrong, we should fix that (after
> > > > auditing all the drivers to make sure they match).
> > > > 
> > > > > Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> > > > > Cc: Ashok Raj <ashok.raj@intel.com>
> > > > > Cc: Keith Busch <keith.busch@intel.com>
> > > > > Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> > > > > Acked-by: Keith Busch <keith.busch@intel.com>
> > > > > ---
> > > > >    drivers/pci/pcie/err.c | 10 +++++++---
> > > > >    1 file changed, 7 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > > > > index b0e6048a9208..53cd9200ec2c 100644
> > > > > --- a/drivers/pci/pcie/err.c
> > > > > +++ b/drivers/pci/pcie/err.c
> > > > > @@ -204,9 +204,12 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
> > > > >    	else
> > > > >    		pci_walk_bus(bus, report_normal_detected, &status);
> > > > > -	if (state == pci_channel_io_frozen &&
> > > > > -	    reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
> > > > > -		goto failed;
> > > > > +	if (state == pci_channel_io_frozen) {
> > > > > +		status = reset_link(dev, service);
> > > > > +		if (status != PCI_ERS_RESULT_RECOVERED)
> > > > > +			goto failed;
> > > > > +		goto done;
> > > > > +	}
> > > > >    	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
> > > > >    		status = PCI_ERS_RESULT_RECOVERED;
> > > > > @@ -228,6 +231,7 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
> > > > >    	if (status != PCI_ERS_RESULT_RECOVERED)
> > > > >    		goto failed;
> > > > > +done:
> > > > >    	pci_dbg(dev, "broadcast resume message\n");
> > > > >    	pci_walk_bus(bus, report_resume, &status);
> > > > > -- 
> > > > > 2.21.0
> > > > > 
> > > -- 
> > > Sathyanarayanan Kuppuswamy
> > > Linux kernel developer
> > > 
> -- 
> Sathyanarayanan Kuppuswamy
> Linux kernel developer
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v11 1/8] PCI/ERR: Update error status after reset_link()
  2020-01-09 23:26           ` Bjorn Helgaas
@ 2020-01-10  1:08             ` Kuppuswamy Sathyanarayanan
  0 siblings, 0 replies; 17+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2020-01-10  1:08 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, linux-kernel, ashok.raj, keith.busch


On 1/9/20 3:26 PM, Bjorn Helgaas wrote:
> On Wed, Jan 08, 2020 at 04:14:09PM -0800, Kuppuswamy Sathyanarayanan wrote:
>> On 1/3/20 6:54 PM, Bjorn Helgaas wrote:
>>> On Fri, Jan 03, 2020 at 05:03:03PM -0800, Kuppuswamy Sathyanarayanan wrote:
>>>> On 1/3/20 4:34 PM, Bjorn Helgaas wrote:
>>>>> On Thu, Dec 26, 2019 at 04:39:07PM -0800, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
>>>>>> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>>>>>
>>>>>> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") uses
>>>>>> reset_link() to recover from fatal errors. But, if the reset is
>>>>>> successful there is no need to continue the rest of the error recovery
>>>>>> checks. Also, during fatal error recovery, if the initial value of error
>>>>>> status is PCI_ERS_RESULT_DISCONNECT or PCI_ERS_RESULT_NO_AER_DRIVER then
>>>>>> even after successful recovery (using reset_link()) pcie_do_recovery()
>>>>>> will report the recovery result as failure. So update the status of
>>>>>> error after reset_link().
>>>>> I like the part about updating "status" with the result of
>>>>> reset_link(), and I split that into its own patch because it
>>>>> seems like a fix that *can* be separated.
>>>>>
>>>>> But I'm not convinced that we should skip the ->slot_reset()
>>>>> callbacks if the reset_link() was successful.
>>>> If reset_link() call is successful then the result value will be
>>>> "PCI_ERS_RESULT_RECOVERED". So even if you proceed with
>>>> rest of the code, slot_reset() will never get called right ?
>>> The current code:
>>>
>>>           if (state == pci_channel_io_frozen &&
>>>               reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
>>>                   goto failed;
>>>           ...
>>>           if (status == PCI_ERS_RESULT_NEED_RESET) {
>>>                   status = PCI_ERS_RESULT_RECOVERED;
>>>                   pci_walk_bus(bus, report_slot_reset, &status);
>>>
>>> doesn't save the result of reset_link(), so if status was
>>> PCI_ERS_RESULT_NEED_RESET and the reset succeeds, we will call
>>> ->slot_reset().
>>>
>>> After your patch, if "state == pci_channel_io_frozen", we *never* call
>>> ->slot_reset().
>>>
>>> Do you think that matches pci-error-recovery.rst?  It doesn't seem
>>> like it to me, but perhaps I haven't read it closely enough.
>> Documentation does not have clear details on what to do with return
>> value of reset_link() (step 3). But IMO, if step 3 recovers the device and
>> returns PCI_ERS_RESULT_RECOVERED then there is no need to proceed
>> to slot reset (step 4). May be we should update the Documentation?
> Are you suggesting we don't need to call a driver callback after
> resetting the device?  Note that the ->slot_reset() doesn't *perform*
> a reset; it is called *after* completion of a reset.
>
> The doc says:
>
>    ... Upon completion of slot reset, the platform will call the device
>    slot_reset() callback.
>    ...
>    This call gives drivers the chance to re-initialize the hardware
>    (re-download firmware, etc.).  At this point, the driver may assume
>    that the card is in a fresh state and is fully functional. The slot
>    is unfrozen and the driver has full access to PCI config space,
>    memory mapped I/O space and DMA. Interrupts (Legacy, MSI, or MSI-X)
>    will also be available.
Got it. Let me fix it in next version.
But, will this sequence apply for fatal error recovery (triggered from 
DPC code path) ? I think the device is removed and re-added in hotplug 
path (DLLSC transition).
>
> After we reset a device, the driver certainly needs a chance to
> reinitialize it.
>
>>>>> According to
>>>>> Documentation/PCI/pci-error-recovery.rst, we should call
>>>>> ->slot_reset() after completion of the reset.
>>>>>
>>>>> For example, rsxx_err_handler implements ->slot_reset(), but
>>>>> not ->resume().  If we reset the device, we'll claim success and
>>>>> return, but we won't call rsxx_slot_reset(), which does a bunch
>>>>> of important-looking recovery stuff.
>>>>>
>>>>> If pci-error-recovery.rst is wrong, we should fix that (after
>>>>> auditing all the drivers to make sure they match).
>>>>>
>>>>>> Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
>>>>>> Cc: Ashok Raj <ashok.raj@intel.com>
>>>>>> Cc: Keith Busch <keith.busch@intel.com>
>>>>>> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>>>>> Acked-by: Keith Busch <keith.busch@intel.com>
>>>>>> ---
>>>>>>     drivers/pci/pcie/err.c | 10 +++++++---
>>>>>>     1 file changed, 7 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>>>>> index b0e6048a9208..53cd9200ec2c 100644
>>>>>> --- a/drivers/pci/pcie/err.c
>>>>>> +++ b/drivers/pci/pcie/err.c
>>>>>> @@ -204,9 +204,12 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>>>>>>     	else
>>>>>>     		pci_walk_bus(bus, report_normal_detected, &status);
>>>>>> -	if (state == pci_channel_io_frozen &&
>>>>>> -	    reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
>>>>>> -		goto failed;
>>>>>> +	if (state == pci_channel_io_frozen) {
>>>>>> +		status = reset_link(dev, service);
>>>>>> +		if (status != PCI_ERS_RESULT_RECOVERED)
>>>>>> +			goto failed;
>>>>>> +		goto done;
>>>>>> +	}
>>>>>>     	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>>>>>>     		status = PCI_ERS_RESULT_RECOVERED;
>>>>>> @@ -228,6 +231,7 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>>>>>>     	if (status != PCI_ERS_RESULT_RECOVERED)
>>>>>>     		goto failed;
>>>>>> +done:
>>>>>>     	pci_dbg(dev, "broadcast resume message\n");
>>>>>>     	pci_walk_bus(bus, report_resume, &status);
>>>>>> -- 
>>>>>> 2.21.0
>>>>>>
>>>> -- 
>>>> Sathyanarayanan Kuppuswamy
>>>> Linux kernel developer
>>>>
>> -- 
>> Sathyanarayanan Kuppuswamy
>> Linux kernel developer
>>
-- 
Sathyanarayanan Kuppuswamy
Linux kernel developer


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-01-10  1:10 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-27  0:39 [PATCH v11 0/8] Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
2019-12-27  0:39 ` [PATCH v11 1/8] PCI/ERR: Update error status after reset_link() sathyanarayanan.kuppuswamy
2020-01-04  0:34   ` Bjorn Helgaas
2020-01-04  1:03     ` Kuppuswamy Sathyanarayanan
2020-01-04  2:54       ` Bjorn Helgaas
2020-01-09  0:14         ` Kuppuswamy Sathyanarayanan
2020-01-09 23:26           ` Bjorn Helgaas
2020-01-10  1:08             ` Kuppuswamy Sathyanarayanan
2019-12-27  0:39 ` [PATCH v11 2/8] PCI/DPC: Allow dpc_probe() even if firmware first mode is enabled sathyanarayanan.kuppuswamy
2019-12-27  0:39 ` [PATCH v11 3/8] PCI/DPC: Add dpc_process_error() wrapper function sathyanarayanan.kuppuswamy
2019-12-27  0:39 ` [PATCH v11 4/8] PCI/DPC: Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
2019-12-27  0:39 ` [PATCH v11 5/8] PCI/AER: Allow clearing Error Status Register in FF mode sathyanarayanan.kuppuswamy
2019-12-30 23:59   ` Bjorn Helgaas
2019-12-31 18:11     ` Kuppuswamy Sathyanarayanan
2019-12-27  0:39 ` [PATCH v11 6/8] PCI/DPC: Update comments related to DPC recovery on NON_FATAL errors sathyanarayanan.kuppuswamy
2019-12-27  0:39 ` [PATCH v11 7/8] PCI/DPC: Clear AER registers in EDR mode sathyanarayanan.kuppuswamy
2019-12-27  0:39 ` [PATCH v11 8/8] PCI/ACPI: Enable EDR support sathyanarayanan.kuppuswamy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).