linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL
@ 2018-06-22  9:58 Oza Pawandeep
  2018-06-22  9:58 ` [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits Oza Pawandeep
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Oza Pawandeep @ 2018-06-22  9:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

These are follow up patches for the series which modifies ERR_FATAL handling.
although there were couple of problems existed before which, itis also fixing.

patch-1:
Fixes the problem where ERR_FATAL and ERR_NONFATAL status should be cleared
taking severity mask into account.

patch-2:
Takes care of clearing error fatal status

patch-3:
Follow up patch where no more need of handling ERR_FATAL
case.

patch-4:
Fixes clearing device status in case of uncorrectable errors.
(e.g. ERR_FATAL and ERR_NONFATAL)

patch-5:
Fixes clearing device status in case of correctable errors.

patch-6:
Follow up patch where no more need of handling pci_channel_io_frozen
in pcie_portdrv_slot_reset()

Oza Pawandeep (6):
  PCI/AER: Take severity mask into account while clearing error bits
  PCI/AER: Clear uncorrectable fatal error status bits
  PCI/ERR: Cleanup ERR_FATAL of error broadcast
  PCI/AER: Clear device error status bits during ERR_FATAL and
    ERR_NONFATAL
  PCI/AER: Fix correctable status bits clearing in device register
  PCI/PORTDRV: Remove ERR_FATAL handling from pcie_portdrv_slot_reset()

 drivers/pci/pcie/aer.c         | 35 +++++++++++++++++++++++------------
 drivers/pci/pcie/err.c         | 15 +++++++--------
 drivers/pci/pcie/portdrv_pci.c | 18 ------------------
 include/linux/aer.h            |  5 +++++
 4 files changed, 35 insertions(+), 38 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits
  2018-06-22  9:58 [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL Oza Pawandeep
@ 2018-06-22  9:58 ` Oza Pawandeep
  2018-07-17 19:03   ` Bjorn Helgaas
  2018-06-22  9:58 ` [PATCH v2 2/6] PCI/AER: Clear uncorrectable fatal error status bits Oza Pawandeep
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Oza Pawandeep @ 2018-06-22  9:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

pci_cleanup_aer_uncorrect_error_status() is called by different slot_reset
callbacks in case of ERR_NONFATAL.

AER uncorrectable error status should take severity into account in order
to clear the bits, so that ERR_NONFATAL path does not clear the bit which
are marked with severity fatal.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e8838..d6cb1f0 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -360,13 +360,16 @@ EXPORT_SYMBOL_GPL(pci_disable_pcie_error_reporting);
 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
 {
 	int pos;
-	u32 status;
+	u32 status, mask;
 
 	pos = dev->aer_cap;
 	if (!pos)
 		return -EIO;
 
+	/* Clean AER Root Error Status */
 	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
+	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &mask);
+	status &= ~mask; /* Clear corresponding nonfatal bits */
 	if (status)
 		pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status);
 
@@ -1336,8 +1339,6 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
  */
 static void aer_error_resume(struct pci_dev *dev)
 {
-	int pos;
-	u32 status, mask;
 	u16 reg16;
 
 	/* Clean up Root device status */
@@ -1345,11 +1346,7 @@ static void aer_error_resume(struct pci_dev *dev)
 	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, reg16);
 
 	/* Clean AER Root Error Status */
-	pos = dev->aer_cap;
-	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
-	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &mask);
-	status &= ~mask; /* Clear corresponding nonfatal bits */
-	pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status);
+	pci_cleanup_aer_uncorrect_error_status(dev);
 }
 
 static struct pcie_port_service_driver aerdriver = {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/6] PCI/AER: Clear uncorrectable fatal error status bits
  2018-06-22  9:58 [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL Oza Pawandeep
  2018-06-22  9:58 ` [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits Oza Pawandeep
@ 2018-06-22  9:58 ` Oza Pawandeep
  2018-06-22  9:58 ` [PATCH v2 3/6] PCI/ERR: Cleanup ERR_FATAL of error broadcast Oza Pawandeep
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Oza Pawandeep @ 2018-06-22  9:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

During ERR_FATAL handling, AER calls pci_cleanup_aer_uncorrect_error_status
which should handle pci_channel_io_frozen case in order to determine if it
has to clear fatal bits.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d6cb1f0..e9c115d 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -369,7 +369,12 @@ int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
 	/* Clean AER Root Error Status */
 	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
 	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &mask);
-	status &= ~mask; /* Clear corresponding nonfatal bits */
+
+	if (dev->error_state == pci_channel_io_normal)
+		status &= ~mask; /* Clear corresponding nonfatal bits */
+	else
+		status &= mask; /* Clear corresponding fatal bits */
+
 	if (status)
 		pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status);
 
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index f7ce0cb..00d2875 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -288,6 +288,7 @@ void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service)
 	struct pci_dev *pdev, *temp;
 	pci_ers_result_t result;
 
+	dev->error_state = pci_channel_io_frozen;
 	if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
 		udev = dev;
 	else
@@ -323,6 +324,7 @@ void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service)
 		if (pcie_wait_for_link(udev, true))
 			pci_rescan_bus(udev->bus);
 		pci_info(dev, "Device recovery from fatal error successful\n");
+		dev->error_state = pci_channel_io_normal;
 	} else {
 		pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
 		pci_info(dev, "Device recovery from fatal error failed\n");
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/6] PCI/ERR: Cleanup ERR_FATAL of error broadcast
  2018-06-22  9:58 [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL Oza Pawandeep
  2018-06-22  9:58 ` [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits Oza Pawandeep
  2018-06-22  9:58 ` [PATCH v2 2/6] PCI/AER: Clear uncorrectable fatal error status bits Oza Pawandeep
@ 2018-06-22  9:58 ` Oza Pawandeep
  2018-06-22  9:58 ` [PATCH v2 4/6] PCI/AER: Clear device error status bits during ERR_FATAL and ERR_NONFATAL Oza Pawandeep
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Oza Pawandeep @ 2018-06-22  9:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

ERR_FATAL is handled by resetting the Link in software, skipping the
driver pci_error_handlers callbacks, removing the devices from the PCI
subsystem, and re-enumerating, so now no more ERR_FATAL handling is
required inside pci_broadcast_error_message()

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 00d2875..404bb69 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -259,15 +259,10 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
 		/*
 		 * If the error is reported by an end point, we think this
 		 * error is related to the upstream link of the end point.
+		 * the error is non fatal so the bus is ok, just invoke
+		 * the callback for the function that logged the error.
 		 */
-		if (state == pci_channel_io_normal)
-			/*
-			 * the error is non fatal so the bus is ok, just invoke
-			 * the callback for the function that logged the error.
-			 */
-			cb(dev, &result_data);
-		else
-			pci_walk_bus(dev->bus, cb, &result_data);
+		cb(dev, &result_data);
 	}
 
 	return result_data.result;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/6] PCI/AER: Clear device error status bits during ERR_FATAL and ERR_NONFATAL
  2018-06-22  9:58 [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL Oza Pawandeep
                   ` (2 preceding siblings ...)
  2018-06-22  9:58 ` [PATCH v2 3/6] PCI/ERR: Cleanup ERR_FATAL of error broadcast Oza Pawandeep
@ 2018-06-22  9:58 ` Oza Pawandeep
  2018-06-22  9:58 ` [PATCH v2 5/6] PCI/AER: Fix correctable status bits clearing in device register Oza Pawandeep
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Oza Pawandeep @ 2018-06-22  9:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

In both ERR_FATAL and ERR_NONFATA cases the device error status
bits needs to be cleared.

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e9c115d..d2d6868 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -357,6 +357,17 @@ int pci_disable_pcie_error_reporting(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_disable_pcie_error_reporting);
 
+int pci_cleanup_aer_error_device_status(struct pci_dev *dev)
+{
+	u16 reg16;
+
+	/* Clean up Root device status */
+	pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &reg16);
+	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, reg16);
+
+	return 0;
+}
+
 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
 {
 	int pos;
@@ -1344,11 +1355,8 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
  */
 static void aer_error_resume(struct pci_dev *dev)
 {
-	u16 reg16;
-
 	/* Clean up Root device status */
-	pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &reg16);
-	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, reg16);
+	pci_cleanup_aer_error_device_status(dev);
 
 	/* Clean AER Root Error Status */
 	pci_cleanup_aer_uncorrect_error_status(dev);
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 404bb69..410c35c 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -252,6 +252,7 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
 			dev->error_state = state;
 		pci_walk_bus(dev->subordinate, cb, &result_data);
 		if (cb == report_resume) {
+			pci_cleanup_aer_error_device_status(dev);
 			pci_cleanup_aer_uncorrect_error_status(dev);
 			dev->error_state = pci_channel_io_normal;
 		}
@@ -312,6 +313,7 @@ void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service)
 		 * do error recovery on all subordinates of the bridge instead
 		 * of the bridge and clear the error status of the bridge.
 		 */
+		pci_cleanup_aer_error_device_status(dev);
 		pci_cleanup_aer_uncorrect_error_status(dev);
 	}
 
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 514bffa..165a147 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -44,6 +44,7 @@ struct aer_capability_regs {
 /* PCIe port driver needs this function to enable AER */
 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
+int pci_cleanup_aer_error_device_status(struct pci_dev *dev);
 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
 int pci_cleanup_aer_error_status_regs(struct pci_dev *dev);
 #else
@@ -55,6 +56,10 @@ static inline int pci_disable_pcie_error_reporting(struct pci_dev *dev)
 {
 	return -EINVAL;
 }
+static inline int pci_cleanup_aer_error_device_status(struct pci_dev *dev)
+{
+	return -EINVAL;
+}
 static inline int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
 {
 	return -EINVAL;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 5/6] PCI/AER: Fix correctable status bits clearing in device register
  2018-06-22  9:58 [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL Oza Pawandeep
                   ` (3 preceding siblings ...)
  2018-06-22  9:58 ` [PATCH v2 4/6] PCI/AER: Clear device error status bits during ERR_FATAL and ERR_NONFATAL Oza Pawandeep
@ 2018-06-22  9:58 ` Oza Pawandeep
  2018-06-22  9:58 ` [PATCH v2 6/6] PCI/PORTDRV: Remove ERR_FATAL handling from pcie_portdrv_slot_reset() Oza Pawandeep
  2018-07-05 15:12 ` [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL poza
  6 siblings, 0 replies; 12+ messages in thread
From: Oza Pawandeep @ 2018-06-22  9:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

In case of correctable error Device Status Register sets
Correctable Error Detected, which should be cleared after handling
the error

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d2d6868..a42b071 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -818,6 +818,7 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 		if (pos)
 			pci_write_config_dword(dev, pos + PCI_ERR_COR_STATUS,
 					info->status);
+		pci_cleanup_aer_error_device_status(dev);
 	} else if (info->severity == AER_NONFATAL)
 		pcie_do_nonfatal_recovery(dev);
 	else if (info->severity == AER_FATAL)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 6/6] PCI/PORTDRV: Remove ERR_FATAL handling from pcie_portdrv_slot_reset()
  2018-06-22  9:58 [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL Oza Pawandeep
                   ` (4 preceding siblings ...)
  2018-06-22  9:58 ` [PATCH v2 5/6] PCI/AER: Fix correctable status bits clearing in device register Oza Pawandeep
@ 2018-06-22  9:58 ` Oza Pawandeep
  2018-07-18 19:10   ` Bjorn Helgaas
  2018-07-05 15:12 ` [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL poza
  6 siblings, 1 reply; 12+ messages in thread
From: Oza Pawandeep @ 2018-06-22  9:58 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi
  Cc: Oza Pawandeep

We are handling ERR_FATAL by resetting the Link in software,skipping the
driver pci_error_handlers callbacks, removing the devices from the PCI
subsystem, and re-enumerating, because of, no need to handle
pci_channel_io_frozen case anymore.

Besides the walk on the bus is happening on subordinates, inside
broadcast_error_message(), which means that pcie_portdrv_slot_reset()
is never called for RP, and now since the all the devices are removed under
this downstream link, we can safely get rid of ERR_FATAL handling code
in pcie_portdrv_slot_reset().

Signed-off-by: Oza Pawandeep <poza@codeaurora.org>

diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index 973f1b8..b970a6d 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -42,17 +42,6 @@ __setup("pcie_ports=", pcie_port_setup);
 
 /* global data */
 
-static int pcie_portdrv_restore_config(struct pci_dev *dev)
-{
-	int retval;
-
-	retval = pci_enable_device(dev);
-	if (retval)
-		return retval;
-	pci_set_master(dev);
-	return 0;
-}
-
 #ifdef CONFIG_PM
 static int pcie_port_runtime_suspend(struct device *dev)
 {
@@ -163,13 +152,6 @@ static pci_ers_result_t pcie_portdrv_mmio_enabled(struct pci_dev *dev)
 static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
 {
 	/* If fatal, restore cfg space for possible link reset at upstream */
-	if (dev->error_state == pci_channel_io_frozen) {
-		dev->state_saved = true;
-		pci_restore_state(dev);
-		pcie_portdrv_restore_config(dev);
-		pci_enable_pcie_error_reporting(dev);
-	}
-
 	return PCI_ERS_RESULT_RECOVERED;
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL
  2018-06-22  9:58 [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL Oza Pawandeep
                   ` (5 preceding siblings ...)
  2018-06-22  9:58 ` [PATCH v2 6/6] PCI/PORTDRV: Remove ERR_FATAL handling from pcie_portdrv_slot_reset() Oza Pawandeep
@ 2018-07-05 15:12 ` poza
  6 siblings, 0 replies; 12+ messages in thread
From: poza @ 2018-07-05 15:12 UTC (permalink / raw)
  To: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

Hi Bjorn,

Could you help to review this series ?

Regards,
Oza.

On 2018-06-22 15:28, Oza Pawandeep wrote:
> These are follow up patches for the series which modifies ERR_FATAL 
> handling.
> although there were couple of problems existed before which, itis also 
> fixing.
> 
> patch-1:
> Fixes the problem where ERR_FATAL and ERR_NONFATAL status should be 
> cleared
> taking severity mask into account.
> 
> patch-2:
> Takes care of clearing error fatal status
> 
> patch-3:
> Follow up patch where no more need of handling ERR_FATAL
> case.
> 
> patch-4:
> Fixes clearing device status in case of uncorrectable errors.
> (e.g. ERR_FATAL and ERR_NONFATAL)
> 
> patch-5:
> Fixes clearing device status in case of correctable errors.
> 
> patch-6:
> Follow up patch where no more need of handling pci_channel_io_frozen
> in pcie_portdrv_slot_reset()
> 
> Oza Pawandeep (6):
>   PCI/AER: Take severity mask into account while clearing error bits
>   PCI/AER: Clear uncorrectable fatal error status bits
>   PCI/ERR: Cleanup ERR_FATAL of error broadcast
>   PCI/AER: Clear device error status bits during ERR_FATAL and
>     ERR_NONFATAL
>   PCI/AER: Fix correctable status bits clearing in device register
>   PCI/PORTDRV: Remove ERR_FATAL handling from pcie_portdrv_slot_reset()
> 
>  drivers/pci/pcie/aer.c         | 35 
> +++++++++++++++++++++++------------
>  drivers/pci/pcie/err.c         | 15 +++++++--------
>  drivers/pci/pcie/portdrv_pci.c | 18 ------------------
>  include/linux/aer.h            |  5 +++++
>  4 files changed, 35 insertions(+), 38 deletions(-)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits
  2018-06-22  9:58 ` [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits Oza Pawandeep
@ 2018-07-17 19:03   ` Bjorn Helgaas
  2018-07-17 21:36     ` Bjorn Helgaas
  0 siblings, 1 reply; 12+ messages in thread
From: Bjorn Helgaas @ 2018-07-17 19:03 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

Hi Oza,

Thanks for doing this!

On Fri, Jun 22, 2018 at 05:58:09AM -0400, Oza Pawandeep wrote:
> pci_cleanup_aer_uncorrect_error_status() is called by different slot_reset
> callbacks in case of ERR_NONFATAL.

IIRC, the current strategy is:

  ERR_COR: log only
  ERR_NONFATAL: call driver callbacks (pci_error_handlers)
  ERR_FATAL: remove device and re-enumerate

So these slot_reset callbacks are only used for ERR_NONFATAL, which
are all uncorrectable errors, of course.

This patch makes it so that when the slot_reset callbacks call
pci_cleanup_aer_uncorrect_error_status(), we only clear the
bits set by ERR_NONFATAL events (this is determined by
PCI_ERR_UNCOR_SEVER).

That makes good sense to me.  All these status bits are RW1CS, so they
will be preserved across a reset but will be cleared when we
re-enumerate, in this path:

  pci_init_capabilities
    pci_aer_init
      pci_cleanup_aer_error_status_regs
      # clear all PCI_ERR_UNCOR_STATUS and PCI_ERR_COR_STATUS bits

> AER uncorrectable error status should take severity into account in order
> to clear the bits, so that ERR_NONFATAL path does not clear the bit which
> are marked with severity fatal.

Two comments:

1) Can you split this into two patches, one that changes
pci_cleanup_aer_uncorrect_error_status() so it looks like the error
clearing code in aer_error_resume(), and a second that factors out the
duplicate code?

2) Maybe use "sev" or "sever" instead of "mask" for the local
variable, since there is also a separate PCI_ERR_UNCOR_MASK register,
which is not involved here.

3) The "pci_cleanup_aer_uncorrect_error_status()" name no longer
really describes what this does.  Something like
"pci_aer_clear_nonfatal_status()" would be more descriptive.  But I
see you have a subsequent patch (which I haven't looked at yet) that
is related to this.

4) I don't think the driver slot_reset callbacks should be responsible
for clearing these AER status bits.  Can we clear them somewhere in
the pcie_do_nonfatal_recovery() path and remove these calls from the
drivers?

5) In principle, we should only read PCI_ERR_UNCOR_STATUS *once* per
device when handling an error.  We currently read it three times:

  aer_isr
    aer_isr_one_error
      find_source_device
        find_device_iter
          is_error_source
            read PCI_ERR_UNCOR_STATUS              # 1
      aer_process_err_devices
        get_device_error_info(e_info->dev[i])
          read PCI_ERR_UNCOR_STATUS                # 2
        handle_error_source
          pcie_do_nonfatal_recovery
            ...
              report_slot_reset
                driver->err_handler->slot_reset
                  pci_cleanup_aer_uncorrect_error_status
                    read PCI_ERR_UNCOR_STATUS      # 3

OK, that was more than two comments :)

> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index a2e8838..d6cb1f0 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -360,13 +360,16 @@ EXPORT_SYMBOL_GPL(pci_disable_pcie_error_reporting);
>  int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
>  {
>  	int pos;
> -	u32 status;
> +	u32 status, mask;
>  
>  	pos = dev->aer_cap;
>  	if (!pos)
>  		return -EIO;
>  
> +	/* Clean AER Root Error Status */

s/Clean/Clear/ (I see you copied this from aer_error_resume(), and maybe
the second patch could remove or update that comment, too)

>  	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
> +	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &mask);
> +	status &= ~mask; /* Clear corresponding nonfatal bits */
>  	if (status)
>  		pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status);
>  
> @@ -1336,8 +1339,6 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>   */
>  static void aer_error_resume(struct pci_dev *dev)
>  {
> -	int pos;
> -	u32 status, mask;
>  	u16 reg16;
>  
>  	/* Clean up Root device status */
> @@ -1345,11 +1346,7 @@ static void aer_error_resume(struct pci_dev *dev)
>  	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, reg16);
>  
>  	/* Clean AER Root Error Status */
> -	pos = dev->aer_cap;
> -	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
> -	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &mask);
> -	status &= ~mask; /* Clear corresponding nonfatal bits */
> -	pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status);
> +	pci_cleanup_aer_uncorrect_error_status(dev);
>  }
>  
>  static struct pcie_port_service_driver aerdriver = {
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits
  2018-07-17 19:03   ` Bjorn Helgaas
@ 2018-07-17 21:36     ` Bjorn Helgaas
  2018-07-18  9:13       ` poza
  0 siblings, 1 reply; 12+ messages in thread
From: Bjorn Helgaas @ 2018-07-17 21:36 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On Tue, Jul 17, 2018 at 02:03:29PM -0500, Bjorn Helgaas wrote:
> Hi Oza,
> 
> Thanks for doing this!
> 
> On Fri, Jun 22, 2018 at 05:58:09AM -0400, Oza Pawandeep wrote:
> > pci_cleanup_aer_uncorrect_error_status() is called by different slot_reset
> > callbacks in case of ERR_NONFATAL.
> 
> IIRC, the current strategy is:
> 
>   ERR_COR: log only
>   ERR_NONFATAL: call driver callbacks (pci_error_handlers)
>   ERR_FATAL: remove device and re-enumerate
> 
> So these slot_reset callbacks are only used for ERR_NONFATAL, which
> are all uncorrectable errors, of course.
> 
> This patch makes it so that when the slot_reset callbacks call
> pci_cleanup_aer_uncorrect_error_status(), we only clear the
> bits set by ERR_NONFATAL events (this is determined by
> PCI_ERR_UNCOR_SEVER).
> 
> That makes good sense to me.  All these status bits are RW1CS, so they
> will be preserved across a reset but will be cleared when we
> re-enumerate, in this path:
> 
>   pci_init_capabilities
>     pci_aer_init
>       pci_cleanup_aer_error_status_regs
>       # clear all PCI_ERR_UNCOR_STATUS and PCI_ERR_COR_STATUS bits
> 
> > AER uncorrectable error status should take severity into account in order
> > to clear the bits, so that ERR_NONFATAL path does not clear the bit which
> > are marked with severity fatal.
> 
> Two comments:
> 
> 1) Can you split this into two patches, one that changes
> pci_cleanup_aer_uncorrect_error_status() so it looks like the error
> clearing code in aer_error_resume(), and a second that factors out the
> duplicate code?
> 
> 2) Maybe use "sev" or "sever" instead of "mask" for the local
> variable, since there is also a separate PCI_ERR_UNCOR_MASK register,
> which is not involved here.

Let me back up a little here: I'm not asking you to do the things
below here.  They're just possible future things, so we can think
about them after this series.  And the things above are things I can
easily do myself.  So no action required from you, unless you think
I'm on the wrong track :)

> 3) The "pci_cleanup_aer_uncorrect_error_status()" name no longer
> really describes what this does.  Something like
> "pci_aer_clear_nonfatal_status()" would be more descriptive.  But I
> see you have a subsequent patch (which I haven't looked at yet) that
> is related to this.
> 
> 4) I don't think the driver slot_reset callbacks should be responsible
> for clearing these AER status bits.  Can we clear them somewhere in
> the pcie_do_nonfatal_recovery() path and remove these calls from the
> drivers?
> 
> 5) In principle, we should only read PCI_ERR_UNCOR_STATUS *once* per
> device when handling an error.  We currently read it three times:
> 
>   aer_isr
>     aer_isr_one_error
>       find_source_device
>         find_device_iter
>           is_error_source
>             read PCI_ERR_UNCOR_STATUS              # 1
>       aer_process_err_devices
>         get_device_error_info(e_info->dev[i])
>           read PCI_ERR_UNCOR_STATUS                # 2
>         handle_error_source
>           pcie_do_nonfatal_recovery
>             ...
>               report_slot_reset
>                 driver->err_handler->slot_reset
>                   pci_cleanup_aer_uncorrect_error_status
>                     read PCI_ERR_UNCOR_STATUS      # 3
> 
> OK, that was more than two comments :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits
  2018-07-17 21:36     ` Bjorn Helgaas
@ 2018-07-18  9:13       ` poza
  0 siblings, 0 replies; 12+ messages in thread
From: poza @ 2018-07-18  9:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On 2018-07-18 03:06, Bjorn Helgaas wrote:
> On Tue, Jul 17, 2018 at 02:03:29PM -0500, Bjorn Helgaas wrote:
>> Hi Oza,
>> 
>> Thanks for doing this!
>> 
>> On Fri, Jun 22, 2018 at 05:58:09AM -0400, Oza Pawandeep wrote:
>> > pci_cleanup_aer_uncorrect_error_status() is called by different slot_reset
>> > callbacks in case of ERR_NONFATAL.
>> 
>> IIRC, the current strategy is:
>> 
>>   ERR_COR: log only
>>   ERR_NONFATAL: call driver callbacks (pci_error_handlers)
>>   ERR_FATAL: remove device and re-enumerate
>> 
>> So these slot_reset callbacks are only used for ERR_NONFATAL, which
>> are all uncorrectable errors, of course.
>> 
>> This patch makes it so that when the slot_reset callbacks call
>> pci_cleanup_aer_uncorrect_error_status(), we only clear the
>> bits set by ERR_NONFATAL events (this is determined by
>> PCI_ERR_UNCOR_SEVER).
>> 
>> That makes good sense to me.  All these status bits are RW1CS, so they
>> will be preserved across a reset but will be cleared when we
>> re-enumerate, in this path:
>> 
>>   pci_init_capabilities
>>     pci_aer_init
>>       pci_cleanup_aer_error_status_regs
>>       # clear all PCI_ERR_UNCOR_STATUS and PCI_ERR_COR_STATUS bits
>> 
>> > AER uncorrectable error status should take severity into account in order
>> > to clear the bits, so that ERR_NONFATAL path does not clear the bit which
>> > are marked with severity fatal.
>> 
>> Two comments:
>> 
>> 1) Can you split this into two patches, one that changes
>> pci_cleanup_aer_uncorrect_error_status() so it looks like the error
>> clearing code in aer_error_resume(), and a second that factors out the
>> duplicate code?
>> 
>> 2) Maybe use "sev" or "sever" instead of "mask" for the local
>> variable, since there is also a separate PCI_ERR_UNCOR_MASK register,
>> which is not involved here.
> 
> Let me back up a little here: I'm not asking you to do the things
> below here.  They're just possible future things, so we can think
> about them after this series.  And the things above are things I can
> easily do myself.  So no action required from you, unless you think
> I'm on the wrong track :)

I agree with your points, and have taken them into account for future 
series reference as well.

what about PATCH-2 of this series ?
that clears ERR_FATAL bits, but as you said, during re-enumeration
pci_init_capabilities
      pci_aer_init
        pci_cleanup_aer_error_status_regs
        # clear all PCI_ERR_UNCOR_STATUS and PCI_ERR_COR_STATUS bits

but that should clear the ERR_FATAL of the devices beneath.

PATCH2: we are doing it for BRIDGE where we think where ERR_FATAL was 
reported by bridge and the problem is with downstream link.
if ((service == PCIE_PORT_SERVICE_AER) &&
	    (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)) {
		/*
		 * If the error is reported by a bridge, we think this error
		 * is related to the downstream link of the bridge, so we
		 * do error recovery on all subordinates of the bridge instead
		 * of the bridge and clear the error status of the bridge.
		 */
		pci_cleanup_aer_uncorrect_error_status(dev);
	}


so overall, I think all the patches are required, if you have comments 
please let me know.
so far I see that, no action is required from me.

> 
>> 3) The "pci_cleanup_aer_uncorrect_error_status()" name no longer
>> really describes what this does.  Something like
>> "pci_aer_clear_nonfatal_status()" would be more descriptive.  But I
>> see you have a subsequent patch (which I haven't looked at yet) that
>> is related to this.
>> 
>> 4) I don't think the driver slot_reset callbacks should be responsible
>> for clearing these AER status bits.  Can we clear them somewhere in
>> the pcie_do_nonfatal_recovery() path and remove these calls from the
>> drivers?
>> 
>> 5) In principle, we should only read PCI_ERR_UNCOR_STATUS *once* per
>> device when handling an error.  We currently read it three times:
>> 
>>   aer_isr
>>     aer_isr_one_error
>>       find_source_device
>>         find_device_iter
>>           is_error_source
>>             read PCI_ERR_UNCOR_STATUS              # 1
>>       aer_process_err_devices
>>         get_device_error_info(e_info->dev[i])
>>           read PCI_ERR_UNCOR_STATUS                # 2
>>         handle_error_source
>>           pcie_do_nonfatal_recovery
>>             ...
>>               report_slot_reset
>>                 driver->err_handler->slot_reset
>>                   pci_cleanup_aer_uncorrect_error_status
>>                     read PCI_ERR_UNCOR_STATUS      # 3
>> 
>> OK, that was more than two comments :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 6/6] PCI/PORTDRV: Remove ERR_FATAL handling from pcie_portdrv_slot_reset()
  2018-06-22  9:58 ` [PATCH v2 6/6] PCI/PORTDRV: Remove ERR_FATAL handling from pcie_portdrv_slot_reset() Oza Pawandeep
@ 2018-07-18 19:10   ` Bjorn Helgaas
  0 siblings, 0 replies; 12+ messages in thread
From: Bjorn Helgaas @ 2018-07-18 19:10 UTC (permalink / raw)
  To: Oza Pawandeep
  Cc: Bjorn Helgaas, Philippe Ombredanne, Thomas Gleixner,
	Greg Kroah-Hartman, Kate Stewart, linux-pci, linux-kernel,
	Dongdong Liu, Keith Busch, Wei Zhang, Sinan Kaya, Timur Tabi

On Fri, Jun 22, 2018 at 05:58:14AM -0400, Oza Pawandeep wrote:
> We are handling ERR_FATAL by resetting the Link in software,skipping the
> driver pci_error_handlers callbacks, removing the devices from the PCI
> subsystem, and re-enumerating, because of, no need to handle
> pci_channel_io_frozen case anymore.
> 
> Besides the walk on the bus is happening on subordinates, inside
> broadcast_error_message(), which means that pcie_portdrv_slot_reset()
> is never called for RP, and now since the all the devices are removed under
> this downstream link, we can safely get rid of ERR_FATAL handling code
> in pcie_portdrv_slot_reset().

portdrv only binds to bridges, and it looks like
broadcast_error_message() only calls the pci_error_handlers callbacks
for subordinate devices, never for the bridge itself:

  broadcast_error_message
    if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
      pci_walk_bus(dev->subordinate, ...)

If that's the case, why do we need pcie_portdrv_err_handler at all?
We should remove it completely if we don't need it.

ISTR some arcane call path for pcie_portdrv_err_resume(), but I don't
remember the details.

> Signed-off-by: Oza Pawandeep <poza@codeaurora.org>
> 
> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
> index 973f1b8..b970a6d 100644
> --- a/drivers/pci/pcie/portdrv_pci.c
> +++ b/drivers/pci/pcie/portdrv_pci.c
> @@ -42,17 +42,6 @@ __setup("pcie_ports=", pcie_port_setup);
>  
>  /* global data */
>  
> -static int pcie_portdrv_restore_config(struct pci_dev *dev)
> -{
> -	int retval;
> -
> -	retval = pci_enable_device(dev);
> -	if (retval)
> -		return retval;
> -	pci_set_master(dev);
> -	return 0;
> -}
> -
>  #ifdef CONFIG_PM
>  static int pcie_port_runtime_suspend(struct device *dev)
>  {
> @@ -163,13 +152,6 @@ static pci_ers_result_t pcie_portdrv_mmio_enabled(struct pci_dev *dev)
>  static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
>  {
>  	/* If fatal, restore cfg space for possible link reset at upstream */
> -	if (dev->error_state == pci_channel_io_frozen) {
> -		dev->state_saved = true;
> -		pci_restore_state(dev);
> -		pcie_portdrv_restore_config(dev);
> -		pci_enable_pcie_error_reporting(dev);
> -	}
> -
>  	return PCI_ERS_RESULT_RECOVERED;
>  }
>  
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-07-18 19:10 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-22  9:58 [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL Oza Pawandeep
2018-06-22  9:58 ` [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits Oza Pawandeep
2018-07-17 19:03   ` Bjorn Helgaas
2018-07-17 21:36     ` Bjorn Helgaas
2018-07-18  9:13       ` poza
2018-06-22  9:58 ` [PATCH v2 2/6] PCI/AER: Clear uncorrectable fatal error status bits Oza Pawandeep
2018-06-22  9:58 ` [PATCH v2 3/6] PCI/ERR: Cleanup ERR_FATAL of error broadcast Oza Pawandeep
2018-06-22  9:58 ` [PATCH v2 4/6] PCI/AER: Clear device error status bits during ERR_FATAL and ERR_NONFATAL Oza Pawandeep
2018-06-22  9:58 ` [PATCH v2 5/6] PCI/AER: Fix correctable status bits clearing in device register Oza Pawandeep
2018-06-22  9:58 ` [PATCH v2 6/6] PCI/PORTDRV: Remove ERR_FATAL handling from pcie_portdrv_slot_reset() Oza Pawandeep
2018-07-18 19:10   ` Bjorn Helgaas
2018-07-05 15:12 ` [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL poza

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).