* [PATCH v3] PCIe AER: report uncorrectable errors only to the functions that logged the errors
@ 2017-09-28 14:33 Gabriele Paoloni
2017-09-29 21:15 ` Bjorn Helgaas
0 siblings, 1 reply; 4+ messages in thread
From: Gabriele Paoloni @ 2017-09-28 14:33 UTC (permalink / raw)
To: bhelgaas, helgaas
Cc: gabriele.paoloni, linuxarm, linux-pci, linux-kernel, liudongdong3
Currently if an uncorrectable error is reported by an EP the AER
driver walks over all the devices connected to the upstream port
bus and in turns call the report_error_detected() callback.
If any of the devices connected to the bus does not implement
dev->driver->err_handler->error_detected() do_recovery() will fail
leaving all the bus hierarchy devices unrecovered.
According to section "6.2.2.2.2. Non-Fatal Errors" of the PCIe specs
<< Non-fatal errors are uncorrectable errors which cause a particular
transaction to be unreliable but the Link is otherwise fully functional.
Isolating Non-fatal from Fatal errors provides Requester/Receiver logic
in a device or system management software the opportunity to recover
from the error without resetting the components on the Link and
disturbing other transactions in progress. Devices not associated with
the transaction in error are not impacted by the error.>>
therefore for non fatal errors the PCIe link should not be considered
compromised and it makes sense to report the error only to all the
functions that logged an error.
This patch implements this new behaviour for non fatal errors.
Also this patch fixes a bug (filed as in the link below)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=197055
Fixes: 6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver")
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com>
Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
---
Changes from v2:
- no functional changes
- Added reference in the commit log to the bugzilla ticket
- Added reference in the commit log the commit that this patch fixes
- Added reference in the commit log to the PCIe specs for Non-fatal
error handling rules
Changes from v1:
- now errors are reported only to the fucntions that logged the error
instead of all the functions in the same device.
- the patch subject has changed to match the new implementation
---
drivers/pci/pcie/aer/aerdrv_core.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 890efcc..7448052 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -390,7 +390,14 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
* If the error is reported by an end point, we think this
* error is related to the upstream link of the end point.
*/
- pci_walk_bus(dev->bus, cb, &result_data);
+ if (state == pci_channel_io_normal)
+ /*
+ * the error is non fatal so the bus is ok, just invoke
+ * the callback for the function that logged the error.
+ */
+ cb(dev, &result_data);
+ else
+ pci_walk_bus(dev->bus, cb, &result_data);
}
return result_data.result;
--
2.7.4
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v3] PCIe AER: report uncorrectable errors only to the functions that logged the errors
2017-09-28 14:33 [PATCH v3] PCIe AER: report uncorrectable errors only to the functions that logged the errors Gabriele Paoloni
@ 2017-09-29 21:15 ` Bjorn Helgaas
2017-09-29 21:31 ` Bjorn Helgaas
0 siblings, 1 reply; 4+ messages in thread
From: Bjorn Helgaas @ 2017-09-29 21:15 UTC (permalink / raw)
To: Gabriele Paoloni
Cc: bhelgaas, linuxarm, linux-pci, linux-kernel, liudongdong3
On Thu, Sep 28, 2017 at 03:33:05PM +0100, Gabriele Paoloni wrote:
> Currently if an uncorrectable error is reported by an EP the AER
> driver walks over all the devices connected to the upstream port
> bus and in turns call the report_error_detected() callback.
> If any of the devices connected to the bus does not implement
> dev->driver->err_handler->error_detected() do_recovery() will fail
> leaving all the bus hierarchy devices unrecovered.
>
> According to section "6.2.2.2.2. Non-Fatal Errors" of the PCIe specs
> << Non-fatal errors are uncorrectable errors which cause a particular
> transaction to be unreliable but the Link is otherwise fully functional.
> Isolating Non-fatal from Fatal errors provides Requester/Receiver logic
> in a device or system management software the opportunity to recover
> from the error without resetting the components on the Link and
> disturbing other transactions in progress. Devices not associated with
> the transaction in error are not impacted by the error.>>
> therefore for non fatal errors the PCIe link should not be considered
> compromised and it makes sense to report the error only to all the
> functions that logged an error.
>
> This patch implements this new behaviour for non fatal errors.
> Also this patch fixes a bug (filed as in the link below)
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=197055
> Fixes: 6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver")
> Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com>
> Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
Applied to pci/aer for v4.15, thanks!
I rewrote some of the changelog to say "non-fatal" instead of
"uncorrectable", since "uncorrectable" also includes fatal errors,
and you're not changing those. Take a look and let me know if
I broke anything.
> ---
> Changes from v2:
> - no functional changes
> - Added reference in the commit log to the bugzilla ticket
> - Added reference in the commit log the commit that this patch fixes
> - Added reference in the commit log to the PCIe specs for Non-fatal
> error handling rules
>
> Changes from v1:
> - now errors are reported only to the fucntions that logged the error
> instead of all the functions in the same device.
> - the patch subject has changed to match the new implementation
> ---
> drivers/pci/pcie/aer/aerdrv_core.c | 9 ++++++++-
> 1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index 890efcc..7448052 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -390,7 +390,14 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
> * If the error is reported by an end point, we think this
> * error is related to the upstream link of the end point.
> */
> - pci_walk_bus(dev->bus, cb, &result_data);
> + if (state == pci_channel_io_normal)
> + /*
> + * the error is non fatal so the bus is ok, just invoke
> + * the callback for the function that logged the error.
> + */
> + cb(dev, &result_data);
> + else
> + pci_walk_bus(dev->bus, cb, &result_data);
> }
>
> return result_data.result;
> --
> 2.7.4
>
>
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH v3] PCIe AER: report uncorrectable errors only to the functions that logged the errors
2017-09-29 21:15 ` Bjorn Helgaas
@ 2017-09-29 21:31 ` Bjorn Helgaas
2017-10-02 10:20 ` Gabriele Paoloni
0 siblings, 1 reply; 4+ messages in thread
From: Bjorn Helgaas @ 2017-09-29 21:31 UTC (permalink / raw)
To: Gabriele Paoloni
Cc: bhelgaas, linuxarm, linux-pci, linux-kernel, liudongdong3
On Fri, Sep 29, 2017 at 04:15:26PM -0500, Bjorn Helgaas wrote:
> On Thu, Sep 28, 2017 at 03:33:05PM +0100, Gabriele Paoloni wrote:
> > Currently if an uncorrectable error is reported by an EP the AER
> > driver walks over all the devices connected to the upstream port
> > bus and in turns call the report_error_detected() callback.
> > If any of the devices connected to the bus does not implement
> > dev->driver->err_handler->error_detected() do_recovery() will fail
> > leaving all the bus hierarchy devices unrecovered.
> >
> > According to section "6.2.2.2.2. Non-Fatal Errors" of the PCIe specs
> > << Non-fatal errors are uncorrectable errors which cause a particular
> > transaction to be unreliable but the Link is otherwise fully functional.
> > Isolating Non-fatal from Fatal errors provides Requester/Receiver logic
> > in a device or system management software the opportunity to recover
> > from the error without resetting the components on the Link and
> > disturbing other transactions in progress. Devices not associated with
> > the transaction in error are not impacted by the error.>>
> > therefore for non fatal errors the PCIe link should not be considered
> > compromised and it makes sense to report the error only to all the
> > functions that logged an error.
> >
> > This patch implements this new behaviour for non fatal errors.
> > Also this patch fixes a bug (filed as in the link below)
> >
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=197055
> > Fixes: 6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver")
> > Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com>
> > Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
>
> Applied to pci/aer for v4.15, thanks!
>
> I rewrote some of the changelog to say "non-fatal" instead of
> "uncorrectable", since "uncorrectable" also includes fatal errors,
> and you're not changing those. Take a look and let me know if
> I broke anything.
Here it is so you don't have to look it up :)
commit 34ba6e7d5f3e37a369097c07c00bfed567860b8c
Author: Gabriele Paoloni <gabriele.paoloni@huawei.com>
Date: Thu Sep 28 15:33:05 2017 +0100
PCI/AER: Report non-fatal errors only to the affected endpoint
Previously, if an non-fatal error was reported by an endpoint, we
called report_error_detected() for the endpoint, every sibling on the
bus, and their descendents. If any of them did not implement the
.error_detected() method, do_recovery() failed, leaving all these
devices unrecovered.
For example, the system described in the bugzilla below has two devices:
0000:74:02.0 [19e5:a230] SAS controller, driver has .error_detected()
0000:74:03.0 [19e5:a235] SATA controller, driver lacks .error_detected()
When a device such as 74:02.0 reported a non-fatal error, do_recovery()
failed because 74:03.0 lacked an .error_detected() method. But per PCIe
r3.1, sec 6.2.2.2.2, such an error does not compromise the Link and
does not affect 74:03.0:
Non-fatal errors are uncorrectable errors which cause a particular
transaction to be unreliable but the Link is otherwise fully functional.
Isolating Non-fatal from Fatal errors provides Requester/Receiver logic
in a device or system management software the opportunity to recover from
the error without resetting the components on the Link and disturbing
other transactions in progress. Devices not associated with the
transaction in error are not impacted by the error.
Report non-fatal errors only to the endpoint that reported them. We really
want to check for AER_NONFATAL here, but the current code structure doesn't
allow that. Looking for pci_channel_io_normal is the best we can do now.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=197055
Fixes: 6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver")
Signed-off-by: Gabriele Paoloni <gabriele.paoloni@huawei.com>
Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 890efcc574cb..744805232155 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -390,7 +390,14 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
* If the error is reported by an end point, we think this
* error is related to the upstream link of the end point.
*/
- pci_walk_bus(dev->bus, cb, &result_data);
+ if (state == pci_channel_io_normal)
+ /*
+ * the error is non fatal so the bus is ok, just invoke
+ * the callback for the function that logged the error.
+ */
+ cb(dev, &result_data);
+ else
+ pci_walk_bus(dev->bus, cb, &result_data);
}
return result_data.result;
^ permalink raw reply related [flat|nested] 4+ messages in thread
* RE: [PATCH v3] PCIe AER: report uncorrectable errors only to the functions that logged the errors
2017-09-29 21:31 ` Bjorn Helgaas
@ 2017-10-02 10:20 ` Gabriele Paoloni
0 siblings, 0 replies; 4+ messages in thread
From: Gabriele Paoloni @ 2017-10-02 10:20 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: bhelgaas, Linuxarm, linux-pci, linux-kernel, liudongdong (C)
[...]
> >
> > Applied to pci/aer for v4.15, thanks!
> >
> > I rewrote some of the changelog to say "non-fatal" instead of
> > "uncorrectable", since "uncorrectable" also includes fatal errors,
> > and you're not changing those. Take a look and let me know if
> > I broke anything.
>
> Here it is so you don't have to look it up :)
>
Many thanks Bjorn!
Gab
[...]
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2017-10-02 10:20 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-28 14:33 [PATCH v3] PCIe AER: report uncorrectable errors only to the functions that logged the errors Gabriele Paoloni
2017-09-29 21:15 ` Bjorn Helgaas
2017-09-29 21:31 ` Bjorn Helgaas
2017-10-02 10:20 ` Gabriele Paoloni
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).