* [ PATCH RESEND ] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers @ 2012-11-10 7:41 Pandarathil, Vijaymohan R 2012-11-15 0:51 ` Bjorn Helgaas 0 siblings, 1 reply; 6+ messages in thread From: Pandarathil, Vijaymohan R @ 2012-11-10 7:41 UTC (permalink / raw) To: bhelgaas; +Cc: linux-kernel, linux-pci, linasvepstas, Myron Stowe When an error is detected on a PCIe device which does not have an AER-aware driver, prevent AER infrastructure from reporting successful error recovery. This is because the report_error_detected() function that gets called in the first phase of recovery process allows forward progress even when the driver for the device does not have AER capabilities. It seems that all callbacks (in pci_error_handlers structure) registered by drivers that gets called during error recovery are not mandatory. So the intention of the infrastructure design seems to be to allow forward progress even when a specific callback has not been registered by a driver. However, if error handler structure itself has not been registered, it doesn't make sense to allow forward progress. As a result of the current design, in the case of a single device having an AER-unaware driver or in the case of any function in a multi-function card having an AER-unaware driver, a successful recovery is reported. Typical scenario this happens is when a PCI device is detached from a KVM host and the pci-stub driver on the host claims the device. The pci-stub driver does not have error handling capabilities but the AER infrastructure still reports that the device recovered successfully. The changes proposed here leaves the device in an unrecovered state if the driver for the device or for any function in a multi-function card does not have error handler structure registered. This reflects the true state of the device and prevents any partial recovery (or no recovery at all) reported as successful. Please also see comments from Linas Vepstas at the following link http://www.spinics.net/lists/linux-pci/msg18572.html Reviewed-by: Linas Vepstas <linasvepstas <at> gmail.com> Reviewed-by: Myron Stowe <mstowe <at> redhat.com> Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil <at> hp.com> --- drivers/pci/pcie/aer/aerdrv_core.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c index 06bad96..030b229 100644 --- a/drivers/pci/pcie/aer/aerdrv_core.c +++ b/drivers/pci/pcie/aer/aerdrv_core.c @@ -215,6 +215,12 @@ static int report_error_detected(struct pci_dev *dev, void *data) dev->error_state = result_data->state; + if ((!dev->driver || !dev->driver->err_handler) && + !(dev->hdr_type & PCI_HEADER_TYPE_BRIDGE)) { + dev_info(&dev->dev, "AER: Error detected but no driver has claimed this device or the driver is AER-unaware\n"); + result_data->result = PCI_ERS_RESULT_NONE; + return 1; + } if (!dev->driver || !dev->driver->err_handler || !dev->driver->err_handler->error_detected) { ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [ PATCH RESEND ] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers 2012-11-10 7:41 [ PATCH RESEND ] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers Pandarathil, Vijaymohan R @ 2012-11-15 0:51 ` Bjorn Helgaas 2012-11-15 1:35 ` Linas Vepstas 2012-11-15 7:09 ` Pandarathil, Vijaymohan R 0 siblings, 2 replies; 6+ messages in thread From: Bjorn Helgaas @ 2012-11-15 0:51 UTC (permalink / raw) To: Pandarathil, Vijaymohan R Cc: linux-kernel, linux-pci, linasvepstas, Myron Stowe, Lance Ortiz, Huang Ying, Hidetoshi Seto, Andrew Patterson, Zhang Yanmin [+cc Lance, Huang, Hidetoshi, Andrew, Zhang] On Sat, Nov 10, 2012 at 07:41:04AM +0000, Pandarathil, Vijaymohan R wrote: > When an error is detected on a PCIe device which does not have an > AER-aware driver, prevent AER infrastructure from reporting > successful error recovery. > > This is because the report_error_detected() function that gets > called in the first phase of recovery process allows forward > progress even when the driver for the device does not have AER > capabilities. It seems that all callbacks (in pci_error_handlers > structure) registered by drivers that gets called during error > recovery are not mandatory. So the intention of the infrastructure > design seems to be to allow forward progress even when a specific > callback has not been registered by a driver. However, if error > handler structure itself has not been registered, it doesn't make > sense to allow forward progress. > > As a result of the current design, in the case of a single device > having an AER-unaware driver or in the case of any function in a > multi-function card having an AER-unaware driver, a successful > recovery is reported. > > Typical scenario this happens is when a PCI device is detached > from a KVM host and the pci-stub driver on the host claims the > device. The pci-stub driver does not have error handling capabilities > but the AER infrastructure still reports that the device recovered > successfully. > > The changes proposed here leaves the device in an unrecovered state > if the driver for the device or for any function in a multi-function > card does not have error handler structure registered. This reflects > the true state of the device and prevents any partial recovery (or no > recovery at all) reported as successful. > > Please also see comments from Linas Vepstas at the following link > http://www.spinics.net/lists/linux-pci/msg18572.html > > Reviewed-by: Linas Vepstas <linasvepstas <at> gmail.com> > Reviewed-by: Myron Stowe <mstowe <at> redhat.com> > Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil <at> hp.com> > > --- > > drivers/pci/pcie/aer/aerdrv_core.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c > index 06bad96..030b229 100644 > --- a/drivers/pci/pcie/aer/aerdrv_core.c > +++ b/drivers/pci/pcie/aer/aerdrv_core.c > @@ -215,6 +215,12 @@ static int report_error_detected(struct pci_dev *dev, void *data) > > dev->error_state = result_data->state; > > + if ((!dev->driver || !dev->driver->err_handler) && > + !(dev->hdr_type & PCI_HEADER_TYPE_BRIDGE)) { > + dev_info(&dev->dev, "AER: Error detected but no driver has claimed this device or the driver is AER-unaware\n"); > + result_data->result = PCI_ERS_RESULT_NONE; > + return 1; This doesn't seem right because returning 1 here causes pci_walk_bus() to terminate, which means we won't set dev->error_state or notify drivers for any devices we haven't visited yet. > + } > if (!dev->driver || > !dev->driver->err_handler || > !dev->driver->err_handler->error_detected) { If none of the drivers in the affected hierarchy supports error handling, I think the call tree looks like this: do_recovery # uncorrectable only broadcast_error_message(dev, ..., report_error_detected) result_data.result = CAN_RECOVER pci_walk_bus(..., report_error_detected) report_error_detected # (each dev in subtree) return 0 # no change to result return result_data.result broadcast_error_message(dev, ..., report_mmio_enabled) result_data.result = PCI_ERS_RESULT_RECOVERED pci_walk_bus(..., report_mmio_enabled) report_mmio_enabled # (each dev in subtree) return 0 # no change to result dev_info("recovery successful") Specifically, there are no error_handler functions, so we never change result_data.result, and the default is that we treat the error as "recovered successfully." That seems broken. An uncorrectable error is by definition recoverable only by device-specific software, i.e., the driver. We didn't call any drivers, so we can't have recovered anything. What do you think of the following alternative? I don't know why you checked for bridge devices in your patch, so I don't know whether that's important here or not. diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c index 06bad96..a109c68 100644 --- a/drivers/pci/pcie/aer/aerdrv_core.c +++ b/drivers/pci/pcie/aer/aerdrv_core.c @@ -231,11 +231,11 @@ static int report_error_detected(struct pci_dev *dev, void *data) dev->driver ? "no AER-aware driver" : "no driver"); } - return 0; + vote = PCI_ERS_RESULT_DISCONNECT; + } else { + err_handler = dev->driver->err_handler; + vote = err_handler->error_detected(dev, result_data->state); } - - err_handler = dev->driver->err_handler; - vote = err_handler->error_detected(dev, result_data->state); result_data->result = merge_result(result_data->result, vote); return 0; } ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [ PATCH RESEND ] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers 2012-11-15 0:51 ` Bjorn Helgaas @ 2012-11-15 1:35 ` Linas Vepstas 2012-11-15 7:09 ` Pandarathil, Vijaymohan R 1 sibling, 0 replies; 6+ messages in thread From: Linas Vepstas @ 2012-11-15 1:35 UTC (permalink / raw) To: Bjorn Helgaas Cc: Pandarathil, Vijaymohan R, linux-kernel, linux-pci, Myron Stowe, Lance Ortiz, Huang Ying, Hidetoshi Seto, Andrew Patterson, Zhang Yanmin Yes, what you describe appears to be the correct semantics; this would then be the more correct patch. Read-the-email-but-didn't-try-to-test-by: Linas Vepstas <linasvepstas <at> gmail.com> -- Linas On 14 November 2012 18:51, Bjorn Helgaas <bhelgaas@google.com> wrote: > > [+cc Lance, Huang, Hidetoshi, Andrew, Zhang] > > On Sat, Nov 10, 2012 at 07:41:04AM +0000, Pandarathil, Vijaymohan R wrote: > > When an error is detected on a PCIe device which does not have an > > AER-aware driver, prevent AER infrastructure from reporting > > successful error recovery. > > > > This is because the report_error_detected() function that gets > > called in the first phase of recovery process allows forward > > progress even when the driver for the device does not have AER > > capabilities. It seems that all callbacks (in pci_error_handlers > > structure) registered by drivers that gets called during error > > recovery are not mandatory. So the intention of the infrastructure > > design seems to be to allow forward progress even when a specific > > callback has not been registered by a driver. However, if error > > handler structure itself has not been registered, it doesn't make > > sense to allow forward progress. > > > > As a result of the current design, in the case of a single device > > having an AER-unaware driver or in the case of any function in a > > multi-function card having an AER-unaware driver, a successful > > recovery is reported. > > > > Typical scenario this happens is when a PCI device is detached > > from a KVM host and the pci-stub driver on the host claims the > > device. The pci-stub driver does not have error handling capabilities > > but the AER infrastructure still reports that the device recovered > > successfully. > > > > The changes proposed here leaves the device in an unrecovered state > > if the driver for the device or for any function in a multi-function > > card does not have error handler structure registered. This reflects > > the true state of the device and prevents any partial recovery (or no > > recovery at all) reported as successful. > > > > Please also see comments from Linas Vepstas at the following link > > http://www.spinics.net/lists/linux-pci/msg18572.html > > > > Reviewed-by: Linas Vepstas <linasvepstas <at> gmail.com> > > Reviewed-by: Myron Stowe <mstowe <at> redhat.com> > > Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil <at> hp.com> > > > > --- > > > > drivers/pci/pcie/aer/aerdrv_core.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c > > index 06bad96..030b229 100644 > > --- a/drivers/pci/pcie/aer/aerdrv_core.c > > +++ b/drivers/pci/pcie/aer/aerdrv_core.c > > @@ -215,6 +215,12 @@ static int report_error_detected(struct pci_dev *dev, void *data) > > > > dev->error_state = result_data->state; > > > > + if ((!dev->driver || !dev->driver->err_handler) && > > + !(dev->hdr_type & PCI_HEADER_TYPE_BRIDGE)) { > > + dev_info(&dev->dev, "AER: Error detected but no driver has claimed this device or the driver is AER-unaware\n"); > > + result_data->result = PCI_ERS_RESULT_NONE; > > + return 1; > > This doesn't seem right because returning 1 here causes pci_walk_bus() > to terminate, which means we won't set dev->error_state or notify > drivers for any devices we haven't visited yet. > > > + } > > if (!dev->driver || > > !dev->driver->err_handler || > > !dev->driver->err_handler->error_detected) { > > If none of the drivers in the affected hierarchy supports error handling, > I think the call tree looks like this: > > do_recovery # uncorrectable only > broadcast_error_message(dev, ..., report_error_detected) > result_data.result = CAN_RECOVER > pci_walk_bus(..., report_error_detected) > report_error_detected # (each dev in subtree) > return 0 # no change to result > return result_data.result > broadcast_error_message(dev, ..., report_mmio_enabled) > result_data.result = PCI_ERS_RESULT_RECOVERED > pci_walk_bus(..., report_mmio_enabled) > report_mmio_enabled # (each dev in subtree) > return 0 # no change to result > dev_info("recovery successful") > > Specifically, there are no error_handler functions, so we never change > result_data.result, and the default is that we treat the error as > "recovered successfully." That seems broken. An uncorrectable error > is by definition recoverable only by device-specific software, i.e., > the driver. We didn't call any drivers, so we can't have recovered > anything. > > What do you think of the following alternative? I don't know why you > checked for bridge devices in your patch, so I don't know whether > that's important here or not. > > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c > index 06bad96..a109c68 100644 > --- a/drivers/pci/pcie/aer/aerdrv_core.c > +++ b/drivers/pci/pcie/aer/aerdrv_core.c > @@ -231,11 +231,11 @@ static int report_error_detected(struct pci_dev *dev, void *data) > dev->driver ? > "no AER-aware driver" : "no driver"); > } > - return 0; > + vote = PCI_ERS_RESULT_DISCONNECT; > + } else { > + err_handler = dev->driver->err_handler; > + vote = err_handler->error_detected(dev, result_data->state); > } > - > - err_handler = dev->driver->err_handler; > - vote = err_handler->error_detected(dev, result_data->state); > result_data->result = merge_result(result_data->result, vote); > return 0; > } ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [ PATCH RESEND ] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers 2012-11-15 0:51 ` Bjorn Helgaas 2012-11-15 1:35 ` Linas Vepstas @ 2012-11-15 7:09 ` Pandarathil, Vijaymohan R 2012-11-16 1:08 ` Bjorn Helgaas 1 sibling, 1 reply; 6+ messages in thread From: Pandarathil, Vijaymohan R @ 2012-11-15 7:09 UTC (permalink / raw) To: Bjorn Helgaas Cc: linux-kernel, linux-pci, linasvepstas, Myron Stowe, Ortiz, Lance E, Huang Ying, Hidetoshi Seto, Patterson, Andrew D (LeftHand Networks), Zhang Yanmin Thanks for the comments. See my response below. > -----Original Message----- > From: Bjorn Helgaas [mailto:bhelgaas@google.com] > Sent: Wednesday, November 14, 2012 4:51 PM > To: Pandarathil, Vijaymohan R > Cc: linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org; > linasvepstas@gmail.com; Myron Stowe; Ortiz, Lance E; Huang Ying; Hidetoshi > Seto; Patterson, Andrew D (LeftHand Networks); Zhang Yanmin > Subject: Re: [ PATCH RESEND ] PCI-AER: Do not report successful error > recovery for devices with AER-unaware drivers > > [+cc Lance, Huang, Hidetoshi, Andrew, Zhang] > > On Sat, Nov 10, 2012 at 07:41:04AM +0000, Pandarathil, Vijaymohan R wrote: > > When an error is detected on a PCIe device which does not have an > > AER-aware driver, prevent AER infrastructure from reporting > > successful error recovery. > > > > This is because the report_error_detected() function that gets > > called in the first phase of recovery process allows forward > > progress even when the driver for the device does not have AER > > capabilities. It seems that all callbacks (in pci_error_handlers > > structure) registered by drivers that gets called during error > > recovery are not mandatory. So the intention of the infrastructure > > design seems to be to allow forward progress even when a specific > > callback has not been registered by a driver. However, if error > > handler structure itself has not been registered, it doesn't make > > sense to allow forward progress. > > > > As a result of the current design, in the case of a single device > > having an AER-unaware driver or in the case of any function in a > > multi-function card having an AER-unaware driver, a successful > > recovery is reported. > > > > Typical scenario this happens is when a PCI device is detached > > from a KVM host and the pci-stub driver on the host claims the > > device. The pci-stub driver does not have error handling capabilities > > but the AER infrastructure still reports that the device recovered > > successfully. > > > > The changes proposed here leaves the device in an unrecovered state > > if the driver for the device or for any function in a multi-function > > card does not have error handler structure registered. This reflects > > the true state of the device and prevents any partial recovery (or no > > recovery at all) reported as successful. > > > > Please also see comments from Linas Vepstas at the following link > > http://www.spinics.net/lists/linux-pci/msg18572.html > > > > Reviewed-by: Linas Vepstas <linasvepstas <at> gmail.com> > > Reviewed-by: Myron Stowe <mstowe <at> redhat.com> > > Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil <at> > hp.com> > > > > --- > > > > drivers/pci/pcie/aer/aerdrv_core.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c > b/drivers/pci/pcie/aer/aerdrv_core.c > > index 06bad96..030b229 100644 > > --- a/drivers/pci/pcie/aer/aerdrv_core.c > > +++ b/drivers/pci/pcie/aer/aerdrv_core.c > > @@ -215,6 +215,12 @@ static int report_error_detected(struct pci_dev > *dev, void *data) > > > > dev->error_state = result_data->state; > > > > + if ((!dev->driver || !dev->driver->err_handler) && > > + !(dev->hdr_type & PCI_HEADER_TYPE_BRIDGE)) { > > + dev_info(&dev->dev, "AER: Error detected but no driver has > claimed this device or the driver is AER-unaware\n"); > > + result_data->result = PCI_ERS_RESULT_NONE; > > + return 1; > > This doesn't seem right because returning 1 here causes pci_walk_bus() > to terminate, which means we won't set dev->error_state or notify > drivers for any devices we haven't visited yet. > > > + } > > if (!dev->driver || > > !dev->driver->err_handler || > > !dev->driver->err_handler->error_detected) { > > If none of the drivers in the affected hierarchy supports error handling, > I think the call tree looks like this: > > do_recovery # uncorrectable only > broadcast_error_message(dev, ..., report_error_detected) > result_data.result = CAN_RECOVER > pci_walk_bus(..., report_error_detected) > report_error_detected # (each dev in subtree) > return 0 # no change to result > return result_data.result > broadcast_error_message(dev, ..., report_mmio_enabled) > result_data.result = PCI_ERS_RESULT_RECOVERED > pci_walk_bus(..., report_mmio_enabled) > report_mmio_enabled # (each dev in subtree) > return 0 # no change to result > dev_info("recovery successful") > > Specifically, there are no error_handler functions, so we never change > result_data.result, and the default is that we treat the error as > "recovered successfully." That seems broken. An uncorrectable error > is by definition recoverable only by device-specific software, i.e., > the driver. We didn't call any drivers, so we can't have recovered > anything. > > What do you think of the following alternative? I don't know why you > checked for bridge devices in your patch, so I don't know whether > that's important here or not. > > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c > b/drivers/pci/pcie/aer/aerdrv_core.c > index 06bad96..a109c68 100644 > --- a/drivers/pci/pcie/aer/aerdrv_core.c > +++ b/drivers/pci/pcie/aer/aerdrv_core.c > @@ -231,11 +231,11 @@ static int report_error_detected(struct pci_dev *dev, > void *data) > dev->driver ? > "no AER-aware driver" : "no driver"); > } > - return 0; > + vote = PCI_ERS_RESULT_DISCONNECT; > + } else { > + err_handler = dev->driver->err_handler; > + vote = err_handler->error_detected(dev, result_data->state); > } > - > - err_handler = dev->driver->err_handler; > - vote = err_handler->error_detected(dev, result_data->state); > result_data->result = merge_result(result_data->result, vote); > return 0; > } This would definitely set the error_state for all devices correctly. However, with the current implementation of merge_result(), won't we still end up reporting successful recovery ? The following case statement in merge_result() can set back the result from PCI_ERS_RESULT_DISCONNECT to PCI_ERS_RESULT_NEED_RESET for a subsequent device on the bus which returned PCI_ERS_RESULT_NEED_RESET from its error_detected callback . merge_result() ... case PCI_ERS_RESULT_DISCONNECT: if (new == PCI_ERS_RESULT_NEED_RESET) orig = new; break; This would mean do_recovery() proceeds along to the next broadcast_message and ultimately report success. Right ? Let me know if I am missing something. I looked at a few options and the following looked more promising. Thoughts ? Introduce a new pci_ers_result enum PCI_ERS_RESULT_NO_AER_DRIVER and make changes as follows. diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h index 94a7598..149ba79 100644 --- a/drivers/pci/pcie/aer/aerdrv.h +++ b/drivers/pci/pcie/aer/aerdrv.h @@ -87,6 +87,9 @@ struct aer_broadcast_data { static inline pci_ers_result_t merge_result(enum pci_ers_result orig, enum pci_ers_result new) { + if (new == PCI_ERS_RESULT_NO_AER_DRIVER) + return new; + if (new == PCI_ERS_RESULT_NONE) return orig; diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c index 06bad96..729580a 100644 --- a/drivers/pci/pcie/aer/aerdrv_core.c +++ b/drivers/pci/pcie/aer/aerdrv_core.c @@ -231,11 +231,12 @@ static int report_error_detected(struct pci_dev *dev, void *data) dev->driver ? "no AER-aware driver" : "no driver"); } - return 0; + vote = PCI_ERS_RESULT_NO_AER_DRIVER; + } else { + err_handler = dev->driver->err_handler; + vote = err_handler->error_detected(dev, result_data->state); } - err_handler = dev->driver->err_handler; - vote = err_handler->error_detected(dev, result_data->state); result_data->result = merge_result(result_data->result, vote); return 0; } diff --git a/include/linux/pci.h b/include/linux/pci.h index ee21795..fb7e869 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -538,6 +538,9 @@ enum pci_ers_result { /* Device driver is fully recovered and operational */ PCI_ERS_RESULT_RECOVERED = (__force pci_ers_result_t) 5, + + /* No AER capabilities registered for the driver */ + PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6, }; /* PCI bus error event callbacks */ ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [ PATCH RESEND ] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers 2012-11-15 7:09 ` Pandarathil, Vijaymohan R @ 2012-11-16 1:08 ` Bjorn Helgaas 2012-11-16 4:26 ` Pandarathil, Vijaymohan R 0 siblings, 1 reply; 6+ messages in thread From: Bjorn Helgaas @ 2012-11-16 1:08 UTC (permalink / raw) To: Pandarathil, Vijaymohan R Cc: linux-kernel, linux-pci, linasvepstas, Myron Stowe, Ortiz, Lance E, Huang Ying, Hidetoshi Seto, Patterson, Andrew D (LeftHand Networks), Zhang Yanmin On Thu, Nov 15, 2012 at 12:09 AM, Pandarathil, Vijaymohan R <vijaymohan.pandarathil@hp.com> wrote: > Thanks for the comments. See my response below. > >> -----Original Message----- >> From: Bjorn Helgaas [mailto:bhelgaas@google.com] >> Sent: Wednesday, November 14, 2012 4:51 PM >> To: Pandarathil, Vijaymohan R >> Cc: linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org; >> linasvepstas@gmail.com; Myron Stowe; Ortiz, Lance E; Huang Ying; Hidetoshi >> Seto; Patterson, Andrew D (LeftHand Networks); Zhang Yanmin >> Subject: Re: [ PATCH RESEND ] PCI-AER: Do not report successful error >> recovery for devices with AER-unaware drivers >> >> [+cc Lance, Huang, Hidetoshi, Andrew, Zhang] >> >> On Sat, Nov 10, 2012 at 07:41:04AM +0000, Pandarathil, Vijaymohan R wrote: >> > When an error is detected on a PCIe device which does not have an >> > AER-aware driver, prevent AER infrastructure from reporting >> > successful error recovery. >> > >> > This is because the report_error_detected() function that gets >> > called in the first phase of recovery process allows forward >> > progress even when the driver for the device does not have AER >> > capabilities. It seems that all callbacks (in pci_error_handlers >> > structure) registered by drivers that gets called during error >> > recovery are not mandatory. So the intention of the infrastructure >> > design seems to be to allow forward progress even when a specific >> > callback has not been registered by a driver. However, if error >> > handler structure itself has not been registered, it doesn't make >> > sense to allow forward progress. >> > >> > As a result of the current design, in the case of a single device >> > having an AER-unaware driver or in the case of any function in a >> > multi-function card having an AER-unaware driver, a successful >> > recovery is reported. >> > >> > Typical scenario this happens is when a PCI device is detached >> > from a KVM host and the pci-stub driver on the host claims the >> > device. The pci-stub driver does not have error handling capabilities >> > but the AER infrastructure still reports that the device recovered >> > successfully. >> > >> > The changes proposed here leaves the device in an unrecovered state >> > if the driver for the device or for any function in a multi-function >> > card does not have error handler structure registered. This reflects >> > the true state of the device and prevents any partial recovery (or no >> > recovery at all) reported as successful. >> > >> > Please also see comments from Linas Vepstas at the following link >> > http://www.spinics.net/lists/linux-pci/msg18572.html >> > >> > Reviewed-by: Linas Vepstas <linasvepstas <at> gmail.com> >> > Reviewed-by: Myron Stowe <mstowe <at> redhat.com> >> > Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil <at> >> hp.com> >> > >> > --- >> > >> > drivers/pci/pcie/aer/aerdrv_core.c | 6 ++++++ >> > 1 file changed, 6 insertions(+) >> > >> > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c >> b/drivers/pci/pcie/aer/aerdrv_core.c >> > index 06bad96..030b229 100644 >> > --- a/drivers/pci/pcie/aer/aerdrv_core.c >> > +++ b/drivers/pci/pcie/aer/aerdrv_core.c >> > @@ -215,6 +215,12 @@ static int report_error_detected(struct pci_dev >> *dev, void *data) >> > >> > dev->error_state = result_data->state; >> > >> > + if ((!dev->driver || !dev->driver->err_handler) && >> > + !(dev->hdr_type & PCI_HEADER_TYPE_BRIDGE)) { >> > + dev_info(&dev->dev, "AER: Error detected but no driver has >> claimed this device or the driver is AER-unaware\n"); >> > + result_data->result = PCI_ERS_RESULT_NONE; >> > + return 1; >> >> This doesn't seem right because returning 1 here causes pci_walk_bus() >> to terminate, which means we won't set dev->error_state or notify >> drivers for any devices we haven't visited yet. >> >> > + } >> > if (!dev->driver || >> > !dev->driver->err_handler || >> > !dev->driver->err_handler->error_detected) { >> >> If none of the drivers in the affected hierarchy supports error handling, >> I think the call tree looks like this: >> >> do_recovery # uncorrectable only >> broadcast_error_message(dev, ..., report_error_detected) >> result_data.result = CAN_RECOVER >> pci_walk_bus(..., report_error_detected) >> report_error_detected # (each dev in subtree) >> return 0 # no change to result >> return result_data.result >> broadcast_error_message(dev, ..., report_mmio_enabled) >> result_data.result = PCI_ERS_RESULT_RECOVERED >> pci_walk_bus(..., report_mmio_enabled) >> report_mmio_enabled # (each dev in subtree) >> return 0 # no change to result >> dev_info("recovery successful") >> >> Specifically, there are no error_handler functions, so we never change >> result_data.result, and the default is that we treat the error as >> "recovered successfully." That seems broken. An uncorrectable error >> is by definition recoverable only by device-specific software, i.e., >> the driver. We didn't call any drivers, so we can't have recovered >> anything. >> >> What do you think of the following alternative? I don't know why you >> checked for bridge devices in your patch, so I don't know whether >> that's important here or not. >> >> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c >> b/drivers/pci/pcie/aer/aerdrv_core.c >> index 06bad96..a109c68 100644 >> --- a/drivers/pci/pcie/aer/aerdrv_core.c >> +++ b/drivers/pci/pcie/aer/aerdrv_core.c >> @@ -231,11 +231,11 @@ static int report_error_detected(struct pci_dev *dev, >> void *data) >> dev->driver ? >> "no AER-aware driver" : "no driver"); >> } >> - return 0; >> + vote = PCI_ERS_RESULT_DISCONNECT; >> + } else { >> + err_handler = dev->driver->err_handler; >> + vote = err_handler->error_detected(dev, result_data->state); >> } >> - >> - err_handler = dev->driver->err_handler; >> - vote = err_handler->error_detected(dev, result_data->state); >> result_data->result = merge_result(result_data->result, vote); >> return 0; >> } > > This would definitely set the error_state for all devices correctly. However, with the > current implementation of merge_result(), won't we still end up reporting successful > recovery ? The following case statement in merge_result() can set back the result > from PCI_ERS_RESULT_DISCONNECT to PCI_ERS_RESULT_NEED_RESET for a subsequent device > on the bus which returned PCI_ERS_RESULT_NEED_RESET from its error_detected callback . > > merge_result() > ... > case PCI_ERS_RESULT_DISCONNECT: > if (new == PCI_ERS_RESULT_NEED_RESET) > orig = new; > break; > > This would mean do_recovery() proceeds along to the next broadcast_message and > ultimately report success. Right ? Let me know if I am missing something. Yes, I think you're right. I only looked at the case where none of the devices in the subtree had drivers. > I looked at a few options and the following looked more promising. Thoughts ? > > Introduce a new pci_ers_result enum PCI_ERS_RESULT_NO_AER_DRIVER and make changes as follows. I don't really like this new enum because it's not really obvious how it's different from PCI_ERS_RESULT_NONE and, more importantly, it makes merge_result() even more of a kludge than it already is. It's hard to write a nice simple description of this algorithm (visiting all the devices in a subtree, collecting error handler results, and merging them). It would be a lot easier to describe if merge_result() could be written simply as "max()", but I'm not sure the pci_ers_result codes could be ordered in a way that would make that possible. And the desired semantics might make it impossible, too. I think the intent of your patch is that if there's any device in the subtree that lacks an .error_detected() method, we do not call .mmio_enabled() or .slot_reset() or .resume() for *any* device in the subtree. Right? So maybe this is the best we can do, and it certainly seems better than what we have now. Can you repost this as a fresh v2 patch? > diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h > index 94a7598..149ba79 100644 > --- a/drivers/pci/pcie/aer/aerdrv.h > +++ b/drivers/pci/pcie/aer/aerdrv.h > @@ -87,6 +87,9 @@ struct aer_broadcast_data { > static inline pci_ers_result_t merge_result(enum pci_ers_result orig, > enum pci_ers_result new) > { > + if (new == PCI_ERS_RESULT_NO_AER_DRIVER) > + return new; BTW, if you keep this, please just use "return PCI_ERS_RESULT_NO_AER_DRIVER" rather than "return new" since we *know* what we're returning. I think there's another instance of this in merge_result() that you could fix, too. > + > if (new == PCI_ERS_RESULT_NONE) > return orig; > > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c > index 06bad96..729580a 100644 > --- a/drivers/pci/pcie/aer/aerdrv_core.c > +++ b/drivers/pci/pcie/aer/aerdrv_core.c > @@ -231,11 +231,12 @@ static int report_error_detected(struct pci_dev *dev, void *data) > dev->driver ? > "no AER-aware driver" : "no driver"); > } > - return 0; > + vote = PCI_ERS_RESULT_NO_AER_DRIVER; > + } else { > + err_handler = dev->driver->err_handler; > + vote = err_handler->error_detected(dev, result_data->state); > } > > - err_handler = dev->driver->err_handler; > - vote = err_handler->error_detected(dev, result_data->state); > result_data->result = merge_result(result_data->result, vote); > return 0; > } > diff --git a/include/linux/pci.h b/include/linux/pci.h > index ee21795..fb7e869 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -538,6 +538,9 @@ enum pci_ers_result { > > /* Device driver is fully recovered and operational */ > PCI_ERS_RESULT_RECOVERED = (__force pci_ers_result_t) 5, > + > + /* No AER capabilities registered for the driver */ > + PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6, > }; > > /* PCI bus error event callbacks */ > ^ permalink raw reply [flat|nested] 6+ messages in thread
* RE: [ PATCH RESEND ] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers 2012-11-16 1:08 ` Bjorn Helgaas @ 2012-11-16 4:26 ` Pandarathil, Vijaymohan R 0 siblings, 0 replies; 6+ messages in thread From: Pandarathil, Vijaymohan R @ 2012-11-16 4:26 UTC (permalink / raw) To: Bjorn Helgaas Cc: linux-kernel, linux-pci, linasvepstas, Myron Stowe, Ortiz, Lance E, Huang Ying, Hidetoshi Seto, Patterson, Andrew D (LeftHand Networks), Zhang Yanmin > -----Original Message----- > From: Bjorn Helgaas [mailto:bhelgaas@google.com] > Sent: Thursday, November 15, 2012 5:09 PM > To: Pandarathil, Vijaymohan R > Cc: linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org; > linasvepstas@gmail.com; Myron Stowe; Ortiz, Lance E; Huang Ying; Hidetoshi > Seto; Patterson, Andrew D (LeftHand Networks); Zhang Yanmin > Subject: Re: [ PATCH RESEND ] PCI-AER: Do not report successful error > recovery for devices with AER-unaware drivers > > On Thu, Nov 15, 2012 at 12:09 AM, Pandarathil, Vijaymohan R > <vijaymohan.pandarathil@hp.com> wrote: > > Thanks for the comments. See my response below. > > > >> -----Original Message----- > >> From: Bjorn Helgaas [mailto:bhelgaas@google.com] > >> Sent: Wednesday, November 14, 2012 4:51 PM > >> To: Pandarathil, Vijaymohan R > >> Cc: linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org; > >> linasvepstas@gmail.com; Myron Stowe; Ortiz, Lance E; Huang Ying; > Hidetoshi > >> Seto; Patterson, Andrew D (LeftHand Networks); Zhang Yanmin > >> Subject: Re: [ PATCH RESEND ] PCI-AER: Do not report successful error > >> recovery for devices with AER-unaware drivers > >> > >> [+cc Lance, Huang, Hidetoshi, Andrew, Zhang] > >> > >> On Sat, Nov 10, 2012 at 07:41:04AM +0000, Pandarathil, Vijaymohan R > wrote: > >> > When an error is detected on a PCIe device which does not have an > >> > AER-aware driver, prevent AER infrastructure from reporting > >> > successful error recovery. > >> > > >> > This is because the report_error_detected() function that gets > >> > called in the first phase of recovery process allows forward > >> > progress even when the driver for the device does not have AER > >> > capabilities. It seems that all callbacks (in pci_error_handlers > >> > structure) registered by drivers that gets called during error > >> > recovery are not mandatory. So the intention of the infrastructure > >> > design seems to be to allow forward progress even when a specific > >> > callback has not been registered by a driver. However, if error > >> > handler structure itself has not been registered, it doesn't make > >> > sense to allow forward progress. > >> > > >> > As a result of the current design, in the case of a single device > >> > having an AER-unaware driver or in the case of any function in a > >> > multi-function card having an AER-unaware driver, a successful > >> > recovery is reported. > >> > > >> > Typical scenario this happens is when a PCI device is detached > >> > from a KVM host and the pci-stub driver on the host claims the > >> > device. The pci-stub driver does not have error handling capabilities > >> > but the AER infrastructure still reports that the device recovered > >> > successfully. > >> > > >> > The changes proposed here leaves the device in an unrecovered state > >> > if the driver for the device or for any function in a multi-function > >> > card does not have error handler structure registered. This reflects > >> > the true state of the device and prevents any partial recovery (or no > >> > recovery at all) reported as successful. > >> > > >> > Please also see comments from Linas Vepstas at the following link > >> > http://www.spinics.net/lists/linux-pci/msg18572.html > >> > > >> > Reviewed-by: Linas Vepstas <linasvepstas <at> gmail.com> > >> > Reviewed-by: Myron Stowe <mstowe <at> redhat.com> > >> > Signed-off-by: Vijay Mohan Pandarathil <vijaymohan.pandarathil <at> > >> hp.com> > >> > > >> > --- > >> > > >> > drivers/pci/pcie/aer/aerdrv_core.c | 6 ++++++ > >> > 1 file changed, 6 insertions(+) > >> > > >> > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c > >> b/drivers/pci/pcie/aer/aerdrv_core.c > >> > index 06bad96..030b229 100644 > >> > --- a/drivers/pci/pcie/aer/aerdrv_core.c > >> > +++ b/drivers/pci/pcie/aer/aerdrv_core.c > >> > @@ -215,6 +215,12 @@ static int report_error_detected(struct pci_dev > >> *dev, void *data) > >> > > >> > dev->error_state = result_data->state; > >> > > >> > + if ((!dev->driver || !dev->driver->err_handler) && > >> > + !(dev->hdr_type & PCI_HEADER_TYPE_BRIDGE)) { > >> > + dev_info(&dev->dev, "AER: Error detected but no driver has > >> claimed this device or the driver is AER-unaware\n"); > >> > + result_data->result = PCI_ERS_RESULT_NONE; > >> > + return 1; > >> > >> This doesn't seem right because returning 1 here causes pci_walk_bus() > >> to terminate, which means we won't set dev->error_state or notify > >> drivers for any devices we haven't visited yet. > >> > >> > + } > >> > if (!dev->driver || > >> > !dev->driver->err_handler || > >> > !dev->driver->err_handler->error_detected) { > >> > >> If none of the drivers in the affected hierarchy supports error > handling, > >> I think the call tree looks like this: > >> > >> do_recovery # uncorrectable only > >> broadcast_error_message(dev, ..., report_error_detected) > >> result_data.result = CAN_RECOVER > >> pci_walk_bus(..., report_error_detected) > >> report_error_detected # (each dev in subtree) > >> return 0 # no change to result > >> return result_data.result > >> broadcast_error_message(dev, ..., report_mmio_enabled) > >> result_data.result = PCI_ERS_RESULT_RECOVERED > >> pci_walk_bus(..., report_mmio_enabled) > >> report_mmio_enabled # (each dev in subtree) > >> return 0 # no change to result > >> dev_info("recovery successful") > >> > >> Specifically, there are no error_handler functions, so we never change > >> result_data.result, and the default is that we treat the error as > >> "recovered successfully." That seems broken. An uncorrectable error > >> is by definition recoverable only by device-specific software, i.e., > >> the driver. We didn't call any drivers, so we can't have recovered > >> anything. > >> > >> What do you think of the following alternative? I don't know why you > >> checked for bridge devices in your patch, so I don't know whether > >> that's important here or not. > >> > >> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c > >> b/drivers/pci/pcie/aer/aerdrv_core.c > >> index 06bad96..a109c68 100644 > >> --- a/drivers/pci/pcie/aer/aerdrv_core.c > >> +++ b/drivers/pci/pcie/aer/aerdrv_core.c > >> @@ -231,11 +231,11 @@ static int report_error_detected(struct pci_dev > *dev, > >> void *data) > >> dev->driver ? > >> "no AER-aware driver" : "no driver"); > >> } > >> - return 0; > >> + vote = PCI_ERS_RESULT_DISCONNECT; > >> + } else { > >> + err_handler = dev->driver->err_handler; > >> + vote = err_handler->error_detected(dev, result_data- > >state); > >> } > >> - > >> - err_handler = dev->driver->err_handler; > >> - vote = err_handler->error_detected(dev, result_data->state); > >> result_data->result = merge_result(result_data->result, vote); > >> return 0; > >> } > > > > This would definitely set the error_state for all devices correctly. > However, with the > > current implementation of merge_result(), won't we still end up reporting > successful > > recovery ? The following case statement in merge_result() can set back > the result > > from PCI_ERS_RESULT_DISCONNECT to PCI_ERS_RESULT_NEED_RESET for a > subsequent device > > on the bus which returned PCI_ERS_RESULT_NEED_RESET from its > error_detected callback . > > > > merge_result() > > ... > > case PCI_ERS_RESULT_DISCONNECT: > > if (new == PCI_ERS_RESULT_NEED_RESET) > > orig = new; > > break; > > > > This would mean do_recovery() proceeds along to the next > broadcast_message and > > ultimately report success. Right ? Let me know if I am missing something. > > Yes, I think you're right. I only looked at the case where none of > the devices in the subtree had drivers. > > > I looked at a few options and the following looked more promising. > Thoughts ? > > > > Introduce a new pci_ers_result enum PCI_ERS_RESULT_NO_AER_DRIVER and make > changes as follows. > > I don't really like this new enum because it's not really obvious how > it's different from PCI_ERS_RESULT_NONE and, more importantly, it > makes merge_result() even more of a kludge than it already is. > > It's hard to write a nice simple description of this algorithm > (visiting all the devices in a subtree, collecting error handler > results, and merging them). It would be a lot easier to describe if > merge_result() could be written simply as "max()", but I'm not sure > the pci_ers_result codes could be ordered in a way that would make > that possible. And the desired semantics might make it impossible, > too. > > I think the intent of your patch is that if there's any device in the > subtree that lacks an .error_detected() method, we do not call > .mmio_enabled() or .slot_reset() or .resume() for *any* device in the > subtree. Right? Right. BTW, I did look at usage of PCI_ERS_RESULT_NONE in all places. It seemed to me that the intended semantics of PCI_ERS_RESULT_NONE may not have been followed everywhere and I did not want to take a risk of breaking something else at this time. I will see if I can cleanup merge_result() as a separate patch in future. > > So maybe this is the best we can do, and it certainly seems better > than what we have now. Can you repost this as a fresh v2 patch? Okay. I will post a v2 patch. > > > diff --git a/drivers/pci/pcie/aer/aerdrv.h > b/drivers/pci/pcie/aer/aerdrv.h > > index 94a7598..149ba79 100644 > > --- a/drivers/pci/pcie/aer/aerdrv.h > > +++ b/drivers/pci/pcie/aer/aerdrv.h > > @@ -87,6 +87,9 @@ struct aer_broadcast_data { > > static inline pci_ers_result_t merge_result(enum pci_ers_result orig, > > enum pci_ers_result new) > > { > > + if (new == PCI_ERS_RESULT_NO_AER_DRIVER) > > + return new; > > BTW, if you keep this, please just use "return > PCI_ERS_RESULT_NO_AER_DRIVER" rather than "return new" since we *know* > what we're returning. I think there's another instance of this in > merge_result() that you could fix, too. Will do. > > > + > > if (new == PCI_ERS_RESULT_NONE) > > return orig; > > > > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c > b/drivers/pci/pcie/aer/aerdrv_core.c > > index 06bad96..729580a 100644 > > --- a/drivers/pci/pcie/aer/aerdrv_core.c > > +++ b/drivers/pci/pcie/aer/aerdrv_core.c > > @@ -231,11 +231,12 @@ static int report_error_detected(struct pci_dev > *dev, void *data) > > dev->driver ? > > "no AER-aware driver" : "no driver"); > > } > > - return 0; > > + vote = PCI_ERS_RESULT_NO_AER_DRIVER; > > + } else { > > + err_handler = dev->driver->err_handler; > > + vote = err_handler->error_detected(dev, result_data- > >state); > > } > > > > - err_handler = dev->driver->err_handler; > > - vote = err_handler->error_detected(dev, result_data->state); > > result_data->result = merge_result(result_data->result, vote); > > return 0; > > } > > diff --git a/include/linux/pci.h b/include/linux/pci.h > > index ee21795..fb7e869 100644 > > --- a/include/linux/pci.h > > +++ b/include/linux/pci.h > > @@ -538,6 +538,9 @@ enum pci_ers_result { > > > > /* Device driver is fully recovered and operational */ > > PCI_ERS_RESULT_RECOVERED = (__force pci_ers_result_t) 5, > > + > > + /* No AER capabilities registered for the driver */ > > + PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6, > > }; > > > > /* PCI bus error event callbacks */ > > ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2012-11-16 4:33 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-11-10 7:41 [ PATCH RESEND ] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers Pandarathil, Vijaymohan R 2012-11-15 0:51 ` Bjorn Helgaas 2012-11-15 1:35 ` Linas Vepstas 2012-11-15 7:09 ` Pandarathil, Vijaymohan R 2012-11-16 1:08 ` Bjorn Helgaas 2012-11-16 4:26 ` Pandarathil, Vijaymohan R
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).