All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: pci_do_recovery not handling fata errors
       [not found] <MN2PR10MB4093188B8CDC659AE68E5640996F9@MN2PR10MB4093.namprd10.prod.outlook.com>
@ 2021-03-12 22:25 ` Kelley, Sean V
  2021-03-12 22:57   ` James Puthukattukaran
  0 siblings, 1 reply; 7+ messages in thread
From: Kelley, Sean V @ 2021-03-12 22:25 UTC (permalink / raw)
  To: James Puthukattukaran, Kuppuswamy, Sathyanarayanan; +Cc: Linux PCI, bhelgaas



> On Mar 12, 2021, at 12:56 PM, James Puthukattukaran <james.puthukattukaran@oracle.com> wrote:
> 
> Hi -
> I’m trying to understand why pci_do_recovery() only clears non-fatal but not fata errors? My immediate concern is call from dpc_handler. If a device sends an ERR_FATAL to the root port, I would think that as part of recovery the fatal status in the AER registers of the endpoint device would be cleared?
>  


Adding Sathya who mentioned to me that:

Fatal error are cleared in

void dpc_process_error(struct pci_dev *pdev)

253                  dpc_get_aer_uncorrect_severity(pdev, &info) &&
254                  aer_get_device_error_info(pdev, &info)) {
255                 aer_print_error(pdev, &info);
256                 pci_aer_clear_nonfatal_status(pdev);
257                 pci_aer_clear_fatal_status(pdev);

Thanks,

Sean

> Snippet of concern in pci_do_recovery –
>  
>         /*
>          * If we have native control of AER, clear error status in the Root
>          * Port or Downstream Port that signaled the error.  If the
>          * platform retained control of AER, it is responsible for clearing
>          * this status.  In that case, the signaling device may not even be
>          * visible to the OS.
>          */
>         if (host->native_aer || pcie_ports_native) {
>                 pcie_clear_device_status(bridge);
>                 pci_aer_clear_nonfatal_status(bridge);   <<<< Just clearing nonfatal. What about fatal?
>         }
>  
> Thanks
> James


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: pci_do_recovery not handling fata errors
  2021-03-12 22:25 ` pci_do_recovery not handling fata errors Kelley, Sean V
@ 2021-03-12 22:57   ` James Puthukattukaran
  2021-03-13 17:11     ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: James Puthukattukaran @ 2021-03-12 22:57 UTC (permalink / raw)
  To: Kelley, Sean V, Kuppuswamy, Sathyanarayanan; +Cc: Linux PCI, bhelgaas

But the clearing of fatal error in the dpc_process_error is only for DPC trigger due to "unmaskable uncorrectable". 
If the trigger reason is ERR_FATAL, then it does not hit the else clause and neither is it cleared in the pci_do_recovery code.

From dpc_process_error with more context -- 

       else if (reason == 0 &&  <<<<<<< only for "unmaskable uncorrectable". What about for ERR_FATAL?
                 dpc_get_aer_uncorrect_severity(pdev, &info) &&
                 aer_get_device_error_info(pdev, &info)) {
                aer_print_error(pdev, &info);
                pci_aer_clear_nonfatal_status(pdev);
                pci_aer_clear_fatal_status(pdev);
        }
 

> -----Original Message-----
> From: Kelley, Sean V <sean.v.kelley@intel.com>
> Sent: Friday, March 12, 2021 5:25 PM
> To: James Puthukattukaran <james.puthukattukaran@oracle.com>;
> Kuppuswamy, Sathyanarayanan
> <sathyanarayanan.kuppuswamy@intel.com>
> Cc: Linux PCI <linux-pci@vger.kernel.org>; bhelgaas@google.com
> Subject: [External] : Re: pci_do_recovery not handling fata errors
> 
> 
> 
> > On Mar 12, 2021, at 12:56 PM, James Puthukattukaran
> <james.puthukattukaran@oracle.com> wrote:
> >
> > Hi -
> > I’m trying to understand why pci_do_recovery() only clears non-fatal but
> not fata errors? My immediate concern is call from dpc_handler. If a device
> sends an ERR_FATAL to the root port, I would think that as part of recovery
> the fatal status in the AER registers of the endpoint device would be cleared?
> >
> 
> 
> Adding Sathya who mentioned to me that:
> 
> Fatal error are cleared in
> 
> void dpc_process_error(struct pci_dev *pdev)
> 
> 253                  dpc_get_aer_uncorrect_severity(pdev, &info) &&
> 254                  aer_get_device_error_info(pdev, &info)) {
> 255                 aer_print_error(pdev, &info);
> 256                 pci_aer_clear_nonfatal_status(pdev);
> 257                 pci_aer_clear_fatal_status(pdev);
> 
> Thanks,
> 
> Sean
> 
> > Snippet of concern in pci_do_recovery –
> >
> >         /*
> >          * If we have native control of AER, clear error status in the Root
> >          * Port or Downstream Port that signaled the error.  If the
> >          * platform retained control of AER, it is responsible for clearing
> >          * this status.  In that case, the signaling device may not even be
> >          * visible to the OS.
> >          */
> >         if (host->native_aer || pcie_ports_native) {
> >                 pcie_clear_device_status(bridge);
> >                 pci_aer_clear_nonfatal_status(bridge);   <<<< Just clearing
> nonfatal. What about fatal?
> >         }
> >
> > Thanks
> > James


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: pci_do_recovery not handling fata errors
  2021-03-12 22:57   ` James Puthukattukaran
@ 2021-03-13 17:11     ` Keith Busch
  2021-03-16 21:13       ` [External] : " James Puthukattukaran
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2021-03-13 17:11 UTC (permalink / raw)
  To: James Puthukattukaran
  Cc: Kelley, Sean V, Kuppuswamy, Sathyanarayanan, Linux PCI, bhelgaas

On Fri, Mar 12, 2021 at 10:57:18PM +0000, James Puthukattukaran wrote:
> But the clearing of fatal error in the dpc_process_error is only for DPC trigger due to "unmaskable uncorrectable". 
> If the trigger reason is ERR_FATAL, then it does not hit the else clause and neither is it cleared in the pci_do_recovery code.

If the reason is ERR_FATAL, then the port didn't detect the error; it is
just the first DPC capable downstream port to receive the message from
some device downstream, so there's nothing to clear in its AER register.
 
> From dpc_process_error with more context -- 
> 
>        else if (reason == 0 &&  <<<<<<< only for "unmaskable uncorrectable". What about for ERR_FATAL?
>                  dpc_get_aer_uncorrect_severity(pdev, &info) &&
>                  aer_get_device_error_info(pdev, &info)) {
>                 aer_print_error(pdev, &info);
>                 pci_aer_clear_nonfatal_status(pdev);
>                 pci_aer_clear_fatal_status(pdev);
>         }
>  
> 
> > -----Original Message-----
> > From: Kelley, Sean V <sean.v.kelley@intel.com>
> > Sent: Friday, March 12, 2021 5:25 PM
> > To: James Puthukattukaran <james.puthukattukaran@oracle.com>;
> > Kuppuswamy, Sathyanarayanan
> > <sathyanarayanan.kuppuswamy@intel.com>
> > Cc: Linux PCI <linux-pci@vger.kernel.org>; bhelgaas@google.com
> > Subject: [External] : Re: pci_do_recovery not handling fata errors
> > 
> > 
> > 
> > > On Mar 12, 2021, at 12:56 PM, James Puthukattukaran
> > <james.puthukattukaran@oracle.com> wrote:
> > >
> > > Hi -
> > > I’m trying to understand why pci_do_recovery() only clears non-fatal but
> > not fata errors? My immediate concern is call from dpc_handler. If a device
> > sends an ERR_FATAL to the root port, I would think that as part of recovery
> > the fatal status in the AER registers of the endpoint device would be cleared?
> > >
> > 
> > 
> > Adding Sathya who mentioned to me that:
> > 
> > Fatal error are cleared in
> > 
> > void dpc_process_error(struct pci_dev *pdev)
> > 
> > 253                  dpc_get_aer_uncorrect_severity(pdev, &info) &&
> > 254                  aer_get_device_error_info(pdev, &info)) {
> > 255                 aer_print_error(pdev, &info);
> > 256                 pci_aer_clear_nonfatal_status(pdev);
> > 257                 pci_aer_clear_fatal_status(pdev);
> > 
> > Thanks,
> > 
> > Sean
> > 
> > > Snippet of concern in pci_do_recovery –
> > >
> > >         /*
> > >          * If we have native control of AER, clear error status in the Root
> > >          * Port or Downstream Port that signaled the error.  If the
> > >          * platform retained control of AER, it is responsible for clearing
> > >          * this status.  In that case, the signaling device may not even be
> > >          * visible to the OS.
> > >          */
> > >         if (host->native_aer || pcie_ports_native) {
> > >                 pcie_clear_device_status(bridge);
> > >                 pci_aer_clear_nonfatal_status(bridge);   <<<< Just clearing
> > nonfatal. What about fatal?
> > >         }
> > >
> > > Thanks
> > > James
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [External] : Re: pci_do_recovery not handling fata errors
  2021-03-13 17:11     ` Keith Busch
@ 2021-03-16 21:13       ` James Puthukattukaran
  2021-03-16 21:51         ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: James Puthukattukaran @ 2021-03-16 21:13 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kelley, Sean V, Kuppuswamy, Sathyanarayanan, Linux PCI, bhelgaas

Keith -
I understand that the RP did not detect the error and so nothing to clear in its AER register. My question is - where is the fatal error register cleared in the device's (the device that was the cause of the fata error) AER register? It does not seem to be done in pci_do_recovery walking the hierarchy (unless I'm missing it)....


> -----Original Message-----
> From: Keith Busch <kbusch@kernel.org>
> Sent: Saturday, March 13, 2021 12:12 PM
> To: James Puthukattukaran <james.puthukattukaran@oracle.com>
> Cc: Kelley, Sean V <sean.v.kelley@intel.com>; Kuppuswamy,
> Sathyanarayanan <sathyanarayanan.kuppuswamy@intel.com>; Linux PCI
> <linux-pci@vger.kernel.org>; bhelgaas@google.com
> Subject: [External] : Re: pci_do_recovery not handling fata errors
> 
> On Fri, Mar 12, 2021 at 10:57:18PM +0000, James Puthukattukaran wrote:
> > But the clearing of fatal error in the dpc_process_error is only for DPC
> trigger due to "unmaskable uncorrectable".
> > If the trigger reason is ERR_FATAL, then it does not hit the else clause and
> neither is it cleared in the pci_do_recovery code.
> 
> If the reason is ERR_FATAL, then the port didn't detect the error; it is just the
> first DPC capable downstream port to receive the message from some device
> downstream, so there's nothing to clear in its AER register.
> 
> > From dpc_process_error with more context --
> >
> >        else if (reason == 0 &&  <<<<<<< only for "unmaskable uncorrectable".
> What about for ERR_FATAL?
> >                  dpc_get_aer_uncorrect_severity(pdev, &info) &&
> >                  aer_get_device_error_info(pdev, &info)) {
> >                 aer_print_error(pdev, &info);
> >                 pci_aer_clear_nonfatal_status(pdev);
> >                 pci_aer_clear_fatal_status(pdev);
> >         }
> >
> >
> > > -----Original Message-----
> > > From: Kelley, Sean V <sean.v.kelley@intel.com>
> > > Sent: Friday, March 12, 2021 5:25 PM
> > > To: James Puthukattukaran <james.puthukattukaran@oracle.com>;
> > > Kuppuswamy, Sathyanarayanan
> > > <sathyanarayanan.kuppuswamy@intel.com>
> > > Cc: Linux PCI <linux-pci@vger.kernel.org>; bhelgaas@google.com
> > > Subject: [External] : Re: pci_do_recovery not handling fata errors
> > >
> > >
> > >
> > > > On Mar 12, 2021, at 12:56 PM, James Puthukattukaran
> > > <james.puthukattukaran@oracle.com> wrote:
> > > >
> > > > Hi -
> > > > I’m trying to understand why pci_do_recovery() only clears
> > > > non-fatal but
> > > not fata errors? My immediate concern is call from dpc_handler. If a
> > > device sends an ERR_FATAL to the root port, I would think that as
> > > part of recovery the fatal status in the AER registers of the endpoint
> device would be cleared?
> > > >
> > >
> > >
> > > Adding Sathya who mentioned to me that:
> > >
> > > Fatal error are cleared in
> > >
> > > void dpc_process_error(struct pci_dev *pdev)
> > >
> > > 253                  dpc_get_aer_uncorrect_severity(pdev, &info) &&
> > > 254                  aer_get_device_error_info(pdev, &info)) {
> > > 255                 aer_print_error(pdev, &info);
> > > 256                 pci_aer_clear_nonfatal_status(pdev);
> > > 257                 pci_aer_clear_fatal_status(pdev);
> > >
> > > Thanks,
> > >
> > > Sean
> > >
> > > > Snippet of concern in pci_do_recovery –
> > > >
> > > >         /*
> > > >          * If we have native control of AER, clear error status in the Root
> > > >          * Port or Downstream Port that signaled the error.  If the
> > > >          * platform retained control of AER, it is responsible for clearing
> > > >          * this status.  In that case, the signaling device may not even be
> > > >          * visible to the OS.
> > > >          */
> > > >         if (host->native_aer || pcie_ports_native) {
> > > >                 pcie_clear_device_status(bridge);
> > > >                 pci_aer_clear_nonfatal_status(bridge);   <<<< Just clearing
> > > nonfatal. What about fatal?
> > > >         }
> > > >
> > > > Thanks
> > > > James
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [External] : Re: pci_do_recovery not handling fata errors
  2021-03-16 21:13       ` [External] : " James Puthukattukaran
@ 2021-03-16 21:51         ` Keith Busch
  2021-04-01  2:15           ` James Puthukattukaran
  0 siblings, 1 reply; 7+ messages in thread
From: Keith Busch @ 2021-03-16 21:51 UTC (permalink / raw)
  To: James Puthukattukaran
  Cc: Kelley, Sean V, Kuppuswamy, Sathyanarayanan, Linux PCI, bhelgaas

On Tue, Mar 16, 2021 at 09:13:56PM +0000, James Puthukattukaran wrote:
> Keith -
> I understand that the RP did not detect the error and so nothing to
> clear in its AER register. My question is - where is the fatal error
> register cleared in the device's (the device that was the cause of the
> fata error) AER register? It does not seem to be done in
> pci_do_recovery walking the hierarchy (unless I'm missing it)....

Gotcha.

All pci drivers that implement error handling should be calling
pci_restore_state() somewhere from its .error_resume() callback, which
invokes pci_aer_clear_status() to clear the device's AER status.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [External] : Re: pci_do_recovery not handling fata errors
  2021-03-16 21:51         ` Keith Busch
@ 2021-04-01  2:15           ` James Puthukattukaran
  2021-04-01  2:22             ` Keith Busch
  0 siblings, 1 reply; 7+ messages in thread
From: James Puthukattukaran @ 2021-04-01  2:15 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kelley, Sean V, Kuppuswamy, Sathyanarayanan, Linux PCI, bhelgaas

What's the rationale for overriding the status returned by the err_detected callback with the reset_link in pcie_do_recovery? If the err_detected returned a NEED_RESET and the reset_link returned RECOVERED (like dpc_reset_link), then the slot_reset driver callback won't be called. 

        pci_dbg(dev, "broadcast error_detected message\n");
        if (state == pci_channel_io_frozen) {
                pci_walk_bus(bus, report_frozen_detected, &status); <-- returns RESET
                status = reset_link(dev);  <--- call which returns RECOVERED
                if (status != PCI_ERS_RESULT_RECOVERED) {
                        pci_warn(dev, "link reset failed\n");
                        goto failed;

--James

> -----Original Message-----
> From: Keith Busch <kbusch@kernel.org>
> Sent: Tuesday, March 16, 2021 5:52 PM
> To: James Puthukattukaran <james.puthukattukaran@oracle.com>
> Cc: Kelley, Sean V <sean.v.kelley@intel.com>; Kuppuswamy,
> Sathyanarayanan <sathyanarayanan.kuppuswamy@intel.com>; Linux PCI
> <linux-pci@vger.kernel.org>; bhelgaas@google.com
> Subject: Re: [External] : Re: pci_do_recovery not handling fata errors
> 
> On Tue, Mar 16, 2021 at 09:13:56PM +0000, James Puthukattukaran wrote:
> > Keith -
> > I understand that the RP did not detect the error and so nothing to
> > clear in its AER register. My question is - where is the fatal error
> > register cleared in the device's (the device that was the cause of the
> > fata error) AER register? It does not seem to be done in
> > pci_do_recovery walking the hierarchy (unless I'm missing it)....
> 
> Gotcha.
> 
> All pci drivers that implement error handling should be calling
> pci_restore_state() somewhere from its .error_resume() callback, which
> invokes pci_aer_clear_status() to clear the device's AER status.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [External] : Re: pci_do_recovery not handling fata errors
  2021-04-01  2:15           ` James Puthukattukaran
@ 2021-04-01  2:22             ` Keith Busch
  0 siblings, 0 replies; 7+ messages in thread
From: Keith Busch @ 2021-04-01  2:22 UTC (permalink / raw)
  To: James Puthukattukaran
  Cc: Kelley, Sean V, Kuppuswamy, Sathyanarayanan, Linux PCI, bhelgaas

On Thu, Apr 01, 2021 at 02:15:38AM +0000, James Puthukattukaran wrote:
> What's the rationale for overriding the status returned by the err_detected callback with the reset_link in pcie_do_recovery?

That was a bug that's been fixed:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=387c72cdd7fb6bef650fb078d0f6ae9682abf631

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-04-01  2:23 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <MN2PR10MB4093188B8CDC659AE68E5640996F9@MN2PR10MB4093.namprd10.prod.outlook.com>
2021-03-12 22:25 ` pci_do_recovery not handling fata errors Kelley, Sean V
2021-03-12 22:57   ` James Puthukattukaran
2021-03-13 17:11     ` Keith Busch
2021-03-16 21:13       ` [External] : " James Puthukattukaran
2021-03-16 21:51         ` Keith Busch
2021-04-01  2:15           ` James Puthukattukaran
2021-04-01  2:22             ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.