All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Keith Busch <keith.busch@intel.com>
Cc: Linux PCI <linux-pci@vger.kernel.org>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Sinan Kaya <okaya@kernel.org>, Thomas Tai <thomas.tai@oracle.com>,
	poza@codeaurora.org, Lukas Wunner <lukas@wunner.de>,
	Christoph Hellwig <hch@lst.de>,
	Mika Westerberg <mika.westerberg@linux.intel.com>
Subject: Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port
Date: Fri, 28 Sep 2018 18:28:02 -0500	[thread overview]
Message-ID: <20180928232801.GB119911@bhelgaas-glaptop.roam.corp.google.com> (raw)
In-Reply-To: <20180928213523.GA22508@localhost.localdomain>

On Fri, Sep 28, 2018 at 03:35:23PM -0600, Keith Busch wrote:
> On Fri, Sep 28, 2018 at 03:50:34PM -0500, Bjorn Helgaas wrote:
> > On Fri, Sep 28, 2018 at 09:42:20AM -0600, Keith Busch wrote:
> > > On Thu, Sep 27, 2018 at 05:56:25PM -0500, Bjorn Helgaas wrote:
> > > > On Wed, Sep 26, 2018 at 04:19:25PM -0600, Keith Busch wrote:
> > > The "first downstream port" was supposed to mean the first DSP we
> > > find when walking toward the root from the pci_dev that reported
> > > ERR_[NON]FATAL.
> > > 
> > > By "use", I mean "walking down the sub-tree for error handling".
> > > 
> > > After thinking more on this, that doesn't really capture the intent. A
> > > better way might be:
> > > 
> > >   Run error handling starting from the downstream port of the bus
> > >   reporting an error
> > 
> > I think the light is beginning to dawn.  Does this make sense (as far
> > as it goes)?
> > 
> >   PCI/ERR: Run error recovery callbacks for all affected devices
> > 
> >   If an Endpoint reports an error with ERR_FATAL, we previously ran
> >   driver error recovery callbacks only for the Endpoint.  But if
> >   recovery requires that we reset the Endpoint, we may have to use a
> >   port farther upstream to initiate a Link reset, and that may affect
> >   components other than the Endpoint, e.g., multi-function peers and
> >   their children.  Drivers for those devices were never notified of
> >   the impending reset.
> > 
> >   Call driver error recovery callbacks for every device that will be
> >   reset.
> 
> Yes!
>  
> > Now help me understand this part:
> > 
> > > This allows two other clean-ups.  First, error handling can only run
> > > on bridges so this patch removes checks for endpoints.  
> > 
> > "error handling can only run on bridges"?  I *think* only Root Ports
> > and Root Complex Event Collectors can assert AER interrupts, so
> > aer_irq() is only run for them (actually I don't think we quite
> > support Event Collectors yet).  Is this what you're getting at?
> 
> I mean the pci_dev sent to pcie_do_recovery(), which may be any device
> in the topology including or below the root port that aer_irq() serviced.

Yep.

> > Probably not, because the "dev" passed to pcie_do_recovery() isn't an
> > RP or RCEC.  But the e_info->dev[i] that aer_process_err_devices()
> > eventually passes in doesn't have to be a bridge at all, does it?
> 
> Yes, e_info->dev[i] is sent to pcie_do_recovery(). That could be an RP,
> but it may also be anything anything below it.

Yep.

> The assumption I'm making (which I think is a safe assumption with
> general consensus) is that errors detected on an end point or an upstream
> port happened because of something wrong with the link going upstream:
> end devices have no other option, 

Is this really true?  It looks like "Internal Errors" (sec 6.2.9) may
be unrelated to a packet or event (though they are supposed to be
associated with a PCIe interface).

It says the only method of recovering is reset or hardware
replacement.  It doesn't specify, but it seems like a FLR or similar
reset might be appropriate and we may not have to reset the link.

Getting back to the changelog, "error handling can only run on
bridges" clearly doesn't refer to the driver callbacks (since those
only apply to endpoints).  Maybe "error handling" refers to the
reset_link(), which can only be done on a bridge?

That would make sense to me, although the current code may be
resetting more devices than necessary if Internal Errors can be
handled without a link reset.

Bjorn

  reply	other threads:[~2018-09-28 23:28 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
2018-09-20 16:27 ` [PATCHv4 01/12] PCI: portdrv: Initialize service drivers directly Keith Busch
2018-09-20 16:27 ` [PATCHv4 02/12] PCI: portdrv: Restore pci state on slot reset Keith Busch
2018-09-20 16:27 ` [PATCHv4 03/12] PCI: DPC: Save and restore control state Keith Busch
2018-09-20 19:46   ` Sinan Kaya
2018-09-20 19:47     ` Sinan Kaya
2018-09-20 19:54       ` Keith Busch
2018-09-20 16:27 ` [PATCHv4 04/12] PCI: AER: Take reference on error devices Keith Busch
2018-09-20 16:27 ` [PATCHv4 05/12] PCI: AER: Don't read upstream ports below fatal errors Keith Busch
2018-09-20 16:27 ` [PATCHv4 06/12] PCI: ERR: Use slot reset if available Keith Busch
2018-09-20 16:27 ` [PATCHv4 07/12] PCI: ERR: Handle fatal error recovery Keith Busch
2018-09-20 16:27 ` [PATCHv4 08/12] PCI: ERR: Always use the first downstream port Keith Busch
2018-09-26 22:01   ` Bjorn Helgaas
2018-09-26 22:19     ` Keith Busch
2018-09-27 22:56       ` Bjorn Helgaas
2018-09-28 15:42         ` Keith Busch
2018-09-28 20:50           ` Bjorn Helgaas
2018-09-28 21:35             ` Keith Busch
2018-09-28 23:28               ` Bjorn Helgaas [this message]
2018-10-01 15:14                 ` Keith Busch
2018-10-02 19:35                   ` Bjorn Helgaas
2018-10-02 19:55                     ` Keith Busch
2018-09-20 16:27 ` [PATCHv4 09/12] PCI: ERR: Simplify broadcast callouts Keith Busch
2018-09-20 16:27 ` [PATCHv4 10/12] PCI: ERR: Report current recovery status for udev Keith Busch
2018-09-20 16:27 ` [PATCHv4 11/12] PCI: Unify device inaccessible Keith Busch
2018-09-20 16:27 ` [PATCHv4 12/12] PCI: Make link active reporting detection generic Keith Busch
2018-09-20 20:00 ` [PATCHv4 00/12] pci error handling fixes Sinan Kaya
2018-09-20 21:17 ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180928232801.GB119911@bhelgaas-glaptop.roam.corp.google.com \
    --to=helgaas@kernel.org \
    --cc=benh@kernel.crashing.org \
    --cc=bhelgaas@google.com \
    --cc=hch@lst.de \
    --cc=keith.busch@intel.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mika.westerberg@linux.intel.com \
    --cc=okaya@kernel.org \
    --cc=poza@codeaurora.org \
    --cc=thomas.tai@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.