All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Keith Busch <keith.busch@intel.com>
Cc: Linux PCI <linux-pci@vger.kernel.org>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Sinan Kaya <okaya@kernel.org>, Thomas Tai <thomas.tai@oracle.com>,
	poza@codeaurora.org, Lukas Wunner <lukas@wunner.de>,
	Christoph Hellwig <hch@lst.de>,
	Mika Westerberg <mika.westerberg@linux.intel.com>
Subject: Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port
Date: Fri, 28 Sep 2018 15:50:34 -0500	[thread overview]
Message-ID: <20180928205034.GA119911@bhelgaas-glaptop.roam.corp.google.com> (raw)
In-Reply-To: <20180928154220.GA21996@localhost.localdomain>

On Fri, Sep 28, 2018 at 09:42:20AM -0600, Keith Busch wrote:
> On Thu, Sep 27, 2018 at 05:56:25PM -0500, Bjorn Helgaas wrote:
> > On Wed, Sep 26, 2018 at 04:19:25PM -0600, Keith Busch wrote:
> > > On Wed, Sep 26, 2018 at 05:01:16PM -0500, Bjorn Helgaas wrote:
> > > > On Thu, Sep 20, 2018 at 10:27:13AM -0600, Keith Busch wrote:
> > > > > The link reset always used the first bridge device, but AER broadcast
> > > > > error handling may have reported an end device. This means the reset may
> > > > > hit devices that were never notified of the impending error recovery.
> > > > > 
> > > > > This patch uses the first downstream port in the hierarchy considered
> > > > > reliable. An error detected by a switch upstream port should mean it
> > > > > occurred on its upstream link, so the patch selects the parent device
> > > > > if the error is not a root or downstream port.
> > > > 
> > > > I'm not really clear on what "Always use the first downstream port"
> > > > means.  Always use it for *what*?
> > 
> > And I forgot to ask what "first downstream port" means.
> 
> The "first downstream port" was supposed to mean the first DSP we
> find when walking toward the root from the pci_dev that reported
> ERR_[NON]FATAL.
> 
> By "use", I mean "walking down the sub-tree for error handling".
> 
> After thinking more on this, that doesn't really capture the intent. A
> better way might be:
> 
>   Run error handling starting from the downstream port of the bus
>   reporting an error

I think the light is beginning to dawn.  Does this make sense (as far
as it goes)?

  PCI/ERR: Run error recovery callbacks for all affected devices

  If an Endpoint reports an error with ERR_FATAL, we previously ran
  driver error recovery callbacks only for the Endpoint.  But if
  recovery requires that we reset the Endpoint, we may have to use a
  port farther upstream to initiate a Link reset, and that may affect
  components other than the Endpoint, e.g., multi-function peers and
  their children.  Drivers for those devices were never notified of
  the impending reset.

  Call driver error recovery callbacks for every device that will be
  reset.

Now help me understand this part:

> This allows two other clean-ups.  First, error handling can only run
> on bridges so this patch removes checks for endpoints.  

"error handling can only run on bridges"?  I *think* only Root Ports
and Root Complex Event Collectors can assert AER interrupts, so
aer_irq() is only run for them (actually I don't think we quite
support Event Collectors yet).  Is this what you're getting at?

Probably not, because the "dev" passed to pcie_do_recovery() isn't an
RP or RCEC.  But the e_info->dev[i] that aer_process_err_devices()
eventually passes in doesn't have to be a bridge at all, does it?

So I can't quite figure out the bridge connection here.

> Second, the first accessible port does not inherit the channel error
> state since we can access it, so the special cases for error detect
> and resume are no longer necessary.

When we call pcie_do_recovery(), I guess we specify a "dev" that
logged an AER event and a pci_channel_state.  It seems a little bit
weird to handle those separately, as opposed to incorporating into 
the pci_dev or something.

But if you're just saying that "if A is frozen because of an error,
that doesn't mean bridges upstream from A are frozen", that makes good
sense.

Bjorn

  reply	other threads:[~2018-09-28 20:50 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
2018-09-20 16:27 ` [PATCHv4 01/12] PCI: portdrv: Initialize service drivers directly Keith Busch
2018-09-20 16:27 ` [PATCHv4 02/12] PCI: portdrv: Restore pci state on slot reset Keith Busch
2018-09-20 16:27 ` [PATCHv4 03/12] PCI: DPC: Save and restore control state Keith Busch
2018-09-20 19:46   ` Sinan Kaya
2018-09-20 19:47     ` Sinan Kaya
2018-09-20 19:54       ` Keith Busch
2018-09-20 16:27 ` [PATCHv4 04/12] PCI: AER: Take reference on error devices Keith Busch
2018-09-20 16:27 ` [PATCHv4 05/12] PCI: AER: Don't read upstream ports below fatal errors Keith Busch
2018-09-20 16:27 ` [PATCHv4 06/12] PCI: ERR: Use slot reset if available Keith Busch
2018-09-20 16:27 ` [PATCHv4 07/12] PCI: ERR: Handle fatal error recovery Keith Busch
2018-09-20 16:27 ` [PATCHv4 08/12] PCI: ERR: Always use the first downstream port Keith Busch
2018-09-26 22:01   ` Bjorn Helgaas
2018-09-26 22:19     ` Keith Busch
2018-09-27 22:56       ` Bjorn Helgaas
2018-09-28 15:42         ` Keith Busch
2018-09-28 20:50           ` Bjorn Helgaas [this message]
2018-09-28 21:35             ` Keith Busch
2018-09-28 23:28               ` Bjorn Helgaas
2018-10-01 15:14                 ` Keith Busch
2018-10-02 19:35                   ` Bjorn Helgaas
2018-10-02 19:55                     ` Keith Busch
2018-09-20 16:27 ` [PATCHv4 09/12] PCI: ERR: Simplify broadcast callouts Keith Busch
2018-09-20 16:27 ` [PATCHv4 10/12] PCI: ERR: Report current recovery status for udev Keith Busch
2018-09-20 16:27 ` [PATCHv4 11/12] PCI: Unify device inaccessible Keith Busch
2018-09-20 16:27 ` [PATCHv4 12/12] PCI: Make link active reporting detection generic Keith Busch
2018-09-20 20:00 ` [PATCHv4 00/12] pci error handling fixes Sinan Kaya
2018-09-20 21:17 ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180928205034.GA119911@bhelgaas-glaptop.roam.corp.google.com \
    --to=helgaas@kernel.org \
    --cc=benh@kernel.crashing.org \
    --cc=bhelgaas@google.com \
    --cc=hch@lst.de \
    --cc=keith.busch@intel.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mika.westerberg@linux.intel.com \
    --cc=okaya@kernel.org \
    --cc=poza@codeaurora.org \
    --cc=thomas.tai@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.