linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Austin.Bolen@dell.com
Cc: sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	ashok.raj@intel.com
Subject: Re: [PATCH v17 09/12] PCI/AER: Allow clearing Error Status Register in FF mode
Date: Wed, 11 Mar 2020 15:33:26 -0500	[thread overview]
Message-ID: <20200311203326.GA163074@google.com> (raw)
In-Reply-To: <7b8d47f9180e43a7bdb01f9d8754c9f6@AUSX13MPC107.AMER.DELL.COM>

On Wed, Mar 11, 2020 at 05:27:35PM +0000, Austin.Bolen@dell.com wrote:
> On 3/11/2020 12:12 PM, Bjorn Helgaas wrote:
> > 
> > [EXTERNAL EMAIL]
> > 
> <SNIP>
> > 
> > I'm probably missing your intent, but that sounds like "the OS can
> > read/write AER bits whenever it wants, regardless of ownership."
> > 
> > That doesn't sound practical to me, and I don't think it's really
> > similar to DPC, where it's pretty clear that the OS can touch DPC bits
> > it doesn't own but only *during the EDR processing window*.
> 
> Yes, by treating AER bits like DPC bits I meant I'd define the specific 
> time windows when OS can touch the AER status bits similar to how it's 
> done for DPC in the current ECN.

Makes sense, thanks.

> >>>> For the normative text describing when OS clears the AER bits
> >>>> following the informative flow chart, it could say that OS clears
> >>>> AER as soon as possible after OST returns and before OS processes
> >>>> _HPX and loading drivers.  Open to other suggestions as well.
> >>>
> >>> I'm not sure what to do with "as soon as possible" either.  That
> >>> doesn't seem like something firmware and the OS can agree on.
> >>
> >> I can just state that it's done after OST returns but before _HPX or
> >> driver is loaded. Any time in that range is fine. I can't get super
> >> specific here because different OSes do different things.  Even for
> >> a given OS they change over time. And I need something generic
> >> enough to support a wide variety of OS implementations.
> > 
> > Yeah.  I don't know how to solve this.
> > 
> > Linux doesn't actually unload and reload drivers for the child devices
> > (Sathy, correct me if I'm wrong here) even though DPC containment
> > takes the link down and effectively unplugs and replugs the device.  I
> > would *like* to handle it like hotplug, but some higher-level software
> > doesn't deal well with things like storage devices disappearing and
> > reappearing.
> > 
> > Since Linux doesn't actually re-enumerate the child devices, it
> > wouldn't evaluate _HPX again.  It would probably be cleaner if it did,
> > but it's all tied up with the whole unplug/replug problem.
> 
> DPC resets everything below it and so to get it back up and running it 
> would mean that all buses and resources need to be assigned, _HPX 
> evaluated, and drivers reloaded. If those things don't happen then the 
> whole hierarchy below the port that triggered DPC will be inaccessible.

Hmm, I think I might be confusing this with another situation.  Sathy,
can you help me understand this?  I don't have a way to actually
exercise this EDR path.  Is there some way the pciehp hotplug driver
gets involved here?

Here's how this seems to work as far as I can tell:

  - Linux does not have DPC or AER control

  - Linux installs EDR notify handler

  - Linux evaluates DPC Enable _DSM

  - DPC containment event occurs

  - Firmware fields DPC interrupt

  - DPC event is not a surprise remove

  - Firmware sends EDR notification

  - Linux EDR notify handler evaluates Locate _DSM

  - Linux reads and logs DPC and AER error information for port in
    containment mode.  [If it was an RP PIO error, Linux clears RP PIO
    error status, which is an asymmetry with the non-RP PIO path.]

  - Linux clears AER error status (pci_aer_raw_clear_status())

  - Linux calls driver .error_detected() methods for all child devices
    of the port in containment mode (pcie_do_recovery()).  These
    devices are inaccessible because the link is down.

  - Linux clears DPC Trigger Status (dpc_reset_link() from
    pcie_do_recovery()).

  - Linux calls driver .mmio_enabled() methods for all child devices.

This is where I get lost.  These child devices are now accessible, but
they've been reset, so I don't know how their config space got
restored.  Did pciehp enumerate them?  Did we do something like
pci_restore_state()?  I don't see where either of these happens.

> For higher level software not handling storage device disappearing due 
> to hot-plug, they will have the same problem with DPC since DPC holds 
> the port in the disabled state (and hence will be inaccessible). And 
> once DPC is released the devices will be unconfigured and so still 
> inaccessible to upper-level software.  A lot of upper-level storage 
> software I've seen can already handle this gracefully.
> 
> >>> For child devices of that port, obviously it's impossible to
> >>> access AER registers until DPC Trigger Status is cleared, and the
> >>> flowchart says the OS shouldn't access them until after _OST.
> >>>
> >>> I'm actually not sure we currently do *anything* with child device
> >>> AER info in the EDR path.  pcie_do_recovery() does walk the
> >>> sub-hierarchy of child devices, but it only calls error handling
> >>> callbacks in the child drivers; it doesn't do anything with the
> >>> child AER registers itself.  And of course, this happens before
> >>> _OST, so it would be too early in any case.  But maybe I'm missing
> >>> something here.
> >>
> >> My understanding is that the OS read/clears AER in the case where OS
> >> has native control of AER.  Feedback from OSVs is they wanted to
> >> continue to do that to keep the native OS controlled AER and FF
> >> mechanism similar.  The other way we could have done it would be to
> >> have the firmware read/clear AER and report them to OS via APEI.
> > 
> > When Linux has native control of AER, it reads/clears AER status.
> > The flowchart is for the case where firmware has AER control, so I
> > guess Linux would not field AER interrupts and wouldn't expect to
> > read/clear AER status.  So I *guess* Linux would assume APEI?  But
> > that doesn't seem to be what the flowchart assumes.
> 
> Correct on the flowchart.  The OSVs we talked with did not want to use 
> APEI.  They wanted to read and clear AER themselves and hence the 
> flowchart is written that way.

So they want to basically do native AER handling even though firmware
owns AER?  My head hurts.

Bjorn

  reply	other threads:[~2020-03-11 20:33 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-04  2:36 [PATCH v17 00/12] Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
2020-03-04  2:36 ` [PATCH v17 01/12] PCI/ERR: Update error status after reset_link() sathyanarayanan.kuppuswamy
2020-03-04  2:36 ` [PATCH v17 02/12] PCI/AER: Move pci_cleanup_aer_error_status_regs() declaration to pci.h sathyanarayanan.kuppuswamy
2020-03-04  2:36 ` [PATCH v17 03/12] PCI/ERR: Remove service dependency in pcie_do_recovery() sathyanarayanan.kuppuswamy
2020-03-17 14:40   ` Christoph Hellwig
2020-03-04  2:36 ` [PATCH v17 04/12] PCI: portdrv: remove unnecessary pcie_port_find_service() sathyanarayanan.kuppuswamy
2020-03-04  2:36 ` [PATCH v17 05/12] PCI: portdrv: remove reset_link member from pcie_port_service_driver sathyanarayanan.kuppuswamy
2020-03-17 14:41   ` Christoph Hellwig
2020-03-17 14:55     ` Kuppuswamy, Sathyanarayanan
2020-03-04  2:36 ` [PATCH v17 06/12] Documentation: PCI: Remove reset_link references sathyanarayanan.kuppuswamy
2020-03-17 14:42   ` Christoph Hellwig
2020-03-17 15:05     ` Kuppuswamy, Sathyanarayanan
2020-03-17 15:07       ` Christoph Hellwig
2020-03-17 16:03         ` Bjorn Helgaas
2020-03-17 17:06           ` Christoph Hellwig
2020-03-19 22:52             ` Bjorn Helgaas
2020-03-04  2:36 ` [PATCH v17 07/12] PCI/ERR: Return status of pcie_do_recovery() sathyanarayanan.kuppuswamy
2020-03-04  2:36 ` [PATCH v17 08/12] PCI/DPC: Cache DPC capabilities in pci_init_capabilities() sathyanarayanan.kuppuswamy
2020-03-04  2:36 ` [PATCH v17 09/12] PCI/AER: Allow clearing Error Status Register in FF mode sathyanarayanan.kuppuswamy
2020-03-06  5:45   ` Kuppuswamy, Sathyanarayanan
2020-03-06 16:04     ` Bjorn Helgaas
2020-03-06 16:11       ` Kuppuswamy, Sathyanarayanan
2020-03-06 16:41         ` Bjorn Helgaas
2020-03-10  2:40   ` Bjorn Helgaas
2020-03-10  4:28     ` Kuppuswamy, Sathyanarayanan
2020-03-10 18:14       ` Austin.Bolen
2020-03-10 19:32         ` Bjorn Helgaas
2020-03-10 20:06           ` Austin.Bolen
2020-03-10 20:41             ` Kuppuswamy Sathyanarayanan
2020-03-10 20:49               ` Austin.Bolen
2020-03-11 14:45             ` Bjorn Helgaas
2020-03-11 15:19               ` Austin.Bolen
2020-03-11 17:12                 ` Bjorn Helgaas
2020-03-11 17:27                   ` Austin.Bolen
2020-03-11 20:33                     ` Bjorn Helgaas [this message]
2020-03-11 21:25                       ` Kuppuswamy Sathyanarayanan
2020-03-11 21:53                         ` Austin.Bolen
2020-03-11 22:11                           ` Kuppuswamy Sathyanarayanan
2020-03-11 22:23                             ` Bjorn Helgaas
2020-03-11 23:07                               ` Kuppuswamy Sathyanarayanan
2020-03-12 19:53                                 ` Bjorn Helgaas
2020-03-12 21:02                                   ` Austin.Bolen
2020-03-12 21:29                                     ` Kuppuswamy Sathyanarayanan
2020-03-12 21:52                                       ` Bjorn Helgaas
2020-03-12 22:02                                         ` Kuppuswamy Sathyanarayanan
2020-03-12 22:36                                           ` Bjorn Helgaas
2020-03-12 21:59                                   ` Kuppuswamy Sathyanarayanan
2020-03-12 22:32                                     ` Bjorn Helgaas
2020-03-13  6:22                                       ` Kuppuswamy, Sathyanarayanan
2020-03-13 19:28                                         ` Bjorn Helgaas
2020-03-13 20:26                                           ` Kuppuswamy Sathyanarayanan
2020-03-19 23:03                                             ` Bjorn Helgaas
2020-03-19 23:20                                               ` Kuppuswamy, Sathyanarayanan
2020-03-11 22:13                         ` Bjorn Helgaas
2020-03-11 22:41                           ` Kuppuswamy Sathyanarayanan
2020-03-11 18:12                   ` Kuppuswamy Sathyanarayanan
2020-03-11 22:05             ` Bjorn Helgaas
2020-03-04  2:36 ` [PATCH v17 10/12] PCI/DPC: Export DPC error recovery functions sathyanarayanan.kuppuswamy
2020-03-17 14:43   ` Christoph Hellwig
2020-03-04  2:36 ` [PATCH v17 11/12] PCI/DPC: Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
2020-03-06  3:47   ` Bjorn Helgaas
2020-03-06  6:32     ` Kuppuswamy, Sathyanarayanan
2020-03-06 21:00       ` Bjorn Helgaas
2020-03-06 22:42         ` Kuppuswamy Sathyanarayanan
2020-03-06 23:23           ` Bjorn Helgaas
2020-03-07  0:19             ` Kuppuswamy Sathyanarayanan
2020-03-04  2:36 ` [PATCH v17 12/12] PCI/ACPI: Enable EDR support sathyanarayanan.kuppuswamy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200311203326.GA163074@google.com \
    --to=helgaas@kernel.org \
    --cc=Austin.Bolen@dell.com \
    --cc=ashok.raj@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).