From: Bjorn Helgaas <helgaas@kernel.org>
To: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: Austin.Bolen@dell.com, linux-pci@vger.kernel.org,
linux-kernel@vger.kernel.org, ashok.raj@intel.com
Subject: Re: [PATCH v17 09/12] PCI/AER: Allow clearing Error Status Register in FF mode
Date: Wed, 11 Mar 2020 17:23:52 -0500 [thread overview]
Message-ID: <20200311222352.GA200510@google.com> (raw)
In-Reply-To: <5328126a-7cf0-58c9-7dff-978fe2cae0ee@linux.intel.com>
On Wed, Mar 11, 2020 at 03:11:06PM -0700, Kuppuswamy Sathyanarayanan wrote:
> On 3/11/20 2:53 PM, Austin.Bolen@dell.com wrote:
> > On 3/11/2020 4:27 PM, Kuppuswamy Sathyanarayanan wrote:
> > > On 3/11/20 1:33 PM, Bjorn Helgaas wrote:
> > > > On Wed, Mar 11, 2020 at 05:27:35PM +0000, Austin.Bolen@dell.com wrote:
> > > > > On 3/11/2020 12:12 PM, Bjorn Helgaas wrote:
> > > > > <SNIP>
> > > > > > I'm probably missing your intent, but that sounds like "the OS can
> > > > > > read/write AER bits whenever it wants, regardless of ownership."
> > > > > >
> > > > > > That doesn't sound practical to me, and I don't think it's really
> > > > > > similar to DPC, where it's pretty clear that the OS can touch DPC bits
> > > > > > it doesn't own but only *during the EDR processing window*.
> > > > > Yes, by treating AER bits like DPC bits I meant I'd define the specific
> > > > > time windows when OS can touch the AER status bits similar to how it's
> > > > > done for DPC in the current ECN.
> > > > Makes sense, thanks.
> > > >
> > > > > > > > > For the normative text describing when OS clears the AER bits
> > > > > > > > > following the informative flow chart, it could say that OS clears
> > > > > > > > > AER as soon as possible after OST returns and before OS processes
> > > > > > > > > _HPX and loading drivers. Open to other suggestions as well.
> > > > > > > > I'm not sure what to do with "as soon as possible" either. That
> > > > > > > > doesn't seem like something firmware and the OS can agree on.
> > > > > > > I can just state that it's done after OST returns but before _HPX or
> > > > > > > driver is loaded. Any time in that range is fine. I can't get super
> > > > > > > specific here because different OSes do different things. Even for
> > > > > > > a given OS they change over time. And I need something generic
> > > > > > > enough to support a wide variety of OS implementations.
> > > > > > Yeah. I don't know how to solve this.
> > > > > >
> > > > > > Linux doesn't actually unload and reload drivers for the child devices
> > > > > > (Sathy, correct me if I'm wrong here) even though DPC containment
> > > > > > takes the link down and effectively unplugs and replugs the device. I
> > > > > > would *like* to handle it like hotplug, but some higher-level software
> > > > > > doesn't deal well with things like storage devices disappearing and
> > > > > > reappearing.
> > > > > >
> > > > > > Since Linux doesn't actually re-enumerate the child devices, it
> > > > > > wouldn't evaluate _HPX again. It would probably be cleaner if it did,
> > > > > > but it's all tied up with the whole unplug/replug problem.
> > > > > DPC resets everything below it and so to get it back up and running it
> > > > > would mean that all buses and resources need to be assigned, _HPX
> > > > > evaluated, and drivers reloaded. If those things don't happen then the
> > > > > whole hierarchy below the port that triggered DPC will be inaccessible.
> > > > Hmm, I think I might be confusing this with another situation. Sathy,
> > > > can you help me understand this? I don't have a way to actually
> > > > exercise this EDR path. Is there some way the pciehp hotplug driver
> > > > gets involved here?
> > If the port has hot-plug enabled then DPC trigger will cause the link to
> > go down (disabled state) and will generate a DLLSC hot-plug interrupt.
> > When DPC is released, the link will become active and generate another
> > DLLSC hot-plug interrupt.
> Yes, device/driver enumeration and removal will triggered by DLLSC
> state change interrupt in pciehp driver.
> >
> > > > Here's how this seems to work as far as I can tell:
> > > >
> > > > - Linux does not have DPC or AER control
> > > >
> > > > - Linux installs EDR notify handler
> > > >
> > > > - Linux evaluates DPC Enable _DSM
> > > >
> > > > - DPC containment event occurs
> > > >
> > > > - Firmware fields DPC interrupt
> > > >
> > > > - DPC event is not a surprise remove
> > > >
> > > > - Firmware sends EDR notification
> > > >
> > > > - Linux EDR notify handler evaluates Locate _DSM
> > > >
> > > > - Linux reads and logs DPC and AER error information for port in
> > > > containment mode. [If it was an RP PIO error, Linux clears RP PIO
> > > > error status, which is an asymmetry with the non-RP PIO path.]
> > > >
> > > > - Linux clears AER error status (pci_aer_raw_clear_status())
> > > >
> > > > - Linux calls driver .error_detected() methods for all child devices
> > > > of the port in containment mode (pcie_do_recovery()). These
> > > > devices are inaccessible because the link is down.
> > > >
> > > > - Linux clears DPC Trigger Status (dpc_reset_link() from
> > > > pcie_do_recovery()).
> > > >
> > > > - Linux calls driver .mmio_enabled() methods for all child devices.
> > > >
> > > > This is where I get lost. These child devices are now accessible, but
> > > > they've been reset, so I don't know how their config space got
> > > > restored. Did pciehp enumerate them? Did we do something like
> > > > pci_restore_state()? I don't see where either of these happens.
> > > AFAIK, AER error status registers are sticky (RW1CS) and hence
> > > will be preserved during reset.
> > In our testing, the device directly connected to the port that was
> > contained does get reprogrammed and the driver is reloaded. These are
> > hot-plug slots and so might be due to DLLSC hot-plug interrupt when
> > containment is released and link goes back to active state.
> >
> > However, if a switch is connected to the port where DPC was triggered
> > then we do not see the whole switch hierarchy being re-enumerated.
> Now that I have a hardware to verify this scenario, I will look into
> it. I suspect there is a transient state in link status which causes
> this disconnect issue. But I think this issue is not related to
> EDR support and hence should be reproducible in native handling
> as well.
> >
> > Also, DPC could be enabled on non-hot-plug slots so can't always rely on
> > hot-plug to re-init devices in the recovery path.
> If hotplug is not supported then there is support to enumerate
> devices via polling or ACPI events. But a point to note
> here is, enumeration path is independent of error handler path, and
> hence there is no explicit trigger or event from error handler path
> to enumeration path to kick start the enumeration.
Is any synchronization needed here between the EDR path and the
hotplug/enumeration path?
next prev parent reply other threads:[~2020-03-11 22:23 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-04 2:36 [PATCH v17 00/12] Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
2020-03-04 2:36 ` [PATCH v17 01/12] PCI/ERR: Update error status after reset_link() sathyanarayanan.kuppuswamy
2020-03-04 2:36 ` [PATCH v17 02/12] PCI/AER: Move pci_cleanup_aer_error_status_regs() declaration to pci.h sathyanarayanan.kuppuswamy
2020-03-04 2:36 ` [PATCH v17 03/12] PCI/ERR: Remove service dependency in pcie_do_recovery() sathyanarayanan.kuppuswamy
2020-03-17 14:40 ` Christoph Hellwig
2020-03-04 2:36 ` [PATCH v17 04/12] PCI: portdrv: remove unnecessary pcie_port_find_service() sathyanarayanan.kuppuswamy
2020-03-04 2:36 ` [PATCH v17 05/12] PCI: portdrv: remove reset_link member from pcie_port_service_driver sathyanarayanan.kuppuswamy
2020-03-17 14:41 ` Christoph Hellwig
2020-03-17 14:55 ` Kuppuswamy, Sathyanarayanan
2020-03-04 2:36 ` [PATCH v17 06/12] Documentation: PCI: Remove reset_link references sathyanarayanan.kuppuswamy
2020-03-17 14:42 ` Christoph Hellwig
2020-03-17 15:05 ` Kuppuswamy, Sathyanarayanan
2020-03-17 15:07 ` Christoph Hellwig
2020-03-17 16:03 ` Bjorn Helgaas
2020-03-17 17:06 ` Christoph Hellwig
2020-03-19 22:52 ` Bjorn Helgaas
2020-03-04 2:36 ` [PATCH v17 07/12] PCI/ERR: Return status of pcie_do_recovery() sathyanarayanan.kuppuswamy
2020-03-04 2:36 ` [PATCH v17 08/12] PCI/DPC: Cache DPC capabilities in pci_init_capabilities() sathyanarayanan.kuppuswamy
2020-03-04 2:36 ` [PATCH v17 09/12] PCI/AER: Allow clearing Error Status Register in FF mode sathyanarayanan.kuppuswamy
2020-03-06 5:45 ` Kuppuswamy, Sathyanarayanan
2020-03-06 16:04 ` Bjorn Helgaas
2020-03-06 16:11 ` Kuppuswamy, Sathyanarayanan
2020-03-06 16:41 ` Bjorn Helgaas
2020-03-10 2:40 ` Bjorn Helgaas
2020-03-10 4:28 ` Kuppuswamy, Sathyanarayanan
2020-03-10 18:14 ` Austin.Bolen
2020-03-10 19:32 ` Bjorn Helgaas
2020-03-10 20:06 ` Austin.Bolen
2020-03-10 20:41 ` Kuppuswamy Sathyanarayanan
2020-03-10 20:49 ` Austin.Bolen
2020-03-11 14:45 ` Bjorn Helgaas
2020-03-11 15:19 ` Austin.Bolen
2020-03-11 17:12 ` Bjorn Helgaas
2020-03-11 17:27 ` Austin.Bolen
2020-03-11 20:33 ` Bjorn Helgaas
2020-03-11 21:25 ` Kuppuswamy Sathyanarayanan
2020-03-11 21:53 ` Austin.Bolen
2020-03-11 22:11 ` Kuppuswamy Sathyanarayanan
2020-03-11 22:23 ` Bjorn Helgaas [this message]
2020-03-11 23:07 ` Kuppuswamy Sathyanarayanan
2020-03-12 19:53 ` Bjorn Helgaas
2020-03-12 21:02 ` Austin.Bolen
2020-03-12 21:29 ` Kuppuswamy Sathyanarayanan
2020-03-12 21:52 ` Bjorn Helgaas
2020-03-12 22:02 ` Kuppuswamy Sathyanarayanan
2020-03-12 22:36 ` Bjorn Helgaas
2020-03-12 21:59 ` Kuppuswamy Sathyanarayanan
2020-03-12 22:32 ` Bjorn Helgaas
2020-03-13 6:22 ` Kuppuswamy, Sathyanarayanan
2020-03-13 19:28 ` Bjorn Helgaas
2020-03-13 20:26 ` Kuppuswamy Sathyanarayanan
2020-03-19 23:03 ` Bjorn Helgaas
2020-03-19 23:20 ` Kuppuswamy, Sathyanarayanan
2020-03-11 22:13 ` Bjorn Helgaas
2020-03-11 22:41 ` Kuppuswamy Sathyanarayanan
2020-03-11 18:12 ` Kuppuswamy Sathyanarayanan
2020-03-11 22:05 ` Bjorn Helgaas
2020-03-04 2:36 ` [PATCH v17 10/12] PCI/DPC: Export DPC error recovery functions sathyanarayanan.kuppuswamy
2020-03-17 14:43 ` Christoph Hellwig
2020-03-04 2:36 ` [PATCH v17 11/12] PCI/DPC: Add Error Disconnect Recover (EDR) support sathyanarayanan.kuppuswamy
2020-03-06 3:47 ` Bjorn Helgaas
2020-03-06 6:32 ` Kuppuswamy, Sathyanarayanan
2020-03-06 21:00 ` Bjorn Helgaas
2020-03-06 22:42 ` Kuppuswamy Sathyanarayanan
2020-03-06 23:23 ` Bjorn Helgaas
2020-03-07 0:19 ` Kuppuswamy Sathyanarayanan
2020-03-04 2:36 ` [PATCH v17 12/12] PCI/ACPI: Enable EDR support sathyanarayanan.kuppuswamy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200311222352.GA200510@google.com \
--to=helgaas@kernel.org \
--cc=Austin.Bolen@dell.com \
--cc=ashok.raj@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).