All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Laight <David.Laight@ACULAB.COM>
To: "'Alex_Gagniuc@Dellteam.com'" <Alex_Gagniuc@Dellteam.com>,
	"lukas@wunner.de" <lukas@wunner.de>
Cc: "mr.nuke.me@gmail.com" <mr.nuke.me@gmail.com>,
	"keith.busch@intel.com" <keith.busch@intel.com>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"Austin.Bolen@dell.com" <Austin.Bolen@dell.com>,
	"Stuart.Hayes@dell.com" <Stuart.Hayes@dell.com>,
	"Narendra.K@dell.com" <Narendra.K@dell.com>,
	"Christopher.Arzola@dell.com" <Christopher.Arzola@dell.com>,
	"David.Chalfant@dell.com" <David.Chalfant@dell.com>,
	"okaya@kernel.org" <okaya@kernel.org>
Subject: RE: Should a PCIe Link Down event set the PCI_DEV_DISCONNECTED bit?
Date: Wed, 1 Aug 2018 08:58:39 +0000	[thread overview]
Message-ID: <9870eb1907ae425bb9671c73845193f3@AcuMS.aculab.com> (raw)
In-Reply-To: <f05149cf4d9743878397de7763561221@ausx13mps321.AMER.DELL.COM>

From: Alex_Gagniuc@Dellteam.com
> Sent: 31 July 2018 17:36
> 
> On 07/31/2018 04:29 AM, Lukas Wunner wrote:
> > On Mon, Jul 30, 2018 at 09:38:04PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> >> On 07/28/2018 01:31 PM, Lukas Wunner wrote:
> >>> On Fri, Jul 27, 2018 at 05:51:04PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> >>>> I think PCI_DEV_DISCONNECTED is a documentation issue above all else.
> >>>> The history I was given is that drivers would take a very long time to
> >>>> tear down a device. Config space IO to an nonexistent device took a long
> >>>> while to time out. Performance was one motivation -- and was not
> >>>> documented.
> >>>
> >>> Often it is possible for the driver to detect surprise removal by
> >>> checking if mmio reads return "all ones".  But in some cases that's
> >>> a valid value to read from mmio and then this approach won't work.
> >>> Also, checking every mmio read may negatively impact performance.
> >>
> >> A colleague and me beat that dead horse to the afterdeath. Consensus was
> >> that the return value is less reliable than a coin toss (of a two-heads
> >> coin).

Something cheap-ish to find out whether a -1 was caused by a card
removal might be sensible - Especially if it can be done without
a config space read.
Clearly you can't check anything BEFORE doing the read.
And reading the pci-id from config space isn't entirely useful.
If the card has reset itself (and the link recovered) then you
need to read a BAR register and check it is setup.

More interestingly a read request that is inside the bridge's address
window but outside any BAR (fairly easy to setup if the target has
a large BAR and a small one) will also timeout (and return -1) even
though there is no failure of the link.

If the target supports AER the information about the failed cycle
ends up in the target's AER registers - even if the host bridge
doesn't support AER (or it is being ignored).
So it might be useful being able to read the AER registers even when
no AER interrupt (or other notification) actually happens.

I've not managed to get linux to pick up AER interrupts even on
systems where the hardware clearly supports them (at least on
some slots).  I suspect the BIOS is carefully disabling them
because of reports of message logs being spammed with AER errors.

We also have one system (possibly a Dell 740) where any failure
of a PCIe link leads to an NMI and a kernel crash!
Not entirely useful in a server model that is supposed to have
resilience against various errors.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

  reply	other threads:[~2018-08-01 10:41 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-26 22:38 Should a PCIe Link Down event set the PCI_DEV_DISCONNECTED bit? Alex G.
2018-07-26 23:00 ` Rajat Jain
2018-07-27  0:04   ` Alex_Gagniuc
2018-07-27  7:18 ` Lukas Wunner
2018-07-27 15:52   ` Alex_Gagniuc
2018-07-27 17:05     ` Lukas Wunner
2018-07-27 17:51       ` Alex_Gagniuc
2018-07-27 18:17         ` Sinan Kaya
2018-07-27 18:23           ` Alex_Gagniuc
2018-07-27 18:34             ` Sinan Kaya
2018-07-28 18:31         ` Lukas Wunner
2018-07-29  0:26           ` Sinan Kaya
2018-07-29 12:09             ` Lukas Wunner
2018-07-29 16:59               ` Sinan Kaya
2018-07-30 13:28           ` David Laight
2018-07-30 13:54             ` Lukas Wunner
2018-07-30 16:06               ` David Laight
2018-07-30 21:38           ` Alex_Gagniuc
2018-07-31  9:28             ` Lukas Wunner
2018-07-31 16:35               ` Alex_Gagniuc
2018-08-01  8:58                 ` David Laight [this message]
2018-08-01 19:06                   ` Alex_Gagniuc

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9870eb1907ae425bb9671c73845193f3@AcuMS.aculab.com \
    --to=david.laight@aculab.com \
    --cc=Alex_Gagniuc@Dellteam.com \
    --cc=Austin.Bolen@dell.com \
    --cc=Christopher.Arzola@dell.com \
    --cc=David.Chalfant@dell.com \
    --cc=Narendra.K@dell.com \
    --cc=Stuart.Hayes@dell.com \
    --cc=keith.busch@intel.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mr.nuke.me@gmail.com \
    --cc=okaya@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.