* PCI: hotplug: Erroneous removal of hotplug PCI devices
@ 2019-01-23 18:20 Alex_Gagniuc
2019-01-23 18:44 ` Keith Busch
2019-01-23 18:54 ` Lukas Wunner
0 siblings, 2 replies; 23+ messages in thread
From: Alex_Gagniuc @ 2019-01-23 18:20 UTC (permalink / raw)
To: linux-pci; +Cc: bhelgaas, lukas, keith.busch, Austin.Bolen
Hi all,
This may be a mind-twisting explanation, so pleas bear with me.
In PCIe, the presence detect bit (PD) in the slot status register should
be a logical OR of in-band and out-of band presence. In-band presence is
the data link layer status. So one would expect that a link up event,
would be accompanied by a PD changed event with PD set. Not everyone
follows that.
I have a system here with the following order of events:
* 0 ms : Link up
* 400 ms : Presence detect up
On the first event, the device is probed as expected, and on the second
event, the device is removed as a SURPRISE!!!_REMOVAL. This is a bug.
The logic is that on every change of presence detect:
/* Even if [the slot]'s occupied again, we cannot assume the card is the
same. */
Reasonable, but the resulting behavior is a bug.
Solution 1 is to say it's a spec violation, so ignore it. They'll change
the "logical OR" thing in the next PCIe spec, so we still will have to
worry about this.
It's obvious that just relying on presence detect state is prone to race
conditions. However, if a device is replaced, we'd expect the data link
layer state to change as well. So I think the best way to proceed is to
skip the SURPRISE!!!_REMOVAL if the following are true:
* presence detect is set
* DLL changed is not set
* presence detect was not previously set
Thoughts?
Alex
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 18:20 PCI: hotplug: Erroneous removal of hotplug PCI devices Alex_Gagniuc
@ 2019-01-23 18:44 ` Keith Busch
2019-01-23 19:02 ` Lukas Wunner
[not found] ` <b32e6ca62ae2494f98450df81ca1ee14@AUSX13MPC131.AMER.DELL.COM>
2019-01-23 18:54 ` Lukas Wunner
1 sibling, 2 replies; 23+ messages in thread
From: Keith Busch @ 2019-01-23 18:44 UTC (permalink / raw)
To: Alex_Gagniuc; +Cc: linux-pci, bhelgaas, lukas, Austin.Bolen
On Wed, Jan 23, 2019 at 06:20:57PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> Hi all,
>
> This may be a mind-twisting explanation, so pleas bear with me.
>
> In PCIe, the presence detect bit (PD) in the slot status register should
> be a logical OR of in-band and out-of band presence. In-band presence is
> the data link layer status. So one would expect that a link up event,
> would be accompanied by a PD changed event with PD set. Not everyone
> follows that.
>
> I have a system here with the following order of events:
> * 0 ms : Link up
> * 400 ms : Presence detect up
> On the first event, the device is probed as expected, and on the second
> event, the device is removed as a SURPRISE!!!_REMOVAL. This is a bug.
>
> The logic is that on every change of presence detect:
> /* Even if [the slot]'s occupied again, we cannot assume the card is the
> same. */
> Reasonable, but the resulting behavior is a bug.
>
> Solution 1 is to say it's a spec violation, so ignore it. They'll change
> the "logical OR" thing in the next PCIe spec, so we still will have to
> worry about this.
When's that changing? 5.0 is the next spec, and it still says:
Presence Detect State - This bit indicates the presence of an adapter
in the slot, reflected by the logical “OR” of the Physical Layer
in-band presence detect mechanism and, if present, any out-of-band
presence detect mechanism defined for the slot’s corresponding
form factor.
> It's obvious that just relying on presence detect state is prone to race
> conditions. However, if a device is replaced, we'd expect the data link
> layer state to change as well. So I think the best way to proceed is to
> skip the SURPRISE!!!_REMOVAL if the following are true:
> * presence detect is set
> * DLL changed is not set
> * presence detect was not previously set
>
> Thoughts?
What is the value of PDS on the Link up event? If it's still "Slot
Empty", could we just ignore the Link event instead and wait for the PDC
event?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 18:20 PCI: hotplug: Erroneous removal of hotplug PCI devices Alex_Gagniuc
2019-01-23 18:44 ` Keith Busch
@ 2019-01-23 18:54 ` Lukas Wunner
2019-01-23 19:07 ` Lukas Wunner
2019-01-24 22:33 ` Austin.Bolen
1 sibling, 2 replies; 23+ messages in thread
From: Lukas Wunner @ 2019-01-23 18:54 UTC (permalink / raw)
To: Alex_Gagniuc; +Cc: linux-pci, bhelgaas, keith.busch, Austin.Bolen
On Wed, Jan 23, 2019 at 06:20:57PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> In PCIe, the presence detect bit (PD) in the slot status register should
> be a logical OR of in-band and out-of band presence. In-band presence is
> the data link layer status. So one would expect that a link up event,
> would be accompanied by a PD changed event with PD set. Not everyone
> follows that.
>
> I have a system here with the following order of events:
> * 0 ms : Link up
> * 400 ms : Presence detect up
> On the first event, the device is probed as expected, and on the second
> event, the device is removed as a SURPRISE!!!_REMOVAL. This is a bug.
It's normal that there's a lag between presence and link changes.
There's even hardware which flaps the link and presence bits a
couple of times before they settle. Since commit 6c35a1ac3da63a7
("PCI: pciehp: Tolerate initially unstable link") we're quite lenient
towards such devices and tolerate link and presence flaps for up to
100 ms. That's basically also the maximum delay we allow between link up
and presence detect up.
Theoretically you could say: Let's wait after the link goes up
until presence also goes up. But guess what, some vendors managed
to hardwire the presence detect flag to zero, so you'd wait forever
on those devices. We have a workaround specifically for such hardware
since commit 80696f991424 ("PCI: pciehp: Tolerate Presence Detect
hardwired to zero").
> It's obvious that just relying on presence detect state is prone to race
> conditions. However, if a device is replaced, we'd expect the data link
> layer state to change as well. So I think the best way to proceed is to
> skip the SURPRISE!!!_REMOVAL if the following are true:
> * presence detect is set
> * DLL changed is not set
> * presence detect was not previously set
Unfortunately it's not that simple. All we know is that presence has
changed and is currently up and the link is also up. This could mean
that the device was swapped and it may just take a little longer until
the link flag also flaps.
A possible solution for your hardware would be to amend
pciehp_check_link_status() such that after the call to pci_bus_check_dev()
we wait until presence also goes up. However it would have to be quirked
to your particular hardware, otherwise it would break devices which
hardwire presence detect to zero. Or we'd have to define a limit of,
say, 1 second and if presence hasn't gone up within that period,
we'd just silently carry on. This wouldn't break hardware which hardwires
presence detect to zero, but introduce a 1 sec delay for that hardware
to bring up the slot.
So I don't see a perfect solution. What device are we talking about
anyway? 400 ms is a *long* time.
Thanks,
Lukas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 18:44 ` Keith Busch
@ 2019-01-23 19:02 ` Lukas Wunner
2019-01-23 19:07 ` Keith Busch
2019-01-24 22:43 ` Austin.Bolen
[not found] ` <b32e6ca62ae2494f98450df81ca1ee14@AUSX13MPC131.AMER.DELL.COM>
1 sibling, 2 replies; 23+ messages in thread
From: Lukas Wunner @ 2019-01-23 19:02 UTC (permalink / raw)
To: Keith Busch; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen
On Wed, Jan 23, 2019 at 11:44:53AM -0700, Keith Busch wrote:
> On Wed, Jan 23, 2019 at 06:20:57PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> > It's obvious that just relying on presence detect state is prone to race
> > conditions. However, if a device is replaced, we'd expect the data link
> > layer state to change as well. So I think the best way to proceed is to
> > skip the SURPRISE!!!_REMOVAL if the following are true:
> > * presence detect is set
> > * DLL changed is not set
> > * presence detect was not previously set
> >
> > Thoughts?
>
> What is the value of PDS on the Link up event? If it's still "Slot
> Empty", could we just ignore the Link event instead and wait for the PDC
> event?
Well, usually it's desirable to bring up the slot as quickly as possible,
so once we get any kind of link or presence event, we immediately try to
bring up the slot.
We do allow a 20 + 100 ms delay in pcie_wait_for_link() between link up
and presence detect up, just not 400 ms.
Thanks,
Lukas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 18:54 ` Lukas Wunner
@ 2019-01-23 19:07 ` Lukas Wunner
2019-01-23 19:09 ` Keith Busch
2019-01-23 23:50 ` Alex_Gagniuc
2019-01-24 22:33 ` Austin.Bolen
1 sibling, 2 replies; 23+ messages in thread
From: Lukas Wunner @ 2019-01-23 19:07 UTC (permalink / raw)
To: Alex_Gagniuc; +Cc: linux-pci, bhelgaas, keith.busch, Austin.Bolen
On Wed, Jan 23, 2019 at 07:54:20PM +0100, Lukas Wunner wrote:
> So I don't see a perfect solution. What device are we talking about
> anyway? 400 ms is a *long* time.
Also, how exactly does this issue manifest itself: Is it just an
annoyance that the slot is brought up/down/up or does it not work
at all?
Thanks,
Lukas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 19:02 ` Lukas Wunner
@ 2019-01-23 19:07 ` Keith Busch
2019-01-23 19:15 ` Lukas Wunner
2019-01-24 22:43 ` Austin.Bolen
1 sibling, 1 reply; 23+ messages in thread
From: Keith Busch @ 2019-01-23 19:07 UTC (permalink / raw)
To: Lukas Wunner; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen
On Wed, Jan 23, 2019 at 08:02:04PM +0100, Lukas Wunner wrote:
> On Wed, Jan 23, 2019 at 11:44:53AM -0700, Keith Busch wrote:
> > On Wed, Jan 23, 2019 at 06:20:57PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> > > It's obvious that just relying on presence detect state is prone to race
> > > conditions. However, if a device is replaced, we'd expect the data link
> > > layer state to change as well. So I think the best way to proceed is to
> > > skip the SURPRISE!!!_REMOVAL if the following are true:
> > > * presence detect is set
> > > * DLL changed is not set
> > > * presence detect was not previously set
> > >
> > > Thoughts?
> >
> > What is the value of PDS on the Link up event? If it's still "Slot
> > Empty", could we just ignore the Link event instead and wait for the PDC
> > event?
>
> Well, usually it's desirable to bring up the slot as quickly as possible,
> so once we get any kind of link or presence event, we immediately try to
> bring up the slot.
>
> We do allow a 20 + 100 ms delay in pcie_wait_for_link() between link up
> and presence detect up, just not 400 ms.
Right, so in Alex's case, it looks like we are observing
pcie_wait_for_link() returning true before the PDC event.
I'm wondering about PDS because if the link is up but Presence reports an
empty slot, does that matter for any implementations? Or is it perfectly
fine to enumerate an active link on an empty slot? An empty slot and
active link doesn't make a lot of sense, but that observation appears to
be what is reported here.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 19:07 ` Lukas Wunner
@ 2019-01-23 19:09 ` Keith Busch
2019-01-23 19:28 ` Lukas Wunner
2019-01-23 23:50 ` Alex_Gagniuc
1 sibling, 1 reply; 23+ messages in thread
From: Keith Busch @ 2019-01-23 19:09 UTC (permalink / raw)
To: Lukas Wunner; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen
On Wed, Jan 23, 2019 at 08:07:23PM +0100, Lukas Wunner wrote:
> On Wed, Jan 23, 2019 at 07:54:20PM +0100, Lukas Wunner wrote:
> > So I don't see a perfect solution. What device are we talking about
> > anyway? 400 ms is a *long* time.
>
> Also, how exactly does this issue manifest itself: Is it just an
> annoyance that the slot is brought up/down/up or does it not work
> at all?
Yeah, there is an nvme driver bug that hits a dead lock if you bring
a very quick add-remove sequence. The nvme remove tries to delete IO
resources before the async probe side set them up, so the driver doesn't
actually see that they're invalid. I have a proposed fix, but waiting to
here if it is successful.
bz: https://bugzilla.kernel.org/show_bug.cgi?id=202081
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 19:07 ` Keith Busch
@ 2019-01-23 19:15 ` Lukas Wunner
2019-01-23 19:33 ` Keith Busch
0 siblings, 1 reply; 23+ messages in thread
From: Lukas Wunner @ 2019-01-23 19:15 UTC (permalink / raw)
To: Keith Busch; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen
On Wed, Jan 23, 2019 at 12:07:38PM -0700, Keith Busch wrote:
> Right, so in Alex's case, it looks like we are observing
> pcie_wait_for_link() returning true before the PDC event.
>
> I'm wondering about PDS because if the link is up but Presence reports an
> empty slot, does that matter for any implementations? Or is it perfectly
> fine to enumerate an active link on an empty slot? An empty slot and
> active link doesn't make a lot of sense, but that observation appears to
> be what is reported here.
We allow enumeration with an active link of an allegedly empty slot
because some hardware hardwires PDS to zero, see commit 80696f991424
("PCI: pciehp: Tolerate Presence Detect hardwired to zero").
The brokenness of PCIe hotplug implementations found in the wild is
astonishing.
Thanks,
Lukas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 19:09 ` Keith Busch
@ 2019-01-23 19:28 ` Lukas Wunner
2019-01-23 19:47 ` Keith Busch
0 siblings, 1 reply; 23+ messages in thread
From: Lukas Wunner @ 2019-01-23 19:28 UTC (permalink / raw)
To: Keith Busch; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen
On Wed, Jan 23, 2019 at 12:09:46PM -0700, Keith Busch wrote:
> On Wed, Jan 23, 2019 at 08:07:23PM +0100, Lukas Wunner wrote:
> > On Wed, Jan 23, 2019 at 07:54:20PM +0100, Lukas Wunner wrote:
> > > So I don't see a perfect solution. What device are we talking about
> > > anyway? 400 ms is a *long* time.
> >
> > Also, how exactly does this issue manifest itself: Is it just an
> > annoyance that the slot is brought up/down/up or does it not work
> > at all?
>
> Yeah, there is an nvme driver bug that hits a dead lock if you bring
> a very quick add-remove sequence. The nvme remove tries to delete IO
> resources before the async probe side set them up, so the driver doesn't
> actually see that they're invalid. I have a proposed fix, but waiting to
> here if it is successful.
>
> bz: https://bugzilla.kernel.org/show_bug.cgi?id=202081
Hm, there's no full dmesg output attached, so it's not possible to
tell what the topology looks like and what the vendor/device ID of
0000:b0:04.0 is.
Also, there's only a card present / link up sequence visible in the
abridged dmesg output which has a 4 usec delay, but no link up / card
present sequence with a 400 msec delay?
Thanks,
Lukas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 19:15 ` Lukas Wunner
@ 2019-01-23 19:33 ` Keith Busch
0 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2019-01-23 19:33 UTC (permalink / raw)
To: Lukas Wunner; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen
On Wed, Jan 23, 2019 at 08:15:27PM +0100, Lukas Wunner wrote:
> On Wed, Jan 23, 2019 at 12:07:38PM -0700, Keith Busch wrote:
> > Right, so in Alex's case, it looks like we are observing
> > pcie_wait_for_link() returning true before the PDC event.
> >
> > I'm wondering about PDS because if the link is up but Presence reports an
> > empty slot, does that matter for any implementations? Or is it perfectly
> > fine to enumerate an active link on an empty slot? An empty slot and
> > active link doesn't make a lot of sense, but that observation appears to
> > be what is reported here.
>
> We allow enumeration with an active link of an allegedly empty slot
> because some hardware hardwires PDS to zero, see commit 80696f991424
> ("PCI: pciehp: Tolerate Presence Detect hardwired to zero").
>
> The brokenness of PCIe hotplug implementations found in the wild is
> astonishing.
Heh, sounds like Alex's proposal may work until we find a slot
implementation that doesn't do DLLSC as expected. :)
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 19:28 ` Lukas Wunner
@ 2019-01-23 19:47 ` Keith Busch
2019-01-23 20:10 ` Alex_Gagniuc
0 siblings, 1 reply; 23+ messages in thread
From: Keith Busch @ 2019-01-23 19:47 UTC (permalink / raw)
To: Lukas Wunner; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen
On Wed, Jan 23, 2019 at 08:28:29PM +0100, Lukas Wunner wrote:
> On Wed, Jan 23, 2019 at 12:09:46PM -0700, Keith Busch wrote:
> > On Wed, Jan 23, 2019 at 08:07:23PM +0100, Lukas Wunner wrote:
> > > On Wed, Jan 23, 2019 at 07:54:20PM +0100, Lukas Wunner wrote:
> > > > So I don't see a perfect solution. What device are we talking about
> > > > anyway? 400 ms is a *long* time.
> > >
> > > Also, how exactly does this issue manifest itself: Is it just an
> > > annoyance that the slot is brought up/down/up or does it not work
> > > at all?
> >
> > Yeah, there is an nvme driver bug that hits a dead lock if you bring
> > a very quick add-remove sequence. The nvme remove tries to delete IO
> > resources before the async probe side set them up, so the driver doesn't
> > actually see that they're invalid. I have a proposed fix, but waiting to
> > here if it is successful.
> >
> > bz: https://bugzilla.kernel.org/show_bug.cgi?id=202081
>
> Hm, there's no full dmesg output attached, so it's not possible to
> tell what the topology looks like and what the vendor/device ID of
> 0000:b0:04.0 is.
>
> Also, there's only a card present / link up sequence visible in the
> abridged dmesg output which has a 4 usec delay, but no link up / card
> present sequence with a 400 msec delay?
Yeah, not easy to follow, and some discussion was off the bz.
Link Change:
[ 838.784541] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up
Presence Detect Change +4msec:
[ 839.183506] pciehp 0000:b0:04.0:pcie204: Slot(178): Card not present
Inbetween these two entries has nvme start setting up its controller
detected on the link up. The "not present" side tries to remove the same
nvme device, but fails to invalidate the IO resources because it's racing
with probe before it even set them up, leaving probe unable to complete
IO a moment later because its IRQ resources were disabled.
Meanwhile, the blk-mq timeout handler can't do anything because the
device state is disconnected and believes the removal side is handling
things. What a mess...
We can fix it, just want to hear if Alex can confirm the proposal is
successful.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 19:47 ` Keith Busch
@ 2019-01-23 20:10 ` Alex_Gagniuc
0 siblings, 0 replies; 23+ messages in thread
From: Alex_Gagniuc @ 2019-01-23 20:10 UTC (permalink / raw)
To: keith.busch, lukas; +Cc: linux-pci, bhelgaas, Austin.Bolen
On 1/23/19 1:53 PM, Keith Busch wrote:
>
> [EXTERNAL EMAIL]
>
> On Wed, Jan 23, 2019 at 08:28:29PM +0100, Lukas Wunner wrote:
>> On Wed, Jan 23, 2019 at 12:09:46PM -0700, Keith Busch wrote:
>>> On Wed, Jan 23, 2019 at 08:07:23PM +0100, Lukas Wunner wrote:
>>>> On Wed, Jan 23, 2019 at 07:54:20PM +0100, Lukas Wunner wrote:
>>>>> So I don't see a perfect solution. What device are we talking about
>>>>> anyway? 400 ms is a *long* time.
>>>>
>>>> Also, how exactly does this issue manifest itself: Is it just an
>>>> annoyance that the slot is brought up/down/up or does it not work
>>>> at all?
>>>
>>> Yeah, there is an nvme driver bug that hits a dead lock if you bring
>>> a very quick add-remove sequence. The nvme remove tries to delete IO
>>> resources before the async probe side set them up, so the driver doesn't
>>> actually see that they're invalid. I have a proposed fix, but waiting to
>>> here if it is successful.
>>>
>>> bz: https://bugzilla.kernel.org/show_bug.cgi?id=202081
>>
>> Hm, there's no full dmesg output attached, so it's not possible to
>> tell what the topology looks like and what the vendor/device ID of
>> 0000:b0:04.0 is.
>>
>> Also, there's only a card present / link up sequence visible in the
>> abridged dmesg output which has a 4 usec delay, but no link up / card
>> present sequence with a 400 msec delay?
>
> Yeah, not easy to follow, and some discussion was off the bz.
>
> Link Change:
>
> [ 838.784541] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up
>
> Presence Detect Change +4msec:
>
> [ 839.183506] pciehp 0000:b0:04.0:pcie204: Slot(178): Card not present
>
> Inbetween these two entries has nvme start setting up its controller
> detected on the link up. The "not present" side tries to remove the same
> nvme device, but fails to invalidate the IO resources because it's racing
> with probe before it even set them up, leaving probe unable to complete
> IO a moment later because its IRQ resources were disabled.
>
> Meanwhile, the blk-mq timeout handler can't do anything because the
> device state is disconnected and believes the removal side is handling
> things. What a mess...
>
> We can fix it, just want to hear if Alex can confirm the proposal is
> successful.
OOPS! Totally missed there was a patch on bz. Will update bz once
testing is done.
Alex
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 19:07 ` Lukas Wunner
2019-01-23 19:09 ` Keith Busch
@ 2019-01-23 23:50 ` Alex_Gagniuc
2019-01-24 9:25 ` Lukas Wunner
1 sibling, 1 reply; 23+ messages in thread
From: Alex_Gagniuc @ 2019-01-23 23:50 UTC (permalink / raw)
To: lukas; +Cc: linux-pci, bhelgaas, keith.busch, Austin.Bolen
On 1/23/19 1:07 PM, Lukas Wunner wrote:
>
> [EXTERNAL EMAIL]
>
> On Wed, Jan 23, 2019 at 07:54:20PM +0100, Lukas Wunner wrote:
>> So I don't see a perfect solution. What device are we talking about
>> anyway? 400 ms is a *long* time.
>
> Also, how exactly does this issue manifest itself: Is it just an
> annoyance that the slot is brought up/down/up or does it not work
> at all?
nvme issue aside, the device is removed. Then, a few seconds later, we
see Link Up and Card Present events, but no pciehp_isr() invocation. I
think it all happens in one pciehp_handle_presence_or_link_change(), so
pciehp_ist() is held up for a few seconds -- but that's a matter for a
different thread.
Alex.
[ 785.007640] nvme 0000:b1:00.0: enabling device (0000 -> 0002)
[ 785.099567] pcieport 0000:b0:04.0: ZOPA: Status(0x00000040):
presence-detected
[ 785.099574] pcieport 0000:b0:04.0: ZOPA: Events(0x00000008):
presence-detect-changed
[ 785.099580] pcieport 0000:b0:04.0: ZOPA: Queued up(0x00000008):
presence-detect-changed
[ 785.099582] pcieport 0000:b0:04.0: ZOPA: pciehp_isr: exiting
[ 785.099611] pcieport 0000:b0:04.0: ZOPA: pciehp_ist: Entered with
events(0x00000008): presence-detect-changed
[ 785.099615] pciehp 0000:b0:04.0:pcie204: ZOPA:
pciehp_handle_presence_or_link_change: locked &ctrl->state_lock
[ 785.099618] pciehp 0000:b0:04.0:pcie204: ZOPA:
pciehp_handle_presence_or_link_change unlocked (POWEROFF)
[ 785.099621] pciehp 0000:b0:04.0:pcie204: Slot(178): Card not present
[ 788.756953] nvme nvme4: failed to mark controller CONNECTING
[ 788.756958] nvme nvme4: Removing after probe failure status: 0
[ 788.763317] pciehp 0000:b0:04.0:pcie204: ZOPA:
pciehp_handle_presence_or_link_change unlocked (POWERON)
[ 788.763322] pciehp 0000:b0:04.0:pcie204: Slot(178): Card present
[ 788.763349] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up
[ 788.892432] pci 0000:b1:00.0: [8086:0a55] type 00 class 0x010802
[ 788.892474] pci 0000:b1:00.0: reg 0x10: [mem 0xe1500000-0xe1503fff 64bit]
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 23:50 ` Alex_Gagniuc
@ 2019-01-24 9:25 ` Lukas Wunner
0 siblings, 0 replies; 23+ messages in thread
From: Lukas Wunner @ 2019-01-24 9:25 UTC (permalink / raw)
To: Alex_Gagniuc; +Cc: linux-pci, bhelgaas, keith.busch, Austin.Bolen
On Wed, Jan 23, 2019 at 11:50:30PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> nvme issue aside, the device is removed. Then, a few seconds later, we
> see Link Up and Card Present events, but no pciehp_isr() invocation. I
> think it all happens in one pciehp_handle_presence_or_link_change(), so
> pciehp_ist() is held up for a few seconds -- but that's a matter for a
> different thread.
Yes, that's normal and expected. The rationale is that we only know that
link or presence has changed. It could even have changed multiple times.
Even if the link is now back up, we don't know if it's still the same card.
So the slot is brought down and if it's currently occupied or the link is
up, the slot is brought up again.
It would of course be possible to come up with more complicated schemes to
try to handle odd behavior such as a belated presence change after a link
change, but then it also becomes more complicated to reason about the
scheme's correctness and ensure that other types of odd behavior (such as
presence hardwired to zero) are still handled correctly.
Thanks,
Lukas
> [ 785.007640] nvme 0000:b1:00.0: enabling device (0000 -> 0002)
> [ 785.099567] pcieport 0000:b0:04.0: ZOPA: Status(0x00000040):
> presence-detected
> [ 785.099574] pcieport 0000:b0:04.0: ZOPA: Events(0x00000008):
> presence-detect-changed
> [ 785.099580] pcieport 0000:b0:04.0: ZOPA: Queued up(0x00000008):
> presence-detect-changed
> [ 785.099582] pcieport 0000:b0:04.0: ZOPA: pciehp_isr: exiting
> [ 785.099611] pcieport 0000:b0:04.0: ZOPA: pciehp_ist: Entered with
> events(0x00000008): presence-detect-changed
> [ 785.099615] pciehp 0000:b0:04.0:pcie204: ZOPA:
> pciehp_handle_presence_or_link_change: locked &ctrl->state_lock
> [ 785.099618] pciehp 0000:b0:04.0:pcie204: ZOPA:
> pciehp_handle_presence_or_link_change unlocked (POWEROFF)
> [ 785.099621] pciehp 0000:b0:04.0:pcie204: Slot(178): Card not present
> [ 788.756953] nvme nvme4: failed to mark controller CONNECTING
> [ 788.756958] nvme nvme4: Removing after probe failure status: 0
> [ 788.763317] pciehp 0000:b0:04.0:pcie204: ZOPA:
> pciehp_handle_presence_or_link_change unlocked (POWERON)
> [ 788.763322] pciehp 0000:b0:04.0:pcie204: Slot(178): Card present
> [ 788.763349] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up
> [ 788.892432] pci 0000:b1:00.0: [8086:0a55] type 00 class 0x010802
> [ 788.892474] pci 0000:b1:00.0: reg 0x10: [mem 0xe1500000-0xe1503fff 64bit]
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
[not found] ` <b32e6ca62ae2494f98450df81ca1ee14@AUSX13MPC131.AMER.DELL.COM>
@ 2019-01-24 20:20 ` Keith Busch
2019-01-24 22:00 ` Austin.Bolen
0 siblings, 1 reply; 23+ messages in thread
From: Keith Busch @ 2019-01-24 20:20 UTC (permalink / raw)
To: Austin.Bolen; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, lukas
[in plain text for the mailing lists]
On Thu, Jan 24, 2019 at 08:10:47PM +0000, Austin.Bolen@dell.com wrote:
> On 1/23/2019 12:46 PM, Keith Busch wrote:
>
> On Wed, Jan 23, 2019 at 06:20:57PM +0000, Alex_Gagniuc@Dellteam.com<mailto:Alex_Gagniuc@Dellteam.com> wrote:
>
>
> <snip>
>
> Solution 1 is to say it's a spec violation, so ignore it. They'll change
> the "logical OR" thing in the next PCIe spec, so we still will have to
> worry about this.
>
>
>
> When's that changing? 5.0 is the next spec, and it still says:
>
> Presence Detect State - This bit indicates the presence of an adapter
> in the slot, reflected by the logical "OR" of the Physical Layer
> in-band presence detect mechanism and, if present, any out-of-band
> presence detect mechanism defined for the slot's corresponding
> form factor.
>
>
> The ECN that changes this behavior (Async Hot-Plug Updates ECN) is
> ratified and published to the PCI-SIG website. It modifies PCIe Base
> Spec 4.0 and will be incorporated into PCIe Base Spec 5.0. Check out
> the ECN section titled "In-band Presence Detect Mechanism Deprecated
> for Async Hot-Plug".
My copy of the 5.0-0.9 spec missed out on this ECN. The new changes look
like it more clearly defines expected software and hardware behavior,
so I'm looking forward to seeing the hardware implementations. Thanks
for driving that.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-24 20:20 ` Keith Busch
@ 2019-01-24 22:00 ` Austin.Bolen
2019-01-25 8:22 ` Lukas Wunner
0 siblings, 1 reply; 23+ messages in thread
From: Austin.Bolen @ 2019-01-24 22:00 UTC (permalink / raw)
To: keith.busch, Austin.Bolen; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, lukas
On 1/24/2019 2:21 PM, Keith Busch wrote:
>
<snip>
> My copy of the 5.0-0.9 spec missed out on this ECN. The new changes look
> like it more clearly defines expected software and hardware behavior,
> so I'm looking forward to seeing the hardware implementations. Thanks
> for driving that.
>
You're welcome! And yes, this ECN is not yet in 5.0 draft but should be
there soon. Much of what is in the ECN can be implemented with existing
DPC hardware. There are associated ECNs in ACPI and PCI Firmware
Specifications as well. The ACPI spec changes will not be public until
the 6.3 spec is released (soon). The PCI Firmware Specification changes
should be ratified in a few weeks as well. In the meantime, UEFI org
(which ACPI is a part of) has a git repo accessible to UEFI forum
members where the Linux code to support these new async hot-plug
mechanisms is being developed. Will be made public once the specs are
public.
Thanks,
Austin
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 18:54 ` Lukas Wunner
2019-01-23 19:07 ` Lukas Wunner
@ 2019-01-24 22:33 ` Austin.Bolen
1 sibling, 0 replies; 23+ messages in thread
From: Austin.Bolen @ 2019-01-24 22:33 UTC (permalink / raw)
To: lukas, Alex_Gagniuc; +Cc: linux-pci, bhelgaas, keith.busch, Austin.Bolen
On 1/23/2019 12:54 PM, Lukas Wunner wrote:
>
<snip
>
> So I don't see a perfect solution. What device are we talking about
> anyway? 400 ms is a *long* time.
Some people have slow hands :-). This is a U.2 device. There are 2
pins (Pin 4 = IfDet#/Pin 10 = PRSNT#) that determine presence and each
pin is a different length (On insert, IfDet# mates first, PRSNT# mates
second, PCIe data lanes mate 3rd).
SAS drive: IfDet# = 0, PRSNT# = 0
NVMe drive: IfDet# = 0, PRSNT# = 1.
These 2 pins determine if it is a SAS or NVMe drive. We have to wait
until both pins mate in order to know for sure it's an NVMe drive and
therefore generate PRSNT# to DSP above the drive. But there is no way
to know when both pins have mated due to the way the U.2 connector
signals were defined so we have to dead wait.
The time delta between these 2 pins mating on insert can vary
substantially with different users. The time delta between these 2 pins
mating for some users we measured was > 100 ms (others were less than 10
ms). That was for "normal" insert. It takes much longer if user
intentionally tries to insert slowly.
We chose 500 ms as the delay after P4 (IfDet#) mates before sampling the
P10 (PRSNT#). About 100 ms of that is chewed up by the time delta of
presence pins mating and PCIe data lanes mating plus the time to train
the link leaving around 400 ms of time between DLLSC interrupt and PDSC
interrupt.
For future, we will hide the U.2 device from OS for 1 second after
insert to all device to be fully inserted (in most cases) and reach
steady state.
New form factors will (hopefully) have presence pins that are
last-to-mate so platforms/OSes do not have to rely on such guesswork to
know when devices are fully inserted.
>
> Thanks,
>
> Lukas
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-23 19:02 ` Lukas Wunner
2019-01-23 19:07 ` Keith Busch
@ 2019-01-24 22:43 ` Austin.Bolen
2019-01-24 22:52 ` Austin.Bolen
1 sibling, 1 reply; 23+ messages in thread
From: Austin.Bolen @ 2019-01-24 22:43 UTC (permalink / raw)
To: lukas, keith.busch; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen
On 1/23/2019 1:02 PM, Lukas Wunner wrote:
>
<snip>
>
> Well, usually it's desirable to bring up the slot as quickly as possible,
> so once we get any kind of link or presence event, we immediately try to
> bring up the slot.
>
> We do allow a 20 + 100 ms delay in pcie_wait_for_link() between link up
> and presence detect up, just not 400 ms.
Is the hot-inserted device's config space accessed immediately after
waiting this 20 + 100 ms delay? Per PCIe spec, in Gen 3 mode, software
should wait at least (1s - CTO value) after Data Link Layer Link Active
before accessing config space. The exception is if OS enables CRS
Software Visibility in which case config space can be accessed 100 ms
after Data Link Layer Link Active. Is CRS Software Visibility being used?
Cheers,
Austin
>
> Thanks,
>
> Lukas
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-24 22:43 ` Austin.Bolen
@ 2019-01-24 22:52 ` Austin.Bolen
0 siblings, 0 replies; 23+ messages in thread
From: Austin.Bolen @ 2019-01-24 22:52 UTC (permalink / raw)
To: Austin.Bolen, lukas, keith.busch; +Cc: Alex_Gagniuc, linux-pci, bhelgaas
On 1/24/2019 4:43 PM, Bolen, Austin wrote:
> On 1/23/2019 1:02 PM, Lukas Wunner wrote:
>>
>
> <snip>
>
>>
>> Well, usually it's desirable to bring up the slot as quickly as possible,
>> so once we get any kind of link or presence event, we immediately try to
>> bring up the slot.
>>
>> We do allow a 20 + 100 ms delay in pcie_wait_for_link() between link up
>> and presence detect up, just not 400 ms.
>
> Is the hot-inserted device's config space accessed immediately after
> waiting this 20 + 100 ms delay? Per PCIe spec, in Gen 3 mode, software
> should wait at least (1s - CTO value) after Data Link Layer Link Active
> before accessing config space. The exception is if OS enables CRS
> Software Visibility in which case config space can be accessed 100 ms
> after Data Link Layer Link Active. Is CRS Software Visibility being used?
>
I forgot... Readiness Notifications can also let software bypass these
first access times but I've not seen anybody implement RN so assume it
is not being used here.
> Cheers,
> Austin
>
>>
>> Thanks,
>>
>> Lukas
>>
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-24 22:00 ` Austin.Bolen
@ 2019-01-25 8:22 ` Lukas Wunner
2019-01-25 22:39 ` Austin.Bolen
0 siblings, 1 reply; 23+ messages in thread
From: Lukas Wunner @ 2019-01-25 8:22 UTC (permalink / raw)
To: Austin.Bolen; +Cc: keith.busch, Alex_Gagniuc, linux-pci, bhelgaas
On Thu, Jan 24, 2019 at 10:00:49PM +0000, Austin.Bolen@dell.com wrote:
> You're welcome! And yes, this ECN is not yet in 5.0 draft but should be
> there soon. Much of what is in the ECN can be implemented with existing
> DPC hardware. There are associated ECNs in ACPI and PCI Firmware
> Specifications as well. The ACPI spec changes will not be public until
> the 6.3 spec is released (soon). The PCI Firmware Specification changes
> should be ratified in a few weeks as well. In the meantime, UEFI org
> (which ACPI is a part of) has a git repo accessible to UEFI forum
> members where the Linux code to support these new async hot-plug
> mechanisms is being developed. Will be made public once the specs are
> public.
Unfortunately I don't have access to those ECNs or the non-public
git repo.
> Some people have slow hands :-). This is a U.2 device. There are 2
> pins (Pin 4 = IfDet#/Pin 10 = PRSNT#) that determine presence and each
> pin is a different length (On insert, IfDet# mates first, PRSNT# mates
> second, PCIe data lanes mate 3rd).
But I've found a public spec for that:
http://www.ssdformfactor.org/docs/SSD_Form_Factor_Version1_00.pdf
> These 2 pins determine if it is a SAS or NVMe drive. We have to wait
> until both pins mate in order to know for sure it's an NVMe drive and
> therefore generate PRSNT# to DSP above the drive. But there is no way
> to know when both pins have mated due to the way the U.2 connector
> signals were defined so we have to dead wait.
What I don't get is, a lot of the pins are used either for PCIe or SAS:
If the host controller doesn't know yet whether it's PCIe or SAS, why does
it already start driving them in PCIe mode and bring up the link?
Also, those PCIe pins mate 3rd, so by the time the host controller starts
driving those pins in PCIe mode, IfDet# and PRSNT# pins are known to be
connected (because they mate earlier), so the host controller can stop
waiting and start sampling the two pins? Or is it mechanically possible
to insert the drive in such a way that the pins do not mate before the
PCIe pins, contrary to the spec?
Link up means in-band presence, so I assume this host controller already
adheres to the ECN discussed earlier? What is the rationale of the ECN
given that it causes these issues? Is this product being shipped or
can it still be changed?
> > We do allow a 20 + 100 ms delay in pcie_wait_for_link() between link up
> > and presence detect up, just not 400 ms.
>
> Is the hot-inserted device's config space accessed immediately after
> waiting this 20 + 100 ms delay?
Yes.
You can follow the code via this link, click your way down the call stack:
https://elixir.bootlin.com/linux/v5.0-rc3/source/drivers/pci/hotplug/pciehp_hpc.c#L602
pciehp_ist()
pciehp_handle_presence_or_link_change() /* called on PDC or DLLSC event */
pciehp_enable_slot() /* called if PDS or DLLLA is 1 */
__pciehp_enable_slot()
board_added()
pciehp_check_link_status()
pcie_wait_for_link()
/*
* wait 20 ms per PCIe r4.0 sec 6.6.1,
* then wait for up to 1 s until DLLLA is 1,
* then wait 100 ms per PCIe r4.0 sec 6.7.3.3
*/
pci_bus_check_dev()
/*
* first access to hotplugged device:
* wait for up to 1 s until vendor ID is accessible
*/
/*
* any additionl PDC or DLLSC event that occurred up to this point
* is ignored because the link is up and config space accessible;
* read Link Status register and verify link is trained
*/
pciehp_configure_device()
/* enumerate hotplugged device */
The 20 ms delay in pcie_wait_for_link() was added by Keith with commit
f0157160b359 ("PCI: Make link active reporting detection generic")
so that the function can also be used by the DPC driver. I think that
additional delay wouldn't normally be necessary for hotplug but it
doesn't hurt much either.
> Per PCIe spec, in Gen 3 mode, software
> should wait at least (1s - CTO value) after Data Link Layer Link Active
> before accessing config space.
Hm, do you have a spec reference for that? PCIe r4.0 sec 6.7.3.3 only says:
"Software must wait for 100 ms after the Data Link Layer Link Active
bit reads 1b before initiating a configuration access to the hot
added device (see Section 6.6)."
> The exception is if OS enables CRS
> Software Visibility in which case config space can be accessed 100 ms
> after Data Link Layer Link Active. Is CRS Software Visibility being used?
Not that I'm aware of, no.
> I forgot... Readiness Notifications can also let software bypass these
> first access times but I've not seen anybody implement RN so assume it
> is not being used here.
Right, I think this isn't used either so far.
Thanks,
Lukas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-25 8:22 ` Lukas Wunner
@ 2019-01-25 22:39 ` Austin.Bolen
2019-01-26 12:12 ` Lukas Wunner
0 siblings, 1 reply; 23+ messages in thread
From: Austin.Bolen @ 2019-01-25 22:39 UTC (permalink / raw)
To: lukas, Austin.Bolen; +Cc: keith.busch, Alex_Gagniuc, linux-pci, bhelgaas
On 1/25/2019 2:22 AM, Lukas Wunner wrote:
>
> [EXTERNAL EMAIL]
>
> On Thu, Jan 24, 2019 at 10:00:49PM +0000, Austin.Bolen@dell.com wrote:
>> You're welcome! And yes, this ECN is not yet in 5.0 draft but should be
>> there soon. Much of what is in the ECN can be implemented with existing
>> DPC hardware. There are associated ECNs in ACPI and PCI Firmware
>> Specifications as well. The ACPI spec changes will not be public until
>> the 6.3 spec is released (soon). The PCI Firmware Specification changes
>> should be ratified in a few weeks as well. In the meantime, UEFI org
>> (which ACPI is a part of) has a git repo accessible to UEFI forum
>> members where the Linux code to support these new async hot-plug
>> mechanisms is being developed. Will be made public once the specs are
>> public.
>
> Unfortunately I don't have access to those ECNs or the non-public
> git repo.
>
The PCIe Async Hot-Plug ECN has most of the new stuff. It has been
released and has the same level of access as the PCIe Base Spec so if
you have access to the PCIe base spec then you should have access to
this ECN.
>
>> Some people have slow hands :-). This is a U.2 device. There are 2
>> pins (Pin 4 = IfDet#/Pin 10 = PRSNT#) that determine presence and each
>> pin is a different length (On insert, IfDet# mates first, PRSNT# mates
>> second, PCIe data lanes mate 3rd).
>
> But I've found a public spec for that:
> http://www.ssdformfactor.org/docs/SSD_Form_Factor_Version1_00.pdf
> >
>> These 2 pins determine if it is a SAS or NVMe drive. We have to wait
>> until both pins mate in order to know for sure it's an NVMe drive and
>> therefore generate PRSNT# to DSP above the drive. But there is no way
>> to know when both pins have mated due to the way the U.2 connector
>> signals were defined so we have to dead wait.
>
> What I don't get is, a lot of the pins are used either for PCIe or SAS:
> If the host controller doesn't know yet whether it's PCIe or SAS, why does
> it already start driving them in PCIe mode and bring up the link?
> > Also, those PCIe pins mate 3rd, so by the time the host controller starts
> driving those pins in PCIe mode, IfDet# and PRSNT# pins are known to be
> connected (because they mate earlier), so the host controller can stop
> waiting and start sampling the two pins? Or is it mechanically possible
> to insert the drive in such a way that the pins do not mate before the
> PCIe pins, contrary to the spec?
In this case, there's not a single host controller. There are two
things at play. The sideband signals (IfDet#, PRSNT#) are routed to
logic on the platform motherboard or drive backplane. The PCIe lanes
are routed to a downstream port (Root Port or Switch Downstream port).
The platform can see when the sideband signals connect but can't see
when the link trains. When the link trains and Data Link Layer goes to
Active state, the DSP will notify OS via DLLSC interrupt but platform
has no visibility into this (because we grant control of hot-plug to OS
natively) and there's no way for platform to get notified when the link
becomes active. The PCIe ECN referenced above provides ability to
notify the platform in this scenario but requires hardware change.
Likewise the DSP doesn't know or care if it's possible to plug a SAS
drive into the slot. As soon as it starts seeing training sets on the
PCIe link it's going to train regardless of the state of IfDet#/PRSNT#.
Also, the DSP doesn't know the state of IfDet#/PRSNT# directly. There's
a pin called PRSNT# in PCIe DSPs (different than the U2 P10 PRSNT#).
The platform will assert PCIe PRSNT# to the DSP when U.2 PRSNT#/IfDet#
indicate it's an NVMe drive (and platform doesn't know this until around
400 ms after IfDet# mates because of the reasons described earlier).
But PCIe Base spec does not require DSPs to see PCIe PRSNT# asserted in
order to start the training sequence. So that's why the PCIe link comes
up before the platform knows its an NVMe drive.
>
> Link up means in-band presence,
Link Up and In-band Presence are slightly different states (see Table
4.15 in PCIe Base Spec). In-band presence is a lower-level detection
mechanism and can be set even when link is not up.
so I assume this host controller already
> adheres to the ECN discussed earlier?
Yes, this platform disables in-band presence.
What is the rationale of the ECN
> given that it causes these issues?
In certain link states like L1, DSPs can't detect in-band presence going
away. If you remove a device while the link is in one of these states
then OS will not get notified. OS will not be able to detect subsequent
insertions either. Only way out of this is for user to go into the OS
and manually get the link out of L1.
The PCIe spec doesn't have any strict requirements on how long the time
between DLLSC and PDSC can be but from our measurements 100 ms is not
long enough to cover the time delta from first pins mating until the
device is fully inserted. At a platform level, we use 1 second to allow
for insertion and removal events and this seems adequate for most cases
but can be defeated if someone makes an effort to slowly insert or
remove a device.
Is this product being shipped or
> can it still be changed?
>
It's shipping. It could be changed with a firmware change to re-enable
in-band presence but then we'd have the issue where hot-plug breaks when
removing devices when link is in certain states like L1 so that is not a
viable solution.
>
>>> We do allow a 20 + 100 ms delay in pcie_wait_for_link() between link up
>>> and presence detect up, just not 400 ms.
>>
>> Is the hot-inserted device's config space accessed immediately after
>> waiting this 20 + 100 ms delay?
>
> Yes.
>
> You can follow the code via this link, click your way down the call stack:
> https://elixir.bootlin.com/linux/v5.0-rc3/source/drivers/pci/hotplug/pciehp_hpc.c#L602
>
> pciehp_ist()
> pciehp_handle_presence_or_link_change() /* called on PDC or DLLSC event */
> pciehp_enable_slot() /* called if PDS or DLLLA is 1 */
> __pciehp_enable_slot()
> board_added()
> pciehp_check_link_status()
> pcie_wait_for_link()
> /*
> * wait 20 ms per PCIe r4.0 sec 6.6.1,
> * then wait for up to 1 s until DLLLA is 1,
> * then wait 100 ms per PCIe r4.0 sec 6.7.3.3
> */
> pci_bus_check_dev()
> /*
> * first access to hotplugged device:
> * wait for up to 1 s until vendor ID is accessible
> */
> /*
> * any additionl PDC or DLLSC event that occurred up to this point
> * is ignored because the link is up and config space accessible;
> * read Link Status register and verify link is trained
> */
> pciehp_configure_device()
> /* enumerate hotplugged device */
>
> The 20 ms delay in pcie_wait_for_link() was added by Keith with commit
> f0157160b359 ("PCI: Make link active reporting detection generic")
> so that the function can also be used by the DPC driver. I think that
> additional delay wouldn't normally be necessary for hotplug but it
> doesn't hurt much either.
>
>
>> Per PCIe spec, in Gen 3 mode, software
>> should wait at least (1s - CTO value) after Data Link Layer Link Active
>> before accessing config space.
>
> Hm, do you have a spec reference for that? PCIe r4.0 sec 6.7.3.3 only says:
>
> "Software must wait for 100 ms after the Data Link Layer Link Active
> bit reads 1b before initiating a configuration access to the hot
> added device (see Section 6.6)."
>
Next sentence says "Software must allow 1 second after the Data Link
Layer Link Active bit reads 1b before it is permitted to determine that
a hot plugged device which fails to return a Successful Completion for a
Valid Configuration Request is a broken device".
And Section 6.6.1 on Conventional Reset says "Software should use 100 ms
wait periods only if software enables CRS Software Visibility.
Otherwise, Completion timeouts, platform timeouts, or lengthy processor
instruction stalls may result."
The "Completion timeouts" reference here is the key. Since CRS Software
Visibility is not used, the device still has 900 ms before it has to
return a Successful Completion if you issue a config read 100 ms after
DLLSC.
CTO timers are typically set around 50 to 65 ms on x86 systems. So if
you issue config read after 100 ms you could see a CTO around 50 to 65
ms after it's issued which would crash the system. In order to avoid
this CTO software has to wait (1 s - CTO value) after DLLSC before
issuing a config read at Gen 3 (timing is slightly different for Gen 1/2
speeds). This timing constraint needs to be observed any time a device
is reset as well.
Probably won't see an issue most of the time because most devices can
come up and respond to config accesses way faster than 100 ms but I've
seen devices take longer than that in certain corner cases. Regardless
of most devices' ability to be ready faster than required by PCIe it's
probably a good idea to wait longer for hot-insert anyway since
insertion events can be slow so need time for everything to reach steady
state (fully inserted) before starting to configure things and 100 ms is
not enough time for that based on our lab measurements. Likewise, DLLSC
and PDSC interrupts can be spread pretty far apart on removal as well
depending on how slowly the user removes the device.
> >> The exception is if OS enables CRS
>> Software Visibility in which case config space can be accessed 100 ms
>> after Data Link Layer Link Active. Is CRS Software Visibility being used?
>
> Not that I'm aware of, no.
>
>
>> I forgot... Readiness Notifications can also let software bypass these
>> first access times but I've not seen anybody implement RN so assume it
>> is not being used here.
>
> Right, I think this isn't used either so far.
>
> Thanks,
>
> Lukas
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-25 22:39 ` Austin.Bolen
@ 2019-01-26 12:12 ` Lukas Wunner
2019-01-30 14:28 ` Austin.Bolen
0 siblings, 1 reply; 23+ messages in thread
From: Lukas Wunner @ 2019-01-26 12:12 UTC (permalink / raw)
To: Austin.Bolen; +Cc: keith.busch, Alex_Gagniuc, linux-pci, bhelgaas
On Fri, Jan 25, 2019 at 10:39:04PM +0000, Austin.Bolen@dell.com wrote:
> On 1/25/2019 2:22 AM, Lukas Wunner wrote:
> > pcie_wait_for_link()
> > /*
> > * wait 20 ms per PCIe r4.0 sec 6.6.1,
> > * then wait for up to 1 s until DLLLA is 1,
> > * then wait 100 ms per PCIe r4.0 sec 6.7.3.3
> > */
> > pci_bus_check_dev()
> > /*
> > * first access to hotplugged device:
> > * wait for up to 1 s until vendor ID is accessible
> > */
[...]
> > PCIe r4.0 sec 6.7.3.3 only says:
> >
> > "Software must wait for 100 ms after the Data Link Layer Link Active
> > bit reads 1b before initiating a configuration access to the hot
> > added device (see Section 6.6)."
> >
>
> Next sentence says "Software must allow 1 second after the Data Link
> Layer Link Active bit reads 1b before it is permitted to determine that
> a hot plugged device which fails to return a Successful Completion for a
> Valid Configuration Request is a broken device".
Yes, that's why pci_bus_check_dev() polls the vendor ID register for
up to 1 sec before it declares failure. If that approach is not correct
or if we deviate from the spec somewhere else then that should of course
be fixed, so please let me know if so.
> And Section 6.6.1 on Conventional Reset says "Software should use 100 ms
> wait periods only if software enables CRS Software Visibility.
> Otherwise, Completion timeouts, platform timeouts, or lengthy processor
> instruction stalls may result."
>
> The "Completion timeouts" reference here is the key. Since CRS Software
> Visibility is not used, the device still has 900 ms before it has to
> return a Successful Completion if you issue a config read 100 ms after
> DLLSC.
I'm truly sorry, I double-checked the code and it turns out Linux does
enable CRS Software Visibility unconditionally for all bridges that
support it:
pci_scan_bridge_extend()
pci_enable_crs()
My apologies for the misinformation in my previous e-mail. Since your
explanation hinges on CRS not being used, I'm not sure in how far it
applies to the problem at hand. (I do thank you though for the
comprehensive explanation, I have a much better understanding now
of the issue you're facing.)
> > Is this product being shipped or can it still be changed?
>
> It's shipping. It could be changed with a firmware change to re-enable
> in-band presence but then we'd have the issue where hot-plug breaks when
> removing devices when link is in certain states like L1 so that is not a
> viable solution.
An alternative idea would be to change firmware to always drive PCIe PRSNT#
high as soon as U.2 PRSNT#/IfDet# is sensed, regardless whether it's NVMe
or not. Linux will try to bring up the slot and if it can't be brought up
because it's a SAS drive, all you'll get is an error message in dmesg,
but you shouldn't see other negative effects.
I think the proper solution would be to add a microcontroller to each
U.2 slot on your motherboard which snoops on the PCIe pins to determine
whether communication is taking place and sample the PRSNT#/IfDet# pins
if so. This leads to higher BOM cost and development cost of course.
Arguably your platform firmware-based solution tries avoid that cost
by simulating what a microcontroller would do but can't really achieve
the same result because it lacks access to the PCIe pins. Vendors such
as Apple do go to the lengths of adding microcontrollers in similar cases,
e.g. Apple used one on their motherboards with first generation Thunderbolt
controllers:
Because the physical port can be used either to attach a Thunderbolt
or a DisplayPort device, they added an NXP LP1112A ARM controller
which snoops on the pins and drives a multiplexer to route signals
either to the Thunderbolt controller or the GPU. See page 75 of
these schematics ("DisplayPort/T29 A MUXing"):
https://forums.macrumors.com/attachments/a1278-j30-820-3115-b-051-9058-pdf.451409/
Newer Thunderbolt controllers integrate that functionality so the
BOM cost and development cost is no longer necessary today.
> > Unfortunately I don't have access to those ECNs or the non-public
> > git repo.
>
> The PCIe Async Hot-Plug ECN has most of the new stuff. It has been
> released and has the same level of access as the PCIe Base Spec so if
> you have access to the PCIe base spec then you should have access to
> this ECN.
Unfortunately I'm not associated with a member of the PCISIG,
so my level of access to the PCIe Base Spec is constrained
to leaked documents found on Ukrainian websites. ;-)
Thanks,
Lukas
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
2019-01-26 12:12 ` Lukas Wunner
@ 2019-01-30 14:28 ` Austin.Bolen
0 siblings, 0 replies; 23+ messages in thread
From: Austin.Bolen @ 2019-01-30 14:28 UTC (permalink / raw)
To: lukas, Austin.Bolen; +Cc: keith.busch, Alex_Gagniuc, linux-pci, bhelgaas
On 1/26/2019 6:12 AM, Lukas Wunner wrote:
>
<snip>
>
> Yes, that's why pci_bus_check_dev() polls the vendor ID register for
> up to 1 sec before it declares failure. If that approach is not correct
> or if we deviate from the spec somewhere else then that should of course
> be fixed, so please let me know if so.
>
> >> And Section 6.6.1 on Conventional Reset says "Software should use 100 ms
>> wait periods only if software enables CRS Software Visibility.
>> Otherwise, Completion timeouts, platform timeouts, or lengthy processor
>> instruction stalls may result."
>>
>> The "Completion timeouts" reference here is the key. Since CRS Software
>> Visibility is not used, the device still has 900 ms before it has to
>> return a Successful Completion if you issue a config read 100 ms after
>> DLLSC.
>
> I'm truly sorry, I double-checked the code and it turns out Linux does
> enable CRS Software Visibility unconditionally for all bridges that
> support it:
>
> pci_scan_bridge_extend()
> pci_enable_crs()
>
> My apologies for the misinformation in my previous e-mail. Since your
> explanation hinges on CRS not being used, I'm not sure in how far it
> applies to the problem at hand. (I do thank you though for the
> comprehensive explanation, I have a much better understanding now
> of the issue you're facing.)
>
>
Yes great, with CRS Software Visibility mechanism in place it sounds
good. Thanks for double-checking on it! There is a weird corner case
where as you insert a device the Data Link Layer becomes active (gold
fingers on the connector just barely make contact) but as you continue
to insert the device the link goes back down then back up once it's
fully inserted (we've captured this and other similar oddities in the
lab). The link down/up would reset the device and so in this case, it
looks like the logic could do a config read before the device is ready.
From platform side we usually wait 1 second around hot-plug events for
everything to reach steady state since you don't otherwise know when the
mechanical action around hot-insert/hot-remove has ended.
The other way to handle this is via DPC since it will avoid link coming
back up on its own but more work is needed to support DPC on some
systems. The coming ACPI additions should hopefully allow DPC to be
enabled on such systems.
>>> Is this product being shipped or can it still be changed?
>>
>> It's shipping. It could be changed with a firmware change to re-enable
>> in-band presence but then we'd have the issue where hot-plug breaks when
>> removing devices when link is in certain states like L1 so that is not a
>> viable solution.
>
> An alternative idea would be to change firmware to always drive PCIe PRSNT#
> high as soon as U.2 PRSNT#/IfDet# is sensed, regardless whether it's NVMe
> or not. Linux will try to bring up the slot and if it can't be brought up
> because it's a SAS drive, all you'll get is an error message in dmesg,
> but you shouldn't see other negative effects.
We used to do this but this error and similar in other OSes is what
prompted the 500 ms delay circuitry.
>
> I think the proper solution would be to add a microcontroller to each
> U.2 slot on your motherboard which snoops on the PCIe pins to determine
> whether communication is taking place and sample the PRSNT#/IfDet# pins
> if so. This leads to higher BOM cost and development cost of course.
> > Arguably your platform firmware-based solution tries avoid that cost
> by simulating what a microcontroller would do but can't really achieve
> the same result because it lacks access to the PCIe pins. Vendors such
> as Apple do go to the lengths of adding microcontrollers in similar cases,
> e.g. Apple used one on their motherboards with first generation Thunderbolt
> controllers:
>
> Because the physical port can be used either to attach a Thunderbolt
> or a DisplayPort device, they added an NXP LP1112A ARM controller
> which snoops on the pins and drives a multiplexer to route signals
> either to the Thunderbolt controller or the GPU. See page 75 of
> these schematics ("DisplayPort/T29 A MUXing"):
>
> https://forums.macrumors.com/attachments/a1278-j30-820-3115-b-051-9058-pdf.451409/
>
> Newer Thunderbolt controllers integrate that functionality so the
> BOM cost and development cost is no longer necessary today.
>
>
Thanks for the links. And yes, we looked at things like this too but
they're prohibitively expensive. It's one thing if you have a single
thunderbolt connector but we have 24 or more drives.
But we do have other less expensive ways to detect when a link has
trained but those mechanisms or the microcontroller idea above all
require hardware changes so won't be possible on the systems already in
the field. Future platforms + DPC should handle it better.
>>> Unfortunately I don't have access to those ECNs or the non-public
>>> git repo.
>>
>> The PCIe Async Hot-Plug ECN has most of the new stuff. It has been
>> released and has the same level of access as the PCIe Base Spec so if
>> you have access to the PCIe base spec then you should have access to
>> this ECN.
>
> Unfortunately I'm not associated with a member of the PCISIG,
> so my level of access to the PCIe Base Spec is constrained
> to leaked documents found on Ukrainian websites. ;-)
>
Maybe you'll get lucky and they'll leak the ECN too :-)
> Thanks,
>
> Lukas
>
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2019-01-30 14:28 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-23 18:20 PCI: hotplug: Erroneous removal of hotplug PCI devices Alex_Gagniuc
2019-01-23 18:44 ` Keith Busch
2019-01-23 19:02 ` Lukas Wunner
2019-01-23 19:07 ` Keith Busch
2019-01-23 19:15 ` Lukas Wunner
2019-01-23 19:33 ` Keith Busch
2019-01-24 22:43 ` Austin.Bolen
2019-01-24 22:52 ` Austin.Bolen
[not found] ` <b32e6ca62ae2494f98450df81ca1ee14@AUSX13MPC131.AMER.DELL.COM>
2019-01-24 20:20 ` Keith Busch
2019-01-24 22:00 ` Austin.Bolen
2019-01-25 8:22 ` Lukas Wunner
2019-01-25 22:39 ` Austin.Bolen
2019-01-26 12:12 ` Lukas Wunner
2019-01-30 14:28 ` Austin.Bolen
2019-01-23 18:54 ` Lukas Wunner
2019-01-23 19:07 ` Lukas Wunner
2019-01-23 19:09 ` Keith Busch
2019-01-23 19:28 ` Lukas Wunner
2019-01-23 19:47 ` Keith Busch
2019-01-23 20:10 ` Alex_Gagniuc
2019-01-23 23:50 ` Alex_Gagniuc
2019-01-24 9:25 ` Lukas Wunner
2019-01-24 22:33 ` Austin.Bolen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).