linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* PCI: hotplug: Erroneous removal of hotplug PCI devices
@ 2019-01-23 18:20 Alex_Gagniuc
  2019-01-23 18:44 ` Keith Busch
  2019-01-23 18:54 ` Lukas Wunner
  0 siblings, 2 replies; 23+ messages in thread
From: Alex_Gagniuc @ 2019-01-23 18:20 UTC (permalink / raw)
  To: linux-pci; +Cc: bhelgaas, lukas, keith.busch, Austin.Bolen

Hi all,

This may be a mind-twisting explanation, so pleas bear with me.

In PCIe, the presence detect bit (PD) in the slot status register should 
be a logical OR of in-band and out-of band presence. In-band presence is 
the data link layer status. So one would expect that a link up event, 
would be accompanied by a PD changed event with PD set. Not everyone 
follows that.

I have a system here with the following order of events:
  *   0 ms : Link up
  * 400 ms : Presence detect up
On the first event, the device is probed as expected, and on the second 
event, the device is removed as a SURPRISE!!!_REMOVAL. This is a bug.

The logic is that on every change of presence detect:
/* Even if [the slot]'s occupied again, we cannot assume the card is the 
same. */
Reasonable, but the resulting behavior is a bug.

Solution 1 is to say it's a spec violation, so ignore it. They'll change 
the "logical OR" thing in the next PCIe spec, so we still will have to 
worry about this.

It's obvious that just relying on presence detect state is prone to race 
conditions. However, if a device is replaced, we'd expect the data link 
layer state to change as well. So I think the best way to proceed is to 
skip the SURPRISE!!!_REMOVAL if the following are true:
  * presence detect is set
  * DLL changed is not set
  * presence detect was not previously set

Thoughts?

Alex

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 18:20 PCI: hotplug: Erroneous removal of hotplug PCI devices Alex_Gagniuc
@ 2019-01-23 18:44 ` Keith Busch
  2019-01-23 19:02   ` Lukas Wunner
       [not found]   ` <b32e6ca62ae2494f98450df81ca1ee14@AUSX13MPC131.AMER.DELL.COM>
  2019-01-23 18:54 ` Lukas Wunner
  1 sibling, 2 replies; 23+ messages in thread
From: Keith Busch @ 2019-01-23 18:44 UTC (permalink / raw)
  To: Alex_Gagniuc; +Cc: linux-pci, bhelgaas, lukas, Austin.Bolen

On Wed, Jan 23, 2019 at 06:20:57PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> Hi all,
> 
> This may be a mind-twisting explanation, so pleas bear with me.
> 
> In PCIe, the presence detect bit (PD) in the slot status register should 
> be a logical OR of in-band and out-of band presence. In-band presence is 
> the data link layer status. So one would expect that a link up event, 
> would be accompanied by a PD changed event with PD set. Not everyone 
> follows that.
>
> I have a system here with the following order of events:
>   *   0 ms : Link up
>   * 400 ms : Presence detect up
> On the first event, the device is probed as expected, and on the second 
> event, the device is removed as a SURPRISE!!!_REMOVAL. This is a bug.
> 
> The logic is that on every change of presence detect:
> /* Even if [the slot]'s occupied again, we cannot assume the card is the 
> same. */
> Reasonable, but the resulting behavior is a bug.
> 
> Solution 1 is to say it's a spec violation, so ignore it. They'll change 
> the "logical OR" thing in the next PCIe spec, so we still will have to 
> worry about this.

When's that changing? 5.0 is the next spec, and it still says:

  Presence Detect State - This bit indicates the presence of an adapter
  in the slot, reflected by the logical “OR” of the Physical Layer
  in-band presence detect mechanism and, if present, any out-of-band
  presence detect mechanism defined for the slot’s corresponding
  form factor.

> It's obvious that just relying on presence detect state is prone to race 
> conditions. However, if a device is replaced, we'd expect the data link 
> layer state to change as well. So I think the best way to proceed is to 
> skip the SURPRISE!!!_REMOVAL if the following are true:
>   * presence detect is set
>   * DLL changed is not set
>   * presence detect was not previously set
> 
> Thoughts?

What is the value of PDS on the Link up event? If it's still "Slot
Empty", could we just ignore the Link event instead and wait for the PDC
event?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 18:20 PCI: hotplug: Erroneous removal of hotplug PCI devices Alex_Gagniuc
  2019-01-23 18:44 ` Keith Busch
@ 2019-01-23 18:54 ` Lukas Wunner
  2019-01-23 19:07   ` Lukas Wunner
  2019-01-24 22:33   ` Austin.Bolen
  1 sibling, 2 replies; 23+ messages in thread
From: Lukas Wunner @ 2019-01-23 18:54 UTC (permalink / raw)
  To: Alex_Gagniuc; +Cc: linux-pci, bhelgaas, keith.busch, Austin.Bolen

On Wed, Jan 23, 2019 at 06:20:57PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> In PCIe, the presence detect bit (PD) in the slot status register should 
> be a logical OR of in-band and out-of band presence. In-band presence is 
> the data link layer status. So one would expect that a link up event, 
> would be accompanied by a PD changed event with PD set. Not everyone 
> follows that.
> 
> I have a system here with the following order of events:
>   *   0 ms : Link up
>   * 400 ms : Presence detect up
> On the first event, the device is probed as expected, and on the second 
> event, the device is removed as a SURPRISE!!!_REMOVAL. This is a bug.

It's normal that there's a lag between presence and link changes.

There's even hardware which flaps the link and presence bits a
couple of times before they settle.  Since commit 6c35a1ac3da63a7
("PCI: pciehp: Tolerate initially unstable link") we're quite lenient
towards such devices and tolerate link and presence flaps for up to
100 ms.  That's basically also the maximum delay we allow between link up
and presence detect up.

Theoretically you could say:  Let's wait after the link goes up
until presence also goes up.  But guess what, some vendors managed
to hardwire the presence detect flag to zero, so you'd wait forever
on those devices.  We have a workaround specifically for such hardware
since commit 80696f991424 ("PCI: pciehp: Tolerate Presence Detect
hardwired to zero").


> It's obvious that just relying on presence detect state is prone to race 
> conditions. However, if a device is replaced, we'd expect the data link 
> layer state to change as well. So I think the best way to proceed is to 
> skip the SURPRISE!!!_REMOVAL if the following are true:
>   * presence detect is set
>   * DLL changed is not set
>   * presence detect was not previously set

Unfortunately it's not that simple.  All we know is that presence has
changed and is currently up and the link is also up.  This could mean
that the device was swapped and it may just take a little longer until
the link flag also flaps.

A possible solution for your hardware would be to amend
pciehp_check_link_status() such that after the call to pci_bus_check_dev()
we wait until presence also goes up.  However it would have to be quirked
to your particular hardware, otherwise it would break devices which
hardwire presence detect to zero.  Or we'd have to define a limit of,
say, 1 second and if presence hasn't gone up within that period,
we'd just silently carry on.  This wouldn't break hardware which hardwires
presence detect to zero, but introduce a 1 sec delay for that hardware
to bring up the slot.

So I don't see a perfect solution.  What device are we talking about
anyway?  400 ms is a *long* time.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 18:44 ` Keith Busch
@ 2019-01-23 19:02   ` Lukas Wunner
  2019-01-23 19:07     ` Keith Busch
  2019-01-24 22:43     ` Austin.Bolen
       [not found]   ` <b32e6ca62ae2494f98450df81ca1ee14@AUSX13MPC131.AMER.DELL.COM>
  1 sibling, 2 replies; 23+ messages in thread
From: Lukas Wunner @ 2019-01-23 19:02 UTC (permalink / raw)
  To: Keith Busch; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen

On Wed, Jan 23, 2019 at 11:44:53AM -0700, Keith Busch wrote:
> On Wed, Jan 23, 2019 at 06:20:57PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> > It's obvious that just relying on presence detect state is prone to race 
> > conditions. However, if a device is replaced, we'd expect the data link 
> > layer state to change as well. So I think the best way to proceed is to 
> > skip the SURPRISE!!!_REMOVAL if the following are true:
> >   * presence detect is set
> >   * DLL changed is not set
> >   * presence detect was not previously set
> > 
> > Thoughts?
> 
> What is the value of PDS on the Link up event? If it's still "Slot
> Empty", could we just ignore the Link event instead and wait for the PDC
> event?

Well, usually it's desirable to bring up the slot as quickly as possible,
so once we get any kind of link or presence event, we immediately try to
bring up the slot.

We do allow a 20 + 100 ms delay in pcie_wait_for_link() between link up
and presence detect up, just not 400 ms.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 18:54 ` Lukas Wunner
@ 2019-01-23 19:07   ` Lukas Wunner
  2019-01-23 19:09     ` Keith Busch
  2019-01-23 23:50     ` Alex_Gagniuc
  2019-01-24 22:33   ` Austin.Bolen
  1 sibling, 2 replies; 23+ messages in thread
From: Lukas Wunner @ 2019-01-23 19:07 UTC (permalink / raw)
  To: Alex_Gagniuc; +Cc: linux-pci, bhelgaas, keith.busch, Austin.Bolen

On Wed, Jan 23, 2019 at 07:54:20PM +0100, Lukas Wunner wrote:
> So I don't see a perfect solution.  What device are we talking about
> anyway?  400 ms is a *long* time.

Also, how exactly does this issue manifest itself:  Is it just an
annoyance that the slot is brought up/down/up or does it not work
at all?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 19:02   ` Lukas Wunner
@ 2019-01-23 19:07     ` Keith Busch
  2019-01-23 19:15       ` Lukas Wunner
  2019-01-24 22:43     ` Austin.Bolen
  1 sibling, 1 reply; 23+ messages in thread
From: Keith Busch @ 2019-01-23 19:07 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen

On Wed, Jan 23, 2019 at 08:02:04PM +0100, Lukas Wunner wrote:
> On Wed, Jan 23, 2019 at 11:44:53AM -0700, Keith Busch wrote:
> > On Wed, Jan 23, 2019 at 06:20:57PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> > > It's obvious that just relying on presence detect state is prone to race 
> > > conditions. However, if a device is replaced, we'd expect the data link 
> > > layer state to change as well. So I think the best way to proceed is to 
> > > skip the SURPRISE!!!_REMOVAL if the following are true:
> > >   * presence detect is set
> > >   * DLL changed is not set
> > >   * presence detect was not previously set
> > > 
> > > Thoughts?
> > 
> > What is the value of PDS on the Link up event? If it's still "Slot
> > Empty", could we just ignore the Link event instead and wait for the PDC
> > event?
> 
> Well, usually it's desirable to bring up the slot as quickly as possible,
> so once we get any kind of link or presence event, we immediately try to
> bring up the slot.
> 
> We do allow a 20 + 100 ms delay in pcie_wait_for_link() between link up
> and presence detect up, just not 400 ms.

Right, so in Alex's case, it looks like we are observing
pcie_wait_for_link() returning true before the PDC event.

I'm wondering about PDS because if the link is up but Presence reports an
empty slot, does that matter for any implementations? Or is it perfectly
fine to enumerate an active link on an empty slot? An empty slot and
active link doesn't make a lot of sense, but that observation appears to
be what is reported here.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 19:07   ` Lukas Wunner
@ 2019-01-23 19:09     ` Keith Busch
  2019-01-23 19:28       ` Lukas Wunner
  2019-01-23 23:50     ` Alex_Gagniuc
  1 sibling, 1 reply; 23+ messages in thread
From: Keith Busch @ 2019-01-23 19:09 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen

On Wed, Jan 23, 2019 at 08:07:23PM +0100, Lukas Wunner wrote:
> On Wed, Jan 23, 2019 at 07:54:20PM +0100, Lukas Wunner wrote:
> > So I don't see a perfect solution.  What device are we talking about
> > anyway?  400 ms is a *long* time.
> 
> Also, how exactly does this issue manifest itself:  Is it just an
> annoyance that the slot is brought up/down/up or does it not work
> at all?

Yeah, there is an nvme driver bug that hits a dead lock if you bring
a very quick add-remove sequence. The nvme remove tries to delete IO
resources before the async probe side set them up, so the driver doesn't
actually see that they're invalid. I have a proposed fix, but waiting to
here if it is successful.

bz: https://bugzilla.kernel.org/show_bug.cgi?id=202081

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 19:07     ` Keith Busch
@ 2019-01-23 19:15       ` Lukas Wunner
  2019-01-23 19:33         ` Keith Busch
  0 siblings, 1 reply; 23+ messages in thread
From: Lukas Wunner @ 2019-01-23 19:15 UTC (permalink / raw)
  To: Keith Busch; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen

On Wed, Jan 23, 2019 at 12:07:38PM -0700, Keith Busch wrote:
> Right, so in Alex's case, it looks like we are observing
> pcie_wait_for_link() returning true before the PDC event.
> 
> I'm wondering about PDS because if the link is up but Presence reports an
> empty slot, does that matter for any implementations? Or is it perfectly
> fine to enumerate an active link on an empty slot? An empty slot and
> active link doesn't make a lot of sense, but that observation appears to
> be what is reported here.

We allow enumeration with an active link of an allegedly empty slot
because some hardware hardwires PDS to zero, see commit 80696f991424
("PCI: pciehp: Tolerate Presence Detect hardwired to zero").

The brokenness of PCIe hotplug implementations found in the wild is
astonishing.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 19:09     ` Keith Busch
@ 2019-01-23 19:28       ` Lukas Wunner
  2019-01-23 19:47         ` Keith Busch
  0 siblings, 1 reply; 23+ messages in thread
From: Lukas Wunner @ 2019-01-23 19:28 UTC (permalink / raw)
  To: Keith Busch; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen

On Wed, Jan 23, 2019 at 12:09:46PM -0700, Keith Busch wrote:
> On Wed, Jan 23, 2019 at 08:07:23PM +0100, Lukas Wunner wrote:
> > On Wed, Jan 23, 2019 at 07:54:20PM +0100, Lukas Wunner wrote:
> > > So I don't see a perfect solution.  What device are we talking about
> > > anyway?  400 ms is a *long* time.
> > 
> > Also, how exactly does this issue manifest itself:  Is it just an
> > annoyance that the slot is brought up/down/up or does it not work
> > at all?
> 
> Yeah, there is an nvme driver bug that hits a dead lock if you bring
> a very quick add-remove sequence. The nvme remove tries to delete IO
> resources before the async probe side set them up, so the driver doesn't
> actually see that they're invalid. I have a proposed fix, but waiting to
> here if it is successful.
> 
> bz: https://bugzilla.kernel.org/show_bug.cgi?id=202081

Hm, there's no full dmesg output attached, so it's not possible to
tell what the topology looks like and what the vendor/device ID of
0000:b0:04.0 is.

Also, there's only a card present / link up sequence visible in the
abridged dmesg output which has a 4 usec delay, but no link up / card
present sequence with a 400 msec delay?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 19:15       ` Lukas Wunner
@ 2019-01-23 19:33         ` Keith Busch
  0 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2019-01-23 19:33 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen

On Wed, Jan 23, 2019 at 08:15:27PM +0100, Lukas Wunner wrote:
> On Wed, Jan 23, 2019 at 12:07:38PM -0700, Keith Busch wrote:
> > Right, so in Alex's case, it looks like we are observing
> > pcie_wait_for_link() returning true before the PDC event.
> > 
> > I'm wondering about PDS because if the link is up but Presence reports an
> > empty slot, does that matter for any implementations? Or is it perfectly
> > fine to enumerate an active link on an empty slot? An empty slot and
> > active link doesn't make a lot of sense, but that observation appears to
> > be what is reported here.
> 
> We allow enumeration with an active link of an allegedly empty slot
> because some hardware hardwires PDS to zero, see commit 80696f991424
> ("PCI: pciehp: Tolerate Presence Detect hardwired to zero").
> 
> The brokenness of PCIe hotplug implementations found in the wild is
> astonishing.

Heh, sounds like Alex's proposal may work until we find a slot
implementation that doesn't do DLLSC as expected. :)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 19:28       ` Lukas Wunner
@ 2019-01-23 19:47         ` Keith Busch
  2019-01-23 20:10           ` Alex_Gagniuc
  0 siblings, 1 reply; 23+ messages in thread
From: Keith Busch @ 2019-01-23 19:47 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen

On Wed, Jan 23, 2019 at 08:28:29PM +0100, Lukas Wunner wrote:
> On Wed, Jan 23, 2019 at 12:09:46PM -0700, Keith Busch wrote:
> > On Wed, Jan 23, 2019 at 08:07:23PM +0100, Lukas Wunner wrote:
> > > On Wed, Jan 23, 2019 at 07:54:20PM +0100, Lukas Wunner wrote:
> > > > So I don't see a perfect solution.  What device are we talking about
> > > > anyway?  400 ms is a *long* time.
> > > 
> > > Also, how exactly does this issue manifest itself:  Is it just an
> > > annoyance that the slot is brought up/down/up or does it not work
> > > at all?
> > 
> > Yeah, there is an nvme driver bug that hits a dead lock if you bring
> > a very quick add-remove sequence. The nvme remove tries to delete IO
> > resources before the async probe side set them up, so the driver doesn't
> > actually see that they're invalid. I have a proposed fix, but waiting to
> > here if it is successful.
> > 
> > bz: https://bugzilla.kernel.org/show_bug.cgi?id=202081
> 
> Hm, there's no full dmesg output attached, so it's not possible to
> tell what the topology looks like and what the vendor/device ID of
> 0000:b0:04.0 is.
> 
> Also, there's only a card present / link up sequence visible in the
> abridged dmesg output which has a 4 usec delay, but no link up / card
> present sequence with a 400 msec delay?

Yeah, not easy to follow, and some discussion was off the bz.

Link Change:

  [  838.784541] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up

Presence Detect Change +4msec:

  [  839.183506] pciehp 0000:b0:04.0:pcie204: Slot(178): Card not present

Inbetween these two entries has nvme start setting up its controller
detected on the link up. The "not present" side tries to remove the same
nvme device, but fails to invalidate the IO resources because it's racing
with probe before it even set them up, leaving probe unable to complete
IO a moment later because its IRQ resources were disabled.

Meanwhile, the blk-mq timeout handler can't do anything because the
device state is disconnected and believes the removal side is handling
things. What a mess...

We can fix it, just want to hear if Alex can confirm the proposal is
successful.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 19:47         ` Keith Busch
@ 2019-01-23 20:10           ` Alex_Gagniuc
  0 siblings, 0 replies; 23+ messages in thread
From: Alex_Gagniuc @ 2019-01-23 20:10 UTC (permalink / raw)
  To: keith.busch, lukas; +Cc: linux-pci, bhelgaas, Austin.Bolen

On 1/23/19 1:53 PM, Keith Busch wrote:
> 
> [EXTERNAL EMAIL]
> 
> On Wed, Jan 23, 2019 at 08:28:29PM +0100, Lukas Wunner wrote:
>> On Wed, Jan 23, 2019 at 12:09:46PM -0700, Keith Busch wrote:
>>> On Wed, Jan 23, 2019 at 08:07:23PM +0100, Lukas Wunner wrote:
>>>> On Wed, Jan 23, 2019 at 07:54:20PM +0100, Lukas Wunner wrote:
>>>>> So I don't see a perfect solution.  What device are we talking about
>>>>> anyway?  400 ms is a *long* time.
>>>>
>>>> Also, how exactly does this issue manifest itself:  Is it just an
>>>> annoyance that the slot is brought up/down/up or does it not work
>>>> at all?
>>>
>>> Yeah, there is an nvme driver bug that hits a dead lock if you bring
>>> a very quick add-remove sequence. The nvme remove tries to delete IO
>>> resources before the async probe side set them up, so the driver doesn't
>>> actually see that they're invalid. I have a proposed fix, but waiting to
>>> here if it is successful.
>>>
>>> bz: https://bugzilla.kernel.org/show_bug.cgi?id=202081
>>
>> Hm, there's no full dmesg output attached, so it's not possible to
>> tell what the topology looks like and what the vendor/device ID of
>> 0000:b0:04.0 is.
>>
>> Also, there's only a card present / link up sequence visible in the
>> abridged dmesg output which has a 4 usec delay, but no link up / card
>> present sequence with a 400 msec delay?
> 
> Yeah, not easy to follow, and some discussion was off the bz.
> 
> Link Change:
> 
>    [  838.784541] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up
> 
> Presence Detect Change +4msec:
> 
>    [  839.183506] pciehp 0000:b0:04.0:pcie204: Slot(178): Card not present
> 
> Inbetween these two entries has nvme start setting up its controller
> detected on the link up. The "not present" side tries to remove the same
> nvme device, but fails to invalidate the IO resources because it's racing
> with probe before it even set them up, leaving probe unable to complete
> IO a moment later because its IRQ resources were disabled.
> 
> Meanwhile, the blk-mq timeout handler can't do anything because the
> device state is disconnected and believes the removal side is handling
> things. What a mess...
> 
> We can fix it, just want to hear if Alex can confirm the proposal is
> successful.

OOPS! Totally missed there was a patch on bz. Will update bz once 
testing is done.

Alex


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 19:07   ` Lukas Wunner
  2019-01-23 19:09     ` Keith Busch
@ 2019-01-23 23:50     ` Alex_Gagniuc
  2019-01-24  9:25       ` Lukas Wunner
  1 sibling, 1 reply; 23+ messages in thread
From: Alex_Gagniuc @ 2019-01-23 23:50 UTC (permalink / raw)
  To: lukas; +Cc: linux-pci, bhelgaas, keith.busch, Austin.Bolen

On 1/23/19 1:07 PM, Lukas Wunner wrote:
> 
> [EXTERNAL EMAIL]
> 
> On Wed, Jan 23, 2019 at 07:54:20PM +0100, Lukas Wunner wrote:
>> So I don't see a perfect solution.  What device are we talking about
>> anyway?  400 ms is a *long* time.
> 
> Also, how exactly does this issue manifest itself:  Is it just an
> annoyance that the slot is brought up/down/up or does it not work
> at all?

nvme issue aside, the device is removed. Then, a few seconds later, we 
see Link Up and Card Present events, but no pciehp_isr() invocation. I 
think it all happens in one pciehp_handle_presence_or_link_change(), so 
pciehp_ist() is held up for a few seconds -- but that's a matter for a 
different thread.

Alex.


[  785.007640] nvme 0000:b1:00.0: enabling device (0000 -> 0002)
[  785.099567] pcieport 0000:b0:04.0: ZOPA: Status(0x00000040): 
presence-detected
[  785.099574] pcieport 0000:b0:04.0: ZOPA: Events(0x00000008): 
presence-detect-changed
[  785.099580] pcieport 0000:b0:04.0: ZOPA: Queued up(0x00000008): 
presence-detect-changed
[  785.099582] pcieport 0000:b0:04.0: ZOPA: pciehp_isr: exiting
[  785.099611] pcieport 0000:b0:04.0: ZOPA: pciehp_ist: Entered with 
events(0x00000008): presence-detect-changed
[  785.099615] pciehp 0000:b0:04.0:pcie204: ZOPA: 
pciehp_handle_presence_or_link_change: locked &ctrl->state_lock
[  785.099618] pciehp 0000:b0:04.0:pcie204: ZOPA: 
pciehp_handle_presence_or_link_change unlocked (POWEROFF)
[  785.099621] pciehp 0000:b0:04.0:pcie204: Slot(178): Card not present
[  788.756953] nvme nvme4: failed to mark controller CONNECTING
[  788.756958] nvme nvme4: Removing after probe failure status: 0
[  788.763317] pciehp 0000:b0:04.0:pcie204: ZOPA: 
pciehp_handle_presence_or_link_change unlocked (POWERON)
[  788.763322] pciehp 0000:b0:04.0:pcie204: Slot(178): Card present
[  788.763349] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up
[  788.892432] pci 0000:b1:00.0: [8086:0a55] type 00 class 0x010802
[  788.892474] pci 0000:b1:00.0: reg 0x10: [mem 0xe1500000-0xe1503fff 64bit]


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 23:50     ` Alex_Gagniuc
@ 2019-01-24  9:25       ` Lukas Wunner
  0 siblings, 0 replies; 23+ messages in thread
From: Lukas Wunner @ 2019-01-24  9:25 UTC (permalink / raw)
  To: Alex_Gagniuc; +Cc: linux-pci, bhelgaas, keith.busch, Austin.Bolen

On Wed, Jan 23, 2019 at 11:50:30PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> nvme issue aside, the device is removed. Then, a few seconds later, we 
> see Link Up and Card Present events, but no pciehp_isr() invocation. I 
> think it all happens in one pciehp_handle_presence_or_link_change(), so 
> pciehp_ist() is held up for a few seconds -- but that's a matter for a 
> different thread.

Yes, that's normal and expected.  The rationale is that we only know that
link or presence has changed.  It could even have changed multiple times.
Even if the link is now back up, we don't know if it's still the same card.
So the slot is brought down and if it's currently occupied or the link is
up, the slot is brought up again.

It would of course be possible to come up with more complicated schemes to
try to handle odd behavior such as a belated presence change after a link
change, but then it also becomes more complicated to reason about the
scheme's correctness and ensure that other types of odd behavior (such as
presence hardwired to zero) are still handled correctly.

Thanks,

Lukas

> [  785.007640] nvme 0000:b1:00.0: enabling device (0000 -> 0002)
> [  785.099567] pcieport 0000:b0:04.0: ZOPA: Status(0x00000040): 
> presence-detected
> [  785.099574] pcieport 0000:b0:04.0: ZOPA: Events(0x00000008): 
> presence-detect-changed
> [  785.099580] pcieport 0000:b0:04.0: ZOPA: Queued up(0x00000008): 
> presence-detect-changed
> [  785.099582] pcieport 0000:b0:04.0: ZOPA: pciehp_isr: exiting
> [  785.099611] pcieport 0000:b0:04.0: ZOPA: pciehp_ist: Entered with 
> events(0x00000008): presence-detect-changed
> [  785.099615] pciehp 0000:b0:04.0:pcie204: ZOPA: 
> pciehp_handle_presence_or_link_change: locked &ctrl->state_lock
> [  785.099618] pciehp 0000:b0:04.0:pcie204: ZOPA: 
> pciehp_handle_presence_or_link_change unlocked (POWEROFF)
> [  785.099621] pciehp 0000:b0:04.0:pcie204: Slot(178): Card not present
> [  788.756953] nvme nvme4: failed to mark controller CONNECTING
> [  788.756958] nvme nvme4: Removing after probe failure status: 0
> [  788.763317] pciehp 0000:b0:04.0:pcie204: ZOPA: 
> pciehp_handle_presence_or_link_change unlocked (POWERON)
> [  788.763322] pciehp 0000:b0:04.0:pcie204: Slot(178): Card present
> [  788.763349] pciehp 0000:b0:04.0:pcie204: Slot(178): Link Up
> [  788.892432] pci 0000:b1:00.0: [8086:0a55] type 00 class 0x010802
> [  788.892474] pci 0000:b1:00.0: reg 0x10: [mem 0xe1500000-0xe1503fff 64bit]
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
       [not found]   ` <b32e6ca62ae2494f98450df81ca1ee14@AUSX13MPC131.AMER.DELL.COM>
@ 2019-01-24 20:20     ` Keith Busch
  2019-01-24 22:00       ` Austin.Bolen
  0 siblings, 1 reply; 23+ messages in thread
From: Keith Busch @ 2019-01-24 20:20 UTC (permalink / raw)
  To: Austin.Bolen; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, lukas

[in plain text for the mailing lists]

On Thu, Jan 24, 2019 at 08:10:47PM +0000, Austin.Bolen@dell.com wrote:
> On 1/23/2019 12:46 PM, Keith Busch wrote:
> 
> On Wed, Jan 23, 2019 at 06:20:57PM +0000, Alex_Gagniuc@Dellteam.com<mailto:Alex_Gagniuc@Dellteam.com> wrote:
> 
> 
> <snip>
> 
> Solution 1 is to say it's a spec violation, so ignore it. They'll change
> the "logical OR" thing in the next PCIe spec, so we still will have to
> worry about this.
> 
> 
> 
> When's that changing? 5.0 is the next spec, and it still says:
> 
>   Presence Detect State - This bit indicates the presence of an adapter
>   in the slot, reflected by the logical "OR" of the Physical Layer
>   in-band presence detect mechanism and, if present, any out-of-band
>   presence detect mechanism defined for the slot's corresponding
>   form factor.
> 
> 
> The ECN that changes this behavior (Async Hot-Plug Updates ECN) is
> ratified and published to the PCI-SIG website.  It modifies PCIe Base
> Spec 4.0 and will be incorporated into PCIe Base Spec 5.0.  Check out
> the ECN section titled "In-band Presence Detect Mechanism Deprecated
> for Async Hot-Plug".

My copy of the 5.0-0.9 spec missed out on this ECN. The new changes look
like it more clearly defines expected software and hardware behavior,
so I'm looking forward to seeing the hardware implementations. Thanks
for driving that.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-24 20:20     ` Keith Busch
@ 2019-01-24 22:00       ` Austin.Bolen
  2019-01-25  8:22         ` Lukas Wunner
  0 siblings, 1 reply; 23+ messages in thread
From: Austin.Bolen @ 2019-01-24 22:00 UTC (permalink / raw)
  To: keith.busch, Austin.Bolen; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, lukas

On 1/24/2019 2:21 PM, Keith Busch wrote:
> 

<snip>


> My copy of the 5.0-0.9 spec missed out on this ECN. The new changes look
> like it more clearly defines expected software and hardware behavior,
> so I'm looking forward to seeing the hardware implementations. Thanks
> for driving that.
> 

You're welcome!  And yes, this ECN is not yet in 5.0 draft but should be 
there soon.  Much of what is in the ECN can be implemented with existing 
DPC hardware.  There are associated ECNs in ACPI and PCI Firmware 
Specifications as well.  The ACPI spec changes will not be public until 
the 6.3 spec is released (soon).  The PCI Firmware Specification changes 
should be ratified in a few weeks as well.  In the meantime, UEFI org 
(which ACPI is a part of) has a git repo accessible to UEFI forum 
members where the Linux code to support these new async hot-plug 
mechanisms is being developed.  Will be made public once the specs are 
public.

Thanks,
Austin

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 18:54 ` Lukas Wunner
  2019-01-23 19:07   ` Lukas Wunner
@ 2019-01-24 22:33   ` Austin.Bolen
  1 sibling, 0 replies; 23+ messages in thread
From: Austin.Bolen @ 2019-01-24 22:33 UTC (permalink / raw)
  To: lukas, Alex_Gagniuc; +Cc: linux-pci, bhelgaas, keith.busch, Austin.Bolen

On 1/23/2019 12:54 PM, Lukas Wunner wrote:
> 

<snip

> 
> So I don't see a perfect solution.  What device are we talking about
> anyway?  400 ms is a *long* time.

Some people have slow hands :-).  This is a U.2 device.  There are 2 
pins (Pin 4 = IfDet#/Pin 10 = PRSNT#) that determine presence and each 
pin is a different length (On insert, IfDet# mates first, PRSNT# mates 
second, PCIe data lanes mate 3rd).

SAS drive:  IfDet# = 0, PRSNT# = 0
NVMe drive: IfDet# = 0, PRSNT# = 1.

These 2 pins determine if it is a SAS or NVMe drive.  We have to wait 
until both pins mate in order to know for sure it's an NVMe drive and 
therefore generate PRSNT# to DSP above the drive.  But there is no way 
to know when both pins have mated due to the way the U.2 connector 
signals were defined so we have to dead wait.

The time delta between these 2 pins mating on insert can vary 
substantially with different users.  The time delta between these 2 pins 
mating for some users we measured was > 100 ms (others were less than 10 
ms).  That was for "normal" insert.  It takes much longer if user 
intentionally tries to insert slowly.

We chose 500 ms as the delay after P4 (IfDet#) mates before sampling the 
P10 (PRSNT#). About 100 ms of that is chewed up by the time delta of 
presence pins mating and PCIe data lanes mating plus the time to train 
the link leaving around 400 ms of time between DLLSC interrupt and PDSC 
interrupt.

For future, we will hide the U.2 device from OS for 1 second after 
insert to all device to be fully inserted (in most cases) and reach 
steady state.

New form factors will (hopefully) have presence pins that are 
last-to-mate so platforms/OSes do not have to rely on such guesswork to 
know when devices are fully inserted.

> 
> Thanks,
> 
> Lukas
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-23 19:02   ` Lukas Wunner
  2019-01-23 19:07     ` Keith Busch
@ 2019-01-24 22:43     ` Austin.Bolen
  2019-01-24 22:52       ` Austin.Bolen
  1 sibling, 1 reply; 23+ messages in thread
From: Austin.Bolen @ 2019-01-24 22:43 UTC (permalink / raw)
  To: lukas, keith.busch; +Cc: Alex_Gagniuc, linux-pci, bhelgaas, Austin.Bolen

On 1/23/2019 1:02 PM, Lukas Wunner wrote:
> 

<snip>

> 
> Well, usually it's desirable to bring up the slot as quickly as possible,
> so once we get any kind of link or presence event, we immediately try to
> bring up the slot.
> 
> We do allow a 20 + 100 ms delay in pcie_wait_for_link() between link up
> and presence detect up, just not 400 ms.

Is the hot-inserted device's config space accessed immediately after 
waiting this 20 + 100 ms delay?  Per PCIe spec, in Gen 3 mode, software 
should wait at least (1s - CTO value) after Data Link Layer Link Active 
before accessing config space. The exception is if OS enables CRS 
Software Visibility in which case config space can be accessed 100 ms 
after Data Link Layer Link Active.  Is CRS Software Visibility being used?

Cheers,
Austin

> 
> Thanks,
> 
> Lukas
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-24 22:43     ` Austin.Bolen
@ 2019-01-24 22:52       ` Austin.Bolen
  0 siblings, 0 replies; 23+ messages in thread
From: Austin.Bolen @ 2019-01-24 22:52 UTC (permalink / raw)
  To: Austin.Bolen, lukas, keith.busch; +Cc: Alex_Gagniuc, linux-pci, bhelgaas

On 1/24/2019 4:43 PM, Bolen, Austin wrote:
> On 1/23/2019 1:02 PM, Lukas Wunner wrote:
>>
> 
> <snip>
> 
>>
>> Well, usually it's desirable to bring up the slot as quickly as possible,
>> so once we get any kind of link or presence event, we immediately try to
>> bring up the slot.
>>
>> We do allow a 20 + 100 ms delay in pcie_wait_for_link() between link up
>> and presence detect up, just not 400 ms.
> 
> Is the hot-inserted device's config space accessed immediately after
> waiting this 20 + 100 ms delay?  Per PCIe spec, in Gen 3 mode, software
> should wait at least (1s - CTO value) after Data Link Layer Link Active
> before accessing config space. The exception is if OS enables CRS
> Software Visibility in which case config space can be accessed 100 ms
> after Data Link Layer Link Active.  Is CRS Software Visibility being used?
> 

I forgot... Readiness Notifications can also let software bypass these 
first access times but I've not seen anybody implement RN so assume it 
is not being used here.

> Cheers,
> Austin
> 
>>
>> Thanks,
>>
>> Lukas
>>
> 
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-24 22:00       ` Austin.Bolen
@ 2019-01-25  8:22         ` Lukas Wunner
  2019-01-25 22:39           ` Austin.Bolen
  0 siblings, 1 reply; 23+ messages in thread
From: Lukas Wunner @ 2019-01-25  8:22 UTC (permalink / raw)
  To: Austin.Bolen; +Cc: keith.busch, Alex_Gagniuc, linux-pci, bhelgaas

On Thu, Jan 24, 2019 at 10:00:49PM +0000, Austin.Bolen@dell.com wrote:
> You're welcome!  And yes, this ECN is not yet in 5.0 draft but should be 
> there soon.  Much of what is in the ECN can be implemented with existing 
> DPC hardware.  There are associated ECNs in ACPI and PCI Firmware 
> Specifications as well.  The ACPI spec changes will not be public until 
> the 6.3 spec is released (soon).  The PCI Firmware Specification changes 
> should be ratified in a few weeks as well.  In the meantime, UEFI org 
> (which ACPI is a part of) has a git repo accessible to UEFI forum 
> members where the Linux code to support these new async hot-plug 
> mechanisms is being developed.  Will be made public once the specs are 
> public.

Unfortunately I don't have access to those ECNs or the non-public
git repo.


> Some people have slow hands :-).  This is a U.2 device.  There are 2 
> pins (Pin 4 = IfDet#/Pin 10 = PRSNT#) that determine presence and each 
> pin is a different length (On insert, IfDet# mates first, PRSNT# mates 
> second, PCIe data lanes mate 3rd).

But I've found a public spec for that:
http://www.ssdformfactor.org/docs/SSD_Form_Factor_Version1_00.pdf


> These 2 pins determine if it is a SAS or NVMe drive.  We have to wait 
> until both pins mate in order to know for sure it's an NVMe drive and 
> therefore generate PRSNT# to DSP above the drive.  But there is no way 
> to know when both pins have mated due to the way the U.2 connector 
> signals were defined so we have to dead wait.

What I don't get is, a lot of the pins are used either for PCIe or SAS:
If the host controller doesn't know yet whether it's PCIe or SAS, why does
it already start driving them in PCIe mode and bring up the link?

Also, those PCIe pins mate 3rd, so by the time the host controller starts
driving those pins in PCIe mode, IfDet# and PRSNT# pins are known to be
connected (because they mate earlier), so the host controller can stop
waiting and start sampling the two pins?  Or is it mechanically possible
to insert the drive in such a way that the pins do not mate before the
PCIe pins, contrary to the spec?

Link up means in-band presence, so I assume this host controller already
adheres to the ECN discussed earlier?  What is the rationale of the ECN
given that it causes these issues?  Is this product being shipped or
can it still be changed?


> > We do allow a 20 + 100 ms delay in pcie_wait_for_link() between link up
> > and presence detect up, just not 400 ms.
> 
> Is the hot-inserted device's config space accessed immediately after 
> waiting this 20 + 100 ms delay?

Yes.

You can follow the code via this link, click your way down the call stack:
https://elixir.bootlin.com/linux/v5.0-rc3/source/drivers/pci/hotplug/pciehp_hpc.c#L602

pciehp_ist()
  pciehp_handle_presence_or_link_change() /* called on PDC or DLLSC event */
    pciehp_enable_slot() /* called if PDS or DLLLA is 1 */
      __pciehp_enable_slot()
        board_added()
	  pciehp_check_link_status()
	    pcie_wait_for_link()
	      /*
	       * wait 20 ms per PCIe r4.0 sec 6.6.1,
	       * then wait for up to 1 s until DLLLA is 1,
	       * then wait 100 ms per PCIe r4.0 sec 6.7.3.3
	       */
	    pci_bus_check_dev()
	      /*
	       * first access to hotplugged device:
	       * wait for up to 1 s until vendor ID is accessible
	       */
	    /*
	     * any additionl PDC or DLLSC event that occurred up to this point
	     * is ignored because the link is up and config space accessible;
	     * read Link Status register and verify link is trained
	     */
	  pciehp_configure_device()
	    /* enumerate hotplugged device */

The 20 ms delay in pcie_wait_for_link() was added by Keith with commit
f0157160b359 ("PCI: Make link active reporting detection generic")
so that the function can also be used by the DPC driver.  I think that
additional delay wouldn't normally be necessary for hotplug but it
doesn't hurt much either.


> Per PCIe spec, in Gen 3 mode, software 
> should wait at least (1s - CTO value) after Data Link Layer Link Active 
> before accessing config space.

Hm, do you have a spec reference for that?  PCIe r4.0 sec 6.7.3.3 only says:

   "Software must wait for 100 ms after the Data Link Layer Link Active
    bit reads 1b before initiating a configuration access to the hot
    added device (see Section 6.6)."


> The exception is if OS enables CRS 
> Software Visibility in which case config space can be accessed 100 ms 
> after Data Link Layer Link Active.  Is CRS Software Visibility being used?

Not that I'm aware of, no.


> I forgot... Readiness Notifications can also let software bypass these 
> first access times but I've not seen anybody implement RN so assume it 
> is not being used here.

Right, I think this isn't used either so far.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-25  8:22         ` Lukas Wunner
@ 2019-01-25 22:39           ` Austin.Bolen
  2019-01-26 12:12             ` Lukas Wunner
  0 siblings, 1 reply; 23+ messages in thread
From: Austin.Bolen @ 2019-01-25 22:39 UTC (permalink / raw)
  To: lukas, Austin.Bolen; +Cc: keith.busch, Alex_Gagniuc, linux-pci, bhelgaas

On 1/25/2019 2:22 AM, Lukas Wunner wrote:
> 
> [EXTERNAL EMAIL]
> 
> On Thu, Jan 24, 2019 at 10:00:49PM +0000, Austin.Bolen@dell.com wrote:
>> You're welcome!  And yes, this ECN is not yet in 5.0 draft but should be
>> there soon.  Much of what is in the ECN can be implemented with existing
>> DPC hardware.  There are associated ECNs in ACPI and PCI Firmware
>> Specifications as well.  The ACPI spec changes will not be public until
>> the 6.3 spec is released (soon).  The PCI Firmware Specification changes
>> should be ratified in a few weeks as well.  In the meantime, UEFI org
>> (which ACPI is a part of) has a git repo accessible to UEFI forum
>> members where the Linux code to support these new async hot-plug
>> mechanisms is being developed.  Will be made public once the specs are
>> public.
> 
> Unfortunately I don't have access to those ECNs or the non-public
> git repo.
> 

The PCIe Async Hot-Plug ECN has most of the new stuff. It has been 
released and has the same level of access as the PCIe Base Spec so if 
you have access to the PCIe base spec then you should have access to 
this ECN.

> 
>> Some people have slow hands :-).  This is a U.2 device.  There are 2
>> pins (Pin 4 = IfDet#/Pin 10 = PRSNT#) that determine presence and each
>> pin is a different length (On insert, IfDet# mates first, PRSNT# mates
>> second, PCIe data lanes mate 3rd).
> 
> But I've found a public spec for that:
> http://www.ssdformfactor.org/docs/SSD_Form_Factor_Version1_00.pdf
> >
>> These 2 pins determine if it is a SAS or NVMe drive.  We have to wait
>> until both pins mate in order to know for sure it's an NVMe drive and
>> therefore generate PRSNT# to DSP above the drive.  But there is no way
>> to know when both pins have mated due to the way the U.2 connector
>> signals were defined so we have to dead wait.
> 
> What I don't get is, a lot of the pins are used either for PCIe or SAS:
> If the host controller doesn't know yet whether it's PCIe or SAS, why does
> it already start driving them in PCIe mode and bring up the link?
>  > Also, those PCIe pins mate 3rd, so by the time the host controller starts
> driving those pins in PCIe mode, IfDet# and PRSNT# pins are known to be
> connected (because they mate earlier), so the host controller can stop
> waiting and start sampling the two pins?  Or is it mechanically possible
> to insert the drive in such a way that the pins do not mate before the
> PCIe pins, contrary to the spec?

In this case, there's not a single host controller.  There are two 
things at play.  The sideband signals (IfDet#, PRSNT#) are routed to 
logic on the platform motherboard or drive backplane.  The PCIe lanes 
are routed to a downstream port (Root Port or Switch Downstream port). 
The platform can see when the sideband signals connect but can't see 
when the link trains.  When the link trains and Data Link Layer goes to 
Active state, the DSP will notify OS via DLLSC interrupt but platform 
has no visibility into this (because we grant control of hot-plug to OS 
natively) and there's no way for platform to get notified when the link 
becomes active.  The PCIe ECN referenced above provides ability to 
notify the platform in this scenario but requires hardware change.

Likewise the DSP doesn't know or care if it's possible to plug a SAS 
drive into the slot.  As soon as it starts seeing training sets on the 
PCIe link it's going to train regardless of the state of IfDet#/PRSNT#.

Also, the DSP doesn't know the state of IfDet#/PRSNT# directly.  There's 
a pin called PRSNT# in PCIe DSPs (different than the U2 P10 PRSNT#). 
The platform will assert PCIe PRSNT# to the DSP when U.2 PRSNT#/IfDet# 
indicate it's an NVMe drive (and platform doesn't know this until around 
400 ms after IfDet# mates because of the reasons described earlier). 
But PCIe Base spec does not require DSPs to see PCIe PRSNT# asserted in 
order to start the training sequence.  So that's why the PCIe link comes 
up before the platform knows its an NVMe drive.

> 
> Link up means in-band presence, 

Link Up and In-band Presence are slightly different states (see Table 
4.15 in PCIe Base Spec).  In-band presence is a lower-level detection 
mechanism and can be set even when link is not up.

so I assume this host controller already
> adheres to the ECN discussed earlier?  

Yes, this platform disables in-band presence.

What is the rationale of the ECN
> given that it causes these issues?

In certain link states like L1, DSPs can't detect in-band presence going 
away.  If you remove a device while the link is in one of these states 
then OS will not get notified.  OS will not be able to detect subsequent 
insertions either. Only way out of this is for user to go into the OS 
and manually get the link out of L1.

The PCIe spec doesn't have any strict requirements on how long the time 
between DLLSC and PDSC can be but from our measurements 100 ms is not 
long enough to cover the time delta from first pins mating until the 
device is fully inserted. At a platform level, we use 1 second to allow 
for insertion and removal events and this seems adequate for most cases 
but can be defeated if someone makes an effort to slowly insert or 
remove a device.

Is this product being shipped or
> can it still be changed?
> 

It's shipping.  It could be changed with a firmware change to re-enable 
in-band presence but then we'd have the issue where hot-plug breaks when 
removing devices when link is in certain states like L1 so that is not a 
viable solution.

> 
>>> We do allow a 20 + 100 ms delay in pcie_wait_for_link() between link up
>>> and presence detect up, just not 400 ms.
>>
>> Is the hot-inserted device's config space accessed immediately after
>> waiting this 20 + 100 ms delay?
> 
> Yes.
> 
> You can follow the code via this link, click your way down the call stack:
> https://elixir.bootlin.com/linux/v5.0-rc3/source/drivers/pci/hotplug/pciehp_hpc.c#L602
> 
> pciehp_ist()
>    pciehp_handle_presence_or_link_change() /* called on PDC or DLLSC event */
>      pciehp_enable_slot() /* called if PDS or DLLLA is 1 */
>        __pciehp_enable_slot()
>          board_added()
> 	  pciehp_check_link_status()
> 	    pcie_wait_for_link()
> 	      /*
> 	       * wait 20 ms per PCIe r4.0 sec 6.6.1,
> 	       * then wait for up to 1 s until DLLLA is 1,
> 	       * then wait 100 ms per PCIe r4.0 sec 6.7.3.3
> 	       */
> 	    pci_bus_check_dev()
> 	      /*
> 	       * first access to hotplugged device:
> 	       * wait for up to 1 s until vendor ID is accessible
> 	       */
> 	    /*
> 	     * any additionl PDC or DLLSC event that occurred up to this point
> 	     * is ignored because the link is up and config space accessible;
> 	     * read Link Status register and verify link is trained
> 	     */
> 	  pciehp_configure_device()
> 	    /* enumerate hotplugged device */
> 
> The 20 ms delay in pcie_wait_for_link() was added by Keith with commit
> f0157160b359 ("PCI: Make link active reporting detection generic")
> so that the function can also be used by the DPC driver.  I think that
> additional delay wouldn't normally be necessary for hotplug but it
> doesn't hurt much either.
> 
> 
>> Per PCIe spec, in Gen 3 mode, software
>> should wait at least (1s - CTO value) after Data Link Layer Link Active
>> before accessing config space.
> 
> Hm, do you have a spec reference for that?  PCIe r4.0 sec 6.7.3.3 only says:
> 
>     "Software must wait for 100 ms after the Data Link Layer Link Active
>      bit reads 1b before initiating a configuration access to the hot
>      added device (see Section 6.6)."
> 

Next sentence says "Software must allow 1 second after the Data Link 
Layer Link Active bit reads 1b before it is permitted to determine that 
a hot plugged device which fails to return a Successful Completion for a 
Valid Configuration Request is a broken device".

And Section 6.6.1 on Conventional Reset says "Software should use 100 ms 
wait periods only if software enables CRS Software Visibility. 
Otherwise, Completion timeouts, platform timeouts, or lengthy processor 
instruction stalls may result."

The "Completion timeouts" reference here is the key. Since CRS Software 
Visibility is not used, the device still has 900 ms before it has to 
return a Successful Completion if you issue a config read 100 ms after 
DLLSC.

CTO timers are typically set around 50 to 65 ms on x86 systems.  So if 
you issue config read after 100 ms you could see a CTO around 50 to 65 
ms after it's issued which would crash the system.  In order to avoid 
this CTO software has to wait (1 s - CTO value) after DLLSC before 
issuing a config read at Gen 3 (timing is slightly different for Gen 1/2 
speeds). This timing constraint needs to be observed any time a device 
is reset as well.

Probably won't see an issue most of the time because most devices can 
come up and respond to config accesses way faster than 100 ms but I've 
seen devices take longer than that in certain corner cases. Regardless 
of most devices' ability to be ready faster than required by PCIe it's 
probably a good idea to wait longer for hot-insert anyway since 
insertion events can be slow so need time for everything to reach steady 
state (fully inserted) before starting to configure things and 100 ms is 
not enough time for that based on our lab measurements.  Likewise, DLLSC 
and PDSC interrupts can be spread pretty far apart on removal as well 
depending on how slowly the user removes the device.

> >> The exception is if OS enables CRS
>> Software Visibility in which case config space can be accessed 100 ms
>> after Data Link Layer Link Active.  Is CRS Software Visibility being used?
> 
> Not that I'm aware of, no.
> 
> 
>> I forgot... Readiness Notifications can also let software bypass these
>> first access times but I've not seen anybody implement RN so assume it
>> is not being used here.
> 
> Right, I think this isn't used either so far.
> 
> Thanks,
> 
> Lukas
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-25 22:39           ` Austin.Bolen
@ 2019-01-26 12:12             ` Lukas Wunner
  2019-01-30 14:28               ` Austin.Bolen
  0 siblings, 1 reply; 23+ messages in thread
From: Lukas Wunner @ 2019-01-26 12:12 UTC (permalink / raw)
  To: Austin.Bolen; +Cc: keith.busch, Alex_Gagniuc, linux-pci, bhelgaas

On Fri, Jan 25, 2019 at 10:39:04PM +0000, Austin.Bolen@dell.com wrote:
> On 1/25/2019 2:22 AM, Lukas Wunner wrote:
> > 	    pcie_wait_for_link()
> > 	      /*
> > 	       * wait 20 ms per PCIe r4.0 sec 6.6.1,
> > 	       * then wait for up to 1 s until DLLLA is 1,
> > 	       * then wait 100 ms per PCIe r4.0 sec 6.7.3.3
> > 	       */
> > 	    pci_bus_check_dev()
> > 	      /*
> > 	       * first access to hotplugged device:
> > 	       * wait for up to 1 s until vendor ID is accessible
> > 	       */
[...]
> >  PCIe r4.0 sec 6.7.3.3 only says:
> > 
> >     "Software must wait for 100 ms after the Data Link Layer Link Active
> >      bit reads 1b before initiating a configuration access to the hot
> >      added device (see Section 6.6)."
> > 
> 
> Next sentence says "Software must allow 1 second after the Data Link 
> Layer Link Active bit reads 1b before it is permitted to determine that 
> a hot plugged device which fails to return a Successful Completion for a 
> Valid Configuration Request is a broken device".

Yes, that's why pci_bus_check_dev() polls the vendor ID register for
up to 1 sec before it declares failure.  If that approach is not correct
or if we deviate from the spec somewhere else then that should of course
be fixed, so please let me know if so.


> And Section 6.6.1 on Conventional Reset says "Software should use 100 ms 
> wait periods only if software enables CRS Software Visibility. 
> Otherwise, Completion timeouts, platform timeouts, or lengthy processor 
> instruction stalls may result."
> 
> The "Completion timeouts" reference here is the key. Since CRS Software 
> Visibility is not used, the device still has 900 ms before it has to 
> return a Successful Completion if you issue a config read 100 ms after 
> DLLSC.

I'm truly sorry, I double-checked the code and it turns out Linux does
enable CRS Software Visibility unconditionally for all bridges that
support it:

pci_scan_bridge_extend()
  pci_enable_crs()

My apologies for the misinformation in my previous e-mail.  Since your
explanation hinges on CRS not being used, I'm not sure in how far it
applies to the problem at hand.  (I do thank you though for the
comprehensive explanation, I have a much better understanding now
of the issue you're facing.)


> > Is this product being shipped or can it still be changed?
> 
> It's shipping.  It could be changed with a firmware change to re-enable 
> in-band presence but then we'd have the issue where hot-plug breaks when 
> removing devices when link is in certain states like L1 so that is not a 
> viable solution.

An alternative idea would be to change firmware to always drive PCIe PRSNT#
high as soon as U.2 PRSNT#/IfDet# is sensed, regardless whether it's NVMe
or not.  Linux will try to bring up the slot and if it can't be brought up
because it's a SAS drive, all you'll get is an error message in dmesg,
but you shouldn't see other negative effects.

I think the proper solution would be to add a microcontroller to each
U.2 slot on your motherboard which snoops on the PCIe pins to determine
whether communication is taking place and sample the PRSNT#/IfDet# pins
if so.  This leads to higher BOM cost and development cost of course.

Arguably your platform firmware-based solution tries avoid that cost
by simulating what a microcontroller would do but can't really achieve
the same result because it lacks access to the PCIe pins.  Vendors such
as Apple do go to the lengths of adding microcontrollers in similar cases,
e.g. Apple used one on their motherboards with first generation Thunderbolt
controllers:

Because the physical port can be used either to attach a Thunderbolt
or a DisplayPort device, they added an NXP LP1112A ARM controller
which snoops on the pins and drives a multiplexer to route signals
either to the Thunderbolt controller or the GPU.  See page 75 of
these schematics ("DisplayPort/T29 A MUXing"):

https://forums.macrumors.com/attachments/a1278-j30-820-3115-b-051-9058-pdf.451409/

Newer Thunderbolt controllers integrate that functionality so the
BOM cost and development cost is no longer necessary today.


> > Unfortunately I don't have access to those ECNs or the non-public
> > git repo.
> 
> The PCIe Async Hot-Plug ECN has most of the new stuff. It has been 
> released and has the same level of access as the PCIe Base Spec so if 
> you have access to the PCIe base spec then you should have access to 
> this ECN.

Unfortunately I'm not associated with a member of the PCISIG,
so my level of access to the PCIe Base Spec is constrained
to leaked documents found on Ukrainian websites. ;-)

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: PCI: hotplug: Erroneous removal of hotplug PCI devices
  2019-01-26 12:12             ` Lukas Wunner
@ 2019-01-30 14:28               ` Austin.Bolen
  0 siblings, 0 replies; 23+ messages in thread
From: Austin.Bolen @ 2019-01-30 14:28 UTC (permalink / raw)
  To: lukas, Austin.Bolen; +Cc: keith.busch, Alex_Gagniuc, linux-pci, bhelgaas

On 1/26/2019 6:12 AM, Lukas Wunner wrote:
> 
<snip>

> 
> Yes, that's why pci_bus_check_dev() polls the vendor ID register for
> up to 1 sec before it declares failure.  If that approach is not correct
> or if we deviate from the spec somewhere else then that should of course
> be fixed, so please let me know if so.
> 
> >> And Section 6.6.1 on Conventional Reset says "Software should use 100 ms
>> wait periods only if software enables CRS Software Visibility.
>> Otherwise, Completion timeouts, platform timeouts, or lengthy processor
>> instruction stalls may result."
>>
>> The "Completion timeouts" reference here is the key. Since CRS Software
>> Visibility is not used, the device still has 900 ms before it has to
>> return a Successful Completion if you issue a config read 100 ms after
>> DLLSC.
> 
> I'm truly sorry, I double-checked the code and it turns out Linux does
> enable CRS Software Visibility unconditionally for all bridges that
> support it:
> 
> pci_scan_bridge_extend()
>    pci_enable_crs()
> 
> My apologies for the misinformation in my previous e-mail.  Since your
> explanation hinges on CRS not being used, I'm not sure in how far it
> applies to the problem at hand.  (I do thank you though for the
> comprehensive explanation, I have a much better understanding now
> of the issue you're facing.)
> 
> 

Yes great, with CRS Software Visibility mechanism in place it sounds 
good.  Thanks for double-checking on it!  There is a weird corner case 
where as you insert a device the Data Link Layer becomes active (gold 
fingers on the connector just barely make contact) but as you continue 
to insert the device the link goes back down then back up once it's 
fully inserted (we've captured this and other similar oddities in the 
lab).  The link down/up would reset the device and so in this case, it 
looks like the logic could do a config read before the device is ready. 
 From platform side we usually wait 1 second around hot-plug events for 
everything to reach steady state since you don't otherwise know when the 
mechanical action around hot-insert/hot-remove has ended.

The other way to handle this is via DPC since it will avoid link coming 
back up on its own but more work is needed to support DPC on some 
systems.  The coming ACPI additions should hopefully allow DPC to be 
enabled on such systems.

>>> Is this product being shipped or can it still be changed?
>>
>> It's shipping.  It could be changed with a firmware change to re-enable
>> in-band presence but then we'd have the issue where hot-plug breaks when
>> removing devices when link is in certain states like L1 so that is not a
>> viable solution.
> 
> An alternative idea would be to change firmware to always drive PCIe PRSNT#
> high as soon as U.2 PRSNT#/IfDet# is sensed, regardless whether it's NVMe
> or not.  Linux will try to bring up the slot and if it can't be brought up
> because it's a SAS drive, all you'll get is an error message in dmesg,
> but you shouldn't see other negative effects.

We used to do this but this error and similar in other OSes is what 
prompted the 500 ms delay circuitry.

> 
> I think the proper solution would be to add a microcontroller to each
> U.2 slot on your motherboard which snoops on the PCIe pins to determine
> whether communication is taking place and sample the PRSNT#/IfDet# pins
> if so.  This leads to higher BOM cost and development cost of course.
> > Arguably your platform firmware-based solution tries avoid that cost
> by simulating what a microcontroller would do but can't really achieve
> the same result because it lacks access to the PCIe pins.  Vendors such
> as Apple do go to the lengths of adding microcontrollers in similar cases,
> e.g. Apple used one on their motherboards with first generation Thunderbolt
> controllers:
> 
> Because the physical port can be used either to attach a Thunderbolt
> or a DisplayPort device, they added an NXP LP1112A ARM controller
> which snoops on the pins and drives a multiplexer to route signals
> either to the Thunderbolt controller or the GPU.  See page 75 of
> these schematics ("DisplayPort/T29 A MUXing"):
> 
> https://forums.macrumors.com/attachments/a1278-j30-820-3115-b-051-9058-pdf.451409/
> 
> Newer Thunderbolt controllers integrate that functionality so the
> BOM cost and development cost is no longer necessary today.
> 
> 

Thanks for the links.  And yes, we looked at things like this too but 
they're prohibitively expensive. It's one thing if you have a single 
thunderbolt connector but we have 24 or more drives.

But we do have other less expensive ways to detect when a link has 
trained but those mechanisms or the microcontroller idea above all 
require hardware changes so won't be possible on the systems already in 
the field.  Future platforms + DPC should handle it better.

>>> Unfortunately I don't have access to those ECNs or the non-public
>>> git repo.
>>
>> The PCIe Async Hot-Plug ECN has most of the new stuff. It has been
>> released and has the same level of access as the PCIe Base Spec so if
>> you have access to the PCIe base spec then you should have access to
>> this ECN.
> 
> Unfortunately I'm not associated with a member of the PCISIG,
> so my level of access to the PCIe Base Spec is constrained
> to leaked documents found on Ukrainian websites. ;-)
>

Maybe you'll get lucky and they'll leak the ECN too :-)

> Thanks,
> 
> Lukas
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-01-30 14:28 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-23 18:20 PCI: hotplug: Erroneous removal of hotplug PCI devices Alex_Gagniuc
2019-01-23 18:44 ` Keith Busch
2019-01-23 19:02   ` Lukas Wunner
2019-01-23 19:07     ` Keith Busch
2019-01-23 19:15       ` Lukas Wunner
2019-01-23 19:33         ` Keith Busch
2019-01-24 22:43     ` Austin.Bolen
2019-01-24 22:52       ` Austin.Bolen
     [not found]   ` <b32e6ca62ae2494f98450df81ca1ee14@AUSX13MPC131.AMER.DELL.COM>
2019-01-24 20:20     ` Keith Busch
2019-01-24 22:00       ` Austin.Bolen
2019-01-25  8:22         ` Lukas Wunner
2019-01-25 22:39           ` Austin.Bolen
2019-01-26 12:12             ` Lukas Wunner
2019-01-30 14:28               ` Austin.Bolen
2019-01-23 18:54 ` Lukas Wunner
2019-01-23 19:07   ` Lukas Wunner
2019-01-23 19:09     ` Keith Busch
2019-01-23 19:28       ` Lukas Wunner
2019-01-23 19:47         ` Keith Busch
2019-01-23 20:10           ` Alex_Gagniuc
2019-01-23 23:50     ` Alex_Gagniuc
2019-01-24  9:25       ` Lukas Wunner
2019-01-24 22:33   ` Austin.Bolen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).