linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* PCIe hot-plug issue: Failed to check link status
@ 2020-09-08 14:57 Myron Stowe
  2020-09-10 13:24 ` Lukas Wunner
  0 siblings, 1 reply; 3+ messages in thread
From: Myron Stowe @ 2020-09-08 14:57 UTC (permalink / raw)
  To: linux-pci; +Cc: bhelgaas, lukas, klimov.linux

On a system with a Mellanox Technologies MT27800 Family [ConnectX-5]
NIC controller containing a power button, hot-plug fails to function
properly.

  Normal, expected, scenario:
    o Press the OCP NIC's power button;
    o Power button LED blinks and turns off (delivering event message
      to CPU);
    o Verify NIC is offline via 'lspci';
    o Remove controller.

  Scenario with cmdline parameter 'pcie_port_pm=off':
    o Press NIC's power button;
    o LED turns off;
    o Verify NIC is offline;
    o Press power button (in an attempt to hot-add controller);
    o NIC is not recognized.

  Scenario with no cmdline parameter, or ''pcie_aspm=off', or
  'pcie_aspm=off pcie_port_pm=off':
    o Press NIC's power button;
    o LED continuously flashes;
    o Checking via 'lspci' indicates NIC is offline but with LED
      flashing, the controller can not be removed.

The 'dmesg', and 'lspci', logs are included within the
associated bugzilla:
  https://bugzilla.kernel.org/show_bug.cgi?id=209113


As stated in the bugzilla, I'm relaying all this information second
hand.  Hoping to get the affected party involved directly.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: PCIe hot-plug issue: Failed to check link status
  2020-09-08 14:57 PCIe hot-plug issue: Failed to check link status Myron Stowe
@ 2020-09-10 13:24 ` Lukas Wunner
  2020-09-14 20:08   ` Myron Stowe
  0 siblings, 1 reply; 3+ messages in thread
From: Lukas Wunner @ 2020-09-10 13:24 UTC (permalink / raw)
  To: Myron Stowe; +Cc: linux-pci, bhelgaas, klimov.linux

On Tue, Sep 08, 2020 at 08:57:26AM -0600, Myron Stowe wrote:
> On a system with a Mellanox Technologies MT27800 Family [ConnectX-5]
> NIC controller containing a power button, hot-plug fails to function
> properly.
[...]
> https://bugzilla.kernel.org/show_bug.cgi?id=209113

Thanks for the report.

So in the dmesg output you've provided, the card is already inserted
when the machine boots.  At 233 seconds, the Attention Button is pressed
twice within 200 msec (the second press cancels the first).  At 235 sec,
the button is pressed again and after 5 sec the slot is brought down.
So far so good.

At 291 sec the button is pressed but bringup of the slot fails.
What happens here is, pciehp notices that upon the button press,
a card is already present in the slot.  So for convenience,
instead of waiting the full 5 sec, it attempts to bring up the slot
immediately.  That fails because Data Link Layer Link Active isn't
set within 1 sec.

The difference to v4.18 is that back then, pciehp waited the full
5 sec before bringing up the slot.

Per PCIe r4.0 sec 6.7.1.8:

    After turning power on, software must wait for a Data Link Layer
    State Changed event, as described in Section 6.7.3.3.

And per sec 6.7.3.3:

    The Data Link Layer State Changed event must occur within 1 second
    of the event that initiates the hot-insertion. If a power controller
    is supported, the time out interval is measured from when software
    initiated a write to the Slot Control register to turn on the power.
    [...] Software is allowed to time out on a hot add operation if the
    Data Link Layer State Changed event does not occur within 1 second.

So we adhere to the spec regarding the timeout between enabling power
and waiting for DLLLA.  We do not exactly adhere to the spec regarding
the 5 sec delay between button press and acting on it.  But I can't
really imagine that's the problem.

As a shot in the dark, could you amend pcie_wait_for_link_delay()
in drivers/pci/pci.c and increase the "timeout = 1000" a little?
Maybe more than 1 sec is necessary in this case between enabling
power and timing out for lack of a link?

The v4.18 output you've provided in the bugzilla is incomplete and
lacks time stamps.  Could you provide it in full?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: PCIe hot-plug issue: Failed to check link status
  2020-09-10 13:24 ` Lukas Wunner
@ 2020-09-14 20:08   ` Myron Stowe
  0 siblings, 0 replies; 3+ messages in thread
From: Myron Stowe @ 2020-09-14 20:08 UTC (permalink / raw)
  To: Lukas Wunner; +Cc: linux-pci, bhelgaas, klimov.linux

On Thu, 10 Sep 2020 15:24:40 +0200
Lukas Wunner <lukas@wunner.de> wrote:

> On Tue, Sep 08, 2020 at 08:57:26AM -0600, Myron Stowe wrote:
> > On a system with a Mellanox Technologies MT27800 Family [ConnectX-5]
> > NIC controller containing a power button, hot-plug fails to function
> > properly.  
> [...]
> > https://bugzilla.kernel.org/show_bug.cgi?id=209113  
> 
> Thanks for the report.
> 
> So in the dmesg output you've provided, the card is already inserted
> when the machine boots.  At 233 seconds, the Attention Button is
> pressed twice within 200 msec (the second press cancels the first).
> At 235 sec, the button is pressed again and after 5 sec the slot is
> brought down. So far so good.
> 
> At 291 sec the button is pressed but bringup of the slot fails.
> What happens here is, pciehp notices that upon the button press,
> a card is already present in the slot.  So for convenience,
> instead of waiting the full 5 sec, it attempts to bring up the slot
> immediately.  That fails because Data Link Layer Link Active isn't
> set within 1 sec.
> 
> The difference to v4.18 is that back then, pciehp waited the full
> 5 sec before bringing up the slot.
> 
> Per PCIe r4.0 sec 6.7.1.8:
> 
>     After turning power on, software must wait for a Data Link Layer
>     State Changed event, as described in Section 6.7.3.3.
> 
> And per sec 6.7.3.3:
> 
>     The Data Link Layer State Changed event must occur within 1 second
>     of the event that initiates the hot-insertion. If a power
> controller is supported, the time out interval is measured from when
> software initiated a write to the Slot Control register to turn on
> the power. [...] Software is allowed to time out on a hot add
> operation if the Data Link Layer State Changed event does not occur
> within 1 second.
> 
> So we adhere to the spec regarding the timeout between enabling power
> and waiting for DLLLA.  We do not exactly adhere to the spec regarding
> the 5 sec delay between button press and acting on it.  But I can't
> really imagine that's the problem.
> 
> As a shot in the dark, could you amend pcie_wait_for_link_delay()
> in drivers/pci/pci.c and increase the "timeout = 1000" a little?
> Maybe more than 1 sec is necessary in this case between enabling
> power and timing out for lack of a link?
> 
> The v4.18 output you've provided in the bugzilla is incomplete and
> lacks time stamps.  Could you provide it in full?

Hi Lukas,

I got back a full 'dmesg' log, with the hot-plug event included, from
the earlier, working scenario, kernel.  It's attached to the BZ as
"dmesg log of v3.10+ showing successful hot-plug event".  Note the
v3.10 basis as I was incorrect before in thinking the working case was
from a v4.18 basis. For this case, the hot-plug event starts 113
seconds in.

As for your "shot in the dark", the reporters doubled the timeout value
in pcie_wait_for_link_delay() and had positive results.  The 'dmesg'
log from that testing is also attached to the BZ as "dmesg log of
v5.8.8 with increased timeout".  So it looks as if this specific
controller is not adhering to the Data Link Layer State Changed event
within the specified time.


There was an earlier attachment of a couple of 'dmesg' logs that you
can ignore (i.e., dmesg with timestamp for RHEL...).

Myron

> 
> Thanks,
> 
> Lukas
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-09-14 20:09 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-08 14:57 PCIe hot-plug issue: Failed to check link status Myron Stowe
2020-09-10 13:24 ` Lukas Wunner
2020-09-14 20:08   ` Myron Stowe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).