linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [Bug report] NVMe and RAID probe fail with commit cdf6b7362108
       [not found] <ce7332ed-0dfb-da20-40e8-702c755d3e08@huawei.com>
@ 2018-10-17  7:54 ` Hanjun Guo
  2018-10-17  7:56 ` Lukas Wunner
  1 sibling, 0 replies; 3+ messages in thread
From: Hanjun Guo @ 2018-10-17  7:54 UTC (permalink / raw)
  To: Lukas Wunner, Bjorn Helgaas
  Cc: linux-pci, Rafael J. Wysocki, Linuxarm, Mika Westerberg

On 2018/10/17 11:38, Hanjun Guo wrote:
> Hi Lukas,
> 
> We met NVMe and RAID probe failure on 4.19-rc1+ based kernel on
> Hisilicon D06, then we bisect to this commit:
> 
> cdf6b7362108 "PCI: pciehp: Always enable occupied slot on probe"
> 
> Reverting this patch makes the system back to functional, I'm not
> sure why this lead some regression on our board, could you share
> some idea on this?

After applying those two patches from linux-next:

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=11e87702be65780be92fb1f0a5b7b293954185f7
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=80696f991424d05a784c0cf9c314ac09ac280406

things back to normal, can we merge that for 4.19?

Thanks
Hanjun


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Bug report] NVMe and RAID probe fail with commit cdf6b7362108
       [not found] <ce7332ed-0dfb-da20-40e8-702c755d3e08@huawei.com>
  2018-10-17  7:54 ` [Bug report] NVMe and RAID probe fail with commit cdf6b7362108 Hanjun Guo
@ 2018-10-17  7:56 ` Lukas Wunner
  2018-10-17  8:43   ` Hanjun Guo
  1 sibling, 1 reply; 3+ messages in thread
From: Lukas Wunner @ 2018-10-17  7:56 UTC (permalink / raw)
  To: Hanjun Guo
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Mika Westerberg,
	wangxiongfeng (C),
	linux-pci, Linuxarm, Xie XiuQi, liudongdong (C)

On Wed, Oct 17, 2018 at 11:38:11AM +0800, Hanjun Guo wrote:
> We met NVMe and RAID probe failure on 4.19-rc1+ based kernel on
> Hisilicon D06, then we bisect to this commit:
> 
> cdf6b7362108 "PCI: pciehp: Always enable occupied slot on probe"
> 
> Reverting this patch makes the system back to functional, I'm not
> sure why this lead some regression on our board, could you share
> some idea on this?
> 
> Boot log below, [0] is the failure one, [1] is the good one,
> please let me you if you need something more.

Please cherry-pick the following linux-next commit and test if the
issue goes away:

    commit 80696f991424d05a784c0cf9c314ac09ac280406
    Author: Lukas Wunner <lukas@wunner.de>
    Date:   Sat Sep 8 09:59:01 2018 +0200
    
    PCI: pciehp: Tolerate Presence Detect hardwired to zero

If this does not fix the issue, please provide full dmesg output for
the working and non-working case with "pciehp.pciehp_debug=1" on the
kernel command line, preferrably attached to a bugzilla entry.

Detailed analysis of the dmesg output you've already provided:

There are 9 hotplug ports but devices are only present below 4 of them:
a RAID adapter, a BMC with VGA, a 2x 10G Ethernet adapter and the NVMe drive:

0000:00:00.0 [19e5:a120] bridge to [bus 01]
0000:00:04.0 [19e5:a120] bridge to [bus 02]
0000:00:08.0 [19e5:a120] bridge to [bus 03]: 0000:03:00.0 [1000:0097] mpt3sas *
0000:00:10.0 [19e5:a120] bridge to [bus 04]: 0000:04:00.0 [19e5:1711] hibmc   *
0000:00:12.0 [19e5:a120] bridge to [bus 05]: 0000:05:00.0 [8086:10fb] ixgbe
                                             0000:05:00.1 [8086:10fb] ixgbe
0000:80:00.0 [19e5:a120] bridge to [bus 81]: 0000:81:00.0 [19e5:0123] nvme    *
0000:80:08.0 [19e5:a120] bridge to [bus 82]
0000:80:0c.0 [19e5:a120] bridge to [bus 83]
0000:80:10.0 [19e5:a120] bridge to [bus 84]

Of the 4 occupied hotplug ports, 3 exhibit a hardware bug wherein the link
is reported to be up but the Presence Detect State bit in the Slot Status
register is zero.  Consequently pciehp deems the slot unoccupied.
I've marked the broken ports with an asterisk *.

The above-mentioned commit works around such broken hardware but so far
we knew only of a single affected chip, the Wilocity/QCA wil6210.
It looks like this bug is more common than we thought, so it might be
worth applying it already to v4.19.  However v4.19 is expected to be
released in 4 days, so backporting to a 4.19 stable release might be
more practical than squeezing the commit in before the coming Sunday.

For some reason the hotplug port 0000:00:12.0 with the 2x 10G Ethernet
adapter does not exhibit the hardware bug.

HTH,

Lukas

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Bug report] NVMe and RAID probe fail with commit cdf6b7362108
  2018-10-17  7:56 ` Lukas Wunner
@ 2018-10-17  8:43   ` Hanjun Guo
  0 siblings, 0 replies; 3+ messages in thread
From: Hanjun Guo @ 2018-10-17  8:43 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Mika Westerberg,
	wangxiongfeng (C),
	linux-pci, Linuxarm, Xie XiuQi, liudongdong (C)

Hi Lukas,

Thank you for the quick reply.

On 2018/10/17 15:56, Lukas Wunner wrote:
> On Wed, Oct 17, 2018 at 11:38:11AM +0800, Hanjun Guo wrote:
>> We met NVMe and RAID probe failure on 4.19-rc1+ based kernel on
>> Hisilicon D06, then we bisect to this commit:
>>
>> cdf6b7362108 "PCI: pciehp: Always enable occupied slot on probe"
>>
>> Reverting this patch makes the system back to functional, I'm not
>> sure why this lead some regression on our board, could you share
>> some idea on this?
>>
>> Boot log below, [0] is the failure one, [1] is the good one,
>> please let me you if you need something more.
> 
> Please cherry-pick the following linux-next commit and test if the
> issue goes away:
> 
>     commit 80696f991424d05a784c0cf9c314ac09ac280406
>     Author: Lukas Wunner <lukas@wunner.de>
>     Date:   Sat Sep 8 09:59:01 2018 +0200
>     
>     PCI: pciehp: Tolerate Presence Detect hardwired to zero
> 
> If this does not fix the issue, please provide full dmesg output for
> the working and non-working case with "pciehp.pciehp_debug=1" on the
> kernel command line, preferrably attached to a bugzilla entry.
> 
> Detailed analysis of the dmesg output you've already provided:
> 
> There are 9 hotplug ports but devices are only present below 4 of them:
> a RAID adapter, a BMC with VGA, a 2x 10G Ethernet adapter and the NVMe drive:
> 
> 0000:00:00.0 [19e5:a120] bridge to [bus 01]
> 0000:00:04.0 [19e5:a120] bridge to [bus 02]
> 0000:00:08.0 [19e5:a120] bridge to [bus 03]: 0000:03:00.0 [1000:0097] mpt3sas *
> 0000:00:10.0 [19e5:a120] bridge to [bus 04]: 0000:04:00.0 [19e5:1711] hibmc   *
> 0000:00:12.0 [19e5:a120] bridge to [bus 05]: 0000:05:00.0 [8086:10fb] ixgbe
>                                              0000:05:00.1 [8086:10fb] ixgbe
> 0000:80:00.0 [19e5:a120] bridge to [bus 81]: 0000:81:00.0 [19e5:0123] nvme    *
> 0000:80:08.0 [19e5:a120] bridge to [bus 82]
> 0000:80:0c.0 [19e5:a120] bridge to [bus 83]
> 0000:80:10.0 [19e5:a120] bridge to [bus 84]
> 
> Of the 4 occupied hotplug ports, 3 exhibit a hardware bug wherein the link
> is reported to be up but the Presence Detect State bit in the Slot Status
> register is zero.  Consequently pciehp deems the slot unoccupied.
> I've marked the broken ports with an asterisk *.

Thank you very much for the clue, it helps a lot.

> 
> The above-mentioned commit works around such broken hardware but so far
> we knew only of a single affected chip, the Wilocity/QCA wil6210.
> It looks like this bug is more common than we thought, so it might be
> worth applying it already to v4.19.  However v4.19 is expected to be
> released in 4 days, so backporting to a 4.19 stable release might be
> more practical than squeezing the commit in before the coming Sunday.

Make sense to me.

> 
> For some reason the hotplug port 0000:00:12.0 with the 2x 10G Ethernet
> adapter does not exhibit the hardware bug.

After some debug of the hardware, only some ports were set to have the
capability for hotplug, 0000:00:12 is the one (obviously ports for nvme
and mpt3sas are not), so some behave OK and others are not, can be easily
fixed by BIOS and cpld logic on the board.

Thank you again for the help.
Hanjun


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-10-17  8:43 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <ce7332ed-0dfb-da20-40e8-702c755d3e08@huawei.com>
2018-10-17  7:54 ` [Bug report] NVMe and RAID probe fail with commit cdf6b7362108 Hanjun Guo
2018-10-17  7:56 ` Lukas Wunner
2018-10-17  8:43   ` Hanjun Guo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).