linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
       [not found] <bug-215525-41252@https.bugzilla.kernel.org/>
@ 2022-01-24 21:46 ` Bjorn Helgaas
  2022-01-25  8:58   ` Hans de Goede
                     ` (3 more replies)
  0 siblings, 4 replies; 20+ messages in thread
From: Bjorn Helgaas @ 2022-01-24 21:46 UTC (permalink / raw)
  To: linux-pci
  Cc: Blazej Kucman, Hans de Goede, Lukas Wunner, Naveen Naidu,
	Keith Busch, Nirmal Patel, Jonathan Derrick

[+cc linux-pci, Hans, Lukas, Naveen, Keith, Nirmal, Jonathan]

On Mon, Jan 24, 2022 at 11:46:14AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215525
> 
>             Bug ID: 215525
>            Summary: HotPlug does not work on upstream kernel 5.17.0-rc1
>            Product: Drivers
>            Version: 2.5
>     Kernel Version: 5.17.0-rc1 upstream
>           Hardware: x86-64
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: PCI
>           Assignee: drivers_pci@kernel-bugs.osdl.org
>           Reporter: blazej.kucman@intel.com
>         Regression: No
> 
> Created attachment 300308
>   --> https://bugzilla.kernel.org/attachment.cgi?id=300308&action=edit
> dmesg
> 
> While testing on latest upstream
> kernel(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/) we
> noticed that with the merge commit
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0a231f01e5b25bacd23e6edc7c979a18a517b2b)
> hotplug and hotunplug of nvme drives stopped working.
> 
> Rescan PCI does not help.
> echo "1" > /sys/bus/pci/rescan
> 
> Issue does not reproduce on a kernel built on an antecedent
> commit(88db8458086b1dcf20b56682504bdb34d2bca0e2).
> 
> 
> During hot-remove device does not disappear, however when we try to do I/O on
> the disk then there is an I/O error, and the device disappears.
> 
> Before I/O no logs regarding the disk appeared in the dmesg, only after I/O the
> entries appeared like below:
> [  177.943703] nvme nvme5: controller is down; will reset: CSTS=0xffffffff,
> PCI_STATUS=0xffff
> [  177.971661] nvme 10000:0b:00.0: can't change power state from D3cold to D0
> (config space inaccessible)
> [  177.981121] pcieport 10000:00:02.0: can't derive routing for PCI INT A
> [  177.987749] nvme 10000:0b:00.0: PCI INT A: no GSI
> [  177.992633] nvme nvme5: Removing after probe failure status: -19
> [  178.004633] nvme5n1: detected capacity change from 83984375 to 0
> [  178.004677] I/O error, dev nvme5n1, sector 0 op 0x0:(READ) flags 0x0
> phys_seg 1 prio class 0
> 
> 
> OS: RHEL 8.4 GA
> Platform: Intel Purley
> 
> The logs are collected on a non-recent upstream kernel, but a issue also occurs
> on the newest upstream kernel(dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0)

Apparently worked immediately before merging the PCI changes for
v5.17 and failed immediately after:

  good: 88db8458086b ("Merge tag 'exfat-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat")
  bad:  d0a231f01e5b ("Merge tag 'pci-v5.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci")

Only three commits touch pciehp:

  085a9f43433f ("PCI: pciehp: Use down_read/write_nested(reset_lock) to fix lockdep errors")
  23584c1ed3e1 ("PCI: pciehp: Fix infinite loop in IRQ handler upon power fault")
  a3b0f10db148 ("PCI: pciehp: Use PCI_POSSIBLE_ERROR() to check config reads")

None seems obviously related to me.  Blazej, could you try setting
CONFIG_DYNAMIC_DEBUG=y and booting with 'dyndbg="file pciehp* +p"' to
enable more debug messages?

Bjorn

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-01-24 21:46 ` [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1 Bjorn Helgaas
@ 2022-01-25  8:58   ` Hans de Goede
  2022-01-25 15:33   ` Lukas Wunner
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 20+ messages in thread
From: Hans de Goede @ 2022-01-25  8:58 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Blazej Kucman, Lukas Wunner, Naveen Naidu, Keith Busch,
	Nirmal Patel, Jonathan Derrick

Hi,

On 1/24/22 22:46, Bjorn Helgaas wrote:
> [+cc linux-pci, Hans, Lukas, Naveen, Keith, Nirmal, Jonathan]
> 
> On Mon, Jan 24, 2022 at 11:46:14AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=215525
>>
>>             Bug ID: 215525
>>            Summary: HotPlug does not work on upstream kernel 5.17.0-rc1
>>            Product: Drivers
>>            Version: 2.5
>>     Kernel Version: 5.17.0-rc1 upstream
>>           Hardware: x86-64
>>                 OS: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: normal
>>           Priority: P1
>>          Component: PCI
>>           Assignee: drivers_pci@kernel-bugs.osdl.org
>>           Reporter: blazej.kucman@intel.com
>>         Regression: No
>>
>> Created attachment 300308
>>   --> https://bugzilla.kernel.org/attachment.cgi?id=300308&action=edit
>> dmesg
>>
>> While testing on latest upstream
>> kernel(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/) we
>> noticed that with the merge commit
>> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0a231f01e5b25bacd23e6edc7c979a18a517b2b)
>> hotplug and hotunplug of nvme drives stopped working.
>>
>> Rescan PCI does not help.
>> echo "1" > /sys/bus/pci/rescan
>>
>> Issue does not reproduce on a kernel built on an antecedent
>> commit(88db8458086b1dcf20b56682504bdb34d2bca0e2).
>>
>>
>> During hot-remove device does not disappear, however when we try to do I/O on
>> the disk then there is an I/O error, and the device disappears.
>>
>> Before I/O no logs regarding the disk appeared in the dmesg, only after I/O the
>> entries appeared like below:
>> [  177.943703] nvme nvme5: controller is down; will reset: CSTS=0xffffffff,
>> PCI_STATUS=0xffff
>> [  177.971661] nvme 10000:0b:00.0: can't change power state from D3cold to D0
>> (config space inaccessible)
>> [  177.981121] pcieport 10000:00:02.0: can't derive routing for PCI INT A
>> [  177.987749] nvme 10000:0b:00.0: PCI INT A: no GSI
>> [  177.992633] nvme nvme5: Removing after probe failure status: -19
>> [  178.004633] nvme5n1: detected capacity change from 83984375 to 0
>> [  178.004677] I/O error, dev nvme5n1, sector 0 op 0x0:(READ) flags 0x0
>> phys_seg 1 prio class 0
>>
>>
>> OS: RHEL 8.4 GA
>> Platform: Intel Purley
>>
>> The logs are collected on a non-recent upstream kernel, but a issue also occurs
>> on the newest upstream kernel(dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0)
> 
> Apparently worked immediately before merging the PCI changes for
> v5.17 and failed immediately after:
> 
>   good: 88db8458086b ("Merge tag 'exfat-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat")
>   bad:  d0a231f01e5b ("Merge tag 'pci-v5.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci")
> 
> Only three commits touch pciehp:
> 
>   085a9f43433f ("PCI: pciehp: Use down_read/write_nested(reset_lock) to fix lockdep errors")
>   23584c1ed3e1 ("PCI: pciehp: Fix infinite loop in IRQ handler upon power fault")
>   a3b0f10db148 ("PCI: pciehp: Use PCI_POSSIBLE_ERROR() to check config reads")
> 
> None seems obviously related to me.  Blazej, could you try setting
> CONFIG_DYNAMIC_DEBUG=y and booting with 'dyndbg="file pciehp* +p"' to
> enable more debug messages?

Since there are only 3 commits maybe try reverting them 1 by 1 in reverse history order
(so revert latest commit first) ? And see if running a kernel with the reverted commit(s)
fixes things ?

Regards,

Hans


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-01-24 21:46 ` [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1 Bjorn Helgaas
  2022-01-25  8:58   ` Hans de Goede
@ 2022-01-25 15:33   ` Lukas Wunner
  2022-01-26  7:31   ` Thorsten Leemhuis
  2022-01-27 14:46   ` Mariusz Tkaczyk
  3 siblings, 0 replies; 20+ messages in thread
From: Lukas Wunner @ 2022-01-25 15:33 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Blazej Kucman, Hans de Goede, Naveen Naidu,
	Keith Busch, Nirmal Patel, Jonathan Derrick

On Mon, Jan 24, 2022 at 03:46:35PM -0600, Bjorn Helgaas wrote:
> On Mon, Jan 24, 2022 at 11:46:14AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=215525
> > 
> > While testing on latest upstream kernel we noticed that with the
> > merge commit d0a231f01e5b hotplug and hotunplug of nvme drives
> > stopped working.
[...]
> Only three commits touch pciehp:
> 
>   085a9f43433f ("PCI: pciehp: Use down_read/write_nested(reset_lock) to fix lockdep errors")
>   23584c1ed3e1 ("PCI: pciehp: Fix infinite loop in IRQ handler upon power fault")
>   a3b0f10db148 ("PCI: pciehp: Use PCI_POSSIBLE_ERROR() to check config reads")

Those commits pertain to *native* hotplug, however the machine in question
does not grant hotplug control to OSPM, so pciehp isn't even probed for
any ports on that machine:

  acpi PNP0A08:09: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
  acpi PNP0A08:09: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]

Are these ports supposed to be handled by native hotplug or acpiphp?
Perhaps CONFIG_HOTPLUG_PCI_PCIE was erroneously not enabled?

It's unfortunate that the bugzilla only contains the dmesg dump of
broken hotplug, but not of working hotplug.  That would make it easier
to determine what's going wrong.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-01-24 21:46 ` [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1 Bjorn Helgaas
  2022-01-25  8:58   ` Hans de Goede
  2022-01-25 15:33   ` Lukas Wunner
@ 2022-01-26  7:31   ` Thorsten Leemhuis
  2022-01-27 14:46   ` Mariusz Tkaczyk
  3 siblings, 0 replies; 20+ messages in thread
From: Thorsten Leemhuis @ 2022-01-26  7:31 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Blazej Kucman, Hans de Goede, Lukas Wunner, Naveen Naidu,
	Keith Busch, Nirmal Patel, Jonathan Derrick, regressions


[TLDR: I'm adding this regression to regzbot, the Linux kernel
regression tracking bot; most text you find below is compiled from a few
templates paragraphs some of you might have seen already.]

Hi, this is your Linux kernel regression tracker speaking.

Adding the regression mailing list to the list of recipients, as it
should be in the loop for all regressions, as explained here:
https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html

On 24.01.22 22:46, Bjorn Helgaas wrote:
> [+cc linux-pci, Hans, Lukas, Naveen, Keith, Nirmal, Jonathan]
> 
> On Mon, Jan 24, 2022 at 11:46:14AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=215525
>>
>>             Bug ID: 215525
>>            Summary: HotPlug does not work on upstream kernel 5.17.0-rc1
>>            Product: Drivers
>>            Version: 2.5
>>     Kernel Version: 5.17.0-rc1 upstream
>>           Hardware: x86-64
>>                 OS: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: normal
>>           Priority: P1
>>          Component: PCI
>>           Assignee: drivers_pci@kernel-bugs.osdl.org
>>           Reporter: blazej.kucman@intel.com
>>         Regression: No
>>
>> Created attachment 300308
>>   --> https://bugzilla.kernel.org/attachment.cgi?id=300308&action=edit
>> dmesg
>>
>> While testing on latest upstream
>> kernel(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/) we
>> noticed that with the merge commit
>> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0a231f01e5b25bacd23e6edc7c979a18a517b2b)
>> hotplug and hotunplug of nvme drives stopped working.
>>
>> Rescan PCI does not help.
>> echo "1" > /sys/bus/pci/rescan
>>
>> Issue does not reproduce on a kernel built on an antecedent
>> commit(88db8458086b1dcf20b56682504bdb34d2bca0e2).
>>
>>
>> During hot-remove device does not disappear, however when we try to do I/O on
>> the disk then there is an I/O error, and the device disappears.
>>
>> Before I/O no logs regarding the disk appeared in the dmesg, only after I/O the
>> entries appeared like below:
>> [  177.943703] nvme nvme5: controller is down; will reset: CSTS=0xffffffff,
>> PCI_STATUS=0xffff
>> [  177.971661] nvme 10000:0b:00.0: can't change power state from D3cold to D0
>> (config space inaccessible)
>> [  177.981121] pcieport 10000:00:02.0: can't derive routing for PCI INT A
>> [  177.987749] nvme 10000:0b:00.0: PCI INT A: no GSI
>> [  177.992633] nvme nvme5: Removing after probe failure status: -19
>> [  178.004633] nvme5n1: detected capacity change from 83984375 to 0
>> [  178.004677] I/O error, dev nvme5n1, sector 0 op 0x0:(READ) flags 0x0
>> phys_seg 1 prio class 0
>>
>>
>> OS: RHEL 8.4 GA
>> Platform: Intel Purley
>>
>> The logs are collected on a non-recent upstream kernel, but a issue also occurs
>> on the newest upstream kernel(dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0)
> 
> Apparently worked immediately before merging the PCI changes for
> v5.17 and failed immediately after:
> 
>   good: 88db8458086b ("Merge tag 'exfat-for-5.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat")
>   bad:  d0a231f01e5b ("Merge tag 'pci-v5.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci")
> 
> Only three commits touch pciehp:
> 
>   085a9f43433f ("PCI: pciehp: Use down_read/write_nested(reset_lock) to fix lockdep errors")
>   23584c1ed3e1 ("PCI: pciehp: Fix infinite loop in IRQ handler upon power fault")
>   a3b0f10db148 ("PCI: pciehp: Use PCI_POSSIBLE_ERROR() to check config reads")
> 
> None seems obviously related to me.  Blazej, could you try setting
> CONFIG_DYNAMIC_DEBUG=y and booting with 'dyndbg="file pciehp* +p"' to
> enable more debug messages?

To be sure this issue doesn't fall through the cracks unnoticed, I'm
adding it to regzbot, my Linux kernel regression tracking bot:

#regzbot ^introduced: d0a231f01e5b25bacd23e
#regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=215525
#regzbot from: blazej.kucman@intel.com
#regzbot title: pci: HotPlug does not work on upstream kernel 5.17.0-rc1

Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat)

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply, that's in everyone's interest.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.

---
Additional information about regzbot:

If you want to know more about regzbot, check out its web-interface, the
getting start guide, and/or the references documentation:

https://linux-regtracking.leemhuis.info/regzbot/
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md

The last two documents will explain how you can interact with regzbot
yourself if your want to.

Hint for reporters: when reporting a regression it's in your interest to
tell #regzbot about it in the report, as that will ensure the regression
gets on the radar of regzbot and the regression tracker. That's in your
interest, as they will make sure the report won't fall through the
cracks unnoticed.

Hint for developers: you normally don't need to care about regzbot once
it's involved. Fix the issue as you normally would, just remember to
include a 'Link:' tag to the report in the commit message, as explained
in Documentation/process/submitting-patches.rst
That aspect was recently was made more explicit in commit 1f57bd42b77c:
https://git.kernel.org/linus/1f57bd42b77c

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-01-24 21:46 ` [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1 Bjorn Helgaas
                     ` (2 preceding siblings ...)
  2022-01-26  7:31   ` Thorsten Leemhuis
@ 2022-01-27 14:46   ` Mariusz Tkaczyk
  2022-01-27 20:47     ` Jonathan Derrick
                       ` (2 more replies)
  3 siblings, 3 replies; 20+ messages in thread
From: Mariusz Tkaczyk @ 2022-01-27 14:46 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Blazej Kucman, Hans de Goede, Lukas Wunner,
	Naveen Naidu, Keith Busch, Nirmal Patel, Jonathan Derrick

On Mon, 24 Jan 2022 15:46:35 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> [+cc linux-pci, Hans, Lukas, Naveen, Keith, Nirmal, Jonathan]
> 
> On Mon, Jan 24, 2022 at 11:46:14AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=215525
> > 
> >             Bug ID: 215525
> >            Summary: HotPlug does not work on upstream kernel
> > 5.17.0-rc1 Product: Drivers
> >            Version: 2.5
> >     Kernel Version: 5.17.0-rc1 upstream
> >           Hardware: x86-64
> >                 OS: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: normal
> >           Priority: P1
> >          Component: PCI
> >           Assignee: drivers_pci@kernel-bugs.osdl.org
> >           Reporter: blazej.kucman@intel.com
> >         Regression: No
> > 
> > Created attachment 300308  
> >   -->
> > https://bugzilla.kernel.org/attachment.cgi?id=300308&action=edit
> > dmesg
> > 
> > While testing on latest upstream
> > kernel(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/)
> > we noticed that with the merge commit
> > (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0a231f01e5b25bacd23e6edc7c979a18a517b2b)
> > hotplug and hotunplug of nvme drives stopped working.
> > 
> > Rescan PCI does not help.
> > echo "1" > /sys/bus/pci/rescan
> > 
> > Issue does not reproduce on a kernel built on an antecedent
> > commit(88db8458086b1dcf20b56682504bdb34d2bca0e2).
> > 
> > 
> > During hot-remove device does not disappear, however when we try to
> > do I/O on the disk then there is an I/O error, and the device
> > disappears.
> > 
> > Before I/O no logs regarding the disk appeared in the dmesg, only
> > after I/O the entries appeared like below:
> > [  177.943703] nvme nvme5: controller is down; will reset:
> > CSTS=0xffffffff, PCI_STATUS=0xffff
> > [  177.971661] nvme 10000:0b:00.0: can't change power state from
> > D3cold to D0 (config space inaccessible)
> > [  177.981121] pcieport 10000:00:02.0: can't derive routing for PCI
> > INT A [  177.987749] nvme 10000:0b:00.0: PCI INT A: no GSI
> > [  177.992633] nvme nvme5: Removing after probe failure status: -19
> > [  178.004633] nvme5n1: detected capacity change from 83984375 to 0
> > [  178.004677] I/O error, dev nvme5n1, sector 0 op 0x0:(READ) flags
> > 0x0 phys_seg 1 prio class 0
> > 
> > 
> > OS: RHEL 8.4 GA
> > Platform: Intel Purley
> > 
> > The logs are collected on a non-recent upstream kernel, but a issue
> > also occurs on the newest upstream
> > kernel(dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0)  
> 
> Apparently worked immediately before merging the PCI changes for
> v5.17 and failed immediately after:
> 
>   good: 88db8458086b ("Merge tag 'exfat-for-5.17-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat") bad:
>  d0a231f01e5b ("Merge tag 'pci-v5.17-changes' of
> git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci")
> 
> Only three commits touch pciehp:
> 
>   085a9f43433f ("PCI: pciehp: Use down_read/write_nested(reset_lock)
> to fix lockdep errors") 23584c1ed3e1 ("PCI: pciehp: Fix infinite loop
> in IRQ handler upon power fault") a3b0f10db148 ("PCI: pciehp: Use
> PCI_POSSIBLE_ERROR() to check config reads")
> 
> None seems obviously related to me.  Blazej, could you try setting
> CONFIG_DYNAMIC_DEBUG=y and booting with 'dyndbg="file pciehp* +p"' to
> enable more debug messages?
> 

Hi Bjorn,

Thanks for your suggestions. Blazej did some tests and results were
inconclusive. He tested it on two same platforms. On the first one it
didn't work, even if he reverted all suggested patches. On the second
one hotplugs always worked.

He noticed that on first platform where issue has been found initally,
there was boot parameter "pci=nommconf". After adding this parameter
on the second platform, hotplugs stopped working too.

Tested on tag pci-v5.17-changes. He have CONFIG_HOTPLUG_PCI_PCIE
and CONFIG_DYNAMIC_DEBUG enabled in config. He also attached two dmesg
logs to bugzilla with boot parameter 'dyndbg="file pciehp* +p" as
requested. One with "pci=nommconf" and one without.

Issue seems to related to "pci=nommconf" and it is probably caused
by change outside pciehp.

He is currently working on email client setup to answer himself.

Thanks,
Mariusz



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-01-27 14:46   ` Mariusz Tkaczyk
@ 2022-01-27 20:47     ` Jonathan Derrick
  2022-01-27 22:31     ` Jonathan Derrick
  2022-01-28  2:52     ` Bjorn Helgaas
  2 siblings, 0 replies; 20+ messages in thread
From: Jonathan Derrick @ 2022-01-27 20:47 UTC (permalink / raw)
  To: Mariusz Tkaczyk, Bjorn Helgaas
  Cc: linux-pci, Blazej Kucman, Hans de Goede, Lukas Wunner,
	Naveen Naidu, Keith Busch, Nirmal Patel



On 1/27/2022 7:46 AM, Mariusz Tkaczyk wrote:
> On Mon, 24 Jan 2022 15:46:35 -0600
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> 
>> [+cc linux-pci, Hans, Lukas, Naveen, Keith, Nirmal, Jonathan]
>>
>> On Mon, Jan 24, 2022 at 11:46:14AM +0000,
>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=215525
>>>
>>>              Bug ID: 215525
>>>             Summary: HotPlug does not work on upstream kernel
>>> 5.17.0-rc1 Product: Drivers
>>>             Version: 2.5
>>>      Kernel Version: 5.17.0-rc1 upstream
>>>            Hardware: x86-64
>>>                  OS: Linux
>>>                Tree: Mainline
>>>              Status: NEW
>>>            Severity: normal
>>>            Priority: P1
>>>           Component: PCI
>>>            Assignee: drivers_pci@kernel-bugs.osdl.org
>>>            Reporter: blazej.kucman@intel.com
>>>          Regression: No
>>>
>>> Created attachment 300308
>>>    -->
>>> https://bugzilla.kernel.org/attachment.cgi?id=300308&action=edit
>>> dmesg
>>>
>>> While testing on latest upstream
>>> kernel(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/)
>>> we noticed that with the merge commit
>>> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0a231f01e5b25bacd23e6edc7c979a18a517b2b)
>>> hotplug and hotunplug of nvme drives stopped working.
>>>
>>> Rescan PCI does not help.
>>> echo "1" > /sys/bus/pci/rescan
>>>
>>> Issue does not reproduce on a kernel built on an antecedent
>>> commit(88db8458086b1dcf20b56682504bdb34d2bca0e2).
>>>
>>>
>>> During hot-remove device does not disappear, however when we try to
>>> do I/O on the disk then there is an I/O error, and the device
>>> disappears.
>>>
>>> Before I/O no logs regarding the disk appeared in the dmesg, only
>>> after I/O the entries appeared like below:
>>> [  177.943703] nvme nvme5: controller is down; will reset:
>>> CSTS=0xffffffff, PCI_STATUS=0xffff
>>> [  177.971661] nvme 10000:0b:00.0: can't change power state from
>>> D3cold to D0 (config space inaccessible)
>>> [  177.981121] pcieport 10000:00:02.0: can't derive routing for PCI
>>> INT A [  177.987749] nvme 10000:0b:00.0: PCI INT A: no GSI
>>> [  177.992633] nvme nvme5: Removing after probe failure status: -19
>>> [  178.004633] nvme5n1: detected capacity change from 83984375 to 0
>>> [  178.004677] I/O error, dev nvme5n1, sector 0 op 0x0:(READ) flags
>>> 0x0 phys_seg 1 prio class 0
>>>
>>>
>>> OS: RHEL 8.4 GA
>>> Platform: Intel Purley
>>>
>>> The logs are collected on a non-recent upstream kernel, but a issue
>>> also occurs on the newest upstream
>>> kernel(dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0)
>>
>> Apparently worked immediately before merging the PCI changes for
>> v5.17 and failed immediately after:
>>
>>    good: 88db8458086b ("Merge tag 'exfat-for-5.17-rc1' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat") bad:
>>   d0a231f01e5b ("Merge tag 'pci-v5.17-changes' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci")
>>
>> Only three commits touch pciehp:
>>
>>    085a9f43433f ("PCI: pciehp: Use down_read/write_nested(reset_lock)
>> to fix lockdep errors") 23584c1ed3e1 ("PCI: pciehp: Fix infinite loop
>> in IRQ handler upon power fault") a3b0f10db148 ("PCI: pciehp: Use
>> PCI_POSSIBLE_ERROR() to check config reads")
>>
>> None seems obviously related to me.  Blazej, could you try setting
>> CONFIG_DYNAMIC_DEBUG=y and booting with 'dyndbg="file pciehp* +p"' to
>> enable more debug messages?
>>
> 
> Hi Bjorn,
> 
> Thanks for your suggestions. Blazej did some tests and results were
> inconclusive. He tested it on two same platforms. On the first one it
> didn't work, even if he reverted all suggested patches. On the second
> one hotplugs always worked.
> 
> He noticed that on first platform where issue has been found initally,
> there was boot parameter "pci=nommconf". After adding this parameter
> on the second platform, hotplugs stopped working too.
> 
> Tested on tag pci-v5.17-changes. He have CONFIG_HOTPLUG_PCI_PCIE
> and CONFIG_DYNAMIC_DEBUG enabled in config. He also attached two dmesg
> logs to bugzilla with boot parameter 'dyndbg="file pciehp* +p" as
> requested. One with "pci=nommconf" and one without.
> 
> Issue seems to related to "pci=nommconf" and it is probably caused
> by change outside pciehp.

Could it be related to this?

int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int 
devfn, int reg, int len, u32 *val)
{
	if (domain == 0 && reg < 256 && raw_pci_ops)
		return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
	if (raw_pci_ext_ops)
		return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
	return -EINVAL;
}

It looks like raw_pci_ext_ops won't be set with nommconf, and VMD 
subdevice domain will be > 0.


> 
> He is currently working on email client setup to answer himself.
> 
> Thanks,
> Mariusz
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-01-27 14:46   ` Mariusz Tkaczyk
  2022-01-27 20:47     ` Jonathan Derrick
@ 2022-01-27 22:31     ` Jonathan Derrick
  2022-01-28  2:52     ` Bjorn Helgaas
  2 siblings, 0 replies; 20+ messages in thread
From: Jonathan Derrick @ 2022-01-27 22:31 UTC (permalink / raw)
  To: Mariusz Tkaczyk, Bjorn Helgaas
  Cc: linux-pci, Blazej Kucman, Hans de Goede, Lukas Wunner,
	Naveen Naidu, Keith Busch, Nirmal Patel


On 1/27/2022 7:46 AM, Mariusz Tkaczyk wrote:
> On Mon, 24 Jan 2022 15:46:35 -0600
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> 
>> [+cc linux-pci, Hans, Lukas, Naveen, Keith, Nirmal, Jonathan]
>>
>> On Mon, Jan 24, 2022 at 11:46:14AM +0000,
>> bugzilla-daemon@bugzilla.kernel.org wrote:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=215525
>>>
>>>              Bug ID: 215525
>>>             Summary: HotPlug does not work on upstream kernel
>>> 5.17.0-rc1 Product: Drivers
>>>             Version: 2.5
>>>      Kernel Version: 5.17.0-rc1 upstream
>>>            Hardware: x86-64
>>>                  OS: Linux
>>>                Tree: Mainline
>>>              Status: NEW
>>>            Severity: normal
>>>            Priority: P1
>>>           Component: PCI
>>>            Assignee: drivers_pci@kernel-bugs.osdl.org
>>>            Reporter: blazej.kucman@intel.com
>>>          Regression: No
>>>
>>> Created attachment 300308
>>>    -->
>>> https://bugzilla.kernel.org/attachment.cgi?id=300308&action=edit
>>> dmesg
>>>
>>> While testing on latest upstream
>>> kernel(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/)
>>> we noticed that with the merge commit
>>> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d0a231f01e5b25bacd23e6edc7c979a18a517b2b)
>>> hotplug and hotunplug of nvme drives stopped working.
>>>
>>> Rescan PCI does not help.
>>> echo "1" > /sys/bus/pci/rescan
>>>
>>> Issue does not reproduce on a kernel built on an antecedent
>>> commit(88db8458086b1dcf20b56682504bdb34d2bca0e2).
>>>
>>>
>>> During hot-remove device does not disappear, however when we try to
>>> do I/O on the disk then there is an I/O error, and the device
>>> disappears.
>>>
>>> Before I/O no logs regarding the disk appeared in the dmesg, only
>>> after I/O the entries appeared like below:
>>> [  177.943703] nvme nvme5: controller is down; will reset:
>>> CSTS=0xffffffff, PCI_STATUS=0xffff
>>> [  177.971661] nvme 10000:0b:00.0: can't change power state from
>>> D3cold to D0 (config space inaccessible)
>>> [  177.981121] pcieport 10000:00:02.0: can't derive routing for PCI
>>> INT A [  177.987749] nvme 10000:0b:00.0: PCI INT A: no GSI
>>> [  177.992633] nvme nvme5: Removing after probe failure status: -19
>>> [  178.004633] nvme5n1: detected capacity change from 83984375 to 0
>>> [  178.004677] I/O error, dev nvme5n1, sector 0 op 0x0:(READ) flags
>>> 0x0 phys_seg 1 prio class 0
>>>
>>>
>>> OS: RHEL 8.4 GA
>>> Platform: Intel Purley
>>>
>>> The logs are collected on a non-recent upstream kernel, but a issue
>>> also occurs on the newest upstream
>>> kernel(dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0)
>>
>> Apparently worked immediately before merging the PCI changes for
>> v5.17 and failed immediately after:
>>
>>    good: 88db8458086b ("Merge tag 'exfat-for-5.17-rc1' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat") bad:
>>   d0a231f01e5b ("Merge tag 'pci-v5.17-changes' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci")
>>
>> Only three commits touch pciehp:
>>
>>    085a9f43433f ("PCI: pciehp: Use down_read/write_nested(reset_lock)
>> to fix lockdep errors") 23584c1ed3e1 ("PCI: pciehp: Fix infinite loop
>> in IRQ handler upon power fault") a3b0f10db148 ("PCI: pciehp: Use
>> PCI_POSSIBLE_ERROR() to check config reads")
>>
>> None seems obviously related to me.  Blazej, could you try setting
>> CONFIG_DYNAMIC_DEBUG=y and booting with 'dyndbg="file pciehp* +p"' to
>> enable more debug messages?
>>
> 
> Hi Bjorn,
> 
> Thanks for your suggestions. Blazej did some tests and results were
> inconclusive. He tested it on two same platforms. On the first one it
> didn't work, even if he reverted all suggested patches. On the second
> one hotplugs always worked.
> 
> He noticed that on first platform where issue has been found initally,
> there was boot parameter "pci=nommconf". After adding this parameter
> on the second platform, hotplugs stopped working too.
> 
> Tested on tag pci-v5.17-changes. He have CONFIG_HOTPLUG_PCI_PCIE
> and CONFIG_DYNAMIC_DEBUG enabled in config. He also attached two dmesg
> logs to bugzilla with boot parameter 'dyndbg="file pciehp* +p" as
> requested. One with "pci=nommconf" and one without.
> 
> Issue seems to related to "pci=nommconf" and it is probably caused
> by change outside pciehp.

Could it be related to this?

int raw_pci_read(unsigned int domain, unsigned int bus, unsigned int devfn, int reg, int len, u32 *val)
{
	if (domain == 0 && reg < 256 && raw_pci_ops)
		return raw_pci_ops->read(domain, bus, devfn, reg, len, val);
	if (raw_pci_ext_ops)
		return raw_pci_ext_ops->read(domain, bus, devfn, reg, len, val);
	return -EINVAL;
}

It looks like raw_pci_ext_ops won't be set with nommconf, and VMD subdevice domain will be > 0.


> 
> He is currently working on email client setup to answer himself.
> 
> Thanks,
> Mariusz
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-01-27 14:46   ` Mariusz Tkaczyk
  2022-01-27 20:47     ` Jonathan Derrick
  2022-01-27 22:31     ` Jonathan Derrick
@ 2022-01-28  2:52     ` Bjorn Helgaas
  2022-01-28  8:29       ` Mariusz Tkaczyk
  2 siblings, 1 reply; 20+ messages in thread
From: Bjorn Helgaas @ 2022-01-28  2:52 UTC (permalink / raw)
  To: Mariusz Tkaczyk
  Cc: linux-pci, Blazej Kucman, Hans de Goede, Lukas Wunner,
	Naveen Naidu, Keith Busch, Nirmal Patel, Jonathan Derrick

On Thu, Jan 27, 2022 at 03:46:15PM +0100, Mariusz Tkaczyk wrote:
> ...
> Thanks for your suggestions. Blazej did some tests and results were
> inconclusive. He tested it on two same platforms. On the first one it
> didn't work, even if he reverted all suggested patches. On the second
> one hotplugs always worked.
> 
> He noticed that on first platform where issue has been found initally,
> there was boot parameter "pci=nommconf". After adding this parameter
> on the second platform, hotplugs stopped working too.
> 
> Tested on tag pci-v5.17-changes. He have CONFIG_HOTPLUG_PCI_PCIE
> and CONFIG_DYNAMIC_DEBUG enabled in config. He also attached two dmesg
> logs to bugzilla with boot parameter 'dyndbg="file pciehp* +p" as
> requested. One with "pci=nommconf" and one without.
> 
> Issue seems to related to "pci=nommconf" and it is probably caused
> by change outside pciehp.

Maybe I'm missing something.  If I understand correctly, the problem
has nothing to do with the kernel version (correct me if I'm wrong!)

PCIe native hotplug doesn't work when booted with "pci=nommconf".
When using "pci=nommconf", obviously we can't access the extended PCI
config space (offset 0x100-0xfff), so none of the extended
capabilities are available.

In that case, we don't even ask the platform for control of PCIe
hotplug via _OSC.  From the dmesg diff from normal (working) to
"pci=nommconf" (not working):

  -Command line: BOOT_IMAGE=/boot/vmlinuz-smp ...
  +Command line: BOOT_IMAGE=/boot/vmlinuz-smp pci=nommconf ...
  ...
  -acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM Segments MSI HPX-Type3]
  -acpi PNP0A08:00: _OSC: platform does not support [AER LTR]
  -acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug PME PCIeCapability]
  +acpi PNP0A08:00: _OSC: OS supports [ASPM ClockPM Segments MSI HPX-Type3]
  +acpi PNP0A08:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
  +acpi PNP0A08:00: MMCONFIG is disabled, can't access extended PCI configuration space under this bridge.

Why are you using "pci=nommconf"?  As far as I know, there's no reason
to use that except to work around some kind of defect.

Bjorn

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-01-28  2:52     ` Bjorn Helgaas
@ 2022-01-28  8:29       ` Mariusz Tkaczyk
  2022-01-28 13:08         ` Bjorn Helgaas
  0 siblings, 1 reply; 20+ messages in thread
From: Mariusz Tkaczyk @ 2022-01-28  8:29 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Blazej Kucman, Hans de Goede, Lukas Wunner,
	Naveen Naidu, Keith Busch, Nirmal Patel, Jonathan Derrick

On Thu, 27 Jan 2022 20:52:12 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Thu, Jan 27, 2022 at 03:46:15PM +0100, Mariusz Tkaczyk wrote:
> > ...
> > Thanks for your suggestions. Blazej did some tests and results were
> > inconclusive. He tested it on two same platforms. On the first one
> > it didn't work, even if he reverted all suggested patches. On the
> > second one hotplugs always worked.
> > 
> > He noticed that on first platform where issue has been found
> > initally, there was boot parameter "pci=nommconf". After adding
> > this parameter on the second platform, hotplugs stopped working too.
> > 
> > Tested on tag pci-v5.17-changes. He have CONFIG_HOTPLUG_PCI_PCIE
> > and CONFIG_DYNAMIC_DEBUG enabled in config. He also attached two
> > dmesg logs to bugzilla with boot parameter 'dyndbg="file pciehp*
> > +p" as requested. One with "pci=nommconf" and one without.
> > 
> > Issue seems to related to "pci=nommconf" and it is probably caused
> > by change outside pciehp.  
> 
> Maybe I'm missing something.  If I understand correctly, the problem
> has nothing to do with the kernel version (correct me if I'm wrong!)
> 
Hi Bjorn,

The problem occurred after the merge commit. It is some kind of
regression.

> PCIe native hotplug doesn't work when booted with "pci=nommconf".
> When using "pci=nommconf", obviously we can't access the extended PCI
> config space (offset 0x100-0xfff), so none of the extended
> capabilities are available.
> 
> In that case, we don't even ask the platform for control of PCIe
> hotplug via _OSC.  From the dmesg diff from normal (working) to
> "pci=nommconf" (not working):
> 
>   -Command line: BOOT_IMAGE=/boot/vmlinuz-smp ...
>   +Command line: BOOT_IMAGE=/boot/vmlinuz-smp pci=nommconf ...
>   ...
>   -acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM
> Segments MSI HPX-Type3] -acpi PNP0A08:00: _OSC: platform does not
> support [AER LTR] -acpi PNP0A08:00: _OSC: OS now controls
> [PCIeHotplug PME PCIeCapability] +acpi PNP0A08:00: _OSC: OS supports
> [ASPM ClockPM Segments MSI HPX-Type3] +acpi PNP0A08:00: _OSC: not
> requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
> +acpi PNP0A08:00: MMCONFIG is disabled, can't access extended PCI
> configuration space under this bridge.
> 

So, it shouldn't work from years but it has been broken recently, that
is the only objection I have. Could you tell why it was working?
According to your words- it shouldn't. We are using VMD driver, is that
matter?

I already saw Jonathan's finding, we can check this. But if nommconf
stays against hotplug, is it valuable? Maybe we should accept the
regression as desired.

> Why are you using "pci=nommconf"?  As far as I know, there's no reason
> to use that except to work around some kind of defect.

It was added long time ago when it was useful, and sometimes is
returning. We need to get rid of it definitely.

Thanks,
Mariusz


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-01-28  8:29       ` Mariusz Tkaczyk
@ 2022-01-28 13:08         ` Bjorn Helgaas
  2022-01-28 13:49           ` Kai-Heng Feng
  0 siblings, 1 reply; 20+ messages in thread
From: Bjorn Helgaas @ 2022-01-28 13:08 UTC (permalink / raw)
  To: Mariusz Tkaczyk
  Cc: linux-pci, Blazej Kucman, Hans de Goede, Lukas Wunner,
	Naveen Naidu, Keith Busch, Nirmal Patel, Jonathan Derrick,
	Kai-Heng Feng

[+cc Kai-Heng]

On Fri, Jan 28, 2022 at 09:29:31AM +0100, Mariusz Tkaczyk wrote:
> On Thu, 27 Jan 2022 20:52:12 -0600
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Jan 27, 2022 at 03:46:15PM +0100, Mariusz Tkaczyk wrote:
> > > ...
> > > Thanks for your suggestions. Blazej did some tests and results were
> > > inconclusive. He tested it on two same platforms. On the first one
> > > it didn't work, even if he reverted all suggested patches. On the
> > > second one hotplugs always worked.
> > > 
> > > He noticed that on first platform where issue has been found
> > > initally, there was boot parameter "pci=nommconf". After adding
> > > this parameter on the second platform, hotplugs stopped working too.
> > > 
> > > Tested on tag pci-v5.17-changes. He have CONFIG_HOTPLUG_PCI_PCIE
> > > and CONFIG_DYNAMIC_DEBUG enabled in config. He also attached two
> > > dmesg logs to bugzilla with boot parameter 'dyndbg="file pciehp*
> > > +p" as requested. One with "pci=nommconf" and one without.
> > > 
> > > Issue seems to related to "pci=nommconf" and it is probably caused
> > > by change outside pciehp.  
> > 
> > Maybe I'm missing something.  If I understand correctly, the problem
> > has nothing to do with the kernel version (correct me if I'm wrong!)
> 
> The problem occurred after the merge commit. It is some kind of
> regression.

The bug report doesn't yet contain the evidence showing this.  It only
contains dmesg logs with "pci=nommconf" where pciehp doesn't work
(which is the expected behavior) and a log without "pci=nommconf"
where pciehp does work (which is again the expected behavior).

> > PCIe native hotplug doesn't work when booted with "pci=nommconf".
> > When using "pci=nommconf", obviously we can't access the extended PCI
> > config space (offset 0x100-0xfff), so none of the extended
> > capabilities are available.
> > 
> > In that case, we don't even ask the platform for control of PCIe
> > hotplug via _OSC.  From the dmesg diff from normal (working) to
> > "pci=nommconf" (not working):
> > 
> >   -Command line: BOOT_IMAGE=/boot/vmlinuz-smp ...
> >   +Command line: BOOT_IMAGE=/boot/vmlinuz-smp pci=nommconf ...
> >   ...
> >   -acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM
> > Segments MSI HPX-Type3] -acpi PNP0A08:00: _OSC: platform does not
> > support [AER LTR] -acpi PNP0A08:00: _OSC: OS now controls
> > [PCIeHotplug PME PCIeCapability] +acpi PNP0A08:00: _OSC: OS supports
> > [ASPM ClockPM Segments MSI HPX-Type3] +acpi PNP0A08:00: _OSC: not
> > requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
> > +acpi PNP0A08:00: MMCONFIG is disabled, can't access extended PCI
> > configuration space under this bridge.
> 
> So, it shouldn't work from years but it has been broken recently, that
> is the only objection I have. Could you tell why it was working?
> According to your words- it shouldn't. We are using VMD driver, is that
> matter?

04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") looks like
a it could be related.  Try reverting that commit and see whether it
makes a difference.

Bjorn

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-01-28 13:08         ` Bjorn Helgaas
@ 2022-01-28 13:49           ` Kai-Heng Feng
  2022-01-28 14:03             ` Bjorn Helgaas
  0 siblings, 1 reply; 20+ messages in thread
From: Kai-Heng Feng @ 2022-01-28 13:49 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Mariusz Tkaczyk, linux-pci, Blazej Kucman, Hans de Goede,
	Lukas Wunner, Naveen Naidu, Keith Busch, Nirmal Patel,
	Jonathan Derrick

On Fri, Jan 28, 2022 at 9:08 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Kai-Heng]
>
> On Fri, Jan 28, 2022 at 09:29:31AM +0100, Mariusz Tkaczyk wrote:
> > On Thu, 27 Jan 2022 20:52:12 -0600
> > Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > On Thu, Jan 27, 2022 at 03:46:15PM +0100, Mariusz Tkaczyk wrote:
> > > > ...
> > > > Thanks for your suggestions. Blazej did some tests and results were
> > > > inconclusive. He tested it on two same platforms. On the first one
> > > > it didn't work, even if he reverted all suggested patches. On the
> > > > second one hotplugs always worked.
> > > >
> > > > He noticed that on first platform where issue has been found
> > > > initally, there was boot parameter "pci=nommconf". After adding
> > > > this parameter on the second platform, hotplugs stopped working too.
> > > >
> > > > Tested on tag pci-v5.17-changes. He have CONFIG_HOTPLUG_PCI_PCIE
> > > > and CONFIG_DYNAMIC_DEBUG enabled in config. He also attached two
> > > > dmesg logs to bugzilla with boot parameter 'dyndbg="file pciehp*
> > > > +p" as requested. One with "pci=nommconf" and one without.
> > > >
> > > > Issue seems to related to "pci=nommconf" and it is probably caused
> > > > by change outside pciehp.
> > >
> > > Maybe I'm missing something.  If I understand correctly, the problem
> > > has nothing to do with the kernel version (correct me if I'm wrong!)
> >
> > The problem occurred after the merge commit. It is some kind of
> > regression.
>
> The bug report doesn't yet contain the evidence showing this.  It only
> contains dmesg logs with "pci=nommconf" where pciehp doesn't work
> (which is the expected behavior) and a log without "pci=nommconf"
> where pciehp does work (which is again the expected behavior).
>
> > > PCIe native hotplug doesn't work when booted with "pci=nommconf".
> > > When using "pci=nommconf", obviously we can't access the extended PCI
> > > config space (offset 0x100-0xfff), so none of the extended
> > > capabilities are available.
> > >
> > > In that case, we don't even ask the platform for control of PCIe
> > > hotplug via _OSC.  From the dmesg diff from normal (working) to
> > > "pci=nommconf" (not working):
> > >
> > >   -Command line: BOOT_IMAGE=/boot/vmlinuz-smp ...
> > >   +Command line: BOOT_IMAGE=/boot/vmlinuz-smp pci=nommconf ...
> > >   ...
> > >   -acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM
> > > Segments MSI HPX-Type3] -acpi PNP0A08:00: _OSC: platform does not
> > > support [AER LTR] -acpi PNP0A08:00: _OSC: OS now controls
> > > [PCIeHotplug PME PCIeCapability] +acpi PNP0A08:00: _OSC: OS supports
> > > [ASPM ClockPM Segments MSI HPX-Type3] +acpi PNP0A08:00: _OSC: not
> > > requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
> > > +acpi PNP0A08:00: MMCONFIG is disabled, can't access extended PCI
> > > configuration space under this bridge.
> >
> > So, it shouldn't work from years but it has been broken recently, that
> > is the only objection I have. Could you tell why it was working?
> > According to your words- it shouldn't. We are using VMD driver, is that
> > matter?
>
> 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") looks like
> a it could be related.  Try reverting that commit and see whether it
> makes a difference.

The affected NVMe is indeed behind VMD domain, so I think the commit
can make a difference.

Does VMD behave differently on laptops and servers?
Anyway, I agree that the issue really lies in "pci=nommconf".

Kai-Heng

>
> Bjorn

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-01-28 13:49           ` Kai-Heng Feng
@ 2022-01-28 14:03             ` Bjorn Helgaas
  2022-02-02 15:48               ` Blazej Kucman
  0 siblings, 1 reply; 20+ messages in thread
From: Bjorn Helgaas @ 2022-01-28 14:03 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Mariusz Tkaczyk, linux-pci, Blazej Kucman, Hans de Goede,
	Lukas Wunner, Naveen Naidu, Keith Busch, Nirmal Patel,
	Jonathan Derrick

On Fri, Jan 28, 2022 at 09:49:34PM +0800, Kai-Heng Feng wrote:
> On Fri, Jan 28, 2022 at 9:08 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Fri, Jan 28, 2022 at 09:29:31AM +0100, Mariusz Tkaczyk wrote:
> > > On Thu, 27 Jan 2022 20:52:12 -0600
> > > Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > On Thu, Jan 27, 2022 at 03:46:15PM +0100, Mariusz Tkaczyk wrote:
> > > > > ...
> > > > > Thanks for your suggestions. Blazej did some tests and results were
> > > > > inconclusive. He tested it on two same platforms. On the first one
> > > > > it didn't work, even if he reverted all suggested patches. On the
> > > > > second one hotplugs always worked.
> > > > >
> > > > > He noticed that on first platform where issue has been found
> > > > > initally, there was boot parameter "pci=nommconf". After adding
> > > > > this parameter on the second platform, hotplugs stopped working too.
> > > > >
> > > > > Tested on tag pci-v5.17-changes. He have CONFIG_HOTPLUG_PCI_PCIE
> > > > > and CONFIG_DYNAMIC_DEBUG enabled in config. He also attached two
> > > > > dmesg logs to bugzilla with boot parameter 'dyndbg="file pciehp*
> > > > > +p" as requested. One with "pci=nommconf" and one without.
> > > > >
> > > > > Issue seems to related to "pci=nommconf" and it is probably caused
> > > > > by change outside pciehp.
> > > >
> > > > Maybe I'm missing something.  If I understand correctly, the problem
> > > > has nothing to do with the kernel version (correct me if I'm wrong!)
> > >
> > > The problem occurred after the merge commit. It is some kind of
> > > regression.
> >
> > The bug report doesn't yet contain the evidence showing this.  It only
> > contains dmesg logs with "pci=nommconf" where pciehp doesn't work
> > (which is the expected behavior) and a log without "pci=nommconf"
> > where pciehp does work (which is again the expected behavior).
> >
> > > > PCIe native hotplug doesn't work when booted with "pci=nommconf".
> > > > When using "pci=nommconf", obviously we can't access the extended PCI
> > > > config space (offset 0x100-0xfff), so none of the extended
> > > > capabilities are available.
> > > >
> > > > In that case, we don't even ask the platform for control of PCIe
> > > > hotplug via _OSC.  From the dmesg diff from normal (working) to
> > > > "pci=nommconf" (not working):
> > > >
> > > >   -Command line: BOOT_IMAGE=/boot/vmlinuz-smp ...
> > > >   +Command line: BOOT_IMAGE=/boot/vmlinuz-smp pci=nommconf ...
> > > >   ...
> > > >   -acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM ClockPM
> > > > Segments MSI HPX-Type3] -acpi PNP0A08:00: _OSC: platform does not
> > > > support [AER LTR] -acpi PNP0A08:00: _OSC: OS now controls
> > > > [PCIeHotplug PME PCIeCapability] +acpi PNP0A08:00: _OSC: OS supports
> > > > [ASPM ClockPM Segments MSI HPX-Type3] +acpi PNP0A08:00: _OSC: not
> > > > requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]
> > > > +acpi PNP0A08:00: MMCONFIG is disabled, can't access extended PCI
> > > > configuration space under this bridge.
> > >
> > > So, it shouldn't work from years but it has been broken recently, that
> > > is the only objection I have. Could you tell why it was working?
> > > According to your words- it shouldn't. We are using VMD driver, is that
> > > matter?
> >
> > 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") looks like
> > a it could be related.  Try reverting that commit and see whether it
> > makes a difference.
> 
> The affected NVMe is indeed behind VMD domain, so I think the commit
> can make a difference.
> 
> Does VMD behave differently on laptops and servers?
> Anyway, I agree that the issue really lies in "pci=nommconf".

Oh, I have a guess:

  - With "pci=nommconf", prior to v5.17-rc1, pciehp did not work in
    general, but *did* work for NVMe behind a VMD.  As of v5.17-rc1,
    pciehp no longer works for NVMe behind VMD.

  - Without "pci=nommconf", pciehp works as expected for all devices
    including NVMe behind VMD, both before and after v5.17-rc1.

Is that what you're observing?

If so, I doubt there's anything to fix other than getting rid of
"pci=nommconf".

Bjorn

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-01-28 14:03             ` Bjorn Helgaas
@ 2022-02-02 15:48               ` Blazej Kucman
  2022-02-02 16:43                 ` Bjorn Helgaas
  0 siblings, 1 reply; 20+ messages in thread
From: Blazej Kucman @ 2022-02-02 15:48 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Kai-Heng Feng, Mariusz Tkaczyk, linux-pci, Blazej Kucman,
	Hans de Goede, Lukas Wunner, Naveen Naidu, Keith Busch,
	Nirmal Patel, Jonathan Derrick

On Fri, 28 Jan 2022 08:03:28 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Fri, Jan 28, 2022 at 09:49:34PM +0800, Kai-Heng Feng wrote:
> > On Fri, Jan 28, 2022 at 9:08 PM Bjorn Helgaas <helgaas@kernel.org>
> > wrote:  
> > > On Fri, Jan 28, 2022 at 09:29:31AM +0100, Mariusz Tkaczyk wrote:  
> > > > On Thu, 27 Jan 2022 20:52:12 -0600
> > > > Bjorn Helgaas <helgaas@kernel.org> wrote:  
> > > > > On Thu, Jan 27, 2022 at 03:46:15PM +0100, Mariusz Tkaczyk
> > > > > wrote:  
> > > > > > ...
> > > > > > Thanks for your suggestions. Blazej did some tests and
> > > > > > results were inconclusive. He tested it on two same
> > > > > > platforms. On the first one it didn't work, even if he
> > > > > > reverted all suggested patches. On the second one hotplugs
> > > > > > always worked.
> > > > > >
> > > > > > He noticed that on first platform where issue has been found
> > > > > > initally, there was boot parameter "pci=nommconf". After
> > > > > > adding this parameter on the second platform, hotplugs
> > > > > > stopped working too.
> > > > > >
> > > > > > Tested on tag pci-v5.17-changes. He have
> > > > > > CONFIG_HOTPLUG_PCI_PCIE and CONFIG_DYNAMIC_DEBUG enabled in
> > > > > > config. He also attached two dmesg logs to bugzilla with
> > > > > > boot parameter 'dyndbg="file pciehp* +p" as requested. One
> > > > > > with "pci=nommconf" and one without.
> > > > > >
> > > > > > Issue seems to related to "pci=nommconf" and it is probably
> > > > > > caused by change outside pciehp.  
> > > > >
> > > > > Maybe I'm missing something.  If I understand correctly, the
> > > > > problem has nothing to do with the kernel version (correct me
> > > > > if I'm wrong!)  
> > > >
> > > > The problem occurred after the merge commit. It is some kind of
> > > > regression.  
> > >
> > > The bug report doesn't yet contain the evidence showing this.  It
> > > only contains dmesg logs with "pci=nommconf" where pciehp doesn't
> > > work (which is the expected behavior) and a log without
> > > "pci=nommconf" where pciehp does work (which is again the
> > > expected behavior). 
> > > > > PCIe native hotplug doesn't work when booted with
> > > > > "pci=nommconf". When using "pci=nommconf", obviously we can't
> > > > > access the extended PCI config space (offset 0x100-0xfff), so
> > > > > none of the extended capabilities are available.
> > > > >
> > > > > In that case, we don't even ask the platform for control of
> > > > > PCIe hotplug via _OSC.  From the dmesg diff from normal
> > > > > (working) to "pci=nommconf" (not working):
> > > > >
> > > > >   -Command line: BOOT_IMAGE=/boot/vmlinuz-smp ...
> > > > >   +Command line: BOOT_IMAGE=/boot/vmlinuz-smp pci=nommconf ...
> > > > >   ...
> > > > >   -acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM
> > > > > ClockPM Segments MSI HPX-Type3] -acpi PNP0A08:00: _OSC:
> > > > > platform does not support [AER LTR] -acpi PNP0A08:00: _OSC:
> > > > > OS now controls [PCIeHotplug PME PCIeCapability] +acpi
> > > > > PNP0A08:00: _OSC: OS supports [ASPM ClockPM Segments MSI
> > > > > HPX-Type3] +acpi PNP0A08:00: _OSC: not requesting OS control;
> > > > > OS requires [ExtendedConfig ASPM ClockPM MSI] +acpi
> > > > > PNP0A08:00: MMCONFIG is disabled, can't access extended PCI
> > > > > configuration space under this bridge.  
> > > >
> > > > So, it shouldn't work from years but it has been broken
> > > > recently, that is the only objection I have. Could you tell why
> > > > it was working? According to your words- it shouldn't. We are
> > > > using VMD driver, is that matter?  
> > >
> > > 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") looks
> > > like a it could be related.  Try reverting that commit and see
> > > whether it makes a difference.  
> > 
> > The affected NVMe is indeed behind VMD domain, so I think the commit
> > can make a difference.
> > 
> > Does VMD behave differently on laptops and servers?
> > Anyway, I agree that the issue really lies in "pci=nommconf".  
> 
> Oh, I have a guess:
> 
>   - With "pci=nommconf", prior to v5.17-rc1, pciehp did not work in
>     general, but *did* work for NVMe behind a VMD.  As of v5.17-rc1,
>     pciehp no longer works for NVMe behind VMD.
> 
>   - Without "pci=nommconf", pciehp works as expected for all devices
>     including NVMe behind VMD, both before and after v5.17-rc1.
> 
> Is that what you're observing?
> 
> If so, I doubt there's anything to fix other than getting rid of
> "pci=nommconf".
> 
> Bjorn

I haven't tested with VMD disabled earlier. I verified it and my
observations are as follows:

OS: RHEL 8.4
NO - hotplug not working
YES - hotplug working

pci=nommconf added:
+--------------+-------------------+---------------------+--------------+
|              | pci-v5.17-changes | revert-04b12ef163d1 | inbox kernel
+--------------+-------------------+---------------------+--------------+
| VMD enabled  | NO                | YES                 | YES         
+--------------+-------------------+---------------------+--------------+
| VMD disabled | NO                | NO                  | NO
+--------------+-------------------+---------------------+--------------+

without pci=nommconf:
+--------------+-------------------+---------------------+--------------+
|              | pci-v5.17-changes | revert-04b12ef163d1 | inbox kernel
+--------------+-------------------+---------------------+--------------+
| VMD enabled  | YES               | YES                 | YES
+--------------+-------------------+---------------------+--------------+
| VMD disabled | YES               | YES                 | YES
+--------------+-------------------+---------------------+--------------+

So, results confirmed your assumptions, but I also confirmed that
revert of 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features")
makes it to work as in inbox kernel.

We will drop the legacy parameter in our tests. According to my results
there is a regression in VMD caused by: 04b12ef163d1 commit, even if it
is not working for nvme anyway. Should it be fixed?

Thanks,
Blazej

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-02-02 15:48               ` Blazej Kucman
@ 2022-02-02 16:43                 ` Bjorn Helgaas
  2022-02-03  9:13                   ` Thorsten Leemhuis
  0 siblings, 1 reply; 20+ messages in thread
From: Bjorn Helgaas @ 2022-02-02 16:43 UTC (permalink / raw)
  To: Blazej Kucman
  Cc: Kai-Heng Feng, Mariusz Tkaczyk, linux-pci, Blazej Kucman,
	Hans de Goede, Lukas Wunner, Naveen Naidu, Keith Busch,
	Nirmal Patel, Jonathan Derrick

On Wed, Feb 02, 2022 at 04:48:01PM +0100, Blazej Kucman wrote:
> On Fri, 28 Jan 2022 08:03:28 -0600
> Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Fri, Jan 28, 2022 at 09:49:34PM +0800, Kai-Heng Feng wrote:
> > > On Fri, Jan 28, 2022 at 9:08 PM Bjorn Helgaas <helgaas@kernel.org>
> > > wrote:  
> > > > On Fri, Jan 28, 2022 at 09:29:31AM +0100, Mariusz Tkaczyk wrote:  
> > > > > On Thu, 27 Jan 2022 20:52:12 -0600
> > > > > Bjorn Helgaas <helgaas@kernel.org> wrote:  
> > > > > > On Thu, Jan 27, 2022 at 03:46:15PM +0100, Mariusz Tkaczyk
> > > > > > wrote:  
> > > > > > > ...
> > > > > > > Thanks for your suggestions. Blazej did some tests and
> > > > > > > results were inconclusive. He tested it on two same
> > > > > > > platforms. On the first one it didn't work, even if he
> > > > > > > reverted all suggested patches. On the second one hotplugs
> > > > > > > always worked.
> > > > > > >
> > > > > > > He noticed that on first platform where issue has been found
> > > > > > > initally, there was boot parameter "pci=nommconf". After
> > > > > > > adding this parameter on the second platform, hotplugs
> > > > > > > stopped working too.
> > > > > > >
> > > > > > > Tested on tag pci-v5.17-changes. He have
> > > > > > > CONFIG_HOTPLUG_PCI_PCIE and CONFIG_DYNAMIC_DEBUG enabled in
> > > > > > > config. He also attached two dmesg logs to bugzilla with
> > > > > > > boot parameter 'dyndbg="file pciehp* +p" as requested. One
> > > > > > > with "pci=nommconf" and one without.
> > > > > > >
> > > > > > > Issue seems to related to "pci=nommconf" and it is probably
> > > > > > > caused by change outside pciehp.  
> > > > > >
> > > > > > Maybe I'm missing something.  If I understand correctly, the
> > > > > > problem has nothing to do with the kernel version (correct me
> > > > > > if I'm wrong!)  
> > > > >
> > > > > The problem occurred after the merge commit. It is some kind of
> > > > > regression.  
> > > >
> > > > The bug report doesn't yet contain the evidence showing this.  It
> > > > only contains dmesg logs with "pci=nommconf" where pciehp doesn't
> > > > work (which is the expected behavior) and a log without
> > > > "pci=nommconf" where pciehp does work (which is again the
> > > > expected behavior). 
> > > > > > PCIe native hotplug doesn't work when booted with
> > > > > > "pci=nommconf". When using "pci=nommconf", obviously we can't
> > > > > > access the extended PCI config space (offset 0x100-0xfff), so
> > > > > > none of the extended capabilities are available.
> > > > > >
> > > > > > In that case, we don't even ask the platform for control of
> > > > > > PCIe hotplug via _OSC.  From the dmesg diff from normal
> > > > > > (working) to "pci=nommconf" (not working):
> > > > > >
> > > > > >   -Command line: BOOT_IMAGE=/boot/vmlinuz-smp ...
> > > > > >   +Command line: BOOT_IMAGE=/boot/vmlinuz-smp pci=nommconf ...
> > > > > >   ...
> > > > > >   -acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM
> > > > > > ClockPM Segments MSI HPX-Type3] -acpi PNP0A08:00: _OSC:
> > > > > > platform does not support [AER LTR] -acpi PNP0A08:00: _OSC:
> > > > > > OS now controls [PCIeHotplug PME PCIeCapability] +acpi
> > > > > > PNP0A08:00: _OSC: OS supports [ASPM ClockPM Segments MSI
> > > > > > HPX-Type3] +acpi PNP0A08:00: _OSC: not requesting OS control;
> > > > > > OS requires [ExtendedConfig ASPM ClockPM MSI] +acpi
> > > > > > PNP0A08:00: MMCONFIG is disabled, can't access extended PCI
> > > > > > configuration space under this bridge.  
> > > > >
> > > > > So, it shouldn't work from years but it has been broken
> > > > > recently, that is the only objection I have. Could you tell why
> > > > > it was working? According to your words- it shouldn't. We are
> > > > > using VMD driver, is that matter?  
> > > >
> > > > 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") looks
> > > > like a it could be related.  Try reverting that commit and see
> > > > whether it makes a difference.  
> > > 
> > > The affected NVMe is indeed behind VMD domain, so I think the commit
> > > can make a difference.
> > > 
> > > Does VMD behave differently on laptops and servers?
> > > Anyway, I agree that the issue really lies in "pci=nommconf".  
> > 
> > Oh, I have a guess:
> > 
> >   - With "pci=nommconf", prior to v5.17-rc1, pciehp did not work in
> >     general, but *did* work for NVMe behind a VMD.  As of v5.17-rc1,
> >     pciehp no longer works for NVMe behind VMD.
> > 
> >   - Without "pci=nommconf", pciehp works as expected for all devices
> >     including NVMe behind VMD, both before and after v5.17-rc1.
> > 
> > Is that what you're observing?
> > 
> > If so, I doubt there's anything to fix other than getting rid of
> > "pci=nommconf".
> 
> I haven't tested with VMD disabled earlier. I verified it and my
> observations are as follows:
> 
> OS: RHEL 8.4
> NO - hotplug not working
> YES - hotplug working
> 
> pci=nommconf added:
> +--------------+-------------------+---------------------+--------------+
> |              | pci-v5.17-changes | revert-04b12ef163d1 | inbox kernel
> +--------------+-------------------+---------------------+--------------+
> | VMD enabled  | NO                | YES                 | YES         
> +--------------+-------------------+---------------------+--------------+
> | VMD disabled | NO                | NO                  | NO
> +--------------+-------------------+---------------------+--------------+
> 
> without pci=nommconf:
> +--------------+-------------------+---------------------+--------------+
> |              | pci-v5.17-changes | revert-04b12ef163d1 | inbox kernel
> +--------------+-------------------+---------------------+--------------+
> | VMD enabled  | YES               | YES                 | YES
> +--------------+-------------------+---------------------+--------------+
> | VMD disabled | YES               | YES                 | YES
> +--------------+-------------------+---------------------+--------------+
> 
> So, results confirmed your assumptions, but I also confirmed that
> revert of 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features")
> makes it to work as in inbox kernel.
> 
> We will drop the legacy parameter in our tests. According to my results
> there is a regression in VMD caused by: 04b12ef163d1 commit, even if it
> is not working for nvme anyway. Should it be fixed?

I don't know what the "inbox kernel" is.  I guess that's unmodified
RHEL 8.4?

And revert-04b12ef163d1 means the pci-v5.17-changes tag plus a revert
of 04b12ef163d1?

I think your "hotplug working" or "hotplug not working" notes refer
specifically to devices behind VMD, right?  They do not refer to
devices outside the VMD hierarchy?

So IIUC, the regression is that hotplug of devices behind VMD used to
work with "pci=nommconf", but it does not work after 04b12ef163d1.
IMO that does not need to be fixed because it was arguably a bug that
it *did* work before.

That said, I'm not 100% confident about 04b12ef163d1 because _OSC is a
way to negotiate ownership of things that could be owned either by
platform firmware or by the OS, and the commit log doesn't make it
clear that's the situation here.  It's more of a "the problem doesn't
happen when we do this" sort of commit log.

If there's anything more to do here, it would be helpful to attach
complete dmesg logs from the scenarios of interest to the bugzilla.
That will help remove ambiguity about what's being tested and what the
results are.

Bjorn

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-02-02 16:43                 ` Bjorn Helgaas
@ 2022-02-03  9:13                   ` Thorsten Leemhuis
  2022-02-03 10:47                     ` Blazej Kucman
  0 siblings, 1 reply; 20+ messages in thread
From: Thorsten Leemhuis @ 2022-02-03  9:13 UTC (permalink / raw)
  To: Bjorn Helgaas, Blazej Kucman
  Cc: Kai-Heng Feng, Mariusz Tkaczyk, linux-pci, Blazej Kucman,
	Hans de Goede, Lukas Wunner, Naveen Naidu, Keith Busch,
	Nirmal Patel, Jonathan Derrick


Hi, this is your Linux kernel regression tracker speaking.

On 02.02.22 17:43, Bjorn Helgaas wrote:
> On Wed, Feb 02, 2022 at 04:48:01PM +0100, Blazej Kucman wrote:
>> On Fri, 28 Jan 2022 08:03:28 -0600
>> Bjorn Helgaas <helgaas@kernel.org> wrote:
>>> On Fri, Jan 28, 2022 at 09:49:34PM +0800, Kai-Heng Feng wrote:
>>>> On Fri, Jan 28, 2022 at 9:08 PM Bjorn Helgaas <helgaas@kernel.org>
>>>> wrote:  
>>>>> On Fri, Jan 28, 2022 at 09:29:31AM +0100, Mariusz Tkaczyk wrote:  
>>>>>> On Thu, 27 Jan 2022 20:52:12 -0600
>>>>>> Bjorn Helgaas <helgaas@kernel.org> wrote:  
>>>>>>> On Thu, Jan 27, 2022 at 03:46:15PM +0100, Mariusz Tkaczyk
>>>>>>> wrote:  
>>>>>>>> ...
>>>>>>>> Thanks for your suggestions. Blazej did some tests and
>>>>>>>> results were inconclusive. He tested it on two same
>>>>>>>> platforms. On the first one it didn't work, even if he
>>>>>>>> reverted all suggested patches. On the second one hotplugs
>>>>>>>> always worked.
>>>>>>>>
>>>>>>>> He noticed that on first platform where issue has been found
>>>>>>>> initally, there was boot parameter "pci=nommconf". After
>>>>>>>> adding this parameter on the second platform, hotplugs
>>>>>>>> stopped working too.
>>>>>>>>
>>>>>>>> Tested on tag pci-v5.17-changes. He have
>>>>>>>> CONFIG_HOTPLUG_PCI_PCIE and CONFIG_DYNAMIC_DEBUG enabled in
>>>>>>>> config. He also attached two dmesg logs to bugzilla with
>>>>>>>> boot parameter 'dyndbg="file pciehp* +p" as requested. One
>>>>>>>> with "pci=nommconf" and one without.
>>>>>>>>
>>>>>>>> Issue seems to related to "pci=nommconf" and it is probably
>>>>>>>> caused by change outside pciehp.  
>>>>>>>
>>>>>>> Maybe I'm missing something.  If I understand correctly, the
>>>>>>> problem has nothing to do with the kernel version (correct me
>>>>>>> if I'm wrong!)  
>>>>>>
>>>>>> The problem occurred after the merge commit. It is some kind of
>>>>>> regression.  
>>>>>
>>>>> The bug report doesn't yet contain the evidence showing this.  It
>>>>> only contains dmesg logs with "pci=nommconf" where pciehp doesn't
>>>>> work (which is the expected behavior) and a log without
>>>>> "pci=nommconf" where pciehp does work (which is again the
>>>>> expected behavior). 
>>>>>>> PCIe native hotplug doesn't work when booted with
>>>>>>> "pci=nommconf". When using "pci=nommconf", obviously we can't
>>>>>>> access the extended PCI config space (offset 0x100-0xfff), so
>>>>>>> none of the extended capabilities are available.
>>>>>>>
>>>>>>> In that case, we don't even ask the platform for control of
>>>>>>> PCIe hotplug via _OSC.  From the dmesg diff from normal
>>>>>>> (working) to "pci=nommconf" (not working):
>>>>>>>
>>>>>>>   -Command line: BOOT_IMAGE=/boot/vmlinuz-smp ...
>>>>>>>   +Command line: BOOT_IMAGE=/boot/vmlinuz-smp pci=nommconf ...
>>>>>>>   ...
>>>>>>>   -acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM
>>>>>>> ClockPM Segments MSI HPX-Type3] -acpi PNP0A08:00: _OSC:
>>>>>>> platform does not support [AER LTR] -acpi PNP0A08:00: _OSC:
>>>>>>> OS now controls [PCIeHotplug PME PCIeCapability] +acpi
>>>>>>> PNP0A08:00: _OSC: OS supports [ASPM ClockPM Segments MSI
>>>>>>> HPX-Type3] +acpi PNP0A08:00: _OSC: not requesting OS control;
>>>>>>> OS requires [ExtendedConfig ASPM ClockPM MSI] +acpi
>>>>>>> PNP0A08:00: MMCONFIG is disabled, can't access extended PCI
>>>>>>> configuration space under this bridge.  
>>>>>>
>>>>>> So, it shouldn't work from years but it has been broken
>>>>>> recently, that is the only objection I have. Could you tell why
>>>>>> it was working? According to your words- it shouldn't. We are
>>>>>> using VMD driver, is that matter?  
>>>>>
>>>>> 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") looks
>>>>> like a it could be related.  Try reverting that commit and see
>>>>> whether it makes a difference.  
>>>>
>>>> The affected NVMe is indeed behind VMD domain, so I think the commit
>>>> can make a difference.
>>>>
>>>> Does VMD behave differently on laptops and servers?
>>>> Anyway, I agree that the issue really lies in "pci=nommconf".  
>>>
>>> Oh, I have a guess:
>>>
>>>   - With "pci=nommconf", prior to v5.17-rc1, pciehp did not work in
>>>     general, but *did* work for NVMe behind a VMD.  As of v5.17-rc1,
>>>     pciehp no longer works for NVMe behind VMD.
>>>
>>>   - Without "pci=nommconf", pciehp works as expected for all devices
>>>     including NVMe behind VMD, both before and after v5.17-rc1.
>>>
>>> Is that what you're observing?
>>>
>>> If so, I doubt there's anything to fix other than getting rid of
>>> "pci=nommconf".
>>
>> I haven't tested with VMD disabled earlier. I verified it and my
>> observations are as follows:
>>
>> OS: RHEL 8.4
>> NO - hotplug not working
>> YES - hotplug working
>>
>> pci=nommconf added:
>> +--------------+-------------------+---------------------+--------------+
>> |              | pci-v5.17-changes | revert-04b12ef163d1 | inbox kernel
>> +--------------+-------------------+---------------------+--------------+
>> | VMD enabled  | NO                | YES                 | YES         
>> +--------------+-------------------+---------------------+--------------+
>> | VMD disabled | NO                | NO                  | NO
>> +--------------+-------------------+---------------------+--------------+
>>
>> without pci=nommconf:
>> +--------------+-------------------+---------------------+--------------+
>> |              | pci-v5.17-changes | revert-04b12ef163d1 | inbox kernel
>> +--------------+-------------------+---------------------+--------------+
>> | VMD enabled  | YES               | YES                 | YES
>> +--------------+-------------------+---------------------+--------------+
>> | VMD disabled | YES               | YES                 | YES
>> +--------------+-------------------+---------------------+--------------+
>>
>> So, results confirmed your assumptions, but I also confirmed that
>> revert of 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features")
>> makes it to work as in inbox kernel.
>>
>> We will drop the legacy parameter in our tests. According to my results
>> there is a regression in VMD caused by: 04b12ef163d1 commit, even if it
>> is not working for nvme anyway. Should it be fixed?
> 
> I don't know what the "inbox kernel" is.  I guess that's unmodified
> RHEL 8.4?
> 
> And revert-04b12ef163d1 means the pci-v5.17-changes tag plus a revert
> of 04b12ef163d1?
> 
> I think your "hotplug working" or "hotplug not working" notes refer
> specifically to devices behind VMD, right?  They do not refer to
> devices outside the VMD hierarchy?
> 
> So IIUC, the regression is that hotplug of devices behind VMD used to
> work with "pci=nommconf", but it does not work after 04b12ef163d1.
> IMO that does not need to be fixed because it was arguably a bug that
> it *did* work before.

FWIW, that afaics does matter. To quote Linus:

```
Users are literally the _only_ thing that matters.

No amount of "you shouldn't have used this" or "that behavior was
undefined, it's your own fault your app broke" or "that used to work
simply because of a kernel bug" is at all relevant.
```

That quote is from:
https://lore.kernel.org/all/CAHk-=wiVi7mSrsMP=fLXQrXK_UimybW=ziLOwSzFTtoXUacWVQ@mail.gmail.com/

So from my point of view as a regression tracker I currently think this
still needs to be fixed.

> That said, I'm not 100% confident about 04b12ef163d1 because _OSC is a
> way to negotiate ownership of things that could be owned either by
> platform firmware or by the OS, and the commit log doesn't make it
> clear that's the situation here.  It's more of a "the problem doesn't
> happen when we do this" sort of commit log.
> 
> If there's anything more to do here, it would be helpful to attach
> complete dmesg logs from the scenarios of interest to the bugzilla.
> That will help remove ambiguity about what's being tested and what the
> results are.

Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat)

P.S.: As a Linux kernel regression tracker I'm getting a lot of reports
on my table. I can only look briefly into most of them. Unfortunately
therefore I sometimes will get things wrong or miss something important.
I hope that's not the case here; if you think it is, don't hesitate to
tell me about it in a public reply, that's in everyone's interest.

BTW, I have no personal interest in this issue, which is tracked using
regzbot, my Linux kernel regression tracking bot
(https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
this mail to get things rolling again and hence don't need to be CC on
all further activities wrt to this regression.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-02-03  9:13                   ` Thorsten Leemhuis
@ 2022-02-03 10:47                     ` Blazej Kucman
  2022-02-03 15:58                       ` Bjorn Helgaas
  0 siblings, 1 reply; 20+ messages in thread
From: Blazej Kucman @ 2022-02-03 10:47 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Kai-Heng Feng, Mariusz Tkaczyk, linux-pci, Blazej Kucman,
	Hans de Goede, Lukas Wunner, Naveen Naidu, Keith Busch,
	Nirmal Patel, Jonathan Derrick

On Thu, 3 Feb 2022 10:13:15 +0100
Thorsten Leemhuis <regressions@leemhuis.info> wrote:

> Hi, this is your Linux kernel regression tracker speaking.
> 
> On 02.02.22 17:43, Bjorn Helgaas wrote:
> > On Wed, Feb 02, 2022 at 04:48:01PM +0100, Blazej Kucman wrote:  
> >> On Fri, 28 Jan 2022 08:03:28 -0600
> >> Bjorn Helgaas <helgaas@kernel.org> wrote:  
> >>> On Fri, Jan 28, 2022 at 09:49:34PM +0800, Kai-Heng Feng wrote:  
> >>>> On Fri, Jan 28, 2022 at 9:08 PM Bjorn Helgaas
> >>>> <helgaas@kernel.org> wrote:    
> >>>>> On Fri, Jan 28, 2022 at 09:29:31AM +0100, Mariusz Tkaczyk
> >>>>> wrote:    
> >>>>>> On Thu, 27 Jan 2022 20:52:12 -0600
> >>>>>> Bjorn Helgaas <helgaas@kernel.org> wrote:    
> >>>>>>> On Thu, Jan 27, 2022 at 03:46:15PM +0100, Mariusz Tkaczyk
> >>>>>>> wrote:    
> >>>>>>>> ...
> >>>>>>>> Thanks for your suggestions. Blazej did some tests and
> >>>>>>>> results were inconclusive. He tested it on two same
> >>>>>>>> platforms. On the first one it didn't work, even if he
> >>>>>>>> reverted all suggested patches. On the second one hotplugs
> >>>>>>>> always worked.
> >>>>>>>>
> >>>>>>>> He noticed that on first platform where issue has been found
> >>>>>>>> initally, there was boot parameter "pci=nommconf". After
> >>>>>>>> adding this parameter on the second platform, hotplugs
> >>>>>>>> stopped working too.
> >>>>>>>>
> >>>>>>>> Tested on tag pci-v5.17-changes. He have
> >>>>>>>> CONFIG_HOTPLUG_PCI_PCIE and CONFIG_DYNAMIC_DEBUG enabled in
> >>>>>>>> config. He also attached two dmesg logs to bugzilla with
> >>>>>>>> boot parameter 'dyndbg="file pciehp* +p" as requested. One
> >>>>>>>> with "pci=nommconf" and one without.
> >>>>>>>>
> >>>>>>>> Issue seems to related to "pci=nommconf" and it is probably
> >>>>>>>> caused by change outside pciehp.    
> >>>>>>>
> >>>>>>> Maybe I'm missing something.  If I understand correctly, the
> >>>>>>> problem has nothing to do with the kernel version (correct me
> >>>>>>> if I'm wrong!)    
> >>>>>>
> >>>>>> The problem occurred after the merge commit. It is some kind of
> >>>>>> regression.    
> >>>>>
> >>>>> The bug report doesn't yet contain the evidence showing this.
> >>>>> It only contains dmesg logs with "pci=nommconf" where pciehp
> >>>>> doesn't work (which is the expected behavior) and a log without
> >>>>> "pci=nommconf" where pciehp does work (which is again the
> >>>>> expected behavior).   
> >>>>>>> PCIe native hotplug doesn't work when booted with
> >>>>>>> "pci=nommconf". When using "pci=nommconf", obviously we can't
> >>>>>>> access the extended PCI config space (offset 0x100-0xfff), so
> >>>>>>> none of the extended capabilities are available.
> >>>>>>>
> >>>>>>> In that case, we don't even ask the platform for control of
> >>>>>>> PCIe hotplug via _OSC.  From the dmesg diff from normal
> >>>>>>> (working) to "pci=nommconf" (not working):
> >>>>>>>
> >>>>>>>   -Command line: BOOT_IMAGE=/boot/vmlinuz-smp ...
> >>>>>>>   +Command line: BOOT_IMAGE=/boot/vmlinuz-smp pci=nommconf ...
> >>>>>>>   ...
> >>>>>>>   -acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM
> >>>>>>> ClockPM Segments MSI HPX-Type3] -acpi PNP0A08:00: _OSC:
> >>>>>>> platform does not support [AER LTR] -acpi PNP0A08:00: _OSC:
> >>>>>>> OS now controls [PCIeHotplug PME PCIeCapability] +acpi
> >>>>>>> PNP0A08:00: _OSC: OS supports [ASPM ClockPM Segments MSI
> >>>>>>> HPX-Type3] +acpi PNP0A08:00: _OSC: not requesting OS control;
> >>>>>>> OS requires [ExtendedConfig ASPM ClockPM MSI] +acpi
> >>>>>>> PNP0A08:00: MMCONFIG is disabled, can't access extended PCI
> >>>>>>> configuration space under this bridge.    
> >>>>>>
> >>>>>> So, it shouldn't work from years but it has been broken
> >>>>>> recently, that is the only objection I have. Could you tell why
> >>>>>> it was working? According to your words- it shouldn't. We are
> >>>>>> using VMD driver, is that matter?    
> >>>>>
> >>>>> 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features")
> >>>>> looks like a it could be related.  Try reverting that commit
> >>>>> and see whether it makes a difference.    
> >>>>
> >>>> The affected NVMe is indeed behind VMD domain, so I think the
> >>>> commit can make a difference.
> >>>>
> >>>> Does VMD behave differently on laptops and servers?
> >>>> Anyway, I agree that the issue really lies in "pci=nommconf".    
> >>>
> >>> Oh, I have a guess:
> >>>
> >>>   - With "pci=nommconf", prior to v5.17-rc1, pciehp did not work
> >>> in general, but *did* work for NVMe behind a VMD.  As of
> >>> v5.17-rc1, pciehp no longer works for NVMe behind VMD.
> >>>
> >>>   - Without "pci=nommconf", pciehp works as expected for all
> >>> devices including NVMe behind VMD, both before and after
> >>> v5.17-rc1.
> >>>
> >>> Is that what you're observing?
> >>>
> >>> If so, I doubt there's anything to fix other than getting rid of
> >>> "pci=nommconf".  
> >>
> >> I haven't tested with VMD disabled earlier. I verified it and my
> >> observations are as follows:
> >>
> >> OS: RHEL 8.4
> >> NO - hotplug not working
> >> YES - hotplug working
> >>
> >> pci=nommconf added:
> >> +--------------+-------------------+---------------------+--------------+
> >> |              | pci-v5.17-changes | revert-04b12ef163d1 | inbox
> >> kernel
> >> +--------------+-------------------+---------------------+--------------+
> >> | VMD enabled  | NO                | YES                 | YES
> >> +--------------+-------------------+---------------------+--------------+
> >> | VMD disabled | NO                | NO                  | NO
> >> +--------------+-------------------+---------------------+--------------+
> >>
> >> without pci=nommconf:
> >> +--------------+-------------------+---------------------+--------------+
> >> |              | pci-v5.17-changes | revert-04b12ef163d1 | inbox
> >> kernel
> >> +--------------+-------------------+---------------------+--------------+
> >> | VMD enabled  | YES               | YES                 | YES
> >> +--------------+-------------------+---------------------+--------------+
> >> | VMD disabled | YES               | YES                 | YES
> >> +--------------+-------------------+---------------------+--------------+
> >>
> >> So, results confirmed your assumptions, but I also confirmed that
> >> revert of 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe
> >> features") makes it to work as in inbox kernel.
> >>
> >> We will drop the legacy parameter in our tests. According to my
> >> results there is a regression in VMD caused by: 04b12ef163d1
> >> commit, even if it is not working for nvme anyway. Should it be
> >> fixed?  
> > 
> > I don't know what the "inbox kernel" is.  I guess that's unmodified
> > RHEL 8.4?

Yes


> > 
> > And revert-04b12ef163d1 means the pci-v5.17-changes tag plus a
> > revert of 04b12ef163d1?

Yes

> > 
> > I think your "hotplug working" or "hotplug not working" notes refer
> > specifically to devices behind VMD, right?  They do not refer to
> > devices outside the VMD hierarchy?

Row in tables with "VMD disabled", means that the VMD is off in UEFI,
and VMD is not exposed to OS, so NVME device is not in VMD hierarchy
then.

"hotplug working" or "hotplug not working" refers only to the
hotplug behaviour.

> > 
> > So IIUC, the regression is that hotplug of devices behind VMD used
> > to work with "pci=nommconf", but it does not work after
> > 04b12ef163d1. IMO that does not need to be fixed because it was
> > arguably a bug that it *did* work before.  
> 
> FWIW, that afaics does matter. To quote Linus:
> 
> ```
> Users are literally the _only_ thing that matters.
> 
> No amount of "you shouldn't have used this" or "that behavior was
> undefined, it's your own fault your app broke" or "that used to work
> simply because of a kernel bug" is at all relevant.
> ```
> 
> That quote is from:
> https://lore.kernel.org/all/CAHk-=wiVi7mSrsMP=fLXQrXK_UimybW=ziLOwSzFTtoXUacWVQ@mail.gmail.com/
> 
> So from my point of view as a regression tracker I currently think
> this still needs to be fixed.
> 
> > That said, I'm not 100% confident about 04b12ef163d1 because _OSC
> > is a way to negotiate ownership of things that could be owned
> > either by platform firmware or by the OS, and the commit log
> > doesn't make it clear that's the situation here.  It's more of a
> > "the problem doesn't happen when we do this" sort of commit log.
> > 
> > If there's anything more to do here, it would be helpful to attach
> > complete dmesg logs from the scenarios of interest to the bugzilla.
> > That will help remove ambiguity about what's being tested and what
> > the results are.  
> 
> Ciao, Thorsten (wearing his 'Linux kernel regression tracker' hat)
> 
> P.S.: As a Linux kernel regression tracker I'm getting a lot of
> reports on my table. I can only look briefly into most of them.
> Unfortunately therefore I sometimes will get things wrong or miss
> something important. I hope that's not the case here; if you think it
> is, don't hesitate to tell me about it in a public reply, that's in
> everyone's interest.
> 
> BTW, I have no personal interest in this issue, which is tracked using
> regzbot, my Linux kernel regression tracking bot
> (https://linux-regtracking.leemhuis.info/regzbot/). I'm only posting
> this mail to get things rolling again and hence don't need to be CC on
> all further activities wrt to this regression.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-02-03 10:47                     ` Blazej Kucman
@ 2022-02-03 15:58                       ` Bjorn Helgaas
  2022-02-09 13:41                         ` Blazej Kucman
  0 siblings, 1 reply; 20+ messages in thread
From: Bjorn Helgaas @ 2022-02-03 15:58 UTC (permalink / raw)
  To: Blazej Kucman
  Cc: Kai-Heng Feng, Mariusz Tkaczyk, linux-pci, Blazej Kucman,
	Hans de Goede, Lukas Wunner, Naveen Naidu, Keith Busch,
	Nirmal Patel, Jonathan Derrick

On Thu, Feb 03, 2022 at 11:47:09AM +0100, Blazej Kucman wrote:
> > > On Wed, Feb 02, 2022 at 04:48:01PM +0100, Blazej Kucman wrote:  
> > >> On Fri, 28 Jan 2022 08:03:28 -0600
> > >> Bjorn Helgaas <helgaas@kernel.org> wrote:  
> > >>>> On Fri, Jan 28, 2022 at 9:08 PM Bjorn Helgaas
> > >>>> <helgaas@kernel.org> wrote:    

> > >>>>> 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features")
> > >>>>> looks like a it could be related.  Try reverting that commit
> > >>>>> and see whether it makes a difference.    
> > >>>>
> > >>>> The affected NVMe is indeed behind VMD domain, so I think the
> > >>>> commit can make a difference.
> > >>>>
> > >>>> Does VMD behave differently on laptops and servers?
> > >>>> Anyway, I agree that the issue really lies in "pci=nommconf".    
> > >>>
> > >>> Oh, I have a guess:
> > >>>
> > >>>   - With "pci=nommconf", prior to v5.17-rc1, pciehp did not work
> > >>> in general, but *did* work for NVMe behind a VMD.  As of
> > >>> v5.17-rc1, pciehp no longer works for NVMe behind VMD.
> > >>>
> > >>>   - Without "pci=nommconf", pciehp works as expected for all
> > >>> devices including NVMe behind VMD, both before and after
> > >>> v5.17-rc1.
> > >>>
> > >>> Is that what you're observing?
> > >>>
> > >>> If so, I doubt there's anything to fix other than getting rid of
> > >>> "pci=nommconf".  
> > >>
> > >> I haven't tested with VMD disabled earlier. I verified it and my
> > >> observations are as follows:
> > >>
> > >> OS: RHEL 8.4
> > >> NO - hotplug not working
> > >> YES - hotplug working
> > >>
> > >> pci=nommconf added:
> > >> +--------------+-------------------+---------------------+--------------+
> > >> |              | pci-v5.17-changes | revert-04b12ef163d1 | inbox
> > >> kernel
> > >> +--------------+-------------------+---------------------+--------------+
> > >> | VMD enabled  | NO                | YES                 | YES
> > >> +--------------+-------------------+---------------------+--------------+
> > >> | VMD disabled | NO                | NO                  | NO
> > >> +--------------+-------------------+---------------------+--------------+

OK, so the only possible problem case is that booting with VMD enabled
and "pci=nommconf".  In that case, hotplug for devices below VMD
worked before 04b12ef163d1 and doesn't work after.

Your table doesn't show it, but hotplug for devices *not* behind VMD
should not work either before or after 04b12ef163d1 because Linux
doesn't request PCIe hotplug control when booting with "pci=nommconf".

Why were you testing with "pci=nommconf"?  Do you think anybody uses
that with VMD and NVMe?

Bjorn

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-02-03 15:58                       ` Bjorn Helgaas
@ 2022-02-09 13:41                         ` Blazej Kucman
  2022-02-09 21:02                           ` Bjorn Helgaas
  0 siblings, 1 reply; 20+ messages in thread
From: Blazej Kucman @ 2022-02-09 13:41 UTC (permalink / raw)
  To: Bjorn Helgaas, Nirmal Patel
  Cc: Kai-Heng Feng, Mariusz Tkaczyk, linux-pci, Blazej Kucman,
	Hans de Goede, Lukas Wunner, Naveen Naidu, Keith Busch,
	Jonathan Derrick

On Thu, 3 Feb 2022 09:58:04 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Thu, Feb 03, 2022 at 11:47:09AM +0100, Blazej Kucman wrote:
> > > > On Wed, Feb 02, 2022 at 04:48:01PM +0100, Blazej Kucman wrote:
> > > >   
> > > >> On Fri, 28 Jan 2022 08:03:28 -0600
> > > >> Bjorn Helgaas <helgaas@kernel.org> wrote:    
> > > >>>> On Fri, Jan 28, 2022 at 9:08 PM Bjorn Helgaas
> > > >>>> <helgaas@kernel.org> wrote:      
> 
> > > >>>>> 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features")
> > > >>>>> looks like a it could be related.  Try reverting that commit
> > > >>>>> and see whether it makes a difference.      
> > > >>>>
> > > >>>> The affected NVMe is indeed behind VMD domain, so I think the
> > > >>>> commit can make a difference.
> > > >>>>
> > > >>>> Does VMD behave differently on laptops and servers?
> > > >>>> Anyway, I agree that the issue really lies in
> > > >>>> "pci=nommconf".      
> > > >>>
> > > >>> Oh, I have a guess:
> > > >>>
> > > >>>   - With "pci=nommconf", prior to v5.17-rc1, pciehp did not
> > > >>> work in general, but *did* work for NVMe behind a VMD.  As of
> > > >>> v5.17-rc1, pciehp no longer works for NVMe behind VMD.
> > > >>>
> > > >>>   - Without "pci=nommconf", pciehp works as expected for all
> > > >>> devices including NVMe behind VMD, both before and after
> > > >>> v5.17-rc1.
> > > >>>
> > > >>> Is that what you're observing?
> > > >>>
> > > >>> If so, I doubt there's anything to fix other than getting rid
> > > >>> of "pci=nommconf".    
> > > >>
> > > >> I haven't tested with VMD disabled earlier. I verified it and
> > > >> my observations are as follows:
> > > >>
> > > >> OS: RHEL 8.4
> > > >> NO - hotplug not working
> > > >> YES - hotplug working
> > > >>
> > > >> pci=nommconf added:
> > > >> +--------------+-------------------+---------------------+--------------+
> > > >> |              | pci-v5.17-changes | revert-04b12ef163d1 |
> > > >> inbox kernel
> > > >> +--------------+-------------------+---------------------+--------------+
> > > >> | VMD enabled  | NO                | YES                 | YES
> > > >> +--------------+-------------------+---------------------+--------------+
> > > >> | VMD disabled | NO                | NO                  | NO
> > > >> +--------------+-------------------+---------------------+--------------+
> > > >>  
> 
> OK, so the only possible problem case is that booting with VMD enabled
> and "pci=nommconf".  In that case, hotplug for devices below VMD
> worked before 04b12ef163d1 and doesn't work after.
> 
> Your table doesn't show it, but hotplug for devices *not* behind VMD
> should not work either before or after 04b12ef163d1 because Linux
> doesn't request PCIe hotplug control when booting with "pci=nommconf".
> 
> Why were you testing with "pci=nommconf"?  Do you think anybody uses
> that with VMD and NVMe?

It was added long time ago when it was useful.

On our side we will drop the parameter, and that resolves issue for us.

Bugzilla report can be closed if you don't consider it as regression.

Thanks,
Blazej


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-02-09 13:41                         ` Blazej Kucman
@ 2022-02-09 21:02                           ` Bjorn Helgaas
  2022-02-10 11:14                             ` Blazej Kucman
  0 siblings, 1 reply; 20+ messages in thread
From: Bjorn Helgaas @ 2022-02-09 21:02 UTC (permalink / raw)
  To: Blazej Kucman
  Cc: Nirmal Patel, Kai-Heng Feng, Mariusz Tkaczyk, linux-pci,
	Blazej Kucman, Hans de Goede, Lukas Wunner, Naveen Naidu,
	Keith Busch, Jonathan Derrick

On Wed, Feb 09, 2022 at 02:41:02PM +0100, Blazej Kucman wrote:
> On Thu, 3 Feb 2022 09:58:04 -0600
> Bjorn Helgaas <helgaas@kernel.org> wrote:

> > Why were you testing with "pci=nommconf"?  Do you think anybody uses
> > that with VMD and NVMe?
> 
> It was added long time ago when it was useful.

I'm curious about why it was useful.  It suggests a possible MMCONFIG
issue in firmware or in Linux.  If it was a Linux issue, ideally we
would fix that.  If it's a firmware issue, ideally we would work
around it or automatically turn on "pci=nommconf" so the user wouldn't
have to figure that out.

> Bugzilla report can be closed if you don't consider it as regression.

OK, I closed it with the details.  I'm not entirely convinced that
04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") was the
right thing, but I'll pursue that elsewhere.

Thanks for your patience in working through this, and sorry for the
hassle it caused you.

Bjorn

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1
  2022-02-09 21:02                           ` Bjorn Helgaas
@ 2022-02-10 11:14                             ` Blazej Kucman
  0 siblings, 0 replies; 20+ messages in thread
From: Blazej Kucman @ 2022-02-10 11:14 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Nirmal Patel, Kai-Heng Feng, Mariusz Tkaczyk, linux-pci,
	Blazej Kucman, Hans de Goede, Lukas Wunner, Naveen Naidu,
	Keith Busch, Jonathan Derrick

On Wed, 9 Feb 2022 15:02:18 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Wed, Feb 09, 2022 at 02:41:02PM +0100, Blazej Kucman wrote:
> > On Thu, 3 Feb 2022 09:58:04 -0600
> > Bjorn Helgaas <helgaas@kernel.org> wrote:  
> 
> > > Why were you testing with "pci=nommconf"?  Do you think anybody
> > > uses that with VMD and NVMe?  
> > 
> > It was added long time ago when it was useful.  
> 
> I'm curious about why it was useful.  It suggests a possible MMCONFIG
> issue in firmware or in Linux.  If it was a Linux issue, ideally we
> would fix that.  If it's a firmware issue, ideally we would work
> around it or automatically turn on "pci=nommconf" so the user wouldn't
> have to figure that out.
> 

The parameter was added many years ago (project has more than 10 years)
and I don't know the reason. Probably it was useful on pre-production
platforms, for software validation. It was added to our
internal BKMs. It didn't cause any real issue in our workflow until now.

Unfortunately, I'm unable to answer more precisely.
We scheduled steps to remove it definitely from our work environment.

> > Bugzilla report can be closed if you don't consider it as
> > regression.  
> 
> OK, I closed it with the details.  I'm not entirely convinced that
> 04b12ef163d1 ("PCI: vmd: Honor ACPI _OSC on PCIe features") was the
> right thing, but I'll pursue that elsewhere.
> 
> Thanks for your patience in working through this, and sorry for the
> hassle it caused you.
> 
> Bjorn

Thanks,
Blazej

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-02-10 11:14 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-215525-41252@https.bugzilla.kernel.org/>
2022-01-24 21:46 ` [Bug 215525] New: HotPlug does not work on upstream kernel 5.17.0-rc1 Bjorn Helgaas
2022-01-25  8:58   ` Hans de Goede
2022-01-25 15:33   ` Lukas Wunner
2022-01-26  7:31   ` Thorsten Leemhuis
2022-01-27 14:46   ` Mariusz Tkaczyk
2022-01-27 20:47     ` Jonathan Derrick
2022-01-27 22:31     ` Jonathan Derrick
2022-01-28  2:52     ` Bjorn Helgaas
2022-01-28  8:29       ` Mariusz Tkaczyk
2022-01-28 13:08         ` Bjorn Helgaas
2022-01-28 13:49           ` Kai-Heng Feng
2022-01-28 14:03             ` Bjorn Helgaas
2022-02-02 15:48               ` Blazej Kucman
2022-02-02 16:43                 ` Bjorn Helgaas
2022-02-03  9:13                   ` Thorsten Leemhuis
2022-02-03 10:47                     ` Blazej Kucman
2022-02-03 15:58                       ` Bjorn Helgaas
2022-02-09 13:41                         ` Blazej Kucman
2022-02-09 21:02                           ` Bjorn Helgaas
2022-02-10 11:14                             ` Blazej Kucman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).