linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system
@ 2019-12-13  8:35 Stefan Roese
  2019-12-13  9:00 ` Nicholas Johnson
  2019-12-16 23:37 ` Bjorn Helgaas
  0 siblings, 2 replies; 14+ messages in thread
From: Stefan Roese @ 2019-12-13  8:35 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner,
	Sergey Miroshnichenko, Nicholas Johnson

Hi!

I am facing an issue with PCIe-Hotplug on an AMD Epyc based system.
Our system is equipped with an HBA for NVMe SSDs incl. PCIe switch
(Supermicro AOC-SLG3-4E2P) [1] and we would like to be able to hotplug
NVMe disks.

Currently, I'm testing with v5.5.0-rc1 and series [2] applied. Here
a few tests and results that I did so far. All tests were done with
one Intel NVMe SSD connected to one of the 4 NVMe ports of the HBA
and the other 3 ports (currently) left unconnected:

a) Kernel Parameter "pci=pcie_bus_safe"
The resources of the 3 unused PCIe slots of the PEX switch are not
assigned in this test.

b) Kernel Parameter "pci=pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
With this test I restricted the resources of the HP slots to the
minimum. Still this results in unassigned resourced for the unused
PCIe slots of the PEX switch.

c) Kernel Parameter "pci=realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
Again, not all resources are assigned.

d) Kernel Parameter "pci=nocrs,realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
Now all requested resources are available for the HP PCIe slots of the
PEX switch. But the NVMe driver fails while probing. Debugging has
shown, that reading from the BAR of the NVMe disk returns 0xffffffff.
Also reading from the PLX PEX switch registers returns 0xfffffff in this
case (this works of course without nocrs, when the BARs are mapped at
a different address).

Does anybody have a clue on why the access to the PEX switch and / or
the NVMe BAR does not work in the "nocrs" case? The BARs are located in
the same window that is provided by the BIOS in the ACPI list (but is
"ignored" in this case) [3].

Or if it is possible to get the HP resource mapping done correctly without
setting "nocrs" for our setup with the PCIe/NVMe switch?

I can provide all sorts of logs (dmegs, lspci etc) if needed - just let
me know.

Many thanks in advance,
Stefan

[1] https://www.supermicro.com/en/products/accessories/addon/AOC-SLG3-4E2P.php
[2] https://lkml.org/lkml/2019/12/9/388
[3]
[    0.701932] acpi PNP0A08:00: host bridge window [io  0x0cf8-0x0cff] (ignored)
[    0.701934] acpi PNP0A08:00: host bridge window [io  0x0000-0x02ff window] (ignored)
[    0.701935] acpi PNP0A08:00: host bridge window [io  0x0300-0x03af window] (ignored)
[    0.701936] acpi PNP0A08:00: host bridge window [io  0x03e0-0x0cf7 window] (ignored)
[    0.701937] acpi PNP0A08:00: host bridge window [io  0x03b0-0x03df window] (ignored)
[    0.701938] acpi PNP0A08:00: host bridge window [io  0x0d00-0x3fff window] (ignored)
[    0.701939] acpi PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff window] (ignored)
[    0.701939] acpi PNP0A08:00: host bridge window [mem 0x000c0000-0x000dffff window] (ignored)
[    0.701940] acpi PNP0A08:00: host bridge window [mem 0xec000000-0xefffffff window] (ignored)
[    0.701941] acpi PNP0A08:00: host bridge window [mem 0x182c8000000-0x1ffffffffff window] (ignored)
...
41:00.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev b0) (prog-if 00 [Normal decode])
         Flags: bus master, fast devsel, latency 0, IRQ 47, NUMA node 2
         Memory at ec400000 (32-bit, non-prefetchable) [size=256K]
         Bus: primary=41, secondary=42, subordinate=47, sec-latency=0
         I/O behind bridge: None
         Memory behind bridge: ec000000-ec3fffff [size=4M]
         Prefetchable memory behind bridge: None
         Capabilities: <access denied>
         Kernel driver in use: pcieport
epyc@epyc-Super-Server:~/stefan$ sudo ./memtool md 0xec400000+0x10
ec400000: ffffffff ffffffff ffffffff ffffffff                ................

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system
  2019-12-13  8:35 PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system Stefan Roese
@ 2019-12-13  9:00 ` Nicholas Johnson
       [not found]   ` <c9f154e5-4214-aa46-2ce2-443b508e1643@denx.de>
  2019-12-16 23:37 ` Bjorn Helgaas
  1 sibling, 1 reply; 14+ messages in thread
From: Nicholas Johnson @ 2019-12-13  9:00 UTC (permalink / raw)
  To: Stefan Roese
  Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner,
	Sergey Miroshnichenko

On Fri, Dec 13, 2019 at 09:35:19AM +0100, Stefan Roese wrote:
> Hi!
Hi,
> 
> I am facing an issue with PCIe-Hotplug on an AMD Epyc based system.
> Our system is equipped with an HBA for NVMe SSDs incl. PCIe switch
> (Supermicro AOC-SLG3-4E2P) [1] and we would like to be able to hotplug
> NVMe disks.
> 
> Currently, I'm testing with v5.5.0-rc1 and series [2] applied. Here
> a few tests and results that I did so far. All tests were done with
> one Intel NVMe SSD connected to one of the 4 NVMe ports of the HBA
> and the other 3 ports (currently) left unconnected:
> 
> a) Kernel Parameter "pci=pcie_bus_safe"
> The resources of the 3 unused PCIe slots of the PEX switch are not
> assigned in this test.
> 
> b) Kernel Parameter "pci=pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
> With this test I restricted the resources of the HP slots to the
> minimum. Still this results in unassigned resourced for the unused
> PCIe slots of the PEX switch.
> 
> c) Kernel Parameter "pci=realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
> Again, not all resources are assigned.
> 
> d) Kernel Parameter "pci=nocrs,realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
> Now all requested resources are available for the HP PCIe slots of the
> PEX switch. But the NVMe driver fails while probing. Debugging has
> shown, that reading from the BAR of the NVMe disk returns 0xffffffff.
> Also reading from the PLX PEX switch registers returns 0xfffffff in this
> case (this works of course without nocrs, when the BARs are mapped at
> a different address).
> 
> Does anybody have a clue on why the access to the PEX switch and / or
> the NVMe BAR does not work in the "nocrs" case? The BARs are located in
> the same window that is provided by the BIOS in the ACPI list (but is
> "ignored" in this case) [3].
> 
> Or if it is possible to get the HP resource mapping done correctly without
> setting "nocrs" for our setup with the PCIe/NVMe switch?
> 
> I can provide all sorts of logs (dmegs, lspci etc) if needed - just let
> me know.
> 
> Many thanks in advance,
> Stefan
This will be a quick response for now. I will get more in depth tonight 
when I have more time.

What I have taken away from this is:

1. Epyc -> Up to 4x PCIe Root Complexes, but from what I can gather, 
they are probably assigned on the same segment / domain, unfortunately, 
with non-overlapping bus numbers. Either way, multiple RCs may 
complicate using pci=nocrs and others. Unfortunately, I have not had the 
privilege of owning a system with multiple RCs, so I cannot be sure.

2. Not using Thunderbolt - [2] patch series only really makes a 
difference with nested hotplug bridges, such as in Thunderbolt. 
Although, it might help by not using additional resource lists, but I 
still do not think it will matter without nested hotplug bridges.

3. System not reallocating resources despite overridden -> is ACPI _DSM 
method evaluating to zero? I experienced this recently with an Intel Ice 
Lake system. I booted the laptop at the retail store into Linux off a 
USB to find out about the Thunderbolt implementation. I dumped "sudo 
lspci -xxxx" and dmesg and analysed the results at home. I noticed it 
did not override the resources, and from examining the source code, it 
likely evaluated _DSM to 0, which may have overridden pci=realloc. Try 
modifying the source code to unconditionally apply realloc in 
drivers/pci/setup-bus.c and see what happens. I have not bothered doing 
this myself and going back to the store to try to test this hypothesis.

4. It would be helpful if you attached full dmesg and "sudo lspci -xxxx" 
which dumps full PCI config, allowing us to run any lspci query as if we 
were on your system, from the file. I will be able to tell a lot more 
after seeing that. Possibly do one with no kernel parameters, and do 
another set of results with all of the kernel parameters. Use 
hpmmiosize=64M and hpmmioprefsize=1G for it to be noticeable, I reckon. 
But this will answer questions I have about which ports are hotplug 
bridges and other things.

5. There is a good chance it will not even boot since kernel since 
around ~v5.3 with acpi=off but it is worth a shot there, also. Since a 
recent kernel, I have found that acpi=off only removes HyperThreading, 
and not all the physical cores like it used to. So there must have been 
a patch which allowed it to guess the MADT table information. I have not 
investigated. But now, some of my computers crash upon loading the 
kernel with acpi=off. It must get it wrong at times. What about 
pci=noacpi instead?

Sorry if I missed something you said.

Best of luck, and I am interested into looking into this further. :)

Kind regards,
Nicholas Johnson

> 
> [1] https://www.supermicro.com/en/products/accessories/addon/AOC-SLG3-4E2P.php
> [2] https://lkml.org/lkml/2019/12/9/388
> [3]
> [    0.701932] acpi PNP0A08:00: host bridge window [io  0x0cf8-0x0cff] (ignored)
> [    0.701934] acpi PNP0A08:00: host bridge window [io  0x0000-0x02ff window] (ignored)
> [    0.701935] acpi PNP0A08:00: host bridge window [io  0x0300-0x03af window] (ignored)
> [    0.701936] acpi PNP0A08:00: host bridge window [io  0x03e0-0x0cf7 window] (ignored)
> [    0.701937] acpi PNP0A08:00: host bridge window [io  0x03b0-0x03df window] (ignored)
> [    0.701938] acpi PNP0A08:00: host bridge window [io  0x0d00-0x3fff window] (ignored)
> [    0.701939] acpi PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff window] (ignored)
> [    0.701939] acpi PNP0A08:00: host bridge window [mem 0x000c0000-0x000dffff window] (ignored)
> [    0.701940] acpi PNP0A08:00: host bridge window [mem 0xec000000-0xefffffff window] (ignored)
> [    0.701941] acpi PNP0A08:00: host bridge window [mem 0x182c8000000-0x1ffffffffff window] (ignored)
> ...
> 41:00.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev b0) (prog-if 00 [Normal decode])
>         Flags: bus master, fast devsel, latency 0, IRQ 47, NUMA node 2
>         Memory at ec400000 (32-bit, non-prefetchable) [size=256K]
>         Bus: primary=41, secondary=42, subordinate=47, sec-latency=0
>         I/O behind bridge: None
>         Memory behind bridge: ec000000-ec3fffff [size=4M]
>         Prefetchable memory behind bridge: None
>         Capabilities: <access denied>
>         Kernel driver in use: pcieport
> epyc@epyc-Super-Server:~/stefan$ sudo ./memtool md 0xec400000+0x10
> ec400000: ffffffff ffffffff ffffffff ffffffff                ................

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system
       [not found]   ` <c9f154e5-4214-aa46-2ce2-443b508e1643@denx.de>
@ 2019-12-13 11:52     ` Nicholas Johnson
  2019-12-13 12:17       ` Stefan Roese
  2019-12-16  0:46       ` Bjorn Helgaas
  0 siblings, 2 replies; 14+ messages in thread
From: Nicholas Johnson @ 2019-12-13 11:52 UTC (permalink / raw)
  To: Stefan Roese
  Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner,
	Sergey Miroshnichenko

On Fri, Dec 13, 2019 at 11:58:53AM +0100, Stefan Roese wrote:
> Hi Nicholas,
> 
> On 13.12.19 10:00, Nicholas Johnson wrote:
> > On Fri, Dec 13, 2019 at 09:35:19AM +0100, Stefan Roese wrote:
> > > Hi!
> > Hi,
> > > 
> > > I am facing an issue with PCIe-Hotplug on an AMD Epyc based system.
> > > Our system is equipped with an HBA for NVMe SSDs incl. PCIe switch
> > > (Supermicro AOC-SLG3-4E2P) [1] and we would like to be able to hotplug
> > > NVMe disks.
> > > 
> > > Currently, I'm testing with v5.5.0-rc1 and series [2] applied. Here
> > > a few tests and results that I did so far. All tests were done with
> > > one Intel NVMe SSD connected to one of the 4 NVMe ports of the HBA
> > > and the other 3 ports (currently) left unconnected:
> > > 
> > > a) Kernel Parameter "pci=pcie_bus_safe"
> > > The resources of the 3 unused PCIe slots of the PEX switch are not
> > > assigned in this test.
> > > 
> > > b) Kernel Parameter "pci=pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
> > > With this test I restricted the resources of the HP slots to the
> > > minimum. Still this results in unassigned resourced for the unused
> > > PCIe slots of the PEX switch.
> > > 
> > > c) Kernel Parameter "pci=realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
> > > Again, not all resources are assigned.
> > > 
> > > d) Kernel Parameter "pci=nocrs,realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
> > > Now all requested resources are available for the HP PCIe slots of the
> > > PEX switch. But the NVMe driver fails while probing. Debugging has
> > > shown, that reading from the BAR of the NVMe disk returns 0xffffffff.
> > > Also reading from the PLX PEX switch registers returns 0xfffffff in this
> > > case (this works of course without nocrs, when the BARs are mapped at
> > > a different address).
> > > 
> > > Does anybody have a clue on why the access to the PEX switch and / or
> > > the NVMe BAR does not work in the "nocrs" case? The BARs are located in
> > > the same window that is provided by the BIOS in the ACPI list (but is
> > > "ignored" in this case) [3].
> > > 
> > > Or if it is possible to get the HP resource mapping done correctly without
> > > setting "nocrs" for our setup with the PCIe/NVMe switch?
> > > 
> > > I can provide all sorts of logs (dmegs, lspci etc) if needed - just let
> > > me know.
> > > 
> > > Many thanks in advance,
> > > Stefan
> > This will be a quick response for now. I will get more in depth tonight
> > when I have more time.
> > 
> > What I have taken away from this is:
> > 
> > 1. Epyc -> Up to 4x PCIe Root Complexes, but from what I can gather,
> > they are probably assigned on the same segment / domain, unfortunately,
> > with non-overlapping bus numbers. Either way, multiple RCs may
> > complicate using pci=nocrs and others. Unfortunately, I have not had the
> > privilege of owning a system with multiple RCs, so I cannot be sure.
> > 
> > 2. Not using Thunderbolt - [2] patch series only really makes a
> > difference with nested hotplug bridges, such as in Thunderbolt.
> > Although, it might help by not using additional resource lists, but I
> > still do not think it will matter without nested hotplug bridges.
> 
> I was not sure about those patches but since they have been queued for
> 5.6, I included them in these tests. The results are similar (or even
> identical, I would need to re-run the test to be sure) without them.
> > 3. System not reallocating resources despite overridden -> is ACPI _DSM
> > method evaluating to zero?
> 
> Not sure if I follow you here. The kernel is reallocating the resources, or
> at least trying to, if requested to via bootargs (Tests c) and d)). I've
> attached the logs from all 4 tests in an archive [1]. It just fails to
> reallocate the resources in test case c) and even though it successfully
> reallocates the resources in test case d), the new addresses at the PEX
> switch and its ports "don't work".
It is unlikely to be the issue, but I thought it was worth a mention.

> 
> > I experienced this recently with an Intel Ice
> > Lake system. I booted the laptop at the retail store into Linux off a
> > USB to find out about the Thunderbolt implementation. I dumped "sudo
> > lspci -xxxx" and dmesg and analysed the results at home.
> 
> Very brave. ;)
It's a retail store with display models for people to play with. If I do 
not damage it (or pay for any damage caused) then I do not have anything 
to be afraid of.

> 
> > I noticed it
> > did not override the resources, and from examining the source code, it
> > likely evaluated _DSM to 0, which may have overridden pci=realloc. Try
> > modifying the source code to unconditionally apply realloc in
> > drivers/pci/setup-bus.c and see what happens. I have not bothered doing
> > this myself and going back to the store to try to test this hypothesis.
> 
> realloc is enabled via boot args and active in the kernel as you can see
> from the dmesg log [2].
> > 4. It would be helpful if you attached full dmesg and "sudo lspci -xxxx"
> > which dumps full PCI config, allowing us to run any lspci query as if we
> > were on your system, from the file. I will be able to tell a lot more
> > after seeing that. Possibly do one with no kernel parameters, and do
> > another set of results with all of the kernel parameters. Use
> > hpmmiosize=64M and hpmmioprefsize=1G for it to be noticeable, I reckon.
> > But this will answer questions I have about which ports are hotplug
> > bridges and other things.
> 
> Okay, I added the following test cases:
> 
> e) Kernel Parameter ""
> f) Kernel Parameter "pci=nocrs,realloc,hpmmiosize=64M,hpmmioprefsize=1G"
> 
> The logs are also included. Please let me know, if I should do any other
> tests and provide the logs.
> 
> > 5. There is a good chance it will not even boot since kernel since
> > around ~v5.3 with acpi=off but it is worth a shot there, also. Since a
> > recent kernel, I have found that acpi=off only removes HyperThreading,
> > and not all the physical cores like it used to. So there must have been
> > a patch which allowed it to guess the MADT table information. I have not
> > investigated. But now, some of my computers crash upon loading the
> > kernel with acpi=off. It must get it wrong at times.
> 
> Booting this 5.5 kernel with "acpi=off" increases the bootup time quite
> a bit. The resources are distributed behind the PLX switch (similar to
> using "pci=nocrs" but again accessing the BARs doesn't work (0xffffffff
> is read back).
It was only to see if ACPI was part of the issue. You would not run in 
production with it off.

> 
> > What about
> > pci=noacpi instead?
> 
> I also tested using pci=noacpi and it did not resolve the resource
> mapping problems.
> > Sorry if I missed something you said.
> > 
> > Best of luck, and I am interested into looking into this further. :)
> 
> Very much appreciated. :)
> 
> Thanks,
> Stefan
> 
> [1] logs.tar.bz2
> [2] 5.5.0-rc1-custom-test-c/dmesg.log

From the logs, it looks like MMIO_PREF was assigned 1G but not MMIO.

This looks tricky. Please revert my commit:
c13704f5685deb7d6eb21e293233e0901ed77377

And see if it is the problem. It is entirely possible, but because of 
the very old code and how there are multiple passes, it might be 
impossible to use realloc without side effects for somebody. If you fix 
it for one scenario, it is possible that there is another scenario for 
which it will break due to the change. The only way to make everything 
work is a near complete rewrite of drivers/pci/setup-bus.c and 
potentially others, something I am working on, but is going to take a 
long time. And unlikely to ever be accepted.

Otherwise, it will take me a lot of grepping through dmesg to find the 
cause, which will take more time.

FYI, "lspci -vvv" is redundant because it can be produced from "lspci 
-xxxx" output.

A final note, Epyc CPUs can bifurcate x16 slots into x4/x4/x4/x4 in the 
BIOS setup, although you will probably not have the hotplug services 
provided by the PEX switch.

Kind regards,
Nicholas Johnson

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system
  2019-12-13 11:52     ` Nicholas Johnson
@ 2019-12-13 12:17       ` Stefan Roese
  2019-12-15  3:16         ` Nicholas Johnson
  2019-12-16  0:46       ` Bjorn Helgaas
  1 sibling, 1 reply; 14+ messages in thread
From: Stefan Roese @ 2019-12-13 12:17 UTC (permalink / raw)
  To: Nicholas Johnson
  Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner,
	Sergey Miroshnichenko

On 13.12.19 12:52, Nicholas Johnson wrote:
> On Fri, Dec 13, 2019 at 11:58:53AM +0100, Stefan Roese wrote:
>> Hi Nicholas,
>>
>> On 13.12.19 10:00, Nicholas Johnson wrote:
>>> On Fri, Dec 13, 2019 at 09:35:19AM +0100, Stefan Roese wrote:
>>>> Hi!
>>> Hi,
>>>>
>>>> I am facing an issue with PCIe-Hotplug on an AMD Epyc based system.
>>>> Our system is equipped with an HBA for NVMe SSDs incl. PCIe switch
>>>> (Supermicro AOC-SLG3-4E2P) [1] and we would like to be able to hotplug
>>>> NVMe disks.
>>>>
>>>> Currently, I'm testing with v5.5.0-rc1 and series [2] applied. Here
>>>> a few tests and results that I did so far. All tests were done with
>>>> one Intel NVMe SSD connected to one of the 4 NVMe ports of the HBA
>>>> and the other 3 ports (currently) left unconnected:
>>>>
>>>> a) Kernel Parameter "pci=pcie_bus_safe"
>>>> The resources of the 3 unused PCIe slots of the PEX switch are not
>>>> assigned in this test.
>>>>
>>>> b) Kernel Parameter "pci=pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
>>>> With this test I restricted the resources of the HP slots to the
>>>> minimum. Still this results in unassigned resourced for the unused
>>>> PCIe slots of the PEX switch.
>>>>
>>>> c) Kernel Parameter "pci=realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
>>>> Again, not all resources are assigned.
>>>>
>>>> d) Kernel Parameter "pci=nocrs,realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
>>>> Now all requested resources are available for the HP PCIe slots of the
>>>> PEX switch. But the NVMe driver fails while probing. Debugging has
>>>> shown, that reading from the BAR of the NVMe disk returns 0xffffffff.
>>>> Also reading from the PLX PEX switch registers returns 0xfffffff in this
>>>> case (this works of course without nocrs, when the BARs are mapped at
>>>> a different address).
>>>>
>>>> Does anybody have a clue on why the access to the PEX switch and / or
>>>> the NVMe BAR does not work in the "nocrs" case? The BARs are located in
>>>> the same window that is provided by the BIOS in the ACPI list (but is
>>>> "ignored" in this case) [3].
>>>>
>>>> Or if it is possible to get the HP resource mapping done correctly without
>>>> setting "nocrs" for our setup with the PCIe/NVMe switch?
>>>>
>>>> I can provide all sorts of logs (dmegs, lspci etc) if needed - just let
>>>> me know.
>>>>
>>>> Many thanks in advance,
>>>> Stefan
>>> This will be a quick response for now. I will get more in depth tonight
>>> when I have more time.
>>>
>>> What I have taken away from this is:
>>>
>>> 1. Epyc -> Up to 4x PCIe Root Complexes, but from what I can gather,
>>> they are probably assigned on the same segment / domain, unfortunately,
>>> with non-overlapping bus numbers. Either way, multiple RCs may
>>> complicate using pci=nocrs and others. Unfortunately, I have not had the
>>> privilege of owning a system with multiple RCs, so I cannot be sure.
>>>
>>> 2. Not using Thunderbolt - [2] patch series only really makes a
>>> difference with nested hotplug bridges, such as in Thunderbolt.
>>> Although, it might help by not using additional resource lists, but I
>>> still do not think it will matter without nested hotplug bridges.
>>
>> I was not sure about those patches but since they have been queued for
>> 5.6, I included them in these tests. The results are similar (or even
>> identical, I would need to re-run the test to be sure) without them.
>>> 3. System not reallocating resources despite overridden -> is ACPI _DSM
>>> method evaluating to zero?
>>
>> Not sure if I follow you here. The kernel is reallocating the resources, or
>> at least trying to, if requested to via bootargs (Tests c) and d)). I've
>> attached the logs from all 4 tests in an archive [1]. It just fails to
>> reallocate the resources in test case c) and even though it successfully
>> reallocates the resources in test case d), the new addresses at the PEX
>> switch and its ports "don't work".
> It is unlikely to be the issue, but I thought it was worth a mention.
> 
>>
>>> I experienced this recently with an Intel Ice
>>> Lake system. I booted the laptop at the retail store into Linux off a
>>> USB to find out about the Thunderbolt implementation. I dumped "sudo
>>> lspci -xxxx" and dmesg and analysed the results at home.
>>
>> Very brave. ;)
> It's a retail store with display models for people to play with. If I do
> not damage it (or pay for any damage caused) then I do not have anything
> to be afraid of.

Sure. I was referring to you being "brave" to do all this analyzing /
debugging without having the system at your hands while doing this for any
further tests. ;)
  
>>
>>> I noticed it
>>> did not override the resources, and from examining the source code, it
>>> likely evaluated _DSM to 0, which may have overridden pci=realloc. Try
>>> modifying the source code to unconditionally apply realloc in
>>> drivers/pci/setup-bus.c and see what happens. I have not bothered doing
>>> this myself and going back to the store to try to test this hypothesis.
>>
>> realloc is enabled via boot args and active in the kernel as you can see
>> from the dmesg log [2].
>>> 4. It would be helpful if you attached full dmesg and "sudo lspci -xxxx"
>>> which dumps full PCI config, allowing us to run any lspci query as if we
>>> were on your system, from the file. I will be able to tell a lot more
>>> after seeing that. Possibly do one with no kernel parameters, and do
>>> another set of results with all of the kernel parameters. Use
>>> hpmmiosize=64M and hpmmioprefsize=1G for it to be noticeable, I reckon.
>>> But this will answer questions I have about which ports are hotplug
>>> bridges and other things.
>>
>> Okay, I added the following test cases:
>>
>> e) Kernel Parameter ""
>> f) Kernel Parameter "pci=nocrs,realloc,hpmmiosize=64M,hpmmioprefsize=1G"
>>
>> The logs are also included. Please let me know, if I should do any other
>> tests and provide the logs.
>>
>>> 5. There is a good chance it will not even boot since kernel since
>>> around ~v5.3 with acpi=off but it is worth a shot there, also. Since a
>>> recent kernel, I have found that acpi=off only removes HyperThreading,
>>> and not all the physical cores like it used to. So there must have been
>>> a patch which allowed it to guess the MADT table information. I have not
>>> investigated. But now, some of my computers crash upon loading the
>>> kernel with acpi=off. It must get it wrong at times.
>>
>> Booting this 5.5 kernel with "acpi=off" increases the bootup time quite
>> a bit. The resources are distributed behind the PLX switch (similar to
>> using "pci=nocrs" but again accessing the BARs doesn't work (0xffffffff
>> is read back).
> It was only to see if ACPI was part of the issue. You would not run in
> production with it off.
> 
>>
>>> What about
>>> pci=noacpi instead?
>>
>> I also tested using pci=noacpi and it did not resolve the resource
>> mapping problems.
>>> Sorry if I missed something you said.
>>>
>>> Best of luck, and I am interested into looking into this further. :)
>>
>> Very much appreciated. :)
>>
>> Thanks,
>> Stefan
>>
>> [1] logs.tar.bz2
>> [2] 5.5.0-rc1-custom-test-c/dmesg.log
> 
>  From the logs, it looks like MMIO_PREF was assigned 1G but not MMIO.
> 
> This looks tricky. Please revert my commit:
> c13704f5685deb7d6eb21e293233e0901ed77377
> 
> And see if it is the problem.

I reverted this patch and did a few test (some of my test cases). None
turned out differently than before. Either the resources are not mapped
completely or they are mapped  (with pci=nocrs) and not accessible.

> It is entirely possible, but because of
> the very old code and how there are multiple passes, it might be
> impossible to use realloc without side effects for somebody. If you fix
> it for one scenario, it is possible that there is another scenario for
> which it will break due to the change. The only way to make everything
> work is a near complete rewrite of drivers/pci/setup-bus.c and
> potentially others, something I am working on, but is going to take a
> long time. And unlikely to ever be accepted.

While working on this issue, I looked (again) at this resource (re-)
allocation code. This is really confusing (at least to me) and I also think
that it needs a "near complete rewrite".
  
> Otherwise, it will take me a lot of grepping through dmesg to find the
> cause, which will take more time.

Sure.
  
> FYI, "lspci -vvv" is redundant because it can be produced from "lspci
> -xxxx" output.

I know. Its mainly for me to easily see the PCI devices listed quickly.
  
> A final note, Epyc CPUs can bifurcate x16 slots into x4/x4/x4/x4 in the
> BIOS setup, although you will probably not have the hotplug services
> provided by the PEX switch.

I think it should not matter for my current test with resource assignment,
how many PCIe lanes the PEX switch has connected to the PCI root port. Its
of course important for the bandwidth, but this is a completely different
issue.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system
  2019-12-13 12:17       ` Stefan Roese
@ 2019-12-15  3:16         ` Nicholas Johnson
  2019-12-16  6:48           ` Stefan Roese
  0 siblings, 1 reply; 14+ messages in thread
From: Nicholas Johnson @ 2019-12-15  3:16 UTC (permalink / raw)
  To: Stefan Roese
  Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner,
	Sergey Miroshnichenko

Hi,

> >  From the logs, it looks like MMIO_PREF was assigned 1G but not MMIO.
> > 
> > This looks tricky. Please revert my commit:
> > c13704f5685deb7d6eb21e293233e0901ed77377
> > 
> > And see if it is the problem.
> 
> I reverted this patch and did a few test (some of my test cases). None
> turned out differently than before. Either the resources are not mapped
> completely or they are mapped  (with pci=nocrs) and not accessible.
> 
> > It is entirely possible, but because of
> > the very old code and how there are multiple passes, it might be
> > impossible to use realloc without side effects for somebody. If you fix
> > it for one scenario, it is possible that there is another scenario for
> > which it will break due to the change. The only way to make everything
> > work is a near complete rewrite of drivers/pci/setup-bus.c and
> > potentially others, something I am working on, but is going to take a
> > long time. And unlikely to ever be accepted.
> 
> While working on this issue, I looked (again) at this resource (re-)
> allocation code. This is really confusing (at least to me) and I also think
> that it needs a "near complete rewrite".
> > Otherwise, it will take me a lot of grepping through dmesg to find the
> > cause, which will take more time.
> 
> Sure.
> > FYI, "lspci -vvv" is redundant because it can be produced from "lspci
> > -xxxx" output.
> 
> I know. Its mainly for me to easily see the PCI devices listed quickly.
> > A final note, Epyc CPUs can bifurcate x16 slots into x4/x4/x4/x4 in the
> > BIOS setup, although you will probably not have the hotplug services
> > provided by the PEX switch.
> 
> I think it should not matter for my current test with resource assignment,
> how many PCIe lanes the PEX switch has connected to the PCI root port. Its
> of course important for the bandwidth, but this is a completely different
> issue.
I meant that you can connect 4x NVMe drives to a PCIe x16 slot with a 
cheap passive bifurcation riser. But it sounds like this card is useful 
because of its hotplug support.

I noticed if you grep your some of your dmesg logs for "add_size", you 
have some lines like this:
[    0.767652] pci 0000:42:04.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 44] add_size 200000 add_align 100000

I am not sure if these are the cause or a symptom of the problem, but I 
do not have any when assigning MMIO and MMIO_PREF for Thunderbolt 3.

I noticed you are using pci=hpmemsize in some of the tests. It should 
not be interfering because you put it first (it is overwritten by 
hpmmiosize and hpmmioprefsize). But I should point out that 
pci=hpmemsize=X is equivalent to pci=hpmmiosize=X,hpmmioprefsize=X so it 
is redundant. When I added hpmmiosize and hpmmioprefsize parameters to 
control them independently, I would have liked to have dropped 
hpmemsize, but needed to leave it around to not disrupt people who are 
already using it.

Please try something like this, which I dug up from a very old attempt 
to overhaul drivers/pci/setup-bus.c that I was working on. It will 
release all the boot resources before the initial allocation, and should 
give the system a chance to cleanly assign all resources on the first 
pass / try. The allocation code works well until you use more than one 
pass - then things get very hairy. I just applied it to mine, and now
everything applies the first pass, with not a single failure to assign.

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 22aed6cdb..befaef6a8 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1822,8 +1822,16 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
 void __init pci_assign_unassigned_resources(void)
 {
 	struct pci_bus *root_bus;
+	struct pci_dev *dev;
 
 	list_for_each_entry(root_bus, &pci_root_buses, node) {
+		for_each_pci_bridge(dev, root_bus) {
+			pci_bridge_release_resources(dev->subordinate, IORESOURCE_IO);
+			pci_bridge_release_resources(dev->subordinate, IORESOURCE_MEM);
+			pci_bridge_release_resources(dev->subordinate, IORESOURCE_MEM_64);
+			pci_bridge_release_resources(dev->subordinate, IORESOURCE_MEM_64 | IORESOURCE_PREFETCH);
+		}
+
 		pci_assign_unassigned_root_bus_resources(root_bus);
 
 		/* Make sure the root bridge has a companion ACPI device */

Kind regards,
Nicholas

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system
  2019-12-13 11:52     ` Nicholas Johnson
  2019-12-13 12:17       ` Stefan Roese
@ 2019-12-16  0:46       ` Bjorn Helgaas
  2019-12-16  6:50         ` Stefan Roese
  1 sibling, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2019-12-16  0:46 UTC (permalink / raw)
  To: Nicholas Johnson
  Cc: Stefan Roese, linux-pci, Bjorn Helgaas, Mika Westerberg,
	Lukas Wunner, Sergey Miroshnichenko

> > The logs are also included. Please let me know, if I should do any other
> > tests and provide the logs.

Please include these logs in your mail to the list or post them
someplace where everybody can see them.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system
  2019-12-15  3:16         ` Nicholas Johnson
@ 2019-12-16  6:48           ` Stefan Roese
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Roese @ 2019-12-16  6:48 UTC (permalink / raw)
  To: Nicholas Johnson
  Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner,
	Sergey Miroshnichenko

Hi Nicholas,

On 15.12.19 04:16, Nicholas Johnson wrote:
> Hi,
> 
>>>   From the logs, it looks like MMIO_PREF was assigned 1G but not MMIO.
>>>
>>> This looks tricky. Please revert my commit:
>>> c13704f5685deb7d6eb21e293233e0901ed77377
>>>
>>> And see if it is the problem.
>>
>> I reverted this patch and did a few test (some of my test cases). None
>> turned out differently than before. Either the resources are not mapped
>> completely or they are mapped  (with pci=nocrs) and not accessible.
>>
>>> It is entirely possible, but because of
>>> the very old code and how there are multiple passes, it might be
>>> impossible to use realloc without side effects for somebody. If you fix
>>> it for one scenario, it is possible that there is another scenario for
>>> which it will break due to the change. The only way to make everything
>>> work is a near complete rewrite of drivers/pci/setup-bus.c and
>>> potentially others, something I am working on, but is going to take a
>>> long time. And unlikely to ever be accepted.
>>
>> While working on this issue, I looked (again) at this resource (re-)
>> allocation code. This is really confusing (at least to me) and I also think
>> that it needs a "near complete rewrite".
>>> Otherwise, it will take me a lot of grepping through dmesg to find the
>>> cause, which will take more time.
>>
>> Sure.
>>> FYI, "lspci -vvv" is redundant because it can be produced from "lspci
>>> -xxxx" output.
>>
>> I know. Its mainly for me to easily see the PCI devices listed quickly.
>>> A final note, Epyc CPUs can bifurcate x16 slots into x4/x4/x4/x4 in the
>>> BIOS setup, although you will probably not have the hotplug services
>>> provided by the PEX switch.
>>
>> I think it should not matter for my current test with resource assignment,
>> how many PCIe lanes the PEX switch has connected to the PCI root port. Its
>> of course important for the bandwidth, but this is a completely different
>> issue.
> I meant that you can connect 4x NVMe drives to a PCIe x16 slot with a
> cheap passive bifurcation riser. But it sounds like this card is useful
> because of its hotplug support.
> 
> I noticed if you grep your some of your dmesg logs for "add_size", you
> have some lines like this:
> [    0.767652] pci 0000:42:04.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 44] add_size 200000 add_align 100000
> 
> I am not sure if these are the cause or a symptom of the problem, but I
> do not have any when assigning MMIO and MMIO_PREF for Thunderbolt 3.
> 
> I noticed you are using pci=hpmemsize in some of the tests. It should
> not be interfering because you put it first (it is overwritten by
> hpmmiosize and hpmmioprefsize). But I should point out that
> pci=hpmemsize=X is equivalent to pci=hpmmiosize=X,hpmmioprefsize=X so it
> is redundant. When I added hpmmiosize and hpmmioprefsize parameters to
> control them independently, I would have liked to have dropped
> hpmemsize, but needed to leave it around to not disrupt people who are
> already using it.

Thanks. I was aware of keeping the old notation to not break backwards
compatiblity. I'll drop hpmemsize=X from now on.
  
> Please try something like this, which I dug up from a very old attempt
> to overhaul drivers/pci/setup-bus.c that I was working on. It will
> release all the boot resources before the initial allocation, and should
> give the system a chance to cleanly assign all resources on the first
> pass / try. The allocation code works well until you use more than one
> pass - then things get very hairy. I just applied it to mine, and now
> everything applies the first pass, with not a single failure to assign.

Do you have some hot-plug enabled slots in your system?
  
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index 22aed6cdb..befaef6a8 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -1822,8 +1822,16 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
>   void __init pci_assign_unassigned_resources(void)
>   {
>   	struct pci_bus *root_bus;
> +	struct pci_dev *dev;
>   
>   	list_for_each_entry(root_bus, &pci_root_buses, node) {
> +		for_each_pci_bridge(dev, root_bus) {
> +			pci_bridge_release_resources(dev->subordinate, IORESOURCE_IO);
> +			pci_bridge_release_resources(dev->subordinate, IORESOURCE_MEM);
> +			pci_bridge_release_resources(dev->subordinate, IORESOURCE_MEM_64);
> +			pci_bridge_release_resources(dev->subordinate, IORESOURCE_MEM_64 | IORESOURCE_PREFETCH);
> +		}
> +
>   		pci_assign_unassigned_root_bus_resources(root_bus);
>   
>   		/* Make sure the root bridge has a companion ACPI device */

Thanks. I've applied this patch to my tree without c13704f5 reverted.

Which parameters should I pass to the kernel? I tested with a few versions
and most are not able to mount the rootfs (most likely SATA controller not
probed correctly). Here is the log from one version that did boot to the
prompt. But the resources are not mapped and NVMe is not probed because of
this:

Test g: pci=realloc,pcie_bus_safe
https://filebin.ca/55U8waihXJVI/logs.tar.bz2

Is there another test parameter set that I should test? I can also provide
the logs of the failing boot tests, since I have connected a serial console
to the system. Just let me know.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system
  2019-12-16  0:46       ` Bjorn Helgaas
@ 2019-12-16  6:50         ` Stefan Roese
  2019-12-16 15:50           ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Roese @ 2019-12-16  6:50 UTC (permalink / raw)
  To: bjorn, Nicholas Johnson
  Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner,
	Sergey Miroshnichenko

On 16.12.19 01:46, Bjorn Helgaas wrote:
>>> The logs are also included. Please let me know, if I should do any other
>>> tests and provide the logs.
> 
> Please include these logs in your mail to the list or post them
> someplace where everybody can see them.

Gladly. Please find the archive here:

https://filebin.ca/55U8waihXJVI/logs.tar.bz2

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system
  2019-12-16  6:50         ` Stefan Roese
@ 2019-12-16 15:50           ` Keith Busch
       [not found]             ` <f3a51108-10e4-f60d-de18-a12de85d07df@denx.de>
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2019-12-16 15:50 UTC (permalink / raw)
  To: Stefan Roese
  Cc: bjorn, Nicholas Johnson, linux-pci, Bjorn Helgaas,
	Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko

On Mon, Dec 16, 2019 at 07:50:20AM +0100, Stefan Roese wrote:
> On 16.12.19 01:46, Bjorn Helgaas wrote:
> > > > The logs are also included. Please let me know, if I should do any other
> > > > tests and provide the logs.
> > 
> > Please include these logs in your mail to the list or post them
> > someplace where everybody can see them.
> 
> Gladly. Please find the archive here:
> 
> https://filebin.ca/55U8waihXJVI/logs.tar.bz2

I can't access that. Could you paste directly into the email? I'm just
looking for 'dmesg' and 'lspci -vvv' right now, so trim to that if your
full capture is too long.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system
       [not found]             ` <f3a51108-10e4-f60d-de18-a12de85d07df@denx.de>
@ 2019-12-16 19:32               ` Keith Busch
  0 siblings, 0 replies; 14+ messages in thread
From: Keith Busch @ 2019-12-16 19:32 UTC (permalink / raw)
  To: Stefan Roese
  Cc: bjorn, Nicholas Johnson, linux-pci, Bjorn Helgaas,
	Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko

On Mon, Dec 16, 2019 at 06:34:22PM +0100, Stefan Roese wrote:
> Hi Keith,
> 
> On 16.12.19 16:50, Keith Busch wrote:
> > On Mon, Dec 16, 2019 at 07:50:20AM +0100, Stefan Roese wrote:
> > > On 16.12.19 01:46, Bjorn Helgaas wrote:
> > > > > > The logs are also included. Please let me know, if I should do any other
> > > > > > tests and provide the logs.
> > > > 
> > > > Please include these logs in your mail to the list or post them
> > > > someplace where everybody can see them.
> > > 
> > > Gladly. Please find the archive here:
> > > 
> > > https://filebin.ca/55U8waihXJVI/logs.tar.bz2
> > 
> > I can't access that. Could you paste directly into the email? I'm just
> > looking for 'dmesg' and 'lspci -vvv' right now, so trim to that if your
> > full capture is too long.
> 
> Sure, here a try with inline logs (stripped down a bit). I didn't include
> all test versions for now, since this inrease the mail size even more, Only
> test a) ... d)  are inlined here:

I think your platform bios simply doesn't support it. It does not
provision empty slots on its own, and it doesn't tolerate the OS
reassigning resources to them from what appears to be unnassigned memory
windows. The platform may be using those memory windows for something
outside the kernel's visibility.

What happens if you boot the system with all slots populated? Do all
devices configure in that case, and if so, can you hot-swap them?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system
  2019-12-13  8:35 PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system Stefan Roese
  2019-12-13  9:00 ` Nicholas Johnson
@ 2019-12-16 23:37 ` Bjorn Helgaas
  2019-12-17 13:54   ` Stefan Roese
  1 sibling, 1 reply; 14+ messages in thread
From: Bjorn Helgaas @ 2019-12-16 23:37 UTC (permalink / raw)
  To: Stefan Roese
  Cc: linux-pci, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko,
	Nicholas Johnson, Keith Busch

[+cc Keith]

On Fri, Dec 13, 2019 at 09:35:19AM +0100, Stefan Roese wrote:
> Hi!
> 
> I am facing an issue with PCIe-Hotplug on an AMD Epyc based system.
> Our system is equipped with an HBA for NVMe SSDs incl. PCIe switch
> (Supermicro AOC-SLG3-4E2P) [1] and we would like to be able to hotplug
> NVMe disks.

Your system has several host bridges.  The address space routed to
each host bridge is determined by firmware, and Linux has no support
for changing it.  Here's the space routed to the hierarchy containing
the NVMe devices:

  ACPI: PCI Root Bridge [S0D2] (domain 0000 [bus 40-5f])
  pci_bus 0000:40: root bus resource [mem 0xeb000000-0xeb5fffff window] 6MB
  pci_bus 0000:40: root bus resource [mem 0x7fc8000000-0xfcffffffff window] 501GB+
  pci_bus 0000:40: root bus resource [bus 40-5f]

Since you have several host bridges, using "pci=nocrs" is pretty much
guaranteed to fail if Linux changes any PCI address assignments.  It
makes Linux *ignore* the routing information from firmware, but it
doesn't *change* any of the routing.  That's why experiment (d) fails:
we assigned this space:

  pci 0000:44:00.0: BAR 0: assigned [mem 0xec000000-0xec003fff 64bit]

but according to the BIOS, the [mem 0xec000000-0xefffffff window] area
is routed to bus 00, not bus 40, so when we try to access that BAR, it
goes to bus 00 where nothing responds.

There are three devices on bus 40 that consume memory address space:

  40:03.1 Root Port to [bus 41-47]  [mem 0xeb400000-0xeb5fffff] 2MB
  40:07.1 Root Port to [bus 48]     [mem 0xeb200000-0xeb3fffff] 2MB
  40:08.1 Root Port to [bus 49]     [mem 0xeb000000-0xeb1fffff] 2MB

Bridges (including Root Ports and Switch Ports) consume memory address
space in 1MB chunks.  The devices on buses 48 and 49 need a little
over 1MB, so 40:07.1 and 40:08.1 need at least 2MB each.  There's only
6MB available, so that leaves 2MB for 40:03.1, which leads to the PLX
switch.

That 2MB of memory space is routed to the PLX Switch Upstream Port,
which has a BAR of its own that requires 256K, which leaves 1MB for it
to send to its Downstream Ports.

The Intel NVMe device only needs 16KB of memory space, but since the
Switch Port windows are a minimum of 1MB, only one port gets memory
space.

So with this configuration, I think you're stuck.  The only things I
can think of are:

  - Put the PLX switch in a different slot to see if BIOS will assign
    more space to it (the other host bridges have more space
    available).

  - Boot with all four PLX slots occupied by NVMe devices.  The BIOS
    may assign space to accommodate them all.  If it does, you should
    be able to hot-remove and add devices after boot.

  - Change Linux to use prefetchable space.  The Intel NVMe wants
    *non-prefetchable* space, but there's an implementation note in
    the spec (PCIe r5.0, sec 7.5.1.2.1) that says it should be safe to
    put it in prefetchable space in certain cases (entire path is
    PCIe, no PCI/PCI-X devices to peer-to-peer reads, host bridge does
    no byte merging, etc).  The main problem is that we don't have a
    good way to identify these cases.

> Currently, I'm testing with v5.5.0-rc1 and series [2] applied. Here
> a few tests and results that I did so far. All tests were done with
> one Intel NVMe SSD connected to one of the 4 NVMe ports of the HBA
> and the other 3 ports (currently) left unconnected:
> 
> a) Kernel Parameter "pci=pcie_bus_safe"
> The resources of the 3 unused PCIe slots of the PEX switch are not
> assigned in this test.
> 
> b) Kernel Parameter "pci=pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
> With this test I restricted the resources of the HP slots to the
> minimum. Still this results in unassigned resourced for the unused
> PCIe slots of the PEX switch.
> 
> c) Kernel Parameter "pci=realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
> Again, not all resources are assigned.
> 
> d) Kernel Parameter "pci=nocrs,realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0"
> Now all requested resources are available for the HP PCIe slots of the
> PEX switch. But the NVMe driver fails while probing. Debugging has
> shown, that reading from the BAR of the NVMe disk returns 0xffffffff.
> Also reading from the PLX PEX switch registers returns 0xfffffff in this
> case (this works of course without nocrs, when the BARs are mapped at
> a different address).
> 
> Does anybody have a clue on why the access to the PEX switch and / or
> the NVMe BAR does not work in the "nocrs" case? The BARs are located in
> the same window that is provided by the BIOS in the ACPI list (but is
> "ignored" in this case) [3].
>
> Or if it is possible to get the HP resource mapping done correctly without
> setting "nocrs" for our setup with the PCIe/NVMe switch?
>
> [1] https://www.supermicro.com/en/products/accessories/addon/AOC-SLG3-4E2P.php
> [2] https://lkml.org/lkml/2019/12/9/388
> [3]
> [    0.701932] acpi PNP0A08:00: host bridge window [io  0x0cf8-0x0cff] (ignored)
> [    0.701934] acpi PNP0A08:00: host bridge window [io  0x0000-0x02ff window] (ignored)
> [    0.701935] acpi PNP0A08:00: host bridge window [io  0x0300-0x03af window] (ignored)
> [    0.701936] acpi PNP0A08:00: host bridge window [io  0x03e0-0x0cf7 window] (ignored)
> [    0.701937] acpi PNP0A08:00: host bridge window [io  0x03b0-0x03df window] (ignored)
> [    0.701938] acpi PNP0A08:00: host bridge window [io  0x0d00-0x3fff window] (ignored)
> [    0.701939] acpi PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff window] (ignored)
> [    0.701939] acpi PNP0A08:00: host bridge window [mem 0x000c0000-0x000dffff window] (ignored)
> [    0.701940] acpi PNP0A08:00: host bridge window [mem 0xec000000-0xefffffff window] (ignored)
> [    0.701941] acpi PNP0A08:00: host bridge window [mem 0x182c8000000-0x1ffffffffff window] (ignored)
> ...
> 41:00.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev b0) (prog-if 00 [Normal decode])
>         Flags: bus master, fast devsel, latency 0, IRQ 47, NUMA node 2
>         Memory at ec400000 (32-bit, non-prefetchable) [size=256K]
>         Bus: primary=41, secondary=42, subordinate=47, sec-latency=0
>         I/O behind bridge: None
>         Memory behind bridge: ec000000-ec3fffff [size=4M]
>         Prefetchable memory behind bridge: None
>         Capabilities: <access denied>
>         Kernel driver in use: pcieport
> epyc@epyc-Super-Server:~/stefan$ sudo ./memtool md 0xec400000+0x10
> ec400000: ffffffff ffffffff ffffffff ffffffff                ................

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system
  2019-12-16 23:37 ` Bjorn Helgaas
@ 2019-12-17 13:54   ` Stefan Roese
  2019-12-17 16:30     ` Keith Busch
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Roese @ 2019-12-17 13:54 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko,
	Nicholas Johnson, Keith Busch

Hi Bjorn,

On 17.12.19 00:37, Bjorn Helgaas wrote:
> [+cc Keith]
> 
> On Fri, Dec 13, 2019 at 09:35:19AM +0100, Stefan Roese wrote:
>> Hi!
>>
>> I am facing an issue with PCIe-Hotplug on an AMD Epyc based system.
>> Our system is equipped with an HBA for NVMe SSDs incl. PCIe switch
>> (Supermicro AOC-SLG3-4E2P) [1] and we would like to be able to hotplug
>> NVMe disks.
> 
> Your system has several host bridges.  The address space routed to
> each host bridge is determined by firmware, and Linux has no support
> for changing it.  Here's the space routed to the hierarchy containing
> the NVMe devices:
> 
>    ACPI: PCI Root Bridge [S0D2] (domain 0000 [bus 40-5f])
>    pci_bus 0000:40: root bus resource [mem 0xeb000000-0xeb5fffff window] 6MB
>    pci_bus 0000:40: root bus resource [mem 0x7fc8000000-0xfcffffffff window] 501GB+
>    pci_bus 0000:40: root bus resource [bus 40-5f]
> 
> Since you have several host bridges, using "pci=nocrs" is pretty much
> guaranteed to fail if Linux changes any PCI address assignments.  It
> makes Linux *ignore* the routing information from firmware, but it
> doesn't *change* any of the routing.  That's why experiment (d) fails:
> we assigned this space:
> 
>    pci 0000:44:00.0: BAR 0: assigned [mem 0xec000000-0xec003fff 64bit]
> 
> but according to the BIOS, the [mem 0xec000000-0xefffffff window] area
> is routed to bus 00, not bus 40, so when we try to access that BAR, it
> goes to bus 00 where nothing responds.

Thanks for your analysis. I totally missed this multiple host bridges
aspect here. This explains completely what's happening with "nocrs",
which can't be used on this platform because of this (without ability
to change the routing in the PCI host bridges as well).
  
> There are three devices on bus 40 that consume memory address space:
> 
>    40:03.1 Root Port to [bus 41-47]  [mem 0xeb400000-0xeb5fffff] 2MB
>    40:07.1 Root Port to [bus 48]     [mem 0xeb200000-0xeb3fffff] 2MB
>    40:08.1 Root Port to [bus 49]     [mem 0xeb000000-0xeb1fffff] 2MB
> 
> Bridges (including Root Ports and Switch Ports) consume memory address
> space in 1MB chunks.  The devices on buses 48 and 49 need a little
> over 1MB, so 40:07.1 and 40:08.1 need at least 2MB each.  There's only
> 6MB available, so that leaves 2MB for 40:03.1, which leads to the PLX
> switch.
> 
> That 2MB of memory space is routed to the PLX Switch Upstream Port,
> which has a BAR of its own that requires 256K, which leaves 1MB for it
> to send to its Downstream Ports.
> 
> The Intel NVMe device only needs 16KB of memory space, but since the
> Switch Port windows are a minimum of 1MB, only one port gets memory
> space.
> 
> So with this configuration, I think you're stuck.  The only things I
> can think of are:
> 
>    - Put the PLX switch in a different slot to see if BIOS will assign
>      more space to it (the other host bridges have more space
>      available).

Thanks for this suggestions. Using a different slot (with more resources)
enables the resource assignment for a 4 HP slots of the PLX switch. Only
when I use this patch from Nicholas though (and pci=realloc):

https://lore.kernel.org/linux-pci/20191216233759.GA249123@google.com/T/#mbb5abd0131f05dbd5030952f567b3e4ec92f2af4
  
>    - Boot with all four PLX slots occupied by NVMe devices.  The BIOS
>      may assign space to accommodate them all.  If it does, you should
>      be able to hot-remove and add devices after boot.

Unfortunately, that's not an option. We need to be able to boot with
e.g. one NVMe device and hot-plug one or more devices later.
  
>    - Change Linux to use prefetchable space.  The Intel NVMe wants
>      *non-prefetchable* space, but there's an implementation note in
>      the spec (PCIe r5.0, sec 7.5.1.2.1) that says it should be safe to
>      put it in prefetchable space in certain cases (entire path is
>      PCIe, no PCI/PCI-X devices to peer-to-peer reads, host bridge does
>      no byte merging, etc).  The main problem is that we don't have a
>      good way to identify these cases.

Thanks for this suggestion. I might look into this. Right now, I'm
experimenting with the "solution" mentioned above, which looks like
it solves our issues for now.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system
  2019-12-17 13:54   ` Stefan Roese
@ 2019-12-17 16:30     ` Keith Busch
  2019-12-17 16:45       ` Stefan Roese
  0 siblings, 1 reply; 14+ messages in thread
From: Keith Busch @ 2019-12-17 16:30 UTC (permalink / raw)
  To: Stefan Roese
  Cc: Bjorn Helgaas, linux-pci, Mika Westerberg, Lukas Wunner,
	Sergey Miroshnichenko, Nicholas Johnson

On Tue, Dec 17, 2019 at 02:54:06PM +0100, Stefan Roese wrote:
> On 17.12.19 00:37, Bjorn Helgaas wrote:
> >    - Boot with all four PLX slots occupied by NVMe devices.  The BIOS
> >      may assign space to accommodate them all.  If it does, you should
> >      be able to hot-remove and add devices after boot.
> 
> Unfortunately, that's not an option. We need to be able to boot with
> e.g. one NVMe device and hot-plug one or more devices later.

That was also my suggestion, but not necessarily as a "solution". It's
just to see if it works, which might indicate what the kernel could do
differently.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system
  2019-12-17 16:30     ` Keith Busch
@ 2019-12-17 16:45       ` Stefan Roese
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Roese @ 2019-12-17 16:45 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, linux-pci, Mika Westerberg, Lukas Wunner,
	Sergey Miroshnichenko, Nicholas Johnson

On 17.12.19 17:30, Keith Busch wrote:
> On Tue, Dec 17, 2019 at 02:54:06PM +0100, Stefan Roese wrote:
>> On 17.12.19 00:37, Bjorn Helgaas wrote:
>>>     - Boot with all four PLX slots occupied by NVMe devices.  The BIOS
>>>       may assign space to accommodate them all.  If it does, you should
>>>       be able to hot-remove and add devices after boot.
>>
>> Unfortunately, that's not an option. We need to be able to boot with
>> e.g. one NVMe device and hot-plug one or more devices later.
> 
> That was also my suggestion, but not necessarily as a "solution". It's
> just to see if it works, which might indicate what the kernel could do
> differently.

I see, thanks. Right now, I don't have enough NVMe devices available to
do such a test. I'll add it to my list for tests that should be done
though.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-12-17 16:45 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-13  8:35 PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system Stefan Roese
2019-12-13  9:00 ` Nicholas Johnson
     [not found]   ` <c9f154e5-4214-aa46-2ce2-443b508e1643@denx.de>
2019-12-13 11:52     ` Nicholas Johnson
2019-12-13 12:17       ` Stefan Roese
2019-12-15  3:16         ` Nicholas Johnson
2019-12-16  6:48           ` Stefan Roese
2019-12-16  0:46       ` Bjorn Helgaas
2019-12-16  6:50         ` Stefan Roese
2019-12-16 15:50           ` Keith Busch
     [not found]             ` <f3a51108-10e4-f60d-de18-a12de85d07df@denx.de>
2019-12-16 19:32               ` Keith Busch
2019-12-16 23:37 ` Bjorn Helgaas
2019-12-17 13:54   ` Stefan Roese
2019-12-17 16:30     ` Keith Busch
2019-12-17 16:45       ` Stefan Roese

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).