* PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system @ 2019-12-13 8:35 Stefan Roese 2019-12-13 9:00 ` Nicholas Johnson 2019-12-16 23:37 ` Bjorn Helgaas 0 siblings, 2 replies; 14+ messages in thread From: Stefan Roese @ 2019-12-13 8:35 UTC (permalink / raw) To: linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko, Nicholas Johnson Hi! I am facing an issue with PCIe-Hotplug on an AMD Epyc based system. Our system is equipped with an HBA for NVMe SSDs incl. PCIe switch (Supermicro AOC-SLG3-4E2P) [1] and we would like to be able to hotplug NVMe disks. Currently, I'm testing with v5.5.0-rc1 and series [2] applied. Here a few tests and results that I did so far. All tests were done with one Intel NVMe SSD connected to one of the 4 NVMe ports of the HBA and the other 3 ports (currently) left unconnected: a) Kernel Parameter "pci=pcie_bus_safe" The resources of the 3 unused PCIe slots of the PEX switch are not assigned in this test. b) Kernel Parameter "pci=pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" With this test I restricted the resources of the HP slots to the minimum. Still this results in unassigned resourced for the unused PCIe slots of the PEX switch. c) Kernel Parameter "pci=realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" Again, not all resources are assigned. d) Kernel Parameter "pci=nocrs,realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" Now all requested resources are available for the HP PCIe slots of the PEX switch. But the NVMe driver fails while probing. Debugging has shown, that reading from the BAR of the NVMe disk returns 0xffffffff. Also reading from the PLX PEX switch registers returns 0xfffffff in this case (this works of course without nocrs, when the BARs are mapped at a different address). Does anybody have a clue on why the access to the PEX switch and / or the NVMe BAR does not work in the "nocrs" case? The BARs are located in the same window that is provided by the BIOS in the ACPI list (but is "ignored" in this case) [3]. Or if it is possible to get the HP resource mapping done correctly without setting "nocrs" for our setup with the PCIe/NVMe switch? I can provide all sorts of logs (dmegs, lspci etc) if needed - just let me know. Many thanks in advance, Stefan [1] https://www.supermicro.com/en/products/accessories/addon/AOC-SLG3-4E2P.php [2] https://lkml.org/lkml/2019/12/9/388 [3] [ 0.701932] acpi PNP0A08:00: host bridge window [io 0x0cf8-0x0cff] (ignored) [ 0.701934] acpi PNP0A08:00: host bridge window [io 0x0000-0x02ff window] (ignored) [ 0.701935] acpi PNP0A08:00: host bridge window [io 0x0300-0x03af window] (ignored) [ 0.701936] acpi PNP0A08:00: host bridge window [io 0x03e0-0x0cf7 window] (ignored) [ 0.701937] acpi PNP0A08:00: host bridge window [io 0x03b0-0x03df window] (ignored) [ 0.701938] acpi PNP0A08:00: host bridge window [io 0x0d00-0x3fff window] (ignored) [ 0.701939] acpi PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff window] (ignored) [ 0.701939] acpi PNP0A08:00: host bridge window [mem 0x000c0000-0x000dffff window] (ignored) [ 0.701940] acpi PNP0A08:00: host bridge window [mem 0xec000000-0xefffffff window] (ignored) [ 0.701941] acpi PNP0A08:00: host bridge window [mem 0x182c8000000-0x1ffffffffff window] (ignored) ... 41:00.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev b0) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0, IRQ 47, NUMA node 2 Memory at ec400000 (32-bit, non-prefetchable) [size=256K] Bus: primary=41, secondary=42, subordinate=47, sec-latency=0 I/O behind bridge: None Memory behind bridge: ec000000-ec3fffff [size=4M] Prefetchable memory behind bridge: None Capabilities: <access denied> Kernel driver in use: pcieport epyc@epyc-Super-Server:~/stefan$ sudo ./memtool md 0xec400000+0x10 ec400000: ffffffff ffffffff ffffffff ffffffff ................ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system 2019-12-13 8:35 PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system Stefan Roese @ 2019-12-13 9:00 ` Nicholas Johnson [not found] ` <c9f154e5-4214-aa46-2ce2-443b508e1643@denx.de> 2019-12-16 23:37 ` Bjorn Helgaas 1 sibling, 1 reply; 14+ messages in thread From: Nicholas Johnson @ 2019-12-13 9:00 UTC (permalink / raw) To: Stefan Roese Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko On Fri, Dec 13, 2019 at 09:35:19AM +0100, Stefan Roese wrote: > Hi! Hi, > > I am facing an issue with PCIe-Hotplug on an AMD Epyc based system. > Our system is equipped with an HBA for NVMe SSDs incl. PCIe switch > (Supermicro AOC-SLG3-4E2P) [1] and we would like to be able to hotplug > NVMe disks. > > Currently, I'm testing with v5.5.0-rc1 and series [2] applied. Here > a few tests and results that I did so far. All tests were done with > one Intel NVMe SSD connected to one of the 4 NVMe ports of the HBA > and the other 3 ports (currently) left unconnected: > > a) Kernel Parameter "pci=pcie_bus_safe" > The resources of the 3 unused PCIe slots of the PEX switch are not > assigned in this test. > > b) Kernel Parameter "pci=pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" > With this test I restricted the resources of the HP slots to the > minimum. Still this results in unassigned resourced for the unused > PCIe slots of the PEX switch. > > c) Kernel Parameter "pci=realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" > Again, not all resources are assigned. > > d) Kernel Parameter "pci=nocrs,realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" > Now all requested resources are available for the HP PCIe slots of the > PEX switch. But the NVMe driver fails while probing. Debugging has > shown, that reading from the BAR of the NVMe disk returns 0xffffffff. > Also reading from the PLX PEX switch registers returns 0xfffffff in this > case (this works of course without nocrs, when the BARs are mapped at > a different address). > > Does anybody have a clue on why the access to the PEX switch and / or > the NVMe BAR does not work in the "nocrs" case? The BARs are located in > the same window that is provided by the BIOS in the ACPI list (but is > "ignored" in this case) [3]. > > Or if it is possible to get the HP resource mapping done correctly without > setting "nocrs" for our setup with the PCIe/NVMe switch? > > I can provide all sorts of logs (dmegs, lspci etc) if needed - just let > me know. > > Many thanks in advance, > Stefan This will be a quick response for now. I will get more in depth tonight when I have more time. What I have taken away from this is: 1. Epyc -> Up to 4x PCIe Root Complexes, but from what I can gather, they are probably assigned on the same segment / domain, unfortunately, with non-overlapping bus numbers. Either way, multiple RCs may complicate using pci=nocrs and others. Unfortunately, I have not had the privilege of owning a system with multiple RCs, so I cannot be sure. 2. Not using Thunderbolt - [2] patch series only really makes a difference with nested hotplug bridges, such as in Thunderbolt. Although, it might help by not using additional resource lists, but I still do not think it will matter without nested hotplug bridges. 3. System not reallocating resources despite overridden -> is ACPI _DSM method evaluating to zero? I experienced this recently with an Intel Ice Lake system. I booted the laptop at the retail store into Linux off a USB to find out about the Thunderbolt implementation. I dumped "sudo lspci -xxxx" and dmesg and analysed the results at home. I noticed it did not override the resources, and from examining the source code, it likely evaluated _DSM to 0, which may have overridden pci=realloc. Try modifying the source code to unconditionally apply realloc in drivers/pci/setup-bus.c and see what happens. I have not bothered doing this myself and going back to the store to try to test this hypothesis. 4. It would be helpful if you attached full dmesg and "sudo lspci -xxxx" which dumps full PCI config, allowing us to run any lspci query as if we were on your system, from the file. I will be able to tell a lot more after seeing that. Possibly do one with no kernel parameters, and do another set of results with all of the kernel parameters. Use hpmmiosize=64M and hpmmioprefsize=1G for it to be noticeable, I reckon. But this will answer questions I have about which ports are hotplug bridges and other things. 5. There is a good chance it will not even boot since kernel since around ~v5.3 with acpi=off but it is worth a shot there, also. Since a recent kernel, I have found that acpi=off only removes HyperThreading, and not all the physical cores like it used to. So there must have been a patch which allowed it to guess the MADT table information. I have not investigated. But now, some of my computers crash upon loading the kernel with acpi=off. It must get it wrong at times. What about pci=noacpi instead? Sorry if I missed something you said. Best of luck, and I am interested into looking into this further. :) Kind regards, Nicholas Johnson > > [1] https://www.supermicro.com/en/products/accessories/addon/AOC-SLG3-4E2P.php > [2] https://lkml.org/lkml/2019/12/9/388 > [3] > [ 0.701932] acpi PNP0A08:00: host bridge window [io 0x0cf8-0x0cff] (ignored) > [ 0.701934] acpi PNP0A08:00: host bridge window [io 0x0000-0x02ff window] (ignored) > [ 0.701935] acpi PNP0A08:00: host bridge window [io 0x0300-0x03af window] (ignored) > [ 0.701936] acpi PNP0A08:00: host bridge window [io 0x03e0-0x0cf7 window] (ignored) > [ 0.701937] acpi PNP0A08:00: host bridge window [io 0x03b0-0x03df window] (ignored) > [ 0.701938] acpi PNP0A08:00: host bridge window [io 0x0d00-0x3fff window] (ignored) > [ 0.701939] acpi PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff window] (ignored) > [ 0.701939] acpi PNP0A08:00: host bridge window [mem 0x000c0000-0x000dffff window] (ignored) > [ 0.701940] acpi PNP0A08:00: host bridge window [mem 0xec000000-0xefffffff window] (ignored) > [ 0.701941] acpi PNP0A08:00: host bridge window [mem 0x182c8000000-0x1ffffffffff window] (ignored) > ... > 41:00.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev b0) (prog-if 00 [Normal decode]) > Flags: bus master, fast devsel, latency 0, IRQ 47, NUMA node 2 > Memory at ec400000 (32-bit, non-prefetchable) [size=256K] > Bus: primary=41, secondary=42, subordinate=47, sec-latency=0 > I/O behind bridge: None > Memory behind bridge: ec000000-ec3fffff [size=4M] > Prefetchable memory behind bridge: None > Capabilities: <access denied> > Kernel driver in use: pcieport > epyc@epyc-Super-Server:~/stefan$ sudo ./memtool md 0xec400000+0x10 > ec400000: ffffffff ffffffff ffffffff ffffffff ................ ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <c9f154e5-4214-aa46-2ce2-443b508e1643@denx.de>]
* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system [not found] ` <c9f154e5-4214-aa46-2ce2-443b508e1643@denx.de> @ 2019-12-13 11:52 ` Nicholas Johnson 2019-12-13 12:17 ` Stefan Roese 2019-12-16 0:46 ` Bjorn Helgaas 0 siblings, 2 replies; 14+ messages in thread From: Nicholas Johnson @ 2019-12-13 11:52 UTC (permalink / raw) To: Stefan Roese Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko On Fri, Dec 13, 2019 at 11:58:53AM +0100, Stefan Roese wrote: > Hi Nicholas, > > On 13.12.19 10:00, Nicholas Johnson wrote: > > On Fri, Dec 13, 2019 at 09:35:19AM +0100, Stefan Roese wrote: > > > Hi! > > Hi, > > > > > > I am facing an issue with PCIe-Hotplug on an AMD Epyc based system. > > > Our system is equipped with an HBA for NVMe SSDs incl. PCIe switch > > > (Supermicro AOC-SLG3-4E2P) [1] and we would like to be able to hotplug > > > NVMe disks. > > > > > > Currently, I'm testing with v5.5.0-rc1 and series [2] applied. Here > > > a few tests and results that I did so far. All tests were done with > > > one Intel NVMe SSD connected to one of the 4 NVMe ports of the HBA > > > and the other 3 ports (currently) left unconnected: > > > > > > a) Kernel Parameter "pci=pcie_bus_safe" > > > The resources of the 3 unused PCIe slots of the PEX switch are not > > > assigned in this test. > > > > > > b) Kernel Parameter "pci=pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" > > > With this test I restricted the resources of the HP slots to the > > > minimum. Still this results in unassigned resourced for the unused > > > PCIe slots of the PEX switch. > > > > > > c) Kernel Parameter "pci=realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" > > > Again, not all resources are assigned. > > > > > > d) Kernel Parameter "pci=nocrs,realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" > > > Now all requested resources are available for the HP PCIe slots of the > > > PEX switch. But the NVMe driver fails while probing. Debugging has > > > shown, that reading from the BAR of the NVMe disk returns 0xffffffff. > > > Also reading from the PLX PEX switch registers returns 0xfffffff in this > > > case (this works of course without nocrs, when the BARs are mapped at > > > a different address). > > > > > > Does anybody have a clue on why the access to the PEX switch and / or > > > the NVMe BAR does not work in the "nocrs" case? The BARs are located in > > > the same window that is provided by the BIOS in the ACPI list (but is > > > "ignored" in this case) [3]. > > > > > > Or if it is possible to get the HP resource mapping done correctly without > > > setting "nocrs" for our setup with the PCIe/NVMe switch? > > > > > > I can provide all sorts of logs (dmegs, lspci etc) if needed - just let > > > me know. > > > > > > Many thanks in advance, > > > Stefan > > This will be a quick response for now. I will get more in depth tonight > > when I have more time. > > > > What I have taken away from this is: > > > > 1. Epyc -> Up to 4x PCIe Root Complexes, but from what I can gather, > > they are probably assigned on the same segment / domain, unfortunately, > > with non-overlapping bus numbers. Either way, multiple RCs may > > complicate using pci=nocrs and others. Unfortunately, I have not had the > > privilege of owning a system with multiple RCs, so I cannot be sure. > > > > 2. Not using Thunderbolt - [2] patch series only really makes a > > difference with nested hotplug bridges, such as in Thunderbolt. > > Although, it might help by not using additional resource lists, but I > > still do not think it will matter without nested hotplug bridges. > > I was not sure about those patches but since they have been queued for > 5.6, I included them in these tests. The results are similar (or even > identical, I would need to re-run the test to be sure) without them. > > 3. System not reallocating resources despite overridden -> is ACPI _DSM > > method evaluating to zero? > > Not sure if I follow you here. The kernel is reallocating the resources, or > at least trying to, if requested to via bootargs (Tests c) and d)). I've > attached the logs from all 4 tests in an archive [1]. It just fails to > reallocate the resources in test case c) and even though it successfully > reallocates the resources in test case d), the new addresses at the PEX > switch and its ports "don't work". It is unlikely to be the issue, but I thought it was worth a mention. > > > I experienced this recently with an Intel Ice > > Lake system. I booted the laptop at the retail store into Linux off a > > USB to find out about the Thunderbolt implementation. I dumped "sudo > > lspci -xxxx" and dmesg and analysed the results at home. > > Very brave. ;) It's a retail store with display models for people to play with. If I do not damage it (or pay for any damage caused) then I do not have anything to be afraid of. > > > I noticed it > > did not override the resources, and from examining the source code, it > > likely evaluated _DSM to 0, which may have overridden pci=realloc. Try > > modifying the source code to unconditionally apply realloc in > > drivers/pci/setup-bus.c and see what happens. I have not bothered doing > > this myself and going back to the store to try to test this hypothesis. > > realloc is enabled via boot args and active in the kernel as you can see > from the dmesg log [2]. > > 4. It would be helpful if you attached full dmesg and "sudo lspci -xxxx" > > which dumps full PCI config, allowing us to run any lspci query as if we > > were on your system, from the file. I will be able to tell a lot more > > after seeing that. Possibly do one with no kernel parameters, and do > > another set of results with all of the kernel parameters. Use > > hpmmiosize=64M and hpmmioprefsize=1G for it to be noticeable, I reckon. > > But this will answer questions I have about which ports are hotplug > > bridges and other things. > > Okay, I added the following test cases: > > e) Kernel Parameter "" > f) Kernel Parameter "pci=nocrs,realloc,hpmmiosize=64M,hpmmioprefsize=1G" > > The logs are also included. Please let me know, if I should do any other > tests and provide the logs. > > > 5. There is a good chance it will not even boot since kernel since > > around ~v5.3 with acpi=off but it is worth a shot there, also. Since a > > recent kernel, I have found that acpi=off only removes HyperThreading, > > and not all the physical cores like it used to. So there must have been > > a patch which allowed it to guess the MADT table information. I have not > > investigated. But now, some of my computers crash upon loading the > > kernel with acpi=off. It must get it wrong at times. > > Booting this 5.5 kernel with "acpi=off" increases the bootup time quite > a bit. The resources are distributed behind the PLX switch (similar to > using "pci=nocrs" but again accessing the BARs doesn't work (0xffffffff > is read back). It was only to see if ACPI was part of the issue. You would not run in production with it off. > > > What about > > pci=noacpi instead? > > I also tested using pci=noacpi and it did not resolve the resource > mapping problems. > > Sorry if I missed something you said. > > > > Best of luck, and I am interested into looking into this further. :) > > Very much appreciated. :) > > Thanks, > Stefan > > [1] logs.tar.bz2 > [2] 5.5.0-rc1-custom-test-c/dmesg.log From the logs, it looks like MMIO_PREF was assigned 1G but not MMIO. This looks tricky. Please revert my commit: c13704f5685deb7d6eb21e293233e0901ed77377 And see if it is the problem. It is entirely possible, but because of the very old code and how there are multiple passes, it might be impossible to use realloc without side effects for somebody. If you fix it for one scenario, it is possible that there is another scenario for which it will break due to the change. The only way to make everything work is a near complete rewrite of drivers/pci/setup-bus.c and potentially others, something I am working on, but is going to take a long time. And unlikely to ever be accepted. Otherwise, it will take me a lot of grepping through dmesg to find the cause, which will take more time. FYI, "lspci -vvv" is redundant because it can be produced from "lspci -xxxx" output. A final note, Epyc CPUs can bifurcate x16 slots into x4/x4/x4/x4 in the BIOS setup, although you will probably not have the hotplug services provided by the PEX switch. Kind regards, Nicholas Johnson ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system 2019-12-13 11:52 ` Nicholas Johnson @ 2019-12-13 12:17 ` Stefan Roese 2019-12-15 3:16 ` Nicholas Johnson 2019-12-16 0:46 ` Bjorn Helgaas 1 sibling, 1 reply; 14+ messages in thread From: Stefan Roese @ 2019-12-13 12:17 UTC (permalink / raw) To: Nicholas Johnson Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko On 13.12.19 12:52, Nicholas Johnson wrote: > On Fri, Dec 13, 2019 at 11:58:53AM +0100, Stefan Roese wrote: >> Hi Nicholas, >> >> On 13.12.19 10:00, Nicholas Johnson wrote: >>> On Fri, Dec 13, 2019 at 09:35:19AM +0100, Stefan Roese wrote: >>>> Hi! >>> Hi, >>>> >>>> I am facing an issue with PCIe-Hotplug on an AMD Epyc based system. >>>> Our system is equipped with an HBA for NVMe SSDs incl. PCIe switch >>>> (Supermicro AOC-SLG3-4E2P) [1] and we would like to be able to hotplug >>>> NVMe disks. >>>> >>>> Currently, I'm testing with v5.5.0-rc1 and series [2] applied. Here >>>> a few tests and results that I did so far. All tests were done with >>>> one Intel NVMe SSD connected to one of the 4 NVMe ports of the HBA >>>> and the other 3 ports (currently) left unconnected: >>>> >>>> a) Kernel Parameter "pci=pcie_bus_safe" >>>> The resources of the 3 unused PCIe slots of the PEX switch are not >>>> assigned in this test. >>>> >>>> b) Kernel Parameter "pci=pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" >>>> With this test I restricted the resources of the HP slots to the >>>> minimum. Still this results in unassigned resourced for the unused >>>> PCIe slots of the PEX switch. >>>> >>>> c) Kernel Parameter "pci=realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" >>>> Again, not all resources are assigned. >>>> >>>> d) Kernel Parameter "pci=nocrs,realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" >>>> Now all requested resources are available for the HP PCIe slots of the >>>> PEX switch. But the NVMe driver fails while probing. Debugging has >>>> shown, that reading from the BAR of the NVMe disk returns 0xffffffff. >>>> Also reading from the PLX PEX switch registers returns 0xfffffff in this >>>> case (this works of course without nocrs, when the BARs are mapped at >>>> a different address). >>>> >>>> Does anybody have a clue on why the access to the PEX switch and / or >>>> the NVMe BAR does not work in the "nocrs" case? The BARs are located in >>>> the same window that is provided by the BIOS in the ACPI list (but is >>>> "ignored" in this case) [3]. >>>> >>>> Or if it is possible to get the HP resource mapping done correctly without >>>> setting "nocrs" for our setup with the PCIe/NVMe switch? >>>> >>>> I can provide all sorts of logs (dmegs, lspci etc) if needed - just let >>>> me know. >>>> >>>> Many thanks in advance, >>>> Stefan >>> This will be a quick response for now. I will get more in depth tonight >>> when I have more time. >>> >>> What I have taken away from this is: >>> >>> 1. Epyc -> Up to 4x PCIe Root Complexes, but from what I can gather, >>> they are probably assigned on the same segment / domain, unfortunately, >>> with non-overlapping bus numbers. Either way, multiple RCs may >>> complicate using pci=nocrs and others. Unfortunately, I have not had the >>> privilege of owning a system with multiple RCs, so I cannot be sure. >>> >>> 2. Not using Thunderbolt - [2] patch series only really makes a >>> difference with nested hotplug bridges, such as in Thunderbolt. >>> Although, it might help by not using additional resource lists, but I >>> still do not think it will matter without nested hotplug bridges. >> >> I was not sure about those patches but since they have been queued for >> 5.6, I included them in these tests. The results are similar (or even >> identical, I would need to re-run the test to be sure) without them. >>> 3. System not reallocating resources despite overridden -> is ACPI _DSM >>> method evaluating to zero? >> >> Not sure if I follow you here. The kernel is reallocating the resources, or >> at least trying to, if requested to via bootargs (Tests c) and d)). I've >> attached the logs from all 4 tests in an archive [1]. It just fails to >> reallocate the resources in test case c) and even though it successfully >> reallocates the resources in test case d), the new addresses at the PEX >> switch and its ports "don't work". > It is unlikely to be the issue, but I thought it was worth a mention. > >> >>> I experienced this recently with an Intel Ice >>> Lake system. I booted the laptop at the retail store into Linux off a >>> USB to find out about the Thunderbolt implementation. I dumped "sudo >>> lspci -xxxx" and dmesg and analysed the results at home. >> >> Very brave. ;) > It's a retail store with display models for people to play with. If I do > not damage it (or pay for any damage caused) then I do not have anything > to be afraid of. Sure. I was referring to you being "brave" to do all this analyzing / debugging without having the system at your hands while doing this for any further tests. ;) >> >>> I noticed it >>> did not override the resources, and from examining the source code, it >>> likely evaluated _DSM to 0, which may have overridden pci=realloc. Try >>> modifying the source code to unconditionally apply realloc in >>> drivers/pci/setup-bus.c and see what happens. I have not bothered doing >>> this myself and going back to the store to try to test this hypothesis. >> >> realloc is enabled via boot args and active in the kernel as you can see >> from the dmesg log [2]. >>> 4. It would be helpful if you attached full dmesg and "sudo lspci -xxxx" >>> which dumps full PCI config, allowing us to run any lspci query as if we >>> were on your system, from the file. I will be able to tell a lot more >>> after seeing that. Possibly do one with no kernel parameters, and do >>> another set of results with all of the kernel parameters. Use >>> hpmmiosize=64M and hpmmioprefsize=1G for it to be noticeable, I reckon. >>> But this will answer questions I have about which ports are hotplug >>> bridges and other things. >> >> Okay, I added the following test cases: >> >> e) Kernel Parameter "" >> f) Kernel Parameter "pci=nocrs,realloc,hpmmiosize=64M,hpmmioprefsize=1G" >> >> The logs are also included. Please let me know, if I should do any other >> tests and provide the logs. >> >>> 5. There is a good chance it will not even boot since kernel since >>> around ~v5.3 with acpi=off but it is worth a shot there, also. Since a >>> recent kernel, I have found that acpi=off only removes HyperThreading, >>> and not all the physical cores like it used to. So there must have been >>> a patch which allowed it to guess the MADT table information. I have not >>> investigated. But now, some of my computers crash upon loading the >>> kernel with acpi=off. It must get it wrong at times. >> >> Booting this 5.5 kernel with "acpi=off" increases the bootup time quite >> a bit. The resources are distributed behind the PLX switch (similar to >> using "pci=nocrs" but again accessing the BARs doesn't work (0xffffffff >> is read back). > It was only to see if ACPI was part of the issue. You would not run in > production with it off. > >> >>> What about >>> pci=noacpi instead? >> >> I also tested using pci=noacpi and it did not resolve the resource >> mapping problems. >>> Sorry if I missed something you said. >>> >>> Best of luck, and I am interested into looking into this further. :) >> >> Very much appreciated. :) >> >> Thanks, >> Stefan >> >> [1] logs.tar.bz2 >> [2] 5.5.0-rc1-custom-test-c/dmesg.log > > From the logs, it looks like MMIO_PREF was assigned 1G but not MMIO. > > This looks tricky. Please revert my commit: > c13704f5685deb7d6eb21e293233e0901ed77377 > > And see if it is the problem. I reverted this patch and did a few test (some of my test cases). None turned out differently than before. Either the resources are not mapped completely or they are mapped (with pci=nocrs) and not accessible. > It is entirely possible, but because of > the very old code and how there are multiple passes, it might be > impossible to use realloc without side effects for somebody. If you fix > it for one scenario, it is possible that there is another scenario for > which it will break due to the change. The only way to make everything > work is a near complete rewrite of drivers/pci/setup-bus.c and > potentially others, something I am working on, but is going to take a > long time. And unlikely to ever be accepted. While working on this issue, I looked (again) at this resource (re-) allocation code. This is really confusing (at least to me) and I also think that it needs a "near complete rewrite". > Otherwise, it will take me a lot of grepping through dmesg to find the > cause, which will take more time. Sure. > FYI, "lspci -vvv" is redundant because it can be produced from "lspci > -xxxx" output. I know. Its mainly for me to easily see the PCI devices listed quickly. > A final note, Epyc CPUs can bifurcate x16 slots into x4/x4/x4/x4 in the > BIOS setup, although you will probably not have the hotplug services > provided by the PEX switch. I think it should not matter for my current test with resource assignment, how many PCIe lanes the PEX switch has connected to the PCI root port. Its of course important for the bandwidth, but this is a completely different issue. Thanks, Stefan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system 2019-12-13 12:17 ` Stefan Roese @ 2019-12-15 3:16 ` Nicholas Johnson 2019-12-16 6:48 ` Stefan Roese 0 siblings, 1 reply; 14+ messages in thread From: Nicholas Johnson @ 2019-12-15 3:16 UTC (permalink / raw) To: Stefan Roese Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko Hi, > > From the logs, it looks like MMIO_PREF was assigned 1G but not MMIO. > > > > This looks tricky. Please revert my commit: > > c13704f5685deb7d6eb21e293233e0901ed77377 > > > > And see if it is the problem. > > I reverted this patch and did a few test (some of my test cases). None > turned out differently than before. Either the resources are not mapped > completely or they are mapped (with pci=nocrs) and not accessible. > > > It is entirely possible, but because of > > the very old code and how there are multiple passes, it might be > > impossible to use realloc without side effects for somebody. If you fix > > it for one scenario, it is possible that there is another scenario for > > which it will break due to the change. The only way to make everything > > work is a near complete rewrite of drivers/pci/setup-bus.c and > > potentially others, something I am working on, but is going to take a > > long time. And unlikely to ever be accepted. > > While working on this issue, I looked (again) at this resource (re-) > allocation code. This is really confusing (at least to me) and I also think > that it needs a "near complete rewrite". > > Otherwise, it will take me a lot of grepping through dmesg to find the > > cause, which will take more time. > > Sure. > > FYI, "lspci -vvv" is redundant because it can be produced from "lspci > > -xxxx" output. > > I know. Its mainly for me to easily see the PCI devices listed quickly. > > A final note, Epyc CPUs can bifurcate x16 slots into x4/x4/x4/x4 in the > > BIOS setup, although you will probably not have the hotplug services > > provided by the PEX switch. > > I think it should not matter for my current test with resource assignment, > how many PCIe lanes the PEX switch has connected to the PCI root port. Its > of course important for the bandwidth, but this is a completely different > issue. I meant that you can connect 4x NVMe drives to a PCIe x16 slot with a cheap passive bifurcation riser. But it sounds like this card is useful because of its hotplug support. I noticed if you grep your some of your dmesg logs for "add_size", you have some lines like this: [ 0.767652] pci 0000:42:04.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 44] add_size 200000 add_align 100000 I am not sure if these are the cause or a symptom of the problem, but I do not have any when assigning MMIO and MMIO_PREF for Thunderbolt 3. I noticed you are using pci=hpmemsize in some of the tests. It should not be interfering because you put it first (it is overwritten by hpmmiosize and hpmmioprefsize). But I should point out that pci=hpmemsize=X is equivalent to pci=hpmmiosize=X,hpmmioprefsize=X so it is redundant. When I added hpmmiosize and hpmmioprefsize parameters to control them independently, I would have liked to have dropped hpmemsize, but needed to leave it around to not disrupt people who are already using it. Please try something like this, which I dug up from a very old attempt to overhaul drivers/pci/setup-bus.c that I was working on. It will release all the boot resources before the initial allocation, and should give the system a chance to cleanly assign all resources on the first pass / try. The allocation code works well until you use more than one pass - then things get very hairy. I just applied it to mine, and now everything applies the first pass, with not a single failure to assign. diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c index 22aed6cdb..befaef6a8 100644 --- a/drivers/pci/setup-bus.c +++ b/drivers/pci/setup-bus.c @@ -1822,8 +1822,16 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus) void __init pci_assign_unassigned_resources(void) { struct pci_bus *root_bus; + struct pci_dev *dev; list_for_each_entry(root_bus, &pci_root_buses, node) { + for_each_pci_bridge(dev, root_bus) { + pci_bridge_release_resources(dev->subordinate, IORESOURCE_IO); + pci_bridge_release_resources(dev->subordinate, IORESOURCE_MEM); + pci_bridge_release_resources(dev->subordinate, IORESOURCE_MEM_64); + pci_bridge_release_resources(dev->subordinate, IORESOURCE_MEM_64 | IORESOURCE_PREFETCH); + } + pci_assign_unassigned_root_bus_resources(root_bus); /* Make sure the root bridge has a companion ACPI device */ Kind regards, Nicholas ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system 2019-12-15 3:16 ` Nicholas Johnson @ 2019-12-16 6:48 ` Stefan Roese 0 siblings, 0 replies; 14+ messages in thread From: Stefan Roese @ 2019-12-16 6:48 UTC (permalink / raw) To: Nicholas Johnson Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko Hi Nicholas, On 15.12.19 04:16, Nicholas Johnson wrote: > Hi, > >>> From the logs, it looks like MMIO_PREF was assigned 1G but not MMIO. >>> >>> This looks tricky. Please revert my commit: >>> c13704f5685deb7d6eb21e293233e0901ed77377 >>> >>> And see if it is the problem. >> >> I reverted this patch and did a few test (some of my test cases). None >> turned out differently than before. Either the resources are not mapped >> completely or they are mapped (with pci=nocrs) and not accessible. >> >>> It is entirely possible, but because of >>> the very old code and how there are multiple passes, it might be >>> impossible to use realloc without side effects for somebody. If you fix >>> it for one scenario, it is possible that there is another scenario for >>> which it will break due to the change. The only way to make everything >>> work is a near complete rewrite of drivers/pci/setup-bus.c and >>> potentially others, something I am working on, but is going to take a >>> long time. And unlikely to ever be accepted. >> >> While working on this issue, I looked (again) at this resource (re-) >> allocation code. This is really confusing (at least to me) and I also think >> that it needs a "near complete rewrite". >>> Otherwise, it will take me a lot of grepping through dmesg to find the >>> cause, which will take more time. >> >> Sure. >>> FYI, "lspci -vvv" is redundant because it can be produced from "lspci >>> -xxxx" output. >> >> I know. Its mainly for me to easily see the PCI devices listed quickly. >>> A final note, Epyc CPUs can bifurcate x16 slots into x4/x4/x4/x4 in the >>> BIOS setup, although you will probably not have the hotplug services >>> provided by the PEX switch. >> >> I think it should not matter for my current test with resource assignment, >> how many PCIe lanes the PEX switch has connected to the PCI root port. Its >> of course important for the bandwidth, but this is a completely different >> issue. > I meant that you can connect 4x NVMe drives to a PCIe x16 slot with a > cheap passive bifurcation riser. But it sounds like this card is useful > because of its hotplug support. > > I noticed if you grep your some of your dmesg logs for "add_size", you > have some lines like this: > [ 0.767652] pci 0000:42:04.0: bridge window [mem 0x00100000-0x000fffff 64bit pref] to [bus 44] add_size 200000 add_align 100000 > > I am not sure if these are the cause or a symptom of the problem, but I > do not have any when assigning MMIO and MMIO_PREF for Thunderbolt 3. > > I noticed you are using pci=hpmemsize in some of the tests. It should > not be interfering because you put it first (it is overwritten by > hpmmiosize and hpmmioprefsize). But I should point out that > pci=hpmemsize=X is equivalent to pci=hpmmiosize=X,hpmmioprefsize=X so it > is redundant. When I added hpmmiosize and hpmmioprefsize parameters to > control them independently, I would have liked to have dropped > hpmemsize, but needed to leave it around to not disrupt people who are > already using it. Thanks. I was aware of keeping the old notation to not break backwards compatiblity. I'll drop hpmemsize=X from now on. > Please try something like this, which I dug up from a very old attempt > to overhaul drivers/pci/setup-bus.c that I was working on. It will > release all the boot resources before the initial allocation, and should > give the system a chance to cleanly assign all resources on the first > pass / try. The allocation code works well until you use more than one > pass - then things get very hairy. I just applied it to mine, and now > everything applies the first pass, with not a single failure to assign. Do you have some hot-plug enabled slots in your system? > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c > index 22aed6cdb..befaef6a8 100644 > --- a/drivers/pci/setup-bus.c > +++ b/drivers/pci/setup-bus.c > @@ -1822,8 +1822,16 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus) > void __init pci_assign_unassigned_resources(void) > { > struct pci_bus *root_bus; > + struct pci_dev *dev; > > list_for_each_entry(root_bus, &pci_root_buses, node) { > + for_each_pci_bridge(dev, root_bus) { > + pci_bridge_release_resources(dev->subordinate, IORESOURCE_IO); > + pci_bridge_release_resources(dev->subordinate, IORESOURCE_MEM); > + pci_bridge_release_resources(dev->subordinate, IORESOURCE_MEM_64); > + pci_bridge_release_resources(dev->subordinate, IORESOURCE_MEM_64 | IORESOURCE_PREFETCH); > + } > + > pci_assign_unassigned_root_bus_resources(root_bus); > > /* Make sure the root bridge has a companion ACPI device */ Thanks. I've applied this patch to my tree without c13704f5 reverted. Which parameters should I pass to the kernel? I tested with a few versions and most are not able to mount the rootfs (most likely SATA controller not probed correctly). Here is the log from one version that did boot to the prompt. But the resources are not mapped and NVMe is not probed because of this: Test g: pci=realloc,pcie_bus_safe https://filebin.ca/55U8waihXJVI/logs.tar.bz2 Is there another test parameter set that I should test? I can also provide the logs of the failing boot tests, since I have connected a serial console to the system. Just let me know. Thanks, Stefan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system 2019-12-13 11:52 ` Nicholas Johnson 2019-12-13 12:17 ` Stefan Roese @ 2019-12-16 0:46 ` Bjorn Helgaas 2019-12-16 6:50 ` Stefan Roese 1 sibling, 1 reply; 14+ messages in thread From: Bjorn Helgaas @ 2019-12-16 0:46 UTC (permalink / raw) To: Nicholas Johnson Cc: Stefan Roese, linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko > > The logs are also included. Please let me know, if I should do any other > > tests and provide the logs. Please include these logs in your mail to the list or post them someplace where everybody can see them. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system 2019-12-16 0:46 ` Bjorn Helgaas @ 2019-12-16 6:50 ` Stefan Roese 2019-12-16 15:50 ` Keith Busch 0 siblings, 1 reply; 14+ messages in thread From: Stefan Roese @ 2019-12-16 6:50 UTC (permalink / raw) To: bjorn, Nicholas Johnson Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko On 16.12.19 01:46, Bjorn Helgaas wrote: >>> The logs are also included. Please let me know, if I should do any other >>> tests and provide the logs. > > Please include these logs in your mail to the list or post them > someplace where everybody can see them. Gladly. Please find the archive here: https://filebin.ca/55U8waihXJVI/logs.tar.bz2 Thanks, Stefan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system 2019-12-16 6:50 ` Stefan Roese @ 2019-12-16 15:50 ` Keith Busch [not found] ` <f3a51108-10e4-f60d-de18-a12de85d07df@denx.de> 0 siblings, 1 reply; 14+ messages in thread From: Keith Busch @ 2019-12-16 15:50 UTC (permalink / raw) To: Stefan Roese Cc: bjorn, Nicholas Johnson, linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko On Mon, Dec 16, 2019 at 07:50:20AM +0100, Stefan Roese wrote: > On 16.12.19 01:46, Bjorn Helgaas wrote: > > > > The logs are also included. Please let me know, if I should do any other > > > > tests and provide the logs. > > > > Please include these logs in your mail to the list or post them > > someplace where everybody can see them. > > Gladly. Please find the archive here: > > https://filebin.ca/55U8waihXJVI/logs.tar.bz2 I can't access that. Could you paste directly into the email? I'm just looking for 'dmesg' and 'lspci -vvv' right now, so trim to that if your full capture is too long. ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <f3a51108-10e4-f60d-de18-a12de85d07df@denx.de>]
* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system [not found] ` <f3a51108-10e4-f60d-de18-a12de85d07df@denx.de> @ 2019-12-16 19:32 ` Keith Busch 0 siblings, 0 replies; 14+ messages in thread From: Keith Busch @ 2019-12-16 19:32 UTC (permalink / raw) To: Stefan Roese Cc: bjorn, Nicholas Johnson, linux-pci, Bjorn Helgaas, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko On Mon, Dec 16, 2019 at 06:34:22PM +0100, Stefan Roese wrote: > Hi Keith, > > On 16.12.19 16:50, Keith Busch wrote: > > On Mon, Dec 16, 2019 at 07:50:20AM +0100, Stefan Roese wrote: > > > On 16.12.19 01:46, Bjorn Helgaas wrote: > > > > > > The logs are also included. Please let me know, if I should do any other > > > > > > tests and provide the logs. > > > > > > > > Please include these logs in your mail to the list or post them > > > > someplace where everybody can see them. > > > > > > Gladly. Please find the archive here: > > > > > > https://filebin.ca/55U8waihXJVI/logs.tar.bz2 > > > > I can't access that. Could you paste directly into the email? I'm just > > looking for 'dmesg' and 'lspci -vvv' right now, so trim to that if your > > full capture is too long. > > Sure, here a try with inline logs (stripped down a bit). I didn't include > all test versions for now, since this inrease the mail size even more, Only > test a) ... d) are inlined here: I think your platform bios simply doesn't support it. It does not provision empty slots on its own, and it doesn't tolerate the OS reassigning resources to them from what appears to be unnassigned memory windows. The platform may be using those memory windows for something outside the kernel's visibility. What happens if you boot the system with all slots populated? Do all devices configure in that case, and if so, can you hot-swap them? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system 2019-12-13 8:35 PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system Stefan Roese 2019-12-13 9:00 ` Nicholas Johnson @ 2019-12-16 23:37 ` Bjorn Helgaas 2019-12-17 13:54 ` Stefan Roese 1 sibling, 1 reply; 14+ messages in thread From: Bjorn Helgaas @ 2019-12-16 23:37 UTC (permalink / raw) To: Stefan Roese Cc: linux-pci, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko, Nicholas Johnson, Keith Busch [+cc Keith] On Fri, Dec 13, 2019 at 09:35:19AM +0100, Stefan Roese wrote: > Hi! > > I am facing an issue with PCIe-Hotplug on an AMD Epyc based system. > Our system is equipped with an HBA for NVMe SSDs incl. PCIe switch > (Supermicro AOC-SLG3-4E2P) [1] and we would like to be able to hotplug > NVMe disks. Your system has several host bridges. The address space routed to each host bridge is determined by firmware, and Linux has no support for changing it. Here's the space routed to the hierarchy containing the NVMe devices: ACPI: PCI Root Bridge [S0D2] (domain 0000 [bus 40-5f]) pci_bus 0000:40: root bus resource [mem 0xeb000000-0xeb5fffff window] 6MB pci_bus 0000:40: root bus resource [mem 0x7fc8000000-0xfcffffffff window] 501GB+ pci_bus 0000:40: root bus resource [bus 40-5f] Since you have several host bridges, using "pci=nocrs" is pretty much guaranteed to fail if Linux changes any PCI address assignments. It makes Linux *ignore* the routing information from firmware, but it doesn't *change* any of the routing. That's why experiment (d) fails: we assigned this space: pci 0000:44:00.0: BAR 0: assigned [mem 0xec000000-0xec003fff 64bit] but according to the BIOS, the [mem 0xec000000-0xefffffff window] area is routed to bus 00, not bus 40, so when we try to access that BAR, it goes to bus 00 where nothing responds. There are three devices on bus 40 that consume memory address space: 40:03.1 Root Port to [bus 41-47] [mem 0xeb400000-0xeb5fffff] 2MB 40:07.1 Root Port to [bus 48] [mem 0xeb200000-0xeb3fffff] 2MB 40:08.1 Root Port to [bus 49] [mem 0xeb000000-0xeb1fffff] 2MB Bridges (including Root Ports and Switch Ports) consume memory address space in 1MB chunks. The devices on buses 48 and 49 need a little over 1MB, so 40:07.1 and 40:08.1 need at least 2MB each. There's only 6MB available, so that leaves 2MB for 40:03.1, which leads to the PLX switch. That 2MB of memory space is routed to the PLX Switch Upstream Port, which has a BAR of its own that requires 256K, which leaves 1MB for it to send to its Downstream Ports. The Intel NVMe device only needs 16KB of memory space, but since the Switch Port windows are a minimum of 1MB, only one port gets memory space. So with this configuration, I think you're stuck. The only things I can think of are: - Put the PLX switch in a different slot to see if BIOS will assign more space to it (the other host bridges have more space available). - Boot with all four PLX slots occupied by NVMe devices. The BIOS may assign space to accommodate them all. If it does, you should be able to hot-remove and add devices after boot. - Change Linux to use prefetchable space. The Intel NVMe wants *non-prefetchable* space, but there's an implementation note in the spec (PCIe r5.0, sec 7.5.1.2.1) that says it should be safe to put it in prefetchable space in certain cases (entire path is PCIe, no PCI/PCI-X devices to peer-to-peer reads, host bridge does no byte merging, etc). The main problem is that we don't have a good way to identify these cases. > Currently, I'm testing with v5.5.0-rc1 and series [2] applied. Here > a few tests and results that I did so far. All tests were done with > one Intel NVMe SSD connected to one of the 4 NVMe ports of the HBA > and the other 3 ports (currently) left unconnected: > > a) Kernel Parameter "pci=pcie_bus_safe" > The resources of the 3 unused PCIe slots of the PEX switch are not > assigned in this test. > > b) Kernel Parameter "pci=pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" > With this test I restricted the resources of the HP slots to the > minimum. Still this results in unassigned resourced for the unused > PCIe slots of the PEX switch. > > c) Kernel Parameter "pci=realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" > Again, not all resources are assigned. > > d) Kernel Parameter "pci=nocrs,realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" > Now all requested resources are available for the HP PCIe slots of the > PEX switch. But the NVMe driver fails while probing. Debugging has > shown, that reading from the BAR of the NVMe disk returns 0xffffffff. > Also reading from the PLX PEX switch registers returns 0xfffffff in this > case (this works of course without nocrs, when the BARs are mapped at > a different address). > > Does anybody have a clue on why the access to the PEX switch and / or > the NVMe BAR does not work in the "nocrs" case? The BARs are located in > the same window that is provided by the BIOS in the ACPI list (but is > "ignored" in this case) [3]. > > Or if it is possible to get the HP resource mapping done correctly without > setting "nocrs" for our setup with the PCIe/NVMe switch? > > [1] https://www.supermicro.com/en/products/accessories/addon/AOC-SLG3-4E2P.php > [2] https://lkml.org/lkml/2019/12/9/388 > [3] > [ 0.701932] acpi PNP0A08:00: host bridge window [io 0x0cf8-0x0cff] (ignored) > [ 0.701934] acpi PNP0A08:00: host bridge window [io 0x0000-0x02ff window] (ignored) > [ 0.701935] acpi PNP0A08:00: host bridge window [io 0x0300-0x03af window] (ignored) > [ 0.701936] acpi PNP0A08:00: host bridge window [io 0x03e0-0x0cf7 window] (ignored) > [ 0.701937] acpi PNP0A08:00: host bridge window [io 0x03b0-0x03df window] (ignored) > [ 0.701938] acpi PNP0A08:00: host bridge window [io 0x0d00-0x3fff window] (ignored) > [ 0.701939] acpi PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff window] (ignored) > [ 0.701939] acpi PNP0A08:00: host bridge window [mem 0x000c0000-0x000dffff window] (ignored) > [ 0.701940] acpi PNP0A08:00: host bridge window [mem 0xec000000-0xefffffff window] (ignored) > [ 0.701941] acpi PNP0A08:00: host bridge window [mem 0x182c8000000-0x1ffffffffff window] (ignored) > ... > 41:00.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev b0) (prog-if 00 [Normal decode]) > Flags: bus master, fast devsel, latency 0, IRQ 47, NUMA node 2 > Memory at ec400000 (32-bit, non-prefetchable) [size=256K] > Bus: primary=41, secondary=42, subordinate=47, sec-latency=0 > I/O behind bridge: None > Memory behind bridge: ec000000-ec3fffff [size=4M] > Prefetchable memory behind bridge: None > Capabilities: <access denied> > Kernel driver in use: pcieport > epyc@epyc-Super-Server:~/stefan$ sudo ./memtool md 0xec400000+0x10 > ec400000: ffffffff ffffffff ffffffff ffffffff ................ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system 2019-12-16 23:37 ` Bjorn Helgaas @ 2019-12-17 13:54 ` Stefan Roese 2019-12-17 16:30 ` Keith Busch 0 siblings, 1 reply; 14+ messages in thread From: Stefan Roese @ 2019-12-17 13:54 UTC (permalink / raw) To: Bjorn Helgaas Cc: linux-pci, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko, Nicholas Johnson, Keith Busch Hi Bjorn, On 17.12.19 00:37, Bjorn Helgaas wrote: > [+cc Keith] > > On Fri, Dec 13, 2019 at 09:35:19AM +0100, Stefan Roese wrote: >> Hi! >> >> I am facing an issue with PCIe-Hotplug on an AMD Epyc based system. >> Our system is equipped with an HBA for NVMe SSDs incl. PCIe switch >> (Supermicro AOC-SLG3-4E2P) [1] and we would like to be able to hotplug >> NVMe disks. > > Your system has several host bridges. The address space routed to > each host bridge is determined by firmware, and Linux has no support > for changing it. Here's the space routed to the hierarchy containing > the NVMe devices: > > ACPI: PCI Root Bridge [S0D2] (domain 0000 [bus 40-5f]) > pci_bus 0000:40: root bus resource [mem 0xeb000000-0xeb5fffff window] 6MB > pci_bus 0000:40: root bus resource [mem 0x7fc8000000-0xfcffffffff window] 501GB+ > pci_bus 0000:40: root bus resource [bus 40-5f] > > Since you have several host bridges, using "pci=nocrs" is pretty much > guaranteed to fail if Linux changes any PCI address assignments. It > makes Linux *ignore* the routing information from firmware, but it > doesn't *change* any of the routing. That's why experiment (d) fails: > we assigned this space: > > pci 0000:44:00.0: BAR 0: assigned [mem 0xec000000-0xec003fff 64bit] > > but according to the BIOS, the [mem 0xec000000-0xefffffff window] area > is routed to bus 00, not bus 40, so when we try to access that BAR, it > goes to bus 00 where nothing responds. Thanks for your analysis. I totally missed this multiple host bridges aspect here. This explains completely what's happening with "nocrs", which can't be used on this platform because of this (without ability to change the routing in the PCI host bridges as well). > There are three devices on bus 40 that consume memory address space: > > 40:03.1 Root Port to [bus 41-47] [mem 0xeb400000-0xeb5fffff] 2MB > 40:07.1 Root Port to [bus 48] [mem 0xeb200000-0xeb3fffff] 2MB > 40:08.1 Root Port to [bus 49] [mem 0xeb000000-0xeb1fffff] 2MB > > Bridges (including Root Ports and Switch Ports) consume memory address > space in 1MB chunks. The devices on buses 48 and 49 need a little > over 1MB, so 40:07.1 and 40:08.1 need at least 2MB each. There's only > 6MB available, so that leaves 2MB for 40:03.1, which leads to the PLX > switch. > > That 2MB of memory space is routed to the PLX Switch Upstream Port, > which has a BAR of its own that requires 256K, which leaves 1MB for it > to send to its Downstream Ports. > > The Intel NVMe device only needs 16KB of memory space, but since the > Switch Port windows are a minimum of 1MB, only one port gets memory > space. > > So with this configuration, I think you're stuck. The only things I > can think of are: > > - Put the PLX switch in a different slot to see if BIOS will assign > more space to it (the other host bridges have more space > available). Thanks for this suggestions. Using a different slot (with more resources) enables the resource assignment for a 4 HP slots of the PLX switch. Only when I use this patch from Nicholas though (and pci=realloc): https://lore.kernel.org/linux-pci/20191216233759.GA249123@google.com/T/#mbb5abd0131f05dbd5030952f567b3e4ec92f2af4 > - Boot with all four PLX slots occupied by NVMe devices. The BIOS > may assign space to accommodate them all. If it does, you should > be able to hot-remove and add devices after boot. Unfortunately, that's not an option. We need to be able to boot with e.g. one NVMe device and hot-plug one or more devices later. > - Change Linux to use prefetchable space. The Intel NVMe wants > *non-prefetchable* space, but there's an implementation note in > the spec (PCIe r5.0, sec 7.5.1.2.1) that says it should be safe to > put it in prefetchable space in certain cases (entire path is > PCIe, no PCI/PCI-X devices to peer-to-peer reads, host bridge does > no byte merging, etc). The main problem is that we don't have a > good way to identify these cases. Thanks for this suggestion. I might look into this. Right now, I'm experimenting with the "solution" mentioned above, which looks like it solves our issues for now. Thanks, Stefan ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system 2019-12-17 13:54 ` Stefan Roese @ 2019-12-17 16:30 ` Keith Busch 2019-12-17 16:45 ` Stefan Roese 0 siblings, 1 reply; 14+ messages in thread From: Keith Busch @ 2019-12-17 16:30 UTC (permalink / raw) To: Stefan Roese Cc: Bjorn Helgaas, linux-pci, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko, Nicholas Johnson On Tue, Dec 17, 2019 at 02:54:06PM +0100, Stefan Roese wrote: > On 17.12.19 00:37, Bjorn Helgaas wrote: > > - Boot with all four PLX slots occupied by NVMe devices. The BIOS > > may assign space to accommodate them all. If it does, you should > > be able to hot-remove and add devices after boot. > > Unfortunately, that's not an option. We need to be able to boot with > e.g. one NVMe device and hot-plug one or more devices later. That was also my suggestion, but not necessarily as a "solution". It's just to see if it works, which might indicate what the kernel could do differently. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system 2019-12-17 16:30 ` Keith Busch @ 2019-12-17 16:45 ` Stefan Roese 0 siblings, 0 replies; 14+ messages in thread From: Stefan Roese @ 2019-12-17 16:45 UTC (permalink / raw) To: Keith Busch Cc: Bjorn Helgaas, linux-pci, Mika Westerberg, Lukas Wunner, Sergey Miroshnichenko, Nicholas Johnson On 17.12.19 17:30, Keith Busch wrote: > On Tue, Dec 17, 2019 at 02:54:06PM +0100, Stefan Roese wrote: >> On 17.12.19 00:37, Bjorn Helgaas wrote: >>> - Boot with all four PLX slots occupied by NVMe devices. The BIOS >>> may assign space to accommodate them all. If it does, you should >>> be able to hot-remove and add devices after boot. >> >> Unfortunately, that's not an option. We need to be able to boot with >> e.g. one NVMe device and hot-plug one or more devices later. > > That was also my suggestion, but not necessarily as a "solution". It's > just to see if it works, which might indicate what the kernel could do > differently. I see, thanks. Right now, I don't have enough NVMe devices available to do such a test. I'll add it to my list for tests that should be done though. Thanks, Stefan ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2019-12-17 16:45 UTC | newest] Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-12-13 8:35 PCIe hotplug resource issues with PEX switch (NVMe disks) on AMD Epyc system Stefan Roese 2019-12-13 9:00 ` Nicholas Johnson [not found] ` <c9f154e5-4214-aa46-2ce2-443b508e1643@denx.de> 2019-12-13 11:52 ` Nicholas Johnson 2019-12-13 12:17 ` Stefan Roese 2019-12-15 3:16 ` Nicholas Johnson 2019-12-16 6:48 ` Stefan Roese 2019-12-16 0:46 ` Bjorn Helgaas 2019-12-16 6:50 ` Stefan Roese 2019-12-16 15:50 ` Keith Busch [not found] ` <f3a51108-10e4-f60d-de18-a12de85d07df@denx.de> 2019-12-16 19:32 ` Keith Busch 2019-12-16 23:37 ` Bjorn Helgaas 2019-12-17 13:54 ` Stefan Roese 2019-12-17 16:30 ` Keith Busch 2019-12-17 16:45 ` Stefan Roese
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).