* Question about reserved_regions w/ Intel IOMMU @ 2023-06-07 22:40 Alexander Duyck 2023-06-07 23:03 ` Alexander Duyck 0 siblings, 1 reply; 25+ messages in thread From: Alexander Duyck @ 2023-06-07 22:40 UTC (permalink / raw) To: LKML, open list:INTEL IOMMU (VT-d), linux-pci I am running into a DMA issue that appears to be a conflict between ACS and IOMMU. As per the documentation I can find, the IOMMU is supposed to create reserved regions for MSI and the memory window behind the root port. However looking at reserved_regions I am not seeing that. I only see the reservation for the MSI. So for example with an enabled NIC and iommu enabled w/o passthru I am seeing: # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions 0x00000000fee00000 0x00000000feefffff msi Shouldn't there also be a memory window for the region behind the root port to prevent any possible peer-to-peer access? ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-07 22:40 Question about reserved_regions w/ Intel IOMMU Alexander Duyck @ 2023-06-07 23:03 ` Alexander Duyck 2023-06-08 3:03 ` Baolu Lu 0 siblings, 1 reply; 25+ messages in thread From: Alexander Duyck @ 2023-06-07 23:03 UTC (permalink / raw) To: LKML, linux-pci, iommu On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck <alexander.duyck@gmail.com> wrote: > > I am running into a DMA issue that appears to be a conflict between > ACS and IOMMU. As per the documentation I can find, the IOMMU is > supposed to create reserved regions for MSI and the memory window > behind the root port. However looking at reserved_regions I am not > seeing that. I only see the reservation for the MSI. > > So for example with an enabled NIC and iommu enabled w/o passthru I am seeing: > # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions > 0x00000000fee00000 0x00000000feefffff msi > > Shouldn't there also be a memory window for the region behind the root > port to prevent any possible peer-to-peer access? Since the iommu portion of the email bounced I figured I would fix that and provide some additional info. I added some instrumentation to the kernel to dump the resources found in iova_reserve_pci_windows. From what I can tell it is finding the correct resources for the Memory and Prefetchable regions behind the root port. It seems to be calling reserve_iova which is successfully allocating an iova to reserve the region. However still no luck on why it isn't showing up in reserved_regions. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-07 23:03 ` Alexander Duyck @ 2023-06-08 3:03 ` Baolu Lu 2023-06-08 14:33 ` Alexander Duyck 2023-06-08 15:28 ` Robin Murphy 0 siblings, 2 replies; 25+ messages in thread From: Baolu Lu @ 2023-06-08 3:03 UTC (permalink / raw) To: Alexander Duyck, LKML, linux-pci, iommu; +Cc: baolu.lu On 6/8/23 7:03 AM, Alexander Duyck wrote: > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck > <alexander.duyck@gmail.com> wrote: >> >> I am running into a DMA issue that appears to be a conflict between >> ACS and IOMMU. As per the documentation I can find, the IOMMU is >> supposed to create reserved regions for MSI and the memory window >> behind the root port. However looking at reserved_regions I am not >> seeing that. I only see the reservation for the MSI. >> >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing: >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions >> 0x00000000fee00000 0x00000000feefffff msi >> >> Shouldn't there also be a memory window for the region behind the root >> port to prevent any possible peer-to-peer access? > > Since the iommu portion of the email bounced I figured I would fix > that and provide some additional info. > > I added some instrumentation to the kernel to dump the resources found > in iova_reserve_pci_windows. From what I can tell it is finding the > correct resources for the Memory and Prefetchable regions behind the > root port. It seems to be calling reserve_iova which is successfully > allocating an iova to reserve the region. > > However still no luck on why it isn't showing up in reserved_regions. Perhaps I can ask the opposite question, why it should show up in reserve_regions? Why does the iommu subsystem block any possible peer- to-peer DMA access? Isn't that a decision of the device driver. The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces which is not related to peer-to-peer accesses. Best regards, baolu ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-08 3:03 ` Baolu Lu @ 2023-06-08 14:33 ` Alexander Duyck 2023-06-08 15:38 ` Ashok Raj 2023-06-08 15:28 ` Robin Murphy 1 sibling, 1 reply; 25+ messages in thread From: Alexander Duyck @ 2023-06-08 14:33 UTC (permalink / raw) To: Baolu Lu; +Cc: LKML, linux-pci, iommu On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <baolu.lu@linux.intel.com> wrote: > > On 6/8/23 7:03 AM, Alexander Duyck wrote: > > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck > > <alexander.duyck@gmail.com> wrote: > >> > >> I am running into a DMA issue that appears to be a conflict between > >> ACS and IOMMU. As per the documentation I can find, the IOMMU is > >> supposed to create reserved regions for MSI and the memory window > >> behind the root port. However looking at reserved_regions I am not > >> seeing that. I only see the reservation for the MSI. > >> > >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing: > >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions > >> 0x00000000fee00000 0x00000000feefffff msi > >> > >> Shouldn't there also be a memory window for the region behind the root > >> port to prevent any possible peer-to-peer access? > > > > Since the iommu portion of the email bounced I figured I would fix > > that and provide some additional info. > > > > I added some instrumentation to the kernel to dump the resources found > > in iova_reserve_pci_windows. From what I can tell it is finding the > > correct resources for the Memory and Prefetchable regions behind the > > root port. It seems to be calling reserve_iova which is successfully > > allocating an iova to reserve the region. > > > > However still no luck on why it isn't showing up in reserved_regions. > > Perhaps I can ask the opposite question, why it should show up in > reserve_regions? Why does the iommu subsystem block any possible peer- > to-peer DMA access? Isn't that a decision of the device driver. > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces > which is not related to peer-to-peer accesses. The problem is if the IOVA overlaps with the physical addresses of other devices that can be routed to via ACS redirect. As such if ACS redirect is enabled a host IOVA could be directed to another device on the switch instead. To prevent that we need to reserve those addresses to avoid address space collisions. From what I can tell it looks like the IOVA should be reserved, but I don't see it showing up anywhere in reserved_regions. What I am wondering is if iova_reserve_pci_windows() should be taking some steps so that it will appear, or if intel_iommu_get_resv_regions() needs to have some code similar to iova_reserve_pci_windows() to get the ranges and verify they are reserved in the IOVA. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-08 14:33 ` Alexander Duyck @ 2023-06-08 15:38 ` Ashok Raj 2023-06-08 17:10 ` Alexander Duyck 0 siblings, 1 reply; 25+ messages in thread From: Ashok Raj @ 2023-06-08 15:38 UTC (permalink / raw) To: Alexander Duyck; +Cc: Baolu Lu, LKML, linux-pci, iommu, Ashok Raj On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote: > On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <baolu.lu@linux.intel.com> wrote: > > > > On 6/8/23 7:03 AM, Alexander Duyck wrote: > > > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck > > > <alexander.duyck@gmail.com> wrote: > > >> > > >> I am running into a DMA issue that appears to be a conflict between > > >> ACS and IOMMU. As per the documentation I can find, the IOMMU is > > >> supposed to create reserved regions for MSI and the memory window > > >> behind the root port. However looking at reserved_regions I am not > > >> seeing that. I only see the reservation for the MSI. > > >> > > >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing: > > >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions > > >> 0x00000000fee00000 0x00000000feefffff msi > > >> > > >> Shouldn't there also be a memory window for the region behind the root > > >> port to prevent any possible peer-to-peer access? > > > > > > Since the iommu portion of the email bounced I figured I would fix > > > that and provide some additional info. > > > > > > I added some instrumentation to the kernel to dump the resources found > > > in iova_reserve_pci_windows. From what I can tell it is finding the > > > correct resources for the Memory and Prefetchable regions behind the > > > root port. It seems to be calling reserve_iova which is successfully > > > allocating an iova to reserve the region. > > > > > > However still no luck on why it isn't showing up in reserved_regions. > > > > Perhaps I can ask the opposite question, why it should show up in > > reserve_regions? Why does the iommu subsystem block any possible peer- > > to-peer DMA access? Isn't that a decision of the device driver. > > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces > > which is not related to peer-to-peer accesses. > > The problem is if the IOVA overlaps with the physical addresses of > other devices that can be routed to via ACS redirect. As such if ACS > redirect is enabled a host IOVA could be directed to another device on > the switch instead. To prevent that we need to reserve those addresses > to avoid address space collisions. Any untranslated address from a device must be forwarded to the IOMMU when ACS is enabled correct? I guess if you want true p2p, then you would need to map so that the hpa turns into the peer address.. but its always a round trip to IOMMU. > > From what I can tell it looks like the IOVA should be reserved, but I > don't see it showing up anywhere in reserved_regions. What I am > wondering is if iova_reserve_pci_windows() should be taking some steps > so that it will appear, or if intel_iommu_get_resv_regions() needs to > have some code similar to iova_reserve_pci_windows() to get the ranges > and verify they are reserved in the IOVA. > ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-08 15:38 ` Ashok Raj @ 2023-06-08 17:10 ` Alexander Duyck 2023-06-08 17:52 ` Ashok Raj 2023-06-08 18:02 ` Robin Murphy 0 siblings, 2 replies; 25+ messages in thread From: Alexander Duyck @ 2023-06-08 17:10 UTC (permalink / raw) To: Ashok Raj; +Cc: Baolu Lu, LKML, linux-pci, iommu, Ashok Raj On Thu, Jun 8, 2023 at 8:40 AM Ashok Raj <ashok_raj@linux.intel.com> wrote: > > On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote: > > On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <baolu.lu@linux.intel.com> wrote: > > > > > > On 6/8/23 7:03 AM, Alexander Duyck wrote: > > > > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck > > > > <alexander.duyck@gmail.com> wrote: > > > >> > > > >> I am running into a DMA issue that appears to be a conflict between > > > >> ACS and IOMMU. As per the documentation I can find, the IOMMU is > > > >> supposed to create reserved regions for MSI and the memory window > > > >> behind the root port. However looking at reserved_regions I am not > > > >> seeing that. I only see the reservation for the MSI. > > > >> > > > >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing: > > > >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions > > > >> 0x00000000fee00000 0x00000000feefffff msi > > > >> > > > >> Shouldn't there also be a memory window for the region behind the root > > > >> port to prevent any possible peer-to-peer access? > > > > > > > > Since the iommu portion of the email bounced I figured I would fix > > > > that and provide some additional info. > > > > > > > > I added some instrumentation to the kernel to dump the resources found > > > > in iova_reserve_pci_windows. From what I can tell it is finding the > > > > correct resources for the Memory and Prefetchable regions behind the > > > > root port. It seems to be calling reserve_iova which is successfully > > > > allocating an iova to reserve the region. > > > > > > > > However still no luck on why it isn't showing up in reserved_regions. > > > > > > Perhaps I can ask the opposite question, why it should show up in > > > reserve_regions? Why does the iommu subsystem block any possible peer- > > > to-peer DMA access? Isn't that a decision of the device driver. > > > > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces > > > which is not related to peer-to-peer accesses. > > > > The problem is if the IOVA overlaps with the physical addresses of > > other devices that can be routed to via ACS redirect. As such if ACS > > redirect is enabled a host IOVA could be directed to another device on > > the switch instead. To prevent that we need to reserve those addresses > > to avoid address space collisions. Our test case is just to perform DMA to/from the host on one device on a switch and what we are seeing is that when we hit an IOVA that matches up with the physical address of the neighboring devices BAR0 then we are seeing an AER followed by a hot reset. > Any untranslated address from a device must be forwarded to the IOMMU when > ACS is enabled correct?I guess if you want true p2p, then you would need > to map so that the hpa turns into the peer address.. but its always a round > trip to IOMMU. This assumes all parts are doing the Request Redirect "correctly". In our case there is a PCIe switch we are trying to debug and we have a few working theories. One concern I have is that the switch may be throwing an ACS violation for us using an address that matches a neighboring device instead of redirecting it to the upstream port. If we pull the switch and just run on the root complex the issue seems to be resolved so I started poking into the code which led me to the documentation pointing out what is supposed to be reserved based on the root complex and MSI regions. As a part of going down that rabbit hole I realized that the reserved_regions seems to only list the MSI reservation. However after digging a bit deeper it seems like there is code to reserve the memory behind the root complex in the IOVA but it doesn't look like that is visible anywhere and is the piece I am currently trying to sort out. What I am working on is trying to figure out if the system that is failing is actually reserving that memory region in the IOVA, or if that is somehow not happening in our test setup. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-08 17:10 ` Alexander Duyck @ 2023-06-08 17:52 ` Ashok Raj 2023-06-08 18:15 ` Alexander Duyck 2023-06-08 18:02 ` Robin Murphy 1 sibling, 1 reply; 25+ messages in thread From: Ashok Raj @ 2023-06-08 17:52 UTC (permalink / raw) To: Alexander Duyck; +Cc: Ashok Raj, Baolu Lu, LKML, linux-pci, iommu, Ashok Raj On Thu, Jun 08, 2023 at 10:10:54AM -0700, Alexander Duyck wrote: > On Thu, Jun 8, 2023 at 8:40 AM Ashok Raj <ashok_raj@linux.intel.com> wrote: > > > > On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote: > > > On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <baolu.lu@linux.intel.com> wrote: > > > > > > > > On 6/8/23 7:03 AM, Alexander Duyck wrote: > > > > > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck > > > > > <alexander.duyck@gmail.com> wrote: > > > > >> > > > > >> I am running into a DMA issue that appears to be a conflict between > > > > >> ACS and IOMMU. As per the documentation I can find, the IOMMU is > > > > >> supposed to create reserved regions for MSI and the memory window > > > > >> behind the root port. However looking at reserved_regions I am not > > > > >> seeing that. I only see the reservation for the MSI. > > > > >> > > > > >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing: > > > > >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions > > > > >> 0x00000000fee00000 0x00000000feefffff msi > > > > >> > > > > >> Shouldn't there also be a memory window for the region behind the root > > > > >> port to prevent any possible peer-to-peer access? > > > > > > > > > > Since the iommu portion of the email bounced I figured I would fix > > > > > that and provide some additional info. > > > > > > > > > > I added some instrumentation to the kernel to dump the resources found > > > > > in iova_reserve_pci_windows. From what I can tell it is finding the > > > > > correct resources for the Memory and Prefetchable regions behind the > > > > > root port. It seems to be calling reserve_iova which is successfully > > > > > allocating an iova to reserve the region. > > > > > > > > > > However still no luck on why it isn't showing up in reserved_regions. > > > > > > > > Perhaps I can ask the opposite question, why it should show up in > > > > reserve_regions? Why does the iommu subsystem block any possible peer- > > > > to-peer DMA access? Isn't that a decision of the device driver. > > > > > > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces > > > > which is not related to peer-to-peer accesses. > > > > > > The problem is if the IOVA overlaps with the physical addresses of > > > other devices that can be routed to via ACS redirect. As such if ACS > > > redirect is enabled a host IOVA could be directed to another device on > > > the switch instead. To prevent that we need to reserve those addresses > > > to avoid address space collisions. > > Our test case is just to perform DMA to/from the host on one device on > a switch and what we are seeing is that when we hit an IOVA that > matches up with the physical address of the neighboring devices BAR0 > then we are seeing an AER followed by a hot reset. ACS is always confusing.. Does your NIC have a DTLB? If request redirect is set, and the Egress is enabled, then all transactions should go upstream to the root-port->IOMMU before being served. In my 6.0 spec its in 6.12.3 ACS Peer-to-Peer Control Interactions? And maybe lspci would show how things are setup in the switch? > > > Any untranslated address from a device must be forwarded to the IOMMU when > > ACS is enabled correct?I guess if you want true p2p, then you would need > > to map so that the hpa turns into the peer address.. but its always a round > > trip to IOMMU. > > This assumes all parts are doing the Request Redirect "correctly". In > our case there is a PCIe switch we are trying to debug and we have a > few working theories. One concern I have is that the switch may be > throwing an ACS violation for us using an address that matches a > neighboring device instead of redirecting it to the upstream port. If > we pull the switch and just run on the root complex the issue seems to > be resolved so I started poking into the code which led me to the > documentation pointing out what is supposed to be reserved based on > the root complex and MSI regions. > > As a part of going down that rabbit hole I realized that the > reserved_regions seems to only list the MSI reservation. However after > digging a bit deeper it seems like there is code to reserve the memory > behind the root complex in the IOVA but it doesn't look like that is > visible anywhere and is the piece I am currently trying to sort out. > What I am working on is trying to figure out if the system that is > failing is actually reserving that memory region in the IOVA, or if > that is somehow not happening in our test setup. I suspect with IOMMU, there is no need to pluck holes like we do for the MSI. In very early code in IOMMU i vaguely recall we did that, but our knowledge on ACS was weak. (not that has improved :-)). Knowing how the switch and root ports are setup with forwarding may help with some clues. The easy option is maybe forcibly adding to the reserved range may help to see if you don't see the ACS violation. Baolu might have some better ideas. -- Cheers, Ashok Bike Shedding: (a.k.a Parkinson's Law of Triviality) - When the discussion on a topic is inversely proportionate to the gravity of the topic. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-08 17:52 ` Ashok Raj @ 2023-06-08 18:15 ` Alexander Duyck 0 siblings, 0 replies; 25+ messages in thread From: Alexander Duyck @ 2023-06-08 18:15 UTC (permalink / raw) To: Ashok Raj; +Cc: Ashok Raj, Baolu Lu, LKML, linux-pci, iommu On Thu, Jun 8, 2023 at 10:52 AM Ashok Raj <ashok.raj@intel.com> wrote: > > On Thu, Jun 08, 2023 at 10:10:54AM -0700, Alexander Duyck wrote: > > On Thu, Jun 8, 2023 at 8:40 AM Ashok Raj <ashok_raj@linux.intel.com> wrote: > > > > > > On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote: > > > > On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <baolu.lu@linux.intel.com> wrote: > > > > > > > > > > On 6/8/23 7:03 AM, Alexander Duyck wrote: > > > > > > On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck > > > > > > <alexander.duyck@gmail.com> wrote: > > > > > >> > > > > > >> I am running into a DMA issue that appears to be a conflict between > > > > > >> ACS and IOMMU. As per the documentation I can find, the IOMMU is > > > > > >> supposed to create reserved regions for MSI and the memory window > > > > > >> behind the root port. However looking at reserved_regions I am not > > > > > >> seeing that. I only see the reservation for the MSI. > > > > > >> > > > > > >> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing: > > > > > >> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions > > > > > >> 0x00000000fee00000 0x00000000feefffff msi > > > > > >> > > > > > >> Shouldn't there also be a memory window for the region behind the root > > > > > >> port to prevent any possible peer-to-peer access? > > > > > > > > > > > > Since the iommu portion of the email bounced I figured I would fix > > > > > > that and provide some additional info. > > > > > > > > > > > > I added some instrumentation to the kernel to dump the resources found > > > > > > in iova_reserve_pci_windows. From what I can tell it is finding the > > > > > > correct resources for the Memory and Prefetchable regions behind the > > > > > > root port. It seems to be calling reserve_iova which is successfully > > > > > > allocating an iova to reserve the region. > > > > > > > > > > > > However still no luck on why it isn't showing up in reserved_regions. > > > > > > > > > > Perhaps I can ask the opposite question, why it should show up in > > > > > reserve_regions? Why does the iommu subsystem block any possible peer- > > > > > to-peer DMA access? Isn't that a decision of the device driver. > > > > > > > > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces > > > > > which is not related to peer-to-peer accesses. > > > > > > > > The problem is if the IOVA overlaps with the physical addresses of > > > > other devices that can be routed to via ACS redirect. As such if ACS > > > > redirect is enabled a host IOVA could be directed to another device on > > > > the switch instead. To prevent that we need to reserve those addresses > > > > to avoid address space collisions. > > > > Our test case is just to perform DMA to/from the host on one device on > > a switch and what we are seeing is that when we hit an IOVA that > > matches up with the physical address of the neighboring devices BAR0 > > then we are seeing an AER followed by a hot reset. > > ACS is always confusing.. Does your NIC have a DTLB? No. It is using the IOMMU for all address translation. I am also pushing back on the test being used as well. It is always possible they have implemented something incorrectly and are overrunning a buffer going into the reserved IOVA region and the overlap is just a coincidence. > If request redirect is set, and the Egress is enabled, then all > transactions should go upstream to the root-port->IOMMU before being > served. > > In my 6.0 spec its in 6.12.3 ACS Peer-to-Peer Control Interactions? > > And maybe lspci would show how things are setup in the switch? We were setting the Redirect Request only, no Egress. I agree, based on the config everything should just go upstream. However if we eliminate the switch or put things in passthrough mode the problem goes away. > > > > > Any untranslated address from a device must be forwarded to the IOMMU when > > > ACS is enabled correct?I guess if you want true p2p, then you would need > > > to map so that the hpa turns into the peer address.. but its always a round > > > trip to IOMMU. > > > > This assumes all parts are doing the Request Redirect "correctly". In > > our case there is a PCIe switch we are trying to debug and we have a > > few working theories. One concern I have is that the switch may be > > throwing an ACS violation for us using an address that matches a > > neighboring device instead of redirecting it to the upstream port. If > > we pull the switch and just run on the root complex the issue seems to > > be resolved so I started poking into the code which led me to the > > documentation pointing out what is supposed to be reserved based on > > the root complex and MSI regions. > > > > As a part of going down that rabbit hole I realized that the > > reserved_regions seems to only list the MSI reservation. However after > > digging a bit deeper it seems like there is code to reserve the memory > > behind the root complex in the IOVA but it doesn't look like that is > > visible anywhere and is the piece I am currently trying to sort out. > > What I am working on is trying to figure out if the system that is > > failing is actually reserving that memory region in the IOVA, or if > > that is somehow not happening in our test setup. > > I suspect with IOMMU, there is no need to pluck holes like we do for the > MSI. In very early code in IOMMU i vaguely recall we did that, but our > knowledge on ACS was weak. (not that has improved :-)). The hole has to do mostly with avoiding any possibility of misrouting things, or at least that was my understanding after reading it. > Knowing how the switch and root ports are setup with forwarding may help > with some clues. The easy option is maybe forcibly adding to the reserved > range may help to see if you don't see the ACS violation. > > Baolu might have some better ideas. I'm working with the team having the issue to try and verify that now. In theory it should already be reserved so I am working with them to check that. Thanks, - Alex ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-08 17:10 ` Alexander Duyck 2023-06-08 17:52 ` Ashok Raj @ 2023-06-08 18:02 ` Robin Murphy 2023-06-08 18:17 ` Alexander Duyck 1 sibling, 1 reply; 25+ messages in thread From: Robin Murphy @ 2023-06-08 18:02 UTC (permalink / raw) To: Alexander Duyck, Ashok Raj; +Cc: Baolu Lu, LKML, linux-pci, iommu, Ashok Raj On 2023-06-08 18:10, Alexander Duyck wrote: > On Thu, Jun 8, 2023 at 8:40 AM Ashok Raj <ashok_raj@linux.intel.com> wrote: >> >> On Thu, Jun 08, 2023 at 07:33:31AM -0700, Alexander Duyck wrote: >>> On Wed, Jun 7, 2023 at 8:05 PM Baolu Lu <baolu.lu@linux.intel.com> wrote: >>>> >>>> On 6/8/23 7:03 AM, Alexander Duyck wrote: >>>>> On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck >>>>> <alexander.duyck@gmail.com> wrote: >>>>>> >>>>>> I am running into a DMA issue that appears to be a conflict between >>>>>> ACS and IOMMU. As per the documentation I can find, the IOMMU is >>>>>> supposed to create reserved regions for MSI and the memory window >>>>>> behind the root port. However looking at reserved_regions I am not >>>>>> seeing that. I only see the reservation for the MSI. >>>>>> >>>>>> So for example with an enabled NIC and iommu enabled w/o passthru I am seeing: >>>>>> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions >>>>>> 0x00000000fee00000 0x00000000feefffff msi >>>>>> >>>>>> Shouldn't there also be a memory window for the region behind the root >>>>>> port to prevent any possible peer-to-peer access? >>>>> >>>>> Since the iommu portion of the email bounced I figured I would fix >>>>> that and provide some additional info. >>>>> >>>>> I added some instrumentation to the kernel to dump the resources found >>>>> in iova_reserve_pci_windows. From what I can tell it is finding the >>>>> correct resources for the Memory and Prefetchable regions behind the >>>>> root port. It seems to be calling reserve_iova which is successfully >>>>> allocating an iova to reserve the region. >>>>> >>>>> However still no luck on why it isn't showing up in reserved_regions. >>>> >>>> Perhaps I can ask the opposite question, why it should show up in >>>> reserve_regions? Why does the iommu subsystem block any possible peer- >>>> to-peer DMA access? Isn't that a decision of the device driver. >>>> >>>> The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces >>>> which is not related to peer-to-peer accesses. >>> >>> The problem is if the IOVA overlaps with the physical addresses of >>> other devices that can be routed to via ACS redirect. As such if ACS >>> redirect is enabled a host IOVA could be directed to another device on >>> the switch instead. To prevent that we need to reserve those addresses >>> to avoid address space collisions. > > Our test case is just to perform DMA to/from the host on one device on > a switch and what we are seeing is that when we hit an IOVA that > matches up with the physical address of the neighboring devices BAR0 > then we are seeing an AER followed by a hot reset. > >> Any untranslated address from a device must be forwarded to the IOMMU when >> ACS is enabled correct?I guess if you want true p2p, then you would need >> to map so that the hpa turns into the peer address.. but its always a round >> trip to IOMMU. > > This assumes all parts are doing the Request Redirect "correctly". In > our case there is a PCIe switch we are trying to debug and we have a > few working theories. One concern I have is that the switch may be > throwing an ACS violation for us using an address that matches a > neighboring device instead of redirecting it to the upstream port. If > we pull the switch and just run on the root complex the issue seems to > be resolved so I started poking into the code which led me to the > documentation pointing out what is supposed to be reserved based on > the root complex and MSI regions. > > As a part of going down that rabbit hole I realized that the > reserved_regions seems to only list the MSI reservation. However after > digging a bit deeper it seems like there is code to reserve the memory > behind the root complex in the IOVA but it doesn't look like that is > visible anywhere and is the piece I am currently trying to sort out. > What I am working on is trying to figure out if the system that is > failing is actually reserving that memory region in the IOVA, or if > that is somehow not happening in our test setup. How old's the kernel? Before 5.11, intel-iommu wasn't hooked up to iommu-dma so didn't do quite the same thing - it only reserved whatever specific PCI memory resources existed at boot, rather than the whole window as iommu-dma does. Either way, ftrace on reserve_iova() (or just whack a print in there) should suffice to see what's happened. Robin. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-08 18:02 ` Robin Murphy @ 2023-06-08 18:17 ` Alexander Duyck 0 siblings, 0 replies; 25+ messages in thread From: Alexander Duyck @ 2023-06-08 18:17 UTC (permalink / raw) To: Robin Murphy; +Cc: Ashok Raj, Baolu Lu, LKML, linux-pci, iommu, Ashok Raj On Thu, Jun 8, 2023 at 11:02 AM Robin Murphy <robin.murphy@arm.com> wrote: > > On 2023-06-08 18:10, Alexander Duyck wrote: <...> > > As a part of going down that rabbit hole I realized that the > > reserved_regions seems to only list the MSI reservation. However after > > digging a bit deeper it seems like there is code to reserve the memory > > behind the root complex in the IOVA but it doesn't look like that is > > visible anywhere and is the piece I am currently trying to sort out. > > What I am working on is trying to figure out if the system that is > > failing is actually reserving that memory region in the IOVA, or if > > that is somehow not happening in our test setup. > > How old's the kernel? Before 5.11, intel-iommu wasn't hooked up to > iommu-dma so didn't do quite the same thing - it only reserved whatever > specific PCI memory resources existed at boot, rather than the whole > window as iommu-dma does. Either way, ftrace on reserve_iova() (or just > whack a print in there) should suffice to see what's happened. > > Robin. We are working with a 5.12 kernel. I will do some digging. We may be able to backport some fixes if needed. Thanks, - Alex ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-08 3:03 ` Baolu Lu 2023-06-08 14:33 ` Alexander Duyck @ 2023-06-08 15:28 ` Robin Murphy 2023-06-13 15:54 ` Jason Gunthorpe 1 sibling, 1 reply; 25+ messages in thread From: Robin Murphy @ 2023-06-08 15:28 UTC (permalink / raw) To: Baolu Lu, Alexander Duyck, LKML, linux-pci, iommu On 2023-06-08 04:03, Baolu Lu wrote: > On 6/8/23 7:03 AM, Alexander Duyck wrote: >> On Wed, Jun 7, 2023 at 3:40 PM Alexander Duyck >> <alexander.duyck@gmail.com> wrote: >>> >>> I am running into a DMA issue that appears to be a conflict between >>> ACS and IOMMU. As per the documentation I can find, the IOMMU is >>> supposed to create reserved regions for MSI and the memory window >>> behind the root port. However looking at reserved_regions I am not >>> seeing that. I only see the reservation for the MSI. >>> >>> So for example with an enabled NIC and iommu enabled w/o passthru I >>> am seeing: >>> # cat /sys/bus/pci/devices/0000\:83\:00.0/iommu_group/reserved_regions >>> 0x00000000fee00000 0x00000000feefffff msi >>> >>> Shouldn't there also be a memory window for the region behind the root >>> port to prevent any possible peer-to-peer access? >> >> Since the iommu portion of the email bounced I figured I would fix >> that and provide some additional info. >> >> I added some instrumentation to the kernel to dump the resources found >> in iova_reserve_pci_windows. From what I can tell it is finding the >> correct resources for the Memory and Prefetchable regions behind the >> root port. It seems to be calling reserve_iova which is successfully >> allocating an iova to reserve the region. >> >> However still no luck on why it isn't showing up in reserved_regions. > > Perhaps I can ask the opposite question, why it should show up in > reserve_regions? Why does the iommu subsystem block any possible peer- > to-peer DMA access? Isn't that a decision of the device driver. > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces > which is not related to peer-to-peer accesses. Right, in general the IOMMU driver cannot be held responsible for whatever might happen upstream of the IOMMU input. The DMA layer carves PCI windows out of its IOVA space unconditionally because we know that they *might* be problematic, and we don't have any specific constraints on our IOVA layout so it's no big deal to just sacrifice some space for simplicity. We don't want to have to go digging any further into bus-specific code to reason about whether the right ACS capabilities are present and enabled everywhere to prevent direct P2P or not. Other use-cases may have different requirements, though, so it's up to them what they want to do. It's conceptually pretty much the same as the case where the device (or indeed a PCI host bridge or other interconnect segment in-between) has a constrained DMA address width - the device may not be able to access all of the address space that the IOMMU provides, but the IOMMU itself can't tell you that. Thanks, Robin. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-08 15:28 ` Robin Murphy @ 2023-06-13 15:54 ` Jason Gunthorpe 2023-06-16 8:39 ` Tian, Kevin 0 siblings, 1 reply; 25+ messages in thread From: Jason Gunthorpe @ 2023-06-13 15:54 UTC (permalink / raw) To: Robin Murphy; +Cc: Baolu Lu, Alexander Duyck, LKML, linux-pci, iommu On Thu, Jun 08, 2023 at 04:28:24PM +0100, Robin Murphy wrote: > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces > > which is not related to peer-to-peer accesses. > > Right, in general the IOMMU driver cannot be held responsible for whatever > might happen upstream of the IOMMU input. The driver yes, but.. > The DMA layer carves PCI windows out of its IOVA space > unconditionally because we know that they *might* be problematic, > and we don't have any specific constraints on our IOVA layout so > it's no big deal to just sacrifice some space for simplicity. This is a problem for everything using UNMANAGED domains. If the iommu API user picks an IOVA it should be able to expect it to work. If the intereconnect fails to allow it to work then this has to be discovered otherwise UNAMANGED domains are not usable at all. Eg vfio and iommufd are also in trouble on these configurations. We shouldn't expect every iommu user to fix this entirely on their own. > We don't want to have to go digging any further into bus-specific > code to reason about whether the right ACS capabilities are present > and enabled everywhere to prevent direct P2P or not. Other use-cases > may have different requirements, though, so it's up to them what > they want to do. I agree the dma-iommu stuff doesn't have to be as precise as other places might want (but also wouldn't be harmed by being more precise) But I can't think of any place that can just ignore this and still be correct.. So, I think it make sense that the iommu driver not be involved, but IMHO the core code should have APIs to report IOVA that doesn't work and every user of UNMANAGED domains needs to check it. IOW it should probably come out of the existing reserved regions interface. Jason ^ permalink raw reply [flat|nested] 25+ messages in thread
* RE: Question about reserved_regions w/ Intel IOMMU 2023-06-13 15:54 ` Jason Gunthorpe @ 2023-06-16 8:39 ` Tian, Kevin 2023-06-16 12:20 ` Jason Gunthorpe 0 siblings, 1 reply; 25+ messages in thread From: Tian, Kevin @ 2023-06-16 8:39 UTC (permalink / raw) To: Jason Gunthorpe, Robin Murphy, Alex Williamson Cc: Baolu Lu, Alexander Duyck, LKML, linux-pci, iommu +Alex > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Tuesday, June 13, 2023 11:54 PM > > On Thu, Jun 08, 2023 at 04:28:24PM +0100, Robin Murphy wrote: > > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces > > > which is not related to peer-to-peer accesses. > > > > Right, in general the IOMMU driver cannot be held responsible for > whatever > > might happen upstream of the IOMMU input. > > The driver yes, but.. > > > The DMA layer carves PCI windows out of its IOVA space > > unconditionally because we know that they *might* be problematic, > > and we don't have any specific constraints on our IOVA layout so > > it's no big deal to just sacrifice some space for simplicity. > > This is a problem for everything using UNMANAGED domains. If the iommu > API user picks an IOVA it should be able to expect it to work. If the > intereconnect fails to allow it to work then this has to be discovered > otherwise UNAMANGED domains are not usable at all. > > Eg vfio and iommufd are also in trouble on these configurations. > If those PCI windows are problematic e.g. due to ACS they belong to a single iommu group. If a vfio user opens all the devices in that group then it can discover and reserve those windows in its IOVA space. The problem is that the user may not open all the devices then currently there is no way for it to know the windows on those unopened devices. Curious why nobody complains about this gap before this thread... ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-16 8:39 ` Tian, Kevin @ 2023-06-16 12:20 ` Jason Gunthorpe 2023-06-16 15:27 ` Alexander Duyck 2023-06-21 8:16 ` Tian, Kevin 0 siblings, 2 replies; 25+ messages in thread From: Jason Gunthorpe @ 2023-06-16 12:20 UTC (permalink / raw) To: Tian, Kevin Cc: Robin Murphy, Alex Williamson, Baolu Lu, Alexander Duyck, LKML, linux-pci, iommu On Fri, Jun 16, 2023 at 08:39:46AM +0000, Tian, Kevin wrote: > +Alex > > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Tuesday, June 13, 2023 11:54 PM > > > > On Thu, Jun 08, 2023 at 04:28:24PM +0100, Robin Murphy wrote: > > > > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces > > > > which is not related to peer-to-peer accesses. > > > > > > Right, in general the IOMMU driver cannot be held responsible for > > whatever > > > might happen upstream of the IOMMU input. > > > > The driver yes, but.. > > > > > The DMA layer carves PCI windows out of its IOVA space > > > unconditionally because we know that they *might* be problematic, > > > and we don't have any specific constraints on our IOVA layout so > > > it's no big deal to just sacrifice some space for simplicity. > > > > This is a problem for everything using UNMANAGED domains. If the iommu > > API user picks an IOVA it should be able to expect it to work. If the > > intereconnect fails to allow it to work then this has to be discovered > > otherwise UNAMANGED domains are not usable at all. > > > > Eg vfio and iommufd are also in trouble on these configurations. > > > > If those PCI windows are problematic e.g. due to ACS they belong to > a single iommu group. If a vfio user opens all the devices in that group > then it can discover and reserve those windows in its IOVA space. How? We don't even exclude the single device's BAR if there is no ACS? > The problem is that the user may not open all the devices then > currently there is no way for it to know the windows on those > unopened devices. > > Curious why nobody complains about this gap before this thread... Probably because it only matters if you have a real PCIe switch in the system, which is pretty rare. Jason ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-16 12:20 ` Jason Gunthorpe @ 2023-06-16 15:27 ` Alexander Duyck 2023-06-16 16:34 ` Robin Murphy 2023-06-16 18:48 ` Jason Gunthorpe 2023-06-21 8:16 ` Tian, Kevin 1 sibling, 2 replies; 25+ messages in thread From: Alexander Duyck @ 2023-06-16 15:27 UTC (permalink / raw) To: Jason Gunthorpe Cc: Tian, Kevin, Robin Murphy, Alex Williamson, Baolu Lu, LKML, linux-pci, iommu On Fri, Jun 16, 2023 at 5:20 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Fri, Jun 16, 2023 at 08:39:46AM +0000, Tian, Kevin wrote: > > +Alex > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Tuesday, June 13, 2023 11:54 PM > > > > > > On Thu, Jun 08, 2023 at 04:28:24PM +0100, Robin Murphy wrote: > > > > > > > > The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces > > > > > which is not related to peer-to-peer accesses. > > > > > > > > Right, in general the IOMMU driver cannot be held responsible for > > > whatever > > > > might happen upstream of the IOMMU input. > > > > > > The driver yes, but.. > > > > > > > The DMA layer carves PCI windows out of its IOVA space > > > > unconditionally because we know that they *might* be problematic, > > > > and we don't have any specific constraints on our IOVA layout so > > > > it's no big deal to just sacrifice some space for simplicity. > > > > > > This is a problem for everything using UNMANAGED domains. If the iommu > > > API user picks an IOVA it should be able to expect it to work. If the > > > intereconnect fails to allow it to work then this has to be discovered > > > otherwise UNAMANGED domains are not usable at all. > > > > > > Eg vfio and iommufd are also in trouble on these configurations. > > > > > > > If those PCI windows are problematic e.g. due to ACS they belong to > > a single iommu group. If a vfio user opens all the devices in that group > > then it can discover and reserve those windows in its IOVA space. > > How? We don't even exclude the single device's BAR if there is no ACS? The issue here was a defective ACS on a PCIe switch. > > The problem is that the user may not open all the devices then > > currently there is no way for it to know the windows on those > > unopened devices. > > > > Curious why nobody complains about this gap before this thread... > > Probably because it only matters if you have a real PCIe switch in the > system, which is pretty rare. So just FYI I am pretty sure we have a partitioned PCIe switch that has FW issues. Specifically it doesn't seem to be honoring the Redirect Request bit so what is happening is that we are seeing requests that are supposed to be going to the root complex/IOMMU getting redirected to an NVMe device that was on the same physical PCIe switch. We are in the process of getting that sorted out now and are using the forcedac option in the meantime to keep the IOMMU out of the 32b address space that was causing the issue. The reason for my original request is more about the user experience of trying to figure out what is reserved and what isn't. It seems like the IOVA will have reservations that are not visible to the end user. So when I go looking through the reserved_regions in sysfs it just lists the MSI regions that are reserved, and maybe some regions such as the memory for USB. while in reality we may be reserving IOVA regions in iova_reserve_pci_windows that will not be exposed without having to add probes or adding some printk debugging. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-16 15:27 ` Alexander Duyck @ 2023-06-16 16:34 ` Robin Murphy 2023-06-16 18:59 ` Jason Gunthorpe 2023-06-16 18:48 ` Jason Gunthorpe 1 sibling, 1 reply; 25+ messages in thread From: Robin Murphy @ 2023-06-16 16:34 UTC (permalink / raw) To: Alexander Duyck, Jason Gunthorpe Cc: Tian, Kevin, Alex Williamson, Baolu Lu, LKML, linux-pci, iommu On 2023-06-16 16:27, Alexander Duyck wrote: > On Fri, Jun 16, 2023 at 5:20 AM Jason Gunthorpe <jgg@nvidia.com> wrote: >> >> On Fri, Jun 16, 2023 at 08:39:46AM +0000, Tian, Kevin wrote: >>> +Alex >>> >>>> From: Jason Gunthorpe <jgg@nvidia.com> >>>> Sent: Tuesday, June 13, 2023 11:54 PM >>>> >>>> On Thu, Jun 08, 2023 at 04:28:24PM +0100, Robin Murphy wrote: >>>> >>>>>> The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces >>>>>> which is not related to peer-to-peer accesses. >>>>> >>>>> Right, in general the IOMMU driver cannot be held responsible for >>>> whatever >>>>> might happen upstream of the IOMMU input. >>>> >>>> The driver yes, but.. >>>> >>>>> The DMA layer carves PCI windows out of its IOVA space >>>>> unconditionally because we know that they *might* be problematic, >>>>> and we don't have any specific constraints on our IOVA layout so >>>>> it's no big deal to just sacrifice some space for simplicity. >>>> >>>> This is a problem for everything using UNMANAGED domains. If the iommu >>>> API user picks an IOVA it should be able to expect it to work. If the >>>> intereconnect fails to allow it to work then this has to be discovered >>>> otherwise UNAMANGED domains are not usable at all. >>>> >>>> Eg vfio and iommufd are also in trouble on these configurations. >>>> >>> >>> If those PCI windows are problematic e.g. due to ACS they belong to >>> a single iommu group. If a vfio user opens all the devices in that group >>> then it can discover and reserve those windows in its IOVA space. >> >> How? We don't even exclude the single device's BAR if there is no ACS? > > The issue here was a defective ACS on a PCIe switch. > >>> The problem is that the user may not open all the devices then >>> currently there is no way for it to know the windows on those >>> unopened devices. >>> >>> Curious why nobody complains about this gap before this thread... >> >> Probably because it only matters if you have a real PCIe switch in the >> system, which is pretty rare. > > So just FYI I am pretty sure we have a partitioned PCIe switch that > has FW issues. Specifically it doesn't seem to be honoring the > Redirect Request bit so what is happening is that we are seeing > requests that are supposed to be going to the root complex/IOMMU > getting redirected to an NVMe device that was on the same physical > PCIe switch. We are in the process of getting that sorted out now and > are using the forcedac option in the meantime to keep the IOMMU out of > the 32b address space that was causing the issue. > > The reason for my original request is more about the user experience > of trying to figure out what is reserved and what isn't. It seems like > the IOVA will have reservations that are not visible to the end user. > So when I go looking through the reserved_regions in sysfs it just > lists the MSI regions that are reserved, and maybe some regions such > as the memory for USB. while in reality we may be reserving IOVA > regions in iova_reserve_pci_windows that will not be exposed without > having to add probes or adding some printk debugging. lspci -vvv seems to have no problem telling me about what PCI memory space is assigned where, even as an unprivileged user, so surely it's available to any VFIO user too? It is not necessarily useful for eeh IOMMU layer to claim to userspace that an entire window is unusable if in fact there's nothing in there that would be treated as a P2P address so it's actually fine. As I say, iommu-dma can make that assumption for itself because iommu-dma doesn't need to maintain any particular address space layout, but it could be overly restrictive for a userspace process or VMM which does. If the system has working ACS configured correctly, then this issue should be moot; if it doesn't, then a VFIO user is going to get a whole group of peer devices if they're getting anything at all, so it doesn't seem entirely unreasonable to leave it up to them to check that all those devices' resources play well with their expected memory map. And the particular case of a system which claims to have working ACS but doesn't, doesn't really seem to be something that can or should be worked around from userspace; if that switch can't be fixed, it probably wants an ACS quirk adding in the kernel. Thanks, Robin. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-16 16:34 ` Robin Murphy @ 2023-06-16 18:59 ` Jason Gunthorpe 2023-06-19 10:20 ` Robin Murphy 0 siblings, 1 reply; 25+ messages in thread From: Jason Gunthorpe @ 2023-06-16 18:59 UTC (permalink / raw) To: Robin Murphy Cc: Alexander Duyck, Tian, Kevin, Alex Williamson, Baolu Lu, LKML, linux-pci, iommu On Fri, Jun 16, 2023 at 05:34:53PM +0100, Robin Murphy wrote: > > If the system has working ACS configured correctly, then this issue should > be moot; Yes > if it doesn't, then a VFIO user is going to get a whole group of > peer devices if they're getting anything at all, so it doesn't seem entirely > unreasonable to leave it up to them to check that all those devices' > resources play well with their expected memory map. I think the kernel should be helping here.. 'go figure it out from lspci' is a very convoluted and obscure uAPI, and I don't see things like DPDK actually doing that. IMHO the uAPI expectation is that the kernel informs userspace what the usable IOVA is, if bridge windows and lack of ACS are rendering address space unusable then VFIO/iommufd should return it as excluded as well. If we are going to do that then all UNAMANGED domain users should follow the same logic. We probably have avoided bug reports because of how rare it would be to see a switch and an UNMANAGED domain using scenario together - especially with ACS turned off. So it is really narrow niche.. Obscure enough I'm not going to make patches :) Jason ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-16 18:59 ` Jason Gunthorpe @ 2023-06-19 10:20 ` Robin Murphy 2023-06-19 14:02 ` Jason Gunthorpe 0 siblings, 1 reply; 25+ messages in thread From: Robin Murphy @ 2023-06-19 10:20 UTC (permalink / raw) To: Jason Gunthorpe Cc: Alexander Duyck, Tian, Kevin, Alex Williamson, Baolu Lu, LKML, linux-pci, iommu On 2023-06-16 19:59, Jason Gunthorpe wrote: > On Fri, Jun 16, 2023 at 05:34:53PM +0100, Robin Murphy wrote: >> >> If the system has working ACS configured correctly, then this issue should >> be moot; > > Yes > >> if it doesn't, then a VFIO user is going to get a whole group of >> peer devices if they're getting anything at all, so it doesn't seem entirely >> unreasonable to leave it up to them to check that all those devices' >> resources play well with their expected memory map. > > I think the kernel should be helping here.. 'go figure it out from > lspci' is a very convoluted and obscure uAPI, and I don't see things > like DPDK actually doing that. > > IMHO the uAPI expectation is that the kernel informs userspace what > the usable IOVA is, if bridge windows and lack of ACS are rendering > address space unusable then VFIO/iommufd should return it as excluded > as well. > > If we are going to do that then all UNAMANGED domain users should > follow the same logic. > > We probably have avoided bug reports because of how rare it would be > to see a switch and an UNMANAGED domain using scenario together - > especially with ACS turned off. > > So it is really narrow niche.. Obscure enough I'm not going to make > patches :) The main thing is that we've already been round this once before; we tried it 6 years ago and then reverted it a year later for causing more problems than it solved: https://lkml.org/lkml/2018/3/2/760 Thanks, Robin. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-19 10:20 ` Robin Murphy @ 2023-06-19 14:02 ` Jason Gunthorpe 2023-06-20 14:57 ` Alexander Duyck 0 siblings, 1 reply; 25+ messages in thread From: Jason Gunthorpe @ 2023-06-19 14:02 UTC (permalink / raw) To: Robin Murphy Cc: Alexander Duyck, Tian, Kevin, Alex Williamson, Baolu Lu, LKML, linux-pci, iommu On Mon, Jun 19, 2023 at 11:20:58AM +0100, Robin Murphy wrote: > On 2023-06-16 19:59, Jason Gunthorpe wrote: > > On Fri, Jun 16, 2023 at 05:34:53PM +0100, Robin Murphy wrote: > > > > > > If the system has working ACS configured correctly, then this issue should > > > be moot; > > > > Yes > > > > > if it doesn't, then a VFIO user is going to get a whole group of > > > peer devices if they're getting anything at all, so it doesn't seem entirely > > > unreasonable to leave it up to them to check that all those devices' > > > resources play well with their expected memory map. > > > > I think the kernel should be helping here.. 'go figure it out from > > lspci' is a very convoluted and obscure uAPI, and I don't see things > > like DPDK actually doing that. > > > > IMHO the uAPI expectation is that the kernel informs userspace what > > the usable IOVA is, if bridge windows and lack of ACS are rendering > > address space unusable then VFIO/iommufd should return it as excluded > > as well. > > > > If we are going to do that then all UNAMANGED domain users should > > follow the same logic. > > > > We probably have avoided bug reports because of how rare it would be > > to see a switch and an UNMANAGED domain using scenario together - > > especially with ACS turned off. > > > > So it is really narrow niche.. Obscure enough I'm not going to make > > patches :) > > The main thing is that we've already been round this once before; we tried > it 6 years ago and then reverted it a year later for causing more problems > than it solved: As I said earlier in this thread if we do it for VFIO then the calculation must be precise and consider bus details like ACS/etc. eg VFIO on an ACS system should not report any new regions. It looks like that thread confirms we can't create reserved regions which are wrong :) I think Alex is saying the same things I'm saying in that thread too: https://lore.kernel.org/all/20180226161310.061ce3a8@w520.home/ (b) is what the kernel should help prevent. And it is clear there are today scenarios where a VFIO user will get data loss because the reported valid IOVA from the kernel is incorrect. Fixing this is hard, much harder than what commit 273df9635385 ("iommu/dma: Make PCI window reservation generic") has. Thanks, Jason ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-19 14:02 ` Jason Gunthorpe @ 2023-06-20 14:57 ` Alexander Duyck 2023-06-20 16:55 ` Jason Gunthorpe 0 siblings, 1 reply; 25+ messages in thread From: Alexander Duyck @ 2023-06-20 14:57 UTC (permalink / raw) To: Jason Gunthorpe Cc: Robin Murphy, Tian, Kevin, Alex Williamson, Baolu Lu, LKML, linux-pci, iommu On Mon, Jun 19, 2023 at 7:02 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Mon, Jun 19, 2023 at 11:20:58AM +0100, Robin Murphy wrote: > > On 2023-06-16 19:59, Jason Gunthorpe wrote: > > > On Fri, Jun 16, 2023 at 05:34:53PM +0100, Robin Murphy wrote: > > > > > > > > If the system has working ACS configured correctly, then this issue should > > > > be moot; > > > > > > Yes > > > > > > > if it doesn't, then a VFIO user is going to get a whole group of > > > > peer devices if they're getting anything at all, so it doesn't seem entirely > > > > unreasonable to leave it up to them to check that all those devices' > > > > resources play well with their expected memory map. > > > > > > I think the kernel should be helping here.. 'go figure it out from > > > lspci' is a very convoluted and obscure uAPI, and I don't see things > > > like DPDK actually doing that. > > > > > > IMHO the uAPI expectation is that the kernel informs userspace what > > > the usable IOVA is, if bridge windows and lack of ACS are rendering > > > address space unusable then VFIO/iommufd should return it as excluded > > > as well. > > > > > > If we are going to do that then all UNAMANGED domain users should > > > follow the same logic. > > > > > > We probably have avoided bug reports because of how rare it would be > > > to see a switch and an UNMANAGED domain using scenario together - > > > especially with ACS turned off. > > > > > > So it is really narrow niche.. Obscure enough I'm not going to make > > > patches :) > > > > The main thing is that we've already been round this once before; we tried > > it 6 years ago and then reverted it a year later for causing more problems > > than it solved: > > As I said earlier in this thread if we do it for VFIO then the > calculation must be precise and consider bus details like > ACS/etc. eg VFIO on an ACS system should not report any new regions. > > It looks like that thread confirms we can't create reserved regions > which are wrong :) > > I think Alex is saying the same things I'm saying in that thread too: > > https://lore.kernel.org/all/20180226161310.061ce3a8@w520.home/ > > (b) is what the kernel should help prevent. > > And it is clear there are today scenarios where a VFIO user will get > data loss because the reported valid IOVA from the kernel is > incorrect. Fixing this is hard, much harder than what commit > 273df9635385 ("iommu/dma: Make PCI window reservation generic") has. I think this may have gone off down a rathole as my original question wasn't anything about adding extra reserved regions. It was about exposing what the IOVA is already reserving so it could be user visible. The issue was that the reservation(s) didn't appear in the reserved_regions sysfs file, and it required adding probes or printk debugging in order to figure out what is reserved and what is not. Specifically what I was trying to point out is that there are regions reserved in iova_reserve_pci_windows() that are not user/admin visible. The function reserve_iova doesn't do anything to track the reservation so that it can be recalled later for display. It made things harder to debug as I wasn't sure if the addresses I was seeing were valid for the IOMMU or not since I didn't know if they were supposed to be reserved and the documentation I had found implied they were. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-20 14:57 ` Alexander Duyck @ 2023-06-20 16:55 ` Jason Gunthorpe 2023-06-20 17:47 ` Alexander Duyck 0 siblings, 1 reply; 25+ messages in thread From: Jason Gunthorpe @ 2023-06-20 16:55 UTC (permalink / raw) To: Alexander Duyck Cc: Robin Murphy, Tian, Kevin, Alex Williamson, Baolu Lu, LKML, linux-pci, iommu On Tue, Jun 20, 2023 at 07:57:57AM -0700, Alexander Duyck wrote: > I think this may have gone off down a rathole as my original question > wasn't anything about adding extra reserved regions. It was about > exposing what the IOVA is already reserving so it could be user > visible. Your question points out that dma-iommu.c uses a different set of reserved regions than everything else, and its set is closer to functionally correct. IMHO the resolution to what you are talking about is not to add more debugging to dma-iommu but to make the set of reserved regions consistently correct for everyone, which will make them viewable in sysfs. Jason ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-20 16:55 ` Jason Gunthorpe @ 2023-06-20 17:47 ` Alexander Duyck 2023-06-21 11:30 ` Robin Murphy 0 siblings, 1 reply; 25+ messages in thread From: Alexander Duyck @ 2023-06-20 17:47 UTC (permalink / raw) To: Jason Gunthorpe Cc: Robin Murphy, Tian, Kevin, Alex Williamson, Baolu Lu, LKML, linux-pci, iommu On Tue, Jun 20, 2023 at 9:55 AM Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Tue, Jun 20, 2023 at 07:57:57AM -0700, Alexander Duyck wrote: > > > I think this may have gone off down a rathole as my original question > > wasn't anything about adding extra reserved regions. It was about > > exposing what the IOVA is already reserving so it could be user > > visible. > > Your question points out that dma-iommu.c uses a different set of > reserved regions than everything else, and its set is closer to > functionally correct. > > IMHO the resolution to what you are talking about is not to add more > debugging to dma-iommu but to make the set of reserved regions > consistently correct for everyone, which will make them viewable in > sysfs. Okay, that makes sense to me, and I agree. If we had a consistent set of reserved regions then it would make it easier to understand. If nothing else my request would be to expose the iova reserved regions and then most likely the other ones could be deprecated since they seem to all be consolidated in the IOVA anyway. Thanks, - Alex ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-20 17:47 ` Alexander Duyck @ 2023-06-21 11:30 ` Robin Murphy 0 siblings, 0 replies; 25+ messages in thread From: Robin Murphy @ 2023-06-21 11:30 UTC (permalink / raw) To: Alexander Duyck, Jason Gunthorpe Cc: Tian, Kevin, Alex Williamson, Baolu Lu, LKML, linux-pci, iommu On 2023-06-20 18:47, Alexander Duyck wrote: > On Tue, Jun 20, 2023 at 9:55 AM Jason Gunthorpe <jgg@nvidia.com> wrote: >> >> On Tue, Jun 20, 2023 at 07:57:57AM -0700, Alexander Duyck wrote: >> >>> I think this may have gone off down a rathole as my original question >>> wasn't anything about adding extra reserved regions. It was about >>> exposing what the IOVA is already reserving so it could be user >>> visible. >> >> Your question points out that dma-iommu.c uses a different set of >> reserved regions than everything else, and its set is closer to >> functionally correct. >> >> IMHO the resolution to what you are talking about is not to add more >> debugging to dma-iommu but to make the set of reserved regions >> consistently correct for everyone, which will make them viewable in >> sysfs. > > Okay, that makes sense to me, and I agree. If we had a consistent set > of reserved regions then it would make it easier to understand. It would also be wrong, unfortunately, because it's conflating multiple different things (there are overlapping notions of "reserve" at play here...). IOMMU API reserved regions are specific things that the IOMMU driver knows are special and all IOMMU domain users definitely need to be aware of. iommu-dma is merely one of those users; it is another layer on top of the API which manages its own IOVA space how it sees fit, just like VFIO or other IOMMU-aware drivers. It honours those reserved regions (via iommu_group_create_direct_mappings()), but it also carves out plenty of IOVA space which is probably perfectly usable - some of which is related to possible upstream bus constraints, to save the hassle of checking; some purely for its own convenience, like the page at IOVA 0 - but it still *doesn't* carve out more IOVA regions which are also unusable overall due to other upstream bus or endpoint constraints, since those are handled dynamically in its allocator instead (dma_mask, bus_dma_limit etc.) > If > nothing else my request would be to expose the iova reserved regions > and then most likely the other ones could be deprecated since they > seem to all be consolidated in the IOVA anyway. FWIW there's no upstream provision for debugging iommu-dma from userspace since it's not something that anyone other than me has ever had any apparent need to do, and you can get an idea of how long it's been since even I thought about that from when I seem to have given up rebasing my local patches for it[1] :) Thanks, Robin. [1] https://gitlab.arm.com/linux-arm/linux-rm/-/commits/iommu/misc/ ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: Question about reserved_regions w/ Intel IOMMU 2023-06-16 15:27 ` Alexander Duyck 2023-06-16 16:34 ` Robin Murphy @ 2023-06-16 18:48 ` Jason Gunthorpe 1 sibling, 0 replies; 25+ messages in thread From: Jason Gunthorpe @ 2023-06-16 18:48 UTC (permalink / raw) To: Alexander Duyck Cc: Tian, Kevin, Robin Murphy, Alex Williamson, Baolu Lu, LKML, linux-pci, iommu On Fri, Jun 16, 2023 at 08:27:21AM -0700, Alexander Duyck wrote: > > > The problem is that the user may not open all the devices then > > > currently there is no way for it to know the windows on those > > > unopened devices. > > > > > > Curious why nobody complains about this gap before this thread... > > > > Probably because it only matters if you have a real PCIe switch in the > > system, which is pretty rare. > > So just FYI I am pretty sure we have a partitioned PCIe switch that > has FW issues. Yeah, that is pretty common :( But I think you've touched on a gap in the API. Jason ^ permalink raw reply [flat|nested] 25+ messages in thread
* RE: Question about reserved_regions w/ Intel IOMMU 2023-06-16 12:20 ` Jason Gunthorpe 2023-06-16 15:27 ` Alexander Duyck @ 2023-06-21 8:16 ` Tian, Kevin 1 sibling, 0 replies; 25+ messages in thread From: Tian, Kevin @ 2023-06-21 8:16 UTC (permalink / raw) To: Jason Gunthorpe Cc: Robin Murphy, Alex Williamson, Baolu Lu, Alexander Duyck, LKML, linux-pci, iommu > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, June 16, 2023 8:21 PM > > On Fri, Jun 16, 2023 at 08:39:46AM +0000, Tian, Kevin wrote: > > +Alex > > > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Tuesday, June 13, 2023 11:54 PM > > > > > > On Thu, Jun 08, 2023 at 04:28:24PM +0100, Robin Murphy wrote: > > > > > > > > The iova_reserve_pci_windows() you've seen is for kernel DMA > interfaces > > > > > which is not related to peer-to-peer accesses. > > > > > > > > Right, in general the IOMMU driver cannot be held responsible for > > > whatever > > > > might happen upstream of the IOMMU input. > > > > > > The driver yes, but.. > > > > > > > The DMA layer carves PCI windows out of its IOVA space > > > > unconditionally because we know that they *might* be problematic, > > > > and we don't have any specific constraints on our IOVA layout so > > > > it's no big deal to just sacrifice some space for simplicity. > > > > > > This is a problem for everything using UNMANAGED domains. If the > iommu > > > API user picks an IOVA it should be able to expect it to work. If the > > > intereconnect fails to allow it to work then this has to be discovered > > > otherwise UNAMANGED domains are not usable at all. > > > > > > Eg vfio and iommufd are also in trouble on these configurations. > > > > > > > If those PCI windows are problematic e.g. due to ACS they belong to > > a single iommu group. If a vfio user opens all the devices in that group > > then it can discover and reserve those windows in its IOVA space. > > How? We don't even exclude the single device's BAR if there is no ACS? I thought the initial vbar value in vfio is copied from physical BAR so the user may check this value to skip. But it's informal and looks today Qemu doesn't compose the GPA layout with any information from there. > > > The problem is that the user may not open all the devices then > > currently there is no way for it to know the windows on those > > unopened devices. > > > > Curious why nobody complains about this gap before this thread... > > Probably because it only matters if you have a real PCIe switch in the > system, which is pretty rare. > multi-devices group might not be rare given vfio has spent so many effort to manage it. More likely the virtual bios may reserve a big enough hole between [3GB, 4GB] which happens to cover the physical BARs (if not 64bit) in the group to avoid conflict, e.g.: c0000000-febfffff : PCI Bus 0000:00 fd000000-fdffffff : 0000:00:01.0 fd000000-fdffffff : bochs-drm fe000000-fe01ffff : 0000:00:02.0 fe020000-fe02ffff : 0000:00:02.0 fe030000-fe033fff : 0000:00:03.0 fe030000-fe033fff : virtio-pci-modern feb80000-febbffff : 0000:00:03.0 febd0000-febd0fff : 0000:00:01.0 febd0000-febd0fff : bochs-drm febd1000-febd1fff : 0000:00:03.0 febd2000-febd2fff : 0000:00:1f.2 febd2000-febd2fff : ahci fec00000-fec003ff : IOAPIC 0 fed00000-fed003ff : HPET 0 fed00000-fed003ff : PNP0103:00 fed1c000-fed1ffff : Reserved fed1f410-fed1f414 : iTCO_wdt.0.auto fed90000-fed90fff : dmar0 fee00000-fee00fff : Local APIC feffc000-feffffff : Reserved fffc0000-ffffffff : Reserved ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2023-06-21 11:30 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2023-06-07 22:40 Question about reserved_regions w/ Intel IOMMU Alexander Duyck 2023-06-07 23:03 ` Alexander Duyck 2023-06-08 3:03 ` Baolu Lu 2023-06-08 14:33 ` Alexander Duyck 2023-06-08 15:38 ` Ashok Raj 2023-06-08 17:10 ` Alexander Duyck 2023-06-08 17:52 ` Ashok Raj 2023-06-08 18:15 ` Alexander Duyck 2023-06-08 18:02 ` Robin Murphy 2023-06-08 18:17 ` Alexander Duyck 2023-06-08 15:28 ` Robin Murphy 2023-06-13 15:54 ` Jason Gunthorpe 2023-06-16 8:39 ` Tian, Kevin 2023-06-16 12:20 ` Jason Gunthorpe 2023-06-16 15:27 ` Alexander Duyck 2023-06-16 16:34 ` Robin Murphy 2023-06-16 18:59 ` Jason Gunthorpe 2023-06-19 10:20 ` Robin Murphy 2023-06-19 14:02 ` Jason Gunthorpe 2023-06-20 14:57 ` Alexander Duyck 2023-06-20 16:55 ` Jason Gunthorpe 2023-06-20 17:47 ` Alexander Duyck 2023-06-21 11:30 ` Robin Murphy 2023-06-16 18:48 ` Jason Gunthorpe 2023-06-21 8:16 ` Tian, Kevin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).