* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space [not found] ` <20190329104904.450fefef@x1.home> @ 2019-04-01 13:41 ` Singh, Brijesh 2019-07-23 17:26 ` Alex Williamson 0 siblings, 1 reply; 13+ messages in thread From: Singh, Brijesh @ 2019-04-01 13:41 UTC (permalink / raw) To: Alex Williamson, qemu-devel Cc: Singh, Brijesh, mst, marcel.apfelbaum, peterx, eric.auger, Suthikulpanit, Suravee Thanks for adding Alex. Adding Suravee. On 3/29/19 11:49 AM, Alex Williamson wrote: > [Cc +Brijesh] > > Hi Brijesh, will the change below require the IVRS to be updated to > include aliases for all BDF ranges behind a conventional bridge? I > think the Linux code handles this regardless of the firmware provided > aliases, but is it required per spec for the ACPI tables to include > bridge aliases? Thanks, > We do need to includes aliases in ACPI table. We need to populate the IVHD type 0x43 and 0x4 for alias range start and end. I believe host IVRS would contain similar information. Suravee, please correct me if I am missing something? thanks > Alex > > On Tue, 26 Mar 2019 16:55:19 -0600 > Alex Williamson <alex.williamson@redhat.com> wrote: > >> Conventional PCI buses pre-date requester IDs. An IOMMU cannot >> distinguish by devfn & bus between devices in a conventional PCI >> topology and therefore we cannot assign them separate AddressSpaces. >> By taking this requester ID aliasing into account, QEMU better matches >> the bare metal behavior and restrictions, and enables shared >> AddressSpace configurations that are otherwise not possible with >> guest IOMMU support. >> >> For the latter case, given any example where an IOMMU group on the >> host includes multiple devices: >> >> $ ls /sys/kernel/iommu_groups/1/devices/ >> 0000:00:01.0 0000:01:00.0 0000:01:00.1 >> >> If we incorporate a vIOMMU into the VM configuration, we're restricted >> that we can only assign one of the endpoints to the guest because a >> second endpoint will attempt to use a different AddressSpace. VFIO >> only supports IOMMU group level granularity at the container level, >> preventing this second endpoint from being assigned: >> >> qemu-system-x86_64 -machine q35... \ >> -device intel-iommu,intremap=on \ >> -device pcie-root-port,addr=1e.0,id=pcie.1 \ >> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ >> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 >> >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ >> 0000:01:00.1: group 1 used in multiple address spaces >> >> However, when QEMU incorporates proper aliasing, we can make use of a >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that >> provides the downstream devices with the same AddressSpace, ex: >> >> qemu-system-x86_64 -machine q35... \ >> -device intel-iommu,intremap=on \ >> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ >> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ >> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 >> >> While the utility of this hack may be limited, this AddressSpace >> aliasing is the correct behavior for QEMU to emulate bare metal. >> >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> >> --- >> hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- >> 1 file changed, 31 insertions(+), 2 deletions(-) >> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c >> index 35451c1e9987..38467e676f1f 100644 >> --- a/hw/pci/pci.c >> +++ b/hw/pci/pci.c >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) >> { >> PCIBus *bus = pci_get_bus(dev); >> PCIBus *iommu_bus = bus; >> + uint8_t devfn = dev->devfn; >> >> while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { >> - iommu_bus = pci_get_bus(iommu_bus->parent_dev); >> + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); >> + >> + /* >> + * Determine which requester ID alias should be used for the device >> + * based on the PCI topology. There are no requester IDs on convetional >> + * PCI buses, therefore we push the alias up to the parent on each non- >> + * express bus. Which alias we use depends on whether this is a legacy >> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- >> + * PCI bridge spec. Note that we cannot use pci_requester_id() here >> + * because the resulting BDF depends on the secondary bridge register >> + * programming. We also cannot lookup the PCIBus from the bus number >> + * at this point for the iommu_fn. Also, requester_id_cache is the >> + * alias to the root bus, which is usually, but not necessarily always >> + * where we'll find our iommu_fn. >> + */ >> + if (!pci_bus_is_express(iommu_bus)) { >> + PCIDevice *parent = iommu_bus->parent_dev; >> + >> + if (pci_is_express(parent) && >> + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { >> + devfn = PCI_DEVFN(0, 0); >> + bus = iommu_bus; >> + } else { >> + devfn = parent->devfn; >> + bus = parent_bus; >> + } >> + } >> + >> + iommu_bus = parent_bus; >> } >> if (iommu_bus && iommu_bus->iommu_fn) { >> - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); >> + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); >> } >> return &address_space_memory; >> } >> > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space 2019-04-01 13:41 ` [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space Singh, Brijesh @ 2019-07-23 17:26 ` Alex Williamson 2019-07-23 18:45 ` Michael S. Tsirkin 2019-07-24 7:14 ` Peter Xu 0 siblings, 2 replies; 13+ messages in thread From: Alex Williamson @ 2019-07-23 17:26 UTC (permalink / raw) To: Singh, Brijesh Cc: mst, qemu-devel, peterx, eric.auger, Suthikulpanit, Suravee On Mon, 1 Apr 2019 13:41:39 +0000 "Singh, Brijesh" <brijesh.singh@amd.com> wrote: > Thanks for adding Alex. > > Adding Suravee. > > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > [Cc +Brijesh] > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > include aliases for all BDF ranges behind a conventional bridge? I > > think the Linux code handles this regardless of the firmware provided > > aliases, but is it required per spec for the ACPI tables to include > > bridge aliases? Thanks, > > > > We do need to includes aliases in ACPI table. We need to populate the > IVHD type 0x43 and 0x4 for alias range start and end. I believe host > IVRS would contain similar information. > > Suravee, please correct me if I am missing something? I finally found some time to investigate this a little further, yes the types mentioned are correct for defining start and end of an alias range. The challenge here is that these entries require a DeviceID, which is defined as a BDF, AIUI. The IVRS is created in QEMU, but bus numbers are defined by the guest firmware, and potentially redefined by the guest OS. This makes it non-trivial to insert a few IVHDs into the IVRS to describe alias ranges. I'm wondering if the solution here is to define a new linker-loader command that would instruct the guest to write a bus number byte to a given offset for a described device. These commands would be inserted before the checksum command, such that these bus number updates are calculated as part of the checksum. I'm imagining the command format would need to be able to distinguish between the actual bus number of a described device, the secondary bus number of the device, and the subordinate bus number of the device. For describing the device, I'm envisioning stealing from the DMAR definition, which already includes a bus number invariant mechanism to describe a device, starting with a segment and root bus, follow a chain of devfns to get to the target device. Therefore the guest firmware would follow the path to the described device, pick the desired bus number, and write it to the indicated table offset. Does this seem like a reasonable approach? Better ideas? I'm not thrilled with the increased scope demanded by IVRS support, but so long as we have an AMD IOMMU model, I don't see how to avoid it. Thanks, Alex > > On Tue, 26 Mar 2019 16:55:19 -0600 > > Alex Williamson <alex.williamson@redhat.com> wrote: > > > >> Conventional PCI buses pre-date requester IDs. An IOMMU cannot > >> distinguish by devfn & bus between devices in a conventional PCI > >> topology and therefore we cannot assign them separate AddressSpaces. > >> By taking this requester ID aliasing into account, QEMU better matches > >> the bare metal behavior and restrictions, and enables shared > >> AddressSpace configurations that are otherwise not possible with > >> guest IOMMU support. > >> > >> For the latter case, given any example where an IOMMU group on the > >> host includes multiple devices: > >> > >> $ ls /sys/kernel/iommu_groups/1/devices/ > >> 0000:00:01.0 0000:01:00.0 0000:01:00.1 > >> > >> If we incorporate a vIOMMU into the VM configuration, we're restricted > >> that we can only assign one of the endpoints to the guest because a > >> second endpoint will attempt to use a different AddressSpace. VFIO > >> only supports IOMMU group level granularity at the container level, > >> preventing this second endpoint from being assigned: > >> > >> qemu-system-x86_64 -machine q35... \ > >> -device intel-iommu,intremap=on \ > >> -device pcie-root-port,addr=1e.0,id=pcie.1 \ > >> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > >> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > >> > >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > >> 0000:01:00.1: group 1 used in multiple address spaces > >> > >> However, when QEMU incorporates proper aliasing, we can make use of a > >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > >> provides the downstream devices with the same AddressSpace, ex: > >> > >> qemu-system-x86_64 -machine q35... \ > >> -device intel-iommu,intremap=on \ > >> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > >> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > >> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > >> > >> While the utility of this hack may be limited, this AddressSpace > >> aliasing is the correct behavior for QEMU to emulate bare metal. > >> > >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > >> --- > >> hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- > >> 1 file changed, 31 insertions(+), 2 deletions(-) > >> > >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c > >> index 35451c1e9987..38467e676f1f 100644 > >> --- a/hw/pci/pci.c > >> +++ b/hw/pci/pci.c > >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) > >> { > >> PCIBus *bus = pci_get_bus(dev); > >> PCIBus *iommu_bus = bus; > >> + uint8_t devfn = dev->devfn; > >> > >> while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { > >> - iommu_bus = pci_get_bus(iommu_bus->parent_dev); > >> + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); > >> + > >> + /* > >> + * Determine which requester ID alias should be used for the device > >> + * based on the PCI topology. There are no requester IDs on convetional > >> + * PCI buses, therefore we push the alias up to the parent on each non- > >> + * express bus. Which alias we use depends on whether this is a legacy > >> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- > >> + * PCI bridge spec. Note that we cannot use pci_requester_id() here > >> + * because the resulting BDF depends on the secondary bridge register > >> + * programming. We also cannot lookup the PCIBus from the bus number > >> + * at this point for the iommu_fn. Also, requester_id_cache is the > >> + * alias to the root bus, which is usually, but not necessarily always > >> + * where we'll find our iommu_fn. > >> + */ > >> + if (!pci_bus_is_express(iommu_bus)) { > >> + PCIDevice *parent = iommu_bus->parent_dev; > >> + > >> + if (pci_is_express(parent) && > >> + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > >> + devfn = PCI_DEVFN(0, 0); > >> + bus = iommu_bus; > >> + } else { > >> + devfn = parent->devfn; > >> + bus = parent_bus; > >> + } > >> + } > >> + > >> + iommu_bus = parent_bus; > >> } > >> if (iommu_bus && iommu_bus->iommu_fn) { > >> - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); > >> + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); > >> } > >> return &address_space_memory; > >> } > >> > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space 2019-07-23 17:26 ` Alex Williamson @ 2019-07-23 18:45 ` Michael S. Tsirkin 2019-07-24 7:14 ` Peter Xu 1 sibling, 0 replies; 13+ messages in thread From: Michael S. Tsirkin @ 2019-07-23 18:45 UTC (permalink / raw) To: Alex Williamson Cc: Singh, Brijesh, Suthikulpanit, Suravee, qemu-devel, peterx, eric.auger On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: > On Mon, 1 Apr 2019 13:41:39 +0000 > "Singh, Brijesh" <brijesh.singh@amd.com> wrote: > > > Thanks for adding Alex. > > > > Adding Suravee. > > > > > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > > [Cc +Brijesh] > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > > include aliases for all BDF ranges behind a conventional bridge? I > > > think the Linux code handles this regardless of the firmware provided > > > aliases, but is it required per spec for the ACPI tables to include > > > bridge aliases? Thanks, > > > > > > > We do need to includes aliases in ACPI table. We need to populate the > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host > > IVRS would contain similar information. > > > > Suravee, please correct me if I am missing something? > > I finally found some time to investigate this a little further, yes the > types mentioned are correct for defining start and end of an alias > range. The challenge here is that these entries require a DeviceID, > which is defined as a BDF, AIUI. The IVRS is created in QEMU, but bus > numbers are defined by the guest firmware, and potentially redefined by > the guest OS. This makes it non-trivial to insert a few IVHDs into the > IVRS to describe alias ranges. I'm wondering if the solution here is > to define a new linker-loader command that would instruct the guest to > write a bus number byte to a given offset for a described device. > These commands would be inserted before the checksum command, such that > these bus number updates are calculated as part of the checksum. > > I'm imagining the command format would need to be able to distinguish > between the actual bus number of a described device, the secondary bus > number of the device, and the subordinate bus number of the device. > For describing the device, I'm envisioning stealing from the DMAR > definition, which already includes a bus number invariant mechanism to > describe a device, starting with a segment and root bus, follow a chain > of devfns to get to the target device. Therefore the guest firmware > would follow the path to the described device, pick the desired bus > number, and write it to the indicated table offset. > > Does this seem like a reasonable approach? Better ideas? I'm not > thrilled with the increased scope demanded by IVRS support, but so long > as we have an AMD IOMMU model, I don't see how to avoid it. Thanks, > > Alex We generally just got the bus #s as programmed into virtual bridges/devices and had qemu program them into acpi tables. If guest os changes bus#s it's responsible for updating acpi tables :) > > > On Tue, 26 Mar 2019 16:55:19 -0600 > > > Alex Williamson <alex.williamson@redhat.com> wrote: > > > > > >> Conventional PCI buses pre-date requester IDs. An IOMMU cannot > > >> distinguish by devfn & bus between devices in a conventional PCI > > >> topology and therefore we cannot assign them separate AddressSpaces. > > >> By taking this requester ID aliasing into account, QEMU better matches > > >> the bare metal behavior and restrictions, and enables shared > > >> AddressSpace configurations that are otherwise not possible with > > >> guest IOMMU support. > > >> > > >> For the latter case, given any example where an IOMMU group on the > > >> host includes multiple devices: > > >> > > >> $ ls /sys/kernel/iommu_groups/1/devices/ > > >> 0000:00:01.0 0000:01:00.0 0000:01:00.1 > > >> > > >> If we incorporate a vIOMMU into the VM configuration, we're restricted > > >> that we can only assign one of the endpoints to the guest because a > > >> second endpoint will attempt to use a different AddressSpace. VFIO > > >> only supports IOMMU group level granularity at the container level, > > >> preventing this second endpoint from being assigned: > > >> > > >> qemu-system-x86_64 -machine q35... \ > > >> -device intel-iommu,intremap=on \ > > >> -device pcie-root-port,addr=1e.0,id=pcie.1 \ > > >> -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \ > > >> -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1 > > >> > > >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \ > > >> 0000:01:00.1: group 1 used in multiple address spaces > > >> > > >> However, when QEMU incorporates proper aliasing, we can make use of a > > >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that > > >> provides the downstream devices with the same AddressSpace, ex: > > >> > > >> qemu-system-x86_64 -machine q35... \ > > >> -device intel-iommu,intremap=on \ > > >> -device pcie-pci-bridge,addr=1e.0,id=pci.1 \ > > >> -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \ > > >> -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1 > > >> > > >> While the utility of this hack may be limited, this AddressSpace > > >> aliasing is the correct behavior for QEMU to emulate bare metal. > > >> > > >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> > > >> --- > > >> hw/pci/pci.c | 33 +++++++++++++++++++++++++++++++-- > > >> 1 file changed, 31 insertions(+), 2 deletions(-) > > >> > > >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c > > >> index 35451c1e9987..38467e676f1f 100644 > > >> --- a/hw/pci/pci.c > > >> +++ b/hw/pci/pci.c > > >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev) > > >> { > > >> PCIBus *bus = pci_get_bus(dev); > > >> PCIBus *iommu_bus = bus; > > >> + uint8_t devfn = dev->devfn; > > >> > > >> while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) { > > >> - iommu_bus = pci_get_bus(iommu_bus->parent_dev); > > >> + PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev); > > >> + > > >> + /* > > >> + * Determine which requester ID alias should be used for the device > > >> + * based on the PCI topology. There are no requester IDs on convetional > > >> + * PCI buses, therefore we push the alias up to the parent on each non- > > >> + * express bus. Which alias we use depends on whether this is a legacy > > >> + * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to- > > >> + * PCI bridge spec. Note that we cannot use pci_requester_id() here > > >> + * because the resulting BDF depends on the secondary bridge register > > >> + * programming. We also cannot lookup the PCIBus from the bus number > > >> + * at this point for the iommu_fn. Also, requester_id_cache is the > > >> + * alias to the root bus, which is usually, but not necessarily always > > >> + * where we'll find our iommu_fn. > > >> + */ > > >> + if (!pci_bus_is_express(iommu_bus)) { > > >> + PCIDevice *parent = iommu_bus->parent_dev; > > >> + > > >> + if (pci_is_express(parent) && > > >> + pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) { > > >> + devfn = PCI_DEVFN(0, 0); > > >> + bus = iommu_bus; > > >> + } else { > > >> + devfn = parent->devfn; > > >> + bus = parent_bus; > > >> + } > > >> + } > > >> + > > >> + iommu_bus = parent_bus; > > >> } > > >> if (iommu_bus && iommu_bus->iommu_fn) { > > >> - return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn); > > >> + return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn); > > >> } > > >> return &address_space_memory; > > >> } > > >> > > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space 2019-07-23 17:26 ` Alex Williamson 2019-07-23 18:45 ` Michael S. Tsirkin @ 2019-07-24 7:14 ` Peter Xu 2019-07-24 9:39 ` Michael S. Tsirkin 1 sibling, 1 reply; 13+ messages in thread From: Peter Xu @ 2019-07-24 7:14 UTC (permalink / raw) To: Alex Williamson Cc: Singh, Brijesh, mst, qemu-devel, eric.auger, Suthikulpanit, Suravee On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > > [Cc +Brijesh] > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > > include aliases for all BDF ranges behind a conventional bridge? I > > > think the Linux code handles this regardless of the firmware provided > > > aliases, but is it required per spec for the ACPI tables to include > > > bridge aliases? Thanks, > > > > > > > We do need to includes aliases in ACPI table. We need to populate the > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host > > IVRS would contain similar information. > > > > Suravee, please correct me if I am missing something? > > I finally found some time to investigate this a little further, yes the > types mentioned are correct for defining start and end of an alias > range. The challenge here is that these entries require a DeviceID, > which is defined as a BDF, AIUI. The IVRS is created in QEMU, but bus > numbers are defined by the guest firmware, and potentially redefined by > the guest OS. This makes it non-trivial to insert a few IVHDs into the > IVRS to describe alias ranges. I'm wondering if the solution here is > to define a new linker-loader command that would instruct the guest to > write a bus number byte to a given offset for a described device. > These commands would be inserted before the checksum command, such that > these bus number updates are calculated as part of the checksum. > > I'm imagining the command format would need to be able to distinguish > between the actual bus number of a described device, the secondary bus > number of the device, and the subordinate bus number of the device. > For describing the device, I'm envisioning stealing from the DMAR > definition, which already includes a bus number invariant mechanism to > describe a device, starting with a segment and root bus, follow a chain > of devfns to get to the target device. Therefore the guest firmware > would follow the path to the described device, pick the desired bus > number, and write it to the indicated table offset. > > Does this seem like a reasonable approach? Better ideas? I'm not > thrilled with the increased scope demanded by IVRS support, but so long > as we have an AMD IOMMU model, I don't see how to avoid it. Thanks, I don't have a better idea yet, but just want to say that accidentally I was trying to look into this as well starting from this week and I'd say that's mostly what I thought about too (I was still reading a bit seabios when I saw this email)... so at least this idea makes sense to me. Would the guest OS still change the PCI bus number even after the firmware (BIOS/UEFI)? Could I ask in what case would that happen? Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space 2019-07-24 7:14 ` Peter Xu @ 2019-07-24 9:39 ` Michael S. Tsirkin 2019-07-24 10:03 ` Peter Xu 0 siblings, 1 reply; 13+ messages in thread From: Michael S. Tsirkin @ 2019-07-24 9:39 UTC (permalink / raw) To: Peter Xu Cc: Singh, Brijesh, Suthikulpanit, Suravee, qemu-devel, eric.auger, Alex Williamson On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote: > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: > > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > > > [Cc +Brijesh] > > > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > > > include aliases for all BDF ranges behind a conventional bridge? I > > > > think the Linux code handles this regardless of the firmware provided > > > > aliases, but is it required per spec for the ACPI tables to include > > > > bridge aliases? Thanks, > > > > > > > > > > We do need to includes aliases in ACPI table. We need to populate the > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host > > > IVRS would contain similar information. > > > > > > Suravee, please correct me if I am missing something? > > > > I finally found some time to investigate this a little further, yes the > > types mentioned are correct for defining start and end of an alias > > range. The challenge here is that these entries require a DeviceID, > > which is defined as a BDF, AIUI. The IVRS is created in QEMU, but bus > > numbers are defined by the guest firmware, and potentially redefined by > > the guest OS. This makes it non-trivial to insert a few IVHDs into the > > IVRS to describe alias ranges. I'm wondering if the solution here is > > to define a new linker-loader command that would instruct the guest to > > write a bus number byte to a given offset for a described device. > > These commands would be inserted before the checksum command, such that > > these bus number updates are calculated as part of the checksum. > > > > I'm imagining the command format would need to be able to distinguish > > between the actual bus number of a described device, the secondary bus > > number of the device, and the subordinate bus number of the device. > > For describing the device, I'm envisioning stealing from the DMAR > > definition, which already includes a bus number invariant mechanism to > > describe a device, starting with a segment and root bus, follow a chain > > of devfns to get to the target device. Therefore the guest firmware > > would follow the path to the described device, pick the desired bus > > number, and write it to the indicated table offset. > > > > Does this seem like a reasonable approach? Better ideas? I'm not > > thrilled with the increased scope demanded by IVRS support, but so long > > as we have an AMD IOMMU model, I don't see how to avoid it. Thanks, > > I don't have a better idea yet, but just want to say that accidentally > I was trying to look into this as well starting from this week and I'd > say that's mostly what I thought about too (I was still reading a bit > seabios when I saw this email)... so at least this idea makes sense to > me. > > Would the guest OS still change the PCI bus number even after the > firmware (BIOS/UEFI)? Could I ask in what case would that happen? > > Thanks, Guest OSes can in theory rebalance resources. Changing bus numbers would be useful if new bridges are added by hotplug. In practice at least Linux doesn't do the rebalancing. I think that if we start reporting PNP OS support in BIOS then windows might start doing that more aggressively. > -- > Peter Xu ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space 2019-07-24 9:39 ` Michael S. Tsirkin @ 2019-07-24 10:03 ` Peter Xu 2019-07-24 14:43 ` Alex Williamson 0 siblings, 1 reply; 13+ messages in thread From: Peter Xu @ 2019-07-24 10:03 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Singh, Brijesh, Suthikulpanit, Suravee, qemu-devel, Peter Xu, eric.auger, Alex Williamson On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote: > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote: > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: > > > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > > > > [Cc +Brijesh] > > > > > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > > > > include aliases for all BDF ranges behind a conventional bridge? I > > > > > think the Linux code handles this regardless of the firmware provided > > > > > aliases, but is it required per spec for the ACPI tables to include > > > > > bridge aliases? Thanks, > > > > > > > > > > > > > We do need to includes aliases in ACPI table. We need to populate the > > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host > > > > IVRS would contain similar information. > > > > > > > > Suravee, please correct me if I am missing something? > > > > > > I finally found some time to investigate this a little further, yes the > > > types mentioned are correct for defining start and end of an alias > > > range. The challenge here is that these entries require a DeviceID, > > > which is defined as a BDF, AIUI. The IVRS is created in QEMU, but bus > > > numbers are defined by the guest firmware, and potentially redefined by > > > the guest OS. This makes it non-trivial to insert a few IVHDs into the > > > IVRS to describe alias ranges. I'm wondering if the solution here is > > > to define a new linker-loader command that would instruct the guest to > > > write a bus number byte to a given offset for a described device. > > > These commands would be inserted before the checksum command, such that > > > these bus number updates are calculated as part of the checksum. > > > > > > I'm imagining the command format would need to be able to distinguish > > > between the actual bus number of a described device, the secondary bus > > > number of the device, and the subordinate bus number of the device. > > > For describing the device, I'm envisioning stealing from the DMAR > > > definition, which already includes a bus number invariant mechanism to > > > describe a device, starting with a segment and root bus, follow a chain > > > of devfns to get to the target device. Therefore the guest firmware > > > would follow the path to the described device, pick the desired bus > > > number, and write it to the indicated table offset. > > > > > > Does this seem like a reasonable approach? Better ideas? I'm not > > > thrilled with the increased scope demanded by IVRS support, but so long > > > as we have an AMD IOMMU model, I don't see how to avoid it. Thanks, > > > > I don't have a better idea yet, but just want to say that accidentally > > I was trying to look into this as well starting from this week and I'd > > say that's mostly what I thought about too (I was still reading a bit > > seabios when I saw this email)... so at least this idea makes sense to > > me. > > > > Would the guest OS still change the PCI bus number even after the > > firmware (BIOS/UEFI)? Could I ask in what case would that happen? > > > > Thanks, > > Guest OSes can in theory rebalance resources. Changing bus numbers > would be useful if new bridges are added by hotplug. > In practice at least Linux doesn't do the rebalancing. > I think that if we start reporting PNP OS support in BIOS then windows > might start doing that more aggressively. It's surprising me a bit... IMHO if we allow the bus number to change then at least many scripts can even fail which might work before. E.g. , a very common script can run "lspci-like" program to list each device and then do "lspci-like -vvv" again upon the BDF it fetched from previous commands. Any kind of BDF caching would be invalid since that from either userspace or kernel. Also, obviously the data to be stored in IVRS is closely bound to how bus number is defined. Even if we can add a new linker-loader command to all the open firmwares like seabios or OVMF but still we can't do that to Windows (or, could we?...). Now one step back, I'm also curious on the reason behind on why AMD spec required the IVRS with BDF information, rather than the scope information like what Intel DMAR spec was asking for. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space 2019-07-24 10:03 ` Peter Xu @ 2019-07-24 14:43 ` Alex Williamson 2019-07-24 19:42 ` Alex Williamson 2019-07-25 6:37 ` Peter Xu 0 siblings, 2 replies; 13+ messages in thread From: Alex Williamson @ 2019-07-24 14:43 UTC (permalink / raw) To: Peter Xu Cc: Singh, Brijesh, Michael S. Tsirkin, qemu-devel, eric.auger, Suthikulpanit, Suravee On Wed, 24 Jul 2019 18:03:31 +0800 Peter Xu <zhexu@redhat.com> wrote: > On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote: > > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote: > > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: > > > > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > > > > > [Cc +Brijesh] > > > > > > > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > > > > > include aliases for all BDF ranges behind a conventional bridge? I > > > > > > think the Linux code handles this regardless of the firmware provided > > > > > > aliases, but is it required per spec for the ACPI tables to include > > > > > > bridge aliases? Thanks, > > > > > > > > > > > > > > > > We do need to includes aliases in ACPI table. We need to populate the > > > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host > > > > > IVRS would contain similar information. > > > > > > > > > > Suravee, please correct me if I am missing something? > > > > > > > > I finally found some time to investigate this a little further, yes the > > > > types mentioned are correct for defining start and end of an alias > > > > range. The challenge here is that these entries require a DeviceID, > > > > which is defined as a BDF, AIUI. The IVRS is created in QEMU, but bus > > > > numbers are defined by the guest firmware, and potentially redefined by > > > > the guest OS. This makes it non-trivial to insert a few IVHDs into the > > > > IVRS to describe alias ranges. I'm wondering if the solution here is > > > > to define a new linker-loader command that would instruct the guest to > > > > write a bus number byte to a given offset for a described device. > > > > These commands would be inserted before the checksum command, such that > > > > these bus number updates are calculated as part of the checksum. > > > > > > > > I'm imagining the command format would need to be able to distinguish > > > > between the actual bus number of a described device, the secondary bus > > > > number of the device, and the subordinate bus number of the device. > > > > For describing the device, I'm envisioning stealing from the DMAR > > > > definition, which already includes a bus number invariant mechanism to > > > > describe a device, starting with a segment and root bus, follow a chain > > > > of devfns to get to the target device. Therefore the guest firmware > > > > would follow the path to the described device, pick the desired bus > > > > number, and write it to the indicated table offset. > > > > > > > > Does this seem like a reasonable approach? Better ideas? I'm not > > > > thrilled with the increased scope demanded by IVRS support, but so long > > > > as we have an AMD IOMMU model, I don't see how to avoid it. Thanks, > > > > > > I don't have a better idea yet, but just want to say that accidentally > > > I was trying to look into this as well starting from this week and I'd > > > say that's mostly what I thought about too (I was still reading a bit > > > seabios when I saw this email)... so at least this idea makes sense to > > > me. > > > > > > Would the guest OS still change the PCI bus number even after the > > > firmware (BIOS/UEFI)? Could I ask in what case would that happen? > > > > > > Thanks, > > > > Guest OSes can in theory rebalance resources. Changing bus numbers > > would be useful if new bridges are added by hotplug. > > In practice at least Linux doesn't do the rebalancing. > > I think that if we start reporting PNP OS support in BIOS then windows > > might start doing that more aggressively. > > It's surprising me a bit... IMHO if we allow the bus number to change > then at least many scripts can even fail which might work before. > E.g. , a very common script can run "lspci-like" program to list each > device and then do "lspci-like -vvv" again upon the BDF it fetched > from previous commands. Any kind of BDF caching would be invalid > since that from either userspace or kernel. > > Also, obviously the data to be stored in IVRS is closely bound to how > bus number is defined. Even if we can add a new linker-loader command > to all the open firmwares like seabios or OVMF but still we can't do > that to Windows (or, could we?...). > > Now one step back, I'm also curious on the reason behind on why AMD > spec required the IVRS with BDF information, rather than the scope > information like what Intel DMAR spec was asking for. It's a deficiency of the IVRS spec, but it's really out of scope here. It's not the responsibility of the hypervisor to resolve this sort of design issue, we should simply maintain the bare metal behavior and the bare metal limitations of the design. Michael did invoke some interesting ideas regarding QEMU updating the IRVS table though. QEMU does know when bus apertures are programmed on devices and the config writes for these updates could trigger IVRS updates. I think we'd want to allow such updates only between machine reset and the guest firmware writing the table checksum. This reduces the scope of the necessary changes, though still feels a little messy to have these config writes making table updates. Another approach, and maybe what Michael was really suggesting, is that we essentially create the ACPI tables twice AFAICT. Once initially, then again via a select callback in fw_cfg. For SeaBIOS, it looks like this second generation would be created after the PCI bus has been enumerated and initialized. I've been trying to see if the same is likely for OVMF, though it's not clear to me that this is a reasonable ordering to rely on. It would be entirely reasonable that firmware could process ACPI tables in advance of enumerating PCI, even potentially as a prerequisite to enumerating PCI. So ultimately I'm not sure if there are valid ordering assumptions to use these callbacks this way, though I'd appreciate any further discussion. Thanks, Alex ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space 2019-07-24 14:43 ` Alex Williamson @ 2019-07-24 19:42 ` Alex Williamson 2019-07-25 14:34 ` Singh, Brijesh 2019-07-25 6:37 ` Peter Xu 1 sibling, 1 reply; 13+ messages in thread From: Alex Williamson @ 2019-07-24 19:42 UTC (permalink / raw) To: Peter Xu Cc: Singh, Brijesh, Michael S. Tsirkin, qemu-devel, eric.auger, Suthikulpanit, Suravee On Wed, 24 Jul 2019 08:43:55 -0600 Alex Williamson <alex.williamson@redhat.com> wrote: > On Wed, 24 Jul 2019 18:03:31 +0800 > Peter Xu <zhexu@redhat.com> wrote: > > > On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote: > > > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote: > > > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: > > > > > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > > > > > > [Cc +Brijesh] > > > > > > > > > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > > > > > > include aliases for all BDF ranges behind a conventional bridge? I > > > > > > > think the Linux code handles this regardless of the firmware provided > > > > > > > aliases, but is it required per spec for the ACPI tables to include > > > > > > > bridge aliases? Thanks, [snip] For a data point, I fired up an old 990FX system which includes a PCIe-to-PCI bridge and I added a plugin PCIe-to-PCI bridge just to keep things interesting... guess how many alias ranges are in the IVRS... Yep, just the one built into the motherboard: AMD-Vi: Using IVHD type 0x10 AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: 3e info 1300 AMD-Vi: mmio-addr: 00000000fec30000 AMD-Vi: DEV_SELECT_RANGE_START devid: 00:00.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 00:00.2 AMD-Vi: DEV_SELECT devid: 00:09.0 flags: 00 AMD-Vi: DEV_SELECT devid: 01:00.0 flags: 00 AMD-Vi: DEV_SELECT devid: 00:0a.0 flags: 00 AMD-Vi: DEV_SELECT devid: 02:00.0 flags: 00 AMD-Vi: DEV_SELECT devid: 00:11.0 flags: 00 AMD-Vi: DEV_SELECT_RANGE_START devid: 00:12.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 00:12.2 AMD-Vi: DEV_SELECT_RANGE_START devid: 00:13.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 00:13.2 AMD-Vi: DEV_SELECT devid: 00:14.0 flags: d7 AMD-Vi: DEV_SELECT devid: 00:14.2 flags: 00 AMD-Vi: DEV_SELECT devid: 00:14.3 flags: 00 AMD-Vi: DEV_SELECT devid: 00:14.4 flags: 00 AMD-Vi: DEV_ALIAS_RANGE devid: 03:00.0 flags: 00 devid_to: 00:14.4 AMD-Vi: DEV_RANGE_END devid: 03:1f.7 [Everything on bus 03:xx.x is aliased to device 00:14.4, the builtin PCIe-to-PCI bridge] AMD-Vi: DEV_SELECT devid: 00:14.5 flags: 00 AMD-Vi: DEV_SELECT devid: 00:15.0 flags: 00 AMD-Vi: DEV_SELECT_RANGE_START devid: 04:00.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 04:1f.7 AMD-Vi: DEV_SELECT devid: 00:15.1 flags: 00 AMD-Vi: DEV_SELECT_RANGE_START devid: 05:00.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 05:1f.7 AMD-Vi: DEV_SELECT devid: 00:15.2 flags: 00 AMD-Vi: DEV_SELECT_RANGE_START devid: 06:00.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 06:1f.7 AMD-Vi: DEV_SELECT devid: 00:15.3 flags: 00 AMD-Vi: DEV_SELECT_RANGE_START devid: 08:00.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 08:1f.7 AMD-Vi: DEV_SELECT_RANGE_START devid: 00:16.0 flags: 00 AMD-Vi: DEV_RANGE_END devid: 00:16.2 AMD-Vi: DEV_SPECIAL(IOAPIC[8]) devid: 00:14.0 AMD-Vi: DEV_SPECIAL(HPET[0]) devid: 00:14.0 AMD-Vi: DEV_SPECIAL(IOAPIC[8]) devid: 00:00.1 -[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] RD9x0/RX980 Host Bridge +-00.2 Advanced Micro Devices, Inc. [AMD/ATI] RD890S/RD990 I/O Memory Management Unit (IOMMU) +-09.0-[01]----00.0 Etron Technology, Inc. EJ168 USB 3.0 Host Controller +-0a.0-[02]----00.0 Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller +-11.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] +-12.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller +-12.2 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller +-13.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller +-13.2 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller +-14.0 Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller +-14.2 Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) +-14.3 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller +-14.4-[03]--+-06.0 NVidia / SGS Thomson (Joint Venture) Riva128 | \-0e.0 VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller +-14.5 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller +-15.0-[04]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller +-15.1-[05]----00.0 Etron Technology, Inc. EJ168 USB 3.0 Host Controller +-15.2-[06-07]----00.0-[07]----00.0 Realtek Semiconductor Co., Ltd. Device 8149 +-15.3-[08]-- +-16.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller +-16.2 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller +-18.0 Advanced Micro Devices, Inc. [AMD] Family 10h Processor HyperTransport Configuration +-18.1 Advanced Micro Devices, Inc. [AMD] Family 10h Processor Address Map +-18.2 Advanced Micro Devices, Inc. [AMD] Family 10h Processor DRAM Controller +-18.3 Advanced Micro Devices, Inc. [AMD] Family 10h Processor Miscellaneous Control \-18.4 Advanced Micro Devices, Inc. [AMD] Family 10h Processor Link Control 00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge (rev 40) (prog-if 01 [Subtractive decode]) Bus: primary=00, secondary=03, subordinate=03, sec-latency=64 06:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 03) (prog-if 00 [Normal decode]) Bus: primary=06, secondary=07, subordinate=07, sec-latency=32 Capabilities: [80] Express (v1) PCI-Express to PCI/PCI-X Bridge, MSI 00 I realized as I was writing, that the alias caused by 06:00.0 would be devfn 00.0 on the secondary bus 07, ie. 07:00.0 would alias to 07:00.0, so technically there's really not an alias for this small case. So I replaced the NIC with this: +-15.2-[06-07]----00.0-[07]--+-00.0 NEC Corporation OHCI USB Controller +-00.1 NEC Corporation OHCI USB Controller \-00.2 NEC Corporation uPD72010x USB 2.0 Controller Now functions 07:00.[12] also alias to 07:00.0. The IVRS table is unaffected. I'm tempted to say that QEMU would be no worse than bare metal if we simply ignore IVHD device alias entries. I know that Linux will figure out the aliasing regardless of the IVRS. Will Windows guests? I'd love to hear from Bijesh or Suravee regarding the behavior above versus what AMD expects from system vendors. Thanks, Alex ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space 2019-07-24 19:42 ` Alex Williamson @ 2019-07-25 14:34 ` Singh, Brijesh 0 siblings, 0 replies; 13+ messages in thread From: Singh, Brijesh @ 2019-07-25 14:34 UTC (permalink / raw) To: Alex Williamson, Peter Xu Cc: Singh, Brijesh, Michael S. Tsirkin, qemu-devel, eric.auger, Suthikulpanit, Suravee On 7/24/19 2:42 PM, Alex Williamson wrote: > On Wed, 24 Jul 2019 08:43:55 -0600 > Alex Williamson <alex.williamson@redhat.com> wrote: > >> On Wed, 24 Jul 2019 18:03:31 +0800 >> Peter Xu <zhexu@redhat.com> wrote: >> >>> On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote: >>>> On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote: >>>>> On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: >>>>>>> On 3/29/19 11:49 AM, Alex Williamson wrote: >>>>>>>> [Cc +Brijesh] >>>>>>>> >>>>>>>> Hi Brijesh, will the change below require the IVRS to be updated to >>>>>>>> include aliases for all BDF ranges behind a conventional bridge? I >>>>>>>> think the Linux code handles this regardless of the firmware provided >>>>>>>> aliases, but is it required per spec for the ACPI tables to include >>>>>>>> bridge aliases? Thanks, > [snip] > > For a data point, I fired up an old 990FX system which includes a > PCIe-to-PCI bridge and I added a plugin PCIe-to-PCI bridge just to keep > things interesting... guess how many alias ranges are in the IVRS... > Yep, just the one built into the motherboard: > > AMD-Vi: Using IVHD type 0x10 > AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: 3e info 1300 > AMD-Vi: mmio-addr: 00000000fec30000 > AMD-Vi: DEV_SELECT_RANGE_START devid: 00:00.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 00:00.2 > AMD-Vi: DEV_SELECT devid: 00:09.0 flags: 00 > AMD-Vi: DEV_SELECT devid: 01:00.0 flags: 00 > AMD-Vi: DEV_SELECT devid: 00:0a.0 flags: 00 > AMD-Vi: DEV_SELECT devid: 02:00.0 flags: 00 > AMD-Vi: DEV_SELECT devid: 00:11.0 flags: 00 > AMD-Vi: DEV_SELECT_RANGE_START devid: 00:12.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 00:12.2 > AMD-Vi: DEV_SELECT_RANGE_START devid: 00:13.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 00:13.2 > AMD-Vi: DEV_SELECT devid: 00:14.0 flags: d7 > AMD-Vi: DEV_SELECT devid: 00:14.2 flags: 00 > AMD-Vi: DEV_SELECT devid: 00:14.3 flags: 00 > AMD-Vi: DEV_SELECT devid: 00:14.4 flags: 00 > AMD-Vi: DEV_ALIAS_RANGE devid: 03:00.0 flags: 00 devid_to: 00:14.4 > AMD-Vi: DEV_RANGE_END devid: 03:1f.7 > > [Everything on bus 03:xx.x is aliased to device 00:14.4, the builtin PCIe-to-PCI bridge] > > AMD-Vi: DEV_SELECT devid: 00:14.5 flags: 00 > AMD-Vi: DEV_SELECT devid: 00:15.0 flags: 00 > AMD-Vi: DEV_SELECT_RANGE_START devid: 04:00.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 04:1f.7 > AMD-Vi: DEV_SELECT devid: 00:15.1 flags: 00 > AMD-Vi: DEV_SELECT_RANGE_START devid: 05:00.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 05:1f.7 > AMD-Vi: DEV_SELECT devid: 00:15.2 flags: 00 > AMD-Vi: DEV_SELECT_RANGE_START devid: 06:00.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 06:1f.7 > AMD-Vi: DEV_SELECT devid: 00:15.3 flags: 00 > AMD-Vi: DEV_SELECT_RANGE_START devid: 08:00.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 08:1f.7 > AMD-Vi: DEV_SELECT_RANGE_START devid: 00:16.0 flags: 00 > AMD-Vi: DEV_RANGE_END devid: 00:16.2 > AMD-Vi: DEV_SPECIAL(IOAPIC[8]) devid: 00:14.0 > AMD-Vi: DEV_SPECIAL(HPET[0]) devid: 00:14.0 > AMD-Vi: DEV_SPECIAL(IOAPIC[8]) devid: 00:00.1 > > -[0000:00]-+-00.0 Advanced Micro Devices, Inc. [AMD/ATI] RD9x0/RX980 Host Bridge > +-00.2 Advanced Micro Devices, Inc. [AMD/ATI] RD890S/RD990 I/O Memory Management Unit (IOMMU) > +-09.0-[01]----00.0 Etron Technology, Inc. EJ168 USB 3.0 Host Controller > +-0a.0-[02]----00.0 Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller > +-11.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] > +-12.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller > +-12.2 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller > +-13.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller > +-13.2 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller > +-14.0 Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller > +-14.2 Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA) > +-14.3 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller > +-14.4-[03]--+-06.0 NVidia / SGS Thomson (Joint Venture) Riva128 > | \-0e.0 VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller > +-14.5 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller > +-15.0-[04]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller > +-15.1-[05]----00.0 Etron Technology, Inc. EJ168 USB 3.0 Host Controller > +-15.2-[06-07]----00.0-[07]----00.0 Realtek Semiconductor Co., Ltd. Device 8149 > +-15.3-[08]-- > +-16.0 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller > +-16.2 Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller > +-18.0 Advanced Micro Devices, Inc. [AMD] Family 10h Processor HyperTransport Configuration > +-18.1 Advanced Micro Devices, Inc. [AMD] Family 10h Processor Address Map > +-18.2 Advanced Micro Devices, Inc. [AMD] Family 10h Processor DRAM Controller > +-18.3 Advanced Micro Devices, Inc. [AMD] Family 10h Processor Miscellaneous Control > \-18.4 Advanced Micro Devices, Inc. [AMD] Family 10h Processor Link Control > > 00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge (rev 40) (prog-if 01 [Subtractive decode]) > Bus: primary=00, secondary=03, subordinate=03, sec-latency=64 > > 06:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 03) (prog-if 00 [Normal decode]) > Bus: primary=06, secondary=07, subordinate=07, sec-latency=32 > Capabilities: [80] Express (v1) PCI-Express to PCI/PCI-X Bridge, MSI 00 > > I realized as I was writing, that the alias caused by 06:00.0 would be > devfn 00.0 on the secondary bus 07, ie. 07:00.0 would alias to > 07:00.0, so technically there's really not an alias for this small > case. So I replaced the NIC with this: > > +-15.2-[06-07]----00.0-[07]--+-00.0 NEC Corporation OHCI USB Controller > +-00.1 NEC Corporation OHCI USB Controller > \-00.2 NEC Corporation uPD72010x USB 2.0 Controller > > Now functions 07:00.[12] also alias to 07:00.0. The IVRS table is > unaffected. > > I'm tempted to say that QEMU would be no worse than bare metal if we > simply ignore IVHD device alias entries. I know that Linux will figure > out the aliasing regardless of the IVRS. Will Windows guests? I'd > love to hear from Bijesh or Suravee regarding the behavior above versus > what AMD expects from system vendors. Thanks, > Alex, I am not well versed on the expected IOMMU address space behavior yet, Suravee will know more about. I will ask him to take a look at this thread. thanks > Alex > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space 2019-07-24 14:43 ` Alex Williamson 2019-07-24 19:42 ` Alex Williamson @ 2019-07-25 6:37 ` Peter Xu 2019-07-25 10:43 ` Michael S. Tsirkin 1 sibling, 1 reply; 13+ messages in thread From: Peter Xu @ 2019-07-25 6:37 UTC (permalink / raw) To: Alex Williamson Cc: Singh, Brijesh, Suthikulpanit, Suravee, Michael S. Tsirkin, qemu-devel, Peter Xu, eric.auger On Wed, Jul 24, 2019 at 08:43:55AM -0600, Alex Williamson wrote: > On Wed, 24 Jul 2019 18:03:31 +0800 > Peter Xu <zhexu@redhat.com> wrote: > > > On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote: > > > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote: > > > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote: > > > > > > On 3/29/19 11:49 AM, Alex Williamson wrote: > > > > > > > [Cc +Brijesh] > > > > > > > > > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to > > > > > > > include aliases for all BDF ranges behind a conventional bridge? I > > > > > > > think the Linux code handles this regardless of the firmware provided > > > > > > > aliases, but is it required per spec for the ACPI tables to include > > > > > > > bridge aliases? Thanks, > > > > > > > > > > > > > > > > > > > We do need to includes aliases in ACPI table. We need to populate the > > > > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host > > > > > > IVRS would contain similar information. > > > > > > > > > > > > Suravee, please correct me if I am missing something? > > > > > > > > > > I finally found some time to investigate this a little further, yes the > > > > > types mentioned are correct for defining start and end of an alias > > > > > range. The challenge here is that these entries require a DeviceID, > > > > > which is defined as a BDF, AIUI. The IVRS is created in QEMU, but bus > > > > > numbers are defined by the guest firmware, and potentially redefined by > > > > > the guest OS. This makes it non-trivial to insert a few IVHDs into the > > > > > IVRS to describe alias ranges. I'm wondering if the solution here is > > > > > to define a new linker-loader command that would instruct the guest to > > > > > write a bus number byte to a given offset for a described device. > > > > > These commands would be inserted before the checksum command, such that > > > > > these bus number updates are calculated as part of the checksum. > > > > > > > > > > I'm imagining the command format would need to be able to distinguish > > > > > between the actual bus number of a described device, the secondary bus > > > > > number of the device, and the subordinate bus number of the device. > > > > > For describing the device, I'm envisioning stealing from the DMAR > > > > > definition, which already includes a bus number invariant mechanism to > > > > > describe a device, starting with a segment and root bus, follow a chain > > > > > of devfns to get to the target device. Therefore the guest firmware > > > > > would follow the path to the described device, pick the desired bus > > > > > number, and write it to the indicated table offset. > > > > > > > > > > Does this seem like a reasonable approach? Better ideas? I'm not > > > > > thrilled with the increased scope demanded by IVRS support, but so long > > > > > as we have an AMD IOMMU model, I don't see how to avoid it. Thanks, > > > > > > > > I don't have a better idea yet, but just want to say that accidentally > > > > I was trying to look into this as well starting from this week and I'd > > > > say that's mostly what I thought about too (I was still reading a bit > > > > seabios when I saw this email)... so at least this idea makes sense to > > > > me. > > > > > > > > Would the guest OS still change the PCI bus number even after the > > > > firmware (BIOS/UEFI)? Could I ask in what case would that happen? > > > > > > > > Thanks, > > > > > > Guest OSes can in theory rebalance resources. Changing bus numbers > > > would be useful if new bridges are added by hotplug. > > > In practice at least Linux doesn't do the rebalancing. > > > I think that if we start reporting PNP OS support in BIOS then windows > > > might start doing that more aggressively. > > > > It's surprising me a bit... IMHO if we allow the bus number to change > > then at least many scripts can even fail which might work before. > > E.g. , a very common script can run "lspci-like" program to list each > > device and then do "lspci-like -vvv" again upon the BDF it fetched > > from previous commands. Any kind of BDF caching would be invalid > > since that from either userspace or kernel. > > > > Also, obviously the data to be stored in IVRS is closely bound to how > > bus number is defined. Even if we can add a new linker-loader command > > to all the open firmwares like seabios or OVMF but still we can't do > > that to Windows (or, could we?...). > > > > Now one step back, I'm also curious on the reason behind on why AMD > > spec required the IVRS with BDF information, rather than the scope > > information like what Intel DMAR spec was asking for. > > It's a deficiency of the IVRS spec, but it's really out of scope here. > It's not the responsibility of the hypervisor to resolve this sort of > design issue, we should simply maintain the bare metal behavior and the > bare metal limitations of the design. Yes this is a better perspective. It's not the first time I totally forgot to go back to reference the bare-metal as that's what we're emulating and normally that'll have very similar problem. And the "data point" email does give a proper supporting material for this. > > Michael did invoke some interesting ideas regarding QEMU updating the > IRVS table though. QEMU does know when bus apertures are programmed on > devices and the config writes for these updates could trigger IVRS > updates. I think we'd want to allow such updates only between machine > reset and the guest firmware writing the table checksum. This reduces > the scope of the necessary changes, though still feels a little messy > to have these config writes making table updates. > > Another approach, and maybe what Michael was really suggesting, is that > we essentially create the ACPI tables twice AFAICT. Once initially, > then again via a select callback in fw_cfg. For SeaBIOS, it looks like > this second generation would be created after the PCI bus has been > enumerated and initialized. I've been trying to see if the same is > likely for OVMF, though it's not clear to me that this is a reasonable > ordering to rely on. It would be entirely reasonable that firmware > could process ACPI tables in advance of enumerating PCI, even > potentially as a prerequisite to enumerating PCI. So ultimately I'm not > sure if there are valid ordering assumptions to use these callbacks > this way, though I'd appreciate any further discussion. Thanks, After re-read Michael's reply, I feel like what Michael suggested is that we can simply ignore the bus-number-change case by the guest OS for now, but I might be wrong. In all cases, this will be a problem only if "we still need to fill in the BDF information" somehow, while that statement seems to be questionable now. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space 2019-07-25 6:37 ` Peter Xu @ 2019-07-25 10:43 ` Michael S. Tsirkin 2019-07-25 14:00 ` Alex Williamson 0 siblings, 1 reply; 13+ messages in thread From: Michael S. Tsirkin @ 2019-07-25 10:43 UTC (permalink / raw) To: Peter Xu Cc: Singh, Brijesh, Suthikulpanit, Suravee, qemu-devel, eric.auger, Alex Williamson On Thu, Jul 25, 2019 at 02:37:33PM +0800, Peter Xu wrote: > After re-read Michael's reply, I feel like what Michael suggested is > that we can simply ignore the bus-number-change case by the guest OS > for now, but I might be wrong. That's what I suggested, yes. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space 2019-07-25 10:43 ` Michael S. Tsirkin @ 2019-07-25 14:00 ` Alex Williamson 2019-07-25 15:22 ` Michael S. Tsirkin 0 siblings, 1 reply; 13+ messages in thread From: Alex Williamson @ 2019-07-25 14:00 UTC (permalink / raw) To: Michael S. Tsirkin Cc: Singh, Brijesh, qemu-devel, Peter Xu, eric.auger, Suthikulpanit, Suravee On Thu, 25 Jul 2019 06:43:04 -0400 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Thu, Jul 25, 2019 at 02:37:33PM +0800, Peter Xu wrote: > > After re-read Michael's reply, I feel like what Michael suggested is > > that we can simply ignore the bus-number-change case by the guest OS > > for now, but I might be wrong. > That's what I suggested, yes. "by the guest OS", yes, that's the part that's beyond the scope of this effort. If we deem it necessary to support IVHD DMA alias though, it's the guest firmware that determines the initial bus numbers, which is part of the DeviceID used to defined IVHD entries and we would not be able to ignore that initial programming. Everything related to the guest OS renumber PCI buses is out of scope, the guest firmware programming initial bus numbers is not. Thanks, Alex ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space 2019-07-25 14:00 ` Alex Williamson @ 2019-07-25 15:22 ` Michael S. Tsirkin 0 siblings, 0 replies; 13+ messages in thread From: Michael S. Tsirkin @ 2019-07-25 15:22 UTC (permalink / raw) To: Alex Williamson Cc: Singh, Brijesh, qemu-devel, Peter Xu, eric.auger, Suthikulpanit, Suravee On Thu, Jul 25, 2019 at 08:00:23AM -0600, Alex Williamson wrote: > On Thu, 25 Jul 2019 06:43:04 -0400 > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > On Thu, Jul 25, 2019 at 02:37:33PM +0800, Peter Xu wrote: > > > After re-read Michael's reply, I feel like what Michael suggested is > > > that we can simply ignore the bus-number-change case by the guest OS > > > for now, but I might be wrong. > > That's what I suggested, yes. > > "by the guest OS", yes, that's the part that's beyond the scope of this > effort. If we deem it necessary to support IVHD DMA alias though, it's > the guest firmware that determines the initial bus numbers, which is > part of the DeviceID used to defined IVHD entries and we would not be > able to ignore that initial programming. Everything related to the > guest OS renumber PCI buses is out of scope, the guest firmware > programming initial bus numbers is not. Right. That's par for the course, we have same issues with MCFG and others. bios programs it and then we generate acpi based on that. >Thanks, > > Alex ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2019-07-25 15:22 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <155364082689.15803.7062874513041742278.stgit@gimli.home> [not found] ` <20190329104904.450fefef@x1.home> 2019-04-01 13:41 ` [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space Singh, Brijesh 2019-07-23 17:26 ` Alex Williamson 2019-07-23 18:45 ` Michael S. Tsirkin 2019-07-24 7:14 ` Peter Xu 2019-07-24 9:39 ` Michael S. Tsirkin 2019-07-24 10:03 ` Peter Xu 2019-07-24 14:43 ` Alex Williamson 2019-07-24 19:42 ` Alex Williamson 2019-07-25 14:34 ` Singh, Brijesh 2019-07-25 6:37 ` Peter Xu 2019-07-25 10:43 ` Michael S. Tsirkin 2019-07-25 14:00 ` Alex Williamson 2019-07-25 15:22 ` Michael S. Tsirkin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.