Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
       [not found] ` <20190329104904.450fefef@x1.home>
@ 2019-04-01 13:41   ` Singh, Brijesh
  2019-07-23 17:26     ` Alex Williamson
  0 siblings, 1 reply; 13+ messages in thread
From: Singh, Brijesh @ 2019-04-01 13:41 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel
  Cc: Singh, Brijesh, mst, marcel.apfelbaum, peterx, eric.auger,
	Suthikulpanit, Suravee

Thanks for adding Alex.

Adding Suravee.


On 3/29/19 11:49 AM, Alex Williamson wrote:
> [Cc +Brijesh]
> 
> Hi Brijesh, will the change below require the IVRS to be updated to
> include aliases for all BDF ranges behind a conventional bridge?  I
> think the Linux code handles this regardless of the firmware provided
> aliases, but is it required per spec for the ACPI tables to include
> bridge aliases?  Thanks,
> 

We do need to includes aliases in ACPI table. We need to populate the
IVHD type 0x43 and 0x4 for alias range start and end. I believe host
IVRS would contain similar information.

Suravee, please correct me if I am missing something?

thanks

> Alex
> 
> On Tue, 26 Mar 2019 16:55:19 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
>> distinguish by devfn & bus between devices in a conventional PCI
>> topology and therefore we cannot assign them separate AddressSpaces.
>> By taking this requester ID aliasing into account, QEMU better matches
>> the bare metal behavior and restrictions, and enables shared
>> AddressSpace configurations that are otherwise not possible with
>> guest IOMMU support.
>>
>> For the latter case, given any example where an IOMMU group on the
>> host includes multiple devices:
>>
>>    $ ls  /sys/kernel/iommu_groups/1/devices/
>>    0000:00:01.0  0000:01:00.0  0000:01:00.1
>>
>> If we incorporate a vIOMMU into the VM configuration, we're restricted
>> that we can only assign one of the endpoints to the guest because a
>> second endpoint will attempt to use a different AddressSpace.  VFIO
>> only supports IOMMU group level granularity at the container level,
>> preventing this second endpoint from being assigned:
>>
>> qemu-system-x86_64 -machine q35... \
>>    -device intel-iommu,intremap=on \
>>    -device pcie-root-port,addr=1e.0,id=pcie.1 \
>>    -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
>>    -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
>>
>> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
>> 0000:01:00.1: group 1 used in multiple address spaces
>>
>> However, when QEMU incorporates proper aliasing, we can make use of a
>> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
>> provides the downstream devices with the same AddressSpace, ex:
>>
>> qemu-system-x86_64 -machine q35... \
>>    -device intel-iommu,intremap=on \
>>    -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
>>    -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
>>    -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
>>
>> While the utility of this hack may be limited, this AddressSpace
>> aliasing is the correct behavior for QEMU to emulate bare metal.
>>
>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>> ---
>>   hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
>>   1 file changed, 31 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
>> index 35451c1e9987..38467e676f1f 100644
>> --- a/hw/pci/pci.c
>> +++ b/hw/pci/pci.c
>> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>>   {
>>       PCIBus *bus = pci_get_bus(dev);
>>       PCIBus *iommu_bus = bus;
>> +    uint8_t devfn = dev->devfn;
>>   
>>       while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
>> -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
>> +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
>> +
>> +        /*
>> +         * Determine which requester ID alias should be used for the device
>> +         * based on the PCI topology.  There are no requester IDs on convetional
>> +         * PCI buses, therefore we push the alias up to the parent on each non-
>> +         * express bus.  Which alias we use depends on whether this is a legacy
>> +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
>> +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
>> +         * because the resulting BDF depends on the secondary bridge register
>> +         * programming.  We also cannot lookup the PCIBus from the bus number
>> +         * at this point for the iommu_fn.  Also, requester_id_cache is the
>> +         * alias to the root bus, which is usually, but not necessarily always
>> +         * where we'll find our iommu_fn.
>> +         */
>> +        if (!pci_bus_is_express(iommu_bus)) {
>> +            PCIDevice *parent = iommu_bus->parent_dev;
>> +
>> +            if (pci_is_express(parent) &&
>> +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
>> +                devfn = PCI_DEVFN(0, 0);
>> +                bus = iommu_bus;
>> +            } else {
>> +                devfn = parent->devfn;
>> +                bus = parent_bus;
>> +            }
>> +        }
>> +
>> +        iommu_bus = parent_bus;
>>       }
>>       if (iommu_bus && iommu_bus->iommu_fn) {
>> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
>> +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
>>       }
>>       return &address_space_memory;
>>   }
>>
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
  2019-04-01 13:41   ` [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space Singh, Brijesh
@ 2019-07-23 17:26     ` Alex Williamson
  2019-07-23 18:45       ` Michael S. Tsirkin
  2019-07-24  7:14       ` Peter Xu
  0 siblings, 2 replies; 13+ messages in thread
From: Alex Williamson @ 2019-07-23 17:26 UTC (permalink / raw)
  To: Singh, Brijesh
  Cc: mst, qemu-devel, peterx, eric.auger, Suthikulpanit, Suravee

On Mon, 1 Apr 2019 13:41:39 +0000
"Singh, Brijesh" <brijesh.singh@amd.com> wrote:

> Thanks for adding Alex.
> 
> Adding Suravee.
> 
> 
> On 3/29/19 11:49 AM, Alex Williamson wrote:
> > [Cc +Brijesh]
> > 
> > Hi Brijesh, will the change below require the IVRS to be updated to
> > include aliases for all BDF ranges behind a conventional bridge?  I
> > think the Linux code handles this regardless of the firmware provided
> > aliases, but is it required per spec for the ACPI tables to include
> > bridge aliases?  Thanks,
> >   
> 
> We do need to includes aliases in ACPI table. We need to populate the
> IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> IVRS would contain similar information.
> 
> Suravee, please correct me if I am missing something?

I finally found some time to investigate this a little further, yes the
types mentioned are correct for defining start and end of an alias
range.  The challenge here is that these entries require a DeviceID,
which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
numbers are defined by the guest firmware, and potentially redefined by
the guest OS.  This makes it non-trivial to insert a few IVHDs into the
IVRS to describe alias ranges.  I'm wondering if the solution here is
to define a new linker-loader command that would instruct the guest to
write a bus number byte to a given offset for a described device.
These commands would be inserted before the checksum command, such that
these bus number updates are calculated as part of the checksum.

I'm imagining the command format would need to be able to distinguish
between the actual bus number of a described device, the secondary bus
number of the device, and the subordinate bus number of the device.
For describing the device, I'm envisioning stealing from the DMAR
definition, which already includes a bus number invariant mechanism to
describe a device, starting with a segment and root bus, follow a chain
of devfns to get to the target device.  Therefore the guest firmware
would follow the path to the described device, pick the desired bus
number, and write it to the indicated table offset.

Does this seem like a reasonable approach?  Better ideas?  I'm not
thrilled with the increased scope demanded by IVRS support, but so long
as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,

Alex

> > On Tue, 26 Mar 2019 16:55:19 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> >> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> >> distinguish by devfn & bus between devices in a conventional PCI
> >> topology and therefore we cannot assign them separate AddressSpaces.
> >> By taking this requester ID aliasing into account, QEMU better matches
> >> the bare metal behavior and restrictions, and enables shared
> >> AddressSpace configurations that are otherwise not possible with
> >> guest IOMMU support.
> >>
> >> For the latter case, given any example where an IOMMU group on the
> >> host includes multiple devices:
> >>
> >>    $ ls  /sys/kernel/iommu_groups/1/devices/
> >>    0000:00:01.0  0000:01:00.0  0000:01:00.1
> >>
> >> If we incorporate a vIOMMU into the VM configuration, we're restricted
> >> that we can only assign one of the endpoints to the guest because a
> >> second endpoint will attempt to use a different AddressSpace.  VFIO
> >> only supports IOMMU group level granularity at the container level,
> >> preventing this second endpoint from being assigned:
> >>
> >> qemu-system-x86_64 -machine q35... \
> >>    -device intel-iommu,intremap=on \
> >>    -device pcie-root-port,addr=1e.0,id=pcie.1 \
> >>    -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> >>    -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> >>
> >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> >> 0000:01:00.1: group 1 used in multiple address spaces
> >>
> >> However, when QEMU incorporates proper aliasing, we can make use of a
> >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> >> provides the downstream devices with the same AddressSpace, ex:
> >>
> >> qemu-system-x86_64 -machine q35... \
> >>    -device intel-iommu,intremap=on \
> >>    -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> >>    -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> >>    -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> >>
> >> While the utility of this hack may be limited, this AddressSpace
> >> aliasing is the correct behavior for QEMU to emulate bare metal.
> >>
> >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >> ---
> >>   hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
> >>   1 file changed, 31 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> >> index 35451c1e9987..38467e676f1f 100644
> >> --- a/hw/pci/pci.c
> >> +++ b/hw/pci/pci.c
> >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> >>   {
> >>       PCIBus *bus = pci_get_bus(dev);
> >>       PCIBus *iommu_bus = bus;
> >> +    uint8_t devfn = dev->devfn;
> >>   
> >>       while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> >> -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> >> +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> >> +
> >> +        /*
> >> +         * Determine which requester ID alias should be used for the device
> >> +         * based on the PCI topology.  There are no requester IDs on convetional
> >> +         * PCI buses, therefore we push the alias up to the parent on each non-
> >> +         * express bus.  Which alias we use depends on whether this is a legacy
> >> +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
> >> +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
> >> +         * because the resulting BDF depends on the secondary bridge register
> >> +         * programming.  We also cannot lookup the PCIBus from the bus number
> >> +         * at this point for the iommu_fn.  Also, requester_id_cache is the
> >> +         * alias to the root bus, which is usually, but not necessarily always
> >> +         * where we'll find our iommu_fn.
> >> +         */
> >> +        if (!pci_bus_is_express(iommu_bus)) {
> >> +            PCIDevice *parent = iommu_bus->parent_dev;
> >> +
> >> +            if (pci_is_express(parent) &&
> >> +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> >> +                devfn = PCI_DEVFN(0, 0);
> >> +                bus = iommu_bus;
> >> +            } else {
> >> +                devfn = parent->devfn;
> >> +                bus = parent_bus;
> >> +            }
> >> +        }
> >> +
> >> +        iommu_bus = parent_bus;
> >>       }
> >>       if (iommu_bus && iommu_bus->iommu_fn) {
> >> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
> >> +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> >>       }
> >>       return &address_space_memory;
> >>   }
> >>  
> >   



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
  2019-07-23 17:26     ` Alex Williamson
@ 2019-07-23 18:45       ` Michael S. Tsirkin
  2019-07-24  7:14       ` Peter Xu
  1 sibling, 0 replies; 13+ messages in thread
From: Michael S. Tsirkin @ 2019-07-23 18:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Singh, Brijesh, Suthikulpanit, Suravee, qemu-devel, peterx, eric.auger

On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:
> On Mon, 1 Apr 2019 13:41:39 +0000
> "Singh, Brijesh" <brijesh.singh@amd.com> wrote:
> 
> > Thanks for adding Alex.
> > 
> > Adding Suravee.
> > 
> > 
> > On 3/29/19 11:49 AM, Alex Williamson wrote:
> > > [Cc +Brijesh]
> > > 
> > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > think the Linux code handles this regardless of the firmware provided
> > > aliases, but is it required per spec for the ACPI tables to include
> > > bridge aliases?  Thanks,
> > >   
> > 
> > We do need to includes aliases in ACPI table. We need to populate the
> > IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> > IVRS would contain similar information.
> > 
> > Suravee, please correct me if I am missing something?
> 
> I finally found some time to investigate this a little further, yes the
> types mentioned are correct for defining start and end of an alias
> range.  The challenge here is that these entries require a DeviceID,
> which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
> numbers are defined by the guest firmware, and potentially redefined by
> the guest OS.  This makes it non-trivial to insert a few IVHDs into the
> IVRS to describe alias ranges.  I'm wondering if the solution here is
> to define a new linker-loader command that would instruct the guest to
> write a bus number byte to a given offset for a described device.
> These commands would be inserted before the checksum command, such that
> these bus number updates are calculated as part of the checksum.
> 
> I'm imagining the command format would need to be able to distinguish
> between the actual bus number of a described device, the secondary bus
> number of the device, and the subordinate bus number of the device.
> For describing the device, I'm envisioning stealing from the DMAR
> definition, which already includes a bus number invariant mechanism to
> describe a device, starting with a segment and root bus, follow a chain
> of devfns to get to the target device.  Therefore the guest firmware
> would follow the path to the described device, pick the desired bus
> number, and write it to the indicated table offset.
> 
> Does this seem like a reasonable approach?  Better ideas?  I'm not
> thrilled with the increased scope demanded by IVRS support, but so long
> as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,
> 
> Alex

We generally just got the bus #s as programmed into virtual
bridges/devices and had qemu program them into acpi
tables.

If guest os changes bus#s it's responsible for updating acpi
tables :)


> > > On Tue, 26 Mar 2019 16:55:19 -0600
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > >   
> > >> Conventional PCI buses pre-date requester IDs.  An IOMMU cannot
> > >> distinguish by devfn & bus between devices in a conventional PCI
> > >> topology and therefore we cannot assign them separate AddressSpaces.
> > >> By taking this requester ID aliasing into account, QEMU better matches
> > >> the bare metal behavior and restrictions, and enables shared
> > >> AddressSpace configurations that are otherwise not possible with
> > >> guest IOMMU support.
> > >>
> > >> For the latter case, given any example where an IOMMU group on the
> > >> host includes multiple devices:
> > >>
> > >>    $ ls  /sys/kernel/iommu_groups/1/devices/
> > >>    0000:00:01.0  0000:01:00.0  0000:01:00.1
> > >>
> > >> If we incorporate a vIOMMU into the VM configuration, we're restricted
> > >> that we can only assign one of the endpoints to the guest because a
> > >> second endpoint will attempt to use a different AddressSpace.  VFIO
> > >> only supports IOMMU group level granularity at the container level,
> > >> preventing this second endpoint from being assigned:
> > >>
> > >> qemu-system-x86_64 -machine q35... \
> > >>    -device intel-iommu,intremap=on \
> > >>    -device pcie-root-port,addr=1e.0,id=pcie.1 \
> > >>    -device vfio-pci,host=1:00.0,bus=pcie.1,addr=0.0,multifunction=on \
> > >>    -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1
> > >>
> > >> qemu-system-x86_64: -device vfio-pci,host=1:00.1,bus=pcie.1,addr=0.1: vfio \
> > >> 0000:01:00.1: group 1 used in multiple address spaces
> > >>
> > >> However, when QEMU incorporates proper aliasing, we can make use of a
> > >> PCIe-to-PCI bridge to mask the requester ID, resulting in a hack that
> > >> provides the downstream devices with the same AddressSpace, ex:
> > >>
> > >> qemu-system-x86_64 -machine q35... \
> > >>    -device intel-iommu,intremap=on \
> > >>    -device pcie-pci-bridge,addr=1e.0,id=pci.1 \
> > >>    -device vfio-pci,host=1:00.0,bus=pci.1,addr=1.0,multifunction=on \
> > >>    -device vfio-pci,host=1:00.1,bus=pci.1,addr=1.1
> > >>
> > >> While the utility of this hack may be limited, this AddressSpace
> > >> aliasing is the correct behavior for QEMU to emulate bare metal.
> > >>
> > >> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > >> ---
> > >>   hw/pci/pci.c |   33 +++++++++++++++++++++++++++++++--
> > >>   1 file changed, 31 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > >> index 35451c1e9987..38467e676f1f 100644
> > >> --- a/hw/pci/pci.c
> > >> +++ b/hw/pci/pci.c
> > >> @@ -2594,12 +2594,41 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> > >>   {
> > >>       PCIBus *bus = pci_get_bus(dev);
> > >>       PCIBus *iommu_bus = bus;
> > >> +    uint8_t devfn = dev->devfn;
> > >>   
> > >>       while(iommu_bus && !iommu_bus->iommu_fn && iommu_bus->parent_dev) {
> > >> -        iommu_bus = pci_get_bus(iommu_bus->parent_dev);
> > >> +        PCIBus *parent_bus = pci_get_bus(iommu_bus->parent_dev);
> > >> +
> > >> +        /*
> > >> +         * Determine which requester ID alias should be used for the device
> > >> +         * based on the PCI topology.  There are no requester IDs on convetional
> > >> +         * PCI buses, therefore we push the alias up to the parent on each non-
> > >> +         * express bus.  Which alias we use depends on whether this is a legacy
> > >> +         * PCI bridge or PCIe-to-PCI/X bridge as in chapter 2.3 of the PCIe-to-
> > >> +         * PCI bridge spec.  Note that we cannot use pci_requester_id() here
> > >> +         * because the resulting BDF depends on the secondary bridge register
> > >> +         * programming.  We also cannot lookup the PCIBus from the bus number
> > >> +         * at this point for the iommu_fn.  Also, requester_id_cache is the
> > >> +         * alias to the root bus, which is usually, but not necessarily always
> > >> +         * where we'll find our iommu_fn.
> > >> +         */
> > >> +        if (!pci_bus_is_express(iommu_bus)) {
> > >> +            PCIDevice *parent = iommu_bus->parent_dev;
> > >> +
> > >> +            if (pci_is_express(parent) &&
> > >> +                pcie_cap_get_type(parent) == PCI_EXP_TYPE_PCI_BRIDGE) {
> > >> +                devfn = PCI_DEVFN(0, 0);
> > >> +                bus = iommu_bus;
> > >> +            } else {
> > >> +                devfn = parent->devfn;
> > >> +                bus = parent_bus;
> > >> +            }
> > >> +        }
> > >> +
> > >> +        iommu_bus = parent_bus;
> > >>       }
> > >>       if (iommu_bus && iommu_bus->iommu_fn) {
> > >> -        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, dev->devfn);
> > >> +        return iommu_bus->iommu_fn(bus, iommu_bus->iommu_opaque, devfn);
> > >>       }
> > >>       return &address_space_memory;
> > >>   }
> > >>  
> > >   


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
  2019-07-23 17:26     ` Alex Williamson
  2019-07-23 18:45       ` Michael S. Tsirkin
@ 2019-07-24  7:14       ` Peter Xu
  2019-07-24  9:39         ` Michael S. Tsirkin
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Xu @ 2019-07-24  7:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Singh, Brijesh, mst, qemu-devel, eric.auger, Suthikulpanit, Suravee

On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:
> > On 3/29/19 11:49 AM, Alex Williamson wrote:
> > > [Cc +Brijesh]
> > > 
> > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > think the Linux code handles this regardless of the firmware provided
> > > aliases, but is it required per spec for the ACPI tables to include
> > > bridge aliases?  Thanks,
> > >   
> > 
> > We do need to includes aliases in ACPI table. We need to populate the
> > IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> > IVRS would contain similar information.
> > 
> > Suravee, please correct me if I am missing something?
> 
> I finally found some time to investigate this a little further, yes the
> types mentioned are correct for defining start and end of an alias
> range.  The challenge here is that these entries require a DeviceID,
> which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
> numbers are defined by the guest firmware, and potentially redefined by
> the guest OS.  This makes it non-trivial to insert a few IVHDs into the
> IVRS to describe alias ranges.  I'm wondering if the solution here is
> to define a new linker-loader command that would instruct the guest to
> write a bus number byte to a given offset for a described device.
> These commands would be inserted before the checksum command, such that
> these bus number updates are calculated as part of the checksum.
> 
> I'm imagining the command format would need to be able to distinguish
> between the actual bus number of a described device, the secondary bus
> number of the device, and the subordinate bus number of the device.
> For describing the device, I'm envisioning stealing from the DMAR
> definition, which already includes a bus number invariant mechanism to
> describe a device, starting with a segment and root bus, follow a chain
> of devfns to get to the target device.  Therefore the guest firmware
> would follow the path to the described device, pick the desired bus
> number, and write it to the indicated table offset.
> 
> Does this seem like a reasonable approach?  Better ideas?  I'm not
> thrilled with the increased scope demanded by IVRS support, but so long
> as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,

I don't have a better idea yet, but just want to say that accidentally
I was trying to look into this as well starting from this week and I'd
say that's mostly what I thought about too (I was still reading a bit
seabios when I saw this email)... so at least this idea makes sense to
me.

Would the guest OS still change the PCI bus number even after the
firmware (BIOS/UEFI)?  Could I ask in what case would that happen?

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
  2019-07-24  7:14       ` Peter Xu
@ 2019-07-24  9:39         ` Michael S. Tsirkin
  2019-07-24 10:03           ` Peter Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Michael S. Tsirkin @ 2019-07-24  9:39 UTC (permalink / raw)
  To: Peter Xu
  Cc: Singh, Brijesh, Suthikulpanit, Suravee, qemu-devel, eric.auger,
	Alex Williamson

On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote:
> On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:
> > > On 3/29/19 11:49 AM, Alex Williamson wrote:
> > > > [Cc +Brijesh]
> > > > 
> > > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > > think the Linux code handles this regardless of the firmware provided
> > > > aliases, but is it required per spec for the ACPI tables to include
> > > > bridge aliases?  Thanks,
> > > >   
> > > 
> > > We do need to includes aliases in ACPI table. We need to populate the
> > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> > > IVRS would contain similar information.
> > > 
> > > Suravee, please correct me if I am missing something?
> > 
> > I finally found some time to investigate this a little further, yes the
> > types mentioned are correct for defining start and end of an alias
> > range.  The challenge here is that these entries require a DeviceID,
> > which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
> > numbers are defined by the guest firmware, and potentially redefined by
> > the guest OS.  This makes it non-trivial to insert a few IVHDs into the
> > IVRS to describe alias ranges.  I'm wondering if the solution here is
> > to define a new linker-loader command that would instruct the guest to
> > write a bus number byte to a given offset for a described device.
> > These commands would be inserted before the checksum command, such that
> > these bus number updates are calculated as part of the checksum.
> > 
> > I'm imagining the command format would need to be able to distinguish
> > between the actual bus number of a described device, the secondary bus
> > number of the device, and the subordinate bus number of the device.
> > For describing the device, I'm envisioning stealing from the DMAR
> > definition, which already includes a bus number invariant mechanism to
> > describe a device, starting with a segment and root bus, follow a chain
> > of devfns to get to the target device.  Therefore the guest firmware
> > would follow the path to the described device, pick the desired bus
> > number, and write it to the indicated table offset.
> > 
> > Does this seem like a reasonable approach?  Better ideas?  I'm not
> > thrilled with the increased scope demanded by IVRS support, but so long
> > as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,
> 
> I don't have a better idea yet, but just want to say that accidentally
> I was trying to look into this as well starting from this week and I'd
> say that's mostly what I thought about too (I was still reading a bit
> seabios when I saw this email)... so at least this idea makes sense to
> me.
> 
> Would the guest OS still change the PCI bus number even after the
> firmware (BIOS/UEFI)?  Could I ask in what case would that happen?
> 
> Thanks,

Guest OSes can in theory rebalance resources. Changing bus numbers
would be useful if new bridges are added by hotplug.
In practice at least Linux doesn't do the rebalancing.
I think that if we start reporting PNP OS support in BIOS then windows
might start doing that more aggressively.


> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
  2019-07-24  9:39         ` Michael S. Tsirkin
@ 2019-07-24 10:03           ` Peter Xu
  2019-07-24 14:43             ` Alex Williamson
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2019-07-24 10:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Singh, Brijesh, Suthikulpanit, Suravee, qemu-devel, Peter Xu,
	eric.auger, Alex Williamson

On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote:
> > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:
> > > > On 3/29/19 11:49 AM, Alex Williamson wrote:
> > > > > [Cc +Brijesh]
> > > > > 
> > > > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > > > think the Linux code handles this regardless of the firmware provided
> > > > > aliases, but is it required per spec for the ACPI tables to include
> > > > > bridge aliases?  Thanks,
> > > > >   
> > > > 
> > > > We do need to includes aliases in ACPI table. We need to populate the
> > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> > > > IVRS would contain similar information.
> > > > 
> > > > Suravee, please correct me if I am missing something?
> > > 
> > > I finally found some time to investigate this a little further, yes the
> > > types mentioned are correct for defining start and end of an alias
> > > range.  The challenge here is that these entries require a DeviceID,
> > > which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
> > > numbers are defined by the guest firmware, and potentially redefined by
> > > the guest OS.  This makes it non-trivial to insert a few IVHDs into the
> > > IVRS to describe alias ranges.  I'm wondering if the solution here is
> > > to define a new linker-loader command that would instruct the guest to
> > > write a bus number byte to a given offset for a described device.
> > > These commands would be inserted before the checksum command, such that
> > > these bus number updates are calculated as part of the checksum.
> > > 
> > > I'm imagining the command format would need to be able to distinguish
> > > between the actual bus number of a described device, the secondary bus
> > > number of the device, and the subordinate bus number of the device.
> > > For describing the device, I'm envisioning stealing from the DMAR
> > > definition, which already includes a bus number invariant mechanism to
> > > describe a device, starting with a segment and root bus, follow a chain
> > > of devfns to get to the target device.  Therefore the guest firmware
> > > would follow the path to the described device, pick the desired bus
> > > number, and write it to the indicated table offset.
> > > 
> > > Does this seem like a reasonable approach?  Better ideas?  I'm not
> > > thrilled with the increased scope demanded by IVRS support, but so long
> > > as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,
> > 
> > I don't have a better idea yet, but just want to say that accidentally
> > I was trying to look into this as well starting from this week and I'd
> > say that's mostly what I thought about too (I was still reading a bit
> > seabios when I saw this email)... so at least this idea makes sense to
> > me.
> > 
> > Would the guest OS still change the PCI bus number even after the
> > firmware (BIOS/UEFI)?  Could I ask in what case would that happen?
> > 
> > Thanks,
> 
> Guest OSes can in theory rebalance resources. Changing bus numbers
> would be useful if new bridges are added by hotplug.
> In practice at least Linux doesn't do the rebalancing.
> I think that if we start reporting PNP OS support in BIOS then windows
> might start doing that more aggressively.

It's surprising me a bit...  IMHO if we allow the bus number to change
then at least many scripts can even fail which might work before.
E.g. , a very common script can run "lspci-like" program to list each
device and then do "lspci-like -vvv" again upon the BDF it fetched
from previous commands.  Any kind of BDF caching would be invalid
since that from either userspace or kernel.

Also, obviously the data to be stored in IVRS is closely bound to how
bus number is defined.  Even if we can add a new linker-loader command
to all the open firmwares like seabios or OVMF but still we can't do
that to Windows (or, could we?...).

Now one step back, I'm also curious on the reason behind on why AMD
spec required the IVRS with BDF information, rather than the scope
information like what Intel DMAR spec was asking for.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
  2019-07-24 10:03           ` Peter Xu
@ 2019-07-24 14:43             ` Alex Williamson
  2019-07-24 19:42               ` Alex Williamson
  2019-07-25  6:37               ` Peter Xu
  0 siblings, 2 replies; 13+ messages in thread
From: Alex Williamson @ 2019-07-24 14:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: Singh, Brijesh, Michael S. Tsirkin, qemu-devel, eric.auger,
	Suthikulpanit, Suravee

On Wed, 24 Jul 2019 18:03:31 +0800
Peter Xu <zhexu@redhat.com> wrote:

> On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote:
> > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote:  
> > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:  
> > > > > On 3/29/19 11:49 AM, Alex Williamson wrote:  
> > > > > > [Cc +Brijesh]
> > > > > > 
> > > > > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > > > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > > > > think the Linux code handles this regardless of the firmware provided
> > > > > > aliases, but is it required per spec for the ACPI tables to include
> > > > > > bridge aliases?  Thanks,
> > > > > >     
> > > > > 
> > > > > We do need to includes aliases in ACPI table. We need to populate the
> > > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> > > > > IVRS would contain similar information.
> > > > > 
> > > > > Suravee, please correct me if I am missing something?  
> > > > 
> > > > I finally found some time to investigate this a little further, yes the
> > > > types mentioned are correct for defining start and end of an alias
> > > > range.  The challenge here is that these entries require a DeviceID,
> > > > which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
> > > > numbers are defined by the guest firmware, and potentially redefined by
> > > > the guest OS.  This makes it non-trivial to insert a few IVHDs into the
> > > > IVRS to describe alias ranges.  I'm wondering if the solution here is
> > > > to define a new linker-loader command that would instruct the guest to
> > > > write a bus number byte to a given offset for a described device.
> > > > These commands would be inserted before the checksum command, such that
> > > > these bus number updates are calculated as part of the checksum.
> > > > 
> > > > I'm imagining the command format would need to be able to distinguish
> > > > between the actual bus number of a described device, the secondary bus
> > > > number of the device, and the subordinate bus number of the device.
> > > > For describing the device, I'm envisioning stealing from the DMAR
> > > > definition, which already includes a bus number invariant mechanism to
> > > > describe a device, starting with a segment and root bus, follow a chain
> > > > of devfns to get to the target device.  Therefore the guest firmware
> > > > would follow the path to the described device, pick the desired bus
> > > > number, and write it to the indicated table offset.
> > > > 
> > > > Does this seem like a reasonable approach?  Better ideas?  I'm not
> > > > thrilled with the increased scope demanded by IVRS support, but so long
> > > > as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,  
> > > 
> > > I don't have a better idea yet, but just want to say that accidentally
> > > I was trying to look into this as well starting from this week and I'd
> > > say that's mostly what I thought about too (I was still reading a bit
> > > seabios when I saw this email)... so at least this idea makes sense to
> > > me.
> > > 
> > > Would the guest OS still change the PCI bus number even after the
> > > firmware (BIOS/UEFI)?  Could I ask in what case would that happen?
> > > 
> > > Thanks,  
> > 
> > Guest OSes can in theory rebalance resources. Changing bus numbers
> > would be useful if new bridges are added by hotplug.
> > In practice at least Linux doesn't do the rebalancing.
> > I think that if we start reporting PNP OS support in BIOS then windows
> > might start doing that more aggressively.  
> 
> It's surprising me a bit...  IMHO if we allow the bus number to change
> then at least many scripts can even fail which might work before.
> E.g. , a very common script can run "lspci-like" program to list each
> device and then do "lspci-like -vvv" again upon the BDF it fetched
> from previous commands.  Any kind of BDF caching would be invalid
> since that from either userspace or kernel.
> 
> Also, obviously the data to be stored in IVRS is closely bound to how
> bus number is defined.  Even if we can add a new linker-loader command
> to all the open firmwares like seabios or OVMF but still we can't do
> that to Windows (or, could we?...).
> 
> Now one step back, I'm also curious on the reason behind on why AMD
> spec required the IVRS with BDF information, rather than the scope
> information like what Intel DMAR spec was asking for.

It's a deficiency of the IVRS spec, but it's really out of scope here.
It's not the responsibility of the hypervisor to resolve this sort of
design issue, we should simply maintain the bare metal behavior and the
bare metal limitations of the design.

Michael did invoke some interesting ideas regarding QEMU updating the
IRVS table though.  QEMU does know when bus apertures are programmed on
devices and the config writes for these updates could trigger IVRS
updates.  I think we'd want to allow such updates only between machine
reset and the guest firmware writing the table checksum.  This reduces
the scope of the necessary changes, though still feels a little messy
to have these config writes making table updates.

Another approach, and maybe what Michael was really suggesting, is that
we essentially create the ACPI tables twice AFAICT.  Once initially,
then again via a select callback in fw_cfg.  For SeaBIOS, it looks like
this second generation would be created after the PCI bus has been
enumerated and initialized.  I've been trying to see if the same is
likely for OVMF, though it's not clear to me that this is a reasonable
ordering to rely on.  It would be entirely reasonable that firmware
could process ACPI tables in advance of enumerating PCI, even
potentially as a prerequisite to enumerating PCI.  So ultimately I'm not
sure if there are valid ordering assumptions to use these callbacks
this way, though I'd appreciate any further discussion.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
  2019-07-24 14:43             ` Alex Williamson
@ 2019-07-24 19:42               ` Alex Williamson
  2019-07-25 14:34                 ` Singh, Brijesh
  2019-07-25  6:37               ` Peter Xu
  1 sibling, 1 reply; 13+ messages in thread
From: Alex Williamson @ 2019-07-24 19:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: Singh, Brijesh, Michael S. Tsirkin, qemu-devel, eric.auger,
	Suthikulpanit, Suravee

On Wed, 24 Jul 2019 08:43:55 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 24 Jul 2019 18:03:31 +0800
> Peter Xu <zhexu@redhat.com> wrote:
> 
> > On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote:  
> > > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote:    
> > > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:    
> > > > > > On 3/29/19 11:49 AM, Alex Williamson wrote:    
> > > > > > > [Cc +Brijesh]
> > > > > > > 
> > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > > > > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > > > > > think the Linux code handles this regardless of the firmware provided
> > > > > > > aliases, but is it required per spec for the ACPI tables to include
> > > > > > > bridge aliases?  Thanks,
[snip]

For a data point, I fired up an old 990FX system which includes a
PCIe-to-PCI bridge and I added a plugin PCIe-to-PCI bridge just to keep
things interesting... guess how many alias ranges are in the IVRS...
Yep, just the one built into the motherboard:

AMD-Vi: Using IVHD type 0x10
AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: 3e info 1300
AMD-Vi:        mmio-addr: 00000000fec30000
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 00:00.2
AMD-Vi:   DEV_SELECT			 devid: 00:09.0 flags: 00
AMD-Vi:   DEV_SELECT			 devid: 01:00.0 flags: 00
AMD-Vi:   DEV_SELECT			 devid: 00:0a.0 flags: 00
AMD-Vi:   DEV_SELECT			 devid: 02:00.0 flags: 00
AMD-Vi:   DEV_SELECT			 devid: 00:11.0 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:12.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 00:12.2
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:13.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 00:13.2
AMD-Vi:   DEV_SELECT			 devid: 00:14.0 flags: d7
AMD-Vi:   DEV_SELECT			 devid: 00:14.2 flags: 00
AMD-Vi:   DEV_SELECT			 devid: 00:14.3 flags: 00
AMD-Vi:   DEV_SELECT			 devid: 00:14.4 flags: 00
AMD-Vi:   DEV_ALIAS_RANGE		 devid: 03:00.0 flags: 00 devid_to: 00:14.4
AMD-Vi:   DEV_RANGE_END		 devid: 03:1f.7

[Everything on bus 03:xx.x is aliased to device 00:14.4, the builtin PCIe-to-PCI bridge]

AMD-Vi:   DEV_SELECT			 devid: 00:14.5 flags: 00
AMD-Vi:   DEV_SELECT			 devid: 00:15.0 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 04:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 04:1f.7
AMD-Vi:   DEV_SELECT			 devid: 00:15.1 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 05:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 05:1f.7
AMD-Vi:   DEV_SELECT			 devid: 00:15.2 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 06:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 06:1f.7
AMD-Vi:   DEV_SELECT			 devid: 00:15.3 flags: 00
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 08:00.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 08:1f.7
AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:16.0 flags: 00
AMD-Vi:   DEV_RANGE_END		 devid: 00:16.2
AMD-Vi:   DEV_SPECIAL(IOAPIC[8])		devid: 00:14.0
AMD-Vi:   DEV_SPECIAL(HPET[0])		devid: 00:14.0
AMD-Vi:   DEV_SPECIAL(IOAPIC[8])		devid: 00:00.1

-[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] RD9x0/RX980 Host Bridge
           +-00.2  Advanced Micro Devices, Inc. [AMD/ATI] RD890S/RD990 I/O Memory Management Unit (IOMMU)
           +-09.0-[01]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
           +-0a.0-[02]----00.0  Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller
           +-11.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
           +-12.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
           +-12.2  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
           +-13.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
           +-13.2  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
           +-14.0  Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller
           +-14.2  Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA)
           +-14.3  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller
           +-14.4-[03]--+-06.0  NVidia / SGS Thomson (Joint Venture) Riva128
           |            \-0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
           +-14.5  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller
           +-15.0-[04]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
           +-15.1-[05]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
           +-15.2-[06-07]----00.0-[07]----00.0  Realtek Semiconductor Co., Ltd. Device 8149
           +-15.3-[08]--
           +-16.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
           +-16.2  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
           +-18.0  Advanced Micro Devices, Inc. [AMD] Family 10h Processor HyperTransport Configuration
           +-18.1  Advanced Micro Devices, Inc. [AMD] Family 10h Processor Address Map
           +-18.2  Advanced Micro Devices, Inc. [AMD] Family 10h Processor DRAM Controller
           +-18.3  Advanced Micro Devices, Inc. [AMD] Family 10h Processor Miscellaneous Control
           \-18.4  Advanced Micro Devices, Inc. [AMD] Family 10h Processor Link Control

00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge (rev 40) (prog-if 01 [Subtractive decode])
	Bus: primary=00, secondary=03, subordinate=03, sec-latency=64

06:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 03) (prog-if 00 [Normal decode])
	Bus: primary=06, secondary=07, subordinate=07, sec-latency=32
	Capabilities: [80] Express (v1) PCI-Express to PCI/PCI-X Bridge, MSI 00

I realized as I was writing, that the alias caused by 06:00.0 would be
devfn 00.0 on the secondary bus 07, ie. 07:00.0 would alias to
07:00.0, so technically there's really not an alias for this small
case.  So I replaced the NIC with this:

           +-15.2-[06-07]----00.0-[07]--+-00.0  NEC Corporation OHCI USB Controller
                                        +-00.1  NEC Corporation OHCI USB Controller
                                        \-00.2  NEC Corporation uPD72010x USB 2.0 Controller

Now functions 07:00.[12] also alias to 07:00.0.  The IVRS table is
unaffected.

I'm tempted to say that QEMU would be no worse than bare metal if we
simply ignore IVHD device alias entries.  I know that Linux will figure
out the aliasing regardless of the IVRS.  Will Windows guests?  I'd
love to hear from Bijesh or Suravee regarding the behavior above versus
what AMD expects from system vendors.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
  2019-07-24 19:42               ` Alex Williamson
@ 2019-07-25 14:34                 ` Singh, Brijesh
  0 siblings, 0 replies; 13+ messages in thread
From: Singh, Brijesh @ 2019-07-25 14:34 UTC (permalink / raw)
  To: Alex Williamson, Peter Xu
  Cc: Singh, Brijesh, Michael S. Tsirkin, qemu-devel, eric.auger,
	Suthikulpanit, Suravee



On 7/24/19 2:42 PM, Alex Williamson wrote:
> On Wed, 24 Jul 2019 08:43:55 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Wed, 24 Jul 2019 18:03:31 +0800
>> Peter Xu <zhexu@redhat.com> wrote:
>>
>>> On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote:
>>>> On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote:
>>>>> On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:
>>>>>>> On 3/29/19 11:49 AM, Alex Williamson wrote:
>>>>>>>> [Cc +Brijesh]
>>>>>>>>
>>>>>>>> Hi Brijesh, will the change below require the IVRS to be updated to
>>>>>>>> include aliases for all BDF ranges behind a conventional bridge?  I
>>>>>>>> think the Linux code handles this regardless of the firmware provided
>>>>>>>> aliases, but is it required per spec for the ACPI tables to include
>>>>>>>> bridge aliases?  Thanks,
> [snip]
> 
> For a data point, I fired up an old 990FX system which includes a
> PCIe-to-PCI bridge and I added a plugin PCIe-to-PCI bridge just to keep
> things interesting... guess how many alias ranges are in the IVRS...
> Yep, just the one built into the motherboard:
> 
> AMD-Vi: Using IVHD type 0x10
> AMD-Vi: device: 00:00.2 cap: 0040 seg: 0 flags: 3e info 1300
> AMD-Vi:        mmio-addr: 00000000fec30000
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:00.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 00:00.2
> AMD-Vi:   DEV_SELECT			 devid: 00:09.0 flags: 00
> AMD-Vi:   DEV_SELECT			 devid: 01:00.0 flags: 00
> AMD-Vi:   DEV_SELECT			 devid: 00:0a.0 flags: 00
> AMD-Vi:   DEV_SELECT			 devid: 02:00.0 flags: 00
> AMD-Vi:   DEV_SELECT			 devid: 00:11.0 flags: 00
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:12.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 00:12.2
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:13.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 00:13.2
> AMD-Vi:   DEV_SELECT			 devid: 00:14.0 flags: d7
> AMD-Vi:   DEV_SELECT			 devid: 00:14.2 flags: 00
> AMD-Vi:   DEV_SELECT			 devid: 00:14.3 flags: 00
> AMD-Vi:   DEV_SELECT			 devid: 00:14.4 flags: 00
> AMD-Vi:   DEV_ALIAS_RANGE		 devid: 03:00.0 flags: 00 devid_to: 00:14.4
> AMD-Vi:   DEV_RANGE_END		 devid: 03:1f.7
> 
> [Everything on bus 03:xx.x is aliased to device 00:14.4, the builtin PCIe-to-PCI bridge]
> 
> AMD-Vi:   DEV_SELECT			 devid: 00:14.5 flags: 00
> AMD-Vi:   DEV_SELECT			 devid: 00:15.0 flags: 00
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 04:00.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 04:1f.7
> AMD-Vi:   DEV_SELECT			 devid: 00:15.1 flags: 00
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 05:00.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 05:1f.7
> AMD-Vi:   DEV_SELECT			 devid: 00:15.2 flags: 00
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 06:00.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 06:1f.7
> AMD-Vi:   DEV_SELECT			 devid: 00:15.3 flags: 00
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 08:00.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 08:1f.7
> AMD-Vi:   DEV_SELECT_RANGE_START	 devid: 00:16.0 flags: 00
> AMD-Vi:   DEV_RANGE_END		 devid: 00:16.2
> AMD-Vi:   DEV_SPECIAL(IOAPIC[8])		devid: 00:14.0
> AMD-Vi:   DEV_SPECIAL(HPET[0])		devid: 00:14.0
> AMD-Vi:   DEV_SPECIAL(IOAPIC[8])		devid: 00:00.1
> 
> -[0000:00]-+-00.0  Advanced Micro Devices, Inc. [AMD/ATI] RD9x0/RX980 Host Bridge
>             +-00.2  Advanced Micro Devices, Inc. [AMD/ATI] RD890S/RD990 I/O Memory Management Unit (IOMMU)
>             +-09.0-[01]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>             +-0a.0-[02]----00.0  Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller
>             +-11.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
>             +-12.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
>             +-12.2  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
>             +-13.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
>             +-13.2  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
>             +-14.0  Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller
>             +-14.2  Advanced Micro Devices, Inc. [AMD/ATI] SBx00 Azalia (Intel HDA)
>             +-14.3  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller
>             +-14.4-[03]--+-06.0  NVidia / SGS Thomson (Joint Venture) Riva128
>             |            \-0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>             +-14.5  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller
>             +-15.0-[04]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
>             +-15.1-[05]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>             +-15.2-[06-07]----00.0-[07]----00.0  Realtek Semiconductor Co., Ltd. Device 8149
>             +-15.3-[08]--
>             +-16.0  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller
>             +-16.2  Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller
>             +-18.0  Advanced Micro Devices, Inc. [AMD] Family 10h Processor HyperTransport Configuration
>             +-18.1  Advanced Micro Devices, Inc. [AMD] Family 10h Processor Address Map
>             +-18.2  Advanced Micro Devices, Inc. [AMD] Family 10h Processor DRAM Controller
>             +-18.3  Advanced Micro Devices, Inc. [AMD] Family 10h Processor Miscellaneous Control
>             \-18.4  Advanced Micro Devices, Inc. [AMD] Family 10h Processor Link Control
> 
> 00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge (rev 40) (prog-if 01 [Subtractive decode])
> 	Bus: primary=00, secondary=03, subordinate=03, sec-latency=64
> 
> 06:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 03) (prog-if 00 [Normal decode])
> 	Bus: primary=06, secondary=07, subordinate=07, sec-latency=32
> 	Capabilities: [80] Express (v1) PCI-Express to PCI/PCI-X Bridge, MSI 00
> 
> I realized as I was writing, that the alias caused by 06:00.0 would be
> devfn 00.0 on the secondary bus 07, ie. 07:00.0 would alias to
> 07:00.0, so technically there's really not an alias for this small
> case.  So I replaced the NIC with this:
> 
>             +-15.2-[06-07]----00.0-[07]--+-00.0  NEC Corporation OHCI USB Controller
>                                          +-00.1  NEC Corporation OHCI USB Controller
>                                          \-00.2  NEC Corporation uPD72010x USB 2.0 Controller
> 
> Now functions 07:00.[12] also alias to 07:00.0.  The IVRS table is
> unaffected.
> 
> I'm tempted to say that QEMU would be no worse than bare metal if we
> simply ignore IVHD device alias entries.  I know that Linux will figure
> out the aliasing regardless of the IVRS.  Will Windows guests?  I'd
> love to hear from Bijesh or Suravee regarding the behavior above versus
> what AMD expects from system vendors.  Thanks,
> 

Alex,

I am not well versed on the expected IOMMU address space behavior yet,
Suravee will know more about. I will ask him to take a look at this
thread.

thanks

> Alex
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
  2019-07-24 14:43             ` Alex Williamson
  2019-07-24 19:42               ` Alex Williamson
@ 2019-07-25  6:37               ` Peter Xu
  2019-07-25 10:43                 ` Michael S. Tsirkin
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Xu @ 2019-07-25  6:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Singh, Brijesh, Suthikulpanit, Suravee, Michael S. Tsirkin,
	qemu-devel, Peter Xu, eric.auger

On Wed, Jul 24, 2019 at 08:43:55AM -0600, Alex Williamson wrote:
> On Wed, 24 Jul 2019 18:03:31 +0800
> Peter Xu <zhexu@redhat.com> wrote:
> 
> > On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote:
> > > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote:  
> > > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:  
> > > > > > On 3/29/19 11:49 AM, Alex Williamson wrote:  
> > > > > > > [Cc +Brijesh]
> > > > > > > 
> > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > > > > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > > > > > think the Linux code handles this regardless of the firmware provided
> > > > > > > aliases, but is it required per spec for the ACPI tables to include
> > > > > > > bridge aliases?  Thanks,
> > > > > > >     
> > > > > > 
> > > > > > We do need to includes aliases in ACPI table. We need to populate the
> > > > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> > > > > > IVRS would contain similar information.
> > > > > > 
> > > > > > Suravee, please correct me if I am missing something?  
> > > > > 
> > > > > I finally found some time to investigate this a little further, yes the
> > > > > types mentioned are correct for defining start and end of an alias
> > > > > range.  The challenge here is that these entries require a DeviceID,
> > > > > which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
> > > > > numbers are defined by the guest firmware, and potentially redefined by
> > > > > the guest OS.  This makes it non-trivial to insert a few IVHDs into the
> > > > > IVRS to describe alias ranges.  I'm wondering if the solution here is
> > > > > to define a new linker-loader command that would instruct the guest to
> > > > > write a bus number byte to a given offset for a described device.
> > > > > These commands would be inserted before the checksum command, such that
> > > > > these bus number updates are calculated as part of the checksum.
> > > > > 
> > > > > I'm imagining the command format would need to be able to distinguish
> > > > > between the actual bus number of a described device, the secondary bus
> > > > > number of the device, and the subordinate bus number of the device.
> > > > > For describing the device, I'm envisioning stealing from the DMAR
> > > > > definition, which already includes a bus number invariant mechanism to
> > > > > describe a device, starting with a segment and root bus, follow a chain
> > > > > of devfns to get to the target device.  Therefore the guest firmware
> > > > > would follow the path to the described device, pick the desired bus
> > > > > number, and write it to the indicated table offset.
> > > > > 
> > > > > Does this seem like a reasonable approach?  Better ideas?  I'm not
> > > > > thrilled with the increased scope demanded by IVRS support, but so long
> > > > > as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,  
> > > > 
> > > > I don't have a better idea yet, but just want to say that accidentally
> > > > I was trying to look into this as well starting from this week and I'd
> > > > say that's mostly what I thought about too (I was still reading a bit
> > > > seabios when I saw this email)... so at least this idea makes sense to
> > > > me.
> > > > 
> > > > Would the guest OS still change the PCI bus number even after the
> > > > firmware (BIOS/UEFI)?  Could I ask in what case would that happen?
> > > > 
> > > > Thanks,  
> > > 
> > > Guest OSes can in theory rebalance resources. Changing bus numbers
> > > would be useful if new bridges are added by hotplug.
> > > In practice at least Linux doesn't do the rebalancing.
> > > I think that if we start reporting PNP OS support in BIOS then windows
> > > might start doing that more aggressively.  
> > 
> > It's surprising me a bit...  IMHO if we allow the bus number to change
> > then at least many scripts can even fail which might work before.
> > E.g. , a very common script can run "lspci-like" program to list each
> > device and then do "lspci-like -vvv" again upon the BDF it fetched
> > from previous commands.  Any kind of BDF caching would be invalid
> > since that from either userspace or kernel.
> > 
> > Also, obviously the data to be stored in IVRS is closely bound to how
> > bus number is defined.  Even if we can add a new linker-loader command
> > to all the open firmwares like seabios or OVMF but still we can't do
> > that to Windows (or, could we?...).
> > 
> > Now one step back, I'm also curious on the reason behind on why AMD
> > spec required the IVRS with BDF information, rather than the scope
> > information like what Intel DMAR spec was asking for.
> 
> It's a deficiency of the IVRS spec, but it's really out of scope here.
> It's not the responsibility of the hypervisor to resolve this sort of
> design issue, we should simply maintain the bare metal behavior and the
> bare metal limitations of the design.

Yes this is a better perspective.  It's not the first time I totally
forgot to go back to reference the bare-metal as that's what we're
emulating and normally that'll have very similar problem.  And the
"data point" email does give a proper supporting material for this.

> 
> Michael did invoke some interesting ideas regarding QEMU updating the
> IRVS table though.  QEMU does know when bus apertures are programmed on
> devices and the config writes for these updates could trigger IVRS
> updates.  I think we'd want to allow such updates only between machine
> reset and the guest firmware writing the table checksum.  This reduces
> the scope of the necessary changes, though still feels a little messy
> to have these config writes making table updates.
> 
> Another approach, and maybe what Michael was really suggesting, is that
> we essentially create the ACPI tables twice AFAICT.  Once initially,
> then again via a select callback in fw_cfg.  For SeaBIOS, it looks like
> this second generation would be created after the PCI bus has been
> enumerated and initialized.  I've been trying to see if the same is
> likely for OVMF, though it's not clear to me that this is a reasonable
> ordering to rely on.  It would be entirely reasonable that firmware
> could process ACPI tables in advance of enumerating PCI, even
> potentially as a prerequisite to enumerating PCI.  So ultimately I'm not
> sure if there are valid ordering assumptions to use these callbacks
> this way, though I'd appreciate any further discussion.  Thanks,

After re-read Michael's reply, I feel like what Michael suggested is
that we can simply ignore the bus-number-change case by the guest OS
for now, but I might be wrong.  In all cases, this will be a problem
only if "we still need to fill in the BDF information" somehow, while
that statement seems to be questionable now.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
  2019-07-25  6:37               ` Peter Xu
@ 2019-07-25 10:43                 ` Michael S. Tsirkin
  2019-07-25 14:00                   ` Alex Williamson
  0 siblings, 1 reply; 13+ messages in thread
From: Michael S. Tsirkin @ 2019-07-25 10:43 UTC (permalink / raw)
  To: Peter Xu
  Cc: Singh, Brijesh, Suthikulpanit, Suravee, qemu-devel, eric.auger,
	Alex Williamson

On Thu, Jul 25, 2019 at 02:37:33PM +0800, Peter Xu wrote:
> After re-read Michael's reply, I feel like what Michael suggested is
> that we can simply ignore the bus-number-change case by the guest OS
> for now, but I might be wrong.
That's what I suggested, yes.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
  2019-07-25 10:43                 ` Michael S. Tsirkin
@ 2019-07-25 14:00                   ` Alex Williamson
  2019-07-25 15:22                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Williamson @ 2019-07-25 14:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Singh, Brijesh, qemu-devel, Peter Xu, eric.auger, Suthikulpanit, Suravee

On Thu, 25 Jul 2019 06:43:04 -0400
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Thu, Jul 25, 2019 at 02:37:33PM +0800, Peter Xu wrote:
> > After re-read Michael's reply, I feel like what Michael suggested is
> > that we can simply ignore the bus-number-change case by the guest OS
> > for now, but I might be wrong.  
> That's what I suggested, yes.

"by the guest OS", yes, that's the part that's beyond the scope of this
effort.  If we deem it necessary to support IVHD DMA alias though, it's
the guest firmware that determines the initial bus numbers, which is
part of the DeviceID used to defined IVHD entries and we would not be
able to ignore that initial programming.  Everything related to the
guest OS renumber PCI buses is out of scope, the guest firmware
programming initial bus numbers is not.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
  2019-07-25 14:00                   ` Alex Williamson
@ 2019-07-25 15:22                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 13+ messages in thread
From: Michael S. Tsirkin @ 2019-07-25 15:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Singh, Brijesh, qemu-devel, Peter Xu, eric.auger, Suthikulpanit, Suravee

On Thu, Jul 25, 2019 at 08:00:23AM -0600, Alex Williamson wrote:
> On Thu, 25 Jul 2019 06:43:04 -0400
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Thu, Jul 25, 2019 at 02:37:33PM +0800, Peter Xu wrote:
> > > After re-read Michael's reply, I feel like what Michael suggested is
> > > that we can simply ignore the bus-number-change case by the guest OS
> > > for now, but I might be wrong.  
> > That's what I suggested, yes.
> 
> "by the guest OS", yes, that's the part that's beyond the scope of this
> effort.  If we deem it necessary to support IVHD DMA alias though, it's
> the guest firmware that determines the initial bus numbers, which is
> part of the DeviceID used to defined IVHD entries and we would not be
> able to ignore that initial programming.  Everything related to the
> guest OS renumber PCI buses is out of scope, the guest firmware
> programming initial bus numbers is not.  

Right. That's par for the course, we have same issues with MCFG
and others. bios programs it and then we generate acpi based on
that.

>Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-07-25 15:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <155364082689.15803.7062874513041742278.stgit@gimli.home>
     [not found] ` <20190329104904.450fefef@x1.home>
2019-04-01 13:41   ` [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space Singh, Brijesh
2019-07-23 17:26     ` Alex Williamson
2019-07-23 18:45       ` Michael S. Tsirkin
2019-07-24  7:14       ` Peter Xu
2019-07-24  9:39         ` Michael S. Tsirkin
2019-07-24 10:03           ` Peter Xu
2019-07-24 14:43             ` Alex Williamson
2019-07-24 19:42               ` Alex Williamson
2019-07-25 14:34                 ` Singh, Brijesh
2019-07-25  6:37               ` Peter Xu
2019-07-25 10:43                 ` Michael S. Tsirkin
2019-07-25 14:00                   ` Alex Williamson
2019-07-25 15:22                     ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.