All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <zhexu@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Singh, Brijesh" <brijesh.singh@amd.com>,
	"Suthikulpanit, Suravee" <Suravee.Suthikulpanit@amd.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	Peter Xu <zhexu@redhat.com>,
	"eric.auger@redhat.com" <eric.auger@redhat.com>
Subject: Re: [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space
Date: Thu, 25 Jul 2019 14:37:33 +0800	[thread overview]
Message-ID: <20190725063733.GH14454@xz-x1> (raw)
In-Reply-To: <20190724084355.627d44cf@x1.home>

On Wed, Jul 24, 2019 at 08:43:55AM -0600, Alex Williamson wrote:
> On Wed, 24 Jul 2019 18:03:31 +0800
> Peter Xu <zhexu@redhat.com> wrote:
> 
> > On Wed, Jul 24, 2019 at 05:39:22AM -0400, Michael S. Tsirkin wrote:
> > > On Wed, Jul 24, 2019 at 03:14:39PM +0800, Peter Xu wrote:  
> > > > On Tue, Jul 23, 2019 at 11:26:18AM -0600, Alex Williamson wrote:  
> > > > > > On 3/29/19 11:49 AM, Alex Williamson wrote:  
> > > > > > > [Cc +Brijesh]
> > > > > > > 
> > > > > > > Hi Brijesh, will the change below require the IVRS to be updated to
> > > > > > > include aliases for all BDF ranges behind a conventional bridge?  I
> > > > > > > think the Linux code handles this regardless of the firmware provided
> > > > > > > aliases, but is it required per spec for the ACPI tables to include
> > > > > > > bridge aliases?  Thanks,
> > > > > > >     
> > > > > > 
> > > > > > We do need to includes aliases in ACPI table. We need to populate the
> > > > > > IVHD type 0x43 and 0x4 for alias range start and end. I believe host
> > > > > > IVRS would contain similar information.
> > > > > > 
> > > > > > Suravee, please correct me if I am missing something?  
> > > > > 
> > > > > I finally found some time to investigate this a little further, yes the
> > > > > types mentioned are correct for defining start and end of an alias
> > > > > range.  The challenge here is that these entries require a DeviceID,
> > > > > which is defined as a BDF, AIUI.  The IVRS is created in QEMU, but bus
> > > > > numbers are defined by the guest firmware, and potentially redefined by
> > > > > the guest OS.  This makes it non-trivial to insert a few IVHDs into the
> > > > > IVRS to describe alias ranges.  I'm wondering if the solution here is
> > > > > to define a new linker-loader command that would instruct the guest to
> > > > > write a bus number byte to a given offset for a described device.
> > > > > These commands would be inserted before the checksum command, such that
> > > > > these bus number updates are calculated as part of the checksum.
> > > > > 
> > > > > I'm imagining the command format would need to be able to distinguish
> > > > > between the actual bus number of a described device, the secondary bus
> > > > > number of the device, and the subordinate bus number of the device.
> > > > > For describing the device, I'm envisioning stealing from the DMAR
> > > > > definition, which already includes a bus number invariant mechanism to
> > > > > describe a device, starting with a segment and root bus, follow a chain
> > > > > of devfns to get to the target device.  Therefore the guest firmware
> > > > > would follow the path to the described device, pick the desired bus
> > > > > number, and write it to the indicated table offset.
> > > > > 
> > > > > Does this seem like a reasonable approach?  Better ideas?  I'm not
> > > > > thrilled with the increased scope demanded by IVRS support, but so long
> > > > > as we have an AMD IOMMU model, I don't see how to avoid it.  Thanks,  
> > > > 
> > > > I don't have a better idea yet, but just want to say that accidentally
> > > > I was trying to look into this as well starting from this week and I'd
> > > > say that's mostly what I thought about too (I was still reading a bit
> > > > seabios when I saw this email)... so at least this idea makes sense to
> > > > me.
> > > > 
> > > > Would the guest OS still change the PCI bus number even after the
> > > > firmware (BIOS/UEFI)?  Could I ask in what case would that happen?
> > > > 
> > > > Thanks,  
> > > 
> > > Guest OSes can in theory rebalance resources. Changing bus numbers
> > > would be useful if new bridges are added by hotplug.
> > > In practice at least Linux doesn't do the rebalancing.
> > > I think that if we start reporting PNP OS support in BIOS then windows
> > > might start doing that more aggressively.  
> > 
> > It's surprising me a bit...  IMHO if we allow the bus number to change
> > then at least many scripts can even fail which might work before.
> > E.g. , a very common script can run "lspci-like" program to list each
> > device and then do "lspci-like -vvv" again upon the BDF it fetched
> > from previous commands.  Any kind of BDF caching would be invalid
> > since that from either userspace or kernel.
> > 
> > Also, obviously the data to be stored in IVRS is closely bound to how
> > bus number is defined.  Even if we can add a new linker-loader command
> > to all the open firmwares like seabios or OVMF but still we can't do
> > that to Windows (or, could we?...).
> > 
> > Now one step back, I'm also curious on the reason behind on why AMD
> > spec required the IVRS with BDF information, rather than the scope
> > information like what Intel DMAR spec was asking for.
> 
> It's a deficiency of the IVRS spec, but it's really out of scope here.
> It's not the responsibility of the hypervisor to resolve this sort of
> design issue, we should simply maintain the bare metal behavior and the
> bare metal limitations of the design.

Yes this is a better perspective.  It's not the first time I totally
forgot to go back to reference the bare-metal as that's what we're
emulating and normally that'll have very similar problem.  And the
"data point" email does give a proper supporting material for this.

> 
> Michael did invoke some interesting ideas regarding QEMU updating the
> IRVS table though.  QEMU does know when bus apertures are programmed on
> devices and the config writes for these updates could trigger IVRS
> updates.  I think we'd want to allow such updates only between machine
> reset and the guest firmware writing the table checksum.  This reduces
> the scope of the necessary changes, though still feels a little messy
> to have these config writes making table updates.
> 
> Another approach, and maybe what Michael was really suggesting, is that
> we essentially create the ACPI tables twice AFAICT.  Once initially,
> then again via a select callback in fw_cfg.  For SeaBIOS, it looks like
> this second generation would be created after the PCI bus has been
> enumerated and initialized.  I've been trying to see if the same is
> likely for OVMF, though it's not clear to me that this is a reasonable
> ordering to rely on.  It would be entirely reasonable that firmware
> could process ACPI tables in advance of enumerating PCI, even
> potentially as a prerequisite to enumerating PCI.  So ultimately I'm not
> sure if there are valid ordering assumptions to use these callbacks
> this way, though I'd appreciate any further discussion.  Thanks,

After re-read Michael's reply, I feel like what Michael suggested is
that we can simply ignore the bus-number-change case by the guest OS
for now, but I might be wrong.  In all cases, this will be a problem
only if "we still need to fill in the BDF information" somehow, while
that statement seems to be questionable now.

Thanks,

-- 
Peter Xu


  parent reply	other threads:[~2019-07-25  6:37 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <155364082689.15803.7062874513041742278.stgit@gimli.home>
     [not found] ` <20190329104904.450fefef@x1.home>
2019-04-01 13:41   ` [Qemu-devel] [RFC PATCH] pci: Use PCI aliases when determining device IOMMU address space Singh, Brijesh
2019-07-23 17:26     ` Alex Williamson
2019-07-23 18:45       ` Michael S. Tsirkin
2019-07-24  7:14       ` Peter Xu
2019-07-24  9:39         ` Michael S. Tsirkin
2019-07-24 10:03           ` Peter Xu
2019-07-24 14:43             ` Alex Williamson
2019-07-24 19:42               ` Alex Williamson
2019-07-25 14:34                 ` Singh, Brijesh
2019-07-25  6:37               ` Peter Xu [this message]
2019-07-25 10:43                 ` Michael S. Tsirkin
2019-07-25 14:00                   ` Alex Williamson
2019-07-25 15:22                     ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190725063733.GH14454@xz-x1 \
    --to=zhexu@redhat.com \
    --cc=Suravee.Suthikulpanit@amd.com \
    --cc=alex.williamson@redhat.com \
    --cc=brijesh.singh@amd.com \
    --cc=eric.auger@redhat.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.