All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
@ 2015-09-15 23:29 Ed Swierk
  2015-09-21 15:58 ` Ed Swierk
  0 siblings, 1 reply; 23+ messages in thread
From: Ed Swierk @ 2015-09-15 23:29 UTC (permalink / raw)
  To: xen-devel; +Cc: Ed Swierk

On systems where the ACPI DSDT advertises the PCI MMCONFIG area but the
E820 table does not reserve it, it's up to Dom0 to inform Xen via
PHYSDEVOP_pci_mmcfg_reserved.  This needs to happen before Xen tries to
access extended capabilities like SRIOV in pci_add_device(), which is
triggered when Linux enumerates PCI devices in acpi_init().  Changing
xen_mcfg_late() to arch_initcall_sync ensures it gets called before
acpi_init(), but after pci_mmcfg_list is populated by pci_arch_init().

Without this change, Xen 4.4 and 4.5 emit WARN messsages from
msix_capability_init() when setting up Intel 82599 VFs, since vf_rlen has
not been initialized by pci_add_device().  And on Xen 4.5, Xen nukes the
DomU due to "Potentially insecure use of MSI-X" when the VF driver loads
in the DomU.  Both problems are fixed by this change.

Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
---
 drivers/xen/pci.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
index 7494dbe..7b5bbdb 100644
--- a/drivers/xen/pci.c
+++ b/drivers/xen/pci.c
@@ -253,7 +253,9 @@ static int __init xen_mcfg_late(void)
 	return 0;
 }
 /*
- * Needs to be done after acpi_init which are subsys_initcall.
+ * Needs to be called after pci_arch_init (arch_initcall) populates
+ * pci_mmcfg_list, but before acpi_init (subsys_initcall) triggers
+ * pci_add_device() in Xen, since the latter probes extended capabilities.
  */
-subsys_initcall_sync(xen_mcfg_late);
+arch_initcall_sync(xen_mcfg_late);
 #endif
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-15 23:29 [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration Ed Swierk
@ 2015-09-21 15:58 ` Ed Swierk
  2015-09-21 18:05   ` Konrad Rzeszutek Wilk
  2015-09-22  5:21   ` Jan Beulich
  0 siblings, 2 replies; 23+ messages in thread
From: Ed Swierk @ 2015-09-21 15:58 UTC (permalink / raw)
  To: xen-devel; +Cc: Ed Swierk

After testing this change on different platforms, I'm finding some
complications.

As I understand it, the BIOS is supposed to mark the MMCONFIG area
reserved in the E820 table no matter what. And if the ACPI DSDT
includes an MCFG record, then it should also include a PNP0C0x record
for the MMCONFIG area.

Some BIOSes do just one of these. For example, the Intel S2600GL/GZ
BIOS has the correct ACPI records, but does not have the area reserved
in the E820 table. Linux is forgiving enough to use the MMCONFIG area
in this case.

Other BIOSes do neither. For example, the Intel S2600WT BIOS doesn't
even include the PNP0C0x record in the ACPI table, when TXT is
enabled. On this platform the only way to get Linux to see extended
config space at all is by passing memmap=256M$0x80000000 on the kernel
command line.

My change works in the latter case, since the memmap parameter causes
it to add the MMCONFIG area in pci_arch_init(). But in the former
case, where the ACPI tables aren't parsed until acpi_init(),
xen_mcfg_late() must come later, unless you also use the memmap kernel
command line parameter.

The fundamental problem is that Xen tries to access extended config
space in pci_add_device(), before the Dom0 finally figures out where
MMCONFIG area is and makes the pci_mmcfg_reserved hypercall. The only
robust solution seems to be for Xen to defer extended config space
accesses. It's not clear to me how late is late enough, however.

--Ed


On Tue, Sep 15, 2015 at 4:29 PM, Ed Swierk <eswierk@skyportsystems.com> wrote:
> On systems where the ACPI DSDT advertises the PCI MMCONFIG area but the
> E820 table does not reserve it, it's up to Dom0 to inform Xen via
> PHYSDEVOP_pci_mmcfg_reserved.  This needs to happen before Xen tries to
> access extended capabilities like SRIOV in pci_add_device(), which is
> triggered when Linux enumerates PCI devices in acpi_init().  Changing
> xen_mcfg_late() to arch_initcall_sync ensures it gets called before
> acpi_init(), but after pci_mmcfg_list is populated by pci_arch_init().
>
> Without this change, Xen 4.4 and 4.5 emit WARN messsages from
> msix_capability_init() when setting up Intel 82599 VFs, since vf_rlen has
> not been initialized by pci_add_device().  And on Xen 4.5, Xen nukes the
> DomU due to "Potentially insecure use of MSI-X" when the VF driver loads
> in the DomU.  Both problems are fixed by this change.
>
> Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
> ---
>  drivers/xen/pci.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
> index 7494dbe..7b5bbdb 100644
> --- a/drivers/xen/pci.c
> +++ b/drivers/xen/pci.c
> @@ -253,7 +253,9 @@ static int __init xen_mcfg_late(void)
>         return 0;
>  }
>  /*
> - * Needs to be done after acpi_init which are subsys_initcall.
> + * Needs to be called after pci_arch_init (arch_initcall) populates
> + * pci_mmcfg_list, but before acpi_init (subsys_initcall) triggers
> + * pci_add_device() in Xen, since the latter probes extended capabilities.
>   */
> -subsys_initcall_sync(xen_mcfg_late);
> +arch_initcall_sync(xen_mcfg_late);
>  #endif
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-21 15:58 ` Ed Swierk
@ 2015-09-21 18:05   ` Konrad Rzeszutek Wilk
  2015-09-22  5:23     ` Jan Beulich
  2015-09-22  5:21   ` Jan Beulich
  1 sibling, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-21 18:05 UTC (permalink / raw)
  To: Ed Swierk; +Cc: xen-devel

On Mon, Sep 21, 2015 at 08:58:44AM -0700, Ed Swierk wrote:
> After testing this change on different platforms, I'm finding some
> complications.
> 
> As I understand it, the BIOS is supposed to mark the MMCONFIG area
> reserved in the E820 table no matter what. And if the ACPI DSDT
> includes an MCFG record, then it should also include a PNP0C0x record
> for the MMCONFIG area.
> 
> Some BIOSes do just one of these. For example, the Intel S2600GL/GZ
> BIOS has the correct ACPI records, but does not have the area reserved
> in the E820 table. Linux is forgiving enough to use the MMCONFIG area
> in this case.
> 
> Other BIOSes do neither. For example, the Intel S2600WT BIOS doesn't
> even include the PNP0C0x record in the ACPI table, when TXT is
> enabled. On this platform the only way to get Linux to see extended
> config space at all is by passing memmap=256M$0x80000000 on the kernel
> command line.
> 
> My change works in the latter case, since the memmap parameter causes
> it to add the MMCONFIG area in pci_arch_init(). But in the former
> case, where the ACPI tables aren't parsed until acpi_init(),
> xen_mcfg_late() must come later, unless you also use the memmap kernel
> command line parameter.
> 
> The fundamental problem is that Xen tries to access extended config
> space in pci_add_device(), before the Dom0 finally figures out where
> MMCONFIG area is and makes the pci_mmcfg_reserved hypercall. The only
> robust solution seems to be for Xen to defer extended config space
> accesses. It's not clear to me how late is late enough, however.

Or it can deal with PCI config spaces changing.

I wrote some patches (floating around ) where we would walk the full PCI
structure in the xen_mcfg_late() and go over all of them and do
the Xen pci_add_device hypercall.

It (the hypercall) would then need to update its own entries on the subsequent
call. However I never got it completletly working because of:
 - Linux can do 'pci=reassign_devices' which renumbers the bus addresses and
   Xen's idea of the "old" and "new" pci devices goes out the window.
   (for that to work I wrote an patchset that implements this functionality
   in Xen so Linux's pci=reassign_devices in effect will be a nop).
 - Never figured out how much data we should save in the Xen's
   struct pci_device to see if we are 'stale'. Looking back I think
   we just need to do the interogation of the PCI capabilities and see
   if they have somehow changed (the ones we care about).


> 
> --Ed
> 
> 
> On Tue, Sep 15, 2015 at 4:29 PM, Ed Swierk <eswierk@skyportsystems.com> wrote:
> > On systems where the ACPI DSDT advertises the PCI MMCONFIG area but the
> > E820 table does not reserve it, it's up to Dom0 to inform Xen via
> > PHYSDEVOP_pci_mmcfg_reserved.  This needs to happen before Xen tries to
> > access extended capabilities like SRIOV in pci_add_device(), which is
> > triggered when Linux enumerates PCI devices in acpi_init().  Changing
> > xen_mcfg_late() to arch_initcall_sync ensures it gets called before
> > acpi_init(), but after pci_mmcfg_list is populated by pci_arch_init().
> >
> > Without this change, Xen 4.4 and 4.5 emit WARN messsages from
> > msix_capability_init() when setting up Intel 82599 VFs, since vf_rlen has
> > not been initialized by pci_add_device().  And on Xen 4.5, Xen nukes the
> > DomU due to "Potentially insecure use of MSI-X" when the VF driver loads
> > in the DomU.  Both problems are fixed by this change.
> >
> > Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
> > ---
> >  drivers/xen/pci.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
> > index 7494dbe..7b5bbdb 100644
> > --- a/drivers/xen/pci.c
> > +++ b/drivers/xen/pci.c
> > @@ -253,7 +253,9 @@ static int __init xen_mcfg_late(void)
> >         return 0;
> >  }
> >  /*
> > - * Needs to be done after acpi_init which are subsys_initcall.
> > + * Needs to be called after pci_arch_init (arch_initcall) populates
> > + * pci_mmcfg_list, but before acpi_init (subsys_initcall) triggers
> > + * pci_add_device() in Xen, since the latter probes extended capabilities.
> >   */
> > -subsys_initcall_sync(xen_mcfg_late);
> > +arch_initcall_sync(xen_mcfg_late);
> >  #endif
> > --
> > 1.9.1
> >
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-21 15:58 ` Ed Swierk
  2015-09-21 18:05   ` Konrad Rzeszutek Wilk
@ 2015-09-22  5:21   ` Jan Beulich
  2015-09-22 12:35     ` Ed Swierk
  1 sibling, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2015-09-22  5:21 UTC (permalink / raw)
  To: xen-devel, eswierk

>>> Ed Swierk <eswierk@skyportsystems.com> 09/21/15 6:01 PM >>>
>The fundamental problem is that Xen tries to access extended config
>space in pci_add_device(), before the Dom0 finally figures out where
>MMCONFIG area is and makes the pci_mmcfg_reserved hypercall. The only
>robust solution seems to be for Xen to defer extended config space
>accesses. It's not clear to me how late is late enough, however.

I don't follow: Surely Dom0 first establishes MCFG areas to be used, and
only then scans the buses for devices, resulting in them to be reported to
the hypervisor?

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-21 18:05   ` Konrad Rzeszutek Wilk
@ 2015-09-22  5:23     ` Jan Beulich
  2015-09-22 13:36       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2015-09-22  5:23 UTC (permalink / raw)
  To: konrad.wilk; +Cc: eswierk, xen-devel

>>> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 09/21/15 8:06 PM >>>
>- Never figured out how much data we should save in the Xen's
>struct pci_device to see if we are 'stale'. Looking back I think
>we just need to do the interogation of the PCI capabilities and see
>if they have somehow changed (the ones we care about).

Didn't we settle on no data to be needed here at all, instead requiring Dom0
to report devices removed on buses about to be re-numbered, adding them
back after the re-numbering?

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22  5:21   ` Jan Beulich
@ 2015-09-22 12:35     ` Ed Swierk
  2015-09-22 12:40       ` Jan Beulich
  2015-09-22 13:26       ` Ed Swierk
  0 siblings, 2 replies; 23+ messages in thread
From: Ed Swierk @ 2015-09-22 12:35 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Mon, Sep 21, 2015 at 10:21 PM, Jan Beulich <jbeulich@suse.com> wrote:
> I don't follow: Surely Dom0 first establishes MCFG areas to be used, and
> only then scans the buses for devices, resulting in them to be reported to
> the hypervisor?

That seems like a reasonable expectation, but while Linux 4.2 finds
the mmcfg areas before scanning the PCI bus, it doesn't tell Xen about
them until later:

acpi_init() first calls pci_mmcfg_late_init(), which searches ACPI for
mmcfgs and checks that they are reserved in E820 or ACPI (vs
pci_mmcfg_early_init() which searches both ACPI and hardcoded host
bridges but checks only E820 for reservations); then calls
acpi_scan_init() which scans the buses and calls the pci_device_add
hypercall for each device. The pci_mmcfg_reserved hypercall is invoked
later by xen_mcfg_late().

So if the contract is that Dom0 tells Xen about mmcfgs before the
devices they cover, then Linux ought to call pci_mmcfg_reserved from
(or immediately after) both pci_mmcfg_early_init() and
pci_mmcfg_late_init().

--Ed

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 12:35     ` Ed Swierk
@ 2015-09-22 12:40       ` Jan Beulich
  2015-09-22 13:26       ` Ed Swierk
  1 sibling, 0 replies; 23+ messages in thread
From: Jan Beulich @ 2015-09-22 12:40 UTC (permalink / raw)
  To: Ed Swierk; +Cc: xen-devel

>>> On 22.09.15 at 14:35, <eswierk@skyportsystems.com> wrote:
> On Mon, Sep 21, 2015 at 10:21 PM, Jan Beulich <jbeulich@suse.com> wrote:
>> I don't follow: Surely Dom0 first establishes MCFG areas to be used, and
>> only then scans the buses for devices, resulting in them to be reported to
>> the hypervisor?
> 
> That seems like a reasonable expectation, but while Linux 4.2 finds
> the mmcfg areas before scanning the PCI bus, it doesn't tell Xen about
> them until later:
> 
> acpi_init() first calls pci_mmcfg_late_init(), which searches ACPI for
> mmcfgs and checks that they are reserved in E820 or ACPI (vs
> pci_mmcfg_early_init() which searches both ACPI and hardcoded host
> bridges but checks only E820 for reservations); then calls
> acpi_scan_init() which scans the buses and calls the pci_device_add
> hypercall for each device. The pci_mmcfg_reserved hypercall is invoked
> later by xen_mcfg_late().
> 
> So if the contract is that Dom0 tells Xen about mmcfgs before the
> devices they cover, then Linux ought to call pci_mmcfg_reserved from
> (or immediately after) both pci_mmcfg_early_init() and
> pci_mmcfg_late_init().

Correct, albeit in reality whether Xen gets notified matters only when
a range is marked properly in ACPI, but is not reported through E820
(since the latter Xen has in hands for checking anyway).

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 12:35     ` Ed Swierk
  2015-09-22 12:40       ` Jan Beulich
@ 2015-09-22 13:26       ` Ed Swierk
  2015-09-22 13:36         ` Jan Beulich
  2015-09-22 13:39         ` Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 23+ messages in thread
From: Ed Swierk @ 2015-09-22 13:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel

On Tue, Sep 22, 2015 at 5:35 AM, Ed Swierk <eswierk@skyportsystems.com> wrote:
> So if the contract is that Dom0 tells Xen about mmcfgs before the
> devices they cover, then Linux ought to call pci_mmcfg_reserved from
> (or immediately after) both pci_mmcfg_early_init() and
> pci_mmcfg_late_init().

Brainstorming possible approaches:

I don't see an obvious way to hook into those functions (or their
callers) without injecting Xen-specific code. Is there a precedent to
follow?

Alternatively, we could just call the equivalent of xen_mcfg_late()
from the existing xen_{add,remove}_device() notifiers. This would
generate a lot of useless pci_mmcfg_reserved hypercalls (number of
devices times number of mmcfg areas), but pci_mmcfg_arch_enable() in
Xen should happily ignore the redundant ones. The advantage of this
approach other than simplicity is that it makes the mmcfg -> device
setup ordering very explicit.

Any other ideas?

--Ed

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 13:26       ` Ed Swierk
@ 2015-09-22 13:36         ` Jan Beulich
  2015-09-22 13:39         ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 23+ messages in thread
From: Jan Beulich @ 2015-09-22 13:36 UTC (permalink / raw)
  To: Ed Swierk; +Cc: xen-devel

>>> On 22.09.15 at 15:26, <eswierk@skyportsystems.com> wrote:
> On Tue, Sep 22, 2015 at 5:35 AM, Ed Swierk <eswierk@skyportsystems.com> wrote:
>> So if the contract is that Dom0 tells Xen about mmcfgs before the
>> devices they cover, then Linux ought to call pci_mmcfg_reserved from
>> (or immediately after) both pci_mmcfg_early_init() and
>> pci_mmcfg_late_init().
> 
> Brainstorming possible approaches:
> 
> I don't see an obvious way to hook into those functions (or their
> callers) without injecting Xen-specific code. Is there a precedent to
> follow?
> 
> Alternatively, we could just call the equivalent of xen_mcfg_late()
> from the existing xen_{add,remove}_device() notifiers. This would
> generate a lot of useless pci_mmcfg_reserved hypercalls (number of
> devices times number of mmcfg areas), but pci_mmcfg_arch_enable() in
> Xen should happily ignore the redundant ones. The advantage of this
> approach other than simplicity is that it makes the mmcfg -> device
> setup ordering very explicit.

Well, I can see this as a reasonable approach to avoid adding Xen-
specific code to generic x86 code, but I don't like such redundancy.
Note however that the redundancy could be brought down: For one
you'd only need to report the MMCFG area covering the device in
question. And then the kernel could track which ones it reported
already.

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22  5:23     ` Jan Beulich
@ 2015-09-22 13:36       ` Konrad Rzeszutek Wilk
  2015-09-22 13:57         ` Jan Beulich
  0 siblings, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-22 13:36 UTC (permalink / raw)
  To: Jan Beulich; +Cc: eswierk, xen-devel

On Mon, Sep 21, 2015 at 11:23:03PM -0600, Jan Beulich wrote:
> >>> Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> 09/21/15 8:06 PM >>>
> >- Never figured out how much data we should save in the Xen's
> >struct pci_device to see if we are 'stale'. Looking back I think
> >we just need to do the interogation of the PCI capabilities and see
> >if they have somehow changed (the ones we care about).
> 
> Didn't we settle on no data to be needed here at all, instead requiring Dom0
> to report devices removed on buses about to be re-numbered, adding them
> back after the re-numbering?

Perhaps?

The Linux code constructs the structs for bus and pci devices (and dispatches
the hypercalls) as it walks the ACPI PCI bus. And if the reassign parameter was used
it then during this walk (buses first) alters the PCI_PRIMARY_BUS. Once it has
done that it restarts the walk and scans for the PCI devices.

What that means is that all the internal Linux PCI devices structure
devices are not available before this ACPI bus scan is done (while
the Xen PCI deviecs structures are available).

The best I could come up with is to do two loops:
 1) for 0:0:0 -> ff:ff:ff call PHYSDEVOP_pci_device_remove
    (so blow away what Xen has for its PCI devices.. except for the AMD IOMMU)
 2) list_for_each_pci_device PHYSDEVOP_pci_device_add (or other variants
    if VF).

But that is just a hack working around the Linux code.

My thinking was that why don't we just make PHYSDEVOP_pci_device_add be
capable of dealing with changes like these.

However, you  have also added the code to trap on PCI configuation access
we could also do some smarts when PCI_PRIMARY_BUS is modified (see
a88b72fddd046a0978242411276861039ec99ad0 x86/PCI: add config space abstract
write intercept logic). That would take care of it much easier I think?

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 13:26       ` Ed Swierk
  2015-09-22 13:36         ` Jan Beulich
@ 2015-09-22 13:39         ` Konrad Rzeszutek Wilk
  2015-09-22 13:52           ` Jan Beulich
  1 sibling, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-22 13:39 UTC (permalink / raw)
  To: Ed Swierk; +Cc: Jan Beulich, xen-devel

On Tue, Sep 22, 2015 at 06:26:11AM -0700, Ed Swierk wrote:
> On Tue, Sep 22, 2015 at 5:35 AM, Ed Swierk <eswierk@skyportsystems.com> wrote:
> > So if the contract is that Dom0 tells Xen about mmcfgs before the
> > devices they cover, then Linux ought to call pci_mmcfg_reserved from
> > (or immediately after) both pci_mmcfg_early_init() and
> > pci_mmcfg_late_init().
> 
> Brainstorming possible approaches:
> 
> I don't see an obvious way to hook into those functions (or their
> callers) without injecting Xen-specific code. Is there a precedent to
> follow?

No.
> 
> Alternatively, we could just call the equivalent of xen_mcfg_late()
> from the existing xen_{add,remove}_device() notifiers. This would
> generate a lot of useless pci_mmcfg_reserved hypercalls (number of
> devices times number of mmcfg areas), but pci_mmcfg_arch_enable() in
> Xen should happily ignore the redundant ones. The advantage of this
> approach other than simplicity is that it makes the mmcfg -> device
> setup ordering very explicit.

While that will update the Xen's view of MMCFG it won't update the
existing configuration that Xen has slurped up during bootup.

As in, you will still get warnings.
> 
> Any other ideas?

I like it - as it will update it right away. However we would need some
extra smarts in Xen to reconfigure its view of the PCI device now that the
extended configuration space has become available.

> 
> --Ed
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 13:39         ` Konrad Rzeszutek Wilk
@ 2015-09-22 13:52           ` Jan Beulich
  2015-09-22 14:03             ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2015-09-22 13:52 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Ed Swierk, xen-devel

>>> On 22.09.15 at 15:39, <konrad.wilk@oracle.com> wrote:
> On Tue, Sep 22, 2015 at 06:26:11AM -0700, Ed Swierk wrote:
>> Any other ideas?
> 
> I like it - as it will update it right away. However we would need some
> extra smarts in Xen to reconfigure its view of the PCI device now that the
> extended configuration space has become available.

What parts are you thinking of that would need updating (and
aren't getting updated already)?

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 13:36       ` Konrad Rzeszutek Wilk
@ 2015-09-22 13:57         ` Jan Beulich
  2015-09-22 14:09           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2015-09-22 13:57 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: eswierk, xen-devel

>>> On 22.09.15 at 15:36, <konrad.wilk@oracle.com> wrote:
> The best I could come up with is to do two loops:
>  1) for 0:0:0 -> ff:ff:ff call PHYSDEVOP_pci_device_remove
>     (so blow away what Xen has for its PCI devices.. except for the AMD 
> IOMMU)
>  2) list_for_each_pci_device PHYSDEVOP_pci_device_add (or other variants
>     if VF).
> 
> But that is just a hack working around the Linux code.

The price of being non-intrusive on the Linux side. The above would
seem okay to me for the (rare?) reassign case. (Not sure why you
mean to exclude the AMD IOMMU there.)

> My thinking was that why don't we just make PHYSDEVOP_pci_device_add be
> capable of dealing with changes like these.

That would require it to be able to match up devices at their new
position (bus number) with its original instance (on the old bus).

> However, you  have also added the code to trap on PCI configuation access
> we could also do some smarts when PCI_PRIMARY_BUS is modified (see
> a88b72fddd046a0978242411276861039ec99ad0 x86/PCI: add config space abstract
> write intercept logic). That would take care of it much easier I think?

That's a nice idea actually.

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 13:52           ` Jan Beulich
@ 2015-09-22 14:03             ` Konrad Rzeszutek Wilk
  2015-09-22 14:11               ` Jan Beulich
  0 siblings, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-22 14:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ed Swierk, xen-devel

On Tue, Sep 22, 2015 at 07:52:19AM -0600, Jan Beulich wrote:
> >>> On 22.09.15 at 15:39, <konrad.wilk@oracle.com> wrote:
> > On Tue, Sep 22, 2015 at 06:26:11AM -0700, Ed Swierk wrote:
> >> Any other ideas?
> > 
> > I like it - as it will update it right away. However we would need some
> > extra smarts in Xen to reconfigure its view of the PCI device now that the
> > extended configuration space has become available.
> 
> What parts are you thinking of that would need updating (and
> aren't getting updated already)?

The VF data. As before this call Xen might have not been able to
get to the extended configuration.
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 13:57         ` Jan Beulich
@ 2015-09-22 14:09           ` Konrad Rzeszutek Wilk
  2015-09-22 14:12             ` Jan Beulich
  0 siblings, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-22 14:09 UTC (permalink / raw)
  To: Jan Beulich; +Cc: eswierk, xen-devel

On Tue, Sep 22, 2015 at 07:57:29AM -0600, Jan Beulich wrote:
> >>> On 22.09.15 at 15:36, <konrad.wilk@oracle.com> wrote:
> > The best I could come up with is to do two loops:
> >  1) for 0:0:0 -> ff:ff:ff call PHYSDEVOP_pci_device_remove
> >     (so blow away what Xen has for its PCI devices.. except for the AMD 
> > IOMMU)
> >  2) list_for_each_pci_device PHYSDEVOP_pci_device_add (or other variants
> >     if VF).
> > 
> > But that is just a hack working around the Linux code.
> 
> The price of being non-intrusive on the Linux side. The above would
> seem okay to me for the (rare?) reassign case. (Not sure why you
> mean to exclude the AMD IOMMU there.)

I think we hide it altogether from the dom0 - so when it does the
bus reassignment it does not see the AMD IOMMU.

My concern was that the bus number of the AMD IOMMU device may
overlap with other bus numbers - which got moved as a result
of the bridge expanding its subordinate bus numbers.

In other words, the AMD IOMMU may have been at bus 0xb, the
bridge in question at 0x01, with an PCI device at 0x02;
some devices between 0x03 down to 0x0a.

The bridge would now cover 0x01->0x04 (with the PCI device
still at 0x02). But the devices at the old 0x03->0x0a have now
been shifted to 0x04->0x0b. And the AMD IOMMU is at 0x0c.

But Xen has no clue about this and "loses" the AMD IOMMU and
starts writting configuration data to whatever poor device used to
be at bus 0x0b.
> 
> > My thinking was that why don't we just make PHYSDEVOP_pci_device_add be
> > capable of dealing with changes like these.
> 
> That would require it to be able to match up devices at their new
> position (bus number) with its original instance (on the old bus).

Exactly!
> 
> > However, you  have also added the code to trap on PCI configuation access
> > we could also do some smarts when PCI_PRIMARY_BUS is modified (see
> > a88b72fddd046a0978242411276861039ec99ad0 x86/PCI: add config space abstract
> > write intercept logic). That would take care of it much easier I think?
> 
> That's a nice idea actually.

Thank you for adding that code.
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 14:03             ` Konrad Rzeszutek Wilk
@ 2015-09-22 14:11               ` Jan Beulich
  2015-09-22 14:37                 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2015-09-22 14:11 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Ed Swierk, xen-devel

>>> On 22.09.15 at 16:03, <konrad.wilk@oracle.com> wrote:
> On Tue, Sep 22, 2015 at 07:52:19AM -0600, Jan Beulich wrote:
>> >>> On 22.09.15 at 15:39, <konrad.wilk@oracle.com> wrote:
>> > On Tue, Sep 22, 2015 at 06:26:11AM -0700, Ed Swierk wrote:
>> >> Any other ideas?
>> > 
>> > I like it - as it will update it right away. However we would need some
>> > extra smarts in Xen to reconfigure its view of the PCI device now that the
>> > extended configuration space has become available.
>> 
>> What parts are you thinking of that would need updating (and
>> aren't getting updated already)?
> 
> The VF data. As before this call Xen might have not been able to
> get to the extended configuration.

I still don't understand: Afaics pci_add_device() updates everything
that may need updating already. And things appear to be working
fine with our kernel (where the MMCFG notification comes right out
of the x86 code earlier referred to), despite this meaning updates
to the data collected during early boot. I guess you'll need to be
more specific on what you see missing...

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 14:09           ` Konrad Rzeszutek Wilk
@ 2015-09-22 14:12             ` Jan Beulich
  2015-09-22 14:33               ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2015-09-22 14:12 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: eswierk, xen-devel

>>> On 22.09.15 at 16:09, <konrad.wilk@oracle.com> wrote:
> On Tue, Sep 22, 2015 at 07:57:29AM -0600, Jan Beulich wrote:
>> >>> On 22.09.15 at 15:36, <konrad.wilk@oracle.com> wrote:
>> > The best I could come up with is to do two loops:
>> >  1) for 0:0:0 -> ff:ff:ff call PHYSDEVOP_pci_device_remove
>> >     (so blow away what Xen has for its PCI devices.. except for the AMD 
>> > IOMMU)
>> >  2) list_for_each_pci_device PHYSDEVOP_pci_device_add (or other variants
>> >     if VF).
>> > 
>> > But that is just a hack working around the Linux code.
>> 
>> The price of being non-intrusive on the Linux side. The above would
>> seem okay to me for the (rare?) reassign case. (Not sure why you
>> mean to exclude the AMD IOMMU there.)
> 
> I think we hide it altogether from the dom0 - so when it does the
> bus reassignment it does not see the AMD IOMMU.
> 
> My concern was that the bus number of the AMD IOMMU device may
> overlap with other bus numbers - which got moved as a result
> of the bridge expanding its subordinate bus numbers.
> 
> In other words, the AMD IOMMU may have been at bus 0xb, the
> bridge in question at 0x01, with an PCI device at 0x02;
> some devices between 0x03 down to 0x0a.
> 
> The bridge would now cover 0x01->0x04 (with the PCI device
> still at 0x02). But the devices at the old 0x03->0x0a have now
> been shifted to 0x04->0x0b. And the AMD IOMMU is at 0x0c.
> 
> But Xen has no clue about this and "loses" the AMD IOMMU and
> starts writting configuration data to whatever poor device used to
> be at bus 0x0b.

And how would that issue be solved by leaving that device alone?

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 14:12             ` Jan Beulich
@ 2015-09-22 14:33               ` Konrad Rzeszutek Wilk
  2015-09-22 15:07                 ` Jan Beulich
  0 siblings, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-22 14:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: eswierk, xen-devel

On Tue, Sep 22, 2015 at 08:12:56AM -0600, Jan Beulich wrote:
> >>> On 22.09.15 at 16:09, <konrad.wilk@oracle.com> wrote:
> > On Tue, Sep 22, 2015 at 07:57:29AM -0600, Jan Beulich wrote:
> >> >>> On 22.09.15 at 15:36, <konrad.wilk@oracle.com> wrote:
> >> > The best I could come up with is to do two loops:
> >> >  1) for 0:0:0 -> ff:ff:ff call PHYSDEVOP_pci_device_remove
> >> >     (so blow away what Xen has for its PCI devices.. except for the AMD 
> >> > IOMMU)
> >> >  2) list_for_each_pci_device PHYSDEVOP_pci_device_add (or other variants
> >> >     if VF).
> >> > 
> >> > But that is just a hack working around the Linux code.
> >> 
> >> The price of being non-intrusive on the Linux side. The above would
> >> seem okay to me for the (rare?) reassign case. (Not sure why you
> >> mean to exclude the AMD IOMMU there.)
> > 
> > I think we hide it altogether from the dom0 - so when it does the
> > bus reassignment it does not see the AMD IOMMU.
> > 
> > My concern was that the bus number of the AMD IOMMU device may
> > overlap with other bus numbers - which got moved as a result
> > of the bridge expanding its subordinate bus numbers.
> > 
> > In other words, the AMD IOMMU may have been at bus 0xb, the
> > bridge in question at 0x01, with an PCI device at 0x02;
> > some devices between 0x03 down to 0x0a.
> > 
> > The bridge would now cover 0x01->0x04 (with the PCI device
> > still at 0x02). But the devices at the old 0x03->0x0a have now
> > been shifted to 0x04->0x0b. And the AMD IOMMU is at 0x0c.
> > 
> > But Xen has no clue about this and "loses" the AMD IOMMU and
> > starts writting configuration data to whatever poor device used to
> > be at bus 0x0b.
> 
> And how would that issue be solved by leaving that device alone?

If it was exposed to dom0, it could make an PHYSDEVOP_pci_device_add
to add it back to Xen (as a new device at bus 0x0c).

But Xen wouldn't know what to do with it - it would still think that
the AMD IOMMU is at bus 0x0b.

I believe I am going on a tangent here - that is the MMCFG
conversation is about the extended configuration registers and how
to make Xen aware of them such that when the VF is added Xen will
be able to validate them. Ed's idea of making the MMCFG hypercall
when the PCI notification (with your suggestions) should take care
of it (I hope).

The issue I am waxing on about is just Linux reassigning the
PCI devices and what are some of the repercussions that stem from that.

I should start a new thread about it and propose some prototype
patch, once my TODO queue is a bit sane.


> 
> Jan
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 14:11               ` Jan Beulich
@ 2015-09-22 14:37                 ` Konrad Rzeszutek Wilk
  2015-09-22 15:09                   ` Jan Beulich
  0 siblings, 1 reply; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-22 14:37 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Ed Swierk, xen-devel

On Tue, Sep 22, 2015 at 08:11:41AM -0600, Jan Beulich wrote:
> >>> On 22.09.15 at 16:03, <konrad.wilk@oracle.com> wrote:
> > On Tue, Sep 22, 2015 at 07:52:19AM -0600, Jan Beulich wrote:
> >> >>> On 22.09.15 at 15:39, <konrad.wilk@oracle.com> wrote:
> >> > On Tue, Sep 22, 2015 at 06:26:11AM -0700, Ed Swierk wrote:
> >> >> Any other ideas?
> >> > 
> >> > I like it - as it will update it right away. However we would need some
> >> > extra smarts in Xen to reconfigure its view of the PCI device now that the
> >> > extended configuration space has become available.
> >> 
> >> What parts are you thinking of that would need updating (and
> >> aren't getting updated already)?
> > 
> > The VF data. As before this call Xen might have not been able to
> > get to the extended configuration.
> 
> I still don't understand: Afaics pci_add_device() updates everything
> that may need updating already. And things appear to be working
> fine with our kernel (where the MMCFG notification comes right out
> of the x86 code earlier referred to), despite this meaning updates
> to the data collected during early boot. I guess you'll need to be
> more specific on what you see missing...

This is all ancient recollection of what I had seen a year or so ago
so take it a with a cup of salt.

I have one of those Intel machines where the MMCFG is not marked
in the E820 but is in the ACPI and I remember getting frequent
WARN_ON from Xen in the msi.c code when doing passthrough on the VF.
I don't have the logs but my vague recollection was that Xen could
not validate the VF's MSI-X because at the time it gathered the
VF PCI device information the extended configurations were not
available. This was prior to the XSA120 discovery so ancient
code.

With Ed's solution (along with your optimizations you suggested)
this should work. I can certainly try it out on my rig.


> 
> Jan
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 14:33               ` Konrad Rzeszutek Wilk
@ 2015-09-22 15:07                 ` Jan Beulich
  0 siblings, 0 replies; 23+ messages in thread
From: Jan Beulich @ 2015-09-22 15:07 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: eswierk, xen-devel

>>> On 22.09.15 at 16:33, <konrad.wilk@oracle.com> wrote:
> On Tue, Sep 22, 2015 at 08:12:56AM -0600, Jan Beulich wrote:
>> >>> On 22.09.15 at 16:09, <konrad.wilk@oracle.com> wrote:
>> > On Tue, Sep 22, 2015 at 07:57:29AM -0600, Jan Beulich wrote:
>> >> >>> On 22.09.15 at 15:36, <konrad.wilk@oracle.com> wrote:
>> >> > The best I could come up with is to do two loops:
>> >> >  1) for 0:0:0 -> ff:ff:ff call PHYSDEVOP_pci_device_remove
>> >> >     (so blow away what Xen has for its PCI devices.. except for the AMD 
>> >> > IOMMU)
>> >> >  2) list_for_each_pci_device PHYSDEVOP_pci_device_add (or other variants
>> >> >     if VF).
>> >> > 
>> >> > But that is just a hack working around the Linux code.
>> >> 
>> >> The price of being non-intrusive on the Linux side. The above would
>> >> seem okay to me for the (rare?) reassign case. (Not sure why you
>> >> mean to exclude the AMD IOMMU there.)
>> > 
>> > I think we hide it altogether from the dom0 - so when it does the
>> > bus reassignment it does not see the AMD IOMMU.
>> > 
>> > My concern was that the bus number of the AMD IOMMU device may
>> > overlap with other bus numbers - which got moved as a result
>> > of the bridge expanding its subordinate bus numbers.
>> > 
>> > In other words, the AMD IOMMU may have been at bus 0xb, the
>> > bridge in question at 0x01, with an PCI device at 0x02;
>> > some devices between 0x03 down to 0x0a.
>> > 
>> > The bridge would now cover 0x01->0x04 (with the PCI device
>> > still at 0x02). But the devices at the old 0x03->0x0a have now
>> > been shifted to 0x04->0x0b. And the AMD IOMMU is at 0x0c.
>> > 
>> > But Xen has no clue about this and "loses" the AMD IOMMU and
>> > starts writting configuration data to whatever poor device used to
>> > be at bus 0x0b.
>> 
>> And how would that issue be solved by leaving that device alone?
> 
> If it was exposed to dom0, it could make an PHYSDEVOP_pci_device_add
> to add it back to Xen (as a new device at bus 0x0c).
> 
> But Xen wouldn't know what to do with it - it would still think that
> the AMD IOMMU is at bus 0x0b.

Ah, now I see what you're getting at: It's the AMD IOMMU code itself
which would need to get notified. That's even more of an orthogonal
issue than the general mixing in of bus reassignment you allude to ...

> I believe I am going on a tangent here - that is the MMCFG
> conversation is about the extended configuration registers and how
> to make Xen aware of them such that when the VF is added Xen will
> be able to validate them. Ed's idea of making the MMCFG hypercall
> when the PCI notification (with your suggestions) should take care
> of it (I hope).
> 
> The issue I am waxing on about is just Linux reassigning the
> PCI devices and what are some of the repercussions that stem from that.

... here. (As a side note - I don't think we do or even reasonably
could hide that device from being found during a bus scan. All we
do is disallow it being played with by Dom0 or getting assigned to
a DomU.)

> I should start a new thread about it and propose some prototype
> patch, once my TODO queue is a bit sane.

Much appreciated, thanks.

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 14:37                 ` Konrad Rzeszutek Wilk
@ 2015-09-22 15:09                   ` Jan Beulich
  2015-09-22 20:17                     ` Ed Swierk
  0 siblings, 1 reply; 23+ messages in thread
From: Jan Beulich @ 2015-09-22 15:09 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Ed Swierk, xen-devel

>>> On 22.09.15 at 16:37, <konrad.wilk@oracle.com> wrote:
> On Tue, Sep 22, 2015 at 08:11:41AM -0600, Jan Beulich wrote:
>> >>> On 22.09.15 at 16:03, <konrad.wilk@oracle.com> wrote:
>> > On Tue, Sep 22, 2015 at 07:52:19AM -0600, Jan Beulich wrote:
>> >> >>> On 22.09.15 at 15:39, <konrad.wilk@oracle.com> wrote:
>> >> > On Tue, Sep 22, 2015 at 06:26:11AM -0700, Ed Swierk wrote:
>> >> >> Any other ideas?
>> >> > 
>> >> > I like it - as it will update it right away. However we would need some
>> >> > extra smarts in Xen to reconfigure its view of the PCI device now that the
>> >> > extended configuration space has become available.
>> >> 
>> >> What parts are you thinking of that would need updating (and
>> >> aren't getting updated already)?
>> > 
>> > The VF data. As before this call Xen might have not been able to
>> > get to the extended configuration.
>> 
>> I still don't understand: Afaics pci_add_device() updates everything
>> that may need updating already. And things appear to be working
>> fine with our kernel (where the MMCFG notification comes right out
>> of the x86 code earlier referred to), despite this meaning updates
>> to the data collected during early boot. I guess you'll need to be
>> more specific on what you see missing...
> 
> This is all ancient recollection of what I had seen a year or so ago
> so take it a with a cup of salt.

Urgh - a whole cup...

> I have one of those Intel machines where the MMCFG is not marked
> in the E820 but is in the ACPI and I remember getting frequent
> WARN_ON from Xen in the msi.c code when doing passthrough on the VF.
> I don't have the logs but my vague recollection was that Xen could
> not validate the VF's MSI-X because at the time it gathered the
> VF PCI device information the extended configurations were not
> available. This was prior to the XSA120 discovery so ancient
> code.

Well, VFs should not even show up prior to MMCFG getting
announced.

Jan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 15:09                   ` Jan Beulich
@ 2015-09-22 20:17                     ` Ed Swierk
  2015-09-23  0:53                       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 23+ messages in thread
From: Ed Swierk @ 2015-09-22 20:17 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 7456 bytes --]

RFC. Tested with 3.14.51 and Xen 4.5.1. I'll make a proper patch against a
more current kernel if we decide this is heading in the right direction.


    xen/mcfg: Notify Xen of PCI MMCONFIG area before adding device

    On systems where the ACPI DSDT advertises a PCI MMCONFIG area but the
E820
    table does not reserve it, it's up to Dom0 to inform Xen of MMCONFIG
areas
    via PHYSDEVOP_pci_mmcfg_reserved.  This needs to happen before Xen
tries to
    access extended capabilities like SRIOV in pci_add_device(), which is
    triggered by PHYSDEVOP_pci_device_add et al.

    Since both MMCONFIG discovery and PCI bus enumeration occur in
acpi_init(),
    calling PHYSDEVOP_pci_mmcfg_reserved cannot be delegated to a separate
    initcall.  Instead, it can be done in xen_pci_notifier() immediately
before
    calling PHYSDEVOP_pci_device_add.

    Without this change, Xen 4.4 and 4.5 emit WARN messsages from
    msix_capability_init() when setting up Intel 82599 VFs, since vf_rlen
has
    not been initialized by pci_add_device().  And on Xen 4.5, Xen nukes the
    DomU due to "Potentially insecure use of MSI-X" when the VF driver
loads in
    the DomU.  Both problems are fixed by this change.

    Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>

diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
index dd9c249..47f6b45 100644
--- a/drivers/xen/pci.c
+++ b/drivers/xen/pci.c
@@ -26,8 +26,57 @@
 #include <asm/xen/hypervisor.h>
 #include <asm/xen/hypercall.h>
 #include "../pci/pci.h"
+
 #ifdef CONFIG_PCI_MMCONFIG
 #include <asm/pci_x86.h>
+#include <linux/list.h>
+
+/* pci_mmcfg_list entries that have already been reported to xen */
+LIST_HEAD(xen_pci_mmcfg_list);
+
+static void xen_report_pci_mmcfg_region(u16 segment, u8 bus)
+{
+ struct pci_mmcfg_region *cfg, *new;
+ struct physdev_pci_mmcfg_reserved r;
+ int rc;
+
+ if ((pci_probe & PCI_PROBE_MMCONF) == 0)
+ return;
+
+ cfg = pci_mmconfig_lookup(segment, bus);
+ if (!cfg)
+ return;
+
+ list_for_each_entry_rcu(new, &xen_pci_mmcfg_list, list) {
+ if (new->segment == cfg->segment &&
+    new->start_bus == cfg->start_bus &&
+    new->end_bus == cfg->end_bus)
+ return;
+ }
+
+ r.address = cfg->address;
+ r.segment = cfg->segment;
+ r.start_bus = cfg->start_bus;
+ r.end_bus = cfg->end_bus;
+ r.flags = XEN_PCI_MMCFG_RESERVED;
+ rc = HYPERVISOR_physdev_op(PHYSDEVOP_pci_mmcfg_reserved, &r);
+ if (rc != 0 && rc != -ENOSYS) {
+ pr_warn("Failed to report MMCONFIG reservation state for %s"
+ " to hypervisor (%d)\n", cfg->name, rc);
+ }
+
+ new = kmemdup(cfg, sizeof(*new), GFP_KERNEL);
+ if (new) {
+ list_add_tail_rcu(&new->list, &xen_pci_mmcfg_list);
+ } else {
+ pr_warn("Failed to allocate xen_pci_mmcfg_list entry\n");
+ }
+}
+
+#else
+
+static void xen_report_pci_mmcfg_region(u16 segment, u8 bus) { }
+
 #endif

 static bool __read_mostly pci_seg_supported = true;
@@ -86,6 +135,7 @@ static int xen_add_device(struct device *dev)
  }
 #endif /* CONFIG_ACPI */

+ xen_report_pci_mmcfg_region(add.seg, add.bus);
  r = HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_add, &add);
  if (r != -ENOSYS)
  return r;
@@ -104,6 +154,7 @@ static int xen_add_device(struct device *dev)
  .physfn.devfn = physfn->devfn,
  };

+ xen_report_pci_mmcfg_region(0, manage_pci_ext.bus);
  r = HYPERVISOR_physdev_op(PHYSDEVOP_manage_pci_add_ext,
  &manage_pci_ext);
  }
@@ -115,6 +166,7 @@ static int xen_add_device(struct device *dev)
  .is_extfn = 1,
  };

+ xen_report_pci_mmcfg_region(0, manage_pci_ext.bus);
  r = HYPERVISOR_physdev_op(PHYSDEVOP_manage_pci_add_ext,
  &manage_pci_ext);
  } else {
@@ -123,6 +175,7 @@ static int xen_add_device(struct device *dev)
  .devfn = pci_dev->devfn,
  };

+ xen_report_pci_mmcfg_region(0, manage_pci.bus);
  r = HYPERVISOR_physdev_op(PHYSDEVOP_manage_pci_add,
  &manage_pci);
  }
@@ -142,6 +195,7 @@ static int xen_remove_device(struct device *dev)
  .devfn = pci_dev->devfn
  };

+ xen_report_pci_mmcfg_region(device.seg, device.bus);
  r = HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_remove,
   &device);
  } else if (pci_domain_nr(pci_dev->bus))
@@ -152,6 +206,7 @@ static int xen_remove_device(struct device *dev)
  .devfn = pci_dev->devfn
  };

+ xen_report_pci_mmcfg_region(0, manage_pci.bus);
  r = HYPERVISOR_physdev_op(PHYSDEVOP_manage_pci_remove,
   &manage_pci);
  }
@@ -195,49 +250,3 @@ static int __init register_xen_pci_notifier(void)
 }

 arch_initcall(register_xen_pci_notifier);
-
-#ifdef CONFIG_PCI_MMCONFIG
-static int __init xen_mcfg_late(void)
-{
- struct pci_mmcfg_region *cfg;
- int rc;
-
- if (!xen_initial_domain())
- return 0;
-
- if ((pci_probe & PCI_PROBE_MMCONF) == 0)
- return 0;
-
- if (list_empty(&pci_mmcfg_list))
- return 0;
-
- /* Check whether they are in the right area. */
- list_for_each_entry(cfg, &pci_mmcfg_list, list) {
- struct physdev_pci_mmcfg_reserved r;
-
- r.address = cfg->address;
- r.segment = cfg->segment;
- r.start_bus = cfg->start_bus;
- r.end_bus = cfg->end_bus;
- r.flags = XEN_PCI_MMCFG_RESERVED;
-
- rc = HYPERVISOR_physdev_op(PHYSDEVOP_pci_mmcfg_reserved, &r);
- switch (rc) {
- case 0:
- case -ENOSYS:
- continue;
-
- default:
- pr_warn("Failed to report MMCONFIG reservation"
- " state for %s to hypervisor"
- " (%d)\n",
- cfg->name, rc);
- }
- }
- return 0;
-}
-/*
- * Needs to be done after acpi_init which are subsys_initcall.
- */
-subsys_initcall_sync(xen_mcfg_late);
-#endif


On Tue, Sep 22, 2015 at 8:09 AM, Jan Beulich <JBeulich@suse.com> wrote:

> >>> On 22.09.15 at 16:37, <konrad.wilk@oracle.com> wrote:
> > On Tue, Sep 22, 2015 at 08:11:41AM -0600, Jan Beulich wrote:
> >> >>> On 22.09.15 at 16:03, <konrad.wilk@oracle.com> wrote:
> >> > On Tue, Sep 22, 2015 at 07:52:19AM -0600, Jan Beulich wrote:
> >> >> >>> On 22.09.15 at 15:39, <konrad.wilk@oracle.com> wrote:
> >> >> > On Tue, Sep 22, 2015 at 06:26:11AM -0700, Ed Swierk wrote:
> >> >> >> Any other ideas?
> >> >> >
> >> >> > I like it - as it will update it right away. However we would need
> some
> >> >> > extra smarts in Xen to reconfigure its view of the PCI device now
> that the
> >> >> > extended configuration space has become available.
> >> >>
> >> >> What parts are you thinking of that would need updating (and
> >> >> aren't getting updated already)?
> >> >
> >> > The VF data. As before this call Xen might have not been able to
> >> > get to the extended configuration.
> >>
> >> I still don't understand: Afaics pci_add_device() updates everything
> >> that may need updating already. And things appear to be working
> >> fine with our kernel (where the MMCFG notification comes right out
> >> of the x86 code earlier referred to), despite this meaning updates
> >> to the data collected during early boot. I guess you'll need to be
> >> more specific on what you see missing...
> >
> > This is all ancient recollection of what I had seen a year or so ago
> > so take it a with a cup of salt.
>
> Urgh - a whole cup...
>
> > I have one of those Intel machines where the MMCFG is not marked
> > in the E820 but is in the ACPI and I remember getting frequent
> > WARN_ON from Xen in the msi.c code when doing passthrough on the VF.
> > I don't have the logs but my vague recollection was that Xen could
> > not validate the VF's MSI-X because at the time it gathered the
> > VF PCI device information the extended configurations were not
> > available. This was prior to the XSA120 discovery so ancient
> > code.
>
> Well, VFs should not even show up prior to MMCFG getting
> announced.
>
> Jan
>
>

[-- Attachment #1.2: Type: text/html, Size: 15377 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration
  2015-09-22 20:17                     ` Ed Swierk
@ 2015-09-23  0:53                       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 23+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-23  0:53 UTC (permalink / raw)
  To: Ed Swierk, david.vrabel; +Cc: Jan Beulich, xen-devel

On Tue, Sep 22, 2015 at 01:17:03PM -0700, Ed Swierk wrote:
> RFC. Tested with 3.14.51 and Xen 4.5.1. I'll make a proper patch against a
> more current kernel if we decide this is heading in the right direction.

CC-ing David as well
> 
> 
>     xen/mcfg: Notify Xen of PCI MMCONFIG area before adding device
> 
>     On systems where the ACPI DSDT advertises a PCI MMCONFIG area but the
> E820
>     table does not reserve it, it's up to Dom0 to inform Xen of MMCONFIG
> areas
>     via PHYSDEVOP_pci_mmcfg_reserved.  This needs to happen before Xen
> tries to
>     access extended capabilities like SRIOV in pci_add_device(), which is
>     triggered by PHYSDEVOP_pci_device_add et al.
> 
>     Since both MMCONFIG discovery and PCI bus enumeration occur in
> acpi_init(),
>     calling PHYSDEVOP_pci_mmcfg_reserved cannot be delegated to a separate
>     initcall.  Instead, it can be done in xen_pci_notifier() immediately
> before
>     calling PHYSDEVOP_pci_device_add.
> 
>     Without this change, Xen 4.4 and 4.5 emit WARN messsages from
>     msix_capability_init() when setting up Intel 82599 VFs, since vf_rlen
> has
>     not been initialized by pci_add_device().  And on Xen 4.5, Xen nukes the
>     DomU due to "Potentially insecure use of MSI-X" when the VF driver
> loads in
>     the DomU.  Both problems are fixed by this change.
> 
>     Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
> 
> diff --git a/drivers/xen/pci.c b/drivers/xen/pci.c
> index dd9c249..47f6b45 100644
> --- a/drivers/xen/pci.c
> +++ b/drivers/xen/pci.c
> @@ -26,8 +26,57 @@
>  #include <asm/xen/hypervisor.h>
>  #include <asm/xen/hypercall.h>
>  #include "../pci/pci.h"
> +
>  #ifdef CONFIG_PCI_MMCONFIG
>  #include <asm/pci_x86.h>
> +#include <linux/list.h>
> +
> +/* pci_mmcfg_list entries that have already been reported to xen */
> +LIST_HEAD(xen_pci_mmcfg_list);
> +
> +static void xen_report_pci_mmcfg_region(u16 segment, u8 bus)
> +{
> + struct pci_mmcfg_region *cfg, *new;
> + struct physdev_pci_mmcfg_reserved r;
> + int rc;
> +
> + if ((pci_probe & PCI_PROBE_MMCONF) == 0)
> + return;
> +
> + cfg = pci_mmconfig_lookup(segment, bus);
> + if (!cfg)
> + return;
> +
> + list_for_each_entry_rcu(new, &xen_pci_mmcfg_list, list) {
> + if (new->segment == cfg->segment &&
> +    new->start_bus == cfg->start_bus &&
> +    new->end_bus == cfg->end_bus)
> + return;
> + }
> +
> + r.address = cfg->address;
> + r.segment = cfg->segment;
> + r.start_bus = cfg->start_bus;
> + r.end_bus = cfg->end_bus;
> + r.flags = XEN_PCI_MMCFG_RESERVED;
> + rc = HYPERVISOR_physdev_op(PHYSDEVOP_pci_mmcfg_reserved, &r);
> + if (rc != 0 && rc != -ENOSYS) {
> + pr_warn("Failed to report MMCONFIG reservation state for %s"
> + " to hypervisor (%d)\n", cfg->name, rc);
> + }
> +
> + new = kmemdup(cfg, sizeof(*new), GFP_KERNEL);
> + if (new) {
> + list_add_tail_rcu(&new->list, &xen_pci_mmcfg_list);
> + } else {
> + pr_warn("Failed to allocate xen_pci_mmcfg_list entry\n");
> + }
> +}
> +
> +#else
> +
> +static void xen_report_pci_mmcfg_region(u16 segment, u8 bus) { }
> +
>  #endif
> 
>  static bool __read_mostly pci_seg_supported = true;
> @@ -86,6 +135,7 @@ static int xen_add_device(struct device *dev)
>   }
>  #endif /* CONFIG_ACPI */
> 
> + xen_report_pci_mmcfg_region(add.seg, add.bus);
>   r = HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_add, &add);
>   if (r != -ENOSYS)
>   return r;
> @@ -104,6 +154,7 @@ static int xen_add_device(struct device *dev)
>   .physfn.devfn = physfn->devfn,
>   };
> 
> + xen_report_pci_mmcfg_region(0, manage_pci_ext.bus);
>   r = HYPERVISOR_physdev_op(PHYSDEVOP_manage_pci_add_ext,
>   &manage_pci_ext);
>   }
> @@ -115,6 +166,7 @@ static int xen_add_device(struct device *dev)
>   .is_extfn = 1,
>   };
> 
> + xen_report_pci_mmcfg_region(0, manage_pci_ext.bus);
>   r = HYPERVISOR_physdev_op(PHYSDEVOP_manage_pci_add_ext,
>   &manage_pci_ext);
>   } else {
> @@ -123,6 +175,7 @@ static int xen_add_device(struct device *dev)
>   .devfn = pci_dev->devfn,
>   };
> 
> + xen_report_pci_mmcfg_region(0, manage_pci.bus);
>   r = HYPERVISOR_physdev_op(PHYSDEVOP_manage_pci_add,
>   &manage_pci);
>   }
> @@ -142,6 +195,7 @@ static int xen_remove_device(struct device *dev)
>   .devfn = pci_dev->devfn
>   };
> 
> + xen_report_pci_mmcfg_region(device.seg, device.bus);
>   r = HYPERVISOR_physdev_op(PHYSDEVOP_pci_device_remove,
>    &device);
>   } else if (pci_domain_nr(pci_dev->bus))
> @@ -152,6 +206,7 @@ static int xen_remove_device(struct device *dev)
>   .devfn = pci_dev->devfn
>   };
> 
> + xen_report_pci_mmcfg_region(0, manage_pci.bus);
>   r = HYPERVISOR_physdev_op(PHYSDEVOP_manage_pci_remove,
>    &manage_pci);
>   }
> @@ -195,49 +250,3 @@ static int __init register_xen_pci_notifier(void)
>  }
> 
>  arch_initcall(register_xen_pci_notifier);
> -
> -#ifdef CONFIG_PCI_MMCONFIG
> -static int __init xen_mcfg_late(void)
> -{
> - struct pci_mmcfg_region *cfg;
> - int rc;
> -
> - if (!xen_initial_domain())
> - return 0;
> -
> - if ((pci_probe & PCI_PROBE_MMCONF) == 0)
> - return 0;
> -
> - if (list_empty(&pci_mmcfg_list))
> - return 0;
> -
> - /* Check whether they are in the right area. */
> - list_for_each_entry(cfg, &pci_mmcfg_list, list) {
> - struct physdev_pci_mmcfg_reserved r;
> -
> - r.address = cfg->address;
> - r.segment = cfg->segment;
> - r.start_bus = cfg->start_bus;
> - r.end_bus = cfg->end_bus;
> - r.flags = XEN_PCI_MMCFG_RESERVED;
> -
> - rc = HYPERVISOR_physdev_op(PHYSDEVOP_pci_mmcfg_reserved, &r);
> - switch (rc) {
> - case 0:
> - case -ENOSYS:
> - continue;
> -
> - default:
> - pr_warn("Failed to report MMCONFIG reservation"
> - " state for %s to hypervisor"
> - " (%d)\n",
> - cfg->name, rc);
> - }
> - }
> - return 0;
> -}
> -/*
> - * Needs to be done after acpi_init which are subsys_initcall.
> - */
> -subsys_initcall_sync(xen_mcfg_late);
> -#endif
> 
> 
> On Tue, Sep 22, 2015 at 8:09 AM, Jan Beulich <JBeulich@suse.com> wrote:
> 
> > >>> On 22.09.15 at 16:37, <konrad.wilk@oracle.com> wrote:
> > > On Tue, Sep 22, 2015 at 08:11:41AM -0600, Jan Beulich wrote:
> > >> >>> On 22.09.15 at 16:03, <konrad.wilk@oracle.com> wrote:
> > >> > On Tue, Sep 22, 2015 at 07:52:19AM -0600, Jan Beulich wrote:
> > >> >> >>> On 22.09.15 at 15:39, <konrad.wilk@oracle.com> wrote:
> > >> >> > On Tue, Sep 22, 2015 at 06:26:11AM -0700, Ed Swierk wrote:
> > >> >> >> Any other ideas?
> > >> >> >
> > >> >> > I like it - as it will update it right away. However we would need
> > some
> > >> >> > extra smarts in Xen to reconfigure its view of the PCI device now
> > that the
> > >> >> > extended configuration space has become available.
> > >> >>
> > >> >> What parts are you thinking of that would need updating (and
> > >> >> aren't getting updated already)?
> > >> >
> > >> > The VF data. As before this call Xen might have not been able to
> > >> > get to the extended configuration.
> > >>
> > >> I still don't understand: Afaics pci_add_device() updates everything
> > >> that may need updating already. And things appear to be working
> > >> fine with our kernel (where the MMCFG notification comes right out
> > >> of the x86 code earlier referred to), despite this meaning updates
> > >> to the data collected during early boot. I guess you'll need to be
> > >> more specific on what you see missing...
> > >
> > > This is all ancient recollection of what I had seen a year or so ago
> > > so take it a with a cup of salt.
> >
> > Urgh - a whole cup...
> >
> > > I have one of those Intel machines where the MMCFG is not marked
> > > in the E820 but is in the ACPI and I remember getting frequent
> > > WARN_ON from Xen in the msi.c code when doing passthrough on the VF.
> > > I don't have the logs but my vague recollection was that Xen could
> > > not validate the VF's MSI-X because at the time it gathered the
> > > VF PCI device information the extended configurations were not
> > > available. This was prior to the XSA120 discovery so ancient
> > > code.
> >
> > Well, VFs should not even show up prior to MMCFG getting
> > announced.
> >
> > Jan
> >
> >

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-09-23  0:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-15 23:29 [PATCH] xen/mcfg: Call PHYSDEVOP_pci_mmcfg_reserved before PCI enumeration Ed Swierk
2015-09-21 15:58 ` Ed Swierk
2015-09-21 18:05   ` Konrad Rzeszutek Wilk
2015-09-22  5:23     ` Jan Beulich
2015-09-22 13:36       ` Konrad Rzeszutek Wilk
2015-09-22 13:57         ` Jan Beulich
2015-09-22 14:09           ` Konrad Rzeszutek Wilk
2015-09-22 14:12             ` Jan Beulich
2015-09-22 14:33               ` Konrad Rzeszutek Wilk
2015-09-22 15:07                 ` Jan Beulich
2015-09-22  5:21   ` Jan Beulich
2015-09-22 12:35     ` Ed Swierk
2015-09-22 12:40       ` Jan Beulich
2015-09-22 13:26       ` Ed Swierk
2015-09-22 13:36         ` Jan Beulich
2015-09-22 13:39         ` Konrad Rzeszutek Wilk
2015-09-22 13:52           ` Jan Beulich
2015-09-22 14:03             ` Konrad Rzeszutek Wilk
2015-09-22 14:11               ` Jan Beulich
2015-09-22 14:37                 ` Konrad Rzeszutek Wilk
2015-09-22 15:09                   ` Jan Beulich
2015-09-22 20:17                     ` Ed Swierk
2015-09-23  0:53                       ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.