All of lore.kernel.org
 help / color / mirror / Atom feed
* [DRAFT RFC] PVHv2 interaction with physical devices
@ 2016-11-09 15:59 Roger Pau Monné
  2016-11-09 18:45 ` Konrad Rzeszutek Wilk
  2016-11-09 18:51 ` Andrew Cooper
  0 siblings, 2 replies; 18+ messages in thread
From: Roger Pau Monné @ 2016-11-09 15:59 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Kelly, Julien Grall, Paul Durrant, Jan Beulich,
	Boris Ostrovsky, Zytaruk

Hello,

I'm attaching a draft of how a PVHv2 Dom0 is supposed to interact with 
physical devices, and what needs to be done inside of Xen in order to 
achieve it. Current draft is RFC because I'm quite sure I'm missing bits 
that should be written down here. So far I've tried to describe what my 
previous series attempted to do by adding a bunch of IO and memory space 
handlers.

Please note that this document only applies to PVHv2 Dom0, it is not 
applicable to untrusted domains that will need more handlers in order to 
secure Xen and other domains running on the same system. The idea is that 
this can be expanded to untrusted domains also in the long term, thus having 
a single set of IO and memory handlers for passed-through devices.

Roger.

---8<---

This document describes how a PVHv2 Dom0 is supposed to interact with physical
devices.

Architecture
============

Purpose
-------

Previous Dom0 implementations have always used PIRQs (physical interrupts
routed over event channels) in order to receive events from physical devices.
This prevents Dom0 form taking advantage of new hardware virtualization
features, like posted interrupts or hardware virtualized local APIC. Also the
current device memory management in the PVH Dom0 implementation is lacking,
and might not support devices that have memory regions past the 4GB 
boundary.

The new PVH implementation (PVHv2) should overcome the interrupt limitations by
providing the same interface that's used on bare metal (a local and IO APICs)
thus allowing the usage of advanced hardware assisted virtualization
techniques. This also aligns with the trend on the hardware industry to
move part of the emulation into the silicon itself.

In order to improve the mapping of device memory areas, Xen will have to
know of those devices in advance (before Dom0 tries to interact with them)
so that the memory BARs will be properly mapped into Dom0 memory map.

The following document describes the proposed interface and implementation
of all the logic needed in order to achieve the functionality described 
above.

MMIO areas
==========

Overview
--------

On x86 systems certain regions of memory might be used in order to manage
physical devices on the system. Access to this areas is critical for a
PVH Dom0 in order to operate properly. Unlike previous PVH Dom0 implementation
(PVHv1) that was setup with identity mappings of all the holes and reserved
regions found in the memory map, this new implementation intents to map only
what's actually needed by the Dom0.

Low 1MB
-------

When booted with a legacy BIOS, the low 1MB contains firmware related data
that should be identity mapped to the Dom0. This include the EBDA, video
memory and possibly ROMs. All non RAM regions below 1MB will be identity
mapped to the Dom0 so that it can access this data freely.

ACPI regions
------------

ACPI regions will be identity mapped to the Dom0, this implies regions with
type 3 and 4 in the e820 memory map. Also, since some BIOS report incorrect
memory maps, the top-level tables discovered by Xen (as listed in the
{X/R}SDT) that are not on RAM regions will be mapped to Dom0.

PCI memory BARs
---------------

PCI devices discovered by Xen will have it's BARs scanned in order to detect
memory BARs, and those will be identity mapped to Dom0. Since BARs can be
freely moved by the Dom0 OS by writing to the appropriate PCI config space
register, Xen must trap those accesses and unmap the previous region and
map the new one as set by Dom0.

Limitations
-----------

 - Xen needs to be aware of any PCI device before Dom0 tries to interact with
   it, so that the MMIO regions are properly mapped.

Interrupt management
====================

Overview
--------

On x86 systems there are tree different mechanisms that can be used in order
to deliver interrupts: IO APIC, MSI and MSI-X. Note that each device might
support different methods, but those are never active at the same time.

Legacy PCI interrupts
---------------------

The only way to deliver legacy PCI interrupts to PVHv2 guests is using the
IO APIC, PVHv2 domains don't have an emulated PIC. As a consequence the ACPI
_PIC method must be set to APIC mode by the Dom0 OS.

Xen will always provide a single IO APIC, that will match the number of
possible GSIs of the underlying hardware. This is possible because ACPI
uses a system cookie in order to name interrupts, so the IO APIC device ID
or pin number is not used in _PTR methods.

XXX: is it possible to have more than 256 GSIs?

The binding between the underlying physical interrupt and the emulated
interrupt is performed when unmasking an IO APIC PIN, so writes to the
IOREDTBL registers that unset the mask bit will trigger this binding
and enable the interrupt.

MSI Interrupts
--------------

MSI interrupts are setup using the PCI config space, either the IO ports
or the memory mapped configuration area. This means that both spaces should
be trapped by Xen, in order to detect accesses to these registers and
properly emulate them.

Since the offset of the MSI registers is not fixed, Xen has to query the
PCI configuration space in order to find the offset of the PCI_CAP_ID_MSI,
and then setup the correct traps, which also vary depending on the
capabilities of the device. The following list contains the set of MSI
registers that Xen will trap, please take into account that some devices
might only implement a subset of those registers, so not all traps will
be used:

 - Message control register (offset 2): Xen traps accesses to this register,
   and stores the data written to it into an internal structure. When the OS
   sets the MSI enable bit (offset 0) Xen will setup the configured MSI
   interrupts and route them to the guest.

 - Message address register (offset 4): writes and reads to this register are
   trapped by Xen, and the value is stored into an internal structure. This is
   later used when MSI are enabled in order to configure the vectors injected
   to the guest. Writes to this register with MSI already enabled will cause
   a reconfiguration of the binding of interrupts to the guest.

 - Message data register (offset 8 or 12 if message address is 64bits): writes
   and reads to this register are trapped by Xen, and the value is stored into
   an internal structure. This is used when MSI are enabled in order to
   configure the vector where the guests expects to receive those interrupts.
   Writes to this register with MSI already enabled will cause a
   reconfiguration of the binding of interrupts to the guest.

 - Mask and pending bits: reads or writes to those registers are not trapped
   by Xen.

MSI-X Interrupts
----------------

MSI-X in contrast with MSI has part of the configuration registers in the
PCI configuration space, while others reside inside of the memory BARs of the
device. So in this case Xen needs to setup traps for both the PCI
configuration space and two different memory regions. Xen has to query the
position of the MSI-X capability using the PCI_CAP_ID_MSIX, and setup a
handler in order to trap accesses to the different registers. Xen also has
to figure out the position of the MSI-X table and PBA, using the table BIR
and table offset, and the PBA BIR and PBA offset. Once those are known a
handler should also be setup in order to trap accesses to those memory 
regions.

This is the list of MSI-X registers that are used in order to manage MSI-X
in the PCI configuration space:

 - Message control: Xen should trap accesses to this register in order to
   detect changes to the MSI-X enable field (bit 15). Changes to this bit
   will trigger the setup of the MSI-X table entries configured. Writes
   to the function mask bit will be passed-through to the underlying
   register.

 - Table offset, table BIR, PBA offset, PBA BIR: accesses to those registers
   are not trapped by Xen.

The following registers reside in memory, and are pointed out by the Table and
PBA fields found in the PCI configuration space:

 - Message address and data: writes and reads to those registers are trapped
   by Xen, and the value is stored into an internal structure. This is later
   used by Xen in order to configure the interrupt injected to the guest.
   Writes to those registers with MSI-X already enabled will not cause a
   reconfiguration of the interrupt.

 - Vector control: writes and reads are trapped, clearing the mask bit (bit 0)
   will cause Xen to setup the configured interrupt if MSI-X is globally
   enabled in the message control field.

 - Pending bits array: writes and reads to this register are not trapped by
   Xen.

Limitations
-----------

 - Due to the fact that Dom0 is not able to parse dynamic ACPI tables,
   some UART devices might only function in polling mode, because Xen
   will be unable to properly configure the interrupt pins without Dom0
   collaboration, and the UART in use by Xen should be explicitly blacklisted
   from Dom0 access.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-09 15:59 [DRAFT RFC] PVHv2 interaction with physical devices Roger Pau Monné
@ 2016-11-09 18:45 ` Konrad Rzeszutek Wilk
  2016-11-10 10:39   ` Roger Pau Monné
  2016-11-09 18:51 ` Andrew Cooper
  1 sibling, 1 reply; 18+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-11-09 18:45 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Kelly, Julien Grall, Paul Durrant, Jan Beulich,
	xen-devel, Boris Ostrovsky, Zytaruk

On Wed, Nov 09, 2016 at 04:59:12PM +0100, Roger Pau Monné wrote:
> Hello,
> 
> I'm attaching a draft of how a PVHv2 Dom0 is supposed to interact with 
> physical devices, and what needs to be done inside of Xen in order to 
> achieve it. Current draft is RFC because I'm quite sure I'm missing bits 
> that should be written down here. So far I've tried to describe what my 
> previous series attempted to do by adding a bunch of IO and memory space 
> handlers.
> 
> Please note that this document only applies to PVHv2 Dom0, it is not 
> applicable to untrusted domains that will need more handlers in order to 
> secure Xen and other domains running on the same system. The idea is that 
> this can be expanded to untrusted domains also in the long term, thus having 
> a single set of IO and memory handlers for passed-through devices.
> 
> Roger.
> 
> ---8<---
> 
> This document describes how a PVHv2 Dom0 is supposed to interact with physical
> devices.
> 
> Architecture
> ============
> 
> Purpose
> -------
> 
> Previous Dom0 implementations have always used PIRQs (physical interrupts
> routed over event channels) in order to receive events from physical devices.
> This prevents Dom0 form taking advantage of new hardware virtualization
> features, like posted interrupts or hardware virtualized local APIC. Also the
> current device memory management in the PVH Dom0 implementation is lacking,
> and might not support devices that have memory regions past the 4GB 
> boundary.

memory regions meaning BAR regions?

> 
> The new PVH implementation (PVHv2) should overcome the interrupt limitations by
> providing the same interface that's used on bare metal (a local and IO APICs)
> thus allowing the usage of advanced hardware assisted virtualization
> techniques. This also aligns with the trend on the hardware industry to
> move part of the emulation into the silicon itself.

What if the hardware PVH2 runs on does not have vAPIC?
> 
> In order to improve the mapping of device memory areas, Xen will have to
> know of those devices in advance (before Dom0 tries to interact with them)
> so that the memory BARs will be properly mapped into Dom0 memory map.

Oh, that is going to be a problem with SR-IOV. Those are created _after_
dom0 has booted. In fact they are done by the drivers themselves.

See xen_add_device in drivers/xen/pci.c how this is handled.

> 
> The following document describes the proposed interface and implementation
> of all the logic needed in order to achieve the functionality described 
> above.
> 
> MMIO areas
> ==========
> 
> Overview
> --------
> 
> On x86 systems certain regions of memory might be used in order to manage
> physical devices on the system. Access to this areas is critical for a
> PVH Dom0 in order to operate properly. Unlike previous PVH Dom0 implementation
> (PVHv1) that was setup with identity mappings of all the holes and reserved
> regions found in the memory map, this new implementation intents to map only
> what's actually needed by the Dom0.

And why was the previous approach not working?
> 
> Low 1MB
> -------
> 
> When booted with a legacy BIOS, the low 1MB contains firmware related data
> that should be identity mapped to the Dom0. This include the EBDA, video
> memory and possibly ROMs. All non RAM regions below 1MB will be identity
> mapped to the Dom0 so that it can access this data freely.
> 
> ACPI regions
> ------------
> 
> ACPI regions will be identity mapped to the Dom0, this implies regions with
> type 3 and 4 in the e820 memory map. Also, since some BIOS report incorrect
> memory maps, the top-level tables discovered by Xen (as listed in the
> {X/R}SDT) that are not on RAM regions will be mapped to Dom0.
> 
> PCI memory BARs
> ---------------
> 
> PCI devices discovered by Xen will have it's BARs scanned in order to detect
> memory BARs, and those will be identity mapped to Dom0. Since BARs can be
> freely moved by the Dom0 OS by writing to the appropriate PCI config space
> register, Xen must trap those accesses and unmap the previous region and
> map the new one as set by Dom0.

You can make that simpler - we have hypercalls to "notify" in Linux
when a device is changing. Those can provide that information as well.
(This is what PV dom0 does).

Also you are missing one important part - the MMCFG. That is required
for Xen to be able to poke at the PCI configuration spaces (above the 256).
And you can only get the MMCFG if the ACPI DSDT has been parsed.

So if you do the PCI bus scanning _before_ booting PVH dom0, you may
need to update your view of PCI devices after the MMCFG locations
have been provided to you.

> 
> Limitations
> -----------
> 
>  - Xen needs to be aware of any PCI device before Dom0 tries to interact with
>    it, so that the MMIO regions are properly mapped.
> 
> Interrupt management
> ====================
> 
> Overview
> --------
> 
> On x86 systems there are tree different mechanisms that can be used in order
> to deliver interrupts: IO APIC, MSI and MSI-X. Note that each device might
> support different methods, but those are never active at the same time.
> 
> Legacy PCI interrupts
> ---------------------
> 
> The only way to deliver legacy PCI interrupts to PVHv2 guests is using the
> IO APIC, PVHv2 domains don't have an emulated PIC. As a consequence the ACPI
> _PIC method must be set to APIC mode by the Dom0 OS.
> 
> Xen will always provide a single IO APIC, that will match the number of
> possible GSIs of the underlying hardware. This is possible because ACPI
> uses a system cookie in order to name interrupts, so the IO APIC device ID
> or pin number is not used in _PTR methods.

So the MADT that is presented to dom0 will be mangled? That is
where the IOAPIC information along with the number of GSIs is presented.
> 
> XXX: is it possible to have more than 256 GSIs?

Yeah. If you have enough of the IOAPICs you can have more than 256. But
I don't think any OS has taken that into account as the GSI value are
always uint8_t.

> 
> The binding between the underlying physical interrupt and the emulated
> interrupt is performed when unmasking an IO APIC PIN, so writes to the
> IOREDTBL registers that unset the mask bit will trigger this binding
> and enable the interrupt.
> 
> MSI Interrupts
> --------------
> 
> MSI interrupts are setup using the PCI config space, either the IO ports
> or the memory mapped configuration area. This means that both spaces should
> be trapped by Xen, in order to detect accesses to these registers and
> properly emulate them.
> 
> Since the offset of the MSI registers is not fixed, Xen has to query the
> PCI configuration space in order to find the offset of the PCI_CAP_ID_MSI,
> and then setup the correct traps, which also vary depending on the
> capabilities of the device. The following list contains the set of MSI
> registers that Xen will trap, please take into account that some devices
> might only implement a subset of those registers, so not all traps will
> be used:
> 
>  - Message control register (offset 2): Xen traps accesses to this register,
>    and stores the data written to it into an internal structure. When the OS
>    sets the MSI enable bit (offset 0) Xen will setup the configured MSI
>    interrupts and route them to the guest.
> 
>  - Message address register (offset 4): writes and reads to this register are
>    trapped by Xen, and the value is stored into an internal structure. This is
>    later used when MSI are enabled in order to configure the vectors injected
>    to the guest. Writes to this register with MSI already enabled will cause
>    a reconfiguration of the binding of interrupts to the guest.
> 
>  - Message data register (offset 8 or 12 if message address is 64bits): writes
>    and reads to this register are trapped by Xen, and the value is stored into
>    an internal structure. This is used when MSI are enabled in order to
>    configure the vector where the guests expects to receive those interrupts.
>    Writes to this register with MSI already enabled will cause a
>    reconfiguration of the binding of interrupts to the guest.
> 
>  - Mask and pending bits: reads or writes to those registers are not trapped
>    by Xen.
> 
> MSI-X Interrupts
> ----------------
> 
> MSI-X in contrast with MSI has part of the configuration registers in the
> PCI configuration space, while others reside inside of the memory BARs of the
> device. So in this case Xen needs to setup traps for both the PCI
> configuration space and two different memory regions. Xen has to query the
> position of the MSI-X capability using the PCI_CAP_ID_MSIX, and setup a
> handler in order to trap accesses to the different registers. Xen also has
> to figure out the position of the MSI-X table and PBA, using the table BIR
> and table offset, and the PBA BIR and PBA offset. Once those are known a
> handler should also be setup in order to trap accesses to those memory 
> regions.
> 
> This is the list of MSI-X registers that are used in order to manage MSI-X
> in the PCI configuration space:
> 
>  - Message control: Xen should trap accesses to this register in order to
>    detect changes to the MSI-X enable field (bit 15). Changes to this bit
>    will trigger the setup of the MSI-X table entries configured. Writes
>    to the function mask bit will be passed-through to the underlying
>    register.
> 
>  - Table offset, table BIR, PBA offset, PBA BIR: accesses to those registers
>    are not trapped by Xen.
> 
> The following registers reside in memory, and are pointed out by the Table and
> PBA fields found in the PCI configuration space:
> 
>  - Message address and data: writes and reads to those registers are trapped
>    by Xen, and the value is stored into an internal structure. This is later
>    used by Xen in order to configure the interrupt injected to the guest.
>    Writes to those registers with MSI-X already enabled will not cause a
>    reconfiguration of the interrupt.
> 
>  - Vector control: writes and reads are trapped, clearing the mask bit (bit 0)
>    will cause Xen to setup the configured interrupt if MSI-X is globally
>    enabled in the message control field.
> 
>  - Pending bits array: writes and reads to this register are not trapped by
>    Xen.
> 
> Limitations
> -----------
> 
>  - Due to the fact that Dom0 is not able to parse dynamic ACPI tables,
>    some UART devices might only function in polling mode, because Xen
>    will be unable to properly configure the interrupt pins without Dom0
>    collaboration, and the UART in use by Xen should be explicitly blacklisted
>    from Dom0 access.

By blacklisting the IO ports too?
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-09 15:59 [DRAFT RFC] PVHv2 interaction with physical devices Roger Pau Monné
  2016-11-09 18:45 ` Konrad Rzeszutek Wilk
@ 2016-11-09 18:51 ` Andrew Cooper
  2016-11-09 20:47   ` Pasi Kärkkäinen
  2016-11-10 10:54   ` Roger Pau Monné
  1 sibling, 2 replies; 18+ messages in thread
From: Andrew Cooper @ 2016-11-09 18:51 UTC (permalink / raw)
  To: Roger Pau Monné, xen-devel
  Cc: Kelly, Julien Grall, Paul Durrant, Jan Beulich, Zytaruk, Boris Ostrovsky

On 09/11/16 15:59, Roger Pau Monné wrote:
> Hello,
>
> I'm attaching a draft of how a PVHv2 Dom0 is supposed to interact with 
> physical devices, and what needs to be done inside of Xen in order to 
> achieve it. Current draft is RFC because I'm quite sure I'm missing bits 
> that should be written down here. So far I've tried to describe what my 
> previous series attempted to do by adding a bunch of IO and memory space 
> handlers.
>
> Please note that this document only applies to PVHv2 Dom0, it is not 
> applicable to untrusted domains that will need more handlers in order to 
> secure Xen and other domains running on the same system. The idea is that 
> this can be expanded to untrusted domains also in the long term, thus having 
> a single set of IO and memory handlers for passed-through devices.
>
> Roger.
>
> ---8<---
>
> This document describes how a PVHv2 Dom0 is supposed to interact with physical
> devices.
>
> Architecture
> ============
>
> Purpose
> -------
>
> Previous Dom0 implementations have always used PIRQs (physical interrupts
> routed over event channels) in order to receive events from physical devices.
> This prevents Dom0 form taking advantage of new hardware virtualization
> features, like posted interrupts or hardware virtualized local APIC. Also the
> current device memory management in the PVH Dom0 implementation is lacking,
> and might not support devices that have memory regions past the 4GB 
> boundary.
>
> The new PVH implementation (PVHv2) should overcome the interrupt limitations by
> providing the same interface that's used on bare metal (a local and IO APICs)
> thus allowing the usage of advanced hardware assisted virtualization
> techniques. This also aligns with the trend on the hardware industry to
> move part of the emulation into the silicon itself.

+10

>
> In order to improve the mapping of device memory areas, Xen will have to
> know of those devices in advance (before Dom0 tries to interact with them)
> so that the memory BARs will be properly mapped into Dom0 memory map.
>
> The following document describes the proposed interface and implementation
> of all the logic needed in order to achieve the functionality described 
> above.
>
> MMIO areas
> ==========
>
> Overview
> --------
>
> On x86 systems certain regions of memory might be used in order to manage
> physical devices on the system. Access to this areas is critical for a
> PVH Dom0 in order to operate properly. Unlike previous PVH Dom0 implementation
> (PVHv1) that was setup with identity mappings of all the holes and reserved
> regions found in the memory map, this new implementation intents to map only
> what's actually needed by the Dom0.
>
> Low 1MB
> -------
>
> When booted with a legacy BIOS, the low 1MB contains firmware related data
> that should be identity mapped to the Dom0. This include the EBDA, video
> memory and possibly ROMs. All non RAM regions below 1MB will be identity
> mapped to the Dom0 so that it can access this data freely.

Are you proposing a unilateral identity map of the first 1MB, or just
the interesting regions?

One thing to remember is the iBVT, for iscsi boot, which lives in
regular RAM and needs searching for.

>
> ACPI regions
> ------------
>
> ACPI regions will be identity mapped to the Dom0, this implies regions with
> type 3 and 4 in the e820 memory map. Also, since some BIOS report incorrect
> memory maps, the top-level tables discovered by Xen (as listed in the
> {X/R}SDT) that are not on RAM regions will be mapped to Dom0.
>
> PCI memory BARs
> ---------------
>
> PCI devices discovered by Xen will have it's BARs scanned in order to detect
> memory BARs, and those will be identity mapped to Dom0. Since BARs can be
> freely moved by the Dom0 OS by writing to the appropriate PCI config space
> register, Xen must trap those accesses and unmap the previous region and
> map the new one as set by Dom0.
>
> Limitations
> -----------
>
>  - Xen needs to be aware of any PCI device before Dom0 tries to interact with
>    it, so that the MMIO regions are properly mapped.
>
> Interrupt management
> ====================
>
> Overview
> --------
>
> On x86 systems there are tree different mechanisms that can be used in order
> to deliver interrupts: IO APIC, MSI and MSI-X. Note that each device might
> support different methods, but those are never active at the same time.
>
> Legacy PCI interrupts
> ---------------------
>
> The only way to deliver legacy PCI interrupts to PVHv2 guests is using the
> IO APIC, PVHv2 domains don't have an emulated PIC. As a consequence the ACPI
> _PIC method must be set to APIC mode by the Dom0 OS.
>
> Xen will always provide a single IO APIC, that will match the number of
> possible GSIs of the underlying hardware. This is possible because ACPI
> uses a system cookie in order to name interrupts, so the IO APIC device ID
> or pin number is not used in _PTR methods.
>
> XXX: is it possible to have more than 256 GSIs?

Yes.  There is no restriction on the number of IO-APIC in a system, and
no restriction on the number of PCI bridges these IO-APICs serve.

However, I would suggest it would be better to offer one a 1-to-1 view
of system IO-APICs to vIO-APICs in PVHv2 dom0, or the pin mappings are
going to get confused when reading the ACPI tables.

>
> The binding between the underlying physical interrupt and the emulated
> interrupt is performed when unmasking an IO APIC PIN, so writes to the
> IOREDTBL registers that unset the mask bit will trigger this binding
> and enable the interrupt.
>
> MSI Interrupts
> --------------
>
> MSI interrupts are setup using the PCI config space, either the IO ports
> or the memory mapped configuration area. This means that both spaces should
> be trapped by Xen, in order to detect accesses to these registers and
> properly emulate them.

cfc/cf8 need trapping unconditionally, and the MMCFG region can only be
intercepted in units of 4k.  As a result, Xen will unconditionally see
all config accesses anyway.

>
> Since the offset of the MSI registers is not fixed, Xen has to query the
> PCI configuration space in order to find the offset of the PCI_CAP_ID_MSI,
> and then setup the correct traps, which also vary depending on the
> capabilities of the device.

Although only once at start-of-day.  The layout of capabilities in
config space for a particular device is static.

>  The following list contains the set of MSI
> registers that Xen will trap, please take into account that some devices
> might only implement a subset of those registers, so not all traps will
> be used:
>
>  - Message control register (offset 2): Xen traps accesses to this register,
>    and stores the data written to it into an internal structure. When the OS
>    sets the MSI enable bit (offset 0) Xen will setup the configured MSI
>    interrupts and route them to the guest.
>
>  - Message address register (offset 4): writes and reads to this register are
>    trapped by Xen, and the value is stored into an internal structure. This is
>    later used when MSI are enabled in order to configure the vectors injected
>    to the guest. Writes to this register with MSI already enabled will cause
>    a reconfiguration of the binding of interrupts to the guest.
>
>  - Message data register (offset 8 or 12 if message address is 64bits): writes
>    and reads to this register are trapped by Xen, and the value is stored into
>    an internal structure. This is used when MSI are enabled in order to
>    configure the vector where the guests expects to receive those interrupts.
>    Writes to this register with MSI already enabled will cause a
>    reconfiguration of the binding of interrupts to the guest.
>
>  - Mask and pending bits: reads or writes to those registers are not trapped
>    by Xen.

These must be trapped.  In all cases, Xen must maintain the guests idea
of whether something is masked, and Xen's own idea.  This is necessary
for interrupt migration.

Having said that, the entire interrupt remapping subsystem in Xen is in
dire need of an overhaul.  It is terminally dumb and inefficient.  With
interrupt remapping enabled, Xen should never need to touch interrupt
sources for non-guest actions.

>
> MSI-X Interrupts
> ----------------
>
> MSI-X in contrast with MSI has part of the configuration registers in the
> PCI configuration space, while others reside inside of the memory BARs of the
> device. So in this case Xen needs to setup traps for both the PCI
> configuration space and two different memory regions. Xen has to query the
> position of the MSI-X capability using the PCI_CAP_ID_MSIX, and setup a
> handler in order to trap accesses to the different registers. Xen also has
> to figure out the position of the MSI-X table and PBA, using the table BIR
> and table offset, and the PBA BIR and PBA offset. Once those are known a
> handler should also be setup in order to trap accesses to those memory 
> regions.
>
> This is the list of MSI-X registers that are used in order to manage MSI-X
> in the PCI configuration space:
>
>  - Message control: Xen should trap accesses to this register in order to
>    detect changes to the MSI-X enable field (bit 15). Changes to this bit
>    will trigger the setup of the MSI-X table entries configured. Writes
>    to the function mask bit will be passed-through to the underlying
>    register.
>
>  - Table offset, table BIR, PBA offset, PBA BIR: accesses to those registers
>    are not trapped by Xen.

These will be trapped, but are read-only so Xen needn't do anything
exciting as part of emulation.

>
> The following registers reside in memory, and are pointed out by the Table and
> PBA fields found in the PCI configuration space:
>
>  - Message address and data: writes and reads to those registers are trapped
>    by Xen, and the value is stored into an internal structure. This is later
>    used by Xen in order to configure the interrupt injected to the guest.
>    Writes to those registers with MSI-X already enabled will not cause a
>    reconfiguration of the interrupt.
>
>  - Vector control: writes and reads are trapped, clearing the mask bit (bit 0)
>    will cause Xen to setup the configured interrupt if MSI-X is globally
>    enabled in the message control field.
>
>  - Pending bits array: writes and reads to this register are not trapped by
>    Xen.
>
> Limitations
> -----------
>
>  - Due to the fact that Dom0 is not able to parse dynamic ACPI tables,
>    some UART devices might only function in polling mode, because Xen
>    will be unable to properly configure the interrupt pins without Dom0
>    collaboration, and the UART in use by Xen should be explicitly blacklisted
>    from Dom0 access.

This reminds me that we need to include some HPET quirks in Xen as well.

There is an entire range of Nehalem era machines where Linux finds an
HPET in the IOH via quirks alone, and not via the ACPI tables, and
nothing in Xen currently knows to disallow this access.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-09 18:51 ` Andrew Cooper
@ 2016-11-09 20:47   ` Pasi Kärkkäinen
  2016-11-10 10:43     ` Andrew Cooper
  2016-11-10 10:54   ` Roger Pau Monné
  1 sibling, 1 reply; 18+ messages in thread
From: Pasi Kärkkäinen @ 2016-11-09 20:47 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kelly, Julien Grall, Paul Durrant, Jan Beulich, xen-devel,
	Zytaruk, Boris Ostrovsky, Roger Pau Monné

On Wed, Nov 09, 2016 at 06:51:49PM +0000, Andrew Cooper wrote:
> >
> > Low 1MB
> > -------
> >
> > When booted with a legacy BIOS, the low 1MB contains firmware related data
> > that should be identity mapped to the Dom0. This include the EBDA, video
> > memory and possibly ROMs. All non RAM regions below 1MB will be identity
> > mapped to the Dom0 so that it can access this data freely.
> 
> Are you proposing a unilateral identity map of the first 1MB, or just
> the interesting regions?
> 
> One thing to remember is the iBVT, for iscsi boot, which lives in
> regular RAM and needs searching for.
> 

I think you mean iBFT = iSCSI Boot Firmware Table.


-- Pasi


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-09 18:45 ` Konrad Rzeszutek Wilk
@ 2016-11-10 10:39   ` Roger Pau Monné
  2016-11-10 13:53     ` Konrad Rzeszutek Wilk
  2016-11-10 16:37     ` Jan Beulich
  0 siblings, 2 replies; 18+ messages in thread
From: Roger Pau Monné @ 2016-11-10 10:39 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Andrew Cooper, Kelly, Julien Grall, Paul Durrant, Jan Beulich,
	xen-devel, Boris Ostrovsky

On Wed, Nov 09, 2016 at 01:45:17PM -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Nov 09, 2016 at 04:59:12PM +0100, Roger Pau Monné wrote:
> > Hello,
> > 
> > I'm attaching a draft of how a PVHv2 Dom0 is supposed to interact with 
> > physical devices, and what needs to be done inside of Xen in order to 
> > achieve it. Current draft is RFC because I'm quite sure I'm missing bits 
> > that should be written down here. So far I've tried to describe what my 
> > previous series attempted to do by adding a bunch of IO and memory space 
> > handlers.
> > 
> > Please note that this document only applies to PVHv2 Dom0, it is not 
> > applicable to untrusted domains that will need more handlers in order to 
> > secure Xen and other domains running on the same system. The idea is that 
> > this can be expanded to untrusted domains also in the long term, thus having 
> > a single set of IO and memory handlers for passed-through devices.
> > 
> > Roger.
> > 
> > ---8<---
> > 
> > This document describes how a PVHv2 Dom0 is supposed to interact with physical
> > devices.
> > 
> > Architecture
> > ============
> > 
> > Purpose
> > -------
> > 
> > Previous Dom0 implementations have always used PIRQs (physical interrupts
> > routed over event channels) in order to receive events from physical devices.
> > This prevents Dom0 form taking advantage of new hardware virtualization
> > features, like posted interrupts or hardware virtualized local APIC. Also the
> > current device memory management in the PVH Dom0 implementation is lacking,
> > and might not support devices that have memory regions past the 4GB 
> > boundary.
> 
> memory regions meaning BAR regions?

Yes.
 
> > 
> > The new PVH implementation (PVHv2) should overcome the interrupt limitations by
> > providing the same interface that's used on bare metal (a local and IO APICs)
> > thus allowing the usage of advanced hardware assisted virtualization
> > techniques. This also aligns with the trend on the hardware industry to
> > move part of the emulation into the silicon itself.
> 
> What if the hardware PVH2 runs on does not have vAPIC?

The emulated local APIC provided by Xen will be used.

> > 
> > In order to improve the mapping of device memory areas, Xen will have to
> > know of those devices in advance (before Dom0 tries to interact with them)
> > so that the memory BARs will be properly mapped into Dom0 memory map.
> 
> Oh, that is going to be a problem with SR-IOV. Those are created _after_
> dom0 has booted. In fact they are done by the drivers themselves.
> 
> See xen_add_device in drivers/xen/pci.c how this is handled.

Is the process of creating those VF something standart? (In the sense that 
it can be detected by Xen, and proper mappings stablished)

> > 
> > The following document describes the proposed interface and implementation
> > of all the logic needed in order to achieve the functionality described 
> > above.
> > 
> > MMIO areas
> > ==========
> > 
> > Overview
> > --------
> > 
> > On x86 systems certain regions of memory might be used in order to manage
> > physical devices on the system. Access to this areas is critical for a
> > PVH Dom0 in order to operate properly. Unlike previous PVH Dom0 implementation
> > (PVHv1) that was setup with identity mappings of all the holes and reserved
> > regions found in the memory map, this new implementation intents to map only
> > what's actually needed by the Dom0.
> 
> And why was the previous approach not working?

Previous PVHv1 implementation would only identity map holes and reserved 
areas in the guest memory map, or up to the 4GB boundary if the guest memory 
map is smaller than 4GB. If a device has a BAR past the 4GB boundary for 
example, it would not be identity mapped in the p2m. 

> > 
> > Low 1MB
> > -------
> > 
> > When booted with a legacy BIOS, the low 1MB contains firmware related data
> > that should be identity mapped to the Dom0. This include the EBDA, video
> > memory and possibly ROMs. All non RAM regions below 1MB will be identity
> > mapped to the Dom0 so that it can access this data freely.
> > 
> > ACPI regions
> > ------------
> > 
> > ACPI regions will be identity mapped to the Dom0, this implies regions with
> > type 3 and 4 in the e820 memory map. Also, since some BIOS report incorrect
> > memory maps, the top-level tables discovered by Xen (as listed in the
> > {X/R}SDT) that are not on RAM regions will be mapped to Dom0.
> > 
> > PCI memory BARs
> > ---------------
> > 
> > PCI devices discovered by Xen will have it's BARs scanned in order to detect
> > memory BARs, and those will be identity mapped to Dom0. Since BARs can be
> > freely moved by the Dom0 OS by writing to the appropriate PCI config space
> > register, Xen must trap those accesses and unmap the previous region and
> > map the new one as set by Dom0.
> 
> You can make that simpler - we have hypercalls to "notify" in Linux
> when a device is changing. Those can provide that information as well.
> (This is what PV dom0 does).
> 
> Also you are missing one important part - the MMCFG. That is required
> for Xen to be able to poke at the PCI configuration spaces (above the 256).
> And you can only get the MMCFG if the ACPI DSDT has been parsed.

Hm, I guess I'm missing something, but at least on my hardware Xen seems to 
be able to parse the MCFG ACPI table before Dom0 does anything with the 
DSDT:

(XEN) PCI: MCFG configuration 0: base f8000000 segment 0000 buses 00 - 3f
(XEN) PCI: MCFG area at f8000000 reserved in E820
(XEN) PCI: Using MCFG for segment 0000 bus 00-3f

> So if you do the PCI bus scanning _before_ booting PVH dom0, you may
> need to update your view of PCI devices after the MMCFG locations
> have been provided to you.

I'm not opposed to keep the PHYSDEVOP_pci_mmcfg_reserved, but I still have 
to see hardware where this is actually needed. Also, AFAICT, FreeBSD at 
least is only able to detect MMCFG regions present in the MCFG ACPI table:

http://fxr.watson.org/fxr/source/dev/acpica/acpi.c?im=excerp#L1861

> > 
> > Limitations
> > -----------
> > 
> >  - Xen needs to be aware of any PCI device before Dom0 tries to interact with
> >    it, so that the MMIO regions are properly mapped.
> > 
> > Interrupt management
> > ====================
> > 
> > Overview
> > --------
> > 
> > On x86 systems there are tree different mechanisms that can be used in order
> > to deliver interrupts: IO APIC, MSI and MSI-X. Note that each device might
> > support different methods, but those are never active at the same time.
> > 
> > Legacy PCI interrupts
> > ---------------------
> > 
> > The only way to deliver legacy PCI interrupts to PVHv2 guests is using the
> > IO APIC, PVHv2 domains don't have an emulated PIC. As a consequence the ACPI
> > _PIC method must be set to APIC mode by the Dom0 OS.
> > 
> > Xen will always provide a single IO APIC, that will match the number of
> > possible GSIs of the underlying hardware. This is possible because ACPI
> > uses a system cookie in order to name interrupts, so the IO APIC device ID
> > or pin number is not used in _PTR methods.
> 
> So the MADT that is presented to dom0 will be mangled? That is
> where the IOAPIC information along with the number of GSIs is presented.

Yes, the MADT presented to Dom0 is created by Xen, this is already part of 
my series, see patch:

https://lists.xenproject.org/archives/html/xen-devel/2016-10/msg02017.html

The IO APIC information is presented in the MADT IO APIC entries, while the 
total number of GSIs is calculated by the Dom0 by poking at how many pins 
each IO APIC has (this information is not directly fetched from ACPI).

> > 
> > XXX: is it possible to have more than 256 GSIs?
> 
> Yeah. If you have enough of the IOAPICs you can have more than 256. But
> I don't think any OS has taken that into account as the GSI value are
> always uint8_t.

Right, so AFAICT providing a single IO APIC with enough pins should be fine.

> > 
> > The binding between the underlying physical interrupt and the emulated
> > interrupt is performed when unmasking an IO APIC PIN, so writes to the
> > IOREDTBL registers that unset the mask bit will trigger this binding
> > and enable the interrupt.
> > 
> > MSI Interrupts
> > --------------
> > 
> > MSI interrupts are setup using the PCI config space, either the IO ports
> > or the memory mapped configuration area. This means that both spaces should
> > be trapped by Xen, in order to detect accesses to these registers and
> > properly emulate them.
> > 
> > Since the offset of the MSI registers is not fixed, Xen has to query the
> > PCI configuration space in order to find the offset of the PCI_CAP_ID_MSI,
> > and then setup the correct traps, which also vary depending on the
> > capabilities of the device. The following list contains the set of MSI
> > registers that Xen will trap, please take into account that some devices
> > might only implement a subset of those registers, so not all traps will
> > be used:
> > 
> >  - Message control register (offset 2): Xen traps accesses to this register,
> >    and stores the data written to it into an internal structure. When the OS
> >    sets the MSI enable bit (offset 0) Xen will setup the configured MSI
> >    interrupts and route them to the guest.
> > 
> >  - Message address register (offset 4): writes and reads to this register are
> >    trapped by Xen, and the value is stored into an internal structure. This is
> >    later used when MSI are enabled in order to configure the vectors injected
> >    to the guest. Writes to this register with MSI already enabled will cause
> >    a reconfiguration of the binding of interrupts to the guest.
> > 
> >  - Message data register (offset 8 or 12 if message address is 64bits): writes
> >    and reads to this register are trapped by Xen, and the value is stored into
> >    an internal structure. This is used when MSI are enabled in order to
> >    configure the vector where the guests expects to receive those interrupts.
> >    Writes to this register with MSI already enabled will cause a
> >    reconfiguration of the binding of interrupts to the guest.
> > 
> >  - Mask and pending bits: reads or writes to those registers are not trapped
> >    by Xen.
> > 
> > MSI-X Interrupts
> > ----------------
> > 
> > MSI-X in contrast with MSI has part of the configuration registers in the
> > PCI configuration space, while others reside inside of the memory BARs of the
> > device. So in this case Xen needs to setup traps for both the PCI
> > configuration space and two different memory regions. Xen has to query the
> > position of the MSI-X capability using the PCI_CAP_ID_MSIX, and setup a
> > handler in order to trap accesses to the different registers. Xen also has
> > to figure out the position of the MSI-X table and PBA, using the table BIR
> > and table offset, and the PBA BIR and PBA offset. Once those are known a
> > handler should also be setup in order to trap accesses to those memory 
> > regions.
> > 
> > This is the list of MSI-X registers that are used in order to manage MSI-X
> > in the PCI configuration space:
> > 
> >  - Message control: Xen should trap accesses to this register in order to
> >    detect changes to the MSI-X enable field (bit 15). Changes to this bit
> >    will trigger the setup of the MSI-X table entries configured. Writes
> >    to the function mask bit will be passed-through to the underlying
> >    register.
> > 
> >  - Table offset, table BIR, PBA offset, PBA BIR: accesses to those registers
> >    are not trapped by Xen.
> > 
> > The following registers reside in memory, and are pointed out by the Table and
> > PBA fields found in the PCI configuration space:
> > 
> >  - Message address and data: writes and reads to those registers are trapped
> >    by Xen, and the value is stored into an internal structure. This is later
> >    used by Xen in order to configure the interrupt injected to the guest.
> >    Writes to those registers with MSI-X already enabled will not cause a
> >    reconfiguration of the interrupt.
> > 
> >  - Vector control: writes and reads are trapped, clearing the mask bit (bit 0)
> >    will cause Xen to setup the configured interrupt if MSI-X is globally
> >    enabled in the message control field.
> > 
> >  - Pending bits array: writes and reads to this register are not trapped by
> >    Xen.
> > 
> > Limitations
> > -----------
> > 
> >  - Due to the fact that Dom0 is not able to parse dynamic ACPI tables,
> >    some UART devices might only function in polling mode, because Xen
> >    will be unable to properly configure the interrupt pins without Dom0
> >    collaboration, and the UART in use by Xen should be explicitly blacklisted
> >    from Dom0 access.
> 
> By blacklisting the IO ports too?

Well, I was planning to somehow use the STAO ACPI table, but I'm not really 
sure how Xen can blacklist a device without parsing the DSDT:

https://lists.xen.org/archives/html/xen-devel/2016-08/pdfYfOWKJ83jH.pdf

Since this table is under Xen's control, we could always make changes to it 
in order to suit our needs, although I'm not really sure how a device can be 
blacklisted without knowing it's ACPI namespace path, and I don't know how 
to get that without parsing the DSDT.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-09 20:47   ` Pasi Kärkkäinen
@ 2016-11-10 10:43     ` Andrew Cooper
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Cooper @ 2016-11-10 10:43 UTC (permalink / raw)
  To: Pasi Kärkkäinen
  Cc: Kelly, Julien Grall, Paul Durrant, Jan Beulich, xen-devel,
	Zytaruk, Boris Ostrovsky, Roger Pau Monné

On 09/11/16 20:47, Pasi Kärkkäinen wrote:
> On Wed, Nov 09, 2016 at 06:51:49PM +0000, Andrew Cooper wrote:
>>> Low 1MB
>>> -------
>>>
>>> When booted with a legacy BIOS, the low 1MB contains firmware related data
>>> that should be identity mapped to the Dom0. This include the EBDA, video
>>> memory and possibly ROMs. All non RAM regions below 1MB will be identity
>>> mapped to the Dom0 so that it can access this data freely.
>> Are you proposing a unilateral identity map of the first 1MB, or just
>> the interesting regions?
>>
>> One thing to remember is the iBVT, for iscsi boot, which lives in
>> regular RAM and needs searching for.
>>
> I think you mean iBFT = iSCSI Boot Firmware Table.

I did indeed.  Sorry - BVT is a commonly used internal acronym, and is
clearly ingrained into my muscle memory.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-09 18:51 ` Andrew Cooper
  2016-11-09 20:47   ` Pasi Kärkkäinen
@ 2016-11-10 10:54   ` Roger Pau Monné
  2016-11-10 11:23     ` Andrew Cooper
  1 sibling, 1 reply; 18+ messages in thread
From: Roger Pau Monné @ 2016-11-10 10:54 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Kelly, Julien Grall, Paul Durrant, Jan Beulich, xen-devel,
	Boris Ostrovsky, Zytaruk

On Wed, Nov 09, 2016 at 06:51:49PM +0000, Andrew Cooper wrote:
> On 09/11/16 15:59, Roger Pau Monné wrote:
> > Low 1MB
> > -------
> >
> > When booted with a legacy BIOS, the low 1MB contains firmware related data
> > that should be identity mapped to the Dom0. This include the EBDA, video
> > memory and possibly ROMs. All non RAM regions below 1MB will be identity
> > mapped to the Dom0 so that it can access this data freely.
> 
> Are you proposing a unilateral identity map of the first 1MB, or just
> the interesting regions?

The current approach identity maps the first 1MB except for RAM regions, 
that are populated in the p2m, and the data in the original pages is copied 
over. This is done because the AP boot trampoline is placed in the RAM 
regions below 1MB, and the emulator is not able to execute code from pages 
marked as p2m_mmio_direct.
 
> One thing to remember is the iBVT, for iscsi boot, which lives in
> regular RAM and needs searching for.

And I guess this is not static data that just needs to be read by the OS? 
Then I will have to look into fixing the emulator to deal with 
p2m_mmio_direct regions.

> >
> > ACPI regions
> > ------------
> >
> > ACPI regions will be identity mapped to the Dom0, this implies regions with
> > type 3 and 4 in the e820 memory map. Also, since some BIOS report incorrect
> > memory maps, the top-level tables discovered by Xen (as listed in the
> > {X/R}SDT) that are not on RAM regions will be mapped to Dom0.
> >
> > PCI memory BARs
> > ---------------
> >
> > PCI devices discovered by Xen will have it's BARs scanned in order to detect
> > memory BARs, and those will be identity mapped to Dom0. Since BARs can be
> > freely moved by the Dom0 OS by writing to the appropriate PCI config space
> > register, Xen must trap those accesses and unmap the previous region and
> > map the new one as set by Dom0.
> >
> > Limitations
> > -----------
> >
> >  - Xen needs to be aware of any PCI device before Dom0 tries to interact with
> >    it, so that the MMIO regions are properly mapped.
> >
> > Interrupt management
> > ====================
> >
> > Overview
> > --------
> >
> > On x86 systems there are tree different mechanisms that can be used in order
> > to deliver interrupts: IO APIC, MSI and MSI-X. Note that each device might
> > support different methods, but those are never active at the same time.
> >
> > Legacy PCI interrupts
> > ---------------------
> >
> > The only way to deliver legacy PCI interrupts to PVHv2 guests is using the
> > IO APIC, PVHv2 domains don't have an emulated PIC. As a consequence the ACPI
> > _PIC method must be set to APIC mode by the Dom0 OS.
> >
> > Xen will always provide a single IO APIC, that will match the number of
> > possible GSIs of the underlying hardware. This is possible because ACPI
> > uses a system cookie in order to name interrupts, so the IO APIC device ID
> > or pin number is not used in _PTR methods.
> >
> > XXX: is it possible to have more than 256 GSIs?
> 
> Yes.  There is no restriction on the number of IO-APIC in a system, and
> no restriction on the number of PCI bridges these IO-APICs serve.
> 
> However, I would suggest it would be better to offer one a 1-to-1 view
> of system IO-APICs to vIO-APICs in PVHv2 dom0, or the pin mappings are
> going to get confused when reading the ACPI tables.

Hm, I've been searching for this, but it seems to me that ACPI tables will 
always use GSIs in APIC mode in order to describe interrupts, so it doesn't 
seem to matter whether those GSIs are scattered across multiple IO APICs or 
just a single one.

> >
> > The binding between the underlying physical interrupt and the emulated
> > interrupt is performed when unmasking an IO APIC PIN, so writes to the
> > IOREDTBL registers that unset the mask bit will trigger this binding
> > and enable the interrupt.
> >
> > MSI Interrupts
> > --------------
> >
> > MSI interrupts are setup using the PCI config space, either the IO ports
> > or the memory mapped configuration area. This means that both spaces should
> > be trapped by Xen, in order to detect accesses to these registers and
> > properly emulate them.
> 
> cfc/cf8 need trapping unconditionally, and the MMCFG region can only be
> intercepted in units of 4k.  As a result, Xen will unconditionally see
> all config accesses anyway.

Yes, that's right (however it might decide to just pass-through some of 
them).

> >
> > Since the offset of the MSI registers is not fixed, Xen has to query the
> > PCI configuration space in order to find the offset of the PCI_CAP_ID_MSI,
> > and then setup the correct traps, which also vary depending on the
> > capabilities of the device.
> 
> Although only once at start-of-day.  The layout of capabilities in
> config space for a particular device is static.

Yes, the MSI capabilities offset is fetched at the start-if-day and then 
stored.

> >  The following list contains the set of MSI
> > registers that Xen will trap, please take into account that some devices
> > might only implement a subset of those registers, so not all traps will
> > be used:
> >
> >  - Message control register (offset 2): Xen traps accesses to this register,
> >    and stores the data written to it into an internal structure. When the OS
> >    sets the MSI enable bit (offset 0) Xen will setup the configured MSI
> >    interrupts and route them to the guest.
> >
> >  - Message address register (offset 4): writes and reads to this register are
> >    trapped by Xen, and the value is stored into an internal structure. This is
> >    later used when MSI are enabled in order to configure the vectors injected
> >    to the guest. Writes to this register with MSI already enabled will cause
> >    a reconfiguration of the binding of interrupts to the guest.
> >
> >  - Message data register (offset 8 or 12 if message address is 64bits): writes
> >    and reads to this register are trapped by Xen, and the value is stored into
> >    an internal structure. This is used when MSI are enabled in order to
> >    configure the vector where the guests expects to receive those interrupts.
> >    Writes to this register with MSI already enabled will cause a
> >    reconfiguration of the binding of interrupts to the guest.
> >
> >  - Mask and pending bits: reads or writes to those registers are not trapped
> >    by Xen.
> 
> These must be trapped.  In all cases, Xen must maintain the guests idea
> of whether something is masked, and Xen's own idea.  This is necessary
> for interrupt migration.

Oh, so mask bits must be trapped and the interrupt masked using the Xen 
interrupt API then, noted.

> Having said that, the entire interrupt remapping subsystem in Xen is in
> dire need of an overhaul.  It is terminally dumb and inefficient.  With
> interrupt remapping enabled, Xen should never need to touch interrupt
> sources for non-guest actions.
> 
> >
> > MSI-X Interrupts
> > ----------------
> >
> > MSI-X in contrast with MSI has part of the configuration registers in the
> > PCI configuration space, while others reside inside of the memory BARs of the
> > device. So in this case Xen needs to setup traps for both the PCI
> > configuration space and two different memory regions. Xen has to query the
> > position of the MSI-X capability using the PCI_CAP_ID_MSIX, and setup a
> > handler in order to trap accesses to the different registers. Xen also has
> > to figure out the position of the MSI-X table and PBA, using the table BIR
> > and table offset, and the PBA BIR and PBA offset. Once those are known a
> > handler should also be setup in order to trap accesses to those memory 
> > regions.
> >
> > This is the list of MSI-X registers that are used in order to manage MSI-X
> > in the PCI configuration space:
> >
> >  - Message control: Xen should trap accesses to this register in order to
> >    detect changes to the MSI-X enable field (bit 15). Changes to this bit
> >    will trigger the setup of the MSI-X table entries configured. Writes
> >    to the function mask bit will be passed-through to the underlying
> >    register.
> >
> >  - Table offset, table BIR, PBA offset, PBA BIR: accesses to those registers
> >    are not trapped by Xen.
> 
> These will be trapped, but are read-only so Xen needn't do anything
> exciting as part of emulation.

Right, those are read-only.

> >
> > The following registers reside in memory, and are pointed out by the Table and
> > PBA fields found in the PCI configuration space:
> >
> >  - Message address and data: writes and reads to those registers are trapped
> >    by Xen, and the value is stored into an internal structure. This is later
> >    used by Xen in order to configure the interrupt injected to the guest.
> >    Writes to those registers with MSI-X already enabled will not cause a
> >    reconfiguration of the interrupt.
> >
> >  - Vector control: writes and reads are trapped, clearing the mask bit (bit 0)
> >    will cause Xen to setup the configured interrupt if MSI-X is globally
> >    enabled in the message control field.
> >
> >  - Pending bits array: writes and reads to this register are not trapped by
> >    Xen.
> >
> > Limitations
> > -----------
> >
> >  - Due to the fact that Dom0 is not able to parse dynamic ACPI tables,
> >    some UART devices might only function in polling mode, because Xen
> >    will be unable to properly configure the interrupt pins without Dom0
> >    collaboration, and the UART in use by Xen should be explicitly blacklisted
> >    from Dom0 access.
> 
> This reminds me that we need to include some HPET quirks in Xen as well.
> 
> There is an entire range of Nehalem era machines where Linux finds an
> HPET in the IOH via quirks alone, and not via the ACPI tables, and
> nothing in Xen currently knows to disallow this access.

Hm, if it's using quirks it's going to be hard to prevent this. At worse 
Linux is going to discover that the HPET is non-functional at least I 
assume?

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-10 10:54   ` Roger Pau Monné
@ 2016-11-10 11:23     ` Andrew Cooper
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Cooper @ 2016-11-10 11:23 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Kelly, Julien Grall, Paul Durrant, Jan Beulich, xen-devel,
	Boris Ostrovsky, Zytaruk

On 10/11/16 10:54, Roger Pau Monné wrote:
> On Wed, Nov 09, 2016 at 06:51:49PM +0000, Andrew Cooper wrote:
>> On 09/11/16 15:59, Roger Pau Monné wrote:
>>> Low 1MB
>>> -------
>>>
>>> When booted with a legacy BIOS, the low 1MB contains firmware related data
>>> that should be identity mapped to the Dom0. This include the EBDA, video
>>> memory and possibly ROMs. All non RAM regions below 1MB will be identity
>>> mapped to the Dom0 so that it can access this data freely.
>> Are you proposing a unilateral identity map of the first 1MB, or just
>> the interesting regions?
> The current approach identity maps the first 1MB except for RAM regions, 
> that are populated in the p2m, and the data in the original pages is copied 
> over. This is done because the AP boot trampoline is placed in the RAM 
> regions below 1MB, and the emulator is not able to execute code from pages 
> marked as p2m_mmio_direct.
>  
>> One thing to remember is the iBVT, for iscsi boot, which lives in
>> regular RAM and needs searching for.
> And I guess this is not static data that just needs to be read by the OS? 
> Then I will have to look into fixing the emulator to deal with 
> p2m_mmio_direct regions.

It lives in plain RAM, but is static iirc.  It should just need copying
into dom0's view.

>
>>> ACPI regions
>>> ------------
>>>
>>> ACPI regions will be identity mapped to the Dom0, this implies regions with
>>> type 3 and 4 in the e820 memory map. Also, since some BIOS report incorrect
>>> memory maps, the top-level tables discovered by Xen (as listed in the
>>> {X/R}SDT) that are not on RAM regions will be mapped to Dom0.
>>>
>>> PCI memory BARs
>>> ---------------
>>>
>>> PCI devices discovered by Xen will have it's BARs scanned in order to detect
>>> memory BARs, and those will be identity mapped to Dom0. Since BARs can be
>>> freely moved by the Dom0 OS by writing to the appropriate PCI config space
>>> register, Xen must trap those accesses and unmap the previous region and
>>> map the new one as set by Dom0.
>>>
>>> Limitations
>>> -----------
>>>
>>>  - Xen needs to be aware of any PCI device before Dom0 tries to interact with
>>>    it, so that the MMIO regions are properly mapped.
>>>
>>> Interrupt management
>>> ====================
>>>
>>> Overview
>>> --------
>>>
>>> On x86 systems there are tree different mechanisms that can be used in order
>>> to deliver interrupts: IO APIC, MSI and MSI-X. Note that each device might
>>> support different methods, but those are never active at the same time.
>>>
>>> Legacy PCI interrupts
>>> ---------------------
>>>
>>> The only way to deliver legacy PCI interrupts to PVHv2 guests is using the
>>> IO APIC, PVHv2 domains don't have an emulated PIC. As a consequence the ACPI
>>> _PIC method must be set to APIC mode by the Dom0 OS.
>>>
>>> Xen will always provide a single IO APIC, that will match the number of
>>> possible GSIs of the underlying hardware. This is possible because ACPI
>>> uses a system cookie in order to name interrupts, so the IO APIC device ID
>>> or pin number is not used in _PTR methods.
>>>
>>> XXX: is it possible to have more than 256 GSIs?
>> Yes.  There is no restriction on the number of IO-APIC in a system, and
>> no restriction on the number of PCI bridges these IO-APICs serve.
>>
>> However, I would suggest it would be better to offer one a 1-to-1 view
>> of system IO-APICs to vIO-APICs in PVHv2 dom0, or the pin mappings are
>> going to get confused when reading the ACPI tables.
> Hm, I've been searching for this, but it seems to me that ACPI tables will 
> always use GSIs in APIC mode in order to describe interrupts, so it doesn't 
> seem to matter whether those GSIs are scattered across multiple IO APICs or 
> just a single one.

I will not be surprised if this plan turns out to cause problems.

Perhaps we can start out with just a single vIOAPIC and see if that
works in reality.

>
>>> The following registers reside in memory, and are pointed out by the Table and
>>> PBA fields found in the PCI configuration space:
>>>
>>>  - Message address and data: writes and reads to those registers are trapped
>>>    by Xen, and the value is stored into an internal structure. This is later
>>>    used by Xen in order to configure the interrupt injected to the guest.
>>>    Writes to those registers with MSI-X already enabled will not cause a
>>>    reconfiguration of the interrupt.
>>>
>>>  - Vector control: writes and reads are trapped, clearing the mask bit (bit 0)
>>>    will cause Xen to setup the configured interrupt if MSI-X is globally
>>>    enabled in the message control field.
>>>
>>>  - Pending bits array: writes and reads to this register are not trapped by
>>>    Xen.
>>>
>>> Limitations
>>> -----------
>>>
>>>  - Due to the fact that Dom0 is not able to parse dynamic ACPI tables,
>>>    some UART devices might only function in polling mode, because Xen
>>>    will be unable to properly configure the interrupt pins without Dom0
>>>    collaboration, and the UART in use by Xen should be explicitly blacklisted
>>>    from Dom0 access.
>> This reminds me that we need to include some HPET quirks in Xen as well.
>>
>> There is an entire range of Nehalem era machines where Linux finds an
>> HPET in the IOH via quirks alone, and not via the ACPI tables, and
>> nothing in Xen currently knows to disallow this access.
> Hm, if it's using quirks it's going to be hard to prevent this. At worse 
> Linux is going to discover that the HPET is non-functional at least I 
> assume?

It is a PCI quirk on the southbridge to know how to find the system HPET
even though it isn't described in any ACPI tables.

As Xen doesn't know how to find this HPET and deny dom0 access to it,
dom0 finds it, disables legacy broadcast mode and reconfigures
interrupts behind Xen's back.  It also causes an hang during kexec
because the new kernel can't complete its timer calibration.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-10 10:39   ` Roger Pau Monné
@ 2016-11-10 13:53     ` Konrad Rzeszutek Wilk
  2016-11-10 15:20       ` Roger Pau Monné
  2016-11-10 16:37     ` Jan Beulich
  1 sibling, 1 reply; 18+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-11-10 13:53 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Kelly, Julien Grall, Paul Durrant, Jan Beulich,
	xen-devel, Boris Ostrovsky

On Thu, Nov 10, 2016 at 11:39:08AM +0100, Roger Pau Monné wrote:
> On Wed, Nov 09, 2016 at 01:45:17PM -0500, Konrad Rzeszutek Wilk wrote:
> > On Wed, Nov 09, 2016 at 04:59:12PM +0100, Roger Pau Monné wrote:
> > > Hello,
> > > 
> > > I'm attaching a draft of how a PVHv2 Dom0 is supposed to interact with 
> > > physical devices, and what needs to be done inside of Xen in order to 
> > > achieve it. Current draft is RFC because I'm quite sure I'm missing bits 
> > > that should be written down here. So far I've tried to describe what my 
> > > previous series attempted to do by adding a bunch of IO and memory space 
> > > handlers.
> > > 
> > > Please note that this document only applies to PVHv2 Dom0, it is not 
> > > applicable to untrusted domains that will need more handlers in order to 
> > > secure Xen and other domains running on the same system. The idea is that 
> > > this can be expanded to untrusted domains also in the long term, thus having 
> > > a single set of IO and memory handlers for passed-through devices.
> > > 
> > > Roger.
> > > 
> > > ---8<---
> > > 
> > > This document describes how a PVHv2 Dom0 is supposed to interact with physical
> > > devices.
> > > 
> > > Architecture
> > > ============
> > > 
> > > Purpose
> > > -------
> > > 
> > > Previous Dom0 implementations have always used PIRQs (physical interrupts
> > > routed over event channels) in order to receive events from physical devices.
> > > This prevents Dom0 form taking advantage of new hardware virtualization
> > > features, like posted interrupts or hardware virtualized local APIC. Also the
> > > current device memory management in the PVH Dom0 implementation is lacking,
> > > and might not support devices that have memory regions past the 4GB 
> > > boundary.
> > 
> > memory regions meaning BAR regions?
> 
> Yes.
>  
> > > 
> > > The new PVH implementation (PVHv2) should overcome the interrupt limitations by
> > > providing the same interface that's used on bare metal (a local and IO APICs)
> > > thus allowing the usage of advanced hardware assisted virtualization
> > > techniques. This also aligns with the trend on the hardware industry to
> > > move part of the emulation into the silicon itself.
> > 
> > What if the hardware PVH2 runs on does not have vAPIC?
> 
> The emulated local APIC provided by Xen will be used.
> 
> > > 
> > > In order to improve the mapping of device memory areas, Xen will have to
> > > know of those devices in advance (before Dom0 tries to interact with them)
> > > so that the memory BARs will be properly mapped into Dom0 memory map.
> > 
> > Oh, that is going to be a problem with SR-IOV. Those are created _after_
> > dom0 has booted. In fact they are done by the drivers themselves.
> > 
> > See xen_add_device in drivers/xen/pci.c how this is handled.
> 
> Is the process of creating those VF something standart? (In the sense that 
> it can be detected by Xen, and proper mappings stablished)

Yes and no.

You can read from the PCI configuration that the device (Physical
function) has SR-IOV. But that information may be in the extended
configuration registers so you need MCFG. Anyhow the only thing the PF
will tell you is the BAR regions they will occupy (since they
are behind the bridge) but not the BDFs:

        Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
                IOVSta: Migration-
                Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 00
                VF offset: 128, stride: 2, Device ID: 10ca
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 00000000fbda0000 (64-bit, non-prefetchable)
                Region 3: Memory at 00000000fbd80000 (64-bit, non-prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Kernel driver in use: igb

And if I enable SR-IOV on the PF I get:

0a:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
0a:10.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
0a:10.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
0a:10.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
0a:10.6 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
0a:11.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
0a:11.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
0a:11.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)

-bash-4.1# lspci -s 0a:10.0 -v
0a:10.0 Ethernet controller: Intel Corporation 82576 Virtual Function
(rev 01)
        Subsystem: Super Micro Computer Inc Device 10c9
        Flags: bus master, fast devsel, latency 0
        [virtual] Memory at fbda0000 (64-bit, non-prefetchable) [size=16K]
        [virtual] Memory at fbd80000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Kernel driver in use: igbvf

-bash-4.1# lspci -s 0a:11.4 -v
0a:11.4 Ethernet controller: Intel Corporation 82576 Virtual Function
(rev 01)
        Subsystem: Super Micro Computer Inc Device 10c9
        Flags: bus master, fast devsel, latency 0
        [virtual] Memory at fbdb8000 (64-bit, non-prefetchable) [size=16K]
        [virtual] Memory at fbd98000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Kernel driver in use: igbvf


> 
> > > 
> > > The following document describes the proposed interface and implementation
> > > of all the logic needed in order to achieve the functionality described 
> > > above.
> > > 
> > > MMIO areas
> > > ==========
> > > 
> > > Overview
> > > --------
> > > 
> > > On x86 systems certain regions of memory might be used in order to manage
> > > physical devices on the system. Access to this areas is critical for a
> > > PVH Dom0 in order to operate properly. Unlike previous PVH Dom0 implementation
> > > (PVHv1) that was setup with identity mappings of all the holes and reserved
> > > regions found in the memory map, this new implementation intents to map only
> > > what's actually needed by the Dom0.
> > 
> > And why was the previous approach not working?
> 
> Previous PVHv1 implementation would only identity map holes and reserved 
> areas in the guest memory map, or up to the 4GB boundary if the guest memory 
> map is smaller than 4GB. If a device has a BAR past the 4GB boundary for 
> example, it would not be identity mapped in the p2m. 
> 
> > > 
> > > Low 1MB
> > > -------
> > > 
> > > When booted with a legacy BIOS, the low 1MB contains firmware related data
> > > that should be identity mapped to the Dom0. This include the EBDA, video
> > > memory and possibly ROMs. All non RAM regions below 1MB will be identity
> > > mapped to the Dom0 so that it can access this data freely.
> > > 
> > > ACPI regions
> > > ------------
> > > 
> > > ACPI regions will be identity mapped to the Dom0, this implies regions with
> > > type 3 and 4 in the e820 memory map. Also, since some BIOS report incorrect
> > > memory maps, the top-level tables discovered by Xen (as listed in the
> > > {X/R}SDT) that are not on RAM regions will be mapped to Dom0.
> > > 
> > > PCI memory BARs
> > > ---------------
> > > 
> > > PCI devices discovered by Xen will have it's BARs scanned in order to detect
> > > memory BARs, and those will be identity mapped to Dom0. Since BARs can be
> > > freely moved by the Dom0 OS by writing to the appropriate PCI config space
> > > register, Xen must trap those accesses and unmap the previous region and
> > > map the new one as set by Dom0.
> > 
> > You can make that simpler - we have hypercalls to "notify" in Linux
> > when a device is changing. Those can provide that information as well.
> > (This is what PV dom0 does).
> > 
> > Also you are missing one important part - the MMCFG. That is required
> > for Xen to be able to poke at the PCI configuration spaces (above the 256).
> > And you can only get the MMCFG if the ACPI DSDT has been parsed.
> 
> Hm, I guess I'm missing something, but at least on my hardware Xen seems to 
> be able to parse the MCFG ACPI table before Dom0 does anything with the 
> DSDT:
> 
> (XEN) PCI: MCFG configuration 0: base f8000000 segment 0000 buses 00 - 3f
> (XEN) PCI: MCFG area at f8000000 reserved in E820
> (XEN) PCI: Using MCFG for segment 0000 bus 00-3f
> 
> > So if you do the PCI bus scanning _before_ booting PVH dom0, you may
> > need to update your view of PCI devices after the MMCFG locations
> > have been provided to you.
> 
> I'm not opposed to keep the PHYSDEVOP_pci_mmcfg_reserved, but I still have 
> to see hardware where this is actually needed. Also, AFAICT, FreeBSD at 
> least is only able to detect MMCFG regions present in the MCFG ACPI table:

There is some hardware out there (I think I saw this with an IBM HS-20,
but I can't recall the details). The specification says that the MCFG
_may_ be defined in the MADT, but is not guaranteed. Which means that it
can bubble via the ACPI DSDT code.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-10 13:53     ` Konrad Rzeszutek Wilk
@ 2016-11-10 15:20       ` Roger Pau Monné
  2016-11-10 17:21         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 18+ messages in thread
From: Roger Pau Monné @ 2016-11-10 15:20 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Andrew Cooper, Kelly, Julien Grall, Paul Durrant, Jan Beulich,
	xen-devel, Boris Ostrovsky

On Thu, Nov 10, 2016 at 08:53:05AM -0500, Konrad Rzeszutek Wilk wrote:
> On Thu, Nov 10, 2016 at 11:39:08AM +0100, Roger Pau Monné wrote:
> > On Wed, Nov 09, 2016 at 01:45:17PM -0500, Konrad Rzeszutek Wilk wrote:
> > > On Wed, Nov 09, 2016 at 04:59:12PM +0100, Roger Pau Monné wrote:
> > > > In order to improve the mapping of device memory areas, Xen will have to
> > > > know of those devices in advance (before Dom0 tries to interact with them)
> > > > so that the memory BARs will be properly mapped into Dom0 memory map.
> > > 
> > > Oh, that is going to be a problem with SR-IOV. Those are created _after_
> > > dom0 has booted. In fact they are done by the drivers themselves.
> > > 
> > > See xen_add_device in drivers/xen/pci.c how this is handled.
> > 
> > Is the process of creating those VF something standart? (In the sense that 
> > it can be detected by Xen, and proper mappings stablished)
> 
> Yes and no.
> 
> You can read from the PCI configuration that the device (Physical
> function) has SR-IOV. But that information may be in the extended
> configuration registers so you need MCFG. Anyhow the only thing the PF
> will tell you is the BAR regions they will occupy (since they
> are behind the bridge) but not the BDFs:

But just knowing the BARs position is enough for Xen to install the identity 
mappings AFAICT?

Or are the more BARs that will only appear after the SR-IOV functionality 
has been enabled?

From the documentation that I've found, if you detect that the device has 
PCI_EXT_CAP_ID_SRIOV, you can then read the BARs and map them into Dom0, but 
maybe I'm missing something (and I have not been able to test this, although 
my previous PVHv2 Dom0 series already contained code in order to perform 
this):

http://xenbits.xen.org/gitweb/?p=people/royger/xen.git;a=commit;h=260cfd1e96e56ab4b58a414d544d92a77e210050

>         Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
>                 IOVCap: Migration-, Interrupt Message Number: 000
>                 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
>                 IOVSta: Migration-
>                 Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 00
>                 VF offset: 128, stride: 2, Device ID: 10ca
>                 Supported Page Size: 00000553, System Page Size: 00000001
>                 Region 0: Memory at 00000000fbda0000 (64-bit, non-prefetchable)
>                 Region 3: Memory at 00000000fbd80000 (64-bit, non-prefetchable)
>                 VF Migration: offset: 00000000, BIR: 0
>         Kernel driver in use: igb
> 
> And if I enable SR-IOV on the PF I get:
> 
> 0a:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
> 0a:10.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> 0a:10.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> 0a:10.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> 0a:10.6 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> 0a:11.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> 0a:11.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> 0a:11.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> 
> -bash-4.1# lspci -s 0a:10.0 -v
> 0a:10.0 Ethernet controller: Intel Corporation 82576 Virtual Function
> (rev 01)
>         Subsystem: Super Micro Computer Inc Device 10c9
>         Flags: bus master, fast devsel, latency 0
>         [virtual] Memory at fbda0000 (64-bit, non-prefetchable) [size=16K]
>         [virtual] Memory at fbd80000 (64-bit, non-prefetchable) [size=16K]
>         Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
>         Capabilities: [a0] Express Endpoint, MSI 00
>         Capabilities: [100] Advanced Error Reporting
>         Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
>         Kernel driver in use: igbvf
> 
> -bash-4.1# lspci -s 0a:11.4 -v
> 0a:11.4 Ethernet controller: Intel Corporation 82576 Virtual Function
> (rev 01)
>         Subsystem: Super Micro Computer Inc Device 10c9
>         Flags: bus master, fast devsel, latency 0
>         [virtual] Memory at fbdb8000 (64-bit, non-prefetchable) [size=16K]
>         [virtual] Memory at fbd98000 (64-bit, non-prefetchable) [size=16K]
>         Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
>         Capabilities: [a0] Express Endpoint, MSI 00
>         Capabilities: [100] Advanced Error Reporting
>         Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
>         Kernel driver in use: igbvf

So it seems that the memory for individual VFs is taken from the BARs listed 
inside of PCI_EXT_CAP_ID_SRIOV.

> > > > PCI memory BARs
> > > > ---------------
> > > > 
> > > > PCI devices discovered by Xen will have it's BARs scanned in order to detect
> > > > memory BARs, and those will be identity mapped to Dom0. Since BARs can be
> > > > freely moved by the Dom0 OS by writing to the appropriate PCI config space
> > > > register, Xen must trap those accesses and unmap the previous region and
> > > > map the new one as set by Dom0.
> > > 
> > > You can make that simpler - we have hypercalls to "notify" in Linux
> > > when a device is changing. Those can provide that information as well.
> > > (This is what PV dom0 does).
> > > 
> > > Also you are missing one important part - the MMCFG. That is required
> > > for Xen to be able to poke at the PCI configuration spaces (above the 256).
> > > And you can only get the MMCFG if the ACPI DSDT has been parsed.
> > 
> > Hm, I guess I'm missing something, but at least on my hardware Xen seems to 
> > be able to parse the MCFG ACPI table before Dom0 does anything with the 
> > DSDT:
> > 
> > (XEN) PCI: MCFG configuration 0: base f8000000 segment 0000 buses 00 - 3f
> > (XEN) PCI: MCFG area at f8000000 reserved in E820
> > (XEN) PCI: Using MCFG for segment 0000 bus 00-3f
> > 
> > > So if you do the PCI bus scanning _before_ booting PVH dom0, you may
> > > need to update your view of PCI devices after the MMCFG locations
> > > have been provided to you.
> > 
> > I'm not opposed to keep the PHYSDEVOP_pci_mmcfg_reserved, but I still have 
> > to see hardware where this is actually needed. Also, AFAICT, FreeBSD at 
> > least is only able to detect MMCFG regions present in the MCFG ACPI table:
> 
> There is some hardware out there (I think I saw this with an IBM HS-20,
> but I can't recall the details). The specification says that the MCFG
> _may_ be defined in the MADT, but is not guaranteed. Which means that it
> can bubble via the ACPI DSDT code.

Hm, MCFG is a top-level table on it's own, and AFAIK not tied to the MADT in 
any way. I'm not opposed to introduce PHYSDEVOP_pci_mmcfg_reserved if it's 
really needed, but I won't do this blindly. We first need to know if there 
are systems out there that don't report MMCFG areas in the MCFG ACPI table 
properly, and then whether those systems would actually be capable of 
running a PVH Dom0 (if they are old like IBM HS-20 they won't be capable of 
running a PVH Dom0 due to missing virtualization features anyway).

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-10 10:39   ` Roger Pau Monné
  2016-11-10 13:53     ` Konrad Rzeszutek Wilk
@ 2016-11-10 16:37     ` Jan Beulich
  2016-11-10 17:19       ` Konrad Rzeszutek Wilk
  2016-11-16 16:42       ` Roger Pau Monné
  1 sibling, 2 replies; 18+ messages in thread
From: Jan Beulich @ 2016-11-10 16:37 UTC (permalink / raw)
  To: Roger Pau Monné, Konrad Rzeszutek Wilk
  Cc: Andrew Cooper, Kelly, Julien Grall, PaulDurrant, xen-devel,
	Boris Ostrovsky

>>> On 10.11.16 at 11:39, <roger.pau@citrix.com> wrote:
> On Wed, Nov 09, 2016 at 01:45:17PM -0500, Konrad Rzeszutek Wilk wrote:
>> On Wed, Nov 09, 2016 at 04:59:12PM +0100, Roger Pau Monné wrote:
>> > PCI memory BARs
>> > ---------------
>> > 
>> > PCI devices discovered by Xen will have it's BARs scanned in order to detect
>> > memory BARs, and those will be identity mapped to Dom0. Since BARs can be
>> > freely moved by the Dom0 OS by writing to the appropriate PCI config space
>> > register, Xen must trap those accesses and unmap the previous region and
>> > map the new one as set by Dom0.
>> 
>> You can make that simpler - we have hypercalls to "notify" in Linux
>> when a device is changing. Those can provide that information as well.
>> (This is what PV dom0 does).
>> 
>> Also you are missing one important part - the MMCFG. That is required
>> for Xen to be able to poke at the PCI configuration spaces (above the 256).
>> And you can only get the MMCFG if the ACPI DSDT has been parsed.
> 
> Hm, I guess I'm missing something, but at least on my hardware Xen seems to 
> be able to parse the MCFG ACPI table before Dom0 does anything with the 
> DSDT:
> 
> (XEN) PCI: MCFG configuration 0: base f8000000 segment 0000 buses 00 - 3f
> (XEN) PCI: MCFG area at f8000000 reserved in E820

This is the crucial line: To guard against broken firmware, we - just
like Linux - require that the area be reserved in at least one of E820
or ACPI resources. We can check E820 ourselves, but we need
Dom0's AML parser for the other mechanism.

>> So if you do the PCI bus scanning _before_ booting PVH dom0, you may
>> need to update your view of PCI devices after the MMCFG locations
>> have been provided to you.
> 
> I'm not opposed to keep the PHYSDEVOP_pci_mmcfg_reserved, but I still have 
> to see hardware where this is actually needed. Also, AFAICT, FreeBSD at 
> least is only able to detect MMCFG regions present in the MCFG ACPI table:
> 
> http://fxr.watson.org/fxr/source/dev/acpica/acpi.c?im=excerp#L1861 

Iirc the spec mandates only segment 0 to be represented in the
static table. Other segments may (and likely will) only have their
data available in AML.

>> > XXX: is it possible to have more than 256 GSIs?
>> 
>> Yeah. If you have enough of the IOAPICs you can have more than 256. But
>> I don't think any OS has taken that into account as the GSI value are
>> always uint8_t.
> 
> Right, so AFAICT providing a single IO APIC with enough pins should be fine.

No, let's not even start with such an approach. Having seen (not really
huge) systems with well beyond 100 GSIs, I don't think it makes sense
to try to (temporarily) ease our lives slightly by introducing an
implementation limit here.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-10 16:37     ` Jan Beulich
@ 2016-11-10 17:19       ` Konrad Rzeszutek Wilk
  2016-11-16 16:42       ` Roger Pau Monné
  1 sibling, 0 replies; 18+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-11-10 17:19 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Kelly, Julien Grall, PaulDurrant, xen-devel,
	Boris Ostrovsky, Roger Pau Monné

On Thu, Nov 10, 2016 at 09:37:19AM -0700, Jan Beulich wrote:
> >>> On 10.11.16 at 11:39, <roger.pau@citrix.com> wrote:
> > On Wed, Nov 09, 2016 at 01:45:17PM -0500, Konrad Rzeszutek Wilk wrote:
> >> On Wed, Nov 09, 2016 at 04:59:12PM +0100, Roger Pau Monné wrote:
> >> > PCI memory BARs
> >> > ---------------
> >> > 
> >> > PCI devices discovered by Xen will have it's BARs scanned in order to detect
> >> > memory BARs, and those will be identity mapped to Dom0. Since BARs can be
> >> > freely moved by the Dom0 OS by writing to the appropriate PCI config space
> >> > register, Xen must trap those accesses and unmap the previous region and
> >> > map the new one as set by Dom0.
> >> 
> >> You can make that simpler - we have hypercalls to "notify" in Linux
> >> when a device is changing. Those can provide that information as well.
> >> (This is what PV dom0 does).
> >> 
> >> Also you are missing one important part - the MMCFG. That is required
> >> for Xen to be able to poke at the PCI configuration spaces (above the 256).
> >> And you can only get the MMCFG if the ACPI DSDT has been parsed.
> > 
> > Hm, I guess I'm missing something, but at least on my hardware Xen seems to 
> > be able to parse the MCFG ACPI table before Dom0 does anything with the 
> > DSDT:
> > 
> > (XEN) PCI: MCFG configuration 0: base f8000000 segment 0000 buses 00 - 3f
> > (XEN) PCI: MCFG area at f8000000 reserved in E820
> 
> This is the crucial line: To guard against broken firmware, we - just
> like Linux - require that the area be reserved in at least one of E820
> or ACPI resources. We can check E820 ourselves, but we need
> Dom0's AML parser for the other mechanism.

And in fact I do have such box!

When it boots:
(XEN) PCI: MCFG configuration 0: base e0000000 segment 0000 buses 00 - 3f^M^M
(XEN) PCI: Not using MCFG for segment 0000 bus 00-3f^M^M

.. and then later:

[    3.880750] NetLabel:  unlabeled traffic allowed by default^M^M^M
(XEN) PCI: Using MCFG for segment 0000 bus 00-3f^M^M

(when it gets the hypercall)
This is Intel DQ67SW  with SWQ6710H.86A.0066.2012.1105.1504 BIOS

It is an SandyBridge motherboard.
> 
> >> So if you do the PCI bus scanning _before_ booting PVH dom0, you may
> >> need to update your view of PCI devices after the MMCFG locations
> >> have been provided to you.
> > 
> > I'm not opposed to keep the PHYSDEVOP_pci_mmcfg_reserved, but I still have 
> > to see hardware where this is actually needed. Also, AFAICT, FreeBSD at 

Here is the spec:
ttp://ark.intel.com/products/51997/Intel-Desktop-Board-DQ67SW


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-10 15:20       ` Roger Pau Monné
@ 2016-11-10 17:21         ` Konrad Rzeszutek Wilk
  2016-11-11 10:04           ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-11-10 17:21 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Kelly, Julien Grall, Paul Durrant, Jan Beulich,
	xen-devel, Boris Ostrovsky

On Thu, Nov 10, 2016 at 04:20:34PM +0100, Roger Pau Monné wrote:
> On Thu, Nov 10, 2016 at 08:53:05AM -0500, Konrad Rzeszutek Wilk wrote:
> > On Thu, Nov 10, 2016 at 11:39:08AM +0100, Roger Pau Monné wrote:
> > > On Wed, Nov 09, 2016 at 01:45:17PM -0500, Konrad Rzeszutek Wilk wrote:
> > > > On Wed, Nov 09, 2016 at 04:59:12PM +0100, Roger Pau Monné wrote:
> > > > > In order to improve the mapping of device memory areas, Xen will have to
> > > > > know of those devices in advance (before Dom0 tries to interact with them)
> > > > > so that the memory BARs will be properly mapped into Dom0 memory map.
> > > > 
> > > > Oh, that is going to be a problem with SR-IOV. Those are created _after_
> > > > dom0 has booted. In fact they are done by the drivers themselves.
> > > > 
> > > > See xen_add_device in drivers/xen/pci.c how this is handled.
> > > 
> > > Is the process of creating those VF something standart? (In the sense that 
> > > it can be detected by Xen, and proper mappings stablished)
> > 
> > Yes and no.
> > 
> > You can read from the PCI configuration that the device (Physical
> > function) has SR-IOV. But that information may be in the extended
> > configuration registers so you need MCFG. Anyhow the only thing the PF
> > will tell you is the BAR regions they will occupy (since they
> > are behind the bridge) but not the BDFs:
> 
> But just knowing the BARs position is enough for Xen to install the identity 
> mappings AFAICT?
> 
> Or are the more BARs that will only appear after the SR-IOV functionality 
> has been enabled?
> 
> >From the documentation that I've found, if you detect that the device has 
> PCI_EXT_CAP_ID_SRIOV, you can then read the BARs and map them into Dom0, but 
> maybe I'm missing something (and I have not been able to test this, although 
> my previous PVHv2 Dom0 series already contained code in order to perform 
> this):
> 
> http://xenbits.xen.org/gitweb/?p=people/royger/xen.git;a=commit;h=260cfd1e96e56ab4b58a414d544d92a77e210050
> 
> >         Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
> >                 IOVCap: Migration-, Interrupt Message Number: 000
> >                 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
> >                 IOVSta: Migration-
> >                 Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 00
> >                 VF offset: 128, stride: 2, Device ID: 10ca
> >                 Supported Page Size: 00000553, System Page Size: 00000001
> >                 Region 0: Memory at 00000000fbda0000 (64-bit, non-prefetchable)
> >                 Region 3: Memory at 00000000fbd80000 (64-bit, non-prefetchable)
> >                 VF Migration: offset: 00000000, BIR: 0
> >         Kernel driver in use: igb
> > 
> > And if I enable SR-IOV on the PF I get:
> > 
> > 0a:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
> > 0a:10.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> > 0a:10.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> > 0a:10.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> > 0a:10.6 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> > 0a:11.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> > 0a:11.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> > 0a:11.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
> > 
> > -bash-4.1# lspci -s 0a:10.0 -v
> > 0a:10.0 Ethernet controller: Intel Corporation 82576 Virtual Function
> > (rev 01)
> >         Subsystem: Super Micro Computer Inc Device 10c9
> >         Flags: bus master, fast devsel, latency 0
> >         [virtual] Memory at fbda0000 (64-bit, non-prefetchable) [size=16K]
> >         [virtual] Memory at fbd80000 (64-bit, non-prefetchable) [size=16K]
> >         Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
> >         Capabilities: [a0] Express Endpoint, MSI 00
> >         Capabilities: [100] Advanced Error Reporting
> >         Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
> >         Kernel driver in use: igbvf
> > 
> > -bash-4.1# lspci -s 0a:11.4 -v
> > 0a:11.4 Ethernet controller: Intel Corporation 82576 Virtual Function
> > (rev 01)
> >         Subsystem: Super Micro Computer Inc Device 10c9
> >         Flags: bus master, fast devsel, latency 0
> >         [virtual] Memory at fbdb8000 (64-bit, non-prefetchable) [size=16K]
> >         [virtual] Memory at fbd98000 (64-bit, non-prefetchable) [size=16K]
> >         Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
> >         Capabilities: [a0] Express Endpoint, MSI 00
> >         Capabilities: [100] Advanced Error Reporting
> >         Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
> >         Kernel driver in use: igbvf
> 
> So it seems that the memory for individual VFs is taken from the BARs listed 
> inside of PCI_EXT_CAP_ID_SRIOV.

Yup! I think that is right as the BIOS also enable SR-IOV to figure out how
many bus addresses to reserve for the PCIe device - and then it turn it off.
(I know this as I had a motherboard with half-broken implemention that booted
in OS with VFs already there).

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-10 17:21         ` Konrad Rzeszutek Wilk
@ 2016-11-11 10:04           ` Jan Beulich
  2016-11-16 16:49             ` Roger Pau Monné
  0 siblings, 1 reply; 18+ messages in thread
From: Jan Beulich @ 2016-11-11 10:04 UTC (permalink / raw)
  To: roger.pau
  Cc: Andrew Cooper, Kelly, Julien Grall, Paul Durrant, xen-devel,
	Boris Ostrovsky

>>> On 10.11.16 at 18:21, <konrad.wilk@oracle.com> wrote:
> On Thu, Nov 10, 2016 at 04:20:34PM +0100, Roger Pau Monné wrote:
>> > 0a:11.4 Ethernet controller: Intel Corporation 82576 Virtual Function
>> > (rev 01)
>> >         Subsystem: Super Micro Computer Inc Device 10c9
>> >         Flags: bus master, fast devsel, latency 0
>> >         [virtual] Memory at fbdb8000 (64-bit, non-prefetchable) [size=16K]
>> >         [virtual] Memory at fbd98000 (64-bit, non-prefetchable) [size=16K]
>> >         Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
>> >         Capabilities: [a0] Express Endpoint, MSI 00
>> >         Capabilities: [100] Advanced Error Reporting
>> >         Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
>> >         Kernel driver in use: igbvf
>> 
>> So it seems that the memory for individual VFs is taken from the BARs listed 
>> inside of PCI_EXT_CAP_ID_SRIOV.
> 
> Yup! I think that is right as the BIOS also enable SR-IOV to figure out how
> many bus addresses to reserve for the PCIe device - and then it turn it off.
> (I know this as I had a motherboard with half-broken implemention that booted
> in OS with VFs already there).

But remember that in the common case you won't be able to access
the SR-IOV capability structure before launching Dom0 (as being
located in extended config space).

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-10 16:37     ` Jan Beulich
  2016-11-10 17:19       ` Konrad Rzeszutek Wilk
@ 2016-11-16 16:42       ` Roger Pau Monné
  2016-11-17 10:43         ` Jan Beulich
  1 sibling, 1 reply; 18+ messages in thread
From: Roger Pau Monné @ 2016-11-16 16:42 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Kelly, Julien Grall, PaulDurrant, xen-devel,
	Boris Ostrovsky

On Thu, Nov 10, 2016 at 09:37:19AM -0700, Jan Beulich wrote:
> >>> On 10.11.16 at 11:39, <roger.pau@citrix.com> wrote:
> > On Wed, Nov 09, 2016 at 01:45:17PM -0500, Konrad Rzeszutek Wilk wrote:
> >> On Wed, Nov 09, 2016 at 04:59:12PM +0100, Roger Pau Monné wrote:
> >> > PCI memory BARs
> >> > ---------------
> >> > 
> >> > PCI devices discovered by Xen will have it's BARs scanned in order to detect
> >> > memory BARs, and those will be identity mapped to Dom0. Since BARs can be
> >> > freely moved by the Dom0 OS by writing to the appropriate PCI config space
> >> > register, Xen must trap those accesses and unmap the previous region and
> >> > map the new one as set by Dom0.
> >> 
> >> You can make that simpler - we have hypercalls to "notify" in Linux
> >> when a device is changing. Those can provide that information as well.
> >> (This is what PV dom0 does).
> >> 
> >> Also you are missing one important part - the MMCFG. That is required
> >> for Xen to be able to poke at the PCI configuration spaces (above the 256).
> >> And you can only get the MMCFG if the ACPI DSDT has been parsed.
> > 
> > Hm, I guess I'm missing something, but at least on my hardware Xen seems to 
> > be able to parse the MCFG ACPI table before Dom0 does anything with the 
> > DSDT:
> > 
> > (XEN) PCI: MCFG configuration 0: base f8000000 segment 0000 buses 00 - 3f
> > (XEN) PCI: MCFG area at f8000000 reserved in E820
> 
> This is the crucial line: To guard against broken firmware, we - just
> like Linux - require that the area be reserved in at least one of E820
> or ACPI resources. We can check E820 ourselves, but we need
> Dom0's AML parser for the other mechanism.
> 
> >> So if you do the PCI bus scanning _before_ booting PVH dom0, you may
> >> need to update your view of PCI devices after the MMCFG locations
> >> have been provided to you.
> > 
> > I'm not opposed to keep the PHYSDEVOP_pci_mmcfg_reserved, but I still have 
> > to see hardware where this is actually needed. Also, AFAICT, FreeBSD at 
> > least is only able to detect MMCFG regions present in the MCFG ACPI table:
> > 
> > http://fxr.watson.org/fxr/source/dev/acpica/acpi.c?im=excerp#L1861 
> 
> Iirc the spec mandates only segment 0 to be represented in the
> static table. Other segments may (and likely will) only have their
> data available in AML.

I don't mind leaving the PHYSDEVOP_pci_mmcfg_reserved hypercall, but it _must_ 
be issued before Dom0 tries to actually access the MCFG area, or else it won't 
be mapped into Dom0 p2m.

> >> > XXX: is it possible to have more than 256 GSIs?
> >> 
> >> Yeah. If you have enough of the IOAPICs you can have more than 256. But
> >> I don't think any OS has taken that into account as the GSI value are
> >> always uint8_t.
> > 
> > Right, so AFAICT providing a single IO APIC with enough pins should be fine.
> 
> No, let's not even start with such an approach. Having seen (not really
> huge) systems with well beyond 100 GSIs, I don't think it makes sense
> to try to (temporarily) ease our lives slightly by introducing an
> implementation limit here.

OK, the only limit I see here is that ACPI GSI numbers are encoded in a double 
word in _PRT objects, so there can theoretically be systems out there with up to 
65536 GSIs. I very much doubt there's any systems out there with more than 256 
GSIs, but better be safe than sorry.

I assume that temporary limiting PVHv2 Dom0 support to systems with 1 IO APIC 
only is not going to be accepted, right?

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-11 10:04           ` Jan Beulich
@ 2016-11-16 16:49             ` Roger Pau Monné
  2016-11-17 10:46               ` Jan Beulich
  0 siblings, 1 reply; 18+ messages in thread
From: Roger Pau Monné @ 2016-11-16 16:49 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Andrew Cooper, Kelly, Julien Grall, Paul Durrant, xen-devel,
	Boris Ostrovsky

On Fri, Nov 11, 2016 at 03:04:49AM -0700, Jan Beulich wrote:
> >>> On 10.11.16 at 18:21, <konrad.wilk@oracle.com> wrote:
> > On Thu, Nov 10, 2016 at 04:20:34PM +0100, Roger Pau Monné wrote:
> >> > 0a:11.4 Ethernet controller: Intel Corporation 82576 Virtual Function
> >> > (rev 01)
> >> >         Subsystem: Super Micro Computer Inc Device 10c9
> >> >         Flags: bus master, fast devsel, latency 0
> >> >         [virtual] Memory at fbdb8000 (64-bit, non-prefetchable) [size=16K]
> >> >         [virtual] Memory at fbd98000 (64-bit, non-prefetchable) [size=16K]
> >> >         Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
> >> >         Capabilities: [a0] Express Endpoint, MSI 00
> >> >         Capabilities: [100] Advanced Error Reporting
> >> >         Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
> >> >         Kernel driver in use: igbvf
> >> 
> >> So it seems that the memory for individual VFs is taken from the BARs listed 
> >> inside of PCI_EXT_CAP_ID_SRIOV.
> > 
> > Yup! I think that is right as the BIOS also enable SR-IOV to figure out how
> > many bus addresses to reserve for the PCIe device - and then it turn it off.
> > (I know this as I had a motherboard with half-broken implemention that booted
> > in OS with VFs already there).
> 
> But remember that in the common case you won't be able to access
> the SR-IOV capability structure before launching Dom0 (as being
> located in extended config space).

Since we need PHYSDEVOP_pci_mmcfg_reserved anyway, the newly added bus will be 
scanned in order to find devices, and those SR-IOV BARs will be added then into 
the Dom0 memory map.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-16 16:42       ` Roger Pau Monné
@ 2016-11-17 10:43         ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2016-11-17 10:43 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Kelly, Julien Grall, PaulDurrant, xen-devel,
	Boris Ostrovsky

>>> On 16.11.16 at 17:42, <roger.pau@citrix.com> wrote:
> I assume that temporary limiting PVHv2 Dom0 support to systems with 1 IO APIC 
> only is not going to be accepted, right?

Well, as long as it's experimental (and the respective code clearly
marked with fixme annotations) that would be acceptable imo.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [DRAFT RFC] PVHv2 interaction with physical devices
  2016-11-16 16:49             ` Roger Pau Monné
@ 2016-11-17 10:46               ` Jan Beulich
  0 siblings, 0 replies; 18+ messages in thread
From: Jan Beulich @ 2016-11-17 10:46 UTC (permalink / raw)
  To: Roger Pau Monné
  Cc: Andrew Cooper, Kelly, Julien Grall, Paul Durrant, xen-devel,
	Boris Ostrovsky

>>> On 16.11.16 at 17:49, <roger.pau@citrix.com> wrote:
> On Fri, Nov 11, 2016 at 03:04:49AM -0700, Jan Beulich wrote:
>> >>> On 10.11.16 at 18:21, <konrad.wilk@oracle.com> wrote:
>> > On Thu, Nov 10, 2016 at 04:20:34PM +0100, Roger Pau Monné wrote:
>> >> > 0a:11.4 Ethernet controller: Intel Corporation 82576 Virtual Function
>> >> > (rev 01)
>> >> >         Subsystem: Super Micro Computer Inc Device 10c9
>> >> >         Flags: bus master, fast devsel, latency 0
>> >> >         [virtual] Memory at fbdb8000 (64-bit, non-prefetchable) [size=16K]
>> >> >         [virtual] Memory at fbd98000 (64-bit, non-prefetchable) [size=16K]
>> >> >         Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
>> >> >         Capabilities: [a0] Express Endpoint, MSI 00
>> >> >         Capabilities: [100] Advanced Error Reporting
>> >> >         Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
>> >> >         Kernel driver in use: igbvf
>> >> 
>> >> So it seems that the memory for individual VFs is taken from the BARs listed 
>> >> inside of PCI_EXT_CAP_ID_SRIOV.
>> > 
>> > Yup! I think that is right as the BIOS also enable SR-IOV to figure out how
>> > many bus addresses to reserve for the PCIe device - and then it turn it off.
>> > (I know this as I had a motherboard with half-broken implemention that booted
>> > in OS with VFs already there).
>> 
>> But remember that in the common case you won't be able to access
>> the SR-IOV capability structure before launching Dom0 (as being
>> located in extended config space).
> 
> Since we need PHYSDEVOP_pci_mmcfg_reserved anyway, the newly added bus will be 
> scanned in order to find devices, and those SR-IOV BARs will be added then into 
> the Dom0 memory map.

You mean you want to rescan bus ranges when that hypercall gets
issued, even when that range had been scanned already? Doable,
but you'll need to carefully handle possible changes you observe on
the 2nd scan compared to the 1st.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-11-17 10:46 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-09 15:59 [DRAFT RFC] PVHv2 interaction with physical devices Roger Pau Monné
2016-11-09 18:45 ` Konrad Rzeszutek Wilk
2016-11-10 10:39   ` Roger Pau Monné
2016-11-10 13:53     ` Konrad Rzeszutek Wilk
2016-11-10 15:20       ` Roger Pau Monné
2016-11-10 17:21         ` Konrad Rzeszutek Wilk
2016-11-11 10:04           ` Jan Beulich
2016-11-16 16:49             ` Roger Pau Monné
2016-11-17 10:46               ` Jan Beulich
2016-11-10 16:37     ` Jan Beulich
2016-11-10 17:19       ` Konrad Rzeszutek Wilk
2016-11-16 16:42       ` Roger Pau Monné
2016-11-17 10:43         ` Jan Beulich
2016-11-09 18:51 ` Andrew Cooper
2016-11-09 20:47   ` Pasi Kärkkäinen
2016-11-10 10:43     ` Andrew Cooper
2016-11-10 10:54   ` Roger Pau Monné
2016-11-10 11:23     ` Andrew Cooper

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.