[DRAFT RFC] PVHv2 interaction with physical devices

* [DRAFT RFC] PVHv2 interaction with physical devices
@ 2016-11-09 15:59 Roger Pau Monné
  2016-11-09 18:45 ` Konrad Rzeszutek Wilk
  2016-11-09 18:51 ` Andrew Cooper
  0 siblings, 2 replies; 18+ messages in thread
From: Roger Pau Monné @ 2016-11-09 15:59 UTC (permalink / raw)
  To: xen-devel
  Cc: Andrew Cooper, Kelly, Julien Grall, Paul Durrant, Jan Beulich,
	Boris Ostrovsky, Zytaruk

Hello,

I'm attaching a draft of how a PVHv2 Dom0 is supposed to interact with 
physical devices, and what needs to be done inside of Xen in order to 
achieve it. Current draft is RFC because I'm quite sure I'm missing bits 
that should be written down here. So far I've tried to describe what my 
previous series attempted to do by adding a bunch of IO and memory space 
handlers.

Please note that this document only applies to PVHv2 Dom0, it is not 
applicable to untrusted domains that will need more handlers in order to 
secure Xen and other domains running on the same system. The idea is that 
this can be expanded to untrusted domains also in the long term, thus having 
a single set of IO and memory handlers for passed-through devices.

Roger.

---8<---

This document describes how a PVHv2 Dom0 is supposed to interact with physical
devices.

Architecture
============

Purpose
-------

Previous Dom0 implementations have always used PIRQs (physical interrupts
routed over event channels) in order to receive events from physical devices.
This prevents Dom0 form taking advantage of new hardware virtualization
features, like posted interrupts or hardware virtualized local APIC. Also the
current device memory management in the PVH Dom0 implementation is lacking,
and might not support devices that have memory regions past the 4GB 
boundary.

The new PVH implementation (PVHv2) should overcome the interrupt limitations by
providing the same interface that's used on bare metal (a local and IO APICs)
thus allowing the usage of advanced hardware assisted virtualization
techniques. This also aligns with the trend on the hardware industry to
move part of the emulation into the silicon itself.

In order to improve the mapping of device memory areas, Xen will have to
know of those devices in advance (before Dom0 tries to interact with them)
so that the memory BARs will be properly mapped into Dom0 memory map.

The following document describes the proposed interface and implementation
of all the logic needed in order to achieve the functionality described 
above.

MMIO areas
==========

Overview
--------

On x86 systems certain regions of memory might be used in order to manage
physical devices on the system. Access to this areas is critical for a
PVH Dom0 in order to operate properly. Unlike previous PVH Dom0 implementation
(PVHv1) that was setup with identity mappings of all the holes and reserved
regions found in the memory map, this new implementation intents to map only
what's actually needed by the Dom0.

Low 1MB
-------

When booted with a legacy BIOS, the low 1MB contains firmware related data
that should be identity mapped to the Dom0. This include the EBDA, video
memory and possibly ROMs. All non RAM regions below 1MB will be identity
mapped to the Dom0 so that it can access this data freely.

ACPI regions
------------

ACPI regions will be identity mapped to the Dom0, this implies regions with
type 3 and 4 in the e820 memory map. Also, since some BIOS report incorrect
memory maps, the top-level tables discovered by Xen (as listed in the
{X/R}SDT) that are not on RAM regions will be mapped to Dom0.

PCI memory BARs
---------------

PCI devices discovered by Xen will have it's BARs scanned in order to detect
memory BARs, and those will be identity mapped to Dom0. Since BARs can be
freely moved by the Dom0 OS by writing to the appropriate PCI config space
register, Xen must trap those accesses and unmap the previous region and
map the new one as set by Dom0.

Limitations
-----------

 - Xen needs to be aware of any PCI device before Dom0 tries to interact with
   it, so that the MMIO regions are properly mapped.

Interrupt management
====================

Overview
--------

On x86 systems there are tree different mechanisms that can be used in order
to deliver interrupts: IO APIC, MSI and MSI-X. Note that each device might
support different methods, but those are never active at the same time.

Legacy PCI interrupts
---------------------

The only way to deliver legacy PCI interrupts to PVHv2 guests is using the
IO APIC, PVHv2 domains don't have an emulated PIC. As a consequence the ACPI
_PIC method must be set to APIC mode by the Dom0 OS.

Xen will always provide a single IO APIC, that will match the number of
possible GSIs of the underlying hardware. This is possible because ACPI
uses a system cookie in order to name interrupts, so the IO APIC device ID
or pin number is not used in _PTR methods.

XXX: is it possible to have more than 256 GSIs?

The binding between the underlying physical interrupt and the emulated
interrupt is performed when unmasking an IO APIC PIN, so writes to the
IOREDTBL registers that unset the mask bit will trigger this binding
and enable the interrupt.

MSI Interrupts
--------------

MSI interrupts are setup using the PCI config space, either the IO ports
or the memory mapped configuration area. This means that both spaces should
be trapped by Xen, in order to detect accesses to these registers and
properly emulate them.

Since the offset of the MSI registers is not fixed, Xen has to query the
PCI configuration space in order to find the offset of the PCI_CAP_ID_MSI,
and then setup the correct traps, which also vary depending on the
capabilities of the device. The following list contains the set of MSI
registers that Xen will trap, please take into account that some devices
might only implement a subset of those registers, so not all traps will
be used:

 - Message control register (offset 2): Xen traps accesses to this register,
   and stores the data written to it into an internal structure. When the OS
   sets the MSI enable bit (offset 0) Xen will setup the configured MSI
   interrupts and route them to the guest.

 - Message address register (offset 4): writes and reads to this register are
   trapped by Xen, and the value is stored into an internal structure. This is
   later used when MSI are enabled in order to configure the vectors injected
   to the guest. Writes to this register with MSI already enabled will cause
   a reconfiguration of the binding of interrupts to the guest.

 - Message data register (offset 8 or 12 if message address is 64bits): writes
   and reads to this register are trapped by Xen, and the value is stored into
   an internal structure. This is used when MSI are enabled in order to
   configure the vector where the guests expects to receive those interrupts.
   Writes to this register with MSI already enabled will cause a
   reconfiguration of the binding of interrupts to the guest.

 - Mask and pending bits: reads or writes to those registers are not trapped
   by Xen.

MSI-X Interrupts
----------------

MSI-X in contrast with MSI has part of the configuration registers in the
PCI configuration space, while others reside inside of the memory BARs of the
device. So in this case Xen needs to setup traps for both the PCI
configuration space and two different memory regions. Xen has to query the
position of the MSI-X capability using the PCI_CAP_ID_MSIX, and setup a
handler in order to trap accesses to the different registers. Xen also has
to figure out the position of the MSI-X table and PBA, using the table BIR
and table offset, and the PBA BIR and PBA offset. Once those are known a
handler should also be setup in order to trap accesses to those memory 
regions.

This is the list of MSI-X registers that are used in order to manage MSI-X
in the PCI configuration space:

 - Message control: Xen should trap accesses to this register in order to
   detect changes to the MSI-X enable field (bit 15). Changes to this bit
   will trigger the setup of the MSI-X table entries configured. Writes
   to the function mask bit will be passed-through to the underlying
   register.

 - Table offset, table BIR, PBA offset, PBA BIR: accesses to those registers
   are not trapped by Xen.

The following registers reside in memory, and are pointed out by the Table and
PBA fields found in the PCI configuration space:

 - Message address and data: writes and reads to those registers are trapped
   by Xen, and the value is stored into an internal structure. This is later
   used by Xen in order to configure the interrupt injected to the guest.
   Writes to those registers with MSI-X already enabled will not cause a
   reconfiguration of the interrupt.

 - Vector control: writes and reads are trapped, clearing the mask bit (bit 0)
   will cause Xen to setup the configured interrupt if MSI-X is globally
   enabled in the message control field.

 - Pending bits array: writes and reads to this register are not trapped by
   Xen.

Limitations
-----------

 - Due to the fact that Dom0 is not able to parse dynamic ACPI tables,
   some UART devices might only function in polling mode, because Xen
   will be unable to properly configure the interrupt pins without Dom0
   collaboration, and the UART in use by Xen should be explicitly blacklisted
   from Dom0 access.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 18+ messages in thread