[RFC] ARM PCI Passthrough design document

* [RFC] ARM PCI Passthrough design document
@ 2017-05-26 17:14 Julien Grall
  2017-05-29  2:30 ` Manish Jaggi
                   ` (5 more replies)
  0 siblings, 6 replies; 35+ messages in thread
From: Julien Grall @ 2017-05-26 17:14 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: edgar.iglesias, punit.agrawal, Wei Chen, Steve Capper,
	Andre Przywara, manish.jaggi, Julien Grall, vikrams, okaya, Goel,
	Sameer, xen-devel, Dave P Martin, Vijaya Kumar K, roger.pau

Hi all,

The document below is an RFC version of a design proposal for PCI
Passthrough in Xen on ARM. It aims to describe from an high level perspective
the interaction with the different subsystems and how guest will be able
to discover and access PCI.

Currently on ARM, Xen does not have any knowledge about PCI devices. This
means that IOMMU and interrupt controller (such as ITS) requiring specific
configuration will not work with PCI even with DOM0.

The PCI Passthrough work could be divided in 2 phases:
        * Phase 1: Register all PCI devices in Xen => will allow
                   to use ITS and SMMU with PCI in Xen
        * Phase 2: Assign devices to guests

This document aims to describe the 2 phases, but for now only phase
1 is fully described.

I think I was able to gather all of the feedbacks and come up with a solution
that will satisfy all the parties. The design document has changed quite a lot
compare to the early draft sent few months ago. The major changes are:
	* Provide more details how PCI works on ARM and the interactions with
	MSI controller and IOMMU
	* Provide details on the existing host bridge implementations
	* Give more explanation and justifications on the approach chosen 
	* Describing the hypercalls used and how they should be called

Feedbacks are welcomed.

Cheers,

--------------------------------------------------------------------------------

% PCI pass-through support on ARM
% Julien Grall <julien.grall@linaro.org>
% Draft B

# Preface

This document aims to describe the components required to enable the PCI
pass-through on ARM.

This is an early draft and some questions are still unanswered. When this is
the case, the text will contain XXX.

# Introduction

PCI pass-through allows the guest to receive full control of physical PCI
devices. This means the guest will have full and direct access to the PCI
device.

ARM is supporting a kind of guest that exploits as much as possible
virtualization support in hardware. The guest will rely on PV driver only
for IO (e.g block, network) and interrupts will come through the virtualized
interrupt controller, therefore there are no big changes required within the
kernel.

As a consequence, it would be possible to replace PV drivers by assigning real
devices to the guest for I/O access. Xen on ARM would therefore be able to
run unmodified operating system.

To achieve this goal, it looks more sensible to go towards emulating the
host bridge (there will be more details later). A guest would be able to take
advantage of the firmware tables, obviating the need for a specific driver
for Xen.

Thus, in this document we follow the emulated host bridge approach.

# PCI terminologies

Each PCI device under a host bridge is uniquely identified by its Requester ID
(AKA RID). A Requester ID is a triplet of Bus number, Device number, and
Function.

When the platform has multiple host bridges, the software can add a fourth
number called Segment (sometimes called Domain) to differentiate host bridges.
A PCI device will then uniquely by segment:bus:device:function (AKA SBDF).

So given a specific SBDF, it would be possible to find the host bridge and the
RID associated to a PCI device. The pair (host bridge, RID) will often be used
to find the relevant information for configuring the different subsystems (e.g
IOMMU, MSI controller). For convenience, the rest of the document will use
SBDF to refer to the pair (host bridge, RID).

# PCI host bridge

PCI host bridge enables data transfer between a host processor and PCI bus
based devices. The bridge is used to access the configuration space of each
PCI devices and, on some platform may also act as an MSI controller.

## Initialization of the PCI host bridge

Whilst it would be expected that the bootloader takes care of initializing
the PCI host bridge, on some platforms it is done in the Operating System.

This may include enabling/configuring the clocks that could be shared among
multiple devices.

## Accessing PCI configuration space

Accessing the PCI configuration space can be divided in 2 category:
    * Indirect access, where the configuration spaces are multiplexed. An
    example would be legacy method on x86 (e.g 0xcf8 and 0xcfc). On ARM a
    similar method is used by PCIe RCar root complex (see [12]).
    * ECAM access, each configuration space will have its own address space.

Whilst ECAM is a standard, some PCI host bridges will require specific fiddling
when access the registers (see thunder-ecam [13]).

In most of the cases, accessing all the PCI configuration spaces under a
given PCI host will be done the same way (i.e either indirect access or ECAM
access). However, there are a few cases, dependent on the PCI devices accessed,
which will use different methods (see thunder-pem [14]).

## Generic host bridge

For the purpose of this document, the term "generic host bridge" will be used
to describe any host bridge ECAM-compliant and the initialization, if required,
will be already done by the firmware/bootloader.

# Interaction of the PCI subsystem with other subsystems

In order to have a PCI device fully working, Xen will need to configure
other subsystems such as the IOMMU and the Interrupt Controller.

The interaction expected between the PCI subsystem and the other subsystems is:
    * Add a device
    * Remove a device
    * Assign a device to a guest
    * Deassign a device from a guest

XXX: Detail the interaction when assigning/deassigning device

In the following subsections, the interactions will be briefly described from a
higher level perspective. However, implementation details such as callback,
structure, etc... are beyond the scope of this document.

## IOMMU

The IOMMU will be used to isolate the PCI device when accessing the memory (e.g
DMA and MSI Doorbells). Often the IOMMU will be configured using a MasterID
(aka StreamID for ARM SMMU)  that can be deduced from the SBDF with the help
of the firmware tables (see below).

Whilst in theory, all the memory transactions issued by a PCI device should
go through the IOMMU, on certain platforms some of the memory transaction may
not reach the IOMMU because they are interpreted by the host bridge. For
instance, this could happen if the MSI doorbell is built into the PCI host
bridge or for P2P traffic. See [6] for more details.

XXX: I think this could be solved by using direct mapping (e.g GFN == MFN),
this would mean the guest memory layout would be similar to the host one when
PCI devices will be pass-throughed => Detail it.

## Interrupt controller

PCI supports three kind of interrupts: legacy interrupt, MSI and MSI-X. On ARM,
legacy interrupts will be mapped to SPIs. MSI and MSI-X will write their
payload in a doorbell belonging to a MSI controller.

### Existing MSI controllers

In this section some of the existing controllers and their interaction with
the devices will be briefly described. More details can be found in the
respective specifications of each MSI controller.

MSIs can be distinguished by some combination of
    * the Doorbell
        It is the MMIO address written to. Devices may be configured by
        software to write to arbitrary doorbells which they can address.
        An MSI controller may feature a number of doorbells.
    * the Payload
        Devices may be configured to write an arbitrary payload chosen by
        software. MSI controllers may have restrictions on permitted payload.
        Xen will have to sanitize the payload unless it is known to be always
        safe.
    * Sideband information accompanying the write
        Typically this is neither configurable nor probeable, and depends on
        the path taken through the memory system (i.e it is a property of the
        combination of MSI controller and device rather than a property of
        either in isolation).

### GICv3/GICv4 ITS

The Interrupt Translation Service (ITS) is a MSI controller designed by ARM
and integrated in the GICv3/GICv4 interrupt controller. For the specification
see [GICV3]. Each MSI/MSI-X will be mapped to a new type of interrupt called
LPI. This interrupt will be configured by the software using a pair (DeviceID,
EventID).

A platform may have multiple ITS block (e.g one per NUMA node), each of them
belong to an ITS group.

The DeviceID is a unique identifier with an ITS group for each MSI-capable
device that can be deduced from the RID with the help of the firmware tables
(see below).

The EventID is a unique identifier to distinguish different event sending
by a device.

The MSI payload will only contain the EventID as the DeviceID will be added
afterwards by the hardware in a way that will prevent any tampering.

The [SBSA] appendix I describes the set of rules for the integration of the
ITS that any compliant platform should follow. Some of the rules will explain
the security implication of a misbehaving devices. It ensures that a guest
will never be able to trigger an MSI on behalf of another guest.

XXX: The security implication is described in the [SBSA] but I haven't found
any similar working in the GICv3 specification. It is unclear to me if
non-SBSA compliant platform (e.g embedded) will follow those rules.

### GICv2m

The GICv2m is an extension of the GICv2 to convert MSI/MSI-X writes to unique
interrupts. The specification can be found in the [SBSA] appendix E.

Depending on the platform, the GICv2m will provide one or multiple instance
of register frames. Each frame is composed of a doorbell and associated to
a set of SPIs that can be discovered by reading the register MSI_TYPER.

On an MSI write, the payload will contain the SPI ID to generate. Note that
on some platform the MSI payload may contain an offset form the base SPI
rather than the SPI itself.

The frame will only generate SPI if the written value corresponds to an SPI
allocated to the frame. Each VM should have exclusity to the frame to ensure
isolation and prevent a guest OS to trigger an MSI on-behalf of another guest
OS.

XXX: Linux seems to consider GICv2m as unsafe by default. From my understanding,
it is still unclear how we should proceed on Xen, as GICv2m should be safe
as long as the frame is only accessed by one guest.

### Other MSI controllers

Servers compliant with SBSA level 1 and higher will have to use either ITS
or GICv2m. However, it is by no means the only MSI controllers available.
The hardware vendor may decide to use their custom MSI controller which can be
integrated in the PCI host bridge.

Whether it will be possible to write securely an MSI will depend on the
MSI controller implementations.

XXX: I am happy to give a brief explanation on more MSI controller (such
as Xilinx and Renesas) if people think it is necessary.

This design document does not pertain to a specific MSI controller and will try
to be as agnostic is possible. When possible, it will give insight how to
integrate the MSI controller.

# Information available in the firmware tables

## ACPI

### Host bridges

The static table MCFG (see 4.2 in [1]) will describe the host bridges available
at boot and supporting ECAM. Unfortunately, there are platforms out there
(see [2]) that re-use MCFG to describe host bridge that are not fully ECAM
compatible.

This means that Xen needs to account for possible quirks in the host bridge.
The Linux community are working on a patch series for this, see [2] and [3],
where quirks will be detected with:
    * OEM ID
    * OEM Table ID
    * OEM Revision
    * PCI Segment
    * PCI bus number range (wildcard allowed)

Based on what Linux is currently doing, there are two kind of quirks:
    * Accesses to the configuration space of certain sizes are not allowed
    * A specific driver is necessary for driving the host bridge

The former is straightforward to solve but the latter will require more thought.
Instantiation of a specific driver for the host controller can be easily done
if Xen has the information to detect it. However, those drivers may require
resources described in ASL (see [4] for instance).

The number of platforms requiring specific PCI host bridge driver is currently
limited. Whilst it is not possible to predict the future, it will be expected
upcoming platform to have fully ECAM compliant PCI host bridges. Therefore,
given Xen does not have any ASL parser, the approach suggested is to hardcode
the missing values. This could be revisit in the future if necessary.

### Finding information to configure IOMMU and MSI controller

The static table [IORT] will provide information that will help to deduce
data (such as MasterID and DeviceID) to configure both the IOMMU and the MSI
controller from a given SBDF.

## Finding which NUMA node a PCI device belongs to

On NUMA system, the NUMA node associated to a PCI device can be found using
the _PXM method of the host bridge (?).

XXX: I am not entirely sure where the _PXM will be (i.e host bridge vs PCI
device).

## Device Tree

### Host bridges

Each Device Tree node associated to a host bridge will have at least the
following properties (see bindings in [8]):
    - device_type: will always be "pci".
    - compatible: a string indicating which driver to instanciate

The node may also contain optional properties such as:
    - linux,pci-domain: assign a fix segment number
    - bus-range: indicate the range of bus numbers supported

When the property linux,pci-domain is not present, the operating system would
have to allocate the segment number for each host bridges.

### Finding information to configure IOMMU and MSI controller

### Configuring the IOMMU

The Device Treee provides a generic IOMMU bindings (see [10]) which uses the
properties "iommu-map" and "iommu-map-mask" to described the relationship
between RID and a MasterID.

These properties will be present in the host bridge Device Tree node. From a
given SBDF, it will be possible to find the corresponding MasterID.

Note that the ARM SMMU also have a legacy binding (see [9]), but it does not
have a way to describe the relationship between RID and StreamID. Instead it
assumed that StreamID == RID. This binding has now been deprecated in favor
of the generic IOMMU binding.

### Configuring the MSI controller

The relationship between the RID and data required to configure the MSI
controller (such as DeviceID) can be found using the property "msi-map"
(see [11]).

This property will be present in the host bridge Device Tree node. From a
given SBDF, it will be possible to find the corresponding MasterID.

## Finding which NUMA node a PCI device belongs to

On NUMA system, the NUMA node associated to a PCI device can be found using
the property "numa-node-id" (see [15]) presents in the host bridge Device Tree
node.

# Discovering PCI devices

Whilst PCI devices are currently available in the hardware domain, the
hypervisor does not have any knowledge of them. The first step of supporting
PCI pass-through is to make Xen aware of the PCI devices.

Xen will require access to the PCI configuration space to retrieve information
for the PCI devices or access it on behalf of the guest via the emulated
host bridge.

This means that Xen should be in charge of controlling the host bridge. However,
for some host controller, this may be difficult to implement in Xen because of
depencencies on other components (e.g clocks, see more details in "PCI host
bridge" section).

For this reason, the approach chosen in this document is to let the hardware
domain to discover the host bridges, scan the PCI devices and then report
everything to Xen. This does not rule out the possibility of doing everything
without the help of the hardware domain in the future.

## Who is in charge of the host bridge?

There are numerous implementation of host bridges which exist on ARM. A part of
them requires a specific driver as they cannot be driven by a generic host bridge
driver. Porting those drivers may be complex due to dependencies on other
components.

This would be seen as signal to leave the host bridge drivers in the hardware
domain. Because Xen would need to access the configuration space, all the access
would have to be forwarded to hardware domain which in turn will access the
hardware.

In this design document, we are considering that the host bridge driver can
be ported in Xen. In the case it is not possible, a interface to forward
configuration space access would need to be defined. The interface details
is out of scope.

## Discovering and registering host bridge

The approach taken in the document will require communication between Xen and
the hardware domain. In this case, they would need to agree on the segment
number associated to an host bridge. However, this number is not available in
the Device Tree case.

The hardware domain will register new host bridges using the existing hypercall
PHYSDEV_mmcfg_reserved:

#define XEN_PCI_MMCFG_RESERVED 1

struct physdev_pci_mmcfg_reserved {
    /* IN */
    uint64_t    address;
    uint16_t    segment;
    /* Range of bus supported by the host bridge */
    uint8_t     start_bus;
    uint8_t     end_bus;

    uint32_t    flags;
}

Some of the host bridges may not have a separate configuration address space
region described in the firmware tables. To simplify the registration, the
field 'address' should contains the base address of one of the region
described in the firmware tables.
    * For ACPI, it would be the base address specified in the MCFG or in the
    _CBA method.
    * For Device Tree, this would be any base address of region
    specified in the "reg" property.

The field 'flags' is expected to have XEN_PCI_MMCFG_RESERVED set.

It is expected that this hypercall is called before any PCI devices is
registered to Xen.

When the hardware domain is in charge of the host bridge, this hypercall will
be used to tell Xen the existence of an host bridge in order to find the
associated information for configuring the MSI controller and the IOMMU.

## Discovering and registering PCI devices

The hardware domain will scan the host bridge to find the list of PCI devices
available and then report it to Xen using the existing hypercall
PHYSDEV_pci_device_add:

#define XEN_PCI_DEV_EXTFN   0x1
#define XEN_PCI_DEV_VIRTFN  0x2
#define XEN_PCI_DEV_PXM     0x3

struct physdev_pci_device_add {
    /* IN */
    uint16_t    seg;
    uint8_t     bus;
    uint8_t     devfn;
    uint32_t    flags;
    struct {
        uint8_t bus;
        uint8_t devfn;
    } physfn;
    /*
     * Optional parameters array.
     * First element ([0]) is PXM domain associated with the device (if
     * XEN_PCI_DEV_PXM is set)
     */
    uint32_t optarr[0];
}

When XEN_PCI_DEV_PXM is set in the field 'flag', optarr[0] will contain the
NUMA node ID associated with the device:
    * For ACPI, it would be the value returned by the method _PXM
    * For Device Tree, this would the value found in the property "numa-node-id".
For more details see the section "Finding which NUMA node a PCI device belongs
to" in "ACPI" and "Device Tree".

XXX: I still don't fully understand how XEN_PCI_DEV_EXTFN and XEN_PCI_DEV_VIRTFN
wil work. AFAICT, the former is used with the bus support ARI and the only usage
is in the x86 IOMMU code. For the latter, this is related to IOV but I am not
sure what devfn and physfn.devfn will correspond too.

Note that x86 currently provides two more hypercalls (PHYSDEVOP_manage_pci_add
and PHYSDEVOP_manage_pci_add_ext) to register PCI devices. However they are
subset of the hypercall PHYSDEVOP_pci_device_add. Therefore, it is suggested
to leave them unimplemented on ARM.

## Removing PCI devices

The hardware domain will be in charge Xen a device has been removed using
the existing hypercall PHYSDEV_pci_device_remove:

struct physdev_pci_device {
    /* IN */
    uint16_t    seg;
    uint8_t     bus;
    uint8_t     devfn;
}

Note that x86 currently provide one more hypercall (PHYSDEVOP_manage_pci_remove)
to remove PCI devices. However it does not allow to pass a segment number.
Therefore it is suggested to leave unimplemented on ARM.

# Glossary

ECAM: Enhanced Configuration Mechanism
SBDF: Segment Bus Device Function. The segment is a software concept.
MSI: Message Signaled Interrupt
MSI doorbell: MMIO address written to by a device to generate an MSI
SPI: Shared Peripheral Interrupt
LPI: Locality-specific Peripheral Interrupt
ITS: Interrupt Translation Service

# Specifications
[SBSA]  ARM-DEN-0029 v3.0
[GICV3] IHI0069C
[IORT]  DEN0049B

# Bibliography

[1] PCI firmware specification, rev 3.2
[2] https://www.spinics.net/lists/linux-pci/msg56715.html
[3] https://www.spinics.net/lists/linux-pci/msg56723.html
[4] https://www.spinics.net/lists/linux-pci/msg56728.html
[6] https://www.spinics.net/lists/kvm/msg140116.html
[7] http://www.firmware.org/1275/bindings/pci/pci2_1.pdf
[8] Documents/devicetree/bindings/pci
[9] Documents/devicetree/bindings/iommu/arm,smmu.txt
[10] Document/devicetree/bindings/pci/pci-iommu.txt
[11] Documents/devicetree/bindings/pci/pci-msi.txt
[12] drivers/pci/host/pcie-rcar.c
[13] drivers/pci/host/pci-thunder-ecam.c
[14] drivers/pci/host/pci-thunder-pem.c
[15] Documents/devicetree/bindings/numa.txt

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 35+ messages in thread