All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
@ 2017-04-07 19:17 Jean-Philippe Brucker
  2017-04-07 19:17 ` [RFC 1/3] virtio-iommu: firmware description of the virtual topology Jean-Philippe Brucker
                   ` (13 more replies)
  0 siblings, 14 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:17 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

This is the initial proposal for a paravirtualized IOMMU device using
virtio transport. It contains a description of the device, a Linux driver,
and a toy implementation in kvmtool. With this prototype, you can
translate DMA to guest memory from emulated (virtio), or passed-through
(VFIO) devices.

In its simplest form, implemented here, the device handles map/unmap
requests from the guest. Future extensions proposed in "RFC 3/3" should
allow to bind page tables to devices.

There are a number of advantages in a paravirtualized IOMMU over a full
emulation. It is portable and could be reused on different architectures.
It is easier to implement than a full emulation, with less state tracking.
It might be more efficient in some cases, with less context switches to
the host and the possibility of in-kernel emulation.

When designing it and writing the kvmtool device, I considered two main
scenarios, illustrated below.

Scenario 1: a hardware device passed through twice via VFIO

   MEM____pIOMMU________PCI device________________________       HARDWARE
            |     (2b)                                    \
  ----------|-------------+-------------+------------------\-------------
            |             :     KVM     :                   \
            |             :             :                    \
       pIOMMU drv         :         _______virtio-iommu drv   \    KERNEL
            |             :        |    :          |           \
          VFIO            :        |    :        VFIO           \
            |             :        |    :          |             \
            |             :        |    :          |             /
  ----------|-------------+--------|----+----------|------------/--------
            |                      |    :          |           /
            | (1c)            (1b) |    :     (1a) |          / (2a)
            |                      |    :          |         /
            |                      |    :          |        /   USERSPACE
            |___virtio-iommu dev___|    :        net drv___/
                                        :
  --------------------------------------+--------------------------------
                 HOST                   :             GUEST

(1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a
       buffer with mmap, obtaining virtual address VA. It then send a
       VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly VA=IOVA).
    b. The maping request is relayed to the host through virtio
       (VIRTIO_IOMMU_T_MAP).
    c. The mapping request is relayed to the physical IOMMU through VFIO.

(2) a. The guest userspace driver can now instruct the device to directly
       access the buffer at IOVA
    b. IOVA accesses from the device are translated into physical
       addresses by the IOMMU.

Scenario 2: a virtual net device behind a virtual IOMMU.

  MEM__pIOMMU___PCI device                                     HARDWARE
         |         |
  -------|---------|------+-------------+-------------------------------
         |         |      :     KVM     :
         |         |      :             :
    pIOMMU drv     |      :             :
             \     |      :      _____________virtio-net drv      KERNEL
              \_net drv   :     |       :          / (1a)
                   |      :     |       :         /
                  tap     :     |    ________virtio-iommu drv
                   |      :     |   |   : (1b)
  -----------------|------+-----|---|---+-------------------------------
                   |            |   |   :
                   |_virtio-net_|   |   :
                         / (2)      |   :
                        /           |   :                      USERSPACE
              virtio-iommu dev______|   :
                                        :
  --------------------------------------+-------------------------------
                 HOST                   :             GUEST

(1) a. Guest virtio-net driver maps the virtio ring and a buffer
    b. The mapping requests are relayed to the host through virtio.
(2) The virtio-net device now needs to access any guest memory via the
    IOMMU.

Physical and virtual IOMMUs are completely dissociated. The net driver is
mapping its own buffers via DMA/IOMMU API, and buffers are copied between
virtio-net and tap.


The description itself seemed too long for a single email, so I split it
into three documents, and will attach Linux and kvmtool patches to this
email.

	1. Firmware note,
	2. device operations (draft for the virtio specification),
	3. future work/possible improvements.

Just to be clear on the terms I'm using:

pIOMMU	physical IOMMU, controlling DMA accesses from physical devices
vIOMMU	virtual IOMMU (virtio-iommu), controlling DMA accesses from
	physical and virtual devices to guest memory.
GVA, GPA, HVA, HPA
	Guest/Host Virtual/Physical Address
IOVA	I/O Virtual Address, the address accessed by a device doing DMA
	through an IOMMU. In the context of a guest OS, IOVA is GVA.

Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI
virtio-iommu.h header, which is BSD 3-clause. For the time being, the
specification draft in RFC 2/3 is also BSD 3-clause.


This proposal may be involuntarily centered around ARM architectures at
times. Any feedback would be appreciated, especially regarding other IOMMU
architectures.

Thanks,
Jean-Philippe

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC 1/3] virtio-iommu: firmware description of the virtual topology
       [not found] ` <20170407191747.26618-1-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
@ 2017-04-07 19:17   ` Jean-Philippe Brucker
       [not found]     ` <20170407191747.26618-2-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
  2017-04-18  9:51     ` Tian, Kevin
  2017-04-10  2:30   ` Need information on type 2 IOMMU valmiki
  2017-04-13  8:41   ` [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Tian, Kevin
  2 siblings, 2 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:17 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: cdall-QSEj5FYQhm4dnm+yROfE0A, mst-H+wXaHxf7aLQT0dZR+AlfA,
	marc.zyngier-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA

Unlike other virtio devices, the virtio-iommu doesn't work independently,
it is linked to other virtual or assigned devices. So before jumping into
device operations, we need to define a way for the guest to discover the
virtual IOMMU and the devices it translates.

The host must describe the relation between IOMMU and devices to the guest
using either device-tree or ACPI. The virtual IOMMU identifies each
virtual device with a 32-bit ID, that we will call "Device ID" in this
document. Device IDs are not necessarily unique system-wide, but they may
not overlap within a single virtual IOMMU. Device ID of passed-through
devices do not need to match IDs seen by the physical IOMMU.

The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
because with PCI the IOMMU interface would itself be an endpoint, and
existing firmware interfaces don't allow to describe IOMMU<->master
relations between PCI endpoints.

The following diagram describes a situation where two virtual IOMMUs
translate traffic from devices in the system. vIOMMU 1 translates two PCI
domains, in which each function has a 16-bits requester ID. In order for
the vIOMMU to differentiate guest requests targeted at devices in each
domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI
domains and a collection of platform devices.

                       Device ID    Requester ID
                  /       0x0           0x0      \
                 /         |             |        PCI domain 1
                /      0xffff           0xffff   /
        vIOMMU 1
                \     0x10000           0x0      \
                 \         |             |        PCI domain 2
                  \   0x1ffff           0xffff   /

                  /       0x0                    \
                 /         |                      platform devices
                /      0x1fff                    /
        vIOMMU 2
                \      0x2000           0x0      \
                 \         |             |        PCI domain 3
                  \   0x11fff           0xffff   /

Device-tree already offers a way to describe the topology. Here's an
example description of vIOMMU 2 with its devices:

	/* The virtual IOMMU is described with a virtio-mmio node */
	viommu2: virtio@10000 {
		compatible = "virtio,mmio";
		reg = <0x10000 0x200>;
		dma-coherent;
		interrupts = <0x0 0x5 0x1>;
		
		#iommu-cells = <1>
	};
	
	/* Some platform device has Device ID 0x5 */
	somedevice@20000 {
		...
		
		iommus = <&viommu2 0x5>;
	};
	
	/*
	 * PCI domain 3 is described by its host controller node, along
	 * with the complete relation to the IOMMU
	 */
	pci {
		...
		/* Linear map between RIDs and Device IDs for the whole bus */
		iommu-map = <0x0 &viommu2 0x10000 0x10000>;
	};

For more details, please refer to [DT-IOMMU].

For ACPI, we expect to add a new node type to the IO Remapping Table
specification [IORT], providing a similar mechanism for describing
translations via ACPI tables. The following is *not* a specification,
simply an example of what the node could be.

         Field      | Len.  | Off.  | Description
    ----------------|-------|-------|---------------------------------
     Type           | 1     | 0     | 5: paravirtualized IOMMU
     Length         | 2     | 1     | The length of the node.
     Revision       | 1     | 3     | 0
     Reserved       | 4     | 4     | Must be zero.
     Number of ID   | 4     | 8     |
       mappings     |       |       |
     Reference to   | 4     | 12    | Offset from the start of the
       ID Array     |       |       | IORT node to the start of its
                    |       |       | Array ID mappings.
                    |       |       |
     Model          | 4     | 16    | 0: virtio-iommu
     Device object  | --    | 20    | ASCII Null terminated string
       name         |       |       | with the full path to the entry
                    |       |       | in the namespace for this IOMMU.
     Padding        | --    | --    | To keep 32-bit alignment and
                    |       |       | leave space for future models.
                    |       |       |
     Array of ID    |       |       |
       mappings     | 20xN  | --    | ID Array.

The OS parses the IORT table to build a map of ID relations between IOMMU
and devices. ID Array is used to find correspondence between IOMMU IDs and
PCI or platform devices. Later on, the virtio-iommu driver finds the
associated LNRO0005 descriptor via the "Device object name" field, and
probes the virtio device to find out more about its capabilities. Since
all properties of the IOMMU will be obtained during virtio probing, the
IORT node can stay simple.

[DT-IOMMU] https://www.kernel.org/doc/Documentation/devicetree/bindings/iommu/iommu.txt
           https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/pci-iommu.txt

[IORT] IO Remapping Table, DEN0049B
       http://infocenter.arm.com/help/topic/com.arm.doc.den0049b/DEN0049B_IO_Remapping_Table.pdf

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC 1/3] virtio-iommu: firmware description of the virtual topology
  2017-04-07 19:17 [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jean-Philippe Brucker
@ 2017-04-07 19:17 ` Jean-Philippe Brucker
  2017-04-07 19:17 ` [RFC 2/3] virtio-iommu: device probing and operations Jean-Philippe Brucker
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:17 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Unlike other virtio devices, the virtio-iommu doesn't work independently,
it is linked to other virtual or assigned devices. So before jumping into
device operations, we need to define a way for the guest to discover the
virtual IOMMU and the devices it translates.

The host must describe the relation between IOMMU and devices to the guest
using either device-tree or ACPI. The virtual IOMMU identifies each
virtual device with a 32-bit ID, that we will call "Device ID" in this
document. Device IDs are not necessarily unique system-wide, but they may
not overlap within a single virtual IOMMU. Device ID of passed-through
devices do not need to match IDs seen by the physical IOMMU.

The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
because with PCI the IOMMU interface would itself be an endpoint, and
existing firmware interfaces don't allow to describe IOMMU<->master
relations between PCI endpoints.

The following diagram describes a situation where two virtual IOMMUs
translate traffic from devices in the system. vIOMMU 1 translates two PCI
domains, in which each function has a 16-bits requester ID. In order for
the vIOMMU to differentiate guest requests targeted at devices in each
domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI
domains and a collection of platform devices.

                       Device ID    Requester ID
                  /       0x0           0x0      \
                 /         |             |        PCI domain 1
                /      0xffff           0xffff   /
        vIOMMU 1
                \     0x10000           0x0      \
                 \         |             |        PCI domain 2
                  \   0x1ffff           0xffff   /

                  /       0x0                    \
                 /         |                      platform devices
                /      0x1fff                    /
        vIOMMU 2
                \      0x2000           0x0      \
                 \         |             |        PCI domain 3
                  \   0x11fff           0xffff   /

Device-tree already offers a way to describe the topology. Here's an
example description of vIOMMU 2 with its devices:

	/* The virtual IOMMU is described with a virtio-mmio node */
	viommu2: virtio@10000 {
		compatible = "virtio,mmio";
		reg = <0x10000 0x200>;
		dma-coherent;
		interrupts = <0x0 0x5 0x1>;
		
		#iommu-cells = <1>
	};
	
	/* Some platform device has Device ID 0x5 */
	somedevice@20000 {
		...
		
		iommus = <&viommu2 0x5>;
	};
	
	/*
	 * PCI domain 3 is described by its host controller node, along
	 * with the complete relation to the IOMMU
	 */
	pci {
		...
		/* Linear map between RIDs and Device IDs for the whole bus */
		iommu-map = <0x0 &viommu2 0x10000 0x10000>;
	};

For more details, please refer to [DT-IOMMU].

For ACPI, we expect to add a new node type to the IO Remapping Table
specification [IORT], providing a similar mechanism for describing
translations via ACPI tables. The following is *not* a specification,
simply an example of what the node could be.

         Field      | Len.  | Off.  | Description
    ----------------|-------|-------|---------------------------------
     Type           | 1     | 0     | 5: paravirtualized IOMMU
     Length         | 2     | 1     | The length of the node.
     Revision       | 1     | 3     | 0
     Reserved       | 4     | 4     | Must be zero.
     Number of ID   | 4     | 8     |
       mappings     |       |       |
     Reference to   | 4     | 12    | Offset from the start of the
       ID Array     |       |       | IORT node to the start of its
                    |       |       | Array ID mappings.
                    |       |       |
     Model          | 4     | 16    | 0: virtio-iommu
     Device object  | --    | 20    | ASCII Null terminated string
       name         |       |       | with the full path to the entry
                    |       |       | in the namespace for this IOMMU.
     Padding        | --    | --    | To keep 32-bit alignment and
                    |       |       | leave space for future models.
                    |       |       |
     Array of ID    |       |       |
       mappings     | 20xN  | --    | ID Array.

The OS parses the IORT table to build a map of ID relations between IOMMU
and devices. ID Array is used to find correspondence between IOMMU IDs and
PCI or platform devices. Later on, the virtio-iommu driver finds the
associated LNRO0005 descriptor via the "Device object name" field, and
probes the virtio device to find out more about its capabilities. Since
all properties of the IOMMU will be obtained during virtio probing, the
IORT node can stay simple.

[DT-IOMMU] https://www.kernel.org/doc/Documentation/devicetree/bindings/iommu/iommu.txt
           https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/pci-iommu.txt

[IORT] IO Remapping Table, DEN0049B
       http://infocenter.arm.com/help/topic/com.arm.doc.den0049b/DEN0049B_IO_Remapping_Table.pdf

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC 2/3] virtio-iommu: device probing and operations
  2017-04-07 19:17 [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jean-Philippe Brucker
  2017-04-07 19:17 ` [RFC 1/3] virtio-iommu: firmware description of the virtual topology Jean-Philippe Brucker
@ 2017-04-07 19:17 ` Jean-Philippe Brucker
  2017-04-18 10:26   ` Tian, Kevin
  2017-04-18 10:26   ` Tian, Kevin
  2017-04-07 19:17 ` Jean-Philippe Brucker
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:17 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

After the virtio-iommu device has been probed and the driver is aware of
the devices translated by the IOMMU, it can start sending requests to the
virtio-iommu device. The operations described here are voluntarily
minimalistic, so vIOMMU devices can be as simple as possible to implement,
and can be extended with feature bits.

	I.   Overview
	II.  Feature bits
	III. Device configuration layout
	IV.  Device initialization
	V.   Device operations
	     1. Attach device
	     2. Detach device
	     3. Map region
	     4. Unmap region


  I. Overview
  ===========

Requests are small buffers added by the guest to the request virtqueue.
The guest can add a batch of them to the queue and send a notification
(kick) to the device to have all of them handled.

Here is an example flow:

* attach(address space, device), kick: create a new address space and
  attach a device to it
* map(address space, virt, phys, size, flags): create a mapping between a
  guest-virtual and a guest-physical addresses
* map, map, map, kick

* ... here the guest device can perform DMA to the freshly mapped memory

* unmap(address space, virt, size), unmap, kick
* detach(address space, device), kick

The following description attempts to use the same format as other virtio
devices. We won't go into details of the virtio transport, please refer to
[VIRTIO-v1.0] for more information.

As a quick reminder, the virtio (1.0) transport can be described with the
following flow:

                             HOST  :  GUEST
                     (3)           :
                    .----- [available ring] <-----. (2)
                   /               :               \
                  v   (4)          :          (1)   \
            [device] <--- [descriptor table] <---- [driver]
                  \                :                 ^
                   \               :                /
                (5) '-------> [used ring] ---------'
                                   :            (6)
                                   :

(1) Driver has a buffers with a payload to send via virtio. It writes
    address and size of buffer in a descriptor. It can chain N sub-buffers
    by writing N descriptors and linking them together. The first
    descriptor of the chain is referred to as the head.
(2) Driver queues the head index into the 'available' ring.
(3) Driver notifies the device. Since virtio-iommu uses MMIO, notification
    is done by writing to a doorbell address. KVM traps it and forwards
    the notification to the virtio device. Device dequeues the head index
    from the 'available' ring.
(4) Device reads all descriptors in the chain, handles the payload.
(5) Device writes the head index into the 'used' ring and sends a
    notification to the guest, by injecting an interrupt.
(6) Driver pops the head from the used ring, and optionally read the
    buffers that were updated by the device.


  II. Feature bits
  ================

VIRTIO_IOMMU_F_INPUT_RANGE (0)
 Available range of virtual addresses is described in input_range

VIRTIO_IOMMU_F_IOASID_BITS (1)
 The number of address spaces supported is described in ioasid_bits

VIRTIO_IOMMU_F_MAP_UNMAP (2)
 Map and unmap requests are available. This is here to allow a device or
 driver to only implement page-table sharing, once we introduce the
 feature. Device will be able to only select one of F_MAP_UNMAP or
 F_PT_SHARING. For the moment, this bit must always be set.
 
VIRTIO_IOMMU_F_BYPASS (3)
 When not attached to an address space, devices behind the IOMMU can
 access the physical address space.

  III. Device configuration layout
  ================================

	struct virtio_iommu_config {
		u64 page_size_mask;
		struct virtio_iommu_range {
			u64 start;
			u64 end;
		} input_range;
		u8 ioasid_bits;
	};

  IV. Device initialization
  =========================

1. page_size_mask contains the bitmask of all page sizes that can be
   mapped. The least significant bit set defines the page granularity of
   IOMMU mappings. Other bits in the mask are hints describing page sizes
   that the IOMMU can merge into a single mapping (page blocks).

   There is no lower limit for the smallest page granularity supported by
   the IOMMU. It is legal for the driver to map one byte at a time if the
   device advertises it.

   page_size_mask must have at least one bit set.

2. If the VIRTIO_IOMMU_F_IOASID_BITS feature is negotiated, ioasid_bits
   contains the number of bits supported in an I/O Address Space ID, the
   identifier used in map/unmap requests. A value of 0 is valid, and means
   that a single address space is supported.

   If the feature is not negotiated, address space identifiers can use up
   to 32 bits.

3. If the VIRTIO_IOMMU_F_INPUT_RANGE feature is negotiated, input_range
   contains the virtual address range that the IOMMU is able to translate.
   Any mapping request to virtual addresses outside of this range will
   fail.

   If the feature is not negotiated, virtual mappings span over the whole
   64-bit address space (start = 0, end = 0xffffffffffffffff)

4. If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, devices behind the
   IOMMU not attached to an address space are allowed to access
   guest-physical addresses. Otherwise, accesses to guest-physical
   addresses may fault.


  V. Device operations
  ====================

Driver send requests on the request virtqueue (0), notifies the device and
waits for the device to return the request with a status in the used ring.
All requests are split in two parts: one device-readable, one device-
writeable. Each request must therefore be described with at least two
descriptors, as illustrated below.

	31                       7      0
	+--------------------------------+ <------- RO descriptor
	|      0 (reserved)     |  type  |
	+--------------------------------+
	|                                |
	|            payload             |
	|                                | <------- WO descriptor
	+--------------------------------+
	|      0 (reserved)     | status |
	+--------------------------------+

	struct virtio_iommu_req_head {
		u8	type;
		u8	reserved[3];
	};

	struct virtio_iommu_req_tail {
		u8	status;
		u8	reserved[3];
	};

(Note on the format choice: this format forces the payload to be split in
two - one read-only buffer, one write-only. It is necessary and sufficient
for our purpose, and does not close the door to future extensions with
more complex requests, such as a WO field sandwiched between two RO ones.
With virtio 1.0 ring requirements, such a request would need to be
described by two chains of descriptors, which might be more complex to
implement efficiently, but still possible. Both devices and drivers must
assume that requests are segmented anyway.)

Type may be one of:

VIRTIO_IOMMU_T_ATTACH			1
VIRTIO_IOMMU_T_DETACH			2
VIRTIO_IOMMU_T_MAP			3
VIRTIO_IOMMU_T_UNMAP			4

A few general-purpose status codes are defined here. Driver must not
assume a specific status to be returned for an invalid request. Except for
0 that always means "success", these values are hints to make
troubleshooting easier.

VIRTIO_IOMMU_S_OK			0
 All good! Carry on.

VIRTIO_IOMMU_S_IOERR			1
 Virtio communication error 

VIRTIO_IOMMU_S_UNSUPP			2
 Unsupported request

VIRTIO_IOMMU_S_DEVERR			3
 Internal device error

VIRTIO_IOMMU_S_INVAL			4
 Invalid parameters

VIRTIO_IOMMU_S_RANGE			5
 Out-of-range parameters

VIRTIO_IOMMU_S_NOENT			6
 Entry not found

VIRTIO_IOMMU_S_FAULT			7
 Bad address


  1. Attach device
  ----------------

struct virtio_iommu_req_attach {
	le32	address_space;
	le32	device;
	le32	flags/reserved;
};

Attach a device to an address space. 'address_space' is an identifier
unique to the guest. If the address space doesn't exist in the IOMMU
device, it is created. 'device' is an identifier unique to the IOMMU. The
host communicates unique device ID to the guest during boot. The method
used to communicate this ID is outside the scope of this specification,
but the following rules must apply:

* The device ID is unique from the IOMMU point of view. Multiple devices
  whose DMA transactions are not translated by the same IOMMU may have the
  same device ID. Devices whose DMA transactions may be translated by the
  same IOMMU must have different device IDs.

* Sometimes the host cannot completely isolate two devices from each
  others. For example on a legacy PCI bus, devices can snoop DMA
  transactions from their neighbours. In this case, the host must
  communicate to the guest that it cannot isolate these devices from each
  others. The method used to communicate this is outside the scope of this
  specification. The IOMMU device must ensure that devices that cannot be
  isolated by the host have the same address spaces.

Multiple devices may be added to the same address space. A device cannot
be attached to multiple address spaces (that is, with the map/unmap
interface. For SVM, see page table and context table sharing proposal.)

If the device is already attached to another address space 'old', it is
detached from the old one and attached to the new one. The device cannot
access mappings from the old address space after this request completes.

The device either returns VIRTIO_IOMMU_S_OK, or an error status. We
suggest the following error status, that would help debug the driver.

NOENT: device not found.
RANGE: address space is outside the range allowed by ioasid_bits.


  2. Detach device
  ----------------

struct virtio_iommu_req_detach {
	le32	device;
	le32	flags/reserved;
};

Detach a device from its address space. When this request completes, the
device cannot access any mapping from that address space anymore. If the
device isn't attached to any address space, the request returns
successfully.

After all devices have been successfully detached from an address space,
its ID can be reused by the driver for another address space.

NOENT: device not found.
INVAL: device wasn't attached to any address space.


  3. Map region
  -------------

struct virtio_iommu_req_map {
	le32	address_space;
	le64	phys_addr;
	le64	virt_addr;
	le64	size;
	le32	flags;
};

VIRTIO_IOMMU_MAP_F_READ		0x1
VIRTIO_IOMMU_MAP_F_WRITE	0x2
VIRTIO_IOMMU_MAP_F_EXEC		0x4

Map a range of virtually-contiguous addresses to a range of
physically-contiguous addresses. Size must always be a multiple of the
page granularity negotiated during initialization. Both phys_addr and
virt_addr must be aligned on the page granularity. The address space must
have been created with VIRTIO_IOMMU_T_ATTACH.

The range defined by (virt_addr, size) must be within the limits specified
by input_range. The range defined by (phys_addr, size) must be within the
guest-physical address space. This includes upper and lower limits, as
well as any carving of guest-physical addresses for use by the host (for
instance MSI doorbells). Guest physical boundaries are set by the host
using a firmware mechanism outside the scope of this specification.

(Note that this format prevents from creating the identity mapping in a
single request (0x0 - 0xfff....fff) -> (0x0 - 0xfff...fff), since it would
result in a size of zero. Hopefully allowing VIRTIO_IOMMU_F_BYPASS
eliminates the need for issuing such request. It would also be unlikely to
conform to the physical range restrictions from the previous paragraph)

(Another note, on flags: it is unlikely that all possible combinations of
flags will be supported by the physical IOMMU. For instance, (W & !R) or
(E & W) might be invalid. I haven't taken time to devise a clever way to
advertise supported and implicit (for instance "W implies R") flags or
combination thereof for the moment, but I could at least try to research
common models. Keeping in mind that we might soon want to add more flags,
such as privileged, device, transient, shared, etc. whatever these would
mean)

This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.

INVAL: invalid flags
RANGE: virt_addr, phys_addr or range are not in the limits specified
       during negotiation. For instance, not aligned to page granularity.
NOENT: address space not found.


  4. Unmap region
  ---------------

struct virtio_iommu_req_unmap {
	le32	address_space;
	le64	virt_addr;
	le64	size;
	le32	reserved;
};

Unmap a range of addresses mapped with VIRTIO_IOMMU_T_MAP. The range,
defined by virt_addr and size, must exactly cover one or more contiguous
mappings created with MAP requests. All mappings covered by the range are
removed. Driver should not send a request covering unmapped areas.

We define a mapping as a virtual region created with a single MAP request.
virt_addr should exactly match the start of an existing mapping. The end
of the range, (virt_addr + size - 1), should exactly match the end of an
existing mapping. Device must reject any request that would affect only
part of a mapping. If the requested range spills outside of mapped
regions, the device's behaviour is undefined.

These rules are illustrated with the following requests (with arguments
(va, size)), assuming each example sequence starts with a blank address
space:

	map(0, 10)
	unmap(0, 10) -> allowed

	map(0, 5)
	map(5, 5)
	unmap(0, 10) -> allowed

	map(0, 10)
	unmap(0, 5) -> forbidden

	map(0, 10)
	unmap(0, 15) -> undefined

	map(0, 5)
	map(10, 5)
	unmap(0, 15) -> undefined

(Note: the semantics of unmap are chosen to be compatible with VFIO's
type1 v2 IOMMU API. This way a device serving as intermediary between
guest and VFIO doesn't have to keep an internal tree of mappings. They are
a bit tighter than VFIO, in that they don't allow unmap spilling outside
mapped regions. Spilling is 'undefined' at the moment, because it should
work in most cases but I don't know if it's worth the added complexity in
devices that are not simply transmitting requests to VFIO. Splitting
mappings won't ever be allowed, but see the relaxed proposal in 3/3 for
more lenient semantics)

This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.

NOENT: address space not found.
FAULT: mapping not found.
RANGE: request would split a mapping.


[VIRTIO-v1.0] Virtual I/O Device (VIRTIO) Version 1.0.  03 December 2013.
              Committee Specification Draft 01 / Public Review Draft 01.
              http://docs.oasis-open.org/virtio/virtio/v1.0/csprd01/virtio-v1.0-csprd01.html

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC 2/3] virtio-iommu: device probing and operations
  2017-04-07 19:17 [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jean-Philippe Brucker
  2017-04-07 19:17 ` [RFC 1/3] virtio-iommu: firmware description of the virtual topology Jean-Philippe Brucker
  2017-04-07 19:17 ` [RFC 2/3] virtio-iommu: device probing and operations Jean-Philippe Brucker
@ 2017-04-07 19:17 ` Jean-Philippe Brucker
  2017-04-07 19:17 ` [RFC 3/3] virtio-iommu: future work Jean-Philippe Brucker
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:17 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

After the virtio-iommu device has been probed and the driver is aware of
the devices translated by the IOMMU, it can start sending requests to the
virtio-iommu device. The operations described here are voluntarily
minimalistic, so vIOMMU devices can be as simple as possible to implement,
and can be extended with feature bits.

	I.   Overview
	II.  Feature bits
	III. Device configuration layout
	IV.  Device initialization
	V.   Device operations
	     1. Attach device
	     2. Detach device
	     3. Map region
	     4. Unmap region


  I. Overview
  ===========

Requests are small buffers added by the guest to the request virtqueue.
The guest can add a batch of them to the queue and send a notification
(kick) to the device to have all of them handled.

Here is an example flow:

* attach(address space, device), kick: create a new address space and
  attach a device to it
* map(address space, virt, phys, size, flags): create a mapping between a
  guest-virtual and a guest-physical addresses
* map, map, map, kick

* ... here the guest device can perform DMA to the freshly mapped memory

* unmap(address space, virt, size), unmap, kick
* detach(address space, device), kick

The following description attempts to use the same format as other virtio
devices. We won't go into details of the virtio transport, please refer to
[VIRTIO-v1.0] for more information.

As a quick reminder, the virtio (1.0) transport can be described with the
following flow:

                             HOST  :  GUEST
                     (3)           :
                    .----- [available ring] <-----. (2)
                   /               :               \
                  v   (4)          :          (1)   \
            [device] <--- [descriptor table] <---- [driver]
                  \                :                 ^
                   \               :                /
                (5) '-------> [used ring] ---------'
                                   :            (6)
                                   :

(1) Driver has a buffers with a payload to send via virtio. It writes
    address and size of buffer in a descriptor. It can chain N sub-buffers
    by writing N descriptors and linking them together. The first
    descriptor of the chain is referred to as the head.
(2) Driver queues the head index into the 'available' ring.
(3) Driver notifies the device. Since virtio-iommu uses MMIO, notification
    is done by writing to a doorbell address. KVM traps it and forwards
    the notification to the virtio device. Device dequeues the head index
    from the 'available' ring.
(4) Device reads all descriptors in the chain, handles the payload.
(5) Device writes the head index into the 'used' ring and sends a
    notification to the guest, by injecting an interrupt.
(6) Driver pops the head from the used ring, and optionally read the
    buffers that were updated by the device.


  II. Feature bits
  ================

VIRTIO_IOMMU_F_INPUT_RANGE (0)
 Available range of virtual addresses is described in input_range

VIRTIO_IOMMU_F_IOASID_BITS (1)
 The number of address spaces supported is described in ioasid_bits

VIRTIO_IOMMU_F_MAP_UNMAP (2)
 Map and unmap requests are available. This is here to allow a device or
 driver to only implement page-table sharing, once we introduce the
 feature. Device will be able to only select one of F_MAP_UNMAP or
 F_PT_SHARING. For the moment, this bit must always be set.
 
VIRTIO_IOMMU_F_BYPASS (3)
 When not attached to an address space, devices behind the IOMMU can
 access the physical address space.

  III. Device configuration layout
  ================================

	struct virtio_iommu_config {
		u64 page_size_mask;
		struct virtio_iommu_range {
			u64 start;
			u64 end;
		} input_range;
		u8 ioasid_bits;
	};

  IV. Device initialization
  =========================

1. page_size_mask contains the bitmask of all page sizes that can be
   mapped. The least significant bit set defines the page granularity of
   IOMMU mappings. Other bits in the mask are hints describing page sizes
   that the IOMMU can merge into a single mapping (page blocks).

   There is no lower limit for the smallest page granularity supported by
   the IOMMU. It is legal for the driver to map one byte at a time if the
   device advertises it.

   page_size_mask must have at least one bit set.

2. If the VIRTIO_IOMMU_F_IOASID_BITS feature is negotiated, ioasid_bits
   contains the number of bits supported in an I/O Address Space ID, the
   identifier used in map/unmap requests. A value of 0 is valid, and means
   that a single address space is supported.

   If the feature is not negotiated, address space identifiers can use up
   to 32 bits.

3. If the VIRTIO_IOMMU_F_INPUT_RANGE feature is negotiated, input_range
   contains the virtual address range that the IOMMU is able to translate.
   Any mapping request to virtual addresses outside of this range will
   fail.

   If the feature is not negotiated, virtual mappings span over the whole
   64-bit address space (start = 0, end = 0xffffffffffffffff)

4. If the VIRTIO_IOMMU_F_BYPASS feature is negotiated, devices behind the
   IOMMU not attached to an address space are allowed to access
   guest-physical addresses. Otherwise, accesses to guest-physical
   addresses may fault.


  V. Device operations
  ====================

Driver send requests on the request virtqueue (0), notifies the device and
waits for the device to return the request with a status in the used ring.
All requests are split in two parts: one device-readable, one device-
writeable. Each request must therefore be described with at least two
descriptors, as illustrated below.

	31                       7      0
	+--------------------------------+ <------- RO descriptor
	|      0 (reserved)     |  type  |
	+--------------------------------+
	|                                |
	|            payload             |
	|                                | <------- WO descriptor
	+--------------------------------+
	|      0 (reserved)     | status |
	+--------------------------------+

	struct virtio_iommu_req_head {
		u8	type;
		u8	reserved[3];
	};

	struct virtio_iommu_req_tail {
		u8	status;
		u8	reserved[3];
	};

(Note on the format choice: this format forces the payload to be split in
two - one read-only buffer, one write-only. It is necessary and sufficient
for our purpose, and does not close the door to future extensions with
more complex requests, such as a WO field sandwiched between two RO ones.
With virtio 1.0 ring requirements, such a request would need to be
described by two chains of descriptors, which might be more complex to
implement efficiently, but still possible. Both devices and drivers must
assume that requests are segmented anyway.)

Type may be one of:

VIRTIO_IOMMU_T_ATTACH			1
VIRTIO_IOMMU_T_DETACH			2
VIRTIO_IOMMU_T_MAP			3
VIRTIO_IOMMU_T_UNMAP			4

A few general-purpose status codes are defined here. Driver must not
assume a specific status to be returned for an invalid request. Except for
0 that always means "success", these values are hints to make
troubleshooting easier.

VIRTIO_IOMMU_S_OK			0
 All good! Carry on.

VIRTIO_IOMMU_S_IOERR			1
 Virtio communication error 

VIRTIO_IOMMU_S_UNSUPP			2
 Unsupported request

VIRTIO_IOMMU_S_DEVERR			3
 Internal device error

VIRTIO_IOMMU_S_INVAL			4
 Invalid parameters

VIRTIO_IOMMU_S_RANGE			5
 Out-of-range parameters

VIRTIO_IOMMU_S_NOENT			6
 Entry not found

VIRTIO_IOMMU_S_FAULT			7
 Bad address


  1. Attach device
  ----------------

struct virtio_iommu_req_attach {
	le32	address_space;
	le32	device;
	le32	flags/reserved;
};

Attach a device to an address space. 'address_space' is an identifier
unique to the guest. If the address space doesn't exist in the IOMMU
device, it is created. 'device' is an identifier unique to the IOMMU. The
host communicates unique device ID to the guest during boot. The method
used to communicate this ID is outside the scope of this specification,
but the following rules must apply:

* The device ID is unique from the IOMMU point of view. Multiple devices
  whose DMA transactions are not translated by the same IOMMU may have the
  same device ID. Devices whose DMA transactions may be translated by the
  same IOMMU must have different device IDs.

* Sometimes the host cannot completely isolate two devices from each
  others. For example on a legacy PCI bus, devices can snoop DMA
  transactions from their neighbours. In this case, the host must
  communicate to the guest that it cannot isolate these devices from each
  others. The method used to communicate this is outside the scope of this
  specification. The IOMMU device must ensure that devices that cannot be
  isolated by the host have the same address spaces.

Multiple devices may be added to the same address space. A device cannot
be attached to multiple address spaces (that is, with the map/unmap
interface. For SVM, see page table and context table sharing proposal.)

If the device is already attached to another address space 'old', it is
detached from the old one and attached to the new one. The device cannot
access mappings from the old address space after this request completes.

The device either returns VIRTIO_IOMMU_S_OK, or an error status. We
suggest the following error status, that would help debug the driver.

NOENT: device not found.
RANGE: address space is outside the range allowed by ioasid_bits.


  2. Detach device
  ----------------

struct virtio_iommu_req_detach {
	le32	device;
	le32	flags/reserved;
};

Detach a device from its address space. When this request completes, the
device cannot access any mapping from that address space anymore. If the
device isn't attached to any address space, the request returns
successfully.

After all devices have been successfully detached from an address space,
its ID can be reused by the driver for another address space.

NOENT: device not found.
INVAL: device wasn't attached to any address space.


  3. Map region
  -------------

struct virtio_iommu_req_map {
	le32	address_space;
	le64	phys_addr;
	le64	virt_addr;
	le64	size;
	le32	flags;
};

VIRTIO_IOMMU_MAP_F_READ		0x1
VIRTIO_IOMMU_MAP_F_WRITE	0x2
VIRTIO_IOMMU_MAP_F_EXEC		0x4

Map a range of virtually-contiguous addresses to a range of
physically-contiguous addresses. Size must always be a multiple of the
page granularity negotiated during initialization. Both phys_addr and
virt_addr must be aligned on the page granularity. The address space must
have been created with VIRTIO_IOMMU_T_ATTACH.

The range defined by (virt_addr, size) must be within the limits specified
by input_range. The range defined by (phys_addr, size) must be within the
guest-physical address space. This includes upper and lower limits, as
well as any carving of guest-physical addresses for use by the host (for
instance MSI doorbells). Guest physical boundaries are set by the host
using a firmware mechanism outside the scope of this specification.

(Note that this format prevents from creating the identity mapping in a
single request (0x0 - 0xfff....fff) -> (0x0 - 0xfff...fff), since it would
result in a size of zero. Hopefully allowing VIRTIO_IOMMU_F_BYPASS
eliminates the need for issuing such request. It would also be unlikely to
conform to the physical range restrictions from the previous paragraph)

(Another note, on flags: it is unlikely that all possible combinations of
flags will be supported by the physical IOMMU. For instance, (W & !R) or
(E & W) might be invalid. I haven't taken time to devise a clever way to
advertise supported and implicit (for instance "W implies R") flags or
combination thereof for the moment, but I could at least try to research
common models. Keeping in mind that we might soon want to add more flags,
such as privileged, device, transient, shared, etc. whatever these would
mean)

This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.

INVAL: invalid flags
RANGE: virt_addr, phys_addr or range are not in the limits specified
       during negotiation. For instance, not aligned to page granularity.
NOENT: address space not found.


  4. Unmap region
  ---------------

struct virtio_iommu_req_unmap {
	le32	address_space;
	le64	virt_addr;
	le64	size;
	le32	reserved;
};

Unmap a range of addresses mapped with VIRTIO_IOMMU_T_MAP. The range,
defined by virt_addr and size, must exactly cover one or more contiguous
mappings created with MAP requests. All mappings covered by the range are
removed. Driver should not send a request covering unmapped areas.

We define a mapping as a virtual region created with a single MAP request.
virt_addr should exactly match the start of an existing mapping. The end
of the range, (virt_addr + size - 1), should exactly match the end of an
existing mapping. Device must reject any request that would affect only
part of a mapping. If the requested range spills outside of mapped
regions, the device's behaviour is undefined.

These rules are illustrated with the following requests (with arguments
(va, size)), assuming each example sequence starts with a blank address
space:

	map(0, 10)
	unmap(0, 10) -> allowed

	map(0, 5)
	map(5, 5)
	unmap(0, 10) -> allowed

	map(0, 10)
	unmap(0, 5) -> forbidden

	map(0, 10)
	unmap(0, 15) -> undefined

	map(0, 5)
	map(10, 5)
	unmap(0, 15) -> undefined

(Note: the semantics of unmap are chosen to be compatible with VFIO's
type1 v2 IOMMU API. This way a device serving as intermediary between
guest and VFIO doesn't have to keep an internal tree of mappings. They are
a bit tighter than VFIO, in that they don't allow unmap spilling outside
mapped regions. Spilling is 'undefined' at the moment, because it should
work in most cases but I don't know if it's worth the added complexity in
devices that are not simply transmitting requests to VFIO. Splitting
mappings won't ever be allowed, but see the relaxed proposal in 3/3 for
more lenient semantics)

This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.

NOENT: address space not found.
FAULT: mapping not found.
RANGE: request would split a mapping.


[VIRTIO-v1.0] Virtual I/O Device (VIRTIO) Version 1.0.  03 December 2013.
              Committee Specification Draft 01 / Public Review Draft 01.
              http://docs.oasis-open.org/virtio/virtio/v1.0/csprd01/virtio-v1.0-csprd01.html
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC 3/3] virtio-iommu: future work
  2017-04-07 19:17 [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jean-Philippe Brucker
                   ` (2 preceding siblings ...)
  2017-04-07 19:17 ` Jean-Philippe Brucker
@ 2017-04-07 19:17 ` Jean-Philippe Brucker
  2017-04-21  8:31   ` Tian, Kevin
                     ` (2 more replies)
  2017-04-07 19:17 ` Jean-Philippe Brucker
                   ` (9 subsequent siblings)
  13 siblings, 3 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:17 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

Here I propose a few ideas for extensions and optimizations. This is all
very exploratory, feel free to correct mistakes and suggest more things.

	I.   Linux host
	     1. vhost-iommu
	     2. VFIO nested translation
	II.  Page table sharing
	     1. Sharing IOMMU page tables
	     2. Sharing MMU page tables (SVM)
	     3. Fault reporting
	     4. Host implementation with VFIO
	III. Relaxed operations
	IV.  Misc


  I. Linux host
  =============

  1. vhost-iommu
  --------------

An advantage of virtualizing an IOMMU using virtio is that it allows to
hoist a lot of the emulation code into the kernel using vhost, and avoid
returning to userspace for each request. The mainline kernel already
implements vhost-net, vhost-scsi and vhost-vsock, and a lot of core code
could be reused.

Introducing vhost in a simplified scenario 1 (removed guest userspace
pass-through, irrelevant to this example) gives us the following:

  MEM____pIOMMU________PCI device____________                    HARDWARE
            |                                \
  ----------|-------------+-------------+-----\--------------------------
            |             :     KVM     :      \
       pIOMMU drv         :             :       \                  KERNEL
            |             :             :     net drv
          VFIO            :             :       /
            |             :             :      /
       vhost-iommu_________________________virtio-iommu-drv
                          :             :
  --------------------------------------+-------------------------------
                 HOST                   :             GUEST


Introducing vhost in scenario 2, userspace now only handles the device
initialisation part, and most runtime communication is handled in kernel:

  MEM__pIOMMU___PCI device                                     HARDWARE
         |         |
  -------|---------|------+-------------+-------------------------------
         |         |      :     KVM     :
    pIOMMU drv     |      :             :                         KERNEL
             \__net drv   :             :
                   |      :             :
                  tap     :             :
                   |      :             :
              _vhost-net________________________virtio-net drv
         (2) /            :             :           / (1a)
            /             :             :          /
   vhost-iommu________________________________virtio-iommu drv
                          :             : (1b)
  ------------------------+-------------+-------------------------------
                 HOST                   :             GUEST

(1) a. Guest virtio driver maps ring and buffers
    b. Map requests are relayed to the host the same way.
(2) To access any guest memory, vhost-net must query the IOMMU. We can
    reuse the existing TLB protocol for this. TLB commands are written to
    and read from the vhost-net fd.

As defined in Linux/include/uapi/linux/vhost.h, the vhost msg structure
has everything needed for map/unmap operations:

	struct vhost_iotlb_msg {
		__u64	iova;
		__u64	size;
		__u64	uaddr;
		__u8	perm; /* R/W */
		__u8	type;
	#define VHOST_IOTLB_MISS
	#define VHOST_IOTLB_UPDATE	/* MAP */
	#define VHOST_IOTLB_INVALIDATE	/* UNMAP */
	#define VHOST_IOTLB_ACCESS_FAIL
	};

	struct vhost_msg {
		int type;
		union {
			struct vhost_iotlb_msg iotlb;
			__u8 padding[64];
		};
	};

The vhost-iommu device associates a virtual device ID to a TLB fd. We
should be able to use the same commands for [vhost-net <-> virtio-iommu]
and [virtio-net <-> vhost-iommu] communication. A virtio-net device
would open a socketpair and hand one side to vhost-iommu.

If vhost_msg is ever used for another purpose than TLB, we'll have some
trouble, as there will be multiple clients that want to read/write the
vhost fd. A multicast transport method will be needed. Until then, this
can work.

Details of operations would be:

(1) Userspace sets up vhost-iommu as with other vhost devices, by using
standard vhost ioctls. Userspace starts by describing the system topology
via ioctl:

	ioctl(iommu_fd, VHOST_IOMMU_ADD_DEVICE, struct
	      vhost_iommu_add_device)

	#define VHOST_IOMMU_DEVICE_TYPE_VFIO
	#define VHOST_IOMMU_DEVICE_TYPE_TLB

	struct vhost_iommu_add_device {
		__u8 type;
		__u32 devid;
		union {
			struct vhost_iommu_device_vfio {
				int vfio_group_fd;
			};
			struct vhost_iommu_device_tlb {
				int fd;
			};
		};
	};

(2) VIRTIO_IOMMU_T_ATTACH(address space, devid)

vhost-iommu creates an address space if necessary, finds the device along
with the relevant operations. If type is VFIO, operations are done on a
container, otherwise they are done on single devices.

(3) VIRTIO_IOMMU_T_MAP(address space, virt, phys, size, flags)

Turn phys into an hva using the vhost mem table.

- If type is TLB, either preload with VHOST_IOTLB_UPDATE or store the
  mapping locally and wait for the TLB to ask for it with a
  VHOST_IOTLB_MISS.
- If type is VFIO, turn it into a VFIO_IOMMU_MAP_DMA (might need to
  introduce a shortcut in the external user API of VFIO).

(4) VIRTIO_IOMMU_T_UNMAP(address space, virt, phys, size, flags)

- If type is TLB, send a VHOST_IOTLB_INVALIDATE.
- If type is VFIO, turn it into VFIO_IOMMU_UNMAP_DMA.

(5) VIRTIO_IOMMU_T_DETACH(address space, devid)

Undo whatever was done in (2).


  2. VFIO nested translation
  --------------------------

For my current kvmtool implementation, I am putting each VFIO group in a
different container during initialization. We cannot detach a group from a
container at runtime without first resetting all devices in that group. So
the best way to provide dynamic address spaces right now is one container
per group. The drawback is that we need to maintain multiple sets of page
tables even if the guest wants to put all devices in the same address
space. Another disadvantage is when implementing bypass mode, we need to
map the whole address space at the beginning, then unmap everything on
attach. Adding nested support would be a nice way to provide dynamic
address spaces while keeping groups tied to a container at all times.

A physical IOMMU may offer nested translation. In this case, address
spaces are managed by two page directories instead of one. A guest-
virtual address is translated into a guest-physical one using what we'll
call here "stage-1" (s1) page tables, and the guest-physical address is
translated into a host-physical one using "stage-2" (s2) page tables.

                             s1      s2
                         GVA --> GPA --> HPA

There isn't a lot of support in Linux for nesting IOMMU page directories
at the moment (though SVM support is coming, see II). VFIO does have a
"nesting" IOMMU type, which doesn't mean much at the moment. The ARM SMMU
code uses this to decide whether to manage the container with s2 page
tables instead of s1, but even then we still only have a single stage and
it is assumed that IOVA=GPA.

Another model that would help with dynamically changing address spaces is
nesting VFIO containers:

                           Parent  <---------- map/unmap
                          container
                         /   |     \
                        /   group   \
                     Child         Child  <--- map/unmap
                   container     container
                    |   |             |
                 group group        group

At the beginning all groups are attached to the parent container, and
there is no child container. Doing map/unmap on the parent container maps
stage-2 page tables (map GPA -> HVA and pin the page -> HPA). User should
be able to choose whether they want all devices attached to this container
to be able to access GPAs (bypass mode, as it currently is) or simply
block all DMA (in which case there is no need to pin pages here).

At some point the guest wants to create an address space and attaches
children to it. Using an ioctl (to be defined), we can derive a child
container from the parent container, and move groups from parent to child.

This returns a child fd. When the guest maps something in this new address
space, we can do a map ioctl on the child container, which maps stage-1
page tables (map GVA -> GPA).

A page table walk may access multiple levels of tables (pgd, p4d, pud,
pmd, pt). With nested translation, each access to a table during the
stage-1 walk requires a stage-2 walk. This makes a full translation costly
so it is preferable to use a single stage of translation when possible.
Folding two stages into one is simple with a single container, as shown in
the kvmtool example. The host keeps track of GPA->HVA mappings, so it can
fold the full GVA->HVA mapping before sending the VFIO request. With
nested containers however, the IOMMU driver would have to do the folding
work itself. Keeping a copy of stage-2 mapping created on the parent
container, it would fold them into the actual stage-2 page tables when
receiving a map request on the child container (note that software folding
is not possible when stage-1 pgd is managed by the guest, as described in
next section).

I don't know if nested VFIO containers are a desirable feature at all. I
find the concept cute on paper, and it would make it easier for userspace
to juggle with address spaces, but it might require some invasive changes
in VFIO, and people have been able to use the current API for IOMMU
virtualization so far.


  II. Page table sharing
  ======================

  1. Sharing IOMMU page tables
  ----------------------------

VIRTIO_IOMMU_F_PT_SHARING

This is independent of the nested mode described in I.2, but relies on a
similar feature in the physical IOMMU: having two stages of page tables,
one for the host and one for the guest.

When this is supported, the guest can manage its own s1 page directory, to
avoid sending MAP/UNMAP requests. Feature VIRTIO_IOMMU_F_PT_SHARING allows
a driver to give a page directory pointer (pgd) to the host and send
invalidations when removing or changing a mapping. In this mode, three
requests are used: probe, attach and invalidate. An address space cannot
be using the MAP/UNMAP interface and PT_SHARING at the same time.

Device and driver first need to negotiate which page table format they
will be using. This depends on the physical IOMMU, so the request contains
a negotiation part to probe the device capabilities.

(1) Driver attaches devices to address spaces as usual, but a flag
    VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
    create page tables for use with the MAP/UNMAP API. The driver intends
    to manage the address space itself.

(2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
    pg_format array.

	VIRTIO_IOMMU_T_PROBE_TABLE

	struct virtio_iommu_req_probe_table {
		le32	address_space;
		le32	flags;
		le32	len;
	
		le32	nr_contexts;
		struct {
			le32	model;
			u8	format[64];
		} pg_format[len];
	};

Introducing a probe request is more flexible than advertising those
features in virtio config, because capabilities are dynamic, and depend on
which devices are attached to an address space. Within a single address
space, devices may support different numbers of contexts (PASIDs), and
some may not support recoverable faults.

(3) Device responds success with all page table formats implemented by the
    physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
    initialize the array to 0 and deduce from there which entries have
    been filled by the device.

Using a probe method seems preferable over trying to attach every possible
format until one sticks. For instance, with an ARM guest running on an x86
host, PROBE_TABLE would return the Intel IOMMU page table format, and the
guest could use that page table code to handle its mappings, hidden behind
the IOMMU API. This requires that the page-table code is reasonably
abstracted from the architecture, as is done with drivers/iommu/io-pgtable
(an x86 guest could use any format implement by io-pgtable for example.)

(4) If the driver is able to use this format, it sends the ATTACH_TABLE
    request.

	VIRTIO_IOMMU_T_ATTACH_TABLE

	struct virtio_iommu_req_attach_table {
		le32	address_space;
		le32	flags;
		le64	table;
	
		le32	nr_contexts;
		/* Page-table format description */
	
		le32	model;
		u8	config[64]
	};


    'table' is a pointer to the page directory. 'nr_contexts' isn't used
    here.

    For both ATTACH and PROBE, 'flags' are the following (and will be
    explained later):

	VIRTIO_IOMMU_ATTACH_TABLE_F_INDIRECT	(1 << 0)
	VIRTIO_IOMMU_ATTACH_TABLE_F_NATIVE	(1 << 1)
	VIRTIO_IOMMU_ATTACH_TABLE_F_FAULT	(1 << 2)

Now 'model' is a bit tricky. We need to specify all possible page table
formats and their parameters. I'm not well-versed in x86, s390 or other
IOMMUs, so I'll just focus on the ARM world for this example. We basically
have two page table models, with a multitude of configuration bits:

	* ARM LPAE
	* ARM short descriptor

We could define a high-level identifier per page-table model, such as:

	#define PG_TABLE_ARM	0x1
	#define PG_TABLE_X86	0x2
	...

And each model would define its own structure. On ARM 'format' could be a
simple u32 defining a variant, LPAE 32/64 or short descriptor. It could
also contain additional capabilities. Then depending on the variant,
'config' would be:

	struct pg_config_v7s {
		le32	tcr;
		le32	prrr;
		le32	nmrr;
		le32	asid;
	};
	
	struct pg_config_lpae {
		le64	tcr;
		le64	mair;
		le32	asid;
	
		/* And maybe TTB1? */
	};

	struct pg_config_arm {
		le32	variant;
		union ...;
	};

I am really uneasy with describing all those nasty architectural details
in the virtio-iommu specification. We certainly won't start describing the
content bit-by-bit of tcr or mair here, but just declaring these fields
might be sufficient.

(5) Once the table is attached, the driver can simply write the page
    tables and expect the physical IOMMU to observe the mappings without
    any additional request. When changing or removing a mapping, however,
    the driver must send an invalidate request.

	VIRTIO_IOMMU_T_INVALIDATE

	struct virtio_iommu_req_invalidate {
		le32	address_space;
		le32	context;
		le32	flags;
		le64	virt_addr;
		le64	range_size;
	
		u8	opaque[64];
	};

    'flags' may be:

    VIRTIO_IOMMU_INVALIDATE_T_VADDR: invalidate a single VA range
      from 'context' (context is 0 when !F_INDIRECT).

    And with context tables only (explained below):

    VIRTIO_IOMMU_INVALIDATE_T_SINGLE: invalidate all mappings from
      'context' (context is 0 when !F_INDIRECT). virt_addr and range_size
      are ignored.

    VIRTIO_IOMMU_INVALIDATE_T_TABLE: with F_INDIRECT, invalidate entries
      in the table that changed. Device reads the table again, compares it
      to previous values, and invalidate all mappings for contexts that
      changed. context, virt_addr and range_size are ignored.

IOMMUs may offer hints and quirks in their invalidation packets. The
opaque structure in invalidate would allow to transport those. This
depends on the page table format and as with architectural page-table
definitions, I really don't want to have those details in the spec itself.


  2. Sharing MMU page tables
  --------------------------

The guest can share process page-tables with the physical IOMMU. To do
that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
page table format is implicit, so the pg_format array can be empty (unless
the guest wants to query some specific property, e.g. number of levels
supported by the pIOMMU?). If the host answers with success, guest can
send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
F_INDIRECT | F_FAULT) flags.

F_FAULT means that the host communicates page requests from device to the
guest, and the guest can handle them by mapping virtual address in the
fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
below.)

F_NATIVE means that the pIOMMU pgtable format is the same as guest MMU
pgtable format.

F_INDIRECT means that 'table' pointer is a context table, instead of a
page directory. Each slot in the context table points to a page directory:

                       64              2 1 0
          table ----> +---------------------+
                      |       pgd       |0|1|<--- context 0
                      |       ---       |0|0|<--- context 1
                      |       pgd       |0|1|
                      |       ---       |0|0|
                      |       ---       |0|0|
                      +---------------------+
                                         | \___Entry is valid
                                         |______reserved

Question: do we want per-context page table format, or can it stay global
for the whole indirect table?

Having a context table allows to provide multiple address spaces for a
single device. In the simplest form, without F_INDIRECT we have a single
address space per device, but some devices may implement more, for
instance devices with the PCI PASID extension.

A slot's position in the context table gives an ID, between 0 and
nr_contexts. The guest can use this ID to have the device target a
specific address space with DMA. The mechanism to do that is
device-specific. For a PCI device, the ID is a PASID, and PCI doesn't
define a specific way of using them for DMA, it's the device driver's
concern.


  3. Fault reporting
  ------------------

VIRTIO_IOMMU_F_EVENT_QUEUE

With this feature, an event virtqueue (1) is available. For now it will
only be used for fault handling, but I'm calling it eventq so that other
asynchronous features can piggy-back on it. Device may report faults and
page requests by sending buffers via the used ring.

	#define VIRTIO_IOMMU_T_FAULT	0x05

	struct virtio_iommu_evt_fault {
		struct virtio_iommu_evt_head {
			u8 type;
			u8 reserved[3];
		};
	
		u32 address_space;
		u32 context;
	
		u64 vaddr;
		u32 flags;	/* Access details: R/W/X */
	
		/* In the reply: */
		u32 reply;	/* Fault handled, or failure */
		u64 paddr;
	};

Driver must send the reply via the request queue, with the fault status
in 'reply', and the mapped page in 'paddr' on success.

Existing fault handling interfaces such as PRI have a tag (PRG) allowing
to identify a page request (or group thereof) when sending a reply. I
wonder if this would be useful to us, but it seems like the
(address_space, context, vaddr) tuple is sufficient to identify a page
fault, provided the device doesn't send duplicate faults. Duplicate faults
could be required if they have a side effect, for instance implementing a
poor man's doorbell. If this is desirable, we could add a fault_id field.


  4. Host implementation with VFIO
  --------------------------------

The VFIO interface for sharing page tables is being worked on at the
moment by Intel. Other virtual IOMMU implementation will most likely let
guest manage full context tables (PASID tables) themselves, giving the
context table pointer to the pIOMMU via a VFIO ioctl.

For the architecture-agnostic virtio-iommu however, we shouldn't have to
implement all possible formats of context table (they are at least
different between ARM SMMU and Intel IOMMU, and will certainly be extended
in future physical IOMMU architectures.) In addition, most users might
only care about having one page directory per device, as SVM is a luxury
at the moment and few devices support it. For these reasons, we should
allow to pass single page directories via VFIO, using very similar
structures as described above, whilst reusing the VFIO channel developed
for Intel vIOMMU.

	* VFIO_SVM_INFO: probe page table formats
	* VFIO_SVM_BIND: set pgd and arch-specific configuration

There is an inconvenient with letting the pIOMMU driver manage the guest's
context table. During a page table walk, the pIOMMU translates the context
table pointer using the stage-2 page tables. The context table must
therefore be mapped in guest-physical space by the pIOMMU driver. One
solution is to let the pIOMMU driver reserve some GPA space upfront using
the iommu and sysfs resv API [1]. The host would then carve that region
out of the guest-physical space using a firmware mechanism (for example DT
reserved-memory node).


  III. Relaxed operations
  =======================

VIRTIO_IOMMU_F_RELAXED

Adding an IOMMU dramatically reduces performance of a device, because
map/unmap operations are costly and produce a lot of TLB traffic. For
significant performance improvements, device might allow the driver to
sacrifice safety for speed. In this mode, the driver does not need to send
UNMAP requests. The semantics of MAP change and are more complex to
implement. Given a MAP([start:end] -> phys, flags) request:

(1) If [start:end] isn't mapped, request succeeds as usual.
(2) If [start:end] overlaps an existing mapping [old_start:old_end], we
    unmap [max(start, old_start):min(end, old_end)] and replace it with
    [start:end].
(3) If [start:end] overlaps an existing mapping that matches the new map
    request exactly (same flags, same phys address), the old mapping is
    kept.

This squashing could be performed by the guest. The driver can catch unmap
requests from the DMA layer, and only relay map requests for (1) and (2).
A MAP request is therefore able to split and partially override an
existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests
are unnecessary, but are now allowed to split or carve holes in mappings.

In this model, a MAP request may take longer, but we may have a net gain
by removing a lot of redundant requests. Squashing series of map/unmap
performed by the guest for the same mapping improves temporal reuse of
IOVA mappings, which I can observe by simply dumping IOMMU activity of a
virtio device. It reduce the number of TLB invalidations to the strict
minimum while keeping correctness of DMA operations (provided the device
obeys its driver). There is a good read on the subject of optimistic
teardown in paper [2].

This model is completely unsafe. A stale DMA transaction might access a
page long after the device driver in the guest unmapped it and
decommissioned the page. The DMA transaction might hit into a completely
different part of the system that is now reusing the page. Existing
relaxed implementations attempt to mitigate the risk by setting a timeout
on the teardown. Unmap requests from device drivers are not discarded
entirely, but buffered and sent at a later time. Paper [2] reports good
results with a 10ms delay.

We could add a way for device and driver to negotiate a vulnerability
window to mitigate the risk of DMA attacks. Driver might not accept a
window at all, since it requires more infrastructure to keep delayed
mappings. In my opinion, it should be made clear that regardless of the
duration of this window, any driver accepting F_RELAXED feature makes the
guest completely vulnerable, and the choice boils down to either isolation
or speed, not a bit of both.


  IV. Misc
  ========

I think we have enough to go on for a while. To improve MAP throughput, I
considered adding a MAP_SG request depending on a feature bit, with
variable size:

	struct virtio_iommu_req_map_sg {
		struct virtio_iommu_req_head;
		u32	address_space;
		u32	nr_elems;
		u64	virt_addr;
		u64	size;
		u64	phys_addr[nr_elems];
	};

Would create the following mappings:

	virt_addr		-> phys_addr[0]
	virt_addr + size	-> phys_addr[1]
	virt_addr + 2 * size	-> phys_addr[2]
	...

This would avoid the overhead of multiple map commands. We could try to
find a more cunning format to compress virtually-contiguous mappings with
different (phys, size) pairs as well. But Linux drivers rarely prefer
map_sg() functions over regular map(), so I don't know if the whole map_sg
feature is worth the effort. All we would gain is a few bytes anyway.

My current map_sg implementation in the virtio-iommu driver adds a batch
of map requests to the queue and kick the host once. That might be enough
of an optimization.


Another invasive optimization would be adding grouped requests. By adding
two flags in the header, L and G, we can group sequences of requests
together, and have one status at the end, either 0 if all requests in the
group succeeded, or the status of the first request that failed. This is
all in-order. Requests in a group follow each others, there is no sequence
identifier.

	                       ___ L: request is last in the group
	                      /  _ G: request is part of a group
	                     |  /
	                     v v
	31                   9 8 7      0
	+--------------------------------+ <------- RO descriptor
	|        res0       |0|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+
	|        res0       |0|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+
	|        res0       |0|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+
	|        res0       |1|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+ <------- WO descriptor
	|        res0           | status |
	+--------------------------------+

This adds some complexity on the device, since it must unroll whatever was
done by successful requests in a group as soon as one fails, and reject
all subsequent ones. A group of requests is an atomic operation. As with
map_sg, this change mostly allows to save space and virtio descriptors.


[1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-iommu_groups
[2] vIOMMU: Efficient IOMMU Emulation
    N. Amit, M. Ben-Yehuda, D. Tsafrir, A. Schuster

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC 3/3] virtio-iommu: future work
  2017-04-07 19:17 [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jean-Philippe Brucker
                   ` (3 preceding siblings ...)
  2017-04-07 19:17 ` [RFC 3/3] virtio-iommu: future work Jean-Philippe Brucker
@ 2017-04-07 19:17 ` Jean-Philippe Brucker
  2017-04-07 19:23 ` [RFC PATCH linux] iommu: Add virtio-iommu driver Jean-Philippe Brucker
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:17 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Here I propose a few ideas for extensions and optimizations. This is all
very exploratory, feel free to correct mistakes and suggest more things.

	I.   Linux host
	     1. vhost-iommu
	     2. VFIO nested translation
	II.  Page table sharing
	     1. Sharing IOMMU page tables
	     2. Sharing MMU page tables (SVM)
	     3. Fault reporting
	     4. Host implementation with VFIO
	III. Relaxed operations
	IV.  Misc


  I. Linux host
  =============

  1. vhost-iommu
  --------------

An advantage of virtualizing an IOMMU using virtio is that it allows to
hoist a lot of the emulation code into the kernel using vhost, and avoid
returning to userspace for each request. The mainline kernel already
implements vhost-net, vhost-scsi and vhost-vsock, and a lot of core code
could be reused.

Introducing vhost in a simplified scenario 1 (removed guest userspace
pass-through, irrelevant to this example) gives us the following:

  MEM____pIOMMU________PCI device____________                    HARDWARE
            |                                \
  ----------|-------------+-------------+-----\--------------------------
            |             :     KVM     :      \
       pIOMMU drv         :             :       \                  KERNEL
            |             :             :     net drv
          VFIO            :             :       /
            |             :             :      /
       vhost-iommu_________________________virtio-iommu-drv
                          :             :
  --------------------------------------+-------------------------------
                 HOST                   :             GUEST


Introducing vhost in scenario 2, userspace now only handles the device
initialisation part, and most runtime communication is handled in kernel:

  MEM__pIOMMU___PCI device                                     HARDWARE
         |         |
  -------|---------|------+-------------+-------------------------------
         |         |      :     KVM     :
    pIOMMU drv     |      :             :                         KERNEL
             \__net drv   :             :
                   |      :             :
                  tap     :             :
                   |      :             :
              _vhost-net________________________virtio-net drv
         (2) /            :             :           / (1a)
            /             :             :          /
   vhost-iommu________________________________virtio-iommu drv
                          :             : (1b)
  ------------------------+-------------+-------------------------------
                 HOST                   :             GUEST

(1) a. Guest virtio driver maps ring and buffers
    b. Map requests are relayed to the host the same way.
(2) To access any guest memory, vhost-net must query the IOMMU. We can
    reuse the existing TLB protocol for this. TLB commands are written to
    and read from the vhost-net fd.

As defined in Linux/include/uapi/linux/vhost.h, the vhost msg structure
has everything needed for map/unmap operations:

	struct vhost_iotlb_msg {
		__u64	iova;
		__u64	size;
		__u64	uaddr;
		__u8	perm; /* R/W */
		__u8	type;
	#define VHOST_IOTLB_MISS
	#define VHOST_IOTLB_UPDATE	/* MAP */
	#define VHOST_IOTLB_INVALIDATE	/* UNMAP */
	#define VHOST_IOTLB_ACCESS_FAIL
	};

	struct vhost_msg {
		int type;
		union {
			struct vhost_iotlb_msg iotlb;
			__u8 padding[64];
		};
	};

The vhost-iommu device associates a virtual device ID to a TLB fd. We
should be able to use the same commands for [vhost-net <-> virtio-iommu]
and [virtio-net <-> vhost-iommu] communication. A virtio-net device
would open a socketpair and hand one side to vhost-iommu.

If vhost_msg is ever used for another purpose than TLB, we'll have some
trouble, as there will be multiple clients that want to read/write the
vhost fd. A multicast transport method will be needed. Until then, this
can work.

Details of operations would be:

(1) Userspace sets up vhost-iommu as with other vhost devices, by using
standard vhost ioctls. Userspace starts by describing the system topology
via ioctl:

	ioctl(iommu_fd, VHOST_IOMMU_ADD_DEVICE, struct
	      vhost_iommu_add_device)

	#define VHOST_IOMMU_DEVICE_TYPE_VFIO
	#define VHOST_IOMMU_DEVICE_TYPE_TLB

	struct vhost_iommu_add_device {
		__u8 type;
		__u32 devid;
		union {
			struct vhost_iommu_device_vfio {
				int vfio_group_fd;
			};
			struct vhost_iommu_device_tlb {
				int fd;
			};
		};
	};

(2) VIRTIO_IOMMU_T_ATTACH(address space, devid)

vhost-iommu creates an address space if necessary, finds the device along
with the relevant operations. If type is VFIO, operations are done on a
container, otherwise they are done on single devices.

(3) VIRTIO_IOMMU_T_MAP(address space, virt, phys, size, flags)

Turn phys into an hva using the vhost mem table.

- If type is TLB, either preload with VHOST_IOTLB_UPDATE or store the
  mapping locally and wait for the TLB to ask for it with a
  VHOST_IOTLB_MISS.
- If type is VFIO, turn it into a VFIO_IOMMU_MAP_DMA (might need to
  introduce a shortcut in the external user API of VFIO).

(4) VIRTIO_IOMMU_T_UNMAP(address space, virt, phys, size, flags)

- If type is TLB, send a VHOST_IOTLB_INVALIDATE.
- If type is VFIO, turn it into VFIO_IOMMU_UNMAP_DMA.

(5) VIRTIO_IOMMU_T_DETACH(address space, devid)

Undo whatever was done in (2).


  2. VFIO nested translation
  --------------------------

For my current kvmtool implementation, I am putting each VFIO group in a
different container during initialization. We cannot detach a group from a
container at runtime without first resetting all devices in that group. So
the best way to provide dynamic address spaces right now is one container
per group. The drawback is that we need to maintain multiple sets of page
tables even if the guest wants to put all devices in the same address
space. Another disadvantage is when implementing bypass mode, we need to
map the whole address space at the beginning, then unmap everything on
attach. Adding nested support would be a nice way to provide dynamic
address spaces while keeping groups tied to a container at all times.

A physical IOMMU may offer nested translation. In this case, address
spaces are managed by two page directories instead of one. A guest-
virtual address is translated into a guest-physical one using what we'll
call here "stage-1" (s1) page tables, and the guest-physical address is
translated into a host-physical one using "stage-2" (s2) page tables.

                             s1      s2
                         GVA --> GPA --> HPA

There isn't a lot of support in Linux for nesting IOMMU page directories
at the moment (though SVM support is coming, see II). VFIO does have a
"nesting" IOMMU type, which doesn't mean much at the moment. The ARM SMMU
code uses this to decide whether to manage the container with s2 page
tables instead of s1, but even then we still only have a single stage and
it is assumed that IOVA=GPA.

Another model that would help with dynamically changing address spaces is
nesting VFIO containers:

                           Parent  <---------- map/unmap
                          container
                         /   |     \
                        /   group   \
                     Child         Child  <--- map/unmap
                   container     container
                    |   |             |
                 group group        group

At the beginning all groups are attached to the parent container, and
there is no child container. Doing map/unmap on the parent container maps
stage-2 page tables (map GPA -> HVA and pin the page -> HPA). User should
be able to choose whether they want all devices attached to this container
to be able to access GPAs (bypass mode, as it currently is) or simply
block all DMA (in which case there is no need to pin pages here).

At some point the guest wants to create an address space and attaches
children to it. Using an ioctl (to be defined), we can derive a child
container from the parent container, and move groups from parent to child.

This returns a child fd. When the guest maps something in this new address
space, we can do a map ioctl on the child container, which maps stage-1
page tables (map GVA -> GPA).

A page table walk may access multiple levels of tables (pgd, p4d, pud,
pmd, pt). With nested translation, each access to a table during the
stage-1 walk requires a stage-2 walk. This makes a full translation costly
so it is preferable to use a single stage of translation when possible.
Folding two stages into one is simple with a single container, as shown in
the kvmtool example. The host keeps track of GPA->HVA mappings, so it can
fold the full GVA->HVA mapping before sending the VFIO request. With
nested containers however, the IOMMU driver would have to do the folding
work itself. Keeping a copy of stage-2 mapping created on the parent
container, it would fold them into the actual stage-2 page tables when
receiving a map request on the child container (note that software folding
is not possible when stage-1 pgd is managed by the guest, as described in
next section).

I don't know if nested VFIO containers are a desirable feature at all. I
find the concept cute on paper, and it would make it easier for userspace
to juggle with address spaces, but it might require some invasive changes
in VFIO, and people have been able to use the current API for IOMMU
virtualization so far.


  II. Page table sharing
  ======================

  1. Sharing IOMMU page tables
  ----------------------------

VIRTIO_IOMMU_F_PT_SHARING

This is independent of the nested mode described in I.2, but relies on a
similar feature in the physical IOMMU: having two stages of page tables,
one for the host and one for the guest.

When this is supported, the guest can manage its own s1 page directory, to
avoid sending MAP/UNMAP requests. Feature VIRTIO_IOMMU_F_PT_SHARING allows
a driver to give a page directory pointer (pgd) to the host and send
invalidations when removing or changing a mapping. In this mode, three
requests are used: probe, attach and invalidate. An address space cannot
be using the MAP/UNMAP interface and PT_SHARING at the same time.

Device and driver first need to negotiate which page table format they
will be using. This depends on the physical IOMMU, so the request contains
a negotiation part to probe the device capabilities.

(1) Driver attaches devices to address spaces as usual, but a flag
    VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
    create page tables for use with the MAP/UNMAP API. The driver intends
    to manage the address space itself.

(2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
    pg_format array.

	VIRTIO_IOMMU_T_PROBE_TABLE

	struct virtio_iommu_req_probe_table {
		le32	address_space;
		le32	flags;
		le32	len;
	
		le32	nr_contexts;
		struct {
			le32	model;
			u8	format[64];
		} pg_format[len];
	};

Introducing a probe request is more flexible than advertising those
features in virtio config, because capabilities are dynamic, and depend on
which devices are attached to an address space. Within a single address
space, devices may support different numbers of contexts (PASIDs), and
some may not support recoverable faults.

(3) Device responds success with all page table formats implemented by the
    physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
    initialize the array to 0 and deduce from there which entries have
    been filled by the device.

Using a probe method seems preferable over trying to attach every possible
format until one sticks. For instance, with an ARM guest running on an x86
host, PROBE_TABLE would return the Intel IOMMU page table format, and the
guest could use that page table code to handle its mappings, hidden behind
the IOMMU API. This requires that the page-table code is reasonably
abstracted from the architecture, as is done with drivers/iommu/io-pgtable
(an x86 guest could use any format implement by io-pgtable for example.)

(4) If the driver is able to use this format, it sends the ATTACH_TABLE
    request.

	VIRTIO_IOMMU_T_ATTACH_TABLE

	struct virtio_iommu_req_attach_table {
		le32	address_space;
		le32	flags;
		le64	table;
	
		le32	nr_contexts;
		/* Page-table format description */
	
		le32	model;
		u8	config[64]
	};


    'table' is a pointer to the page directory. 'nr_contexts' isn't used
    here.

    For both ATTACH and PROBE, 'flags' are the following (and will be
    explained later):

	VIRTIO_IOMMU_ATTACH_TABLE_F_INDIRECT	(1 << 0)
	VIRTIO_IOMMU_ATTACH_TABLE_F_NATIVE	(1 << 1)
	VIRTIO_IOMMU_ATTACH_TABLE_F_FAULT	(1 << 2)

Now 'model' is a bit tricky. We need to specify all possible page table
formats and their parameters. I'm not well-versed in x86, s390 or other
IOMMUs, so I'll just focus on the ARM world for this example. We basically
have two page table models, with a multitude of configuration bits:

	* ARM LPAE
	* ARM short descriptor

We could define a high-level identifier per page-table model, such as:

	#define PG_TABLE_ARM	0x1
	#define PG_TABLE_X86	0x2
	...

And each model would define its own structure. On ARM 'format' could be a
simple u32 defining a variant, LPAE 32/64 or short descriptor. It could
also contain additional capabilities. Then depending on the variant,
'config' would be:

	struct pg_config_v7s {
		le32	tcr;
		le32	prrr;
		le32	nmrr;
		le32	asid;
	};
	
	struct pg_config_lpae {
		le64	tcr;
		le64	mair;
		le32	asid;
	
		/* And maybe TTB1? */
	};

	struct pg_config_arm {
		le32	variant;
		union ...;
	};

I am really uneasy with describing all those nasty architectural details
in the virtio-iommu specification. We certainly won't start describing the
content bit-by-bit of tcr or mair here, but just declaring these fields
might be sufficient.

(5) Once the table is attached, the driver can simply write the page
    tables and expect the physical IOMMU to observe the mappings without
    any additional request. When changing or removing a mapping, however,
    the driver must send an invalidate request.

	VIRTIO_IOMMU_T_INVALIDATE

	struct virtio_iommu_req_invalidate {
		le32	address_space;
		le32	context;
		le32	flags;
		le64	virt_addr;
		le64	range_size;
	
		u8	opaque[64];
	};

    'flags' may be:

    VIRTIO_IOMMU_INVALIDATE_T_VADDR: invalidate a single VA range
      from 'context' (context is 0 when !F_INDIRECT).

    And with context tables only (explained below):

    VIRTIO_IOMMU_INVALIDATE_T_SINGLE: invalidate all mappings from
      'context' (context is 0 when !F_INDIRECT). virt_addr and range_size
      are ignored.

    VIRTIO_IOMMU_INVALIDATE_T_TABLE: with F_INDIRECT, invalidate entries
      in the table that changed. Device reads the table again, compares it
      to previous values, and invalidate all mappings for contexts that
      changed. context, virt_addr and range_size are ignored.

IOMMUs may offer hints and quirks in their invalidation packets. The
opaque structure in invalidate would allow to transport those. This
depends on the page table format and as with architectural page-table
definitions, I really don't want to have those details in the spec itself.


  2. Sharing MMU page tables
  --------------------------

The guest can share process page-tables with the physical IOMMU. To do
that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
page table format is implicit, so the pg_format array can be empty (unless
the guest wants to query some specific property, e.g. number of levels
supported by the pIOMMU?). If the host answers with success, guest can
send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
F_INDIRECT | F_FAULT) flags.

F_FAULT means that the host communicates page requests from device to the
guest, and the guest can handle them by mapping virtual address in the
fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
below.)

F_NATIVE means that the pIOMMU pgtable format is the same as guest MMU
pgtable format.

F_INDIRECT means that 'table' pointer is a context table, instead of a
page directory. Each slot in the context table points to a page directory:

                       64              2 1 0
          table ----> +---------------------+
                      |       pgd       |0|1|<--- context 0
                      |       ---       |0|0|<--- context 1
                      |       pgd       |0|1|
                      |       ---       |0|0|
                      |       ---       |0|0|
                      +---------------------+
                                         | \___Entry is valid
                                         |______reserved

Question: do we want per-context page table format, or can it stay global
for the whole indirect table?

Having a context table allows to provide multiple address spaces for a
single device. In the simplest form, without F_INDIRECT we have a single
address space per device, but some devices may implement more, for
instance devices with the PCI PASID extension.

A slot's position in the context table gives an ID, between 0 and
nr_contexts. The guest can use this ID to have the device target a
specific address space with DMA. The mechanism to do that is
device-specific. For a PCI device, the ID is a PASID, and PCI doesn't
define a specific way of using them for DMA, it's the device driver's
concern.


  3. Fault reporting
  ------------------

VIRTIO_IOMMU_F_EVENT_QUEUE

With this feature, an event virtqueue (1) is available. For now it will
only be used for fault handling, but I'm calling it eventq so that other
asynchronous features can piggy-back on it. Device may report faults and
page requests by sending buffers via the used ring.

	#define VIRTIO_IOMMU_T_FAULT	0x05

	struct virtio_iommu_evt_fault {
		struct virtio_iommu_evt_head {
			u8 type;
			u8 reserved[3];
		};
	
		u32 address_space;
		u32 context;
	
		u64 vaddr;
		u32 flags;	/* Access details: R/W/X */
	
		/* In the reply: */
		u32 reply;	/* Fault handled, or failure */
		u64 paddr;
	};

Driver must send the reply via the request queue, with the fault status
in 'reply', and the mapped page in 'paddr' on success.

Existing fault handling interfaces such as PRI have a tag (PRG) allowing
to identify a page request (or group thereof) when sending a reply. I
wonder if this would be useful to us, but it seems like the
(address_space, context, vaddr) tuple is sufficient to identify a page
fault, provided the device doesn't send duplicate faults. Duplicate faults
could be required if they have a side effect, for instance implementing a
poor man's doorbell. If this is desirable, we could add a fault_id field.


  4. Host implementation with VFIO
  --------------------------------

The VFIO interface for sharing page tables is being worked on at the
moment by Intel. Other virtual IOMMU implementation will most likely let
guest manage full context tables (PASID tables) themselves, giving the
context table pointer to the pIOMMU via a VFIO ioctl.

For the architecture-agnostic virtio-iommu however, we shouldn't have to
implement all possible formats of context table (they are at least
different between ARM SMMU and Intel IOMMU, and will certainly be extended
in future physical IOMMU architectures.) In addition, most users might
only care about having one page directory per device, as SVM is a luxury
at the moment and few devices support it. For these reasons, we should
allow to pass single page directories via VFIO, using very similar
structures as described above, whilst reusing the VFIO channel developed
for Intel vIOMMU.

	* VFIO_SVM_INFO: probe page table formats
	* VFIO_SVM_BIND: set pgd and arch-specific configuration

There is an inconvenient with letting the pIOMMU driver manage the guest's
context table. During a page table walk, the pIOMMU translates the context
table pointer using the stage-2 page tables. The context table must
therefore be mapped in guest-physical space by the pIOMMU driver. One
solution is to let the pIOMMU driver reserve some GPA space upfront using
the iommu and sysfs resv API [1]. The host would then carve that region
out of the guest-physical space using a firmware mechanism (for example DT
reserved-memory node).


  III. Relaxed operations
  =======================

VIRTIO_IOMMU_F_RELAXED

Adding an IOMMU dramatically reduces performance of a device, because
map/unmap operations are costly and produce a lot of TLB traffic. For
significant performance improvements, device might allow the driver to
sacrifice safety for speed. In this mode, the driver does not need to send
UNMAP requests. The semantics of MAP change and are more complex to
implement. Given a MAP([start:end] -> phys, flags) request:

(1) If [start:end] isn't mapped, request succeeds as usual.
(2) If [start:end] overlaps an existing mapping [old_start:old_end], we
    unmap [max(start, old_start):min(end, old_end)] and replace it with
    [start:end].
(3) If [start:end] overlaps an existing mapping that matches the new map
    request exactly (same flags, same phys address), the old mapping is
    kept.

This squashing could be performed by the guest. The driver can catch unmap
requests from the DMA layer, and only relay map requests for (1) and (2).
A MAP request is therefore able to split and partially override an
existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests
are unnecessary, but are now allowed to split or carve holes in mappings.

In this model, a MAP request may take longer, but we may have a net gain
by removing a lot of redundant requests. Squashing series of map/unmap
performed by the guest for the same mapping improves temporal reuse of
IOVA mappings, which I can observe by simply dumping IOMMU activity of a
virtio device. It reduce the number of TLB invalidations to the strict
minimum while keeping correctness of DMA operations (provided the device
obeys its driver). There is a good read on the subject of optimistic
teardown in paper [2].

This model is completely unsafe. A stale DMA transaction might access a
page long after the device driver in the guest unmapped it and
decommissioned the page. The DMA transaction might hit into a completely
different part of the system that is now reusing the page. Existing
relaxed implementations attempt to mitigate the risk by setting a timeout
on the teardown. Unmap requests from device drivers are not discarded
entirely, but buffered and sent at a later time. Paper [2] reports good
results with a 10ms delay.

We could add a way for device and driver to negotiate a vulnerability
window to mitigate the risk of DMA attacks. Driver might not accept a
window at all, since it requires more infrastructure to keep delayed
mappings. In my opinion, it should be made clear that regardless of the
duration of this window, any driver accepting F_RELAXED feature makes the
guest completely vulnerable, and the choice boils down to either isolation
or speed, not a bit of both.


  IV. Misc
  ========

I think we have enough to go on for a while. To improve MAP throughput, I
considered adding a MAP_SG request depending on a feature bit, with
variable size:

	struct virtio_iommu_req_map_sg {
		struct virtio_iommu_req_head;
		u32	address_space;
		u32	nr_elems;
		u64	virt_addr;
		u64	size;
		u64	phys_addr[nr_elems];
	};

Would create the following mappings:

	virt_addr		-> phys_addr[0]
	virt_addr + size	-> phys_addr[1]
	virt_addr + 2 * size	-> phys_addr[2]
	...

This would avoid the overhead of multiple map commands. We could try to
find a more cunning format to compress virtually-contiguous mappings with
different (phys, size) pairs as well. But Linux drivers rarely prefer
map_sg() functions over regular map(), so I don't know if the whole map_sg
feature is worth the effort. All we would gain is a few bytes anyway.

My current map_sg implementation in the virtio-iommu driver adds a batch
of map requests to the queue and kick the host once. That might be enough
of an optimization.


Another invasive optimization would be adding grouped requests. By adding
two flags in the header, L and G, we can group sequences of requests
together, and have one status at the end, either 0 if all requests in the
group succeeded, or the status of the first request that failed. This is
all in-order. Requests in a group follow each others, there is no sequence
identifier.

	                       ___ L: request is last in the group
	                      /  _ G: request is part of a group
	                     |  /
	                     v v
	31                   9 8 7      0
	+--------------------------------+ <------- RO descriptor
	|        res0       |0|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+
	|        res0       |0|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+
	|        res0       |0|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+
	|        res0       |1|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+ <------- WO descriptor
	|        res0           | status |
	+--------------------------------+

This adds some complexity on the device, since it must unroll whatever was
done by successful requests in a group as soon as one fails, and reject
all subsequent ones. A group of requests is an atomic operation. As with
map_sg, this change mostly allows to save space and virtio descriptors.


[1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-iommu_groups
[2] vIOMMU: Efficient IOMMU Emulation
    N. Amit, M. Ben-Yehuda, D. Tsafrir, A. Schuster

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC PATCH linux] iommu: Add virtio-iommu driver
  2017-04-07 19:17 [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jean-Philippe Brucker
                   ` (4 preceding siblings ...)
  2017-04-07 19:17 ` Jean-Philippe Brucker
@ 2017-04-07 19:23 ` Jean-Philippe Brucker
  2017-06-16  8:48   ` [virtio-dev] " Bharat Bhushan
  2017-06-16  8:48   ` Bharat Bhushan
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:23 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

The virtio IOMMU is a para-virtualized device, allowing to send IOMMU
requests such as map/unmap over virtio-mmio transport. This driver should
illustrate the initial proposal for virtio-iommu, that you hopefully
received with it. It handle attach, detach, map and unmap requests.

The bulk of the code is to create requests and send them through virtio.
Implementing the IOMMU API is fairly straightforward since the
virtio-iommu MAP/UNMAP interface is almost identical. I threw in a custom
map_sg() function which takes up some space, but is optional. The core
function would send a sequence of map requests, waiting for a reply
between each mapping. This optimization avoids yielding to the host after
each map, and instead prepares a batch of requests in the virtio ring and
kicks the host once.

It must be applied on top of the probe deferral work for IOMMU, currently
under discussion. This allows to dissociate early driver detection and
device probing: device-tree or ACPI is parsed early to find which devices
are translated by the IOMMU, but the IOMMU itself cannot be probed until
the core virtio module is loaded.

Enabling DEBUG makes it extremely verbose at the moment, but it should be
calmer in next versions.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 drivers/iommu/Kconfig             |  11 +
 drivers/iommu/Makefile            |   1 +
 drivers/iommu/virtio-iommu.c      | 980 ++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/Kbuild         |   1 +
 include/uapi/linux/virtio_ids.h   |   1 +
 include/uapi/linux/virtio_iommu.h | 142 ++++++
 6 files changed, 1136 insertions(+)
 create mode 100644 drivers/iommu/virtio-iommu.c
 create mode 100644 include/uapi/linux/virtio_iommu.h

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 37e204f3d9be..8cd56ee9a93a 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -359,4 +359,15 @@ config MTK_IOMMU_V1
 
 	  if unsure, say N here.
 
+config VIRTIO_IOMMU
+	tristate "Virtio IOMMU driver"
+	depends on VIRTIO_MMIO
+	select IOMMU_API
+	select INTERVAL_TREE
+	select ARM_DMA_USE_IOMMU if ARM
+	help
+	  Para-virtualised IOMMU driver with virtio.
+
+	  Say Y here if you intend to run this kernel as a guest.
+
 endif # IOMMU_SUPPORT
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 195f7b997d8e..1199d8475802 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -27,3 +27,4 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-smmu.o
 obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
 obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
 obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
+obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
new file mode 100644
index 000000000000..1cf4f57b7817
--- /dev/null
+++ b/drivers/iommu/virtio-iommu.c
@@ -0,0 +1,980 @@
+/*
+ * Virtio driver for the paravirtualized IOMMU
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2017 ARM Limited
+ *
+ * Author: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/amba/bus.h>
+#include <linux/delay.h>
+#include <linux/dma-iommu.h>
+#include <linux/freezer.h>
+#include <linux/interval_tree.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/of_iommu.h>
+#include <linux/of_platform.h>
+#include <linux/platform_device.h>
+#include <linux/virtio.h>
+#include <linux/virtio_config.h>
+#include <linux/virtio_ids.h>
+#include <linux/wait.h>
+
+#include <uapi/linux/virtio_iommu.h>
+
+struct viommu_dev {
+	struct iommu_device		iommu;
+	struct device			*dev;
+	struct virtio_device		*vdev;
+
+	struct virtqueue		*vq;
+	struct list_head		pending_requests;
+	/* Serialize anything touching the vq and the request list */
+	spinlock_t			vq_lock;
+
+	struct list_head		list;
+
+	/* Device configuration */
+	u64				pgsize_bitmap;
+	u64				aperture_start;
+	u64				aperture_end;
+};
+
+struct viommu_mapping {
+	phys_addr_t			paddr;
+	struct interval_tree_node	iova;
+};
+
+struct viommu_domain {
+	struct iommu_domain		domain;
+	struct viommu_dev		*viommu;
+	struct mutex			mutex;
+	u64				id;
+
+	spinlock_t			mappings_lock;
+	struct rb_root			mappings;
+
+	/* Number of devices attached to this domain */
+	unsigned long			attached;
+};
+
+struct viommu_endpoint {
+	struct viommu_dev		*viommu;
+	struct viommu_domain		*vdomain;
+};
+
+struct viommu_request {
+	struct scatterlist		head;
+	struct scatterlist		tail;
+
+	int				written;
+	struct list_head		list;
+};
+
+/* TODO: use an IDA */
+static atomic64_t viommu_domain_ids_gen;
+
+#define to_viommu_domain(domain) container_of(domain, struct viommu_domain, domain)
+
+/* Virtio transport */
+
+static int viommu_status_to_errno(u8 status)
+{
+	switch (status) {
+	case VIRTIO_IOMMU_S_OK:
+		return 0;
+	case VIRTIO_IOMMU_S_UNSUPP:
+		return -ENOSYS;
+	case VIRTIO_IOMMU_S_INVAL:
+		return -EINVAL;
+	case VIRTIO_IOMMU_S_RANGE:
+		return -ERANGE;
+	case VIRTIO_IOMMU_S_NOENT:
+		return -ENOENT;
+	case VIRTIO_IOMMU_S_FAULT:
+		return -EFAULT;
+	case VIRTIO_IOMMU_S_IOERR:
+	case VIRTIO_IOMMU_S_DEVERR:
+	default:
+		return -EIO;
+	}
+}
+
+static int viommu_get_req_size(struct virtio_iommu_req_head *req, size_t *head,
+			       size_t *tail)
+{
+	size_t size;
+	union virtio_iommu_req r;
+
+	*tail = sizeof(struct virtio_iommu_req_tail);
+
+	switch (req->type) {
+	case VIRTIO_IOMMU_T_ATTACH:
+		size = sizeof(r.attach);
+		break;
+	case VIRTIO_IOMMU_T_DETACH:
+		size = sizeof(r.detach);
+		break;
+	case VIRTIO_IOMMU_T_MAP:
+		size = sizeof(r.map);
+		break;
+	case VIRTIO_IOMMU_T_UNMAP:
+		size = sizeof(r.unmap);
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	*head = size - *tail;
+	return 0;
+}
+
+static int viommu_receive_resp(struct viommu_dev *viommu, int nr_expected)
+{
+
+	unsigned int len;
+	int nr_received = 0;
+	struct viommu_request *req, *pending, *next;
+
+	pending = list_first_entry_or_null(&viommu->pending_requests,
+					   struct viommu_request, list);
+	if (WARN_ON(!pending))
+		return 0;
+
+	while ((req = virtqueue_get_buf(viommu->vq, &len)) != NULL) {
+		if (req != pending) {
+			dev_warn(viommu->dev, "discarding stale request\n");
+			continue;
+		}
+
+		pending->written = len;
+
+		if (++nr_received == nr_expected) {
+			list_del(&pending->list);
+			/*
+			 * In an ideal world, we'd wake up the waiter for this
+			 * group of requests here. But everything is painfully
+			 * synchronous, so waiter is the caller.
+			 */
+			break;
+		}
+
+		next = list_next_entry(pending, list);
+		list_del(&pending->list);
+
+		if (WARN_ON(list_empty(&viommu->pending_requests)))
+			return 0;
+
+		pending = next;
+	}
+
+	return nr_received;
+}
+
+/* Must be called with vq_lock held */
+static int _viommu_send_reqs_sync(struct viommu_dev *viommu,
+				  struct viommu_request *req, int nr,
+				  int *nr_sent)
+{
+	int i, ret;
+	ktime_t timeout;
+	int nr_received = 0;
+	struct scatterlist *sg[2];
+	/*
+	 * FIXME: as it stands, 1s timeout per request. This is a voluntary
+	 * exaggeration because I have no idea how real our ktime is. Are we
+	 * using a RTC? Are we aware of steal time? I don't know much about
+	 * this, need to do some digging.
+	 */
+	unsigned long timeout_ms = 1000;
+
+	*nr_sent = 0;
+
+	for (i = 0; i < nr; i++, req++) {
+		/*
+		 * The backend will allocate one indirect descriptor for each
+		 * request, which allows to double the ring consumption, but
+		 * might be slower.
+		 */
+		req->written = 0;
+
+		sg[0] = &req->head;
+		sg[1] = &req->tail;
+
+		ret = virtqueue_add_sgs(viommu->vq, sg, 1, 1, req,
+					GFP_ATOMIC);
+		if (ret)
+			break;
+
+		list_add_tail(&req->list, &viommu->pending_requests);
+	}
+
+	if (i && !virtqueue_kick(viommu->vq))
+		return -EPIPE;
+
+	/*
+	 * Absolutely no wiggle room here. We're not allowed to sleep as callers
+	 * might be holding spinlocks, so we have to poll like savages until
+	 * something appears. Hopefully the host already handled the request
+	 * during the above kick and returned it to us.
+	 *
+	 * A nice improvement would be for the caller to tell us if we can sleep
+	 * whilst mapping, but this has to go through the IOMMU/DMA API.
+	 */
+	timeout = ktime_add_ms(ktime_get(), timeout_ms * i);
+	while (nr_received < i && ktime_before(ktime_get(), timeout)) {
+		nr_received += viommu_receive_resp(viommu, i - nr_received);
+		if (nr_received < i) {
+			/*
+			 * FIXME: what's a good way to yield to host? A second
+			 * virtqueue_kick won't have any effect since we haven't
+			 * added any descriptor.
+			 */
+			udelay(10);
+		}
+	}
+	dev_dbg(viommu->dev, "request took %lld us\n",
+		ktime_us_delta(ktime_get(), ktime_sub_ms(timeout, timeout_ms * i)));
+
+	if (nr_received != i)
+		ret = -ETIMEDOUT;
+
+	if (ret == -ENOSPC && nr_received)
+		/*
+		 * We've freed some space since virtio told us that the ring is
+		 * full, tell the caller to come back later (after releasing the
+		 * lock first, to be fair to other threads)
+		 */
+		ret = -EAGAIN;
+
+	*nr_sent = nr_received;
+
+	return ret;
+}
+
+/**
+ * viommu_send_reqs_sync - add a batch of requests, kick the host and wait for
+ *                         them to return
+ *
+ * @req: array of requests
+ * @nr: size of the array
+ * @nr_sent: contains the number of requests actually sent after this function
+ *           returns
+ *
+ * Return 0 on success, or an error if we failed to send some of the requests.
+ */
+static int viommu_send_reqs_sync(struct viommu_dev *viommu,
+				 struct viommu_request *req, int nr,
+				 int *nr_sent)
+{
+	int ret;
+	int sent = 0;
+	unsigned long flags;
+
+	*nr_sent = 0;
+	do {
+		spin_lock_irqsave(&viommu->vq_lock, flags);
+		ret = _viommu_send_reqs_sync(viommu, req, nr, &sent);
+		spin_unlock_irqrestore(&viommu->vq_lock, flags);
+
+		*nr_sent += sent;
+		req += sent;
+		nr -= sent;
+	} while (ret == -EAGAIN);
+
+	return ret;
+}
+
+/**
+ * viommu_send_req_sync - send one request and wait for reply
+ *
+ * @head_ptr: pointer to a virtio_iommu_req_* structure
+ *
+ * Returns 0 if the request was successful, or an error number otherwise. No
+ * distinction is done between transport and request errors.
+ */
+static int viommu_send_req_sync(struct viommu_dev *viommu, void *head_ptr)
+{
+	int ret;
+	int nr_sent;
+	struct viommu_request req;
+	size_t head_size, tail_size;
+	struct virtio_iommu_req_tail *tail;
+	struct virtio_iommu_req_head *head = head_ptr;
+
+	ret = viommu_get_req_size(head, &head_size, &tail_size);
+	if (ret)
+		return ret;
+
+	dev_dbg(viommu->dev, "Sending request 0x%x, %zu bytes\n", head->type,
+		head_size + tail_size);
+
+	tail = head_ptr + head_size;
+
+	sg_init_one(&req.head, head, head_size);
+	sg_init_one(&req.tail, tail, tail_size);
+
+	ret = viommu_send_reqs_sync(viommu, &req, 1, &nr_sent);
+	if (ret || !req.written || nr_sent != 1) {
+		dev_err(viommu->dev, "failed to send command\n");
+		return -EIO;
+	}
+
+	ret = -viommu_status_to_errno(tail->status);
+
+	if (ret)
+		dev_dbg(viommu->dev, " completed with %d\n", ret);
+
+	return ret;
+}
+
+static int viommu_tlb_map(struct viommu_domain *vdomain, unsigned long iova,
+			  phys_addr_t paddr, size_t size)
+{
+	unsigned long flags;
+	struct viommu_mapping *mapping;
+
+	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
+	if (!mapping)
+		return -ENOMEM;
+
+	mapping->paddr = paddr;
+	mapping->iova.start = iova;
+	mapping->iova.last = iova + size - 1;
+
+	spin_lock_irqsave(&vdomain->mappings_lock, flags);
+	interval_tree_insert(&mapping->iova, &vdomain->mappings);
+	spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
+
+	return 0;
+}
+
+static size_t viommu_tlb_unmap(struct viommu_domain *vdomain,
+			       unsigned long iova, size_t size)
+{
+	size_t unmapped = 0;
+	unsigned long flags;
+	unsigned long last = iova + size - 1;
+	struct viommu_mapping *mapping = NULL;
+	struct interval_tree_node *node, *next;
+
+	spin_lock_irqsave(&vdomain->mappings_lock, flags);
+	next = interval_tree_iter_first(&vdomain->mappings, iova, last);
+	while (next) {
+		node = next;
+		mapping = container_of(node, struct viommu_mapping, iova);
+
+		next = interval_tree_iter_next(node, iova, last);
+
+		/*
+		 * Note that for a partial range, this will return the full
+		 * mapping so we avoid sending split requests to the device.
+		 */
+		unmapped += mapping->iova.last - mapping->iova.start + 1;
+
+		interval_tree_remove(node, &vdomain->mappings);
+		kfree(mapping);
+	}
+	spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
+
+	return unmapped;
+}
+
+/* IOMMU API */
+
+static bool viommu_capable(enum iommu_cap cap)
+{
+	return false; /* :( */
+}
+
+static struct iommu_domain *viommu_domain_alloc(unsigned type)
+{
+	struct viommu_domain *vdomain;
+
+	if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA)
+		return NULL;
+
+	vdomain = kzalloc(sizeof(struct viommu_domain), GFP_KERNEL);
+	if (!vdomain)
+		return NULL;
+
+	vdomain->id = atomic64_inc_return_relaxed(&viommu_domain_ids_gen);
+
+	mutex_init(&vdomain->mutex);
+	spin_lock_init(&vdomain->mappings_lock);
+	vdomain->mappings = RB_ROOT;
+
+	pr_debug("alloc domain of type %d -> %llu\n", type, vdomain->id);
+
+	if (type == IOMMU_DOMAIN_DMA &&
+	    iommu_get_dma_cookie(&vdomain->domain)) {
+		kfree(vdomain);
+		return NULL;
+	}
+
+	return &vdomain->domain;
+}
+
+static void viommu_domain_free(struct iommu_domain *domain)
+{
+	struct viommu_domain *vdomain = to_viommu_domain(domain);
+
+	pr_debug("free domain %llu\n", vdomain->id);
+
+	iommu_put_dma_cookie(domain);
+
+	/* Free all remaining mappings (size 2^64) */
+	viommu_tlb_unmap(vdomain, 0, 0);
+
+	kfree(vdomain);
+}
+
+static int viommu_attach_dev(struct iommu_domain *domain, struct device *dev)
+{
+	int i;
+	int ret = 0;
+	struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+	struct viommu_endpoint *vdev = fwspec->iommu_priv;
+	struct viommu_domain *vdomain = to_viommu_domain(domain);
+	struct virtio_iommu_req_attach req = {
+		.head.type	= VIRTIO_IOMMU_T_ATTACH,
+		.address_space	= cpu_to_le32(vdomain->id),
+	};
+
+	mutex_lock(&vdomain->mutex);
+	if (!vdomain->viommu) {
+		struct viommu_dev *viommu = vdev->viommu;
+
+		vdomain->viommu = viommu;
+
+		domain->pgsize_bitmap		= viommu->pgsize_bitmap;
+		domain->geometry.aperture_start	= viommu->aperture_start;
+		domain->geometry.aperture_end	= viommu->aperture_end;
+		domain->geometry.force_aperture	= true;
+
+	} else if (vdomain->viommu != vdev->viommu) {
+		dev_err(dev, "cannot attach to foreign VIOMMU\n");
+		ret = -EXDEV;
+	}
+	mutex_unlock(&vdomain->mutex);
+
+	if (ret)
+		return ret;
+
+	/*
+	 * When attaching the device to a new domain, it will be detached from
+	 * the old one and, if as as a result the old domain isn't attached to
+	 * any device, all mappings are removed from the old domain and it is
+	 * freed. (Note that we can't use get_domain_for_dev here, it returns
+	 * the default domain during initial attach.)
+	 *
+	 * Take note of the device disappearing, so we can ignore unmap request
+	 * on stale domains (that is, between this detach and the upcoming
+	 * free.)
+	 *
+	 * vdev->vdomain is protected by group->mutex
+	 */
+	if (vdev->vdomain) {
+		dev_dbg(dev, "detach from domain %llu\n", vdev->vdomain->id);
+		vdev->vdomain->attached--;
+	}
+
+	dev_dbg(dev, "attach to domain %llu\n", vdomain->id);
+
+	for (i = 0; i < fwspec->num_ids; i++) {
+		req.device = cpu_to_le32(fwspec->ids[i]);
+
+		ret = viommu_send_req_sync(vdomain->viommu, &req);
+		if (ret)
+			break;
+	}
+
+	vdomain->attached++;
+	vdev->vdomain = vdomain;
+
+	return ret;
+}
+
+static int viommu_map(struct iommu_domain *domain, unsigned long iova,
+		      phys_addr_t paddr, size_t size, int prot)
+{
+	int ret;
+	struct viommu_domain *vdomain = to_viommu_domain(domain);
+	struct virtio_iommu_req_map req = {
+		.head.type	= VIRTIO_IOMMU_T_MAP,
+		.address_space	= cpu_to_le32(vdomain->id),
+		.virt_addr	= cpu_to_le64(iova),
+		.phys_addr	= cpu_to_le64(paddr),
+		.size		= cpu_to_le64(size),
+	};
+
+	pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id, iova,
+		 paddr, size);
+
+	if (!vdomain->attached)
+		return -ENODEV;
+
+	if (prot & IOMMU_READ)
+		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ);
+
+	if (prot & IOMMU_WRITE)
+		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE);
+
+	ret = viommu_tlb_map(vdomain, iova, paddr, size);
+	if (ret)
+		return ret;
+
+	ret = viommu_send_req_sync(vdomain->viommu, &req);
+	if (ret)
+		viommu_tlb_unmap(vdomain, iova, size);
+
+	return ret;
+}
+
+static size_t viommu_unmap(struct iommu_domain *domain, unsigned long iova,
+			   size_t size)
+{
+	int ret;
+	size_t unmapped;
+	struct viommu_domain *vdomain = to_viommu_domain(domain);
+	struct virtio_iommu_req_unmap req = {
+		.head.type	= VIRTIO_IOMMU_T_UNMAP,
+		.address_space	= cpu_to_le32(vdomain->id),
+		.virt_addr	= cpu_to_le64(iova),
+	};
+
+	pr_debug("unmap %llu 0x%lx (%zu)\n", vdomain->id, iova, size);
+
+	/* Callers may unmap after detach, but device already took care of it. */
+	if (!vdomain->attached)
+		return size;
+
+	unmapped = viommu_tlb_unmap(vdomain, iova, size);
+	if (unmapped < size)
+		return 0;
+
+	req.size = cpu_to_le64(unmapped);
+
+	ret = viommu_send_req_sync(vdomain->viommu, &req);
+	if (ret)
+		return 0;
+
+	return unmapped;
+}
+
+static size_t viommu_map_sg(struct iommu_domain *domain, unsigned long iova,
+			    struct scatterlist *sg, unsigned int nents, int prot)
+{
+	int i, ret;
+	int nr_sent;
+	size_t mapped;
+	size_t min_pagesz;
+	size_t total_size;
+	struct scatterlist *s;
+	unsigned int flags = 0;
+	unsigned long cur_iova;
+	unsigned long mapped_iova;
+	size_t head_size, tail_size;
+	struct viommu_request reqs[nents];
+	struct virtio_iommu_req_map map_reqs[nents];
+	struct viommu_domain *vdomain = to_viommu_domain(domain);
+
+	if (!vdomain->attached)
+		return 0;
+
+	pr_debug("map_sg %llu %u 0x%lx\n", vdomain->id, nents, iova);
+
+	if (prot & IOMMU_READ)
+		flags |= VIRTIO_IOMMU_MAP_F_READ;
+
+	if (prot & IOMMU_WRITE)
+		flags |= VIRTIO_IOMMU_MAP_F_WRITE;
+
+	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
+	tail_size = sizeof(struct virtio_iommu_req_tail);
+	head_size = sizeof(*map_reqs) - tail_size;
+
+	cur_iova = iova;
+
+	for_each_sg(sg, s, nents, i) {
+		size_t size = s->length;
+		phys_addr_t paddr = sg_phys(s);
+		void *tail = (void *)&map_reqs[i] + head_size;
+
+		if (!IS_ALIGNED(paddr | size, min_pagesz)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		/* TODO: merge physically-contiguous mappings if any */
+		map_reqs[i] = (struct virtio_iommu_req_map) {
+			.head.type	= VIRTIO_IOMMU_T_MAP,
+			.address_space	= cpu_to_le32(vdomain->id),
+			.flags		= cpu_to_le32(flags),
+			.virt_addr	= cpu_to_le64(cur_iova),
+			.phys_addr	= cpu_to_le64(paddr),
+			.size		= cpu_to_le64(size),
+		};
+
+		ret = viommu_tlb_map(vdomain, cur_iova, paddr, size);
+		if (ret)
+			break;
+
+		sg_init_one(&reqs[i].head, &map_reqs[i], head_size);
+		sg_init_one(&reqs[i].tail, tail, tail_size);
+
+		cur_iova += size;
+	}
+
+	total_size = cur_iova - iova;
+
+	if (ret) {
+		viommu_tlb_unmap(vdomain, iova, total_size);
+		return 0;
+	}
+
+	ret = viommu_send_reqs_sync(vdomain->viommu, reqs, i, &nr_sent);
+
+	if (nr_sent != nents)
+		goto err_rollback;
+
+	for (i = 0; i < nents; i++) {
+		if (!reqs[i].written || map_reqs[i].tail.status)
+			goto err_rollback;
+	}
+
+	return total_size;
+
+err_rollback:
+	/*
+	 * Any request in the range might have failed. Unmap what was
+	 * successful.
+	 */
+	cur_iova = iova;
+	mapped_iova = iova;
+	mapped = 0;
+	for_each_sg(sg, s, nents, i) {
+		size_t size = s->length;
+
+		cur_iova += size;
+
+		if (!reqs[i].written || map_reqs[i].tail.status) {
+			if (mapped)
+				viommu_unmap(domain, mapped_iova, mapped);
+
+			mapped_iova = cur_iova;
+			mapped = 0;
+		} else {
+			mapped += size;
+		}
+	}
+
+	viommu_tlb_unmap(vdomain, iova, total_size);
+
+	return 0;
+}
+
+static phys_addr_t viommu_iova_to_phys(struct iommu_domain *domain,
+				       dma_addr_t iova)
+{
+	u64 paddr = 0;
+	unsigned long flags;
+	struct viommu_mapping *mapping;
+	struct interval_tree_node *node;
+	struct viommu_domain *vdomain = to_viommu_domain(domain);
+
+	spin_lock_irqsave(&vdomain->mappings_lock, flags);
+	node = interval_tree_iter_first(&vdomain->mappings, iova, iova);
+	if (node) {
+		mapping = container_of(node, struct viommu_mapping, iova);
+		paddr = mapping->paddr + (iova - mapping->iova.start);
+	}
+	spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
+
+	pr_debug("iova_to_phys %llu 0x%llx->0x%llx\n", vdomain->id, iova,
+		 paddr);
+
+	return paddr;
+}
+
+static struct iommu_ops viommu_ops;
+static struct virtio_driver virtio_iommu_drv;
+
+static int viommu_match_node(struct device *dev, void *data)
+{
+	return dev->parent->fwnode == data;
+}
+
+static struct viommu_dev *viommu_get_by_fwnode(struct fwnode_handle *fwnode)
+{
+	struct device *dev = driver_find_device(&virtio_iommu_drv.driver, NULL,
+						fwnode, viommu_match_node);
+	put_device(dev);
+
+	return dev ? dev_to_virtio(dev)->priv : NULL;
+}
+
+static int viommu_add_device(struct device *dev)
+{
+	struct iommu_group *group;
+	struct viommu_endpoint *vdev;
+	struct viommu_dev *viommu = NULL;
+	struct iommu_fwspec *fwspec = dev->iommu_fwspec;
+
+	if (!fwspec || fwspec->ops != &viommu_ops)
+		return -ENODEV;
+
+	viommu = viommu_get_by_fwnode(fwspec->iommu_fwnode);
+	if (!viommu)
+		return -ENODEV;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev)
+		return -ENOMEM;
+
+	vdev->viommu = viommu;
+	fwspec->iommu_priv = vdev;
+
+	/*
+	 * Last step creates a default domain and attaches to it. Everything
+	 * must be ready.
+	 */
+	group = iommu_group_get_for_dev(dev);
+
+	return PTR_ERR_OR_ZERO(group);
+}
+
+static void viommu_remove_device(struct device *dev)
+{
+	kfree(dev->iommu_fwspec->iommu_priv);
+}
+
+static struct iommu_group *
+viommu_device_group(struct device *dev)
+{
+	if (dev_is_pci(dev))
+		return pci_device_group(dev);
+	else
+		return generic_device_group(dev);
+}
+
+static int viommu_of_xlate(struct device *dev, struct of_phandle_args *args)
+{
+	u32 *id = args->args;
+
+	dev_dbg(dev, "of_xlate 0x%x\n", *id);
+	return iommu_fwspec_add_ids(dev, args->args, 1);
+}
+
+/*
+ * (Maybe) temporary hack for device pass-through into guest userspace. On ARM
+ * with an ITS, VFIO will look for a region where to map the doorbell, even
+ * though the virtual doorbell is never written to by the device, and instead
+ * the host injects interrupts directly. TODO: sort this out in VFIO.
+ */
+#define MSI_IOVA_BASE			0x8000000
+#define MSI_IOVA_LENGTH			0x100000
+
+static void viommu_get_resv_regions(struct device *dev, struct list_head *head)
+{
+	struct iommu_resv_region *region;
+	int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
+
+	region = iommu_alloc_resv_region(MSI_IOVA_BASE, MSI_IOVA_LENGTH, prot,
+					 IOMMU_RESV_MSI);
+	if (!region)
+		return;
+
+	list_add_tail(&region->list, head);
+}
+
+static void viommu_put_resv_regions(struct device *dev, struct list_head *head)
+{
+	struct iommu_resv_region *entry, *next;
+
+	list_for_each_entry_safe(entry, next, head, list)
+		kfree(entry);
+}
+
+static struct iommu_ops viommu_ops = {
+	.capable		= viommu_capable,
+	.domain_alloc		= viommu_domain_alloc,
+	.domain_free		= viommu_domain_free,
+	.attach_dev		= viommu_attach_dev,
+	.map			= viommu_map,
+	.unmap			= viommu_unmap,
+	.map_sg			= viommu_map_sg,
+	.iova_to_phys		= viommu_iova_to_phys,
+	.add_device		= viommu_add_device,
+	.remove_device		= viommu_remove_device,
+	.device_group		= viommu_device_group,
+	.of_xlate		= viommu_of_xlate,
+	.get_resv_regions	= viommu_get_resv_regions,
+	.put_resv_regions	= viommu_put_resv_regions,
+};
+
+static int viommu_init_vq(struct viommu_dev *viommu)
+{
+	struct virtio_device *vdev = dev_to_virtio(viommu->dev);
+	vq_callback_t *callback = NULL;
+	const char *name = "request";
+	int ret;
+
+	ret = vdev->config->find_vqs(vdev, 1, &viommu->vq, &callback,
+				     &name, NULL);
+	if (ret)
+		dev_err(viommu->dev, "cannot find VQ\n");
+
+	return ret;
+}
+
+static int viommu_probe(struct virtio_device *vdev)
+{
+	struct device *parent_dev = vdev->dev.parent;
+	struct viommu_dev *viommu = NULL;
+	struct device *dev = &vdev->dev;
+	int ret;
+
+	viommu = kzalloc(sizeof(*viommu), GFP_KERNEL);
+	if (!viommu)
+		return -ENOMEM;
+
+	spin_lock_init(&viommu->vq_lock);
+	INIT_LIST_HEAD(&viommu->pending_requests);
+	viommu->dev = dev;
+	viommu->vdev = vdev;
+
+	ret = viommu_init_vq(viommu);
+	if (ret)
+		goto err_free_viommu;
+
+	virtio_cread(vdev, struct virtio_iommu_config, page_sizes,
+		     &viommu->pgsize_bitmap);
+
+	viommu->aperture_end = -1UL;
+
+	virtio_cread_feature(vdev, VIRTIO_IOMMU_F_INPUT_RANGE,
+			     struct virtio_iommu_config, input_range.start,
+			     &viommu->aperture_start);
+
+	virtio_cread_feature(vdev, VIRTIO_IOMMU_F_INPUT_RANGE,
+			     struct virtio_iommu_config, input_range.end,
+			     &viommu->aperture_end);
+
+	if (!viommu->pgsize_bitmap) {
+		ret = -EINVAL;
+		goto err_free_viommu;
+	}
+
+	viommu_ops.pgsize_bitmap = viommu->pgsize_bitmap;
+
+	/*
+	 * Not strictly necessary, virtio would enable it later. This allows to
+	 * start using the request queue early.
+	 */
+	virtio_device_ready(vdev);
+
+	ret = iommu_device_sysfs_add(&viommu->iommu, dev, NULL, "%s",
+				     virtio_bus_name(vdev));
+	if (ret)
+		goto err_free_viommu;
+
+	iommu_device_set_ops(&viommu->iommu, &viommu_ops);
+	iommu_device_set_fwnode(&viommu->iommu, parent_dev->fwnode);
+
+	iommu_device_register(&viommu->iommu);
+
+#ifdef CONFIG_PCI
+	if (pci_bus_type.iommu_ops != &viommu_ops) {
+		pci_request_acs();
+		ret = bus_set_iommu(&pci_bus_type, &viommu_ops);
+		if (ret)
+			goto err_unregister;
+	}
+#endif
+#ifdef CONFIG_ARM_AMBA
+	if (amba_bustype.iommu_ops != &viommu_ops) {
+		ret = bus_set_iommu(&amba_bustype, &viommu_ops);
+		if (ret)
+			goto err_unregister;
+	}
+#endif
+	if (platform_bus_type.iommu_ops != &viommu_ops) {
+		ret = bus_set_iommu(&platform_bus_type, &viommu_ops);
+		if (ret)
+			goto err_unregister;
+	}
+
+	vdev->priv = viommu;
+
+	dev_info(viommu->dev, "probe successful\n");
+
+	return 0;
+
+err_unregister:
+	iommu_device_unregister(&viommu->iommu);
+
+err_free_viommu:
+	kfree(viommu);
+
+	return ret;
+}
+
+static void viommu_remove(struct virtio_device *vdev)
+{
+	struct viommu_dev *viommu = vdev->priv;
+
+	iommu_device_unregister(&viommu->iommu);
+	kfree(viommu);
+
+	dev_info(&vdev->dev, "device removed\n");
+}
+
+static void viommu_config_changed(struct virtio_device *vdev)
+{
+	dev_warn(&vdev->dev, "config changed\n");
+}
+
+static unsigned int features[] = {
+	VIRTIO_IOMMU_F_INPUT_RANGE,
+};
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_IOMMU, VIRTIO_DEV_ANY_ID },
+	{ 0 },
+};
+
+static struct virtio_driver virtio_iommu_drv = {
+	.driver.name		= KBUILD_MODNAME,
+	.driver.owner		= THIS_MODULE,
+	.id_table		= id_table,
+	.feature_table		= features,
+	.feature_table_size	= ARRAY_SIZE(features),
+	.probe			= viommu_probe,
+	.remove			= viommu_remove,
+	.config_changed		= viommu_config_changed,
+};
+
+module_virtio_driver(virtio_iommu_drv);
+
+IOMMU_OF_DECLARE(viommu, "virtio,mmio", NULL);
+
+MODULE_DESCRIPTION("virtio-iommu driver");
+MODULE_AUTHOR("Jean-Philippe Brucker <jean-philippe.brucker@arm.com>");
+MODULE_LICENSE("GPL v2");
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 1f25c86374ad..c0cb0f173258 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -467,6 +467,7 @@ header-y += virtio_console.h
 header-y += virtio_gpu.h
 header-y += virtio_ids.h
 header-y += virtio_input.h
+header-y += virtio_iommu.h
 header-y += virtio_mmio.h
 header-y += virtio_net.h
 header-y += virtio_pci.h
diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
index 6d5c3b2d4f4d..934ed3d3cd3f 100644
--- a/include/uapi/linux/virtio_ids.h
+++ b/include/uapi/linux/virtio_ids.h
@@ -43,5 +43,6 @@
 #define VIRTIO_ID_INPUT        18 /* virtio input */
 #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
 #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
+#define VIRTIO_ID_IOMMU	    61216 /* virtio IOMMU (temporary) */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/uapi/linux/virtio_iommu.h b/include/uapi/linux/virtio_iommu.h
new file mode 100644
index 000000000000..ec74c9a727d4
--- /dev/null
+++ b/include/uapi/linux/virtio_iommu.h
@@ -0,0 +1,142 @@
+/*
+ * Copyright (C) 2017 ARM Ltd.
+ *
+ * This header is BSD licensed so anyone can use the definitions
+ * to implement compatible drivers/servers:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of ARM Ltd. nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#ifndef _UAPI_LINUX_VIRTIO_IOMMU_H
+#define _UAPI_LINUX_VIRTIO_IOMMU_H
+
+/* Feature bits */
+#define VIRTIO_IOMMU_F_INPUT_RANGE		0
+#define VIRTIO_IOMMU_F_IOASID_BITS		1
+#define VIRTIO_IOMMU_F_MAP_UNMAP		2
+#define VIRTIO_IOMMU_F_BYPASS			3
+
+__packed
+struct virtio_iommu_config {
+	/* Supported page sizes */
+	__u64					page_sizes;
+	struct virtio_iommu_range {
+		__u64				start;
+		__u64				end;
+	} input_range;
+	__u8 					ioasid_bits;
+};
+
+/* Request types */
+#define VIRTIO_IOMMU_T_ATTACH			0x01
+#define VIRTIO_IOMMU_T_DETACH			0x02
+#define VIRTIO_IOMMU_T_MAP			0x03
+#define VIRTIO_IOMMU_T_UNMAP			0x04
+
+/* Status types */
+#define VIRTIO_IOMMU_S_OK			0x00
+#define VIRTIO_IOMMU_S_IOERR			0x01
+#define VIRTIO_IOMMU_S_UNSUPP			0x02
+#define VIRTIO_IOMMU_S_DEVERR			0x03
+#define VIRTIO_IOMMU_S_INVAL			0x04
+#define VIRTIO_IOMMU_S_RANGE			0x05
+#define VIRTIO_IOMMU_S_NOENT			0x06
+#define VIRTIO_IOMMU_S_FAULT			0x07
+
+__packed
+struct virtio_iommu_req_head {
+	__u8					type;
+	__u8					reserved[3];
+};
+
+__packed
+struct virtio_iommu_req_tail {
+	__u8					status;
+	__u8					reserved[3];
+};
+
+__packed
+struct virtio_iommu_req_attach {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					device;
+	__le32					reserved;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+__packed
+struct virtio_iommu_req_detach {
+	struct virtio_iommu_req_head		head;
+
+	__le32					device;
+	__le32					reserved;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+#define VIRTIO_IOMMU_MAP_F_READ			(1 << 0)
+#define VIRTIO_IOMMU_MAP_F_WRITE		(1 << 1)
+#define VIRTIO_IOMMU_MAP_F_EXEC			(1 << 2)
+
+#define VIRTIO_IOMMU_MAP_F_MASK			(VIRTIO_IOMMU_MAP_F_READ |	\
+						 VIRTIO_IOMMU_MAP_F_WRITE |	\
+						 VIRTIO_IOMMU_MAP_F_EXEC)
+
+__packed
+struct virtio_iommu_req_map {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					flags;
+	__le64					virt_addr;
+	__le64					phys_addr;
+	__le64					size;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+__packed
+struct virtio_iommu_req_unmap {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					flags;
+	__le64					virt_addr;
+	__le64					size;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+union virtio_iommu_req {
+	struct virtio_iommu_req_head		head;
+
+	struct virtio_iommu_req_attach		attach;
+	struct virtio_iommu_req_detach		detach;
+	struct virtio_iommu_req_map		map;
+	struct virtio_iommu_req_unmap		unmap;
+};
+
+#endif
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 00/15] Add virtio-iommu
  2017-04-07 19:17 [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jean-Philippe Brucker
                   ` (5 preceding siblings ...)
  2017-04-07 19:23 ` [RFC PATCH linux] iommu: Add virtio-iommu driver Jean-Philippe Brucker
@ 2017-04-07 19:24 ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 01/15] virtio: synchronize virtio-iommu headers with Linux Jean-Philippe Brucker
                     ` (31 more replies)
  2017-04-07 21:19 ` [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Michael S. Tsirkin
                   ` (6 subsequent siblings)
  13 siblings, 32 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Implement a virtio-iommu device and translate DMA traffic from vfio and virtio
devices. Virtio needed some rework to support scatter-gather accesses to vring
and buffers at page granularity. Patch 3 implements the actual virtio-iommu
device.

Adding --viommu on the command-line now inserts a virtual IOMMU in front
of all virtio and vfio devices:

	$ lkvm run -k Image --console virtio -p console=hvc0 \
	           --viommu --vfio 0 --vfio 4 --irqchip gicv3-its
	...
	[    2.998949] virtio_iommu virtio0: probe successful
	[    3.007739] virtio_iommu virtio1: probe successful
	...
	[    3.165023] iommu: Adding device 0000:00:00.0 to group 0
	[    3.536480] iommu: Adding device 10200.virtio to group 1
	[    3.553643] iommu: Adding device 10600.virtio to group 2
	[    3.570687] iommu: Adding device 10800.virtio to group 3
	[    3.627425] iommu: Adding device 10a00.virtio to group 4
	[    7.823689] iommu: Adding device 0000:00:01.0 to group 5
	...

Patches 13 and 14 add debug facilities. Some statistics are gathered for each
address space and can be queried via the debug builtin:

	$ lkvm debug -n guest-1210 --iommu stats
	iommu 0 "viommu-vfio"
	  kicks                 1255
	  requests              1256
	  ioas 1
	    maps                7
	    unmaps              4
	    resident            2101248
	  ioas 6
	    maps                623
	    unmaps              620
	    resident            16384
	iommu 1 "viommu-virtio"
	  kicks                 11426
	  requests              11431
	  ioas 2
	    maps                2836
	    unmaps              2835
	    resident            8192
	    accesses            2836
	...

This is based on the VFIO patchset[1], itself based on Andre's ITS work.
The VFIO bits have only been tested on a software model and are unlikely
to work on actual hardware, but I also tested virtio on an ARM Juno.

[1] http://www.spinics.net/lists/kvm/msg147624.html

Jean-Philippe Brucker (15):
  virtio: synchronize virtio-iommu headers with Linux
  FDT: (re)introduce a dynamic phandle allocator
  virtio: add virtio-iommu
  Add a simple IOMMU
  iommu: describe IOMMU topology in device-trees
  irq: register MSI doorbell addresses
  virtio: factor virtqueue initialization
  virtio: add vIOMMU instance for virtio devices
  virtio: access vring and buffers through IOMMU mappings
  virtio-pci: translate MSIs with the virtual IOMMU
  virtio: set VIRTIO_F_IOMMU_PLATFORM when necessary
  vfio: add support for virtual IOMMU
  virtio-iommu: debug via IPC
  virtio-iommu: implement basic debug commands
  virtio: use virtio-iommu when available

 Makefile                          |   3 +
 arm/gic.c                         |   4 +
 arm/include/arm-common/fdt-arch.h |   2 +-
 arm/pci.c                         |  49 ++-
 builtin-debug.c                   |   8 +-
 builtin-run.c                     |   2 +
 fdt.c                             |  35 ++
 include/kvm/builtin-debug.h       |   6 +
 include/kvm/devices.h             |   4 +
 include/kvm/fdt.h                 |  20 +
 include/kvm/iommu.h               | 105 +++++
 include/kvm/irq.h                 |   3 +
 include/kvm/kvm-config.h          |   1 +
 include/kvm/vfio.h                |   2 +
 include/kvm/virtio-iommu.h        |  15 +
 include/kvm/virtio-mmio.h         |   1 +
 include/kvm/virtio-pci.h          |   2 +
 include/kvm/virtio.h              | 137 +++++-
 include/linux/virtio_config.h     |  74 ++++
 include/linux/virtio_ids.h        |   4 +
 include/linux/virtio_iommu.h      | 142 ++++++
 iommu.c                           | 240 ++++++++++
 irq.c                             |  35 ++
 kvm-ipc.c                         |  43 +-
 mips/include/kvm/fdt-arch.h       |   2 +-
 powerpc/include/kvm/fdt-arch.h    |   2 +-
 vfio.c                            | 281 +++++++++++-
 virtio/9p.c                       |   7 +-
 virtio/balloon.c                  |   7 +-
 virtio/blk.c                      |  10 +-
 virtio/console.c                  |   7 +-
 virtio/core.c                     | 240 ++++++++--
 virtio/iommu.c                    | 902 ++++++++++++++++++++++++++++++++++++++
 virtio/mmio.c                     |  44 +-
 virtio/net.c                      |   8 +-
 virtio/pci.c                      |  61 ++-
 virtio/rng.c                      |   6 +-
 virtio/scsi.c                     |   6 +-
 x86/include/kvm/fdt-arch.h        |   2 +-
 39 files changed, 2389 insertions(+), 133 deletions(-)
 create mode 100644 fdt.c
 create mode 100644 include/kvm/iommu.h
 create mode 100644 include/kvm/virtio-iommu.h
 create mode 100644 include/linux/virtio_config.h
 create mode 100644 include/linux/virtio_iommu.h
 create mode 100644 iommu.c
 create mode 100644 virtio/iommu.c

-- 
2.12.1

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 01/15] virtio: synchronize virtio-iommu headers with Linux
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 01/15] virtio: synchronize virtio-iommu headers with Linux Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 02/15] FDT: (re)introduce a dynamic phandle allocator Jean-Philippe Brucker
                     ` (29 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

Pull virtio-iommu header (initial proposal) from Linux. Also add
virtio_config.h because it defines VIRTIO_F_IOMMU_PLATFORM, which I'm
going to need soon, and it's not provided by my toolchain.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/linux/virtio_config.h |  74 ++++++++++++++++++++++
 include/linux/virtio_ids.h    |   4 ++
 include/linux/virtio_iommu.h  | 142 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 220 insertions(+)
 create mode 100644 include/linux/virtio_config.h
 create mode 100644 include/linux/virtio_iommu.h

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
new file mode 100644
index 00000000..648b688f
--- /dev/null
+++ b/include/linux/virtio_config.h
@@ -0,0 +1,74 @@
+#ifndef _LINUX_VIRTIO_CONFIG_H
+#define _LINUX_VIRTIO_CONFIG_H
+/* This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
+ * anyone can use the definitions to implement compatible drivers/servers.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE. */
+
+/* Virtio devices use a standardized configuration space to define their
+ * features and pass configuration information, but each implementation can
+ * store and access that space differently. */
+#include <linux/types.h>
+
+/* Status byte for guest to report progress, and synchronize features. */
+/* We have seen device and processed generic fields (VIRTIO_CONFIG_F_VIRTIO) */
+#define VIRTIO_CONFIG_S_ACKNOWLEDGE	1
+/* We have found a driver for the device. */
+#define VIRTIO_CONFIG_S_DRIVER		2
+/* Driver has used its parts of the config, and is happy */
+#define VIRTIO_CONFIG_S_DRIVER_OK	4
+/* Driver has finished configuring features */
+#define VIRTIO_CONFIG_S_FEATURES_OK	8
+/* Device entered invalid state, driver must reset it */
+#define VIRTIO_CONFIG_S_NEEDS_RESET	0x40
+/* We've given up on this device. */
+#define VIRTIO_CONFIG_S_FAILED		0x80
+
+/* Some virtio feature bits (currently bits 28 through 32) are reserved for the
+ * transport being used (eg. virtio_ring), the rest are per-device feature
+ * bits. */
+#define VIRTIO_TRANSPORT_F_START	28
+#define VIRTIO_TRANSPORT_F_END		34
+
+#ifndef VIRTIO_CONFIG_NO_LEGACY
+/* Do we get callbacks when the ring is completely used, even if we've
+ * suppressed them? */
+#define VIRTIO_F_NOTIFY_ON_EMPTY	24
+
+/* Can the device handle any descriptor layout? */
+#define VIRTIO_F_ANY_LAYOUT		27
+#endif /* VIRTIO_CONFIG_NO_LEGACY */
+
+/* v1.0 compliant. */
+#define VIRTIO_F_VERSION_1		32
+
+/*
+ * If clear - device has the IOMMU bypass quirk feature.
+ * If set - use platform tools to detect the IOMMU.
+ *
+ * Note the reverse polarity (compared to most other features),
+ * this is for compatibility with legacy systems.
+ */
+#define VIRTIO_F_IOMMU_PLATFORM		33
+#endif /* _LINUX_VIRTIO_CONFIG_H */
diff --git a/include/linux/virtio_ids.h b/include/linux/virtio_ids.h
index 5f60aa4b..934ed3d3 100644
--- a/include/linux/virtio_ids.h
+++ b/include/linux/virtio_ids.h
@@ -39,6 +39,10 @@
 #define VIRTIO_ID_9P		9 /* 9p virtio console */
 #define VIRTIO_ID_RPROC_SERIAL 11 /* virtio remoteproc serial link */
 #define VIRTIO_ID_CAIF	       12 /* Virtio caif */
+#define VIRTIO_ID_GPU          16 /* virtio GPU */
 #define VIRTIO_ID_INPUT        18 /* virtio input */
+#define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
+#define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
+#define VIRTIO_ID_IOMMU	    61216 /* virtio IOMMU (temporary) */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/linux/virtio_iommu.h b/include/linux/virtio_iommu.h
new file mode 100644
index 00000000..beb21d44
--- /dev/null
+++ b/include/linux/virtio_iommu.h
@@ -0,0 +1,142 @@
+/*
+ * Copyright (C) 2017 ARM Ltd.
+ *
+ * This header is BSD licensed so anyone can use the definitions
+ * to implement compatible drivers/servers:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of ARM Ltd. nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#ifndef _LINUX_VIRTIO_IOMMU_H
+#define _LINUX_VIRTIO_IOMMU_H
+
+/* Feature bits */
+#define VIRTIO_IOMMU_F_INPUT_RANGE		0
+#define VIRTIO_IOMMU_F_IOASID_BITS		1
+#define VIRTIO_IOMMU_F_MAP_UNMAP		2
+#define VIRTIO_IOMMU_F_BYPASS			3
+
+__attribute__((packed))
+struct virtio_iommu_config {
+	/* Supported page sizes */
+	__u64					page_sizes;
+	struct virtio_iommu_range {
+		__u64				start;
+		__u64				end;
+	} input_range;
+	__u8 					ioasid_bits;
+};
+
+/* Request types */
+#define VIRTIO_IOMMU_T_ATTACH			0x01
+#define VIRTIO_IOMMU_T_DETACH			0x02
+#define VIRTIO_IOMMU_T_MAP			0x03
+#define VIRTIO_IOMMU_T_UNMAP			0x04
+
+/* Status types */
+#define VIRTIO_IOMMU_S_OK			0x00
+#define VIRTIO_IOMMU_S_IOERR			0x01
+#define VIRTIO_IOMMU_S_UNSUPP			0x02
+#define VIRTIO_IOMMU_S_DEVERR			0x03
+#define VIRTIO_IOMMU_S_INVAL			0x04
+#define VIRTIO_IOMMU_S_RANGE			0x05
+#define VIRTIO_IOMMU_S_NOENT			0x06
+#define VIRTIO_IOMMU_S_FAULT			0x07
+
+__attribute__((packed))
+struct virtio_iommu_req_head {
+	__u8					type;
+	__u8					reserved[3];
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_tail {
+	__u8					status;
+	__u8					reserved[3];
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_attach {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					device;
+	__le32					reserved;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_detach {
+	struct virtio_iommu_req_head		head;
+
+	__le32					device;
+	__le32					reserved;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+#define VIRTIO_IOMMU_MAP_F_READ			(1 << 0)
+#define VIRTIO_IOMMU_MAP_F_WRITE		(1 << 1)
+#define VIRTIO_IOMMU_MAP_F_EXEC			(1 << 2)
+
+#define VIRTIO_IOMMU_MAP_F_MASK			(VIRTIO_IOMMU_MAP_F_READ |	\
+						 VIRTIO_IOMMU_MAP_F_WRITE |	\
+						 VIRTIO_IOMMU_MAP_F_EXEC)
+
+__attribute__((packed))
+struct virtio_iommu_req_map {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					flags;
+	__le64					virt_addr;
+	__le64					phys_addr;
+	__le64					size;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_unmap {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					flags;
+	__le64					virt_addr;
+	__le64					size;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+union virtio_iommu_req {
+	struct virtio_iommu_req_head		head;
+
+	struct virtio_iommu_req_attach		attach;
+	struct virtio_iommu_req_detach		detach;
+	struct virtio_iommu_req_map		map;
+	struct virtio_iommu_req_unmap		unmap;
+};
+
+#endif
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 01/15] virtio: synchronize virtio-iommu headers with Linux
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (30 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Pull virtio-iommu header (initial proposal) from Linux. Also add
virtio_config.h because it defines VIRTIO_F_IOMMU_PLATFORM, which I'm
going to need soon, and it's not provided by my toolchain.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/linux/virtio_config.h |  74 ++++++++++++++++++++++
 include/linux/virtio_ids.h    |   4 ++
 include/linux/virtio_iommu.h  | 142 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 220 insertions(+)
 create mode 100644 include/linux/virtio_config.h
 create mode 100644 include/linux/virtio_iommu.h

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
new file mode 100644
index 00000000..648b688f
--- /dev/null
+++ b/include/linux/virtio_config.h
@@ -0,0 +1,74 @@
+#ifndef _LINUX_VIRTIO_CONFIG_H
+#define _LINUX_VIRTIO_CONFIG_H
+/* This header, excluding the #ifdef __KERNEL__ part, is BSD licensed so
+ * anyone can use the definitions to implement compatible drivers/servers.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of IBM nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE. */
+
+/* Virtio devices use a standardized configuration space to define their
+ * features and pass configuration information, but each implementation can
+ * store and access that space differently. */
+#include <linux/types.h>
+
+/* Status byte for guest to report progress, and synchronize features. */
+/* We have seen device and processed generic fields (VIRTIO_CONFIG_F_VIRTIO) */
+#define VIRTIO_CONFIG_S_ACKNOWLEDGE	1
+/* We have found a driver for the device. */
+#define VIRTIO_CONFIG_S_DRIVER		2
+/* Driver has used its parts of the config, and is happy */
+#define VIRTIO_CONFIG_S_DRIVER_OK	4
+/* Driver has finished configuring features */
+#define VIRTIO_CONFIG_S_FEATURES_OK	8
+/* Device entered invalid state, driver must reset it */
+#define VIRTIO_CONFIG_S_NEEDS_RESET	0x40
+/* We've given up on this device. */
+#define VIRTIO_CONFIG_S_FAILED		0x80
+
+/* Some virtio feature bits (currently bits 28 through 32) are reserved for the
+ * transport being used (eg. virtio_ring), the rest are per-device feature
+ * bits. */
+#define VIRTIO_TRANSPORT_F_START	28
+#define VIRTIO_TRANSPORT_F_END		34
+
+#ifndef VIRTIO_CONFIG_NO_LEGACY
+/* Do we get callbacks when the ring is completely used, even if we've
+ * suppressed them? */
+#define VIRTIO_F_NOTIFY_ON_EMPTY	24
+
+/* Can the device handle any descriptor layout? */
+#define VIRTIO_F_ANY_LAYOUT		27
+#endif /* VIRTIO_CONFIG_NO_LEGACY */
+
+/* v1.0 compliant. */
+#define VIRTIO_F_VERSION_1		32
+
+/*
+ * If clear - device has the IOMMU bypass quirk feature.
+ * If set - use platform tools to detect the IOMMU.
+ *
+ * Note the reverse polarity (compared to most other features),
+ * this is for compatibility with legacy systems.
+ */
+#define VIRTIO_F_IOMMU_PLATFORM		33
+#endif /* _LINUX_VIRTIO_CONFIG_H */
diff --git a/include/linux/virtio_ids.h b/include/linux/virtio_ids.h
index 5f60aa4b..934ed3d3 100644
--- a/include/linux/virtio_ids.h
+++ b/include/linux/virtio_ids.h
@@ -39,6 +39,10 @@
 #define VIRTIO_ID_9P		9 /* 9p virtio console */
 #define VIRTIO_ID_RPROC_SERIAL 11 /* virtio remoteproc serial link */
 #define VIRTIO_ID_CAIF	       12 /* Virtio caif */
+#define VIRTIO_ID_GPU          16 /* virtio GPU */
 #define VIRTIO_ID_INPUT        18 /* virtio input */
+#define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
+#define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
+#define VIRTIO_ID_IOMMU	    61216 /* virtio IOMMU (temporary) */
 
 #endif /* _LINUX_VIRTIO_IDS_H */
diff --git a/include/linux/virtio_iommu.h b/include/linux/virtio_iommu.h
new file mode 100644
index 00000000..beb21d44
--- /dev/null
+++ b/include/linux/virtio_iommu.h
@@ -0,0 +1,142 @@
+/*
+ * Copyright (C) 2017 ARM Ltd.
+ *
+ * This header is BSD licensed so anyone can use the definitions
+ * to implement compatible drivers/servers:
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of ARM Ltd. nor the names of its contributors
+ *    may be used to endorse or promote products derived from this software
+ *    without specific prior written permission.
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
+ * FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL IBM OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+#ifndef _LINUX_VIRTIO_IOMMU_H
+#define _LINUX_VIRTIO_IOMMU_H
+
+/* Feature bits */
+#define VIRTIO_IOMMU_F_INPUT_RANGE		0
+#define VIRTIO_IOMMU_F_IOASID_BITS		1
+#define VIRTIO_IOMMU_F_MAP_UNMAP		2
+#define VIRTIO_IOMMU_F_BYPASS			3
+
+__attribute__((packed))
+struct virtio_iommu_config {
+	/* Supported page sizes */
+	__u64					page_sizes;
+	struct virtio_iommu_range {
+		__u64				start;
+		__u64				end;
+	} input_range;
+	__u8 					ioasid_bits;
+};
+
+/* Request types */
+#define VIRTIO_IOMMU_T_ATTACH			0x01
+#define VIRTIO_IOMMU_T_DETACH			0x02
+#define VIRTIO_IOMMU_T_MAP			0x03
+#define VIRTIO_IOMMU_T_UNMAP			0x04
+
+/* Status types */
+#define VIRTIO_IOMMU_S_OK			0x00
+#define VIRTIO_IOMMU_S_IOERR			0x01
+#define VIRTIO_IOMMU_S_UNSUPP			0x02
+#define VIRTIO_IOMMU_S_DEVERR			0x03
+#define VIRTIO_IOMMU_S_INVAL			0x04
+#define VIRTIO_IOMMU_S_RANGE			0x05
+#define VIRTIO_IOMMU_S_NOENT			0x06
+#define VIRTIO_IOMMU_S_FAULT			0x07
+
+__attribute__((packed))
+struct virtio_iommu_req_head {
+	__u8					type;
+	__u8					reserved[3];
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_tail {
+	__u8					status;
+	__u8					reserved[3];
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_attach {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					device;
+	__le32					reserved;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_detach {
+	struct virtio_iommu_req_head		head;
+
+	__le32					device;
+	__le32					reserved;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+#define VIRTIO_IOMMU_MAP_F_READ			(1 << 0)
+#define VIRTIO_IOMMU_MAP_F_WRITE		(1 << 1)
+#define VIRTIO_IOMMU_MAP_F_EXEC			(1 << 2)
+
+#define VIRTIO_IOMMU_MAP_F_MASK			(VIRTIO_IOMMU_MAP_F_READ |	\
+						 VIRTIO_IOMMU_MAP_F_WRITE |	\
+						 VIRTIO_IOMMU_MAP_F_EXEC)
+
+__attribute__((packed))
+struct virtio_iommu_req_map {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					flags;
+	__le64					virt_addr;
+	__le64					phys_addr;
+	__le64					size;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+__attribute__((packed))
+struct virtio_iommu_req_unmap {
+	struct virtio_iommu_req_head		head;
+
+	__le32					address_space;
+	__le32					flags;
+	__le64					virt_addr;
+	__le64					size;
+
+	struct virtio_iommu_req_tail		tail;
+};
+
+union virtio_iommu_req {
+	struct virtio_iommu_req_head		head;
+
+	struct virtio_iommu_req_attach		attach;
+	struct virtio_iommu_req_detach		detach;
+	struct virtio_iommu_req_map		map;
+	struct virtio_iommu_req_unmap		unmap;
+};
+
+#endif
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 02/15] FDT: (re)introduce a dynamic phandle allocator
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (2 preceding siblings ...)
  2017-04-07 19:24   ` [RFC PATCH kvmtool 02/15] FDT: (re)introduce a dynamic phandle allocator Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 03/15] virtio: add virtio-iommu Jean-Philippe Brucker
                     ` (27 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

The phandle allocator was removed because static values were sufficient
for creating a common irqchip. With adding multiple virtual IOMMUs to the
device-tree, there is a need for a dynamic allocation of phandles. Add a
simple allocator that returns values above the static ones.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 Makefile                          |  1 +
 arm/include/arm-common/fdt-arch.h |  2 +-
 fdt.c                             | 15 +++++++++++++++
 include/kvm/fdt.h                 | 13 +++++++++++++
 mips/include/kvm/fdt-arch.h       |  2 +-
 powerpc/include/kvm/fdt-arch.h    |  2 +-
 x86/include/kvm/fdt-arch.h        |  2 +-
 7 files changed, 33 insertions(+), 4 deletions(-)
 create mode 100644 fdt.c

diff --git a/Makefile b/Makefile
index 6d5f5d9d..3e21c597 100644
--- a/Makefile
+++ b/Makefile
@@ -303,6 +303,7 @@ ifeq (y,$(ARCH_WANT_LIBFDT))
 		CFLAGS_STATOPT	+= -DCONFIG_HAS_LIBFDT
 		LIBS_DYNOPT	+= -lfdt
 		LIBS_STATOPT	+= -lfdt
+		OBJS		+= fdt.o
 	endif
 endif
 
diff --git a/arm/include/arm-common/fdt-arch.h b/arm/include/arm-common/fdt-arch.h
index 60c2d406..ed4ff3d4 100644
--- a/arm/include/arm-common/fdt-arch.h
+++ b/arm/include/arm-common/fdt-arch.h
@@ -1,6 +1,6 @@
 #ifndef ARM__FDT_H
 #define ARM__FDT_H
 
-enum phandles {PHANDLE_RESERVED = 0, PHANDLE_GIC, PHANDLE_MSI, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, PHANDLE_GIC, PHANDLE_MSI, ARCH_PHANDLES_MAX};
 
 #endif /* ARM__FDT_H */
diff --git a/fdt.c b/fdt.c
new file mode 100644
index 00000000..6db03d4e
--- /dev/null
+++ b/fdt.c
@@ -0,0 +1,15 @@
+/*
+ * Commonly used FDT functions.
+ */
+
+#include "kvm/fdt.h"
+
+static u32 next_phandle = PHANDLE_RESERVED;
+
+u32 fdt_alloc_phandle(void)
+{
+	if (next_phandle == PHANDLE_RESERVED)
+		next_phandle = ARCH_PHANDLES_MAX;
+
+	return next_phandle++;
+}
diff --git a/include/kvm/fdt.h b/include/kvm/fdt.h
index beadc7f3..503887f9 100644
--- a/include/kvm/fdt.h
+++ b/include/kvm/fdt.h
@@ -35,4 +35,17 @@ enum irq_type {
 		}							\
 	} while (0)
 
+#ifdef CONFIG_HAS_LIBFDT
+
+u32 fdt_alloc_phandle(void);
+
+#else
+
+static inline u32 fdt_alloc_phandle(void)
+{
+	return PHANDLE_RESERVED;
+}
+
+#endif /* CONFIG_HAS_LIBFDT */
+
 #endif /* KVM__FDT_H */
diff --git a/mips/include/kvm/fdt-arch.h b/mips/include/kvm/fdt-arch.h
index b0302457..3d004117 100644
--- a/mips/include/kvm/fdt-arch.h
+++ b/mips/include/kvm/fdt-arch.h
@@ -1,6 +1,6 @@
 #ifndef KVM__KVM_FDT_H
 #define KVM__KVM_FDT_H
 
-enum phandles {PHANDLE_RESERVED = 0, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, ARCH_PHANDLES_MAX};
 
 #endif /* KVM__KVM_FDT_H */
diff --git a/powerpc/include/kvm/fdt-arch.h b/powerpc/include/kvm/fdt-arch.h
index d48c0554..4ae4d3a0 100644
--- a/powerpc/include/kvm/fdt-arch.h
+++ b/powerpc/include/kvm/fdt-arch.h
@@ -1,6 +1,6 @@
 #ifndef KVM__KVM_FDT_H
 #define KVM__KVM_FDT_H
 
-enum phandles {PHANDLE_RESERVED = 0, PHANDLE_XICP, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, PHANDLE_XICP, ARCH_PHANDLES_MAX};
 
 #endif /* KVM__KVM_FDT_H */
diff --git a/x86/include/kvm/fdt-arch.h b/x86/include/kvm/fdt-arch.h
index eebd73f9..aba06ad8 100644
--- a/x86/include/kvm/fdt-arch.h
+++ b/x86/include/kvm/fdt-arch.h
@@ -1,6 +1,6 @@
 #ifndef X86__FDT_ARCH_H
 #define X86__FDT_ARCH_H
 
-enum phandles {PHANDLE_RESERVED = 0, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, ARCH_PHANDLES_MAX};
 
 #endif /* KVM__KVM_FDT_H */
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 02/15] FDT: (re)introduce a dynamic phandle allocator
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 01/15] virtio: synchronize virtio-iommu headers with Linux Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (28 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

The phandle allocator was removed because static values were sufficient
for creating a common irqchip. With adding multiple virtual IOMMUs to the
device-tree, there is a need for a dynamic allocation of phandles. Add a
simple allocator that returns values above the static ones.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 Makefile                          |  1 +
 arm/include/arm-common/fdt-arch.h |  2 +-
 fdt.c                             | 15 +++++++++++++++
 include/kvm/fdt.h                 | 13 +++++++++++++
 mips/include/kvm/fdt-arch.h       |  2 +-
 powerpc/include/kvm/fdt-arch.h    |  2 +-
 x86/include/kvm/fdt-arch.h        |  2 +-
 7 files changed, 33 insertions(+), 4 deletions(-)
 create mode 100644 fdt.c

diff --git a/Makefile b/Makefile
index 6d5f5d9d..3e21c597 100644
--- a/Makefile
+++ b/Makefile
@@ -303,6 +303,7 @@ ifeq (y,$(ARCH_WANT_LIBFDT))
 		CFLAGS_STATOPT	+= -DCONFIG_HAS_LIBFDT
 		LIBS_DYNOPT	+= -lfdt
 		LIBS_STATOPT	+= -lfdt
+		OBJS		+= fdt.o
 	endif
 endif
 
diff --git a/arm/include/arm-common/fdt-arch.h b/arm/include/arm-common/fdt-arch.h
index 60c2d406..ed4ff3d4 100644
--- a/arm/include/arm-common/fdt-arch.h
+++ b/arm/include/arm-common/fdt-arch.h
@@ -1,6 +1,6 @@
 #ifndef ARM__FDT_H
 #define ARM__FDT_H
 
-enum phandles {PHANDLE_RESERVED = 0, PHANDLE_GIC, PHANDLE_MSI, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, PHANDLE_GIC, PHANDLE_MSI, ARCH_PHANDLES_MAX};
 
 #endif /* ARM__FDT_H */
diff --git a/fdt.c b/fdt.c
new file mode 100644
index 00000000..6db03d4e
--- /dev/null
+++ b/fdt.c
@@ -0,0 +1,15 @@
+/*
+ * Commonly used FDT functions.
+ */
+
+#include "kvm/fdt.h"
+
+static u32 next_phandle = PHANDLE_RESERVED;
+
+u32 fdt_alloc_phandle(void)
+{
+	if (next_phandle == PHANDLE_RESERVED)
+		next_phandle = ARCH_PHANDLES_MAX;
+
+	return next_phandle++;
+}
diff --git a/include/kvm/fdt.h b/include/kvm/fdt.h
index beadc7f3..503887f9 100644
--- a/include/kvm/fdt.h
+++ b/include/kvm/fdt.h
@@ -35,4 +35,17 @@ enum irq_type {
 		}							\
 	} while (0)
 
+#ifdef CONFIG_HAS_LIBFDT
+
+u32 fdt_alloc_phandle(void);
+
+#else
+
+static inline u32 fdt_alloc_phandle(void)
+{
+	return PHANDLE_RESERVED;
+}
+
+#endif /* CONFIG_HAS_LIBFDT */
+
 #endif /* KVM__FDT_H */
diff --git a/mips/include/kvm/fdt-arch.h b/mips/include/kvm/fdt-arch.h
index b0302457..3d004117 100644
--- a/mips/include/kvm/fdt-arch.h
+++ b/mips/include/kvm/fdt-arch.h
@@ -1,6 +1,6 @@
 #ifndef KVM__KVM_FDT_H
 #define KVM__KVM_FDT_H
 
-enum phandles {PHANDLE_RESERVED = 0, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, ARCH_PHANDLES_MAX};
 
 #endif /* KVM__KVM_FDT_H */
diff --git a/powerpc/include/kvm/fdt-arch.h b/powerpc/include/kvm/fdt-arch.h
index d48c0554..4ae4d3a0 100644
--- a/powerpc/include/kvm/fdt-arch.h
+++ b/powerpc/include/kvm/fdt-arch.h
@@ -1,6 +1,6 @@
 #ifndef KVM__KVM_FDT_H
 #define KVM__KVM_FDT_H
 
-enum phandles {PHANDLE_RESERVED = 0, PHANDLE_XICP, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, PHANDLE_XICP, ARCH_PHANDLES_MAX};
 
 #endif /* KVM__KVM_FDT_H */
diff --git a/x86/include/kvm/fdt-arch.h b/x86/include/kvm/fdt-arch.h
index eebd73f9..aba06ad8 100644
--- a/x86/include/kvm/fdt-arch.h
+++ b/x86/include/kvm/fdt-arch.h
@@ -1,6 +1,6 @@
 #ifndef X86__FDT_ARCH_H
 #define X86__FDT_ARCH_H
 
-enum phandles {PHANDLE_RESERVED = 0, PHANDLES_MAX};
+enum phandles {PHANDLE_RESERVED = 0, ARCH_PHANDLES_MAX};
 
 #endif /* KVM__KVM_FDT_H */
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 03/15] virtio: add virtio-iommu
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (3 preceding siblings ...)
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (26 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

Implement a simple para-virtualized IOMMU for handling device address
spaces in guests.

Four operations are implemented:
* attach/detach: guest creates an address space, symbolized by a unique
  identifier (IOASID), and attaches the device to it.
* map/unmap: guest creates a GVA->GPA mapping in an address space. Devices
  attached to this address space can then access the GVA.

Each subsystem can register its own IOMMU, by calling register/unregister.
A unique device-tree phandle is allocated for each IOMMU. The IOMMU
receives commands from the driver through the virtqueue, and has a set of
callbacks for each device, allowing to implement different map/unmap
operations for passed-through and emulated devices. Note that a single
virtual IOMMU per guest would be enough, this multi-instance model is just
here for experimenting and allow different subsystems to offer different
vIOMMU features.

Add a global --viommu parameter to enable the virtual IOMMU.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 Makefile                   |   1 +
 builtin-run.c              |   2 +
 include/kvm/devices.h      |   4 +
 include/kvm/iommu.h        |  64 +++++
 include/kvm/kvm-config.h   |   1 +
 include/kvm/virtio-iommu.h |  10 +
 virtio/iommu.c             | 628 +++++++++++++++++++++++++++++++++++++++++++++
 virtio/mmio.c              |  11 +
 8 files changed, 721 insertions(+)
 create mode 100644 include/kvm/iommu.h
 create mode 100644 include/kvm/virtio-iommu.h
 create mode 100644 virtio/iommu.c

diff --git a/Makefile b/Makefile
index 3e21c597..67953870 100644
--- a/Makefile
+++ b/Makefile
@@ -68,6 +68,7 @@ OBJS	+= virtio/net.o
 OBJS	+= virtio/rng.o
 OBJS    += virtio/balloon.o
 OBJS	+= virtio/pci.o
+OBJS	+= virtio/iommu.o
 OBJS	+= disk/blk.o
 OBJS	+= disk/qcow.o
 OBJS	+= disk/raw.o
diff --git a/builtin-run.c b/builtin-run.c
index b4790ebc..7535b531 100644
--- a/builtin-run.c
+++ b/builtin-run.c
@@ -113,6 +113,8 @@ void kvm_run_set_wrapper_sandbox(void)
 	OPT_BOOLEAN('\0', "sdl", &(cfg)->sdl, "Enable SDL framebuffer"),\
 	OPT_BOOLEAN('\0', "rng", &(cfg)->virtio_rng, "Enable virtio"	\
 			" Random Number Generator"),			\
+	OPT_BOOLEAN('\0', "viommu", &(cfg)->viommu,			\
+			"Enable virtio IOMMU"),				\
 	OPT_CALLBACK('\0', "9p", NULL, "dir_to_share,tag_name",		\
 		     "Enable virtio 9p to share files between host and"	\
 		     " guest", virtio_9p_rootdir_parser, kvm),		\
diff --git a/include/kvm/devices.h b/include/kvm/devices.h
index 405f1952..70a00c5b 100644
--- a/include/kvm/devices.h
+++ b/include/kvm/devices.h
@@ -11,11 +11,15 @@ enum device_bus_type {
 	DEVICE_BUS_MAX,
 };
 
+struct iommu_ops;
+
 struct device_header {
 	enum device_bus_type	bus_type;
 	void			*data;
 	int			dev_num;
 	struct rb_node		node;
+	struct iommu_ops	*iommu_ops;
+	void			*iommu_data;
 };
 
 int device__register(struct device_header *dev);
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
new file mode 100644
index 00000000..925e1993
--- /dev/null
+++ b/include/kvm/iommu.h
@@ -0,0 +1,64 @@
+#ifndef KVM_IOMMU_H
+#define KVM_IOMMU_H
+
+#include <stdlib.h>
+
+#include "devices.h"
+
+#define IOMMU_PROT_NONE		0x0
+#define IOMMU_PROT_READ		0x1
+#define IOMMU_PROT_WRITE	0x2
+#define IOMMU_PROT_EXEC		0x4
+
+struct iommu_ops {
+	const struct iommu_properties *(*get_properties)(struct device_header *);
+
+	void *(*alloc_address_space)(struct device_header *);
+	void (*free_address_space)(void *);
+
+	int (*attach)(void *, struct device_header *, int flags);
+	int (*detach)(void *, struct device_header *);
+	int (*map)(void *, u64 virt_addr, u64 phys_addr, u64 size, int prot);
+	int (*unmap)(void *, u64 virt_addr, u64 size, int flags);
+};
+
+struct iommu_properties {
+	const char			*name;
+	u32				phandle;
+
+	size_t				input_addr_size;
+	u64				pgsize_mask;
+};
+
+/*
+ * All devices presented to the system have a device ID, that allows the IOMMU
+ * to identify them. Since multiple buses can share an IOMMU, this device ID
+ * must be unique system-wide. We define it here as:
+ *
+ *	(bus_type << 16) + dev_num
+ *
+ * Where dev_num is the device number on the bus as allocated by devices.c
+ *
+ * TODO: enforce this limit, by checking that the device number allocator
+ * doesn't overflow BUS_SIZE.
+ */
+
+#define BUS_SIZE 0x10000
+
+static inline long device_to_iommu_id(struct device_header *dev)
+{
+	return dev->bus_type * BUS_SIZE + dev->dev_num;
+}
+
+#define iommu_id_to_bus(device_id)	((device_id) / BUS_SIZE)
+#define iommu_id_to_devnum(device_id)	((device_id) % BUS_SIZE)
+
+static inline struct device_header *iommu_get_device(u32 device_id)
+{
+	enum device_bus_type bus = iommu_id_to_bus(device_id);
+	u32 dev_num = iommu_id_to_devnum(device_id);
+
+	return device__find_dev(bus, dev_num);
+}
+
+#endif /* KVM_IOMMU_H */
diff --git a/include/kvm/kvm-config.h b/include/kvm/kvm-config.h
index 62dc6a2f..9678065b 100644
--- a/include/kvm/kvm-config.h
+++ b/include/kvm/kvm-config.h
@@ -60,6 +60,7 @@ struct kvm_config {
 	bool no_dhcp;
 	bool ioport_debug;
 	bool mmio_debug;
+	bool viommu;
 };
 
 #endif
diff --git a/include/kvm/virtio-iommu.h b/include/kvm/virtio-iommu.h
new file mode 100644
index 00000000..5532c82b
--- /dev/null
+++ b/include/kvm/virtio-iommu.h
@@ -0,0 +1,10 @@
+#ifndef KVM_VIRTIO_IOMMU_H
+#define KVM_VIRTIO_IOMMU_H
+
+#include "virtio.h"
+
+const struct iommu_properties *viommu_get_properties(void *dev);
+void *viommu_register(struct kvm *kvm, struct iommu_properties *props);
+void viommu_unregister(struct kvm *kvm, void *cookie);
+
+#endif
diff --git a/virtio/iommu.c b/virtio/iommu.c
new file mode 100644
index 00000000..c72e7322
--- /dev/null
+++ b/virtio/iommu.c
@@ -0,0 +1,628 @@
+#include <errno.h>
+#include <stdbool.h>
+
+#include <linux/compiler.h>
+
+#include <linux/bitops.h>
+#include <linux/byteorder.h>
+#include <linux/err.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_iommu.h>
+
+#include "kvm/guest_compat.h"
+#include "kvm/iommu.h"
+#include "kvm/threadpool.h"
+#include "kvm/virtio.h"
+#include "kvm/virtio-iommu.h"
+
+/* Max size */
+#define VIOMMU_DEFAULT_QUEUE_SIZE	256
+
+struct viommu_endpoint {
+	struct device_header		*dev;
+	struct viommu_ioas		*ioas;
+	struct list_head		list;
+};
+
+struct viommu_ioas {
+	u32				id;
+
+	struct mutex			devices_mutex;
+	struct list_head		devices;
+	size_t				nr_devices;
+	struct rb_node			node;
+
+	struct iommu_ops		*ops;
+	void				*priv;
+};
+
+struct viommu_dev {
+	struct virtio_device		vdev;
+	struct virtio_iommu_config	config;
+
+	const struct iommu_properties	*properties;
+
+	struct virt_queue		vq;
+	size_t				queue_size;
+	struct thread_pool__job		job;
+
+	struct rb_root			address_spaces;
+	struct kvm			*kvm;
+};
+
+static int compat_id = -1;
+
+static struct viommu_ioas *viommu_find_ioas(struct viommu_dev *viommu,
+					    u32 ioasid)
+{
+	struct rb_node *node;
+	struct viommu_ioas *ioas;
+
+	node = viommu->address_spaces.rb_node;
+	while (node) {
+		ioas = container_of(node, struct viommu_ioas, node);
+		if (ioas->id > ioasid)
+			node = node->rb_left;
+		else if (ioas->id < ioasid)
+			node = node->rb_right;
+		else
+			return ioas;
+	}
+
+	return NULL;
+}
+
+static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
+					     struct device_header *device,
+					     u32 ioasid)
+{
+	struct rb_node **node, *parent = NULL;
+	struct viommu_ioas *new_ioas, *ioas;
+	struct iommu_ops *ops = device->iommu_ops;
+
+	if (!ops || !ops->get_properties || !ops->alloc_address_space ||
+	    !ops->free_address_space || !ops->attach || !ops->detach ||
+	    !ops->map || !ops->unmap) {
+		/* Catch programming mistakes early */
+		pr_err("Invalid IOMMU ops");
+		return NULL;
+	}
+
+	new_ioas = calloc(1, sizeof(*new_ioas));
+	if (!new_ioas)
+		return NULL;
+
+	INIT_LIST_HEAD(&new_ioas->devices);
+	mutex_init(&new_ioas->devices_mutex);
+	new_ioas->id		= ioasid;
+	new_ioas->ops		= ops;
+	new_ioas->priv		= ops->alloc_address_space(device);
+
+	/* A NULL priv pointer is valid. */
+
+	node = &viommu->address_spaces.rb_node;
+	while (*node) {
+		ioas = container_of(*node, struct viommu_ioas, node);
+		parent = *node;
+
+		if (ioas->id > ioasid) {
+			node = &((*node)->rb_left);
+		} else if (ioas->id < ioasid) {
+			node = &((*node)->rb_right);
+		} else {
+			pr_err("IOAS exists!");
+			free(new_ioas);
+			return NULL;
+		}
+	}
+
+	rb_link_node(&new_ioas->node, parent, node);
+	rb_insert_color(&new_ioas->node, &viommu->address_spaces);
+
+	return new_ioas;
+}
+
+static void viommu_free_ioas(struct viommu_dev *viommu,
+			     struct viommu_ioas *ioas)
+{
+	if (ioas->priv)
+		ioas->ops->free_address_space(ioas->priv);
+
+	rb_erase(&ioas->node, &viommu->address_spaces);
+	free(ioas);
+}
+
+static int viommu_ioas_add_device(struct viommu_ioas *ioas,
+				  struct viommu_endpoint *vdev)
+{
+	mutex_lock(&ioas->devices_mutex);
+	list_add_tail(&vdev->list, &ioas->devices);
+	ioas->nr_devices++;
+	vdev->ioas = ioas;
+	mutex_unlock(&ioas->devices_mutex);
+
+	return 0;
+}
+
+static int viommu_ioas_del_device(struct viommu_ioas *ioas,
+				  struct viommu_endpoint *vdev)
+{
+	mutex_lock(&ioas->devices_mutex);
+	list_del(&vdev->list);
+	ioas->nr_devices--;
+	vdev->ioas = NULL;
+	mutex_unlock(&ioas->devices_mutex);
+
+	return 0;
+}
+
+static struct viommu_endpoint *viommu_alloc_device(struct device_header *device)
+{
+	struct viommu_endpoint *vdev = calloc(1, sizeof(*vdev));
+
+	device->iommu_data = vdev;
+	vdev->dev = device;
+
+	return vdev;
+}
+
+static int viommu_detach_device(struct viommu_dev *viommu,
+				struct viommu_endpoint *vdev)
+{
+	int ret;
+	struct viommu_ioas *ioas = vdev->ioas;
+	struct device_header *device = vdev->dev;
+
+	if (!ioas)
+		return -EINVAL;
+
+	pr_debug("detaching device %#lx from IOAS %u",
+		 device_to_iommu_id(device), ioas->id);
+
+	ret = device->iommu_ops->detach(ioas->priv, device);
+	if (!ret)
+		ret = viommu_ioas_del_device(ioas, vdev);
+
+	if (!ioas->nr_devices)
+		viommu_free_ioas(viommu, ioas);
+
+	return ret;
+}
+
+static int viommu_handle_attach(struct viommu_dev *viommu,
+				struct virtio_iommu_req_attach *attach)
+{
+	int ret;
+	struct viommu_ioas *ioas;
+	struct device_header *device;
+	struct viommu_endpoint *vdev;
+
+	u32 device_id	= le32_to_cpu(attach->device);
+	u32 ioasid	= le32_to_cpu(attach->address_space);
+
+	device = iommu_get_device(device_id);
+	if (IS_ERR_OR_NULL(device)) {
+		pr_err("could not find device %#x", device_id);
+		return -ENODEV;
+	}
+
+	pr_debug("attaching device %#x to IOAS %u", device_id, ioasid);
+
+	vdev = device->iommu_data;
+	if (!vdev) {
+		vdev = viommu_alloc_device(device);
+		if (!vdev)
+			return -ENOMEM;
+	}
+
+	ioas = viommu_find_ioas(viommu, ioasid);
+	if (!ioas) {
+		ioas = viommu_alloc_ioas(viommu, device, ioasid);
+		if (!ioas)
+			return -ENOMEM;
+	} else if (ioas->ops->map != device->iommu_ops->map ||
+		   ioas->ops->unmap != device->iommu_ops->unmap) {
+		return -EINVAL;
+	}
+
+	if (vdev->ioas) {
+		ret = viommu_detach_device(viommu, vdev);
+		if (ret)
+			return ret;
+	}
+
+	ret = device->iommu_ops->attach(ioas->priv, device, 0);
+	if (!ret)
+		ret = viommu_ioas_add_device(ioas, vdev);
+
+	if (ret && ioas->nr_devices == 0)
+		viommu_free_ioas(viommu, ioas);
+
+	return ret;
+}
+
+static int viommu_handle_detach(struct viommu_dev *viommu,
+				struct virtio_iommu_req_detach *detach)
+{
+	struct device_header *device;
+	struct viommu_endpoint *vdev;
+
+	u32 device_id	= le32_to_cpu(detach->device);
+
+	device = iommu_get_device(device_id);
+	if (IS_ERR_OR_NULL(device)) {
+		pr_err("could not find device %#x", device_id);
+		return -ENODEV;
+	}
+
+	vdev = device->iommu_data;
+	if (!vdev)
+		return -ENODEV;
+
+	return viommu_detach_device(viommu, vdev);
+}
+
+static int viommu_handle_map(struct viommu_dev *viommu,
+			     struct virtio_iommu_req_map *map)
+{
+	int prot = 0;
+	struct viommu_ioas *ioas;
+
+	u32 ioasid	= le32_to_cpu(map->address_space);
+	u64 virt_addr	= le64_to_cpu(map->virt_addr);
+	u64 phys_addr	= le64_to_cpu(map->phys_addr);
+	u64 size	= le64_to_cpu(map->size);
+	u32 flags	= le64_to_cpu(map->flags);
+
+	ioas = viommu_find_ioas(viommu, ioasid);
+	if (!ioas) {
+		pr_err("could not find address space %u", ioasid);
+		return -ESRCH;
+	}
+
+	if (flags & ~VIRTIO_IOMMU_MAP_F_MASK)
+		return -EINVAL;
+
+	if (flags & VIRTIO_IOMMU_MAP_F_READ)
+		prot |= IOMMU_PROT_READ;
+
+	if (flags & VIRTIO_IOMMU_MAP_F_WRITE)
+		prot |= IOMMU_PROT_WRITE;
+
+	if (flags & VIRTIO_IOMMU_MAP_F_EXEC)
+		prot |= IOMMU_PROT_EXEC;
+
+	pr_debug("map %#llx -> %#llx (%llu) to IOAS %u", virt_addr,
+		 phys_addr, size, ioasid);
+
+	return ioas->ops->map(ioas->priv, virt_addr, phys_addr, size, prot);
+}
+
+static int viommu_handle_unmap(struct viommu_dev *viommu,
+			       struct virtio_iommu_req_unmap *unmap)
+{
+	struct viommu_ioas *ioas;
+
+	u32 ioasid	= le32_to_cpu(unmap->address_space);
+	u64 virt_addr	= le64_to_cpu(unmap->virt_addr);
+	u64 size	= le64_to_cpu(unmap->size);
+
+	ioas = viommu_find_ioas(viommu, ioasid);
+	if (!ioas) {
+		pr_err("could not find address space %u", ioasid);
+		return -ESRCH;
+	}
+
+	pr_debug("unmap %#llx (%llu) from IOAS %u", virt_addr, size,
+		 ioasid);
+
+	return ioas->ops->unmap(ioas->priv, virt_addr, size, 0);
+}
+
+static size_t viommu_get_req_len(union virtio_iommu_req *req)
+{
+	switch (req->head.type) {
+	case VIRTIO_IOMMU_T_ATTACH:
+		return sizeof(req->attach);
+	case VIRTIO_IOMMU_T_DETACH:
+		return sizeof(req->detach);
+	case VIRTIO_IOMMU_T_MAP:
+		return sizeof(req->map);
+	case VIRTIO_IOMMU_T_UNMAP:
+		return sizeof(req->unmap);
+	default:
+		pr_err("unknown request type %x", req->head.type);
+		return 0;
+	}
+}
+
+static int viommu_errno_to_status(int err)
+{
+	switch (err) {
+	case 0:
+		return VIRTIO_IOMMU_S_OK;
+	case EIO:
+		return VIRTIO_IOMMU_S_IOERR;
+	case ENOSYS:
+		return VIRTIO_IOMMU_S_UNSUPP;
+	case ERANGE:
+		return VIRTIO_IOMMU_S_RANGE;
+	case EFAULT:
+		return VIRTIO_IOMMU_S_FAULT;
+	case EINVAL:
+		return VIRTIO_IOMMU_S_INVAL;
+	case ENOENT:
+	case ENODEV:
+	case ESRCH:
+		return VIRTIO_IOMMU_S_NOENT;
+	case ENOMEM:
+	case ENOSPC:
+	default:
+		return VIRTIO_IOMMU_S_DEVERR;
+	}
+}
+
+static ssize_t viommu_dispatch_commands(struct viommu_dev *viommu,
+					struct iovec *iov, int nr_in, int nr_out)
+{
+	u32 op;
+	int i, ret;
+	ssize_t written_len = 0;
+	size_t len, expected_len;
+	union virtio_iommu_req *req;
+	struct virtio_iommu_req_tail *tail;
+
+	/*
+	 * Are we picking up in the middle of a request buffer? Keep a running
+	 * count.
+	 *
+	 * Here we assume that a request is always made of two descriptors, a
+	 * head and a tail. TODO: get rid of framing assumptions by keeping
+	 * track of request fragments.
+	 */
+	static bool is_head = true;
+	static int cur_status = 0;
+
+	for (i = 0; i < nr_in + nr_out; i++, is_head = !is_head) {
+		len = iov[i].iov_len;
+		if (is_head && len < sizeof(req->head)) {
+			pr_err("invalid command length (%zu)", len);
+			cur_status = EIO;
+			continue;
+		} else if (!is_head && len < sizeof(*tail)) {
+			pr_err("invalid tail length (%zu)", len);
+			cur_status = 0;
+			continue;
+		}
+
+		if (!is_head) {
+			int status = viommu_errno_to_status(cur_status);
+
+			tail = iov[i].iov_base;
+			tail->status = cpu_to_le32(status);
+			written_len += sizeof(tail->status);
+			cur_status = 0;
+			continue;
+		}
+
+		req = iov[i].iov_base;
+		op = req->head.type;
+		expected_len = viommu_get_req_len(req) - sizeof(*tail);
+		if (expected_len != len) {
+			pr_err("invalid command %x length (%zu != %zu)", op,
+			       len, expected_len);
+			cur_status = EIO;
+			continue;
+		}
+
+		switch (op) {
+		case VIRTIO_IOMMU_T_ATTACH:
+			ret = viommu_handle_attach(viommu, &req->attach);
+			break;
+
+		case VIRTIO_IOMMU_T_DETACH:
+			ret = viommu_handle_detach(viommu, &req->detach);
+			break;
+
+		case VIRTIO_IOMMU_T_MAP:
+			ret = viommu_handle_map(viommu, &req->map);
+			break;
+
+		case VIRTIO_IOMMU_T_UNMAP:
+			ret = viommu_handle_unmap(viommu, &req->unmap);
+			break;
+
+		default:
+			pr_err("unhandled command %x", op);
+			ret = -ENOSYS;
+		}
+
+		if (ret)
+			cur_status = -ret;
+	}
+
+	return written_len;
+}
+
+static void viommu_command(struct kvm *kvm, void *dev)
+{
+	int len;
+	u16 head;
+	u16 out, in;
+
+	struct virt_queue *vq;
+	struct viommu_dev *viommu = dev;
+	struct iovec iov[VIOMMU_DEFAULT_QUEUE_SIZE];
+
+	vq = &viommu->vq;
+
+	while (virt_queue__available(vq)) {
+		head = virt_queue__get_iov(vq, iov, &out, &in, kvm);
+
+		len = viommu_dispatch_commands(viommu, iov, in, out);
+		if (len < 0) {
+			/* Critical error, abort everything */
+			pr_err("failed to dispatch viommu command");
+			return;
+		}
+
+		virt_queue__set_used_elem(vq, head, len);
+	}
+
+	if (virtio_queue__should_signal(vq))
+		viommu->vdev.ops->signal_vq(kvm, &viommu->vdev, 0);
+}
+
+/* Virtio API */
+static u8 *viommu_get_config(struct kvm *kvm, void *dev)
+{
+	struct viommu_dev *viommu = dev;
+
+	return (u8 *)&viommu->config;
+}
+
+static u32 viommu_get_host_features(struct kvm *kvm, void *dev)
+{
+	return 1ULL << VIRTIO_RING_F_EVENT_IDX
+	     | 1ULL << VIRTIO_RING_F_INDIRECT_DESC
+	     | 1ULL << VIRTIO_IOMMU_F_INPUT_RANGE;
+}
+
+static void viommu_set_guest_features(struct kvm *kvm, void *dev, u32 features)
+{
+}
+
+static int viommu_init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size,
+			  u32 align, u32 pfn)
+{
+	void *ptr;
+	struct virt_queue *queue;
+	struct viommu_dev *viommu = dev;
+
+	if (vq != 0)
+		return -ENODEV;
+
+	compat__remove_message(compat_id);
+
+	queue = &viommu->vq;
+	queue->pfn = pfn;
+	ptr = virtio_get_vq(kvm, queue->pfn, page_size);
+
+	vring_init(&queue->vring, viommu->queue_size, ptr, align);
+	virtio_init_device_vq(&viommu->vdev, queue);
+
+	thread_pool__init_job(&viommu->job, kvm, viommu_command, viommu);
+
+	return 0;
+}
+
+static int viommu_get_pfn_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+	struct viommu_dev *viommu = dev;
+
+	return viommu->vq.pfn;
+}
+
+static int viommu_get_size_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+	struct viommu_dev *viommu = dev;
+
+	return viommu->queue_size;
+}
+
+static int viommu_set_size_vq(struct kvm *kvm, void *dev, u32 vq, int size)
+{
+	struct viommu_dev *viommu = dev;
+
+	if (viommu->vq.pfn)
+		/* Already init, can't resize */
+		return viommu->queue_size;
+
+	viommu->queue_size = size;
+
+	return size;
+}
+
+static int viommu_notify_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+	struct viommu_dev *viommu = dev;
+
+	thread_pool__do_job(&viommu->job);
+
+	return 0;
+}
+
+static void viommu_notify_vq_gsi(struct kvm *kvm, void *dev, u32 vq, u32 gsi)
+{
+	/* TODO: when implementing vhost */
+}
+
+static void viommu_notify_vq_eventfd(struct kvm *kvm, void *dev, u32 vq, u32 fd)
+{
+	/* TODO: when implementing vhost */
+}
+
+static struct virtio_ops iommu_dev_virtio_ops = {
+	.get_config		= viommu_get_config,
+	.get_host_features	= viommu_get_host_features,
+	.set_guest_features	= viommu_set_guest_features,
+	.init_vq		= viommu_init_vq,
+	.get_pfn_vq		= viommu_get_pfn_vq,
+	.get_size_vq		= viommu_get_size_vq,
+	.set_size_vq		= viommu_set_size_vq,
+	.notify_vq		= viommu_notify_vq,
+	.notify_vq_gsi		= viommu_notify_vq_gsi,
+	.notify_vq_eventfd	= viommu_notify_vq_eventfd,
+};
+
+const struct iommu_properties *viommu_get_properties(void *dev)
+{
+	struct viommu_dev *viommu = dev;
+
+	return viommu->properties;
+}
+
+void *viommu_register(struct kvm *kvm, struct iommu_properties *props)
+{
+	struct viommu_dev *viommu;
+	u64 pgsize_mask = ~(PAGE_SIZE - 1);
+
+	if (!kvm->cfg.viommu)
+		return NULL;
+
+	props->phandle = fdt_alloc_phandle();
+
+	viommu = calloc(1, sizeof(struct viommu_dev));
+	if (!viommu)
+		return NULL;
+
+	viommu->queue_size		= VIOMMU_DEFAULT_QUEUE_SIZE;
+	viommu->address_spaces		= (struct rb_root)RB_ROOT;
+	viommu->properties		= props;
+
+	viommu->config.page_sizes	= props->pgsize_mask ?: pgsize_mask;
+	viommu->config.input_range.end	= props->input_addr_size % BITS_PER_LONG ?
+					  (1UL << props->input_addr_size) - 1 :
+					  -1UL;
+
+	if (virtio_init(kvm, viommu, &viommu->vdev, &iommu_dev_virtio_ops,
+			VIRTIO_MMIO, 0, VIRTIO_ID_IOMMU, 0)) {
+		free(viommu);
+		return NULL;
+	}
+
+	pr_info("Loaded virtual IOMMU %s", props->name);
+
+	if (compat_id == -1)
+		compat_id = virtio_compat_add_message("virtio-iommu",
+						      "CONFIG_VIRTIO_IOMMU");
+
+	return viommu;
+}
+
+void viommu_unregister(struct kvm *kvm, void *viommu)
+{
+	free(viommu);
+}
diff --git a/virtio/mmio.c b/virtio/mmio.c
index f0af4bd1..b3dea51a 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -1,14 +1,17 @@
 #include "kvm/devices.h"
 #include "kvm/virtio-mmio.h"
 #include "kvm/ioeventfd.h"
+#include "kvm/iommu.h"
 #include "kvm/ioport.h"
 #include "kvm/virtio.h"
+#include "kvm/virtio-iommu.h"
 #include "kvm/kvm.h"
 #include "kvm/kvm-cpu.h"
 #include "kvm/irq.h"
 #include "kvm/fdt.h"
 
 #include <linux/virtio_mmio.h>
+#include <linux/virtio_ids.h>
 #include <string.h>
 
 static u32 virtio_mmio_io_space_blocks = KVM_VIRTIO_MMIO_AREA;
@@ -237,6 +240,7 @@ void generate_virtio_mmio_fdt_node(void *fdt,
 							     u8 irq,
 							     enum irq_type))
 {
+	const struct iommu_properties *props;
 	char dev_name[DEVICE_NAME_MAX_LEN];
 	struct virtio_mmio *vmmio = container_of(dev_hdr,
 						 struct virtio_mmio,
@@ -254,6 +258,13 @@ void generate_virtio_mmio_fdt_node(void *fdt,
 	_FDT(fdt_property(fdt, "reg", reg_prop, sizeof(reg_prop)));
 	_FDT(fdt_property(fdt, "dma-coherent", NULL, 0));
 	generate_irq_prop(fdt, vmmio->irq, IRQ_TYPE_EDGE_RISING);
+
+	if (vmmio->hdr.device_id == VIRTIO_ID_IOMMU) {
+		props = viommu_get_properties(vmmio->dev);
+		_FDT(fdt_property_cell(fdt, "phandle", props->phandle));
+		_FDT(fdt_property_cell(fdt, "#iommu-cells", 1));
+	}
+
 	_FDT(fdt_end_node(fdt));
 }
 #else
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 03/15] virtio: add virtio-iommu
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (4 preceding siblings ...)
  2017-04-07 19:24   ` [RFC PATCH kvmtool 03/15] virtio: add virtio-iommu Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 04/15] Add a simple IOMMU Jean-Philippe Brucker
                     ` (25 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Implement a simple para-virtualized IOMMU for handling device address
spaces in guests.

Four operations are implemented:
* attach/detach: guest creates an address space, symbolized by a unique
  identifier (IOASID), and attaches the device to it.
* map/unmap: guest creates a GVA->GPA mapping in an address space. Devices
  attached to this address space can then access the GVA.

Each subsystem can register its own IOMMU, by calling register/unregister.
A unique device-tree phandle is allocated for each IOMMU. The IOMMU
receives commands from the driver through the virtqueue, and has a set of
callbacks for each device, allowing to implement different map/unmap
operations for passed-through and emulated devices. Note that a single
virtual IOMMU per guest would be enough, this multi-instance model is just
here for experimenting and allow different subsystems to offer different
vIOMMU features.

Add a global --viommu parameter to enable the virtual IOMMU.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 Makefile                   |   1 +
 builtin-run.c              |   2 +
 include/kvm/devices.h      |   4 +
 include/kvm/iommu.h        |  64 +++++
 include/kvm/kvm-config.h   |   1 +
 include/kvm/virtio-iommu.h |  10 +
 virtio/iommu.c             | 628 +++++++++++++++++++++++++++++++++++++++++++++
 virtio/mmio.c              |  11 +
 8 files changed, 721 insertions(+)
 create mode 100644 include/kvm/iommu.h
 create mode 100644 include/kvm/virtio-iommu.h
 create mode 100644 virtio/iommu.c

diff --git a/Makefile b/Makefile
index 3e21c597..67953870 100644
--- a/Makefile
+++ b/Makefile
@@ -68,6 +68,7 @@ OBJS	+= virtio/net.o
 OBJS	+= virtio/rng.o
 OBJS    += virtio/balloon.o
 OBJS	+= virtio/pci.o
+OBJS	+= virtio/iommu.o
 OBJS	+= disk/blk.o
 OBJS	+= disk/qcow.o
 OBJS	+= disk/raw.o
diff --git a/builtin-run.c b/builtin-run.c
index b4790ebc..7535b531 100644
--- a/builtin-run.c
+++ b/builtin-run.c
@@ -113,6 +113,8 @@ void kvm_run_set_wrapper_sandbox(void)
 	OPT_BOOLEAN('\0', "sdl", &(cfg)->sdl, "Enable SDL framebuffer"),\
 	OPT_BOOLEAN('\0', "rng", &(cfg)->virtio_rng, "Enable virtio"	\
 			" Random Number Generator"),			\
+	OPT_BOOLEAN('\0', "viommu", &(cfg)->viommu,			\
+			"Enable virtio IOMMU"),				\
 	OPT_CALLBACK('\0', "9p", NULL, "dir_to_share,tag_name",		\
 		     "Enable virtio 9p to share files between host and"	\
 		     " guest", virtio_9p_rootdir_parser, kvm),		\
diff --git a/include/kvm/devices.h b/include/kvm/devices.h
index 405f1952..70a00c5b 100644
--- a/include/kvm/devices.h
+++ b/include/kvm/devices.h
@@ -11,11 +11,15 @@ enum device_bus_type {
 	DEVICE_BUS_MAX,
 };
 
+struct iommu_ops;
+
 struct device_header {
 	enum device_bus_type	bus_type;
 	void			*data;
 	int			dev_num;
 	struct rb_node		node;
+	struct iommu_ops	*iommu_ops;
+	void			*iommu_data;
 };
 
 int device__register(struct device_header *dev);
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
new file mode 100644
index 00000000..925e1993
--- /dev/null
+++ b/include/kvm/iommu.h
@@ -0,0 +1,64 @@
+#ifndef KVM_IOMMU_H
+#define KVM_IOMMU_H
+
+#include <stdlib.h>
+
+#include "devices.h"
+
+#define IOMMU_PROT_NONE		0x0
+#define IOMMU_PROT_READ		0x1
+#define IOMMU_PROT_WRITE	0x2
+#define IOMMU_PROT_EXEC		0x4
+
+struct iommu_ops {
+	const struct iommu_properties *(*get_properties)(struct device_header *);
+
+	void *(*alloc_address_space)(struct device_header *);
+	void (*free_address_space)(void *);
+
+	int (*attach)(void *, struct device_header *, int flags);
+	int (*detach)(void *, struct device_header *);
+	int (*map)(void *, u64 virt_addr, u64 phys_addr, u64 size, int prot);
+	int (*unmap)(void *, u64 virt_addr, u64 size, int flags);
+};
+
+struct iommu_properties {
+	const char			*name;
+	u32				phandle;
+
+	size_t				input_addr_size;
+	u64				pgsize_mask;
+};
+
+/*
+ * All devices presented to the system have a device ID, that allows the IOMMU
+ * to identify them. Since multiple buses can share an IOMMU, this device ID
+ * must be unique system-wide. We define it here as:
+ *
+ *	(bus_type << 16) + dev_num
+ *
+ * Where dev_num is the device number on the bus as allocated by devices.c
+ *
+ * TODO: enforce this limit, by checking that the device number allocator
+ * doesn't overflow BUS_SIZE.
+ */
+
+#define BUS_SIZE 0x10000
+
+static inline long device_to_iommu_id(struct device_header *dev)
+{
+	return dev->bus_type * BUS_SIZE + dev->dev_num;
+}
+
+#define iommu_id_to_bus(device_id)	((device_id) / BUS_SIZE)
+#define iommu_id_to_devnum(device_id)	((device_id) % BUS_SIZE)
+
+static inline struct device_header *iommu_get_device(u32 device_id)
+{
+	enum device_bus_type bus = iommu_id_to_bus(device_id);
+	u32 dev_num = iommu_id_to_devnum(device_id);
+
+	return device__find_dev(bus, dev_num);
+}
+
+#endif /* KVM_IOMMU_H */
diff --git a/include/kvm/kvm-config.h b/include/kvm/kvm-config.h
index 62dc6a2f..9678065b 100644
--- a/include/kvm/kvm-config.h
+++ b/include/kvm/kvm-config.h
@@ -60,6 +60,7 @@ struct kvm_config {
 	bool no_dhcp;
 	bool ioport_debug;
 	bool mmio_debug;
+	bool viommu;
 };
 
 #endif
diff --git a/include/kvm/virtio-iommu.h b/include/kvm/virtio-iommu.h
new file mode 100644
index 00000000..5532c82b
--- /dev/null
+++ b/include/kvm/virtio-iommu.h
@@ -0,0 +1,10 @@
+#ifndef KVM_VIRTIO_IOMMU_H
+#define KVM_VIRTIO_IOMMU_H
+
+#include "virtio.h"
+
+const struct iommu_properties *viommu_get_properties(void *dev);
+void *viommu_register(struct kvm *kvm, struct iommu_properties *props);
+void viommu_unregister(struct kvm *kvm, void *cookie);
+
+#endif
diff --git a/virtio/iommu.c b/virtio/iommu.c
new file mode 100644
index 00000000..c72e7322
--- /dev/null
+++ b/virtio/iommu.c
@@ -0,0 +1,628 @@
+#include <errno.h>
+#include <stdbool.h>
+
+#include <linux/compiler.h>
+
+#include <linux/bitops.h>
+#include <linux/byteorder.h>
+#include <linux/err.h>
+#include <linux/list.h>
+#include <linux/types.h>
+#include <linux/virtio_ids.h>
+#include <linux/virtio_iommu.h>
+
+#include "kvm/guest_compat.h"
+#include "kvm/iommu.h"
+#include "kvm/threadpool.h"
+#include "kvm/virtio.h"
+#include "kvm/virtio-iommu.h"
+
+/* Max size */
+#define VIOMMU_DEFAULT_QUEUE_SIZE	256
+
+struct viommu_endpoint {
+	struct device_header		*dev;
+	struct viommu_ioas		*ioas;
+	struct list_head		list;
+};
+
+struct viommu_ioas {
+	u32				id;
+
+	struct mutex			devices_mutex;
+	struct list_head		devices;
+	size_t				nr_devices;
+	struct rb_node			node;
+
+	struct iommu_ops		*ops;
+	void				*priv;
+};
+
+struct viommu_dev {
+	struct virtio_device		vdev;
+	struct virtio_iommu_config	config;
+
+	const struct iommu_properties	*properties;
+
+	struct virt_queue		vq;
+	size_t				queue_size;
+	struct thread_pool__job		job;
+
+	struct rb_root			address_spaces;
+	struct kvm			*kvm;
+};
+
+static int compat_id = -1;
+
+static struct viommu_ioas *viommu_find_ioas(struct viommu_dev *viommu,
+					    u32 ioasid)
+{
+	struct rb_node *node;
+	struct viommu_ioas *ioas;
+
+	node = viommu->address_spaces.rb_node;
+	while (node) {
+		ioas = container_of(node, struct viommu_ioas, node);
+		if (ioas->id > ioasid)
+			node = node->rb_left;
+		else if (ioas->id < ioasid)
+			node = node->rb_right;
+		else
+			return ioas;
+	}
+
+	return NULL;
+}
+
+static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
+					     struct device_header *device,
+					     u32 ioasid)
+{
+	struct rb_node **node, *parent = NULL;
+	struct viommu_ioas *new_ioas, *ioas;
+	struct iommu_ops *ops = device->iommu_ops;
+
+	if (!ops || !ops->get_properties || !ops->alloc_address_space ||
+	    !ops->free_address_space || !ops->attach || !ops->detach ||
+	    !ops->map || !ops->unmap) {
+		/* Catch programming mistakes early */
+		pr_err("Invalid IOMMU ops");
+		return NULL;
+	}
+
+	new_ioas = calloc(1, sizeof(*new_ioas));
+	if (!new_ioas)
+		return NULL;
+
+	INIT_LIST_HEAD(&new_ioas->devices);
+	mutex_init(&new_ioas->devices_mutex);
+	new_ioas->id		= ioasid;
+	new_ioas->ops		= ops;
+	new_ioas->priv		= ops->alloc_address_space(device);
+
+	/* A NULL priv pointer is valid. */
+
+	node = &viommu->address_spaces.rb_node;
+	while (*node) {
+		ioas = container_of(*node, struct viommu_ioas, node);
+		parent = *node;
+
+		if (ioas->id > ioasid) {
+			node = &((*node)->rb_left);
+		} else if (ioas->id < ioasid) {
+			node = &((*node)->rb_right);
+		} else {
+			pr_err("IOAS exists!");
+			free(new_ioas);
+			return NULL;
+		}
+	}
+
+	rb_link_node(&new_ioas->node, parent, node);
+	rb_insert_color(&new_ioas->node, &viommu->address_spaces);
+
+	return new_ioas;
+}
+
+static void viommu_free_ioas(struct viommu_dev *viommu,
+			     struct viommu_ioas *ioas)
+{
+	if (ioas->priv)
+		ioas->ops->free_address_space(ioas->priv);
+
+	rb_erase(&ioas->node, &viommu->address_spaces);
+	free(ioas);
+}
+
+static int viommu_ioas_add_device(struct viommu_ioas *ioas,
+				  struct viommu_endpoint *vdev)
+{
+	mutex_lock(&ioas->devices_mutex);
+	list_add_tail(&vdev->list, &ioas->devices);
+	ioas->nr_devices++;
+	vdev->ioas = ioas;
+	mutex_unlock(&ioas->devices_mutex);
+
+	return 0;
+}
+
+static int viommu_ioas_del_device(struct viommu_ioas *ioas,
+				  struct viommu_endpoint *vdev)
+{
+	mutex_lock(&ioas->devices_mutex);
+	list_del(&vdev->list);
+	ioas->nr_devices--;
+	vdev->ioas = NULL;
+	mutex_unlock(&ioas->devices_mutex);
+
+	return 0;
+}
+
+static struct viommu_endpoint *viommu_alloc_device(struct device_header *device)
+{
+	struct viommu_endpoint *vdev = calloc(1, sizeof(*vdev));
+
+	device->iommu_data = vdev;
+	vdev->dev = device;
+
+	return vdev;
+}
+
+static int viommu_detach_device(struct viommu_dev *viommu,
+				struct viommu_endpoint *vdev)
+{
+	int ret;
+	struct viommu_ioas *ioas = vdev->ioas;
+	struct device_header *device = vdev->dev;
+
+	if (!ioas)
+		return -EINVAL;
+
+	pr_debug("detaching device %#lx from IOAS %u",
+		 device_to_iommu_id(device), ioas->id);
+
+	ret = device->iommu_ops->detach(ioas->priv, device);
+	if (!ret)
+		ret = viommu_ioas_del_device(ioas, vdev);
+
+	if (!ioas->nr_devices)
+		viommu_free_ioas(viommu, ioas);
+
+	return ret;
+}
+
+static int viommu_handle_attach(struct viommu_dev *viommu,
+				struct virtio_iommu_req_attach *attach)
+{
+	int ret;
+	struct viommu_ioas *ioas;
+	struct device_header *device;
+	struct viommu_endpoint *vdev;
+
+	u32 device_id	= le32_to_cpu(attach->device);
+	u32 ioasid	= le32_to_cpu(attach->address_space);
+
+	device = iommu_get_device(device_id);
+	if (IS_ERR_OR_NULL(device)) {
+		pr_err("could not find device %#x", device_id);
+		return -ENODEV;
+	}
+
+	pr_debug("attaching device %#x to IOAS %u", device_id, ioasid);
+
+	vdev = device->iommu_data;
+	if (!vdev) {
+		vdev = viommu_alloc_device(device);
+		if (!vdev)
+			return -ENOMEM;
+	}
+
+	ioas = viommu_find_ioas(viommu, ioasid);
+	if (!ioas) {
+		ioas = viommu_alloc_ioas(viommu, device, ioasid);
+		if (!ioas)
+			return -ENOMEM;
+	} else if (ioas->ops->map != device->iommu_ops->map ||
+		   ioas->ops->unmap != device->iommu_ops->unmap) {
+		return -EINVAL;
+	}
+
+	if (vdev->ioas) {
+		ret = viommu_detach_device(viommu, vdev);
+		if (ret)
+			return ret;
+	}
+
+	ret = device->iommu_ops->attach(ioas->priv, device, 0);
+	if (!ret)
+		ret = viommu_ioas_add_device(ioas, vdev);
+
+	if (ret && ioas->nr_devices == 0)
+		viommu_free_ioas(viommu, ioas);
+
+	return ret;
+}
+
+static int viommu_handle_detach(struct viommu_dev *viommu,
+				struct virtio_iommu_req_detach *detach)
+{
+	struct device_header *device;
+	struct viommu_endpoint *vdev;
+
+	u32 device_id	= le32_to_cpu(detach->device);
+
+	device = iommu_get_device(device_id);
+	if (IS_ERR_OR_NULL(device)) {
+		pr_err("could not find device %#x", device_id);
+		return -ENODEV;
+	}
+
+	vdev = device->iommu_data;
+	if (!vdev)
+		return -ENODEV;
+
+	return viommu_detach_device(viommu, vdev);
+}
+
+static int viommu_handle_map(struct viommu_dev *viommu,
+			     struct virtio_iommu_req_map *map)
+{
+	int prot = 0;
+	struct viommu_ioas *ioas;
+
+	u32 ioasid	= le32_to_cpu(map->address_space);
+	u64 virt_addr	= le64_to_cpu(map->virt_addr);
+	u64 phys_addr	= le64_to_cpu(map->phys_addr);
+	u64 size	= le64_to_cpu(map->size);
+	u32 flags	= le64_to_cpu(map->flags);
+
+	ioas = viommu_find_ioas(viommu, ioasid);
+	if (!ioas) {
+		pr_err("could not find address space %u", ioasid);
+		return -ESRCH;
+	}
+
+	if (flags & ~VIRTIO_IOMMU_MAP_F_MASK)
+		return -EINVAL;
+
+	if (flags & VIRTIO_IOMMU_MAP_F_READ)
+		prot |= IOMMU_PROT_READ;
+
+	if (flags & VIRTIO_IOMMU_MAP_F_WRITE)
+		prot |= IOMMU_PROT_WRITE;
+
+	if (flags & VIRTIO_IOMMU_MAP_F_EXEC)
+		prot |= IOMMU_PROT_EXEC;
+
+	pr_debug("map %#llx -> %#llx (%llu) to IOAS %u", virt_addr,
+		 phys_addr, size, ioasid);
+
+	return ioas->ops->map(ioas->priv, virt_addr, phys_addr, size, prot);
+}
+
+static int viommu_handle_unmap(struct viommu_dev *viommu,
+			       struct virtio_iommu_req_unmap *unmap)
+{
+	struct viommu_ioas *ioas;
+
+	u32 ioasid	= le32_to_cpu(unmap->address_space);
+	u64 virt_addr	= le64_to_cpu(unmap->virt_addr);
+	u64 size	= le64_to_cpu(unmap->size);
+
+	ioas = viommu_find_ioas(viommu, ioasid);
+	if (!ioas) {
+		pr_err("could not find address space %u", ioasid);
+		return -ESRCH;
+	}
+
+	pr_debug("unmap %#llx (%llu) from IOAS %u", virt_addr, size,
+		 ioasid);
+
+	return ioas->ops->unmap(ioas->priv, virt_addr, size, 0);
+}
+
+static size_t viommu_get_req_len(union virtio_iommu_req *req)
+{
+	switch (req->head.type) {
+	case VIRTIO_IOMMU_T_ATTACH:
+		return sizeof(req->attach);
+	case VIRTIO_IOMMU_T_DETACH:
+		return sizeof(req->detach);
+	case VIRTIO_IOMMU_T_MAP:
+		return sizeof(req->map);
+	case VIRTIO_IOMMU_T_UNMAP:
+		return sizeof(req->unmap);
+	default:
+		pr_err("unknown request type %x", req->head.type);
+		return 0;
+	}
+}
+
+static int viommu_errno_to_status(int err)
+{
+	switch (err) {
+	case 0:
+		return VIRTIO_IOMMU_S_OK;
+	case EIO:
+		return VIRTIO_IOMMU_S_IOERR;
+	case ENOSYS:
+		return VIRTIO_IOMMU_S_UNSUPP;
+	case ERANGE:
+		return VIRTIO_IOMMU_S_RANGE;
+	case EFAULT:
+		return VIRTIO_IOMMU_S_FAULT;
+	case EINVAL:
+		return VIRTIO_IOMMU_S_INVAL;
+	case ENOENT:
+	case ENODEV:
+	case ESRCH:
+		return VIRTIO_IOMMU_S_NOENT;
+	case ENOMEM:
+	case ENOSPC:
+	default:
+		return VIRTIO_IOMMU_S_DEVERR;
+	}
+}
+
+static ssize_t viommu_dispatch_commands(struct viommu_dev *viommu,
+					struct iovec *iov, int nr_in, int nr_out)
+{
+	u32 op;
+	int i, ret;
+	ssize_t written_len = 0;
+	size_t len, expected_len;
+	union virtio_iommu_req *req;
+	struct virtio_iommu_req_tail *tail;
+
+	/*
+	 * Are we picking up in the middle of a request buffer? Keep a running
+	 * count.
+	 *
+	 * Here we assume that a request is always made of two descriptors, a
+	 * head and a tail. TODO: get rid of framing assumptions by keeping
+	 * track of request fragments.
+	 */
+	static bool is_head = true;
+	static int cur_status = 0;
+
+	for (i = 0; i < nr_in + nr_out; i++, is_head = !is_head) {
+		len = iov[i].iov_len;
+		if (is_head && len < sizeof(req->head)) {
+			pr_err("invalid command length (%zu)", len);
+			cur_status = EIO;
+			continue;
+		} else if (!is_head && len < sizeof(*tail)) {
+			pr_err("invalid tail length (%zu)", len);
+			cur_status = 0;
+			continue;
+		}
+
+		if (!is_head) {
+			int status = viommu_errno_to_status(cur_status);
+
+			tail = iov[i].iov_base;
+			tail->status = cpu_to_le32(status);
+			written_len += sizeof(tail->status);
+			cur_status = 0;
+			continue;
+		}
+
+		req = iov[i].iov_base;
+		op = req->head.type;
+		expected_len = viommu_get_req_len(req) - sizeof(*tail);
+		if (expected_len != len) {
+			pr_err("invalid command %x length (%zu != %zu)", op,
+			       len, expected_len);
+			cur_status = EIO;
+			continue;
+		}
+
+		switch (op) {
+		case VIRTIO_IOMMU_T_ATTACH:
+			ret = viommu_handle_attach(viommu, &req->attach);
+			break;
+
+		case VIRTIO_IOMMU_T_DETACH:
+			ret = viommu_handle_detach(viommu, &req->detach);
+			break;
+
+		case VIRTIO_IOMMU_T_MAP:
+			ret = viommu_handle_map(viommu, &req->map);
+			break;
+
+		case VIRTIO_IOMMU_T_UNMAP:
+			ret = viommu_handle_unmap(viommu, &req->unmap);
+			break;
+
+		default:
+			pr_err("unhandled command %x", op);
+			ret = -ENOSYS;
+		}
+
+		if (ret)
+			cur_status = -ret;
+	}
+
+	return written_len;
+}
+
+static void viommu_command(struct kvm *kvm, void *dev)
+{
+	int len;
+	u16 head;
+	u16 out, in;
+
+	struct virt_queue *vq;
+	struct viommu_dev *viommu = dev;
+	struct iovec iov[VIOMMU_DEFAULT_QUEUE_SIZE];
+
+	vq = &viommu->vq;
+
+	while (virt_queue__available(vq)) {
+		head = virt_queue__get_iov(vq, iov, &out, &in, kvm);
+
+		len = viommu_dispatch_commands(viommu, iov, in, out);
+		if (len < 0) {
+			/* Critical error, abort everything */
+			pr_err("failed to dispatch viommu command");
+			return;
+		}
+
+		virt_queue__set_used_elem(vq, head, len);
+	}
+
+	if (virtio_queue__should_signal(vq))
+		viommu->vdev.ops->signal_vq(kvm, &viommu->vdev, 0);
+}
+
+/* Virtio API */
+static u8 *viommu_get_config(struct kvm *kvm, void *dev)
+{
+	struct viommu_dev *viommu = dev;
+
+	return (u8 *)&viommu->config;
+}
+
+static u32 viommu_get_host_features(struct kvm *kvm, void *dev)
+{
+	return 1ULL << VIRTIO_RING_F_EVENT_IDX
+	     | 1ULL << VIRTIO_RING_F_INDIRECT_DESC
+	     | 1ULL << VIRTIO_IOMMU_F_INPUT_RANGE;
+}
+
+static void viommu_set_guest_features(struct kvm *kvm, void *dev, u32 features)
+{
+}
+
+static int viommu_init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size,
+			  u32 align, u32 pfn)
+{
+	void *ptr;
+	struct virt_queue *queue;
+	struct viommu_dev *viommu = dev;
+
+	if (vq != 0)
+		return -ENODEV;
+
+	compat__remove_message(compat_id);
+
+	queue = &viommu->vq;
+	queue->pfn = pfn;
+	ptr = virtio_get_vq(kvm, queue->pfn, page_size);
+
+	vring_init(&queue->vring, viommu->queue_size, ptr, align);
+	virtio_init_device_vq(&viommu->vdev, queue);
+
+	thread_pool__init_job(&viommu->job, kvm, viommu_command, viommu);
+
+	return 0;
+}
+
+static int viommu_get_pfn_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+	struct viommu_dev *viommu = dev;
+
+	return viommu->vq.pfn;
+}
+
+static int viommu_get_size_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+	struct viommu_dev *viommu = dev;
+
+	return viommu->queue_size;
+}
+
+static int viommu_set_size_vq(struct kvm *kvm, void *dev, u32 vq, int size)
+{
+	struct viommu_dev *viommu = dev;
+
+	if (viommu->vq.pfn)
+		/* Already init, can't resize */
+		return viommu->queue_size;
+
+	viommu->queue_size = size;
+
+	return size;
+}
+
+static int viommu_notify_vq(struct kvm *kvm, void *dev, u32 vq)
+{
+	struct viommu_dev *viommu = dev;
+
+	thread_pool__do_job(&viommu->job);
+
+	return 0;
+}
+
+static void viommu_notify_vq_gsi(struct kvm *kvm, void *dev, u32 vq, u32 gsi)
+{
+	/* TODO: when implementing vhost */
+}
+
+static void viommu_notify_vq_eventfd(struct kvm *kvm, void *dev, u32 vq, u32 fd)
+{
+	/* TODO: when implementing vhost */
+}
+
+static struct virtio_ops iommu_dev_virtio_ops = {
+	.get_config		= viommu_get_config,
+	.get_host_features	= viommu_get_host_features,
+	.set_guest_features	= viommu_set_guest_features,
+	.init_vq		= viommu_init_vq,
+	.get_pfn_vq		= viommu_get_pfn_vq,
+	.get_size_vq		= viommu_get_size_vq,
+	.set_size_vq		= viommu_set_size_vq,
+	.notify_vq		= viommu_notify_vq,
+	.notify_vq_gsi		= viommu_notify_vq_gsi,
+	.notify_vq_eventfd	= viommu_notify_vq_eventfd,
+};
+
+const struct iommu_properties *viommu_get_properties(void *dev)
+{
+	struct viommu_dev *viommu = dev;
+
+	return viommu->properties;
+}
+
+void *viommu_register(struct kvm *kvm, struct iommu_properties *props)
+{
+	struct viommu_dev *viommu;
+	u64 pgsize_mask = ~(PAGE_SIZE - 1);
+
+	if (!kvm->cfg.viommu)
+		return NULL;
+
+	props->phandle = fdt_alloc_phandle();
+
+	viommu = calloc(1, sizeof(struct viommu_dev));
+	if (!viommu)
+		return NULL;
+
+	viommu->queue_size		= VIOMMU_DEFAULT_QUEUE_SIZE;
+	viommu->address_spaces		= (struct rb_root)RB_ROOT;
+	viommu->properties		= props;
+
+	viommu->config.page_sizes	= props->pgsize_mask ?: pgsize_mask;
+	viommu->config.input_range.end	= props->input_addr_size % BITS_PER_LONG ?
+					  (1UL << props->input_addr_size) - 1 :
+					  -1UL;
+
+	if (virtio_init(kvm, viommu, &viommu->vdev, &iommu_dev_virtio_ops,
+			VIRTIO_MMIO, 0, VIRTIO_ID_IOMMU, 0)) {
+		free(viommu);
+		return NULL;
+	}
+
+	pr_info("Loaded virtual IOMMU %s", props->name);
+
+	if (compat_id == -1)
+		compat_id = virtio_compat_add_message("virtio-iommu",
+						      "CONFIG_VIRTIO_IOMMU");
+
+	return viommu;
+}
+
+void viommu_unregister(struct kvm *kvm, void *viommu)
+{
+	free(viommu);
+}
diff --git a/virtio/mmio.c b/virtio/mmio.c
index f0af4bd1..b3dea51a 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -1,14 +1,17 @@
 #include "kvm/devices.h"
 #include "kvm/virtio-mmio.h"
 #include "kvm/ioeventfd.h"
+#include "kvm/iommu.h"
 #include "kvm/ioport.h"
 #include "kvm/virtio.h"
+#include "kvm/virtio-iommu.h"
 #include "kvm/kvm.h"
 #include "kvm/kvm-cpu.h"
 #include "kvm/irq.h"
 #include "kvm/fdt.h"
 
 #include <linux/virtio_mmio.h>
+#include <linux/virtio_ids.h>
 #include <string.h>
 
 static u32 virtio_mmio_io_space_blocks = KVM_VIRTIO_MMIO_AREA;
@@ -237,6 +240,7 @@ void generate_virtio_mmio_fdt_node(void *fdt,
 							     u8 irq,
 							     enum irq_type))
 {
+	const struct iommu_properties *props;
 	char dev_name[DEVICE_NAME_MAX_LEN];
 	struct virtio_mmio *vmmio = container_of(dev_hdr,
 						 struct virtio_mmio,
@@ -254,6 +258,13 @@ void generate_virtio_mmio_fdt_node(void *fdt,
 	_FDT(fdt_property(fdt, "reg", reg_prop, sizeof(reg_prop)));
 	_FDT(fdt_property(fdt, "dma-coherent", NULL, 0));
 	generate_irq_prop(fdt, vmmio->irq, IRQ_TYPE_EDGE_RISING);
+
+	if (vmmio->hdr.device_id == VIRTIO_ID_IOMMU) {
+		props = viommu_get_properties(vmmio->dev);
+		_FDT(fdt_property_cell(fdt, "phandle", props->phandle));
+		_FDT(fdt_property_cell(fdt, "#iommu-cells", 1));
+	}
+
 	_FDT(fdt_end_node(fdt));
 }
 #else
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 04/15] Add a simple IOMMU
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (5 preceding siblings ...)
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (24 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

Add a rb-tree based IOMMU with support for map, unmap and access
operations. It will be used to store mappings for virtio devices and MSI
doorbells. If needed, it could also be extended with a TLB implementation.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 Makefile            |   1 +
 include/kvm/iommu.h |   9 +++
 iommu.c             | 162 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 172 insertions(+)
 create mode 100644 iommu.c

diff --git a/Makefile b/Makefile
index 67953870..0c369206 100644
--- a/Makefile
+++ b/Makefile
@@ -73,6 +73,7 @@ OBJS	+= disk/blk.o
 OBJS	+= disk/qcow.o
 OBJS	+= disk/raw.o
 OBJS	+= ioeventfd.o
+OBJS	+= iommu.o
 OBJS	+= net/uip/core.o
 OBJS	+= net/uip/arp.o
 OBJS	+= net/uip/icmp.o
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 925e1993..4164ba20 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -61,4 +61,13 @@ static inline struct device_header *iommu_get_device(u32 device_id)
 	return device__find_dev(bus, dev_num);
 }
 
+void *iommu_alloc_address_space(struct device_header *dev);
+void iommu_free_address_space(void *address_space);
+
+int iommu_map(void *address_space, u64 virt_addr, u64 phys_addr, u64 size,
+	      int prot);
+int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags);
+u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
+		 int prot);
+
 #endif /* KVM_IOMMU_H */
diff --git a/iommu.c b/iommu.c
new file mode 100644
index 00000000..0a662404
--- /dev/null
+++ b/iommu.c
@@ -0,0 +1,162 @@
+/*
+ * Implement basic IOMMU operations - map, unmap and translate
+ */
+#include <errno.h>
+
+#include "kvm/iommu.h"
+#include "kvm/kvm.h"
+#include "kvm/mutex.h"
+#include "kvm/rbtree-interval.h"
+
+struct iommu_mapping {
+	struct rb_int_node	iova_range;
+	u64			phys;
+	int			prot;
+};
+
+struct iommu_ioas {
+	struct rb_root		mappings;
+	struct mutex		mutex;
+};
+
+void *iommu_alloc_address_space(struct device_header *unused)
+{
+	struct iommu_ioas *ioas = calloc(1, sizeof(*ioas));
+
+	if (!ioas)
+		return NULL;
+
+	ioas->mappings = (struct rb_root)RB_ROOT;
+	mutex_init(&ioas->mutex);
+
+	return ioas;
+}
+
+void iommu_free_address_space(void *address_space)
+{
+	struct iommu_ioas *ioas = address_space;
+	struct rb_int_node *int_node;
+	struct rb_node *node, *next;
+	struct iommu_mapping *map;
+
+        /* Postorder allows to free leaves first. */
+	node = rb_first_postorder(&ioas->mappings);
+	while (node) {
+		next = rb_next_postorder(node);
+
+		int_node = rb_int(node);
+		map = container_of(int_node, struct iommu_mapping, iova_range);
+		free(map);
+
+		node = next;
+	}
+
+	free(ioas);
+}
+
+int iommu_map(void *address_space, u64 virt_addr, u64 phys_addr,
+	      u64 size, int prot)
+{
+	struct iommu_ioas *ioas = address_space;
+	struct iommu_mapping *map;
+
+	if (!ioas)
+		return -ENODEV;
+
+	map = malloc(sizeof(struct iommu_mapping));
+	if (!map)
+		return -ENOMEM;
+
+	map->phys = phys_addr;
+	map->iova_range = RB_INT_INIT(virt_addr, virt_addr + size - 1);
+	map->prot = prot;
+
+	mutex_lock(&ioas->mutex);
+	rb_int_insert(&ioas->mappings, &map->iova_range);
+	mutex_unlock(&ioas->mutex);
+
+	return 0;
+}
+
+int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags)
+{
+	int ret = 0;
+	struct rb_int_node *node;
+	struct iommu_mapping *map;
+	struct iommu_ioas *ioas = address_space;
+
+	if (!ioas)
+		return -ENODEV;
+
+	mutex_lock(&ioas->mutex);
+	node = rb_int_search_single(&ioas->mappings, virt_addr);
+	while (node && size) {
+		struct rb_node *next = rb_next(&node->node);
+		size_t node_size = node->high - node->low + 1;
+		map = container_of(node, struct iommu_mapping, iova_range);
+
+		if (node_size > size) {
+			pr_debug("cannot split mapping");
+			ret = -EINVAL;
+			break;
+		}
+
+		size -= node_size;
+		virt_addr += node_size;
+
+		rb_erase(&node->node, &ioas->mappings);
+		free(map);
+		node = next ? container_of(next, struct rb_int_node, node) : NULL;
+	}
+
+	if (size && !ret) {
+		pr_debug("mapping not found");
+		ret = -ENXIO;
+	}
+	mutex_unlock(&ioas->mutex);
+
+	return ret;
+}
+
+/*
+ * Translate a virtual address into a physical one. Perform an access of @size
+ * bytes with protection @prot. If @addr isn't mapped in @address_space, return
+ * 0. If the permissions of the mapping don't match, return 0. If the access
+ * range specified by (addr, size) spans over multiple mappings, only access the
+ * first mapping and return the accessed size in @out_size. It is up to the
+ * caller to complete the access by calling the function again on the remaining
+ * range. Subsequent accesses are not guaranteed to succeed.
+ */
+u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
+		 int prot)
+{
+	struct iommu_ioas *ioas = address_space;
+	struct iommu_mapping *map;
+	struct rb_int_node *node;
+	u64 out_addr = 0;
+
+	mutex_lock(&ioas->mutex);
+	node = rb_int_search_single(&ioas->mappings, addr);
+	if (!node) {
+		pr_err("fault at IOVA %#llx %zu", addr, size);
+		errno = EFAULT;
+		goto out_unlock; /* Segv incomming */
+	}
+
+	map = container_of(node, struct iommu_mapping, iova_range);
+	if (prot & ~map->prot) {
+		pr_err("permission fault at IOVA %#llx", addr);
+		errno = EPERM;
+		goto out_unlock;
+	}
+
+	out_addr = map->phys + (addr - node->low);
+	*out_size = min_t(size_t, node->high - addr + 1, size);
+
+	pr_debug("access %llx %zu/%zu %x -> %#llx", addr, *out_size, size,
+		 prot, out_addr);
+out_unlock:
+	mutex_unlock(&ioas->mutex);
+
+	return out_addr;
+}
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 04/15] Add a simple IOMMU
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (6 preceding siblings ...)
  2017-04-07 19:24   ` [RFC PATCH kvmtool 04/15] Add a simple IOMMU Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 05/15] iommu: describe IOMMU topology in device-trees Jean-Philippe Brucker
                     ` (23 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Add a rb-tree based IOMMU with support for map, unmap and access
operations. It will be used to store mappings for virtio devices and MSI
doorbells. If needed, it could also be extended with a TLB implementation.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 Makefile            |   1 +
 include/kvm/iommu.h |   9 +++
 iommu.c             | 162 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 172 insertions(+)
 create mode 100644 iommu.c

diff --git a/Makefile b/Makefile
index 67953870..0c369206 100644
--- a/Makefile
+++ b/Makefile
@@ -73,6 +73,7 @@ OBJS	+= disk/blk.o
 OBJS	+= disk/qcow.o
 OBJS	+= disk/raw.o
 OBJS	+= ioeventfd.o
+OBJS	+= iommu.o
 OBJS	+= net/uip/core.o
 OBJS	+= net/uip/arp.o
 OBJS	+= net/uip/icmp.o
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 925e1993..4164ba20 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -61,4 +61,13 @@ static inline struct device_header *iommu_get_device(u32 device_id)
 	return device__find_dev(bus, dev_num);
 }
 
+void *iommu_alloc_address_space(struct device_header *dev);
+void iommu_free_address_space(void *address_space);
+
+int iommu_map(void *address_space, u64 virt_addr, u64 phys_addr, u64 size,
+	      int prot);
+int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags);
+u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
+		 int prot);
+
 #endif /* KVM_IOMMU_H */
diff --git a/iommu.c b/iommu.c
new file mode 100644
index 00000000..0a662404
--- /dev/null
+++ b/iommu.c
@@ -0,0 +1,162 @@
+/*
+ * Implement basic IOMMU operations - map, unmap and translate
+ */
+#include <errno.h>
+
+#include "kvm/iommu.h"
+#include "kvm/kvm.h"
+#include "kvm/mutex.h"
+#include "kvm/rbtree-interval.h"
+
+struct iommu_mapping {
+	struct rb_int_node	iova_range;
+	u64			phys;
+	int			prot;
+};
+
+struct iommu_ioas {
+	struct rb_root		mappings;
+	struct mutex		mutex;
+};
+
+void *iommu_alloc_address_space(struct device_header *unused)
+{
+	struct iommu_ioas *ioas = calloc(1, sizeof(*ioas));
+
+	if (!ioas)
+		return NULL;
+
+	ioas->mappings = (struct rb_root)RB_ROOT;
+	mutex_init(&ioas->mutex);
+
+	return ioas;
+}
+
+void iommu_free_address_space(void *address_space)
+{
+	struct iommu_ioas *ioas = address_space;
+	struct rb_int_node *int_node;
+	struct rb_node *node, *next;
+	struct iommu_mapping *map;
+
+        /* Postorder allows to free leaves first. */
+	node = rb_first_postorder(&ioas->mappings);
+	while (node) {
+		next = rb_next_postorder(node);
+
+		int_node = rb_int(node);
+		map = container_of(int_node, struct iommu_mapping, iova_range);
+		free(map);
+
+		node = next;
+	}
+
+	free(ioas);
+}
+
+int iommu_map(void *address_space, u64 virt_addr, u64 phys_addr,
+	      u64 size, int prot)
+{
+	struct iommu_ioas *ioas = address_space;
+	struct iommu_mapping *map;
+
+	if (!ioas)
+		return -ENODEV;
+
+	map = malloc(sizeof(struct iommu_mapping));
+	if (!map)
+		return -ENOMEM;
+
+	map->phys = phys_addr;
+	map->iova_range = RB_INT_INIT(virt_addr, virt_addr + size - 1);
+	map->prot = prot;
+
+	mutex_lock(&ioas->mutex);
+	rb_int_insert(&ioas->mappings, &map->iova_range);
+	mutex_unlock(&ioas->mutex);
+
+	return 0;
+}
+
+int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags)
+{
+	int ret = 0;
+	struct rb_int_node *node;
+	struct iommu_mapping *map;
+	struct iommu_ioas *ioas = address_space;
+
+	if (!ioas)
+		return -ENODEV;
+
+	mutex_lock(&ioas->mutex);
+	node = rb_int_search_single(&ioas->mappings, virt_addr);
+	while (node && size) {
+		struct rb_node *next = rb_next(&node->node);
+		size_t node_size = node->high - node->low + 1;
+		map = container_of(node, struct iommu_mapping, iova_range);
+
+		if (node_size > size) {
+			pr_debug("cannot split mapping");
+			ret = -EINVAL;
+			break;
+		}
+
+		size -= node_size;
+		virt_addr += node_size;
+
+		rb_erase(&node->node, &ioas->mappings);
+		free(map);
+		node = next ? container_of(next, struct rb_int_node, node) : NULL;
+	}
+
+	if (size && !ret) {
+		pr_debug("mapping not found");
+		ret = -ENXIO;
+	}
+	mutex_unlock(&ioas->mutex);
+
+	return ret;
+}
+
+/*
+ * Translate a virtual address into a physical one. Perform an access of @size
+ * bytes with protection @prot. If @addr isn't mapped in @address_space, return
+ * 0. If the permissions of the mapping don't match, return 0. If the access
+ * range specified by (addr, size) spans over multiple mappings, only access the
+ * first mapping and return the accessed size in @out_size. It is up to the
+ * caller to complete the access by calling the function again on the remaining
+ * range. Subsequent accesses are not guaranteed to succeed.
+ */
+u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
+		 int prot)
+{
+	struct iommu_ioas *ioas = address_space;
+	struct iommu_mapping *map;
+	struct rb_int_node *node;
+	u64 out_addr = 0;
+
+	mutex_lock(&ioas->mutex);
+	node = rb_int_search_single(&ioas->mappings, addr);
+	if (!node) {
+		pr_err("fault at IOVA %#llx %zu", addr, size);
+		errno = EFAULT;
+		goto out_unlock; /* Segv incomming */
+	}
+
+	map = container_of(node, struct iommu_mapping, iova_range);
+	if (prot & ~map->prot) {
+		pr_err("permission fault at IOVA %#llx", addr);
+		errno = EPERM;
+		goto out_unlock;
+	}
+
+	out_addr = map->phys + (addr - node->low);
+	*out_size = min_t(size_t, node->high - addr + 1, size);
+
+	pr_debug("access %llx %zu/%zu %x -> %#llx", addr, *out_size, size,
+		 prot, out_addr);
+out_unlock:
+	mutex_unlock(&ioas->mutex);
+
+	return out_addr;
+}
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 05/15] iommu: describe IOMMU topology in device-trees
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (8 preceding siblings ...)
  2017-04-07 19:24   ` [RFC PATCH kvmtool 05/15] iommu: describe IOMMU topology in device-trees Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 06/15] irq: register MSI doorbell addresses Jean-Philippe Brucker
                     ` (21 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

Add an "iommu-map" property to the PCI host controller, describing which
iommus translate which devices. We describe individual devices in
iommu-map, not ranges. This patch is incompatible with current mainline
Linux, which requires *all* devices under a host controller to be
described by the iommu-map property when present. Unfortunately all PCI
devices in kvmtool are under the same root complex, and we have to omit
RIDs of devices that aren't behind the virtual IOMMU in iommu-map. Fixing
this either requires a simple patch in Linux, or to implement multiple
host controllers in kvmtool.

Add an "iommus" property to plaform devices that are behind an iommu.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 arm/pci.c         | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
 fdt.c             | 20 ++++++++++++++++++++
 include/kvm/fdt.h |  7 +++++++
 virtio/mmio.c     |  1 +
 4 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/arm/pci.c b/arm/pci.c
index 557cfa98..968cbf5b 100644
--- a/arm/pci.c
+++ b/arm/pci.c
@@ -1,9 +1,11 @@
 #include "kvm/devices.h"
 #include "kvm/fdt.h"
+#include "kvm/iommu.h"
 #include "kvm/kvm.h"
 #include "kvm/of_pci.h"
 #include "kvm/pci.h"
 #include "kvm/util.h"
+#include "kvm/virtio-iommu.h"
 
 #include "arm-common/pci.h"
 
@@ -24,11 +26,20 @@ struct of_interrupt_map_entry {
 	struct of_gic_irq		gic_irq;
 } __attribute__((packed));
 
+struct of_iommu_map_entry {
+	u32				rid_base;
+	u32				iommu_phandle;
+	u32				iommu_base;
+	u32				length;
+} __attribute__((packed));
+
 void pci__generate_fdt_nodes(void *fdt)
 {
 	struct device_header *dev_hdr;
 	struct of_interrupt_map_entry irq_map[OF_PCI_IRQ_MAP_MAX];
-	unsigned nentries = 0;
+	struct of_iommu_map_entry *iommu_map;
+	unsigned nentries = 0, ntranslated = 0;
+	unsigned i;
 	/* Bus range */
 	u32 bus_range[] = { cpu_to_fdt32(0), cpu_to_fdt32(1), };
 	/* Configuration Space */
@@ -99,6 +110,9 @@ void pci__generate_fdt_nodes(void *fdt)
 			},
 		};
 
+		if (dev_hdr->iommu_ops)
+			ntranslated++;
+
 		nentries++;
 		dev_hdr = device__next_dev(dev_hdr);
 	}
@@ -121,5 +135,38 @@ void pci__generate_fdt_nodes(void *fdt)
 				  sizeof(irq_mask)));
 	}
 
+	if (ntranslated) {
+		const struct iommu_properties *props;
+
+		iommu_map = malloc(ntranslated * sizeof(struct of_iommu_map_entry));
+		if (!iommu_map) {
+			pr_err("cannot allocate iommu_map.");
+			return;
+		}
+
+		dev_hdr = device__first_dev(DEVICE_BUS_PCI);
+		for (i = 0; i < ntranslated; dev_hdr = device__next_dev(dev_hdr)) {
+			struct of_iommu_map_entry *entry = &iommu_map[i];
+
+			if (!dev_hdr->iommu_ops)
+				continue;
+
+			props = dev_hdr->iommu_ops->get_properties(dev_hdr);
+
+			*entry = (struct of_iommu_map_entry) {
+				.rid_base	= cpu_to_fdt32(dev_hdr->dev_num << 3),
+				.iommu_phandle	= cpu_to_fdt32(props->phandle),
+				.iommu_base	= cpu_to_fdt32(device_to_iommu_id(dev_hdr)),
+				.length		= cpu_to_fdt32(1 << 3),
+			};
+
+			i++;
+		}
+
+		_FDT(fdt_property(fdt, "iommu-map", iommu_map,
+				  ntranslated * sizeof(struct of_iommu_map_entry)));
+		free(iommu_map);
+	}
+
 	_FDT(fdt_end_node(fdt));
 }
diff --git a/fdt.c b/fdt.c
index 6db03d4e..15d7bb29 100644
--- a/fdt.c
+++ b/fdt.c
@@ -2,7 +2,10 @@
  * Commonly used FDT functions.
  */
 
+#include "kvm/devices.h"
 #include "kvm/fdt.h"
+#include "kvm/iommu.h"
+#include "kvm/util.h"
 
 static u32 next_phandle = PHANDLE_RESERVED;
 
@@ -13,3 +16,20 @@ u32 fdt_alloc_phandle(void)
 
 	return next_phandle++;
 }
+
+void fdt_generate_iommus_prop(void *fdt, struct device_header *dev_hdr)
+{
+	const struct iommu_properties *props;
+
+	if (!dev_hdr->iommu_ops)
+		return;
+
+	props = dev_hdr->iommu_ops->get_properties(dev_hdr);
+
+	u32 iommus[] = {
+		cpu_to_fdt32(props->phandle),
+		cpu_to_fdt32(device_to_iommu_id(dev_hdr)),
+	};
+
+	_FDT(fdt_property(fdt, "iommus", iommus, sizeof(iommus)));
+}
diff --git a/include/kvm/fdt.h b/include/kvm/fdt.h
index 503887f9..c64fe8a3 100644
--- a/include/kvm/fdt.h
+++ b/include/kvm/fdt.h
@@ -37,7 +37,10 @@ enum irq_type {
 
 #ifdef CONFIG_HAS_LIBFDT
 
+struct device_header;
+
 u32 fdt_alloc_phandle(void);
+void fdt_generate_iommus_prop(void *fdt, struct device_header *dev);
 
 #else
 
@@ -46,6 +49,10 @@ static inline u32 fdt_alloc_phandle(void)
 	return PHANDLE_RESERVED;
 }
 
+static inline void fdt_generate_iommus_prop(void *fdt, struct device_header *dev)
+{
+}
+
 #endif /* CONFIG_HAS_LIBFDT */
 
 #endif /* KVM__FDT_H */
diff --git a/virtio/mmio.c b/virtio/mmio.c
index b3dea51a..16b44fbb 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -258,6 +258,7 @@ void generate_virtio_mmio_fdt_node(void *fdt,
 	_FDT(fdt_property(fdt, "reg", reg_prop, sizeof(reg_prop)));
 	_FDT(fdt_property(fdt, "dma-coherent", NULL, 0));
 	generate_irq_prop(fdt, vmmio->irq, IRQ_TYPE_EDGE_RISING);
+	fdt_generate_iommus_prop(fdt, dev_hdr);
 
 	if (vmmio->hdr.device_id == VIRTIO_ID_IOMMU) {
 		props = viommu_get_properties(vmmio->dev);
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 05/15] iommu: describe IOMMU topology in device-trees
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (7 preceding siblings ...)
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (22 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Add an "iommu-map" property to the PCI host controller, describing which
iommus translate which devices. We describe individual devices in
iommu-map, not ranges. This patch is incompatible with current mainline
Linux, which requires *all* devices under a host controller to be
described by the iommu-map property when present. Unfortunately all PCI
devices in kvmtool are under the same root complex, and we have to omit
RIDs of devices that aren't behind the virtual IOMMU in iommu-map. Fixing
this either requires a simple patch in Linux, or to implement multiple
host controllers in kvmtool.

Add an "iommus" property to plaform devices that are behind an iommu.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 arm/pci.c         | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
 fdt.c             | 20 ++++++++++++++++++++
 include/kvm/fdt.h |  7 +++++++
 virtio/mmio.c     |  1 +
 4 files changed, 76 insertions(+), 1 deletion(-)

diff --git a/arm/pci.c b/arm/pci.c
index 557cfa98..968cbf5b 100644
--- a/arm/pci.c
+++ b/arm/pci.c
@@ -1,9 +1,11 @@
 #include "kvm/devices.h"
 #include "kvm/fdt.h"
+#include "kvm/iommu.h"
 #include "kvm/kvm.h"
 #include "kvm/of_pci.h"
 #include "kvm/pci.h"
 #include "kvm/util.h"
+#include "kvm/virtio-iommu.h"
 
 #include "arm-common/pci.h"
 
@@ -24,11 +26,20 @@ struct of_interrupt_map_entry {
 	struct of_gic_irq		gic_irq;
 } __attribute__((packed));
 
+struct of_iommu_map_entry {
+	u32				rid_base;
+	u32				iommu_phandle;
+	u32				iommu_base;
+	u32				length;
+} __attribute__((packed));
+
 void pci__generate_fdt_nodes(void *fdt)
 {
 	struct device_header *dev_hdr;
 	struct of_interrupt_map_entry irq_map[OF_PCI_IRQ_MAP_MAX];
-	unsigned nentries = 0;
+	struct of_iommu_map_entry *iommu_map;
+	unsigned nentries = 0, ntranslated = 0;
+	unsigned i;
 	/* Bus range */
 	u32 bus_range[] = { cpu_to_fdt32(0), cpu_to_fdt32(1), };
 	/* Configuration Space */
@@ -99,6 +110,9 @@ void pci__generate_fdt_nodes(void *fdt)
 			},
 		};
 
+		if (dev_hdr->iommu_ops)
+			ntranslated++;
+
 		nentries++;
 		dev_hdr = device__next_dev(dev_hdr);
 	}
@@ -121,5 +135,38 @@ void pci__generate_fdt_nodes(void *fdt)
 				  sizeof(irq_mask)));
 	}
 
+	if (ntranslated) {
+		const struct iommu_properties *props;
+
+		iommu_map = malloc(ntranslated * sizeof(struct of_iommu_map_entry));
+		if (!iommu_map) {
+			pr_err("cannot allocate iommu_map.");
+			return;
+		}
+
+		dev_hdr = device__first_dev(DEVICE_BUS_PCI);
+		for (i = 0; i < ntranslated; dev_hdr = device__next_dev(dev_hdr)) {
+			struct of_iommu_map_entry *entry = &iommu_map[i];
+
+			if (!dev_hdr->iommu_ops)
+				continue;
+
+			props = dev_hdr->iommu_ops->get_properties(dev_hdr);
+
+			*entry = (struct of_iommu_map_entry) {
+				.rid_base	= cpu_to_fdt32(dev_hdr->dev_num << 3),
+				.iommu_phandle	= cpu_to_fdt32(props->phandle),
+				.iommu_base	= cpu_to_fdt32(device_to_iommu_id(dev_hdr)),
+				.length		= cpu_to_fdt32(1 << 3),
+			};
+
+			i++;
+		}
+
+		_FDT(fdt_property(fdt, "iommu-map", iommu_map,
+				  ntranslated * sizeof(struct of_iommu_map_entry)));
+		free(iommu_map);
+	}
+
 	_FDT(fdt_end_node(fdt));
 }
diff --git a/fdt.c b/fdt.c
index 6db03d4e..15d7bb29 100644
--- a/fdt.c
+++ b/fdt.c
@@ -2,7 +2,10 @@
  * Commonly used FDT functions.
  */
 
+#include "kvm/devices.h"
 #include "kvm/fdt.h"
+#include "kvm/iommu.h"
+#include "kvm/util.h"
 
 static u32 next_phandle = PHANDLE_RESERVED;
 
@@ -13,3 +16,20 @@ u32 fdt_alloc_phandle(void)
 
 	return next_phandle++;
 }
+
+void fdt_generate_iommus_prop(void *fdt, struct device_header *dev_hdr)
+{
+	const struct iommu_properties *props;
+
+	if (!dev_hdr->iommu_ops)
+		return;
+
+	props = dev_hdr->iommu_ops->get_properties(dev_hdr);
+
+	u32 iommus[] = {
+		cpu_to_fdt32(props->phandle),
+		cpu_to_fdt32(device_to_iommu_id(dev_hdr)),
+	};
+
+	_FDT(fdt_property(fdt, "iommus", iommus, sizeof(iommus)));
+}
diff --git a/include/kvm/fdt.h b/include/kvm/fdt.h
index 503887f9..c64fe8a3 100644
--- a/include/kvm/fdt.h
+++ b/include/kvm/fdt.h
@@ -37,7 +37,10 @@ enum irq_type {
 
 #ifdef CONFIG_HAS_LIBFDT
 
+struct device_header;
+
 u32 fdt_alloc_phandle(void);
+void fdt_generate_iommus_prop(void *fdt, struct device_header *dev);
 
 #else
 
@@ -46,6 +49,10 @@ static inline u32 fdt_alloc_phandle(void)
 	return PHANDLE_RESERVED;
 }
 
+static inline void fdt_generate_iommus_prop(void *fdt, struct device_header *dev)
+{
+}
+
 #endif /* CONFIG_HAS_LIBFDT */
 
 #endif /* KVM__FDT_H */
diff --git a/virtio/mmio.c b/virtio/mmio.c
index b3dea51a..16b44fbb 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -258,6 +258,7 @@ void generate_virtio_mmio_fdt_node(void *fdt,
 	_FDT(fdt_property(fdt, "reg", reg_prop, sizeof(reg_prop)));
 	_FDT(fdt_property(fdt, "dma-coherent", NULL, 0));
 	generate_irq_prop(fdt, vmmio->irq, IRQ_TYPE_EDGE_RISING);
+	fdt_generate_iommus_prop(fdt, dev_hdr);
 
 	if (vmmio->hdr.device_id == VIRTIO_ID_IOMMU) {
 		props = viommu_get_properties(vmmio->dev);
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 06/15] irq: register MSI doorbell addresses
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (10 preceding siblings ...)
  2017-04-07 19:24   ` [RFC PATCH kvmtool 06/15] irq: register MSI doorbell addresses Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 07/15] virtio: factor virtqueue initialization Jean-Philippe Brucker
                     ` (19 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

For passed-through devices behind a vIOMMU, we'll need to translate writes
to MSI vectors. Let the IRQ code register MSI doorbells, and add a simple
way for other systems to check if an address is a doorbell.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 arm/gic.c         |  4 ++++
 include/kvm/irq.h |  3 +++
 irq.c             | 35 +++++++++++++++++++++++++++++++++++
 3 files changed, 42 insertions(+)

diff --git a/arm/gic.c b/arm/gic.c
index bf7a22a9..c708031e 100644
--- a/arm/gic.c
+++ b/arm/gic.c
@@ -108,6 +108,10 @@ static int gic__create_its_frame(struct kvm *kvm, u64 its_frame_addr)
 	};
 	int err;
 
+	err = irq__add_msi_doorbell(kvm, its_frame_addr, KVM_VGIC_V3_ITS_SIZE);
+	if (err)
+		return err;
+
 	err = ioctl(kvm->vm_fd, KVM_CREATE_DEVICE, &its_device);
 	if (err) {
 		fprintf(stderr,
diff --git a/include/kvm/irq.h b/include/kvm/irq.h
index a188a870..2a59257e 100644
--- a/include/kvm/irq.h
+++ b/include/kvm/irq.h
@@ -24,6 +24,9 @@ int irq__allocate_routing_entry(void);
 int irq__add_msix_route(struct kvm *kvm, struct msi_msg *msg, u32 device_id);
 void irq__update_msix_route(struct kvm *kvm, u32 gsi, struct msi_msg *msg);
 
+int irq__add_msi_doorbell(struct kvm *kvm, u64 addr, u64 size);
+bool irq__addr_is_msi_doorbell(struct kvm *kvm, u64 addr);
+
 /*
  * The function takes two eventfd arguments, trigger_fd and resample_fd. If
  * resample_fd is <= 0, resampling is disabled and the IRQ is edge-triggered
diff --git a/irq.c b/irq.c
index a4ef75e4..a04f4d37 100644
--- a/irq.c
+++ b/irq.c
@@ -8,6 +8,14 @@
 #include "kvm/irq.h"
 #include "kvm/kvm-arch.h"
 
+struct kvm_msi_doorbell_region {
+	u64			start;
+	u64			end;
+	struct list_head	head;
+};
+
+static LIST_HEAD(msi_doorbells);
+
 static u8 next_line = KVM_IRQ_OFFSET;
 static int allocated_gsis = 0;
 
@@ -147,6 +155,33 @@ void irq__update_msix_route(struct kvm *kvm, u32 gsi, struct msi_msg *msg)
 		die_perror("KVM_SET_GSI_ROUTING");
 }
 
+int irq__add_msi_doorbell(struct kvm *kvm, u64 addr, u64 size)
+{
+	struct kvm_msi_doorbell_region *doorbell = malloc(sizeof(*doorbell));
+
+	if (!doorbell)
+		return -ENOMEM;
+
+	doorbell->start = addr;
+	doorbell->end = addr + size - 1;
+
+	list_add(&doorbell->head, &msi_doorbells);
+
+	return 0;
+}
+
+bool irq__addr_is_msi_doorbell(struct kvm *kvm, u64 addr)
+{
+	struct kvm_msi_doorbell_region *doorbell;
+
+	list_for_each_entry(doorbell, &msi_doorbells, head) {
+		if (addr >= doorbell->start && addr <= doorbell->end)
+			return true;
+	}
+
+	return false;
+}
+
 int irq__common_add_irqfd(struct kvm *kvm, unsigned int gsi, int trigger_fd,
 			   int resample_fd)
 {
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 06/15] irq: register MSI doorbell addresses
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (9 preceding siblings ...)
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (20 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

For passed-through devices behind a vIOMMU, we'll need to translate writes
to MSI vectors. Let the IRQ code register MSI doorbells, and add a simple
way for other systems to check if an address is a doorbell.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 arm/gic.c         |  4 ++++
 include/kvm/irq.h |  3 +++
 irq.c             | 35 +++++++++++++++++++++++++++++++++++
 3 files changed, 42 insertions(+)

diff --git a/arm/gic.c b/arm/gic.c
index bf7a22a9..c708031e 100644
--- a/arm/gic.c
+++ b/arm/gic.c
@@ -108,6 +108,10 @@ static int gic__create_its_frame(struct kvm *kvm, u64 its_frame_addr)
 	};
 	int err;
 
+	err = irq__add_msi_doorbell(kvm, its_frame_addr, KVM_VGIC_V3_ITS_SIZE);
+	if (err)
+		return err;
+
 	err = ioctl(kvm->vm_fd, KVM_CREATE_DEVICE, &its_device);
 	if (err) {
 		fprintf(stderr,
diff --git a/include/kvm/irq.h b/include/kvm/irq.h
index a188a870..2a59257e 100644
--- a/include/kvm/irq.h
+++ b/include/kvm/irq.h
@@ -24,6 +24,9 @@ int irq__allocate_routing_entry(void);
 int irq__add_msix_route(struct kvm *kvm, struct msi_msg *msg, u32 device_id);
 void irq__update_msix_route(struct kvm *kvm, u32 gsi, struct msi_msg *msg);
 
+int irq__add_msi_doorbell(struct kvm *kvm, u64 addr, u64 size);
+bool irq__addr_is_msi_doorbell(struct kvm *kvm, u64 addr);
+
 /*
  * The function takes two eventfd arguments, trigger_fd and resample_fd. If
  * resample_fd is <= 0, resampling is disabled and the IRQ is edge-triggered
diff --git a/irq.c b/irq.c
index a4ef75e4..a04f4d37 100644
--- a/irq.c
+++ b/irq.c
@@ -8,6 +8,14 @@
 #include "kvm/irq.h"
 #include "kvm/kvm-arch.h"
 
+struct kvm_msi_doorbell_region {
+	u64			start;
+	u64			end;
+	struct list_head	head;
+};
+
+static LIST_HEAD(msi_doorbells);
+
 static u8 next_line = KVM_IRQ_OFFSET;
 static int allocated_gsis = 0;
 
@@ -147,6 +155,33 @@ void irq__update_msix_route(struct kvm *kvm, u32 gsi, struct msi_msg *msg)
 		die_perror("KVM_SET_GSI_ROUTING");
 }
 
+int irq__add_msi_doorbell(struct kvm *kvm, u64 addr, u64 size)
+{
+	struct kvm_msi_doorbell_region *doorbell = malloc(sizeof(*doorbell));
+
+	if (!doorbell)
+		return -ENOMEM;
+
+	doorbell->start = addr;
+	doorbell->end = addr + size - 1;
+
+	list_add(&doorbell->head, &msi_doorbells);
+
+	return 0;
+}
+
+bool irq__addr_is_msi_doorbell(struct kvm *kvm, u64 addr)
+{
+	struct kvm_msi_doorbell_region *doorbell;
+
+	list_for_each_entry(doorbell, &msi_doorbells, head) {
+		if (addr >= doorbell->start && addr <= doorbell->end)
+			return true;
+	}
+
+	return false;
+}
+
 int irq__common_add_irqfd(struct kvm *kvm, unsigned int gsi, int trigger_fd,
 			   int resample_fd)
 {
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 07/15] virtio: factor virtqueue initialization
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (11 preceding siblings ...)
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (18 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

All virtio devices are doing the same few operations when initializing
their virtqueues. Move these operations to virtio core, as we'll have to
complexify vring initialization when implementing a virtual IOMMU.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/virtio.h | 16 +++++++++-------
 virtio/9p.c          |  7 ++-----
 virtio/balloon.c     |  7 +++----
 virtio/blk.c         | 10 ++--------
 virtio/console.c     |  7 ++-----
 virtio/iommu.c       | 10 ++--------
 virtio/net.c         |  8 ++------
 virtio/rng.c         |  6 ++----
 virtio/scsi.c        |  6 ++----
 9 files changed, 26 insertions(+), 51 deletions(-)

diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index 00a791ac..24c0c487 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -169,15 +169,17 @@ int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 int virtio_compat_add_message(const char *device, const char *config);
 const char* virtio_trans_name(enum virtio_trans trans);
 
-static inline void *virtio_get_vq(struct kvm *kvm, u32 pfn, u32 page_size)
+static inline void virtio_init_device_vq(struct kvm *kvm,
+					 struct virtio_device *vdev,
+					 struct virt_queue *vq, size_t nr_descs,
+					 u32 page_size, u32 align, u32 pfn)
 {
-	return guest_flat_to_host(kvm, (u64)pfn * page_size);
-}
+	void *p		= guest_flat_to_host(kvm, (u64)pfn * page_size);
 
-static inline void virtio_init_device_vq(struct virtio_device *vdev,
-					 struct virt_queue *vq)
-{
-	vq->endian = vdev->endian;
+	vq->endian	= vdev->endian;
+	vq->pfn		= pfn;
+
+	vring_init(&vq->vring, nr_descs, p, align);
 }
 
 #endif /* KVM__VIRTIO_H */
diff --git a/virtio/9p.c b/virtio/9p.c
index 69fdc4be..acd09bdd 100644
--- a/virtio/9p.c
+++ b/virtio/9p.c
@@ -1388,17 +1388,14 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 	struct p9_dev *p9dev = dev;
 	struct p9_dev_job *job;
 	struct virt_queue *queue;
-	void *p;
 
 	compat__remove_message(compat_id);
 
 	queue		= &p9dev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
 	job		= &p9dev->jobs[vq];
 
-	vring_init(&queue->vring, VIRTQUEUE_NUM, p, align);
-	virtio_init_device_vq(&p9dev->vdev, queue);
+	virtio_init_device_vq(kvm, &p9dev->vdev, queue, VIRTQUEUE_NUM,
+			      page_size, align, pfn);
 
 	*job		= (struct p9_dev_job) {
 		.vq		= queue,
diff --git a/virtio/balloon.c b/virtio/balloon.c
index 9564aa39..9182cae6 100644
--- a/virtio/balloon.c
+++ b/virtio/balloon.c
@@ -198,16 +198,15 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 {
 	struct bln_dev *bdev = dev;
 	struct virt_queue *queue;
-	void *p;
 
 	compat__remove_message(compat_id);
 
 	queue		= &bdev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
+
+	virtio_init_device_vq(kvm, &bdev->vdev, queue, VIRTIO_BLN_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	thread_pool__init_job(&bdev->jobs[vq], kvm, virtio_bln_do_io, queue);
-	vring_init(&queue->vring, VIRTIO_BLN_QUEUE_SIZE, p, align);
 
 	return 0;
 }
diff --git a/virtio/blk.c b/virtio/blk.c
index c485e4fc..8c6e59ba 100644
--- a/virtio/blk.c
+++ b/virtio/blk.c
@@ -178,17 +178,11 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 		   u32 pfn)
 {
 	struct blk_dev *bdev = dev;
-	struct virt_queue *queue;
-	void *p;
 
 	compat__remove_message(compat_id);
 
-	queue		= &bdev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
-
-	vring_init(&queue->vring, VIRTIO_BLK_QUEUE_SIZE, p, align);
-	virtio_init_device_vq(&bdev->vdev, queue);
+	virtio_init_device_vq(kvm, &bdev->vdev, &bdev->vqs[vq],
+			      VIRTIO_BLK_QUEUE_SIZE, page_size, align, pfn);
 
 	return 0;
 }
diff --git a/virtio/console.c b/virtio/console.c
index f1c0a190..610962c4 100644
--- a/virtio/console.c
+++ b/virtio/console.c
@@ -143,18 +143,15 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 		   u32 pfn)
 {
 	struct virt_queue *queue;
-	void *p;
 
 	BUG_ON(vq >= VIRTIO_CONSOLE_NUM_QUEUES);
 
 	compat__remove_message(compat_id);
 
 	queue		= &cdev.vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
 
-	vring_init(&queue->vring, VIRTIO_CONSOLE_QUEUE_SIZE, p, align);
-	virtio_init_device_vq(&cdev.vdev, queue);
+	virtio_init_device_vq(kvm, &cdev.vdev, queue, VIRTIO_CONSOLE_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	if (vq == VIRTIO_CONSOLE_TX_QUEUE) {
 		thread_pool__init_job(&cdev.jobs[vq], kvm, virtio_console_handle_callback, queue);
diff --git a/virtio/iommu.c b/virtio/iommu.c
index c72e7322..2e5a23ee 100644
--- a/virtio/iommu.c
+++ b/virtio/iommu.c
@@ -497,8 +497,6 @@ static void viommu_set_guest_features(struct kvm *kvm, void *dev, u32 features)
 static int viommu_init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size,
 			  u32 align, u32 pfn)
 {
-	void *ptr;
-	struct virt_queue *queue;
 	struct viommu_dev *viommu = dev;
 
 	if (vq != 0)
@@ -506,12 +504,8 @@ static int viommu_init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size,
 
 	compat__remove_message(compat_id);
 
-	queue = &viommu->vq;
-	queue->pfn = pfn;
-	ptr = virtio_get_vq(kvm, queue->pfn, page_size);
-
-	vring_init(&queue->vring, viommu->queue_size, ptr, align);
-	virtio_init_device_vq(&viommu->vdev, queue);
+	virtio_init_device_vq(kvm, &viommu->vdev, &viommu->vq,
+			      viommu->queue_size, page_size, align, pfn);
 
 	thread_pool__init_job(&viommu->job, kvm, viommu_command, viommu);
 
diff --git a/virtio/net.c b/virtio/net.c
index 529b4111..957cca09 100644
--- a/virtio/net.c
+++ b/virtio/net.c
@@ -505,17 +505,13 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 	struct vhost_vring_addr addr;
 	struct net_dev *ndev = dev;
 	struct virt_queue *queue;
-	void *p;
 	int r;
 
 	compat__remove_message(compat_id);
 
 	queue		= &ndev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
-
-	vring_init(&queue->vring, VIRTIO_NET_QUEUE_SIZE, p, align);
-	virtio_init_device_vq(&ndev->vdev, queue);
+	virtio_init_device_vq(kvm, &ndev->vdev, queue, VIRTIO_NET_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	mutex_init(&ndev->io_lock[vq]);
 	pthread_cond_init(&ndev->io_cond[vq], NULL);
diff --git a/virtio/rng.c b/virtio/rng.c
index 9b9e1283..5f525540 100644
--- a/virtio/rng.c
+++ b/virtio/rng.c
@@ -92,17 +92,15 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 	struct rng_dev *rdev = dev;
 	struct virt_queue *queue;
 	struct rng_dev_job *job;
-	void *p;
 
 	compat__remove_message(compat_id);
 
 	queue		= &rdev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
 
 	job = &rdev->jobs[vq];
 
-	vring_init(&queue->vring, VIRTIO_RNG_QUEUE_SIZE, p, align);
+	virtio_init_device_vq(kvm, &rdev->vdev, queue, VIRTIO_RNG_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	*job = (struct rng_dev_job) {
 		.vq	= queue,
diff --git a/virtio/scsi.c b/virtio/scsi.c
index a429ac85..e0fd85f6 100644
--- a/virtio/scsi.c
+++ b/virtio/scsi.c
@@ -57,16 +57,14 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 	struct vhost_vring_addr addr;
 	struct scsi_dev *sdev = dev;
 	struct virt_queue *queue;
-	void *p;
 	int r;
 
 	compat__remove_message(compat_id);
 
 	queue		= &sdev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
 
-	vring_init(&queue->vring, VIRTIO_SCSI_QUEUE_SIZE, p, align);
+	virtio_init_device_vq(kvm, &sdev->vdev, queue, VIRTIO_SCSI_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	if (sdev->vhost_fd == 0)
 		return 0;
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 07/15] virtio: factor virtqueue initialization
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (12 preceding siblings ...)
  2017-04-07 19:24   ` [RFC PATCH kvmtool 07/15] virtio: factor virtqueue initialization Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 08/15] virtio: add vIOMMU instance for virtio devices Jean-Philippe Brucker
                     ` (17 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

All virtio devices are doing the same few operations when initializing
their virtqueues. Move these operations to virtio core, as we'll have to
complexify vring initialization when implementing a virtual IOMMU.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/virtio.h | 16 +++++++++-------
 virtio/9p.c          |  7 ++-----
 virtio/balloon.c     |  7 +++----
 virtio/blk.c         | 10 ++--------
 virtio/console.c     |  7 ++-----
 virtio/iommu.c       | 10 ++--------
 virtio/net.c         |  8 ++------
 virtio/rng.c         |  6 ++----
 virtio/scsi.c        |  6 ++----
 9 files changed, 26 insertions(+), 51 deletions(-)

diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index 00a791ac..24c0c487 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -169,15 +169,17 @@ int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 int virtio_compat_add_message(const char *device, const char *config);
 const char* virtio_trans_name(enum virtio_trans trans);
 
-static inline void *virtio_get_vq(struct kvm *kvm, u32 pfn, u32 page_size)
+static inline void virtio_init_device_vq(struct kvm *kvm,
+					 struct virtio_device *vdev,
+					 struct virt_queue *vq, size_t nr_descs,
+					 u32 page_size, u32 align, u32 pfn)
 {
-	return guest_flat_to_host(kvm, (u64)pfn * page_size);
-}
+	void *p		= guest_flat_to_host(kvm, (u64)pfn * page_size);
 
-static inline void virtio_init_device_vq(struct virtio_device *vdev,
-					 struct virt_queue *vq)
-{
-	vq->endian = vdev->endian;
+	vq->endian	= vdev->endian;
+	vq->pfn		= pfn;
+
+	vring_init(&vq->vring, nr_descs, p, align);
 }
 
 #endif /* KVM__VIRTIO_H */
diff --git a/virtio/9p.c b/virtio/9p.c
index 69fdc4be..acd09bdd 100644
--- a/virtio/9p.c
+++ b/virtio/9p.c
@@ -1388,17 +1388,14 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 	struct p9_dev *p9dev = dev;
 	struct p9_dev_job *job;
 	struct virt_queue *queue;
-	void *p;
 
 	compat__remove_message(compat_id);
 
 	queue		= &p9dev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
 	job		= &p9dev->jobs[vq];
 
-	vring_init(&queue->vring, VIRTQUEUE_NUM, p, align);
-	virtio_init_device_vq(&p9dev->vdev, queue);
+	virtio_init_device_vq(kvm, &p9dev->vdev, queue, VIRTQUEUE_NUM,
+			      page_size, align, pfn);
 
 	*job		= (struct p9_dev_job) {
 		.vq		= queue,
diff --git a/virtio/balloon.c b/virtio/balloon.c
index 9564aa39..9182cae6 100644
--- a/virtio/balloon.c
+++ b/virtio/balloon.c
@@ -198,16 +198,15 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 {
 	struct bln_dev *bdev = dev;
 	struct virt_queue *queue;
-	void *p;
 
 	compat__remove_message(compat_id);
 
 	queue		= &bdev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
+
+	virtio_init_device_vq(kvm, &bdev->vdev, queue, VIRTIO_BLN_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	thread_pool__init_job(&bdev->jobs[vq], kvm, virtio_bln_do_io, queue);
-	vring_init(&queue->vring, VIRTIO_BLN_QUEUE_SIZE, p, align);
 
 	return 0;
 }
diff --git a/virtio/blk.c b/virtio/blk.c
index c485e4fc..8c6e59ba 100644
--- a/virtio/blk.c
+++ b/virtio/blk.c
@@ -178,17 +178,11 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 		   u32 pfn)
 {
 	struct blk_dev *bdev = dev;
-	struct virt_queue *queue;
-	void *p;
 
 	compat__remove_message(compat_id);
 
-	queue		= &bdev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
-
-	vring_init(&queue->vring, VIRTIO_BLK_QUEUE_SIZE, p, align);
-	virtio_init_device_vq(&bdev->vdev, queue);
+	virtio_init_device_vq(kvm, &bdev->vdev, &bdev->vqs[vq],
+			      VIRTIO_BLK_QUEUE_SIZE, page_size, align, pfn);
 
 	return 0;
 }
diff --git a/virtio/console.c b/virtio/console.c
index f1c0a190..610962c4 100644
--- a/virtio/console.c
+++ b/virtio/console.c
@@ -143,18 +143,15 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 		   u32 pfn)
 {
 	struct virt_queue *queue;
-	void *p;
 
 	BUG_ON(vq >= VIRTIO_CONSOLE_NUM_QUEUES);
 
 	compat__remove_message(compat_id);
 
 	queue		= &cdev.vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
 
-	vring_init(&queue->vring, VIRTIO_CONSOLE_QUEUE_SIZE, p, align);
-	virtio_init_device_vq(&cdev.vdev, queue);
+	virtio_init_device_vq(kvm, &cdev.vdev, queue, VIRTIO_CONSOLE_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	if (vq == VIRTIO_CONSOLE_TX_QUEUE) {
 		thread_pool__init_job(&cdev.jobs[vq], kvm, virtio_console_handle_callback, queue);
diff --git a/virtio/iommu.c b/virtio/iommu.c
index c72e7322..2e5a23ee 100644
--- a/virtio/iommu.c
+++ b/virtio/iommu.c
@@ -497,8 +497,6 @@ static void viommu_set_guest_features(struct kvm *kvm, void *dev, u32 features)
 static int viommu_init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size,
 			  u32 align, u32 pfn)
 {
-	void *ptr;
-	struct virt_queue *queue;
 	struct viommu_dev *viommu = dev;
 
 	if (vq != 0)
@@ -506,12 +504,8 @@ static int viommu_init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size,
 
 	compat__remove_message(compat_id);
 
-	queue = &viommu->vq;
-	queue->pfn = pfn;
-	ptr = virtio_get_vq(kvm, queue->pfn, page_size);
-
-	vring_init(&queue->vring, viommu->queue_size, ptr, align);
-	virtio_init_device_vq(&viommu->vdev, queue);
+	virtio_init_device_vq(kvm, &viommu->vdev, &viommu->vq,
+			      viommu->queue_size, page_size, align, pfn);
 
 	thread_pool__init_job(&viommu->job, kvm, viommu_command, viommu);
 
diff --git a/virtio/net.c b/virtio/net.c
index 529b4111..957cca09 100644
--- a/virtio/net.c
+++ b/virtio/net.c
@@ -505,17 +505,13 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 	struct vhost_vring_addr addr;
 	struct net_dev *ndev = dev;
 	struct virt_queue *queue;
-	void *p;
 	int r;
 
 	compat__remove_message(compat_id);
 
 	queue		= &ndev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
-
-	vring_init(&queue->vring, VIRTIO_NET_QUEUE_SIZE, p, align);
-	virtio_init_device_vq(&ndev->vdev, queue);
+	virtio_init_device_vq(kvm, &ndev->vdev, queue, VIRTIO_NET_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	mutex_init(&ndev->io_lock[vq]);
 	pthread_cond_init(&ndev->io_cond[vq], NULL);
diff --git a/virtio/rng.c b/virtio/rng.c
index 9b9e1283..5f525540 100644
--- a/virtio/rng.c
+++ b/virtio/rng.c
@@ -92,17 +92,15 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 	struct rng_dev *rdev = dev;
 	struct virt_queue *queue;
 	struct rng_dev_job *job;
-	void *p;
 
 	compat__remove_message(compat_id);
 
 	queue		= &rdev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
 
 	job = &rdev->jobs[vq];
 
-	vring_init(&queue->vring, VIRTIO_RNG_QUEUE_SIZE, p, align);
+	virtio_init_device_vq(kvm, &rdev->vdev, queue, VIRTIO_RNG_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	*job = (struct rng_dev_job) {
 		.vq	= queue,
diff --git a/virtio/scsi.c b/virtio/scsi.c
index a429ac85..e0fd85f6 100644
--- a/virtio/scsi.c
+++ b/virtio/scsi.c
@@ -57,16 +57,14 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 	struct vhost_vring_addr addr;
 	struct scsi_dev *sdev = dev;
 	struct virt_queue *queue;
-	void *p;
 	int r;
 
 	compat__remove_message(compat_id);
 
 	queue		= &sdev->vqs[vq];
-	queue->pfn	= pfn;
-	p		= virtio_get_vq(kvm, queue->pfn, page_size);
 
-	vring_init(&queue->vring, VIRTIO_SCSI_QUEUE_SIZE, p, align);
+	virtio_init_device_vq(kvm, &sdev->vdev, queue, VIRTIO_SCSI_QUEUE_SIZE,
+			      page_size, align, pfn);
 
 	if (sdev->vhost_fd == 0)
 		return 0;
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 08/15] virtio: add vIOMMU instance for virtio devices
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (13 preceding siblings ...)
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (16 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

Virtio devices can now opt-in to use an IOMMU, by setting the use_iommu
field. None of this will work in the current state, since virtio devices
still access memory linearly. A subsequent patch implements sg accesses.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/virtio-mmio.h |  1 +
 include/kvm/virtio-pci.h  |  1 +
 include/kvm/virtio.h      | 13 ++++++++++++
 virtio/core.c             | 52 +++++++++++++++++++++++++++++++++++++++++++++++
 virtio/mmio.c             | 27 ++++++++++++++++++++++++
 virtio/pci.c              | 26 ++++++++++++++++++++++++
 6 files changed, 120 insertions(+)

diff --git a/include/kvm/virtio-mmio.h b/include/kvm/virtio-mmio.h
index 835f421b..c25a4fd7 100644
--- a/include/kvm/virtio-mmio.h
+++ b/include/kvm/virtio-mmio.h
@@ -44,6 +44,7 @@ struct virtio_mmio_hdr {
 struct virtio_mmio {
 	u32			addr;
 	void			*dev;
+	struct virtio_device	*vdev;
 	struct kvm		*kvm;
 	u8			irq;
 	struct virtio_mmio_hdr	hdr;
diff --git a/include/kvm/virtio-pci.h b/include/kvm/virtio-pci.h
index b70cadd8..26772f74 100644
--- a/include/kvm/virtio-pci.h
+++ b/include/kvm/virtio-pci.h
@@ -22,6 +22,7 @@ struct virtio_pci {
 	struct pci_device_header pci_hdr;
 	struct device_header	dev_hdr;
 	void			*dev;
+	struct virtio_device	*vdev;
 	struct kvm		*kvm;
 
 	u16			port_addr;
diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index 24c0c487..9f2ff237 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -9,6 +9,7 @@
 #include <linux/types.h>
 #include <sys/uio.h>
 
+#include "kvm/iommu.h"
 #include "kvm/kvm.h"
 
 #define VIRTIO_IRQ_LOW		0
@@ -137,10 +138,12 @@ enum virtio_trans {
 };
 
 struct virtio_device {
+	bool			use_iommu;
 	bool			use_vhost;
 	void			*virtio;
 	struct virtio_ops	*ops;
 	u16			endian;
+	void			*iotlb;
 };
 
 struct virtio_ops {
@@ -182,4 +185,14 @@ static inline void virtio_init_device_vq(struct kvm *kvm,
 	vring_init(&vq->vring, nr_descs, p, align);
 }
 
+/*
+ * These are callbacks for IOMMU operations on virtio devices. They are not
+ * operations on the virtio-iommu device. Confusing, I know.
+ */
+const struct iommu_properties *
+virtio__iommu_get_properties(struct device_header *dev);
+
+int virtio__iommu_attach(void *, struct virtio_device *vdev, int flags);
+int virtio__iommu_detach(void *, struct virtio_device *vdev);
+
 #endif /* KVM__VIRTIO_H */
diff --git a/virtio/core.c b/virtio/core.c
index d6ac289d..32bd4ebc 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -6,11 +6,16 @@
 #include "kvm/guest_compat.h"
 #include "kvm/barrier.h"
 #include "kvm/virtio.h"
+#include "kvm/virtio-iommu.h"
 #include "kvm/virtio-pci.h"
 #include "kvm/virtio-mmio.h"
 #include "kvm/util.h"
 #include "kvm/kvm.h"
 
+static void *iommu = NULL;
+static struct iommu_properties iommu_props = {
+	.name		= "viommu-virtio",
+};
 
 const char* virtio_trans_name(enum virtio_trans trans)
 {
@@ -198,6 +203,41 @@ bool virtio_queue__should_signal(struct virt_queue *vq)
 	return false;
 }
 
+const struct iommu_properties *
+virtio__iommu_get_properties(struct device_header *dev)
+{
+	return &iommu_props;
+}
+
+int virtio__iommu_attach(void *priv, struct virtio_device *vdev, int flags)
+{
+	struct virtio_tlb *iotlb = priv;
+
+	if (!iotlb)
+		return -ENOMEM;
+
+	if (vdev->iotlb) {
+		pr_err("device already attached");
+		return -EINVAL;
+	}
+
+	vdev->iotlb = iotlb;
+
+	return 0;
+}
+
+int virtio__iommu_detach(void *priv, struct virtio_device *vdev)
+{
+	if (vdev->iotlb != priv) {
+		pr_err("wrong iotlb"); /* bug */
+		return -EINVAL;
+	}
+
+	vdev->iotlb = NULL;
+
+	return 0;
+}
+
 int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 		struct virtio_ops *ops, enum virtio_trans trans,
 		int device_id, int subsys_id, int class)
@@ -233,6 +273,18 @@ int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 		return -1;
 	};
 
+	if (!iommu && vdev->use_iommu) {
+		iommu_props.pgsize_mask = ~(PAGE_SIZE - 1);
+		/*
+		 * With legacy MMIO, we only have 32-bit to hold the vring PFN.
+		 * This limits the IOVA size to (32 + 12) = 44 bits, when using
+		 * 4k pages.
+		 */
+		iommu_props.input_addr_size = 44;
+		iommu = viommu_register(kvm, &iommu_props);
+	}
+
+
 	return 0;
 }
 
diff --git a/virtio/mmio.c b/virtio/mmio.c
index 16b44fbb..24a14a71 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -1,4 +1,5 @@
 #include "kvm/devices.h"
+#include "kvm/virtio-iommu.h"
 #include "kvm/virtio-mmio.h"
 #include "kvm/ioeventfd.h"
 #include "kvm/iommu.h"
@@ -286,6 +287,30 @@ void virtio_mmio_assign_irq(struct device_header *dev_hdr)
 	vmmio->irq = irq__alloc_line();
 }
 
+#define mmio_dev_to_virtio(dev_hdr)					\
+	container_of(dev_hdr, struct virtio_mmio, dev_hdr)->vdev
+
+static int virtio_mmio_iommu_attach(void *priv, struct device_header *dev_hdr,
+				    int flags)
+{
+	return virtio__iommu_attach(priv, mmio_dev_to_virtio(dev_hdr), flags);
+}
+
+static int virtio_mmio_iommu_detach(void *priv, struct device_header *dev_hdr)
+{
+	return virtio__iommu_detach(priv, mmio_dev_to_virtio(dev_hdr));
+}
+
+static struct iommu_ops virtio_mmio_iommu_ops = {
+	.get_properties		= virtio__iommu_get_properties,
+	.alloc_address_space	= iommu_alloc_address_space,
+	.free_address_space	= iommu_free_address_space,
+	.attach			= virtio_mmio_iommu_attach,
+	.detach			= virtio_mmio_iommu_detach,
+	.map			= iommu_map,
+	.unmap			= iommu_unmap,
+};
+
 int virtio_mmio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 		     int device_id, int subsys_id, int class)
 {
@@ -294,6 +319,7 @@ int virtio_mmio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 	vmmio->addr	= virtio_mmio_get_io_space_block(VIRTIO_MMIO_IO_SIZE);
 	vmmio->kvm	= kvm;
 	vmmio->dev	= dev;
+	vmmio->vdev	= vdev;
 
 	kvm__register_mmio(kvm, vmmio->addr, VIRTIO_MMIO_IO_SIZE,
 			   false, virtio_mmio_mmio_callback, vdev);
@@ -309,6 +335,7 @@ int virtio_mmio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 	vmmio->dev_hdr = (struct device_header) {
 		.bus_type	= DEVICE_BUS_MMIO,
 		.data		= generate_virtio_mmio_fdt_node,
+		.iommu_ops	= vdev->use_iommu ? &virtio_mmio_iommu_ops : NULL,
 	};
 
 	device__register(&vmmio->dev_hdr);
diff --git a/virtio/pci.c b/virtio/pci.c
index b6ef389e..674d5143 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -408,6 +408,30 @@ static void virtio_pci__io_mmio_callback(struct kvm_cpu *vcpu,
 	kvm__emulate_io(vcpu, port, data, direction, len, 1);
 }
 
+#define pci_dev_to_virtio(dev_hdr)				\
+	(container_of(dev_hdr, struct virtio_pci, dev_hdr)->vdev)
+
+static int virtio_pci_iommu_attach(void *priv, struct device_header *dev_hdr,
+				   int flags)
+{
+	return virtio__iommu_attach(priv, pci_dev_to_virtio(dev_hdr), flags);
+}
+
+static int virtio_pci_iommu_detach(void *priv, struct device_header *dev_hdr)
+{
+	return virtio__iommu_detach(priv, pci_dev_to_virtio(dev_hdr));
+}
+
+static struct iommu_ops virtio_pci_iommu_ops = {
+	.get_properties		= virtio__iommu_get_properties,
+	.alloc_address_space	= iommu_alloc_address_space,
+	.free_address_space	= iommu_free_address_space,
+	.attach			= virtio_pci_iommu_attach,
+	.detach			= virtio_pci_iommu_detach,
+	.map			= iommu_map,
+	.unmap			= iommu_unmap,
+};
+
 int virtio_pci__init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 		     int device_id, int subsys_id, int class)
 {
@@ -416,6 +440,7 @@ int virtio_pci__init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 
 	vpci->kvm = kvm;
 	vpci->dev = dev;
+	vpci->vdev = vdev;
 
 	r = ioport__register(kvm, IOPORT_EMPTY, &virtio_pci__io_ops, IOPORT_SIZE, vdev);
 	if (r < 0)
@@ -461,6 +486,7 @@ int virtio_pci__init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 	vpci->dev_hdr = (struct device_header) {
 		.bus_type		= DEVICE_BUS_PCI,
 		.data			= &vpci->pci_hdr,
+		.iommu_ops		= vdev->use_iommu ? &virtio_pci_iommu_ops : NULL,
 	};
 
 	vpci->pci_hdr.msix.cap = PCI_CAP_ID_MSIX;
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 08/15] virtio: add vIOMMU instance for virtio devices
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (14 preceding siblings ...)
  2017-04-07 19:24   ` [RFC PATCH kvmtool 08/15] virtio: add vIOMMU instance for virtio devices Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 09/15] virtio: access vring and buffers through IOMMU mappings Jean-Philippe Brucker
                     ` (15 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Virtio devices can now opt-in to use an IOMMU, by setting the use_iommu
field. None of this will work in the current state, since virtio devices
still access memory linearly. A subsequent patch implements sg accesses.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/virtio-mmio.h |  1 +
 include/kvm/virtio-pci.h  |  1 +
 include/kvm/virtio.h      | 13 ++++++++++++
 virtio/core.c             | 52 +++++++++++++++++++++++++++++++++++++++++++++++
 virtio/mmio.c             | 27 ++++++++++++++++++++++++
 virtio/pci.c              | 26 ++++++++++++++++++++++++
 6 files changed, 120 insertions(+)

diff --git a/include/kvm/virtio-mmio.h b/include/kvm/virtio-mmio.h
index 835f421b..c25a4fd7 100644
--- a/include/kvm/virtio-mmio.h
+++ b/include/kvm/virtio-mmio.h
@@ -44,6 +44,7 @@ struct virtio_mmio_hdr {
 struct virtio_mmio {
 	u32			addr;
 	void			*dev;
+	struct virtio_device	*vdev;
 	struct kvm		*kvm;
 	u8			irq;
 	struct virtio_mmio_hdr	hdr;
diff --git a/include/kvm/virtio-pci.h b/include/kvm/virtio-pci.h
index b70cadd8..26772f74 100644
--- a/include/kvm/virtio-pci.h
+++ b/include/kvm/virtio-pci.h
@@ -22,6 +22,7 @@ struct virtio_pci {
 	struct pci_device_header pci_hdr;
 	struct device_header	dev_hdr;
 	void			*dev;
+	struct virtio_device	*vdev;
 	struct kvm		*kvm;
 
 	u16			port_addr;
diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index 24c0c487..9f2ff237 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -9,6 +9,7 @@
 #include <linux/types.h>
 #include <sys/uio.h>
 
+#include "kvm/iommu.h"
 #include "kvm/kvm.h"
 
 #define VIRTIO_IRQ_LOW		0
@@ -137,10 +138,12 @@ enum virtio_trans {
 };
 
 struct virtio_device {
+	bool			use_iommu;
 	bool			use_vhost;
 	void			*virtio;
 	struct virtio_ops	*ops;
 	u16			endian;
+	void			*iotlb;
 };
 
 struct virtio_ops {
@@ -182,4 +185,14 @@ static inline void virtio_init_device_vq(struct kvm *kvm,
 	vring_init(&vq->vring, nr_descs, p, align);
 }
 
+/*
+ * These are callbacks for IOMMU operations on virtio devices. They are not
+ * operations on the virtio-iommu device. Confusing, I know.
+ */
+const struct iommu_properties *
+virtio__iommu_get_properties(struct device_header *dev);
+
+int virtio__iommu_attach(void *, struct virtio_device *vdev, int flags);
+int virtio__iommu_detach(void *, struct virtio_device *vdev);
+
 #endif /* KVM__VIRTIO_H */
diff --git a/virtio/core.c b/virtio/core.c
index d6ac289d..32bd4ebc 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -6,11 +6,16 @@
 #include "kvm/guest_compat.h"
 #include "kvm/barrier.h"
 #include "kvm/virtio.h"
+#include "kvm/virtio-iommu.h"
 #include "kvm/virtio-pci.h"
 #include "kvm/virtio-mmio.h"
 #include "kvm/util.h"
 #include "kvm/kvm.h"
 
+static void *iommu = NULL;
+static struct iommu_properties iommu_props = {
+	.name		= "viommu-virtio",
+};
 
 const char* virtio_trans_name(enum virtio_trans trans)
 {
@@ -198,6 +203,41 @@ bool virtio_queue__should_signal(struct virt_queue *vq)
 	return false;
 }
 
+const struct iommu_properties *
+virtio__iommu_get_properties(struct device_header *dev)
+{
+	return &iommu_props;
+}
+
+int virtio__iommu_attach(void *priv, struct virtio_device *vdev, int flags)
+{
+	struct virtio_tlb *iotlb = priv;
+
+	if (!iotlb)
+		return -ENOMEM;
+
+	if (vdev->iotlb) {
+		pr_err("device already attached");
+		return -EINVAL;
+	}
+
+	vdev->iotlb = iotlb;
+
+	return 0;
+}
+
+int virtio__iommu_detach(void *priv, struct virtio_device *vdev)
+{
+	if (vdev->iotlb != priv) {
+		pr_err("wrong iotlb"); /* bug */
+		return -EINVAL;
+	}
+
+	vdev->iotlb = NULL;
+
+	return 0;
+}
+
 int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 		struct virtio_ops *ops, enum virtio_trans trans,
 		int device_id, int subsys_id, int class)
@@ -233,6 +273,18 @@ int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 		return -1;
 	};
 
+	if (!iommu && vdev->use_iommu) {
+		iommu_props.pgsize_mask = ~(PAGE_SIZE - 1);
+		/*
+		 * With legacy MMIO, we only have 32-bit to hold the vring PFN.
+		 * This limits the IOVA size to (32 + 12) = 44 bits, when using
+		 * 4k pages.
+		 */
+		iommu_props.input_addr_size = 44;
+		iommu = viommu_register(kvm, &iommu_props);
+	}
+
+
 	return 0;
 }
 
diff --git a/virtio/mmio.c b/virtio/mmio.c
index 16b44fbb..24a14a71 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -1,4 +1,5 @@
 #include "kvm/devices.h"
+#include "kvm/virtio-iommu.h"
 #include "kvm/virtio-mmio.h"
 #include "kvm/ioeventfd.h"
 #include "kvm/iommu.h"
@@ -286,6 +287,30 @@ void virtio_mmio_assign_irq(struct device_header *dev_hdr)
 	vmmio->irq = irq__alloc_line();
 }
 
+#define mmio_dev_to_virtio(dev_hdr)					\
+	container_of(dev_hdr, struct virtio_mmio, dev_hdr)->vdev
+
+static int virtio_mmio_iommu_attach(void *priv, struct device_header *dev_hdr,
+				    int flags)
+{
+	return virtio__iommu_attach(priv, mmio_dev_to_virtio(dev_hdr), flags);
+}
+
+static int virtio_mmio_iommu_detach(void *priv, struct device_header *dev_hdr)
+{
+	return virtio__iommu_detach(priv, mmio_dev_to_virtio(dev_hdr));
+}
+
+static struct iommu_ops virtio_mmio_iommu_ops = {
+	.get_properties		= virtio__iommu_get_properties,
+	.alloc_address_space	= iommu_alloc_address_space,
+	.free_address_space	= iommu_free_address_space,
+	.attach			= virtio_mmio_iommu_attach,
+	.detach			= virtio_mmio_iommu_detach,
+	.map			= iommu_map,
+	.unmap			= iommu_unmap,
+};
+
 int virtio_mmio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 		     int device_id, int subsys_id, int class)
 {
@@ -294,6 +319,7 @@ int virtio_mmio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 	vmmio->addr	= virtio_mmio_get_io_space_block(VIRTIO_MMIO_IO_SIZE);
 	vmmio->kvm	= kvm;
 	vmmio->dev	= dev;
+	vmmio->vdev	= vdev;
 
 	kvm__register_mmio(kvm, vmmio->addr, VIRTIO_MMIO_IO_SIZE,
 			   false, virtio_mmio_mmio_callback, vdev);
@@ -309,6 +335,7 @@ int virtio_mmio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 	vmmio->dev_hdr = (struct device_header) {
 		.bus_type	= DEVICE_BUS_MMIO,
 		.data		= generate_virtio_mmio_fdt_node,
+		.iommu_ops	= vdev->use_iommu ? &virtio_mmio_iommu_ops : NULL,
 	};
 
 	device__register(&vmmio->dev_hdr);
diff --git a/virtio/pci.c b/virtio/pci.c
index b6ef389e..674d5143 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -408,6 +408,30 @@ static void virtio_pci__io_mmio_callback(struct kvm_cpu *vcpu,
 	kvm__emulate_io(vcpu, port, data, direction, len, 1);
 }
 
+#define pci_dev_to_virtio(dev_hdr)				\
+	(container_of(dev_hdr, struct virtio_pci, dev_hdr)->vdev)
+
+static int virtio_pci_iommu_attach(void *priv, struct device_header *dev_hdr,
+				   int flags)
+{
+	return virtio__iommu_attach(priv, pci_dev_to_virtio(dev_hdr), flags);
+}
+
+static int virtio_pci_iommu_detach(void *priv, struct device_header *dev_hdr)
+{
+	return virtio__iommu_detach(priv, pci_dev_to_virtio(dev_hdr));
+}
+
+static struct iommu_ops virtio_pci_iommu_ops = {
+	.get_properties		= virtio__iommu_get_properties,
+	.alloc_address_space	= iommu_alloc_address_space,
+	.free_address_space	= iommu_free_address_space,
+	.attach			= virtio_pci_iommu_attach,
+	.detach			= virtio_pci_iommu_detach,
+	.map			= iommu_map,
+	.unmap			= iommu_unmap,
+};
+
 int virtio_pci__init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 		     int device_id, int subsys_id, int class)
 {
@@ -416,6 +440,7 @@ int virtio_pci__init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 
 	vpci->kvm = kvm;
 	vpci->dev = dev;
+	vpci->vdev = vdev;
 
 	r = ioport__register(kvm, IOPORT_EMPTY, &virtio_pci__io_ops, IOPORT_SIZE, vdev);
 	if (r < 0)
@@ -461,6 +486,7 @@ int virtio_pci__init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 	vpci->dev_hdr = (struct device_header) {
 		.bus_type		= DEVICE_BUS_PCI,
 		.data			= &vpci->pci_hdr,
+		.iommu_ops		= vdev->use_iommu ? &virtio_pci_iommu_ops : NULL,
 	};
 
 	vpci->pci_hdr.msix.cap = PCI_CAP_ID_MSIX;
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 09/15] virtio: access vring and buffers through IOMMU mappings
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (16 preceding siblings ...)
  2017-04-07 19:24   ` [RFC PATCH kvmtool 09/15] virtio: access vring and buffers through IOMMU mappings Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 10/15] virtio-pci: translate MSIs with the virtual IOMMU Jean-Philippe Brucker
                     ` (13 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

Teach the virtio core how to access scattered vring structures. When
presenting a virtual IOMMU to the guest in front of virtio devices, the
virtio ring and buffers will be scattered across discontiguous guest-
physical pages. The device has to translate all IOVAs to host-virtual
addresses and gather the pages before accessing any structure.

Buffers described by vring.desc are already returned to the device via an
iovec. We simply have to fill them at a finer granularity and hope that:

1. The driver doesn't provide too many descriptors at a time, since the
   iovec is only as big as the number of descriptor and an overflow is now
   possible.

2. The device doesn't make assumption on message framing from vectors (ie.
   a message can now be contained in more vectors than before). This is
   forbidden by virtio 1.0 (and legacy with ANY_LAYOUT) but our
   virtio-net, for instance, assumes that the first vector always contains
   a full vnet header. In practice it's fine, but still extremely fragile.

For accessing vring and indirect descriptor tables, we now allocate an
iovec describing the IOMMU mappings of the structure, and make all
accesses via this iovec.

                                  ***

A more elegant way to do it would be to create a subprocess per
address-space, and remap fragments of guest memory in a contiguous manner:

                                .---- virtio-blk process
                               /
           viommu process ----+------ virtio-net process
                               \
                                '---- some other device

(0) Initially, parent forks for each emulated device. Each child reserves
    a large chunk of virtual memory with mmap (base), representing the
    IOVA space, but doesn't populate it.
(1) virtio-dev wants to access guest memory, for instance read the vring.
    It sends a TLB miss for an IOVA to the parent via pipe or socket.
(2) Parent viommu checks its translation table, and returns an offset in
    guest memory.
(3) Child does a mmap in its IOVA space, using the fd that backs guest
    memory: mmap(base + iova, pgsize, SHARED|FIXED, fd, offset)

This would be really cool, but I suspect it adds a lot of complexity,
since it's not clear which devices are entirely self-contained and which
need to access parent memory. So stay with scatter-gather accesses for
now.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/virtio.h | 108 +++++++++++++++++++++++++++++--
 virtio/core.c        | 179 ++++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 252 insertions(+), 35 deletions(-)

diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index 9f2ff237..cdc960cd 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -29,12 +29,16 @@
 
 struct virt_queue {
 	struct vring	vring;
+	struct iovec	*vring_sg;
+	size_t		vring_nr_sg;
 	u32		pfn;
 	/* The last_avail_idx field is an index to ->ring of struct vring_avail.
 	   It's where we assume the next request index is at.  */
 	u16		last_avail_idx;
 	u16		last_used_signalled;
 	u16		endian;
+
+	struct virtio_device *vdev;
 };
 
 /*
@@ -96,26 +100,91 @@ static inline __u64 __virtio_h2g_u64(u16 endian, __u64 val)
 
 #endif
 
+void *virtio_guest_access(struct kvm *kvm, struct virtio_device *vdev,
+			  u64 addr, size_t size, size_t *out_size, int prot);
+int virtio_populate_sg(struct kvm *kvm, struct virtio_device *vdev, u64 addr,
+		       size_t size, int prot, u16 cur_sg, u16 max_sg,
+		       struct iovec iov[]);
+
+/*
+ * Access element in a virtio structure. If @iov is NULL, access is linear and
+ * @ptr represents a Host-Virtual Address (HVA).
+ *
+ * Otherwise, the structure is scattered in the guest-physical space, and is
+ * made virtually-contiguous by the virtual IOMMU. @iov describes the
+ * structure's IOVA->HVA fragments, @base is the IOVA of the structure, and @ptr
+ * an IOVA inside the structure. @max is the number of elements in @iov.
+ *
+ *                                        HVA
+ *                      IOVA      .----> +---+ iov[0].base
+ *              @base-> +---+ ----'      |   |
+ *                      |   |            +---+
+ *                      +---+ ----.      :   :
+ *                      |   |     '----> +---+ iov[1].base
+ *               @ptr-> |   |            |   |
+ *                      +---+            |   |--> out
+ *                                       +---+
+ */
+static void *virtio_access_sg(struct iovec *iov, int max, void *base, void *ptr)
+{
+	int i;
+	size_t off = ptr - base;
+
+	if (!iov)
+		return ptr;
+
+	for (i = 0; i < max; i++) {
+		size_t sz = iov[i].iov_len;
+		if (off < sz)
+			return iov[i].iov_base + off;
+		off -= sz;
+	}
+
+	pr_err("virtio_access_sg overflow");
+	return NULL;
+}
+
+/*
+ * We only implement legacy vhost, so vring is a single virtually-contiguous
+ * structure starting at the descriptor table. Differentiation of accesses
+ * allows to ease a future move to virtio 1.0.
+ */
+#define vring_access_avail(vq, ptr)	\
+	virtio_access_sg(vq->vring_sg, vq->vring_nr_sg, vq->vring.desc, ptr)
+#define vring_access_desc(vq, ptr)	\
+	virtio_access_sg(vq->vring_sg, vq->vring_nr_sg, vq->vring.desc, ptr)
+#define vring_access_used(vq, ptr)	\
+	virtio_access_sg(vq->vring_sg, vq->vring_nr_sg, vq->vring.desc, ptr)
+
 static inline u16 virt_queue__pop(struct virt_queue *queue)
 {
+	void *ptr;
 	__u16 guest_idx;
 
-	guest_idx = queue->vring.avail->ring[queue->last_avail_idx++ % queue->vring.num];
+	ptr = &queue->vring.avail->ring[queue->last_avail_idx++ % queue->vring.num];
+	guest_idx = *(u16 *)vring_access_avail(queue, ptr);
+
 	return virtio_guest_to_host_u16(queue, guest_idx);
 }
 
 static inline struct vring_desc *virt_queue__get_desc(struct virt_queue *queue, u16 desc_ndx)
 {
-	return &queue->vring.desc[desc_ndx];
+	return vring_access_desc(queue, &queue->vring.desc[desc_ndx]);
 }
 
 static inline bool virt_queue__available(struct virt_queue *vq)
 {
+	u16 *evt, *idx;
+
 	if (!vq->vring.avail)
 		return 0;
 
-	vring_avail_event(&vq->vring) = virtio_host_to_guest_u16(vq, vq->last_avail_idx);
-	return virtio_guest_to_host_u16(vq, vq->vring.avail->idx) != vq->last_avail_idx;
+	/* Disgusting casts under the hood: &(*&used[size]) */
+	evt = vring_access_used(vq, &vring_avail_event(&vq->vring));
+	idx = vring_access_avail(vq, &vq->vring.avail->idx);
+
+	*evt = virtio_host_to_guest_u16(vq, vq->last_avail_idx);
+	return virtio_guest_to_host_u16(vq, *idx) != vq->last_avail_idx;
 }
 
 void virt_queue__used_idx_advance(struct virt_queue *queue, u16 jump);
@@ -177,10 +246,39 @@ static inline void virtio_init_device_vq(struct kvm *kvm,
 					 struct virt_queue *vq, size_t nr_descs,
 					 u32 page_size, u32 align, u32 pfn)
 {
-	void *p		= guest_flat_to_host(kvm, (u64)pfn * page_size);
+	void *p;
 
 	vq->endian	= vdev->endian;
 	vq->pfn		= pfn;
+	vq->vdev	= vdev;
+	vq->vring_sg	= NULL;
+
+	if (vdev->iotlb) {
+		u64 addr = (u64)pfn * page_size;
+		size_t size = vring_size(nr_descs, align);
+		/* Our IOMMU maps at PAGE_SIZE granularity */
+		size_t nr_sg = size / PAGE_SIZE;
+		int flags = IOMMU_PROT_READ | IOMMU_PROT_WRITE;
+
+		vq->vring_sg = calloc(nr_sg, sizeof(struct iovec));
+		if (!vq->vring_sg) {
+			pr_err("could not allocate vring_sg");
+			return; /* Explode later. */
+		}
+
+		vq->vring_nr_sg = virtio_populate_sg(kvm, vdev, addr, size,
+						     flags, 0, nr_sg,
+						     vq->vring_sg);
+		if (!vq->vring_nr_sg) {
+			pr_err("could not map vring");
+			free(vq->vring_sg);
+		}
+
+		/* vring is described with its IOVA */
+		p = (void *)addr;
+	} else {
+		p = guest_flat_to_host(kvm, (u64)pfn * page_size);
+	}
 
 	vring_init(&vq->vring, nr_descs, p, align);
 }
diff --git a/virtio/core.c b/virtio/core.c
index 32bd4ebc..ba35e5f1 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -28,7 +28,8 @@ const char* virtio_trans_name(enum virtio_trans trans)
 
 void virt_queue__used_idx_advance(struct virt_queue *queue, u16 jump)
 {
-	u16 idx = virtio_guest_to_host_u16(queue, queue->vring.used->idx);
+	u16 *ptr = vring_access_used(queue, &queue->vring.used->idx);
+	u16 idx = virtio_guest_to_host_u16(queue, *ptr);
 
 	/*
 	 * Use wmb to assure that used elem was updated with head and len.
@@ -37,7 +38,7 @@ void virt_queue__used_idx_advance(struct virt_queue *queue, u16 jump)
 	 */
 	wmb();
 	idx += jump;
-	queue->vring.used->idx = virtio_host_to_guest_u16(queue, idx);
+	*ptr = virtio_host_to_guest_u16(queue, idx);
 
 	/*
 	 * Use wmb to assure used idx has been increased before we signal the guest.
@@ -52,10 +53,12 @@ virt_queue__set_used_elem_no_update(struct virt_queue *queue, u32 head,
 				    u32 len, u16 offset)
 {
 	struct vring_used_elem *used_elem;
-	u16 idx = virtio_guest_to_host_u16(queue, queue->vring.used->idx);
+	u16 *ptr = vring_access_used(queue, &queue->vring.used->idx);
+	u16 idx = virtio_guest_to_host_u16(queue, *ptr);
 
-	idx += offset;
-	used_elem	= &queue->vring.used->ring[idx % queue->vring.num];
+	idx = (idx + offset) % queue->vring.num;
+
+	used_elem	= vring_access_used(queue, &queue->vring.used->ring[idx]);
 	used_elem->id	= virtio_host_to_guest_u32(queue, head);
 	used_elem->len	= virtio_host_to_guest_u32(queue, len);
 
@@ -84,16 +87,17 @@ static inline bool virt_desc__test_flag(struct virt_queue *vq,
  * at the end.
  */
 static unsigned next_desc(struct virt_queue *vq, struct vring_desc *desc,
-			  unsigned int i, unsigned int max)
+			  unsigned int max)
 {
 	unsigned int next;
 
 	/* If this descriptor says it doesn't chain, we're done. */
-	if (!virt_desc__test_flag(vq, &desc[i], VRING_DESC_F_NEXT))
+	if (!virt_desc__test_flag(vq, desc, VRING_DESC_F_NEXT))
 		return max;
 
+	next = virtio_guest_to_host_u16(vq, desc->next);
 	/* Check they're not leading us off end of descriptors. */
-	next = virtio_guest_to_host_u16(vq, desc[i].next);
+	next = min(next, max);
 	/* Make sure compiler knows to grab that: we don't want it changing! */
 	wmb();
 
@@ -102,32 +106,76 @@ static unsigned next_desc(struct virt_queue *vq, struct vring_desc *desc,
 
 u16 virt_queue__get_head_iov(struct virt_queue *vq, struct iovec iov[], u16 *out, u16 *in, u16 head, struct kvm *kvm)
 {
-	struct vring_desc *desc;
+	struct vring_desc *desc_base, *desc;
+	bool indirect, is_write;
+	struct iovec *desc_sg;
+	size_t len, nr_sg;
+	u64 addr;
 	u16 idx;
 	u16 max;
 
 	idx = head;
 	*out = *in = 0;
 	max = vq->vring.num;
-	desc = vq->vring.desc;
+	desc_base = vq->vring.desc;
+	desc_sg = vq->vring_sg;
+	nr_sg = vq->vring_nr_sg;
+
+	desc = vring_access_desc(vq, &desc_base[idx]);
+	indirect = virt_desc__test_flag(vq, desc, VRING_DESC_F_INDIRECT);
+	if (indirect) {
+		len = virtio_guest_to_host_u32(vq, desc->len);
+		max = len / sizeof(struct vring_desc);
+		addr = virtio_guest_to_host_u64(vq, desc->addr);
+		if (desc_sg) {
+			desc_sg = calloc(len / PAGE_SIZE + 1, sizeof(struct iovec));
+			if (!desc_sg)
+				return 0;
+
+			nr_sg = virtio_populate_sg(kvm, vq->vdev, addr, len,
+						   IOMMU_PROT_READ, 0, max,
+						   desc_sg);
+			if (!nr_sg) {
+				pr_err("failed to populate indirect table");
+				free(desc_sg);
+				return 0;
+			}
+
+			desc_base = (void *)addr;
+		} else {
+			desc_base = guest_flat_to_host(kvm, addr);
+		}
 
-	if (virt_desc__test_flag(vq, &desc[idx], VRING_DESC_F_INDIRECT)) {
-		max = virtio_guest_to_host_u32(vq, desc[idx].len) / sizeof(struct vring_desc);
-		desc = guest_flat_to_host(kvm, virtio_guest_to_host_u64(vq, desc[idx].addr));
 		idx = 0;
 	}
 
 	do {
+		u16 nr_io;
+
+		desc = virtio_access_sg(desc_sg, nr_sg, desc_base, &desc_base[idx]);
+		is_write = virt_desc__test_flag(vq, desc, VRING_DESC_F_WRITE);
+
 		/* Grab the first descriptor, and check it's OK. */
-		iov[*out + *in].iov_len = virtio_guest_to_host_u32(vq, desc[idx].len);
-		iov[*out + *in].iov_base = guest_flat_to_host(kvm,
-							      virtio_guest_to_host_u64(vq, desc[idx].addr));
+		len = virtio_guest_to_host_u32(vq, desc->len);
+		addr = virtio_guest_to_host_u64(vq, desc->addr);
+
+		/*
+		 * dodgy assumption alert: device uses vring.desc.num iovecs.
+		 * True in practice, but they are not obligated to do so.
+		 */
+		nr_io = virtio_populate_sg(kvm, vq->vdev, addr, len, is_write ?
+					   IOMMU_PROT_WRITE : IOMMU_PROT_READ,
+					   *out + *in, vq->vring.num, iov);
+
 		/* If this is an input descriptor, increment that count. */
-		if (virt_desc__test_flag(vq, &desc[idx], VRING_DESC_F_WRITE))
-			(*in)++;
+		if (is_write)
+			(*in) += nr_io;
 		else
-			(*out)++;
-	} while ((idx = next_desc(vq, desc, idx, max)) != max);
+			(*out) += nr_io;
+	} while ((idx = next_desc(vq, desc, max)) != max);
+
+	if (indirect && desc_sg)
+		free(desc_sg);
 
 	return head;
 }
@@ -147,23 +195,35 @@ u16 virt_queue__get_inout_iov(struct kvm *kvm, struct virt_queue *queue,
 			      u16 *in, u16 *out)
 {
 	struct vring_desc *desc;
+	struct iovec *iov;
 	u16 head, idx;
+	bool is_write;
+	size_t len;
+	u64 addr;
+	int prot;
+	u16 *cur;
 
 	idx = head = virt_queue__pop(queue);
 	*out = *in = 0;
 	do {
-		u64 addr;
 		desc = virt_queue__get_desc(queue, idx);
+		is_write = virt_desc__test_flag(queue, desc, VRING_DESC_F_WRITE);
+		len = virtio_guest_to_host_u32(queue, desc->len);
 		addr = virtio_guest_to_host_u64(queue, desc->addr);
-		if (virt_desc__test_flag(queue, desc, VRING_DESC_F_WRITE)) {
-			in_iov[*in].iov_base = guest_flat_to_host(kvm, addr);
-			in_iov[*in].iov_len = virtio_guest_to_host_u32(queue, desc->len);
-			(*in)++;
+		if (is_write) {
+			prot = IOMMU_PROT_WRITE;
+			iov = in_iov;
+			cur = in;
 		} else {
-			out_iov[*out].iov_base = guest_flat_to_host(kvm, addr);
-			out_iov[*out].iov_len = virtio_guest_to_host_u32(queue, desc->len);
-			(*out)++;
+			prot = IOMMU_PROT_READ;
+			iov = out_iov;
+			cur = out;
 		}
+
+		/* dodgy assumption alert: device uses vring.desc.num iovecs */
+		*cur += virtio_populate_sg(kvm, queue->vdev, addr, len, prot,
+					   *cur, queue->vring.num, iov);
+
 		if (virt_desc__test_flag(queue, desc, VRING_DESC_F_NEXT))
 			idx = virtio_guest_to_host_u16(queue, desc->next);
 		else
@@ -191,9 +251,12 @@ bool virtio_queue__should_signal(struct virt_queue *vq)
 {
 	u16 old_idx, new_idx, event_idx;
 
+	u16 *new_ptr	= vring_access_used(vq, &vq->vring.used->idx);
+	u16 *event_ptr	= vring_access_avail(vq, &vring_used_event(&vq->vring));
+
 	old_idx		= vq->last_used_signalled;
-	new_idx		= virtio_guest_to_host_u16(vq, vq->vring.used->idx);
-	event_idx	= virtio_guest_to_host_u16(vq, vring_used_event(&vq->vring));
+	new_idx		= virtio_guest_to_host_u16(vq, *new_ptr);
+	event_idx	= virtio_guest_to_host_u16(vq, *event_ptr);
 
 	if (vring_need_event(event_idx, new_idx, old_idx)) {
 		vq->last_used_signalled = new_idx;
@@ -238,6 +301,62 @@ int virtio__iommu_detach(void *priv, struct virtio_device *vdev)
 	return 0;
 }
 
+void *virtio_guest_access(struct kvm *kvm, struct virtio_device *vdev,
+			  u64 addr, size_t size, size_t *out_size, int prot)
+{
+	u64 paddr;
+
+	if (!vdev->iotlb) {
+		*out_size = size;
+		paddr = addr;
+	} else {
+		paddr = iommu_access(vdev->iotlb, addr, size, out_size, prot);
+	}
+
+	return guest_flat_to_host(kvm, paddr);
+}
+
+/*
+ * Fill @iov starting at index @cur_vec with translations of the (@addr, @size)
+ * range. If @vdev doesn't have a tlb, fill a single vector with the
+ * corresponding HVA. Otherwise, fill vectors with GVA->GPA->HVA translations.
+ * Since the IOVA range may span over multiple IOMMU mappings, there may need to
+ * be multiple vectors. @nr_vec is the size of the @iov array.
+ */
+int virtio_populate_sg(struct kvm *kvm, struct virtio_device *vdev, u64 addr,
+		       size_t size, int prot, u16 cur_vec, u16 nr_vec,
+		       struct iovec iov[])
+{
+	void *ptr;
+	int vec = cur_vec;
+	size_t consumed = 0;
+
+	while (size > 0 && vec < nr_vec) {
+		ptr = virtio_guest_access(kvm, vdev, addr, size, &consumed,
+					  prot);
+		if (!ptr)
+			break;
+
+		iov[vec].iov_len = consumed;
+		iov[vec].iov_base = ptr;
+
+		size -= consumed;
+		addr += consumed;
+		vec++;
+	}
+
+	if (cur_vec == nr_vec && size)
+		/*
+		 * This is bad. Devices used to offer as many iovecs as vring
+		 * descriptors, so there was no chance of filling up the array.
+		 * But with the IOMMU, buffers may be fragmented and use
+		 * multiple iovecs per descriptor.
+		 */
+		pr_err("reached end of iovec, incomplete buffer");
+
+	return vec - cur_vec;
+}
+
 int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 		struct virtio_ops *ops, enum virtio_trans trans,
 		int device_id, int subsys_id, int class)
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 09/15] virtio: access vring and buffers through IOMMU mappings
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (15 preceding siblings ...)
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (14 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Teach the virtio core how to access scattered vring structures. When
presenting a virtual IOMMU to the guest in front of virtio devices, the
virtio ring and buffers will be scattered across discontiguous guest-
physical pages. The device has to translate all IOVAs to host-virtual
addresses and gather the pages before accessing any structure.

Buffers described by vring.desc are already returned to the device via an
iovec. We simply have to fill them at a finer granularity and hope that:

1. The driver doesn't provide too many descriptors at a time, since the
   iovec is only as big as the number of descriptor and an overflow is now
   possible.

2. The device doesn't make assumption on message framing from vectors (ie.
   a message can now be contained in more vectors than before). This is
   forbidden by virtio 1.0 (and legacy with ANY_LAYOUT) but our
   virtio-net, for instance, assumes that the first vector always contains
   a full vnet header. In practice it's fine, but still extremely fragile.

For accessing vring and indirect descriptor tables, we now allocate an
iovec describing the IOMMU mappings of the structure, and make all
accesses via this iovec.

                                  ***

A more elegant way to do it would be to create a subprocess per
address-space, and remap fragments of guest memory in a contiguous manner:

                                .---- virtio-blk process
                               /
           viommu process ----+------ virtio-net process
                               \
                                '---- some other device

(0) Initially, parent forks for each emulated device. Each child reserves
    a large chunk of virtual memory with mmap (base), representing the
    IOVA space, but doesn't populate it.
(1) virtio-dev wants to access guest memory, for instance read the vring.
    It sends a TLB miss for an IOVA to the parent via pipe or socket.
(2) Parent viommu checks its translation table, and returns an offset in
    guest memory.
(3) Child does a mmap in its IOVA space, using the fd that backs guest
    memory: mmap(base + iova, pgsize, SHARED|FIXED, fd, offset)

This would be really cool, but I suspect it adds a lot of complexity,
since it's not clear which devices are entirely self-contained and which
need to access parent memory. So stay with scatter-gather accesses for
now.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/virtio.h | 108 +++++++++++++++++++++++++++++--
 virtio/core.c        | 179 ++++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 252 insertions(+), 35 deletions(-)

diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index 9f2ff237..cdc960cd 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -29,12 +29,16 @@
 
 struct virt_queue {
 	struct vring	vring;
+	struct iovec	*vring_sg;
+	size_t		vring_nr_sg;
 	u32		pfn;
 	/* The last_avail_idx field is an index to ->ring of struct vring_avail.
 	   It's where we assume the next request index is at.  */
 	u16		last_avail_idx;
 	u16		last_used_signalled;
 	u16		endian;
+
+	struct virtio_device *vdev;
 };
 
 /*
@@ -96,26 +100,91 @@ static inline __u64 __virtio_h2g_u64(u16 endian, __u64 val)
 
 #endif
 
+void *virtio_guest_access(struct kvm *kvm, struct virtio_device *vdev,
+			  u64 addr, size_t size, size_t *out_size, int prot);
+int virtio_populate_sg(struct kvm *kvm, struct virtio_device *vdev, u64 addr,
+		       size_t size, int prot, u16 cur_sg, u16 max_sg,
+		       struct iovec iov[]);
+
+/*
+ * Access element in a virtio structure. If @iov is NULL, access is linear and
+ * @ptr represents a Host-Virtual Address (HVA).
+ *
+ * Otherwise, the structure is scattered in the guest-physical space, and is
+ * made virtually-contiguous by the virtual IOMMU. @iov describes the
+ * structure's IOVA->HVA fragments, @base is the IOVA of the structure, and @ptr
+ * an IOVA inside the structure. @max is the number of elements in @iov.
+ *
+ *                                        HVA
+ *                      IOVA      .----> +---+ iov[0].base
+ *              @base-> +---+ ----'      |   |
+ *                      |   |            +---+
+ *                      +---+ ----.      :   :
+ *                      |   |     '----> +---+ iov[1].base
+ *               @ptr-> |   |            |   |
+ *                      +---+            |   |--> out
+ *                                       +---+
+ */
+static void *virtio_access_sg(struct iovec *iov, int max, void *base, void *ptr)
+{
+	int i;
+	size_t off = ptr - base;
+
+	if (!iov)
+		return ptr;
+
+	for (i = 0; i < max; i++) {
+		size_t sz = iov[i].iov_len;
+		if (off < sz)
+			return iov[i].iov_base + off;
+		off -= sz;
+	}
+
+	pr_err("virtio_access_sg overflow");
+	return NULL;
+}
+
+/*
+ * We only implement legacy vhost, so vring is a single virtually-contiguous
+ * structure starting at the descriptor table. Differentiation of accesses
+ * allows to ease a future move to virtio 1.0.
+ */
+#define vring_access_avail(vq, ptr)	\
+	virtio_access_sg(vq->vring_sg, vq->vring_nr_sg, vq->vring.desc, ptr)
+#define vring_access_desc(vq, ptr)	\
+	virtio_access_sg(vq->vring_sg, vq->vring_nr_sg, vq->vring.desc, ptr)
+#define vring_access_used(vq, ptr)	\
+	virtio_access_sg(vq->vring_sg, vq->vring_nr_sg, vq->vring.desc, ptr)
+
 static inline u16 virt_queue__pop(struct virt_queue *queue)
 {
+	void *ptr;
 	__u16 guest_idx;
 
-	guest_idx = queue->vring.avail->ring[queue->last_avail_idx++ % queue->vring.num];
+	ptr = &queue->vring.avail->ring[queue->last_avail_idx++ % queue->vring.num];
+	guest_idx = *(u16 *)vring_access_avail(queue, ptr);
+
 	return virtio_guest_to_host_u16(queue, guest_idx);
 }
 
 static inline struct vring_desc *virt_queue__get_desc(struct virt_queue *queue, u16 desc_ndx)
 {
-	return &queue->vring.desc[desc_ndx];
+	return vring_access_desc(queue, &queue->vring.desc[desc_ndx]);
 }
 
 static inline bool virt_queue__available(struct virt_queue *vq)
 {
+	u16 *evt, *idx;
+
 	if (!vq->vring.avail)
 		return 0;
 
-	vring_avail_event(&vq->vring) = virtio_host_to_guest_u16(vq, vq->last_avail_idx);
-	return virtio_guest_to_host_u16(vq, vq->vring.avail->idx) != vq->last_avail_idx;
+	/* Disgusting casts under the hood: &(*&used[size]) */
+	evt = vring_access_used(vq, &vring_avail_event(&vq->vring));
+	idx = vring_access_avail(vq, &vq->vring.avail->idx);
+
+	*evt = virtio_host_to_guest_u16(vq, vq->last_avail_idx);
+	return virtio_guest_to_host_u16(vq, *idx) != vq->last_avail_idx;
 }
 
 void virt_queue__used_idx_advance(struct virt_queue *queue, u16 jump);
@@ -177,10 +246,39 @@ static inline void virtio_init_device_vq(struct kvm *kvm,
 					 struct virt_queue *vq, size_t nr_descs,
 					 u32 page_size, u32 align, u32 pfn)
 {
-	void *p		= guest_flat_to_host(kvm, (u64)pfn * page_size);
+	void *p;
 
 	vq->endian	= vdev->endian;
 	vq->pfn		= pfn;
+	vq->vdev	= vdev;
+	vq->vring_sg	= NULL;
+
+	if (vdev->iotlb) {
+		u64 addr = (u64)pfn * page_size;
+		size_t size = vring_size(nr_descs, align);
+		/* Our IOMMU maps at PAGE_SIZE granularity */
+		size_t nr_sg = size / PAGE_SIZE;
+		int flags = IOMMU_PROT_READ | IOMMU_PROT_WRITE;
+
+		vq->vring_sg = calloc(nr_sg, sizeof(struct iovec));
+		if (!vq->vring_sg) {
+			pr_err("could not allocate vring_sg");
+			return; /* Explode later. */
+		}
+
+		vq->vring_nr_sg = virtio_populate_sg(kvm, vdev, addr, size,
+						     flags, 0, nr_sg,
+						     vq->vring_sg);
+		if (!vq->vring_nr_sg) {
+			pr_err("could not map vring");
+			free(vq->vring_sg);
+		}
+
+		/* vring is described with its IOVA */
+		p = (void *)addr;
+	} else {
+		p = guest_flat_to_host(kvm, (u64)pfn * page_size);
+	}
 
 	vring_init(&vq->vring, nr_descs, p, align);
 }
diff --git a/virtio/core.c b/virtio/core.c
index 32bd4ebc..ba35e5f1 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -28,7 +28,8 @@ const char* virtio_trans_name(enum virtio_trans trans)
 
 void virt_queue__used_idx_advance(struct virt_queue *queue, u16 jump)
 {
-	u16 idx = virtio_guest_to_host_u16(queue, queue->vring.used->idx);
+	u16 *ptr = vring_access_used(queue, &queue->vring.used->idx);
+	u16 idx = virtio_guest_to_host_u16(queue, *ptr);
 
 	/*
 	 * Use wmb to assure that used elem was updated with head and len.
@@ -37,7 +38,7 @@ void virt_queue__used_idx_advance(struct virt_queue *queue, u16 jump)
 	 */
 	wmb();
 	idx += jump;
-	queue->vring.used->idx = virtio_host_to_guest_u16(queue, idx);
+	*ptr = virtio_host_to_guest_u16(queue, idx);
 
 	/*
 	 * Use wmb to assure used idx has been increased before we signal the guest.
@@ -52,10 +53,12 @@ virt_queue__set_used_elem_no_update(struct virt_queue *queue, u32 head,
 				    u32 len, u16 offset)
 {
 	struct vring_used_elem *used_elem;
-	u16 idx = virtio_guest_to_host_u16(queue, queue->vring.used->idx);
+	u16 *ptr = vring_access_used(queue, &queue->vring.used->idx);
+	u16 idx = virtio_guest_to_host_u16(queue, *ptr);
 
-	idx += offset;
-	used_elem	= &queue->vring.used->ring[idx % queue->vring.num];
+	idx = (idx + offset) % queue->vring.num;
+
+	used_elem	= vring_access_used(queue, &queue->vring.used->ring[idx]);
 	used_elem->id	= virtio_host_to_guest_u32(queue, head);
 	used_elem->len	= virtio_host_to_guest_u32(queue, len);
 
@@ -84,16 +87,17 @@ static inline bool virt_desc__test_flag(struct virt_queue *vq,
  * at the end.
  */
 static unsigned next_desc(struct virt_queue *vq, struct vring_desc *desc,
-			  unsigned int i, unsigned int max)
+			  unsigned int max)
 {
 	unsigned int next;
 
 	/* If this descriptor says it doesn't chain, we're done. */
-	if (!virt_desc__test_flag(vq, &desc[i], VRING_DESC_F_NEXT))
+	if (!virt_desc__test_flag(vq, desc, VRING_DESC_F_NEXT))
 		return max;
 
+	next = virtio_guest_to_host_u16(vq, desc->next);
 	/* Check they're not leading us off end of descriptors. */
-	next = virtio_guest_to_host_u16(vq, desc[i].next);
+	next = min(next, max);
 	/* Make sure compiler knows to grab that: we don't want it changing! */
 	wmb();
 
@@ -102,32 +106,76 @@ static unsigned next_desc(struct virt_queue *vq, struct vring_desc *desc,
 
 u16 virt_queue__get_head_iov(struct virt_queue *vq, struct iovec iov[], u16 *out, u16 *in, u16 head, struct kvm *kvm)
 {
-	struct vring_desc *desc;
+	struct vring_desc *desc_base, *desc;
+	bool indirect, is_write;
+	struct iovec *desc_sg;
+	size_t len, nr_sg;
+	u64 addr;
 	u16 idx;
 	u16 max;
 
 	idx = head;
 	*out = *in = 0;
 	max = vq->vring.num;
-	desc = vq->vring.desc;
+	desc_base = vq->vring.desc;
+	desc_sg = vq->vring_sg;
+	nr_sg = vq->vring_nr_sg;
+
+	desc = vring_access_desc(vq, &desc_base[idx]);
+	indirect = virt_desc__test_flag(vq, desc, VRING_DESC_F_INDIRECT);
+	if (indirect) {
+		len = virtio_guest_to_host_u32(vq, desc->len);
+		max = len / sizeof(struct vring_desc);
+		addr = virtio_guest_to_host_u64(vq, desc->addr);
+		if (desc_sg) {
+			desc_sg = calloc(len / PAGE_SIZE + 1, sizeof(struct iovec));
+			if (!desc_sg)
+				return 0;
+
+			nr_sg = virtio_populate_sg(kvm, vq->vdev, addr, len,
+						   IOMMU_PROT_READ, 0, max,
+						   desc_sg);
+			if (!nr_sg) {
+				pr_err("failed to populate indirect table");
+				free(desc_sg);
+				return 0;
+			}
+
+			desc_base = (void *)addr;
+		} else {
+			desc_base = guest_flat_to_host(kvm, addr);
+		}
 
-	if (virt_desc__test_flag(vq, &desc[idx], VRING_DESC_F_INDIRECT)) {
-		max = virtio_guest_to_host_u32(vq, desc[idx].len) / sizeof(struct vring_desc);
-		desc = guest_flat_to_host(kvm, virtio_guest_to_host_u64(vq, desc[idx].addr));
 		idx = 0;
 	}
 
 	do {
+		u16 nr_io;
+
+		desc = virtio_access_sg(desc_sg, nr_sg, desc_base, &desc_base[idx]);
+		is_write = virt_desc__test_flag(vq, desc, VRING_DESC_F_WRITE);
+
 		/* Grab the first descriptor, and check it's OK. */
-		iov[*out + *in].iov_len = virtio_guest_to_host_u32(vq, desc[idx].len);
-		iov[*out + *in].iov_base = guest_flat_to_host(kvm,
-							      virtio_guest_to_host_u64(vq, desc[idx].addr));
+		len = virtio_guest_to_host_u32(vq, desc->len);
+		addr = virtio_guest_to_host_u64(vq, desc->addr);
+
+		/*
+		 * dodgy assumption alert: device uses vring.desc.num iovecs.
+		 * True in practice, but they are not obligated to do so.
+		 */
+		nr_io = virtio_populate_sg(kvm, vq->vdev, addr, len, is_write ?
+					   IOMMU_PROT_WRITE : IOMMU_PROT_READ,
+					   *out + *in, vq->vring.num, iov);
+
 		/* If this is an input descriptor, increment that count. */
-		if (virt_desc__test_flag(vq, &desc[idx], VRING_DESC_F_WRITE))
-			(*in)++;
+		if (is_write)
+			(*in) += nr_io;
 		else
-			(*out)++;
-	} while ((idx = next_desc(vq, desc, idx, max)) != max);
+			(*out) += nr_io;
+	} while ((idx = next_desc(vq, desc, max)) != max);
+
+	if (indirect && desc_sg)
+		free(desc_sg);
 
 	return head;
 }
@@ -147,23 +195,35 @@ u16 virt_queue__get_inout_iov(struct kvm *kvm, struct virt_queue *queue,
 			      u16 *in, u16 *out)
 {
 	struct vring_desc *desc;
+	struct iovec *iov;
 	u16 head, idx;
+	bool is_write;
+	size_t len;
+	u64 addr;
+	int prot;
+	u16 *cur;
 
 	idx = head = virt_queue__pop(queue);
 	*out = *in = 0;
 	do {
-		u64 addr;
 		desc = virt_queue__get_desc(queue, idx);
+		is_write = virt_desc__test_flag(queue, desc, VRING_DESC_F_WRITE);
+		len = virtio_guest_to_host_u32(queue, desc->len);
 		addr = virtio_guest_to_host_u64(queue, desc->addr);
-		if (virt_desc__test_flag(queue, desc, VRING_DESC_F_WRITE)) {
-			in_iov[*in].iov_base = guest_flat_to_host(kvm, addr);
-			in_iov[*in].iov_len = virtio_guest_to_host_u32(queue, desc->len);
-			(*in)++;
+		if (is_write) {
+			prot = IOMMU_PROT_WRITE;
+			iov = in_iov;
+			cur = in;
 		} else {
-			out_iov[*out].iov_base = guest_flat_to_host(kvm, addr);
-			out_iov[*out].iov_len = virtio_guest_to_host_u32(queue, desc->len);
-			(*out)++;
+			prot = IOMMU_PROT_READ;
+			iov = out_iov;
+			cur = out;
 		}
+
+		/* dodgy assumption alert: device uses vring.desc.num iovecs */
+		*cur += virtio_populate_sg(kvm, queue->vdev, addr, len, prot,
+					   *cur, queue->vring.num, iov);
+
 		if (virt_desc__test_flag(queue, desc, VRING_DESC_F_NEXT))
 			idx = virtio_guest_to_host_u16(queue, desc->next);
 		else
@@ -191,9 +251,12 @@ bool virtio_queue__should_signal(struct virt_queue *vq)
 {
 	u16 old_idx, new_idx, event_idx;
 
+	u16 *new_ptr	= vring_access_used(vq, &vq->vring.used->idx);
+	u16 *event_ptr	= vring_access_avail(vq, &vring_used_event(&vq->vring));
+
 	old_idx		= vq->last_used_signalled;
-	new_idx		= virtio_guest_to_host_u16(vq, vq->vring.used->idx);
-	event_idx	= virtio_guest_to_host_u16(vq, vring_used_event(&vq->vring));
+	new_idx		= virtio_guest_to_host_u16(vq, *new_ptr);
+	event_idx	= virtio_guest_to_host_u16(vq, *event_ptr);
 
 	if (vring_need_event(event_idx, new_idx, old_idx)) {
 		vq->last_used_signalled = new_idx;
@@ -238,6 +301,62 @@ int virtio__iommu_detach(void *priv, struct virtio_device *vdev)
 	return 0;
 }
 
+void *virtio_guest_access(struct kvm *kvm, struct virtio_device *vdev,
+			  u64 addr, size_t size, size_t *out_size, int prot)
+{
+	u64 paddr;
+
+	if (!vdev->iotlb) {
+		*out_size = size;
+		paddr = addr;
+	} else {
+		paddr = iommu_access(vdev->iotlb, addr, size, out_size, prot);
+	}
+
+	return guest_flat_to_host(kvm, paddr);
+}
+
+/*
+ * Fill @iov starting at index @cur_vec with translations of the (@addr, @size)
+ * range. If @vdev doesn't have a tlb, fill a single vector with the
+ * corresponding HVA. Otherwise, fill vectors with GVA->GPA->HVA translations.
+ * Since the IOVA range may span over multiple IOMMU mappings, there may need to
+ * be multiple vectors. @nr_vec is the size of the @iov array.
+ */
+int virtio_populate_sg(struct kvm *kvm, struct virtio_device *vdev, u64 addr,
+		       size_t size, int prot, u16 cur_vec, u16 nr_vec,
+		       struct iovec iov[])
+{
+	void *ptr;
+	int vec = cur_vec;
+	size_t consumed = 0;
+
+	while (size > 0 && vec < nr_vec) {
+		ptr = virtio_guest_access(kvm, vdev, addr, size, &consumed,
+					  prot);
+		if (!ptr)
+			break;
+
+		iov[vec].iov_len = consumed;
+		iov[vec].iov_base = ptr;
+
+		size -= consumed;
+		addr += consumed;
+		vec++;
+	}
+
+	if (cur_vec == nr_vec && size)
+		/*
+		 * This is bad. Devices used to offer as many iovecs as vring
+		 * descriptors, so there was no chance of filling up the array.
+		 * But with the IOMMU, buffers may be fragmented and use
+		 * multiple iovecs per descriptor.
+		 */
+		pr_err("reached end of iovec, incomplete buffer");
+
+	return vec - cur_vec;
+}
+
 int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 		struct virtio_ops *ops, enum virtio_trans trans,
 		int device_id, int subsys_id, int class)
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 10/15] virtio-pci: translate MSIs with the virtual IOMMU
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (17 preceding siblings ...)
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (12 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

When the virtio device is behind a virtual IOMMU, the doorbell address
written into the MSI-X table by the guest is an IOVA, not a physical one.
When injecting an MSI, KVM needs a physical address to recognize the
doorbell and the associated IRQ chip. Translate the address given by the
guest into a physical one, and store it in a secondary table for easy
access.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/iommu.h      |  4 ++++
 include/kvm/virtio-pci.h |  1 +
 iommu.c                  | 23 +++++++++++++++++++++++
 virtio/pci.c             | 33 ++++++++++++++++++++++++---------
 4 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 4164ba20..8f87ce5a 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -70,4 +70,8 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags);
 u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
 		 int prot);
 
+struct msi_msg;
+
+int iommu_translate_msi(void *address_space, struct msi_msg *msi);
+
 #endif /* KVM_IOMMU_H */
diff --git a/include/kvm/virtio-pci.h b/include/kvm/virtio-pci.h
index 26772f74..cb5225d6 100644
--- a/include/kvm/virtio-pci.h
+++ b/include/kvm/virtio-pci.h
@@ -47,6 +47,7 @@ struct virtio_pci {
 	u32			msix_io_block;
 	u64			msix_pba;
 	struct msix_table	msix_table[VIRTIO_PCI_MAX_VQ + VIRTIO_PCI_MAX_CONFIG];
+	struct msi_msg		msix_msgs[VIRTIO_PCI_MAX_VQ + VIRTIO_PCI_MAX_CONFIG];
 
 	/* virtio queue */
 	u16			queue_selector;
diff --git a/iommu.c b/iommu.c
index 0a662404..c10a3f0b 100644
--- a/iommu.c
+++ b/iommu.c
@@ -5,6 +5,7 @@
 
 #include "kvm/iommu.h"
 #include "kvm/kvm.h"
+#include "kvm/msi.h"
 #include "kvm/mutex.h"
 #include "kvm/rbtree-interval.h"
 
@@ -160,3 +161,25 @@ out_unlock:
 
 	return out_addr;
 }
+
+int iommu_translate_msi(void *address_space, struct msi_msg *msg)
+{
+	size_t size = 4, out_size;
+	u64 addr = ((u64)msg->address_hi << 32) | msg->address_lo;
+
+	if (!address_space)
+		return 0;
+
+	addr = iommu_access(address_space, addr, size, &out_size,
+			    IOMMU_PROT_WRITE);
+
+	if (!addr || out_size != size) {
+		pr_err("could not translate MSI doorbell");
+		return -EFAULT;
+	}
+
+	msg->address_lo = addr & 0xffffffff;
+	msg->address_hi = addr >> 32;
+
+	return 0;
+}
diff --git a/virtio/pci.c b/virtio/pci.c
index 674d5143..88b1a129 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -156,6 +156,7 @@ static void update_msix_map(struct virtio_pci *vpci,
 			    struct msix_table *msix_entry, u32 vecnum)
 {
 	u32 gsi, i;
+	struct msi_msg *msg;
 
 	/* Find the GSI number used for that vector */
 	if (vecnum == vpci->config_vector) {
@@ -172,14 +173,20 @@ static void update_msix_map(struct virtio_pci *vpci,
 	if (gsi == 0)
 		return;
 
-	msix_entry = &msix_entry[vecnum];
-	irq__update_msix_route(vpci->kvm, gsi, &msix_entry->msg);
+	msg = &vpci->msix_msgs[vecnum];
+	*msg = msix_entry[vecnum].msg;
+
+	if (iommu_translate_msi(vpci->vdev->iotlb, msg))
+		return;
+
+	irq__update_msix_route(vpci->kvm, gsi, msg);
 }
 
 static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_device *vdev, u16 port,
 					void *data, int size, int offset)
 {
 	struct virtio_pci *vpci = vdev->virtio;
+	struct msi_msg *msg;
 	u32 config_offset, vec;
 	int gsi;
 	int type = virtio__get_dev_specific_field(offset - 20, virtio_pci__msix_enabled(vpci),
@@ -191,8 +198,12 @@ static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_device *v
 			if (vec == VIRTIO_MSI_NO_VECTOR)
 				break;
 
-			gsi = irq__add_msix_route(kvm,
-						  &vpci->msix_table[vec].msg,
+			msg = &vpci->msix_msgs[vec];
+			*msg = vpci->msix_table[vec].msg;
+			if (iommu_translate_msi(vdev->iotlb, msg))
+				break;
+
+			gsi = irq__add_msix_route(kvm, msg,
 						  vpci->dev_hdr.dev_num << 3);
 			if (gsi >= 0) {
 				vpci->config_gsi = gsi;
@@ -210,8 +221,12 @@ static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_device *v
 			if (vec == VIRTIO_MSI_NO_VECTOR)
 				break;
 
-			gsi = irq__add_msix_route(kvm,
-						  &vpci->msix_table[vec].msg,
+			msg = &vpci->msix_msgs[vec];
+			*msg = vpci->msix_table[vec].msg;
+			if (iommu_translate_msi(vdev->iotlb, msg))
+				break;
+
+			gsi = irq__add_msix_route(kvm, msg,
 						  vpci->dev_hdr.dev_num << 3);
 			if (gsi < 0) {
 				if (gsi == -ENXIO &&
@@ -328,9 +343,9 @@ static void virtio_pci__signal_msi(struct kvm *kvm, struct virtio_pci *vpci,
 {
 	static int needs_devid = 0;
 	struct kvm_msi msi = {
-		.address_lo = vpci->msix_table[vec].msg.address_lo,
-		.address_hi = vpci->msix_table[vec].msg.address_hi,
-		.data = vpci->msix_table[vec].msg.data,
+		.address_lo = vpci->msix_msgs[vec].address_lo,
+		.address_hi = vpci->msix_msgs[vec].address_hi,
+		.data = vpci->msix_msgs[vec].data,
 	};
 
 	if (needs_devid == 0) {
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 10/15] virtio-pci: translate MSIs with the virtual IOMMU
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (18 preceding siblings ...)
  2017-04-07 19:24   ` [RFC PATCH kvmtool 10/15] virtio-pci: translate MSIs with the virtual IOMMU Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 11/15] virtio: set VIRTIO_F_IOMMU_PLATFORM when necessary Jean-Philippe Brucker
                     ` (11 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

When the virtio device is behind a virtual IOMMU, the doorbell address
written into the MSI-X table by the guest is an IOVA, not a physical one.
When injecting an MSI, KVM needs a physical address to recognize the
doorbell and the associated IRQ chip. Translate the address given by the
guest into a physical one, and store it in a secondary table for easy
access.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/iommu.h      |  4 ++++
 include/kvm/virtio-pci.h |  1 +
 iommu.c                  | 23 +++++++++++++++++++++++
 virtio/pci.c             | 33 ++++++++++++++++++++++++---------
 4 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 4164ba20..8f87ce5a 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -70,4 +70,8 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags);
 u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
 		 int prot);
 
+struct msi_msg;
+
+int iommu_translate_msi(void *address_space, struct msi_msg *msi);
+
 #endif /* KVM_IOMMU_H */
diff --git a/include/kvm/virtio-pci.h b/include/kvm/virtio-pci.h
index 26772f74..cb5225d6 100644
--- a/include/kvm/virtio-pci.h
+++ b/include/kvm/virtio-pci.h
@@ -47,6 +47,7 @@ struct virtio_pci {
 	u32			msix_io_block;
 	u64			msix_pba;
 	struct msix_table	msix_table[VIRTIO_PCI_MAX_VQ + VIRTIO_PCI_MAX_CONFIG];
+	struct msi_msg		msix_msgs[VIRTIO_PCI_MAX_VQ + VIRTIO_PCI_MAX_CONFIG];
 
 	/* virtio queue */
 	u16			queue_selector;
diff --git a/iommu.c b/iommu.c
index 0a662404..c10a3f0b 100644
--- a/iommu.c
+++ b/iommu.c
@@ -5,6 +5,7 @@
 
 #include "kvm/iommu.h"
 #include "kvm/kvm.h"
+#include "kvm/msi.h"
 #include "kvm/mutex.h"
 #include "kvm/rbtree-interval.h"
 
@@ -160,3 +161,25 @@ out_unlock:
 
 	return out_addr;
 }
+
+int iommu_translate_msi(void *address_space, struct msi_msg *msg)
+{
+	size_t size = 4, out_size;
+	u64 addr = ((u64)msg->address_hi << 32) | msg->address_lo;
+
+	if (!address_space)
+		return 0;
+
+	addr = iommu_access(address_space, addr, size, &out_size,
+			    IOMMU_PROT_WRITE);
+
+	if (!addr || out_size != size) {
+		pr_err("could not translate MSI doorbell");
+		return -EFAULT;
+	}
+
+	msg->address_lo = addr & 0xffffffff;
+	msg->address_hi = addr >> 32;
+
+	return 0;
+}
diff --git a/virtio/pci.c b/virtio/pci.c
index 674d5143..88b1a129 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -156,6 +156,7 @@ static void update_msix_map(struct virtio_pci *vpci,
 			    struct msix_table *msix_entry, u32 vecnum)
 {
 	u32 gsi, i;
+	struct msi_msg *msg;
 
 	/* Find the GSI number used for that vector */
 	if (vecnum == vpci->config_vector) {
@@ -172,14 +173,20 @@ static void update_msix_map(struct virtio_pci *vpci,
 	if (gsi == 0)
 		return;
 
-	msix_entry = &msix_entry[vecnum];
-	irq__update_msix_route(vpci->kvm, gsi, &msix_entry->msg);
+	msg = &vpci->msix_msgs[vecnum];
+	*msg = msix_entry[vecnum].msg;
+
+	if (iommu_translate_msi(vpci->vdev->iotlb, msg))
+		return;
+
+	irq__update_msix_route(vpci->kvm, gsi, msg);
 }
 
 static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_device *vdev, u16 port,
 					void *data, int size, int offset)
 {
 	struct virtio_pci *vpci = vdev->virtio;
+	struct msi_msg *msg;
 	u32 config_offset, vec;
 	int gsi;
 	int type = virtio__get_dev_specific_field(offset - 20, virtio_pci__msix_enabled(vpci),
@@ -191,8 +198,12 @@ static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_device *v
 			if (vec == VIRTIO_MSI_NO_VECTOR)
 				break;
 
-			gsi = irq__add_msix_route(kvm,
-						  &vpci->msix_table[vec].msg,
+			msg = &vpci->msix_msgs[vec];
+			*msg = vpci->msix_table[vec].msg;
+			if (iommu_translate_msi(vdev->iotlb, msg))
+				break;
+
+			gsi = irq__add_msix_route(kvm, msg,
 						  vpci->dev_hdr.dev_num << 3);
 			if (gsi >= 0) {
 				vpci->config_gsi = gsi;
@@ -210,8 +221,12 @@ static bool virtio_pci__specific_io_out(struct kvm *kvm, struct virtio_device *v
 			if (vec == VIRTIO_MSI_NO_VECTOR)
 				break;
 
-			gsi = irq__add_msix_route(kvm,
-						  &vpci->msix_table[vec].msg,
+			msg = &vpci->msix_msgs[vec];
+			*msg = vpci->msix_table[vec].msg;
+			if (iommu_translate_msi(vdev->iotlb, msg))
+				break;
+
+			gsi = irq__add_msix_route(kvm, msg,
 						  vpci->dev_hdr.dev_num << 3);
 			if (gsi < 0) {
 				if (gsi == -ENXIO &&
@@ -328,9 +343,9 @@ static void virtio_pci__signal_msi(struct kvm *kvm, struct virtio_pci *vpci,
 {
 	static int needs_devid = 0;
 	struct kvm_msi msi = {
-		.address_lo = vpci->msix_table[vec].msg.address_lo,
-		.address_hi = vpci->msix_table[vec].msg.address_hi,
-		.data = vpci->msix_table[vec].msg.data,
+		.address_lo = vpci->msix_msgs[vec].address_lo,
+		.address_hi = vpci->msix_msgs[vec].address_hi,
+		.data = vpci->msix_msgs[vec].data,
 	};
 
 	if (needs_devid == 0) {
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 11/15] virtio: set VIRTIO_F_IOMMU_PLATFORM when necessary
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (20 preceding siblings ...)
  2017-04-07 19:24   ` [RFC PATCH kvmtool 11/15] virtio: set VIRTIO_F_IOMMU_PLATFORM when necessary Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 12/15] vfio: add support for virtual IOMMU Jean-Philippe Brucker
                     ` (9 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

Pass the VIRTIO_F_IOMMU_PLATFORM to tell the guest when a device is behind
an IOMMU.

Other feature bits in virtio do not depend on the device type and could be
factored the same way. For instance our vring implementation always
supports indirect descriptors (VIRTIO_RING_F_INDIRECT_DESC), so we could
advertise it for all devices at once (only net, scsi and blk at the
moment). However, this might modify guest behaviour: in Linux whenever the
driver attempts to add a chain of descriptors, it will allocate an
indirect table and use a single ring descriptor, which might slightly
reduce performance. Cowardly ignore this.

VIRTIO_RING_F_EVENT_IDX is another feature of the vring, but that one
needs the device to call virtio_queue__should_signal before signaling to
the guest. Arguably we could factor all calls to signal_vq, but let's keep
this patch simple.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/virtio.h | 2 ++
 virtio/core.c        | 6 ++++++
 virtio/mmio.c        | 4 +++-
 virtio/pci.c         | 1 +
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index cdc960cd..97bd5bdb 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -293,4 +293,6 @@ virtio__iommu_get_properties(struct device_header *dev);
 int virtio__iommu_attach(void *, struct virtio_device *vdev, int flags);
 int virtio__iommu_detach(void *, struct virtio_device *vdev);
 
+u32 virtio_get_common_features(struct kvm *kvm, struct virtio_device *vdev);
+
 #endif /* KVM__VIRTIO_H */
diff --git a/virtio/core.c b/virtio/core.c
index ba35e5f1..66e0cecb 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -1,3 +1,4 @@
+#include <linux/virtio_config.h>
 #include <linux/virtio_ring.h>
 #include <linux/types.h>
 #include <sys/uio.h>
@@ -266,6 +267,11 @@ bool virtio_queue__should_signal(struct virt_queue *vq)
 	return false;
 }
 
+u32 virtio_get_common_features(struct kvm *kvm, struct virtio_device *vdev)
+{
+	return vdev->use_iommu ? VIRTIO_F_IOMMU_PLATFORM : 0;
+}
+
 const struct iommu_properties *
 virtio__iommu_get_properties(struct device_header *dev)
 {
diff --git a/virtio/mmio.c b/virtio/mmio.c
index 24a14a71..699d4403 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -127,9 +127,11 @@ static void virtio_mmio_config_in(struct kvm_cpu *vcpu,
 		ioport__write32(data, *(u32 *)(((void *)&vmmio->hdr) + addr));
 		break;
 	case VIRTIO_MMIO_HOST_FEATURES:
-		if (vmmio->hdr.host_features_sel == 0)
+		if (vmmio->hdr.host_features_sel == 0) {
 			val = vdev->ops->get_host_features(vmmio->kvm,
 							   vmmio->dev);
+			val |= virtio_get_common_features(vmmio->kvm, vdev);
+		}
 		ioport__write32(data, val);
 		break;
 	case VIRTIO_MMIO_QUEUE_PFN:
diff --git a/virtio/pci.c b/virtio/pci.c
index 88b1a129..c9f0e558 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -126,6 +126,7 @@ static bool virtio_pci__io_in(struct ioport *ioport, struct kvm_cpu *vcpu, u16 p
 	switch (offset) {
 	case VIRTIO_PCI_HOST_FEATURES:
 		val = vdev->ops->get_host_features(kvm, vpci->dev);
+		val |= virtio_get_common_features(kvm, vdev);
 		ioport__write32(data, val);
 		break;
 	case VIRTIO_PCI_QUEUE_PFN:
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 11/15] virtio: set VIRTIO_F_IOMMU_PLATFORM when necessary
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (19 preceding siblings ...)
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (10 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Pass the VIRTIO_F_IOMMU_PLATFORM to tell the guest when a device is behind
an IOMMU.

Other feature bits in virtio do not depend on the device type and could be
factored the same way. For instance our vring implementation always
supports indirect descriptors (VIRTIO_RING_F_INDIRECT_DESC), so we could
advertise it for all devices at once (only net, scsi and blk at the
moment). However, this might modify guest behaviour: in Linux whenever the
driver attempts to add a chain of descriptors, it will allocate an
indirect table and use a single ring descriptor, which might slightly
reduce performance. Cowardly ignore this.

VIRTIO_RING_F_EVENT_IDX is another feature of the vring, but that one
needs the device to call virtio_queue__should_signal before signaling to
the guest. Arguably we could factor all calls to signal_vq, but let's keep
this patch simple.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/virtio.h | 2 ++
 virtio/core.c        | 6 ++++++
 virtio/mmio.c        | 4 +++-
 virtio/pci.c         | 1 +
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index cdc960cd..97bd5bdb 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -293,4 +293,6 @@ virtio__iommu_get_properties(struct device_header *dev);
 int virtio__iommu_attach(void *, struct virtio_device *vdev, int flags);
 int virtio__iommu_detach(void *, struct virtio_device *vdev);
 
+u32 virtio_get_common_features(struct kvm *kvm, struct virtio_device *vdev);
+
 #endif /* KVM__VIRTIO_H */
diff --git a/virtio/core.c b/virtio/core.c
index ba35e5f1..66e0cecb 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -1,3 +1,4 @@
+#include <linux/virtio_config.h>
 #include <linux/virtio_ring.h>
 #include <linux/types.h>
 #include <sys/uio.h>
@@ -266,6 +267,11 @@ bool virtio_queue__should_signal(struct virt_queue *vq)
 	return false;
 }
 
+u32 virtio_get_common_features(struct kvm *kvm, struct virtio_device *vdev)
+{
+	return vdev->use_iommu ? VIRTIO_F_IOMMU_PLATFORM : 0;
+}
+
 const struct iommu_properties *
 virtio__iommu_get_properties(struct device_header *dev)
 {
diff --git a/virtio/mmio.c b/virtio/mmio.c
index 24a14a71..699d4403 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -127,9 +127,11 @@ static void virtio_mmio_config_in(struct kvm_cpu *vcpu,
 		ioport__write32(data, *(u32 *)(((void *)&vmmio->hdr) + addr));
 		break;
 	case VIRTIO_MMIO_HOST_FEATURES:
-		if (vmmio->hdr.host_features_sel == 0)
+		if (vmmio->hdr.host_features_sel == 0) {
 			val = vdev->ops->get_host_features(vmmio->kvm,
 							   vmmio->dev);
+			val |= virtio_get_common_features(vmmio->kvm, vdev);
+		}
 		ioport__write32(data, val);
 		break;
 	case VIRTIO_MMIO_QUEUE_PFN:
diff --git a/virtio/pci.c b/virtio/pci.c
index 88b1a129..c9f0e558 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -126,6 +126,7 @@ static bool virtio_pci__io_in(struct ioport *ioport, struct kvm_cpu *vcpu, u16 p
 	switch (offset) {
 	case VIRTIO_PCI_HOST_FEATURES:
 		val = vdev->ops->get_host_features(kvm, vpci->dev);
+		val |= virtio_get_common_features(kvm, vdev);
 		ioport__write32(data, val);
 		break;
 	case VIRTIO_PCI_QUEUE_PFN:
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 12/15] vfio: add support for virtual IOMMU
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (22 preceding siblings ...)
  2017-04-07 19:24   ` [RFC PATCH kvmtool 12/15] vfio: add support for virtual IOMMU Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 13/15] virtio-iommu: debug via IPC Jean-Philippe Brucker
                     ` (7 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

Currently all passed-through devices must access the same guest-physical
address space. Register an IOMMU to offer individual address spaces to
devices. The way we do it is allocate one container per group, and add
mappings on demand.

Since guest cannot access devices unless it is attached to a container,
and we cannot change container at runtime without resetting the device,
this implementation is limited. To implement bypass mode, we'd need to map
the whole guest physical memory first, and unmap everything when attaching
to a new address space. It is also not possible for devices to be attached
to the same address space, they all have different page tables.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/iommu.h |   6 ++
 include/kvm/vfio.h  |   2 +
 iommu.c             |   7 +-
 vfio.c              | 281 ++++++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 273 insertions(+), 23 deletions(-)

diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 8f87ce5a..45a20f3b 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -10,6 +10,12 @@
 #define IOMMU_PROT_WRITE	0x2
 #define IOMMU_PROT_EXEC		0x4
 
+/*
+ * Test if mapping is present. If not, return an error but do not report it to
+ * stderr
+ */
+#define IOMMU_UNMAP_SILENT	0x1
+
 struct iommu_ops {
 	const struct iommu_properties *(*get_properties)(struct device_header *);
 
diff --git a/include/kvm/vfio.h b/include/kvm/vfio.h
index 71dfa8f7..84126eb9 100644
--- a/include/kvm/vfio.h
+++ b/include/kvm/vfio.h
@@ -55,6 +55,7 @@ struct vfio_device {
 	struct device_header		dev_hdr;
 
 	int				fd;
+	struct vfio_group		*group;
 	struct vfio_device_info		info;
 	struct vfio_irq_info		irq_info;
 	struct vfio_region		*regions;
@@ -65,6 +66,7 @@ struct vfio_device {
 struct vfio_group {
 	unsigned long			id; /* iommu_group number in sysfs */
 	int				fd;
+	struct vfio_guest_container	*container;
 };
 
 int vfio_group_parser(const struct option *opt, const char *arg, int unset);
diff --git a/iommu.c b/iommu.c
index c10a3f0b..2220e4b2 100644
--- a/iommu.c
+++ b/iommu.c
@@ -85,6 +85,7 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags)
 	struct rb_int_node *node;
 	struct iommu_mapping *map;
 	struct iommu_ioas *ioas = address_space;
+	bool silent = flags & IOMMU_UNMAP_SILENT;
 
 	if (!ioas)
 		return -ENODEV;
@@ -97,7 +98,8 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags)
 		map = container_of(node, struct iommu_mapping, iova_range);
 
 		if (node_size > size) {
-			pr_debug("cannot split mapping");
+			if (!silent)
+				pr_debug("cannot split mapping");
 			ret = -EINVAL;
 			break;
 		}
@@ -111,7 +113,8 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags)
 	}
 
 	if (size && !ret) {
-		pr_debug("mapping not found");
+		if (!silent)
+			pr_debug("mapping not found");
 		ret = -ENXIO;
 	}
 	mutex_unlock(&ioas->mutex);
diff --git a/vfio.c b/vfio.c
index f4fd4090..406d0781 100644
--- a/vfio.c
+++ b/vfio.c
@@ -1,10 +1,13 @@
+#include "kvm/iommu.h"
 #include "kvm/irq.h"
 #include "kvm/kvm.h"
 #include "kvm/kvm-cpu.h"
 #include "kvm/pci.h"
 #include "kvm/util.h"
 #include "kvm/vfio.h"
+#include "kvm/virtio-iommu.h"
 
+#include <linux/bitops.h>
 #include <linux/kvm.h>
 #include <linux/pci_regs.h>
 
@@ -25,7 +28,16 @@ struct vfio_irq_eventfd {
 	int			fd;
 };
 
-static int vfio_container;
+struct vfio_guest_container {
+	struct kvm		*kvm;
+	int			fd;
+
+	void			*msi_doorbells;
+};
+
+static void *viommu = NULL;
+
+static int vfio_host_container;
 
 int vfio_group_parser(const struct option *opt, const char *arg, int unset)
 {
@@ -43,6 +55,7 @@ int vfio_group_parser(const struct option *opt, const char *arg, int unset)
 
 	cur = strtok(buf, ",");
 	group->id = strtoul(cur, NULL, 0);
+	group->container = NULL;
 
 	kvm->cfg.num_vfio_groups = ++idx;
 	free(buf);
@@ -68,11 +81,13 @@ static void vfio_pci_msix_pba_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
 static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
 				       u32 len, u8 is_write, void *ptr)
 {
+	struct msi_msg msg;
 	struct kvm *kvm = vcpu->kvm;
 	struct vfio_pci_device *pdev = ptr;
 	struct vfio_pci_msix_entry *entry;
 	struct vfio_pci_msix_table *table = &pdev->msix_table;
 	struct vfio_device *device = container_of(pdev, struct vfio_device, pci);
+	struct vfio_guest_container *container = device->group->container;
 
 	u64 offset = addr - table->guest_phys_addr;
 
@@ -88,11 +103,16 @@ static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
 
 	memcpy((void *)&entry->config + field, data, len);
 
-	if (field != PCI_MSIX_ENTRY_VECTOR_CTRL)
+	if (field != PCI_MSIX_ENTRY_VECTOR_CTRL || entry->config.ctrl & 1)
+		return;
+
+	msg = entry->config.msg;
+
+	if (container && iommu_translate_msi(container->msi_doorbells, &msg))
 		return;
 
 	if (entry->gsi < 0) {
-		int ret = irq__add_msix_route(kvm, &entry->config.msg,
+		int ret = irq__add_msix_route(kvm, &msg,
 					      device->dev_hdr.dev_num << 3);
 		if (ret < 0) {
 			pr_err("cannot create MSI-X route");
@@ -111,7 +131,7 @@ static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
 		return;
 	}
 
-	irq__update_msix_route(kvm, entry->gsi, &entry->config.msg);
+	irq__update_msix_route(kvm, entry->gsi, &msg);
 }
 
 static void vfio_pci_msi_write(struct kvm *kvm, struct vfio_device *device,
@@ -122,6 +142,7 @@ static void vfio_pci_msi_write(struct kvm *kvm, struct vfio_device *device,
 	struct msi_msg msi;
 	struct vfio_pci_msix_entry *entry;
 	struct vfio_pci_device *pdev = &device->pci;
+	struct vfio_guest_container *container = device->group->container;
 	struct msi_cap_64 *msi_cap_64 = (void *)&pdev->hdr + pdev->msi.pos;
 
 	/* Only modify routes when guest sets the enable bit */
@@ -144,6 +165,9 @@ static void vfio_pci_msi_write(struct kvm *kvm, struct vfio_device *device,
 		msi.data = msi_cap_32->data;
 	}
 
+	if (container && iommu_translate_msi(container->msi_doorbells, &msi))
+		return;
+
 	for (i = 0; i < nr_vectors; i++) {
 		u32 devid = device->dev_hdr.dev_num << 3;
 
@@ -870,6 +894,154 @@ static int vfio_configure_dev_irqs(struct kvm *kvm, struct vfio_device *device)
 	return ret;
 }
 
+static struct iommu_properties vfio_viommu_props = {
+	.name				= "viommu-vfio",
+
+	.input_addr_size		= 64,
+};
+
+static const struct iommu_properties *
+vfio_viommu_get_properties(struct device_header *dev)
+{
+	return &vfio_viommu_props;
+}
+
+static void *vfio_viommu_alloc(struct device_header *dev_hdr)
+{
+	struct vfio_device *vdev = container_of(dev_hdr, struct vfio_device,
+						dev_hdr);
+	struct vfio_guest_container *container = vdev->group->container;
+
+	container->msi_doorbells = iommu_alloc_address_space(NULL);
+	if (!container->msi_doorbells) {
+		pr_err("Failed to create MSI address space");
+		return NULL;
+	}
+
+	return container;
+}
+
+static void vfio_viommu_free(void *priv)
+{
+	struct vfio_guest_container *container = priv;
+
+	/* Half the address space */
+	size_t size = 1UL << (BITS_PER_LONG - 1);
+	unsigned long virt_addr = 0;
+	int i;
+
+	/*
+	 * Remove all mappings in two times, since 2^64 doesn't fit in
+	 * unmap.size
+	 */
+	for (i = 0; i < 2; i++, virt_addr += size) {
+		struct vfio_iommu_type1_dma_unmap unmap = {
+			.argsz	= sizeof(unmap),
+			.iova	= virt_addr,
+			.size	= size,
+		};
+	}
+
+	iommu_free_address_space(container->msi_doorbells);
+	container->msi_doorbells = NULL;
+}
+
+static int vfio_viommu_attach(void *priv, struct device_header *dev_hdr, int flags)
+{
+	struct vfio_guest_container *container = priv;
+	struct vfio_device *vdev = container_of(dev_hdr, struct vfio_device,
+						dev_hdr);
+
+	if (!container)
+		return -ENODEV;
+
+	if (container->fd != vdev->group->container->fd)
+		/*
+		 * TODO: We don't support multiple devices in the same address
+		 * space at the moment. It should be easy to implement, just
+		 * create an address space structure that holds multiple
+		 * container fds and multiplex map/unmap requests.
+		 */
+		return -EINVAL;
+
+	return 0;
+}
+
+static int vfio_viommu_detach(void *priv, struct device_header *dev_hdr)
+{
+	return 0;
+}
+
+static int vfio_viommu_map(void *priv, u64 virt_addr, u64 phys_addr, u64 size,
+			   int prot)
+{
+	int ret;
+	struct vfio_guest_container *container = priv;
+	struct vfio_iommu_type1_dma_map map = {
+		.argsz	= sizeof(map),
+		.iova	= virt_addr,
+		.size	= size,
+	};
+
+	map.vaddr = (u64)guest_flat_to_host(container->kvm, phys_addr);
+	if (!map.vaddr) {
+		if (irq__addr_is_msi_doorbell(container->kvm, phys_addr)) {
+			ret = iommu_map(container->msi_doorbells, virt_addr,
+					phys_addr, size, prot);
+			if (ret) {
+				pr_err("could not map MSI");
+				return ret;
+			}
+
+			// TODO: silence guest_flat_to_host
+			pr_info("Nevermind, all is well. Mapped MSI %llx->%llx",
+				virt_addr, phys_addr);
+			return 0;
+		} else {
+			return -ERANGE;
+		}
+	}
+
+	if (prot & IOMMU_PROT_READ)
+		map.flags |= VFIO_DMA_MAP_FLAG_READ;
+
+	if (prot & IOMMU_PROT_WRITE)
+		map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
+
+	if (prot & IOMMU_PROT_EXEC) {
+		pr_err("VFIO does not support PROT_EXEC");
+		return -ENOSYS;
+	}
+
+	return ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map);
+}
+
+static int vfio_viommu_unmap(void *priv, u64 virt_addr, u64 size, int flags)
+{
+	struct vfio_guest_container *container = priv;
+	struct vfio_iommu_type1_dma_unmap unmap = {
+		.argsz	= sizeof(unmap),
+		.iova	= virt_addr,
+		.size	= size,
+	};
+
+	if (!iommu_unmap(container->msi_doorbells, virt_addr, size,
+			 flags | IOMMU_UNMAP_SILENT))
+		return 0;
+
+	return ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap);
+}
+
+static struct iommu_ops vfio_iommu_ops = {
+	.get_properties		= vfio_viommu_get_properties,
+	.alloc_address_space	= vfio_viommu_alloc,
+	.free_address_space	= vfio_viommu_free,
+	.attach			= vfio_viommu_attach,
+	.detach			= vfio_viommu_detach,
+	.map			= vfio_viommu_map,
+	.unmap			= vfio_viommu_unmap,
+};
+
 static int vfio_configure_reserved_regions(struct kvm *kvm,
 					   struct vfio_group *group)
 {
@@ -912,6 +1084,8 @@ static int vfio_configure_device(struct kvm *kvm, struct vfio_group *group,
 		return -ENOMEM;
 	}
 
+	device->group = group;
+
 	device->fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, dirent->d_name);
 	if (device->fd < 0) {
 		pr_err("Failed to get FD for device %s in group %lu",
@@ -945,6 +1119,7 @@ static int vfio_configure_device(struct kvm *kvm, struct vfio_group *group,
 	device->dev_hdr = (struct device_header) {
 		.bus_type	= DEVICE_BUS_PCI,
 		.data		= &device->pci.hdr,
+		.iommu_ops	= viommu ? &vfio_iommu_ops : NULL,
 	};
 
 	ret = device__register(&device->dev_hdr);
@@ -1009,13 +1184,13 @@ static int vfio_configure_iommu_groups(struct kvm *kvm)
 /* TODO: this should be an arch callback, so arm can return HYP only if vsmmu */
 static int vfio_get_iommu_type(void)
 {
-	if (ioctl(vfio_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_NESTING_IOMMU))
+	if (ioctl(vfio_host_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_NESTING_IOMMU))
 		return VFIO_TYPE1_NESTING_IOMMU;
 
-	if (ioctl(vfio_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU))
+	if (ioctl(vfio_host_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU))
 		return VFIO_TYPE1v2_IOMMU;
 
-	if (ioctl(vfio_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
+	if (ioctl(vfio_host_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
 		return VFIO_TYPE1_IOMMU;
 
 	return -ENODEV;
@@ -1033,7 +1208,7 @@ static int vfio_map_mem_bank(struct kvm *kvm, struct kvm_mem_bank *bank, void *d
 	};
 
 	/* Map the guest memory for DMA (i.e. provide isolation) */
-	if (ioctl(vfio_container, VFIO_IOMMU_MAP_DMA, &dma_map)) {
+	if (ioctl(vfio_host_container, VFIO_IOMMU_MAP_DMA, &dma_map)) {
 		ret = -errno;
 		pr_err("Failed to map 0x%llx -> 0x%llx (%llu) for DMA",
 		       dma_map.iova, dma_map.vaddr, dma_map.size);
@@ -1050,14 +1225,15 @@ static int vfio_unmap_mem_bank(struct kvm *kvm, struct kvm_mem_bank *bank, void
 		.iova = bank->guest_phys_addr,
 	};
 
-	ioctl(vfio_container, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+	ioctl(vfio_host_container, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
 
 	return 0;
 }
 
 static int vfio_group_init(struct kvm *kvm, struct vfio_group *group)
 {
-	int ret;
+	int ret = 0;
+	int container;
 	char group_node[VFIO_PATH_MAX_LEN];
 	struct vfio_group_status group_status = {
 		.argsz = sizeof(group_status),
@@ -1066,6 +1242,25 @@ static int vfio_group_init(struct kvm *kvm, struct vfio_group *group)
 	snprintf(group_node, VFIO_PATH_MAX_LEN, VFIO_DEV_DIR "/%lu",
 		 group->id);
 
+	if (kvm->cfg.viommu) {
+		container = open(VFIO_DEV_NODE, O_RDWR);
+		if (container < 0) {
+			ret = -errno;
+			pr_err("cannot initialize private container\n");
+			return ret;
+		}
+
+		group->container = malloc(sizeof(struct vfio_guest_container));
+		if (!group->container)
+			return -ENOMEM;
+
+		group->container->fd = container;
+		group->container->kvm = kvm;
+		group->container->msi_doorbells = NULL;
+	} else {
+		container = vfio_host_container;
+	}
+
 	group->fd = open(group_node, O_RDWR);
 	if (group->fd == -1) {
 		ret = -errno;
@@ -1085,29 +1280,52 @@ static int vfio_group_init(struct kvm *kvm, struct vfio_group *group)
 		return -EINVAL;
 	}
 
-	if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &vfio_container)) {
+	if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container)) {
 		ret = -errno;
 		pr_err("Failed to add IOMMU group %s to VFIO container",
 		       group_node);
 		return ret;
 	}
 
-	return 0;
+	if (container != vfio_host_container) {
+		struct vfio_iommu_type1_info info = {
+			.argsz = sizeof(info),
+		};
+
+		/* We really need v2 semantics for unmap-all */
+		ret = ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU);
+		if (ret) {
+			ret = -errno;
+			pr_err("Failed to set IOMMU");
+			return ret;
+		}
+
+		ret = ioctl(container, VFIO_IOMMU_GET_INFO, &info);
+		if (ret)
+			pr_err("Failed to get IOMMU info");
+		else if (info.flags & VFIO_IOMMU_INFO_PGSIZES)
+			vfio_viommu_props.pgsize_mask = info.iova_pgsizes;
+	}
+
+	return ret;
 }
 
-static int vfio_container_init(struct kvm *kvm)
+static int vfio_groups_init(struct kvm *kvm)
 {
 	int api, i, ret, iommu_type;;
 
-	/* Create a container for our IOMMU groups */
-	vfio_container = open(VFIO_DEV_NODE, O_RDWR);
-	if (vfio_container == -1) {
+	/*
+	 * Create a container for our IOMMU groups. Even when using a viommu, we
+	 * still use this one for probing capabilities.
+	 */
+	vfio_host_container = open(VFIO_DEV_NODE, O_RDWR);
+	if (vfio_host_container == -1) {
 		ret = errno;
 		pr_err("Failed to open %s", VFIO_DEV_NODE);
 		return ret;
 	}
 
-	api = ioctl(vfio_container, VFIO_GET_API_VERSION);
+	api = ioctl(vfio_host_container, VFIO_GET_API_VERSION);
 	if (api != VFIO_API_VERSION) {
 		pr_err("Unknown VFIO API version %d", api);
 		return -ENODEV;
@@ -1119,15 +1337,20 @@ static int vfio_container_init(struct kvm *kvm)
 		return iommu_type;
 	}
 
-	/* Sanity check our groups and add them to the container */
 	for (i = 0; i < kvm->cfg.num_vfio_groups; ++i) {
 		ret = vfio_group_init(kvm, &kvm->cfg.vfio_group[i]);
 		if (ret)
 			return ret;
 	}
 
+	if (kvm->cfg.viommu) {
+		close(vfio_host_container);
+		vfio_host_container = -1;
+		return 0;
+	}
+
 	/* Finalise the container */
-	if (ioctl(vfio_container, VFIO_SET_IOMMU, iommu_type)) {
+	if (ioctl(vfio_host_container, VFIO_SET_IOMMU, iommu_type)) {
 		ret = -errno;
 		pr_err("Failed to set IOMMU type %d for VFIO container",
 		       iommu_type);
@@ -1147,10 +1370,16 @@ static int vfio__init(struct kvm *kvm)
 	if (!kvm->cfg.num_vfio_groups)
 		return 0;
 
-	ret = vfio_container_init(kvm);
+	ret = vfio_groups_init(kvm);
 	if (ret)
 		return ret;
 
+	if (kvm->cfg.viommu) {
+		viommu = viommu_register(kvm, &vfio_viommu_props);
+		if (!viommu)
+			pr_err("could not register viommu");
+	}
+
 	ret = vfio_configure_iommu_groups(kvm);
 	if (ret)
 		return ret;
@@ -1162,17 +1391,27 @@ dev_base_init(vfio__init);
 static int vfio__exit(struct kvm *kvm)
 {
 	int i, fd;
+	struct vfio_guest_container *container;
 
 	if (!kvm->cfg.num_vfio_groups)
 		return 0;
 
 	for (i = 0; i < kvm->cfg.num_vfio_groups; ++i) {
+		container = kvm->cfg.vfio_group[i].container;
 		fd = kvm->cfg.vfio_group[i].fd;
 		ioctl(fd, VFIO_GROUP_UNSET_CONTAINER);
 		close(fd);
+
+		if (container != NULL) {
+			close(container->fd);
+			free(container);
+		}
 	}
 
+	if (vfio_host_container == -1)
+		return 0;
+
 	kvm__for_each_mem_bank(kvm, KVM_MEM_TYPE_RAM, vfio_unmap_mem_bank, NULL);
-	return close(vfio_container);
+	return close(vfio_host_container);
 }
 dev_base_exit(vfio__exit);
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 12/15] vfio: add support for virtual IOMMU
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (21 preceding siblings ...)
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (8 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Currently all passed-through devices must access the same guest-physical
address space. Register an IOMMU to offer individual address spaces to
devices. The way we do it is allocate one container per group, and add
mappings on demand.

Since guest cannot access devices unless it is attached to a container,
and we cannot change container at runtime without resetting the device,
this implementation is limited. To implement bypass mode, we'd need to map
the whole guest physical memory first, and unmap everything when attaching
to a new address space. It is also not possible for devices to be attached
to the same address space, they all have different page tables.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/iommu.h |   6 ++
 include/kvm/vfio.h  |   2 +
 iommu.c             |   7 +-
 vfio.c              | 281 ++++++++++++++++++++++++++++++++++++++++++++++++----
 4 files changed, 273 insertions(+), 23 deletions(-)

diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 8f87ce5a..45a20f3b 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -10,6 +10,12 @@
 #define IOMMU_PROT_WRITE	0x2
 #define IOMMU_PROT_EXEC		0x4
 
+/*
+ * Test if mapping is present. If not, return an error but do not report it to
+ * stderr
+ */
+#define IOMMU_UNMAP_SILENT	0x1
+
 struct iommu_ops {
 	const struct iommu_properties *(*get_properties)(struct device_header *);
 
diff --git a/include/kvm/vfio.h b/include/kvm/vfio.h
index 71dfa8f7..84126eb9 100644
--- a/include/kvm/vfio.h
+++ b/include/kvm/vfio.h
@@ -55,6 +55,7 @@ struct vfio_device {
 	struct device_header		dev_hdr;
 
 	int				fd;
+	struct vfio_group		*group;
 	struct vfio_device_info		info;
 	struct vfio_irq_info		irq_info;
 	struct vfio_region		*regions;
@@ -65,6 +66,7 @@ struct vfio_device {
 struct vfio_group {
 	unsigned long			id; /* iommu_group number in sysfs */
 	int				fd;
+	struct vfio_guest_container	*container;
 };
 
 int vfio_group_parser(const struct option *opt, const char *arg, int unset);
diff --git a/iommu.c b/iommu.c
index c10a3f0b..2220e4b2 100644
--- a/iommu.c
+++ b/iommu.c
@@ -85,6 +85,7 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags)
 	struct rb_int_node *node;
 	struct iommu_mapping *map;
 	struct iommu_ioas *ioas = address_space;
+	bool silent = flags & IOMMU_UNMAP_SILENT;
 
 	if (!ioas)
 		return -ENODEV;
@@ -97,7 +98,8 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags)
 		map = container_of(node, struct iommu_mapping, iova_range);
 
 		if (node_size > size) {
-			pr_debug("cannot split mapping");
+			if (!silent)
+				pr_debug("cannot split mapping");
 			ret = -EINVAL;
 			break;
 		}
@@ -111,7 +113,8 @@ int iommu_unmap(void *address_space, u64 virt_addr, u64 size, int flags)
 	}
 
 	if (size && !ret) {
-		pr_debug("mapping not found");
+		if (!silent)
+			pr_debug("mapping not found");
 		ret = -ENXIO;
 	}
 	mutex_unlock(&ioas->mutex);
diff --git a/vfio.c b/vfio.c
index f4fd4090..406d0781 100644
--- a/vfio.c
+++ b/vfio.c
@@ -1,10 +1,13 @@
+#include "kvm/iommu.h"
 #include "kvm/irq.h"
 #include "kvm/kvm.h"
 #include "kvm/kvm-cpu.h"
 #include "kvm/pci.h"
 #include "kvm/util.h"
 #include "kvm/vfio.h"
+#include "kvm/virtio-iommu.h"
 
+#include <linux/bitops.h>
 #include <linux/kvm.h>
 #include <linux/pci_regs.h>
 
@@ -25,7 +28,16 @@ struct vfio_irq_eventfd {
 	int			fd;
 };
 
-static int vfio_container;
+struct vfio_guest_container {
+	struct kvm		*kvm;
+	int			fd;
+
+	void			*msi_doorbells;
+};
+
+static void *viommu = NULL;
+
+static int vfio_host_container;
 
 int vfio_group_parser(const struct option *opt, const char *arg, int unset)
 {
@@ -43,6 +55,7 @@ int vfio_group_parser(const struct option *opt, const char *arg, int unset)
 
 	cur = strtok(buf, ",");
 	group->id = strtoul(cur, NULL, 0);
+	group->container = NULL;
 
 	kvm->cfg.num_vfio_groups = ++idx;
 	free(buf);
@@ -68,11 +81,13 @@ static void vfio_pci_msix_pba_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
 static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
 				       u32 len, u8 is_write, void *ptr)
 {
+	struct msi_msg msg;
 	struct kvm *kvm = vcpu->kvm;
 	struct vfio_pci_device *pdev = ptr;
 	struct vfio_pci_msix_entry *entry;
 	struct vfio_pci_msix_table *table = &pdev->msix_table;
 	struct vfio_device *device = container_of(pdev, struct vfio_device, pci);
+	struct vfio_guest_container *container = device->group->container;
 
 	u64 offset = addr - table->guest_phys_addr;
 
@@ -88,11 +103,16 @@ static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
 
 	memcpy((void *)&entry->config + field, data, len);
 
-	if (field != PCI_MSIX_ENTRY_VECTOR_CTRL)
+	if (field != PCI_MSIX_ENTRY_VECTOR_CTRL || entry->config.ctrl & 1)
+		return;
+
+	msg = entry->config.msg;
+
+	if (container && iommu_translate_msi(container->msi_doorbells, &msg))
 		return;
 
 	if (entry->gsi < 0) {
-		int ret = irq__add_msix_route(kvm, &entry->config.msg,
+		int ret = irq__add_msix_route(kvm, &msg,
 					      device->dev_hdr.dev_num << 3);
 		if (ret < 0) {
 			pr_err("cannot create MSI-X route");
@@ -111,7 +131,7 @@ static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
 		return;
 	}
 
-	irq__update_msix_route(kvm, entry->gsi, &entry->config.msg);
+	irq__update_msix_route(kvm, entry->gsi, &msg);
 }
 
 static void vfio_pci_msi_write(struct kvm *kvm, struct vfio_device *device,
@@ -122,6 +142,7 @@ static void vfio_pci_msi_write(struct kvm *kvm, struct vfio_device *device,
 	struct msi_msg msi;
 	struct vfio_pci_msix_entry *entry;
 	struct vfio_pci_device *pdev = &device->pci;
+	struct vfio_guest_container *container = device->group->container;
 	struct msi_cap_64 *msi_cap_64 = (void *)&pdev->hdr + pdev->msi.pos;
 
 	/* Only modify routes when guest sets the enable bit */
@@ -144,6 +165,9 @@ static void vfio_pci_msi_write(struct kvm *kvm, struct vfio_device *device,
 		msi.data = msi_cap_32->data;
 	}
 
+	if (container && iommu_translate_msi(container->msi_doorbells, &msi))
+		return;
+
 	for (i = 0; i < nr_vectors; i++) {
 		u32 devid = device->dev_hdr.dev_num << 3;
 
@@ -870,6 +894,154 @@ static int vfio_configure_dev_irqs(struct kvm *kvm, struct vfio_device *device)
 	return ret;
 }
 
+static struct iommu_properties vfio_viommu_props = {
+	.name				= "viommu-vfio",
+
+	.input_addr_size		= 64,
+};
+
+static const struct iommu_properties *
+vfio_viommu_get_properties(struct device_header *dev)
+{
+	return &vfio_viommu_props;
+}
+
+static void *vfio_viommu_alloc(struct device_header *dev_hdr)
+{
+	struct vfio_device *vdev = container_of(dev_hdr, struct vfio_device,
+						dev_hdr);
+	struct vfio_guest_container *container = vdev->group->container;
+
+	container->msi_doorbells = iommu_alloc_address_space(NULL);
+	if (!container->msi_doorbells) {
+		pr_err("Failed to create MSI address space");
+		return NULL;
+	}
+
+	return container;
+}
+
+static void vfio_viommu_free(void *priv)
+{
+	struct vfio_guest_container *container = priv;
+
+	/* Half the address space */
+	size_t size = 1UL << (BITS_PER_LONG - 1);
+	unsigned long virt_addr = 0;
+	int i;
+
+	/*
+	 * Remove all mappings in two times, since 2^64 doesn't fit in
+	 * unmap.size
+	 */
+	for (i = 0; i < 2; i++, virt_addr += size) {
+		struct vfio_iommu_type1_dma_unmap unmap = {
+			.argsz	= sizeof(unmap),
+			.iova	= virt_addr,
+			.size	= size,
+		};
+	}
+
+	iommu_free_address_space(container->msi_doorbells);
+	container->msi_doorbells = NULL;
+}
+
+static int vfio_viommu_attach(void *priv, struct device_header *dev_hdr, int flags)
+{
+	struct vfio_guest_container *container = priv;
+	struct vfio_device *vdev = container_of(dev_hdr, struct vfio_device,
+						dev_hdr);
+
+	if (!container)
+		return -ENODEV;
+
+	if (container->fd != vdev->group->container->fd)
+		/*
+		 * TODO: We don't support multiple devices in the same address
+		 * space at the moment. It should be easy to implement, just
+		 * create an address space structure that holds multiple
+		 * container fds and multiplex map/unmap requests.
+		 */
+		return -EINVAL;
+
+	return 0;
+}
+
+static int vfio_viommu_detach(void *priv, struct device_header *dev_hdr)
+{
+	return 0;
+}
+
+static int vfio_viommu_map(void *priv, u64 virt_addr, u64 phys_addr, u64 size,
+			   int prot)
+{
+	int ret;
+	struct vfio_guest_container *container = priv;
+	struct vfio_iommu_type1_dma_map map = {
+		.argsz	= sizeof(map),
+		.iova	= virt_addr,
+		.size	= size,
+	};
+
+	map.vaddr = (u64)guest_flat_to_host(container->kvm, phys_addr);
+	if (!map.vaddr) {
+		if (irq__addr_is_msi_doorbell(container->kvm, phys_addr)) {
+			ret = iommu_map(container->msi_doorbells, virt_addr,
+					phys_addr, size, prot);
+			if (ret) {
+				pr_err("could not map MSI");
+				return ret;
+			}
+
+			// TODO: silence guest_flat_to_host
+			pr_info("Nevermind, all is well. Mapped MSI %llx->%llx",
+				virt_addr, phys_addr);
+			return 0;
+		} else {
+			return -ERANGE;
+		}
+	}
+
+	if (prot & IOMMU_PROT_READ)
+		map.flags |= VFIO_DMA_MAP_FLAG_READ;
+
+	if (prot & IOMMU_PROT_WRITE)
+		map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
+
+	if (prot & IOMMU_PROT_EXEC) {
+		pr_err("VFIO does not support PROT_EXEC");
+		return -ENOSYS;
+	}
+
+	return ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map);
+}
+
+static int vfio_viommu_unmap(void *priv, u64 virt_addr, u64 size, int flags)
+{
+	struct vfio_guest_container *container = priv;
+	struct vfio_iommu_type1_dma_unmap unmap = {
+		.argsz	= sizeof(unmap),
+		.iova	= virt_addr,
+		.size	= size,
+	};
+
+	if (!iommu_unmap(container->msi_doorbells, virt_addr, size,
+			 flags | IOMMU_UNMAP_SILENT))
+		return 0;
+
+	return ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap);
+}
+
+static struct iommu_ops vfio_iommu_ops = {
+	.get_properties		= vfio_viommu_get_properties,
+	.alloc_address_space	= vfio_viommu_alloc,
+	.free_address_space	= vfio_viommu_free,
+	.attach			= vfio_viommu_attach,
+	.detach			= vfio_viommu_detach,
+	.map			= vfio_viommu_map,
+	.unmap			= vfio_viommu_unmap,
+};
+
 static int vfio_configure_reserved_regions(struct kvm *kvm,
 					   struct vfio_group *group)
 {
@@ -912,6 +1084,8 @@ static int vfio_configure_device(struct kvm *kvm, struct vfio_group *group,
 		return -ENOMEM;
 	}
 
+	device->group = group;
+
 	device->fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, dirent->d_name);
 	if (device->fd < 0) {
 		pr_err("Failed to get FD for device %s in group %lu",
@@ -945,6 +1119,7 @@ static int vfio_configure_device(struct kvm *kvm, struct vfio_group *group,
 	device->dev_hdr = (struct device_header) {
 		.bus_type	= DEVICE_BUS_PCI,
 		.data		= &device->pci.hdr,
+		.iommu_ops	= viommu ? &vfio_iommu_ops : NULL,
 	};
 
 	ret = device__register(&device->dev_hdr);
@@ -1009,13 +1184,13 @@ static int vfio_configure_iommu_groups(struct kvm *kvm)
 /* TODO: this should be an arch callback, so arm can return HYP only if vsmmu */
 static int vfio_get_iommu_type(void)
 {
-	if (ioctl(vfio_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_NESTING_IOMMU))
+	if (ioctl(vfio_host_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_NESTING_IOMMU))
 		return VFIO_TYPE1_NESTING_IOMMU;
 
-	if (ioctl(vfio_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU))
+	if (ioctl(vfio_host_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU))
 		return VFIO_TYPE1v2_IOMMU;
 
-	if (ioctl(vfio_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
+	if (ioctl(vfio_host_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
 		return VFIO_TYPE1_IOMMU;
 
 	return -ENODEV;
@@ -1033,7 +1208,7 @@ static int vfio_map_mem_bank(struct kvm *kvm, struct kvm_mem_bank *bank, void *d
 	};
 
 	/* Map the guest memory for DMA (i.e. provide isolation) */
-	if (ioctl(vfio_container, VFIO_IOMMU_MAP_DMA, &dma_map)) {
+	if (ioctl(vfio_host_container, VFIO_IOMMU_MAP_DMA, &dma_map)) {
 		ret = -errno;
 		pr_err("Failed to map 0x%llx -> 0x%llx (%llu) for DMA",
 		       dma_map.iova, dma_map.vaddr, dma_map.size);
@@ -1050,14 +1225,15 @@ static int vfio_unmap_mem_bank(struct kvm *kvm, struct kvm_mem_bank *bank, void
 		.iova = bank->guest_phys_addr,
 	};
 
-	ioctl(vfio_container, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+	ioctl(vfio_host_container, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
 
 	return 0;
 }
 
 static int vfio_group_init(struct kvm *kvm, struct vfio_group *group)
 {
-	int ret;
+	int ret = 0;
+	int container;
 	char group_node[VFIO_PATH_MAX_LEN];
 	struct vfio_group_status group_status = {
 		.argsz = sizeof(group_status),
@@ -1066,6 +1242,25 @@ static int vfio_group_init(struct kvm *kvm, struct vfio_group *group)
 	snprintf(group_node, VFIO_PATH_MAX_LEN, VFIO_DEV_DIR "/%lu",
 		 group->id);
 
+	if (kvm->cfg.viommu) {
+		container = open(VFIO_DEV_NODE, O_RDWR);
+		if (container < 0) {
+			ret = -errno;
+			pr_err("cannot initialize private container\n");
+			return ret;
+		}
+
+		group->container = malloc(sizeof(struct vfio_guest_container));
+		if (!group->container)
+			return -ENOMEM;
+
+		group->container->fd = container;
+		group->container->kvm = kvm;
+		group->container->msi_doorbells = NULL;
+	} else {
+		container = vfio_host_container;
+	}
+
 	group->fd = open(group_node, O_RDWR);
 	if (group->fd == -1) {
 		ret = -errno;
@@ -1085,29 +1280,52 @@ static int vfio_group_init(struct kvm *kvm, struct vfio_group *group)
 		return -EINVAL;
 	}
 
-	if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &vfio_container)) {
+	if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container)) {
 		ret = -errno;
 		pr_err("Failed to add IOMMU group %s to VFIO container",
 		       group_node);
 		return ret;
 	}
 
-	return 0;
+	if (container != vfio_host_container) {
+		struct vfio_iommu_type1_info info = {
+			.argsz = sizeof(info),
+		};
+
+		/* We really need v2 semantics for unmap-all */
+		ret = ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1v2_IOMMU);
+		if (ret) {
+			ret = -errno;
+			pr_err("Failed to set IOMMU");
+			return ret;
+		}
+
+		ret = ioctl(container, VFIO_IOMMU_GET_INFO, &info);
+		if (ret)
+			pr_err("Failed to get IOMMU info");
+		else if (info.flags & VFIO_IOMMU_INFO_PGSIZES)
+			vfio_viommu_props.pgsize_mask = info.iova_pgsizes;
+	}
+
+	return ret;
 }
 
-static int vfio_container_init(struct kvm *kvm)
+static int vfio_groups_init(struct kvm *kvm)
 {
 	int api, i, ret, iommu_type;;
 
-	/* Create a container for our IOMMU groups */
-	vfio_container = open(VFIO_DEV_NODE, O_RDWR);
-	if (vfio_container == -1) {
+	/*
+	 * Create a container for our IOMMU groups. Even when using a viommu, we
+	 * still use this one for probing capabilities.
+	 */
+	vfio_host_container = open(VFIO_DEV_NODE, O_RDWR);
+	if (vfio_host_container == -1) {
 		ret = errno;
 		pr_err("Failed to open %s", VFIO_DEV_NODE);
 		return ret;
 	}
 
-	api = ioctl(vfio_container, VFIO_GET_API_VERSION);
+	api = ioctl(vfio_host_container, VFIO_GET_API_VERSION);
 	if (api != VFIO_API_VERSION) {
 		pr_err("Unknown VFIO API version %d", api);
 		return -ENODEV;
@@ -1119,15 +1337,20 @@ static int vfio_container_init(struct kvm *kvm)
 		return iommu_type;
 	}
 
-	/* Sanity check our groups and add them to the container */
 	for (i = 0; i < kvm->cfg.num_vfio_groups; ++i) {
 		ret = vfio_group_init(kvm, &kvm->cfg.vfio_group[i]);
 		if (ret)
 			return ret;
 	}
 
+	if (kvm->cfg.viommu) {
+		close(vfio_host_container);
+		vfio_host_container = -1;
+		return 0;
+	}
+
 	/* Finalise the container */
-	if (ioctl(vfio_container, VFIO_SET_IOMMU, iommu_type)) {
+	if (ioctl(vfio_host_container, VFIO_SET_IOMMU, iommu_type)) {
 		ret = -errno;
 		pr_err("Failed to set IOMMU type %d for VFIO container",
 		       iommu_type);
@@ -1147,10 +1370,16 @@ static int vfio__init(struct kvm *kvm)
 	if (!kvm->cfg.num_vfio_groups)
 		return 0;
 
-	ret = vfio_container_init(kvm);
+	ret = vfio_groups_init(kvm);
 	if (ret)
 		return ret;
 
+	if (kvm->cfg.viommu) {
+		viommu = viommu_register(kvm, &vfio_viommu_props);
+		if (!viommu)
+			pr_err("could not register viommu");
+	}
+
 	ret = vfio_configure_iommu_groups(kvm);
 	if (ret)
 		return ret;
@@ -1162,17 +1391,27 @@ dev_base_init(vfio__init);
 static int vfio__exit(struct kvm *kvm)
 {
 	int i, fd;
+	struct vfio_guest_container *container;
 
 	if (!kvm->cfg.num_vfio_groups)
 		return 0;
 
 	for (i = 0; i < kvm->cfg.num_vfio_groups; ++i) {
+		container = kvm->cfg.vfio_group[i].container;
 		fd = kvm->cfg.vfio_group[i].fd;
 		ioctl(fd, VFIO_GROUP_UNSET_CONTAINER);
 		close(fd);
+
+		if (container != NULL) {
+			close(container->fd);
+			free(container);
+		}
 	}
 
+	if (vfio_host_container == -1)
+		return 0;
+
 	kvm__for_each_mem_bank(kvm, KVM_MEM_TYPE_RAM, vfio_unmap_mem_bank, NULL);
-	return close(vfio_container);
+	return close(vfio_host_container);
 }
 dev_base_exit(vfio__exit);
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 13/15] virtio-iommu: debug via IPC
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (24 preceding siblings ...)
  2017-04-07 19:24   ` [RFC PATCH kvmtool 13/15] virtio-iommu: debug via IPC Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 14/15] virtio-iommu: implement basic debug commands Jean-Philippe Brucker
                     ` (5 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

Add a new parameter to lkvm debug, '-i' or '--iommu'. Commands will be
added later. For the moment, rework the debug builtin to share dump
facilities with the '-d'/'--dump' parameter.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 builtin-debug.c             |  8 +++++++-
 include/kvm/builtin-debug.h |  6 ++++++
 include/kvm/iommu.h         |  5 +++++
 include/kvm/virtio-iommu.h  |  5 +++++
 kvm-ipc.c                   | 43 ++++++++++++++++++++++++-------------------
 virtio/iommu.c              | 14 ++++++++++++++
 6 files changed, 61 insertions(+), 20 deletions(-)

diff --git a/builtin-debug.c b/builtin-debug.c
index 4ae51d20..e39e2d09 100644
--- a/builtin-debug.c
+++ b/builtin-debug.c
@@ -5,6 +5,7 @@
 #include <kvm/parse-options.h>
 #include <kvm/kvm-ipc.h>
 #include <kvm/read-write.h>
+#include <kvm/virtio-iommu.h>
 
 #include <stdio.h>
 #include <string.h>
@@ -17,6 +18,7 @@ static int nmi = -1;
 static bool dump;
 static const char *instance_name;
 static const char *sysrq;
+static const char *iommu;
 
 static const char * const debug_usage[] = {
 	"lkvm debug [--all] [-n name] [-d] [-m vcpu]",
@@ -28,6 +30,7 @@ static const struct option debug_options[] = {
 	OPT_BOOLEAN('d', "dump", &dump, "Generate a debug dump from guest"),
 	OPT_INTEGER('m', "nmi", &nmi, "Generate NMI on VCPU"),
 	OPT_STRING('s', "sysrq", &sysrq, "sysrq", "Inject a sysrq"),
+	OPT_STRING('i', "iommu", &iommu, "params", "Debug virtual IOMMU"),
 	OPT_GROUP("Instance options:"),
 	OPT_BOOLEAN('a', "all", &all, "Debug all instances"),
 	OPT_STRING('n', "name", &instance_name, "name", "Instance name"),
@@ -68,11 +71,14 @@ static int do_debug(const char *name, int sock)
 		cmd.sysrq = sysrq[0];
 	}
 
+	if (iommu && !viommu_parse_debug_string(iommu, &cmd.iommu))
+		cmd.dbg_type |= KVM_DEBUG_CMD_TYPE_IOMMU;
+
 	r = kvm_ipc__send_msg(sock, KVM_IPC_DEBUG, sizeof(cmd), (u8 *)&cmd);
 	if (r < 0)
 		return r;
 
-	if (!dump)
+	if (!(cmd.dbg_type & KVM_DEBUG_CMD_DUMP_MASK))
 		return 0;
 
 	do {
diff --git a/include/kvm/builtin-debug.h b/include/kvm/builtin-debug.h
index efa02684..cd2155ae 100644
--- a/include/kvm/builtin-debug.h
+++ b/include/kvm/builtin-debug.h
@@ -2,16 +2,22 @@
 #define KVM__DEBUG_H
 
 #include <kvm/util.h>
+#include <kvm/iommu.h>
 #include <linux/types.h>
 
 #define KVM_DEBUG_CMD_TYPE_DUMP	(1 << 0)
 #define KVM_DEBUG_CMD_TYPE_NMI	(1 << 1)
 #define KVM_DEBUG_CMD_TYPE_SYSRQ (1 << 2)
+#define KVM_DEBUG_CMD_TYPE_IOMMU (1 << 3)
+
+#define KVM_DEBUG_CMD_DUMP_MASK \
+	(KVM_DEBUG_CMD_TYPE_IOMMU | KVM_DEBUG_CMD_TYPE_DUMP)
 
 struct debug_cmd_params {
 	u32 dbg_type;
 	u32 cpu;
 	char sysrq;
+	struct iommu_debug_params iommu;
 };
 
 int kvm_cmd_debug(int argc, const char **argv, const char *prefix);
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 45a20f3b..60857fa5 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -1,6 +1,7 @@
 #ifndef KVM_IOMMU_H
 #define KVM_IOMMU_H
 
+#include <stdbool.h>
 #include <stdlib.h>
 
 #include "devices.h"
@@ -10,6 +11,10 @@
 #define IOMMU_PROT_WRITE	0x2
 #define IOMMU_PROT_EXEC		0x4
 
+struct iommu_debug_params {
+	bool				print_enabled;
+};
+
 /*
  * Test if mapping is present. If not, return an error but do not report it to
  * stderr
diff --git a/include/kvm/virtio-iommu.h b/include/kvm/virtio-iommu.h
index 5532c82b..c9e36fb6 100644
--- a/include/kvm/virtio-iommu.h
+++ b/include/kvm/virtio-iommu.h
@@ -7,4 +7,9 @@ const struct iommu_properties *viommu_get_properties(void *dev);
 void *viommu_register(struct kvm *kvm, struct iommu_properties *props);
 void viommu_unregister(struct kvm *kvm, void *cookie);
 
+struct iommu_debug_params;
+
+int viommu_parse_debug_string(const char *options, struct iommu_debug_params *);
+int viommu_debug(int fd, struct iommu_debug_params *);
+
 #endif
diff --git a/kvm-ipc.c b/kvm-ipc.c
index e07ad105..a8b56543 100644
--- a/kvm-ipc.c
+++ b/kvm-ipc.c
@@ -14,6 +14,7 @@
 #include "kvm/strbuf.h"
 #include "kvm/kvm-cpu.h"
 #include "kvm/8250-serial.h"
+#include "kvm/virtio-iommu.h"
 
 struct kvm_ipc_head {
 	u32 type;
@@ -424,31 +425,35 @@ static void handle_debug(struct kvm *kvm, int fd, u32 type, u32 len, u8 *msg)
 		pthread_kill(kvm->cpus[vcpu]->thread, SIGUSR1);
 	}
 
-	if (!(dbg_type & KVM_DEBUG_CMD_TYPE_DUMP))
-		return;
+	if (dbg_type & KVM_DEBUG_CMD_TYPE_IOMMU)
+		viommu_debug(fd, &params->iommu);
 
-	for (i = 0; i < kvm->nrcpus; i++) {
-		struct kvm_cpu *cpu = kvm->cpus[i];
+	if (dbg_type & KVM_DEBUG_CMD_TYPE_DUMP) {
+		for (i = 0; i < kvm->nrcpus; i++) {
+			struct kvm_cpu *cpu = kvm->cpus[i];
 
-		if (!cpu)
-			continue;
+			if (!cpu)
+				continue;
 
-		printout_done = 0;
+			printout_done = 0;
+
+			kvm_cpu__set_debug_fd(fd);
+			pthread_kill(cpu->thread, SIGUSR1);
+			/*
+			 * Wait for the vCPU to dump state before signalling
+			 * the next thread. Since this is debug code it does
+			 * not matter that we are burning CPU time a bit:
+			 */
+			while (!printout_done)
+				sleep(0);
+		}
 
-		kvm_cpu__set_debug_fd(fd);
-		pthread_kill(cpu->thread, SIGUSR1);
-		/*
-		 * Wait for the vCPU to dump state before signalling
-		 * the next thread. Since this is debug code it does
-		 * not matter that we are burning CPU time a bit:
-		 */
-		while (!printout_done)
-			sleep(0);
+		serial8250__inject_sysrq(kvm, 'p');
 	}
 
-	close(fd);
-
-	serial8250__inject_sysrq(kvm, 'p');
+	if (dbg_type & KVM_DEBUG_CMD_DUMP_MASK)
+		/* builtin-debug is reading, signal EOT */
+		close(fd);
 }
 
 int kvm_ipc__init(struct kvm *kvm)
diff --git a/virtio/iommu.c b/virtio/iommu.c
index 2e5a23ee..5973cef1 100644
--- a/virtio/iommu.c
+++ b/virtio/iommu.c
@@ -620,3 +620,17 @@ void viommu_unregister(struct kvm *kvm, void *viommu)
 {
 	free(viommu);
 }
+
+int viommu_parse_debug_string(const char *cmdline, struct iommu_debug_params *params)
+{
+	/* show instances numbers */
+	/* send command to instance */
+	/* - dump mappings */
+	/* - statistics */
+	return -ENOSYS;
+}
+
+int viommu_debug(int sock, struct iommu_debug_params *params)
+{
+	return -ENOSYS;
+}
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 13/15] virtio-iommu: debug via IPC
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (23 preceding siblings ...)
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (6 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Add a new parameter to lkvm debug, '-i' or '--iommu'. Commands will be
added later. For the moment, rework the debug builtin to share dump
facilities with the '-d'/'--dump' parameter.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 builtin-debug.c             |  8 +++++++-
 include/kvm/builtin-debug.h |  6 ++++++
 include/kvm/iommu.h         |  5 +++++
 include/kvm/virtio-iommu.h  |  5 +++++
 kvm-ipc.c                   | 43 ++++++++++++++++++++++++-------------------
 virtio/iommu.c              | 14 ++++++++++++++
 6 files changed, 61 insertions(+), 20 deletions(-)

diff --git a/builtin-debug.c b/builtin-debug.c
index 4ae51d20..e39e2d09 100644
--- a/builtin-debug.c
+++ b/builtin-debug.c
@@ -5,6 +5,7 @@
 #include <kvm/parse-options.h>
 #include <kvm/kvm-ipc.h>
 #include <kvm/read-write.h>
+#include <kvm/virtio-iommu.h>
 
 #include <stdio.h>
 #include <string.h>
@@ -17,6 +18,7 @@ static int nmi = -1;
 static bool dump;
 static const char *instance_name;
 static const char *sysrq;
+static const char *iommu;
 
 static const char * const debug_usage[] = {
 	"lkvm debug [--all] [-n name] [-d] [-m vcpu]",
@@ -28,6 +30,7 @@ static const struct option debug_options[] = {
 	OPT_BOOLEAN('d', "dump", &dump, "Generate a debug dump from guest"),
 	OPT_INTEGER('m', "nmi", &nmi, "Generate NMI on VCPU"),
 	OPT_STRING('s', "sysrq", &sysrq, "sysrq", "Inject a sysrq"),
+	OPT_STRING('i', "iommu", &iommu, "params", "Debug virtual IOMMU"),
 	OPT_GROUP("Instance options:"),
 	OPT_BOOLEAN('a', "all", &all, "Debug all instances"),
 	OPT_STRING('n', "name", &instance_name, "name", "Instance name"),
@@ -68,11 +71,14 @@ static int do_debug(const char *name, int sock)
 		cmd.sysrq = sysrq[0];
 	}
 
+	if (iommu && !viommu_parse_debug_string(iommu, &cmd.iommu))
+		cmd.dbg_type |= KVM_DEBUG_CMD_TYPE_IOMMU;
+
 	r = kvm_ipc__send_msg(sock, KVM_IPC_DEBUG, sizeof(cmd), (u8 *)&cmd);
 	if (r < 0)
 		return r;
 
-	if (!dump)
+	if (!(cmd.dbg_type & KVM_DEBUG_CMD_DUMP_MASK))
 		return 0;
 
 	do {
diff --git a/include/kvm/builtin-debug.h b/include/kvm/builtin-debug.h
index efa02684..cd2155ae 100644
--- a/include/kvm/builtin-debug.h
+++ b/include/kvm/builtin-debug.h
@@ -2,16 +2,22 @@
 #define KVM__DEBUG_H
 
 #include <kvm/util.h>
+#include <kvm/iommu.h>
 #include <linux/types.h>
 
 #define KVM_DEBUG_CMD_TYPE_DUMP	(1 << 0)
 #define KVM_DEBUG_CMD_TYPE_NMI	(1 << 1)
 #define KVM_DEBUG_CMD_TYPE_SYSRQ (1 << 2)
+#define KVM_DEBUG_CMD_TYPE_IOMMU (1 << 3)
+
+#define KVM_DEBUG_CMD_DUMP_MASK \
+	(KVM_DEBUG_CMD_TYPE_IOMMU | KVM_DEBUG_CMD_TYPE_DUMP)
 
 struct debug_cmd_params {
 	u32 dbg_type;
 	u32 cpu;
 	char sysrq;
+	struct iommu_debug_params iommu;
 };
 
 int kvm_cmd_debug(int argc, const char **argv, const char *prefix);
diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 45a20f3b..60857fa5 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -1,6 +1,7 @@
 #ifndef KVM_IOMMU_H
 #define KVM_IOMMU_H
 
+#include <stdbool.h>
 #include <stdlib.h>
 
 #include "devices.h"
@@ -10,6 +11,10 @@
 #define IOMMU_PROT_WRITE	0x2
 #define IOMMU_PROT_EXEC		0x4
 
+struct iommu_debug_params {
+	bool				print_enabled;
+};
+
 /*
  * Test if mapping is present. If not, return an error but do not report it to
  * stderr
diff --git a/include/kvm/virtio-iommu.h b/include/kvm/virtio-iommu.h
index 5532c82b..c9e36fb6 100644
--- a/include/kvm/virtio-iommu.h
+++ b/include/kvm/virtio-iommu.h
@@ -7,4 +7,9 @@ const struct iommu_properties *viommu_get_properties(void *dev);
 void *viommu_register(struct kvm *kvm, struct iommu_properties *props);
 void viommu_unregister(struct kvm *kvm, void *cookie);
 
+struct iommu_debug_params;
+
+int viommu_parse_debug_string(const char *options, struct iommu_debug_params *);
+int viommu_debug(int fd, struct iommu_debug_params *);
+
 #endif
diff --git a/kvm-ipc.c b/kvm-ipc.c
index e07ad105..a8b56543 100644
--- a/kvm-ipc.c
+++ b/kvm-ipc.c
@@ -14,6 +14,7 @@
 #include "kvm/strbuf.h"
 #include "kvm/kvm-cpu.h"
 #include "kvm/8250-serial.h"
+#include "kvm/virtio-iommu.h"
 
 struct kvm_ipc_head {
 	u32 type;
@@ -424,31 +425,35 @@ static void handle_debug(struct kvm *kvm, int fd, u32 type, u32 len, u8 *msg)
 		pthread_kill(kvm->cpus[vcpu]->thread, SIGUSR1);
 	}
 
-	if (!(dbg_type & KVM_DEBUG_CMD_TYPE_DUMP))
-		return;
+	if (dbg_type & KVM_DEBUG_CMD_TYPE_IOMMU)
+		viommu_debug(fd, &params->iommu);
 
-	for (i = 0; i < kvm->nrcpus; i++) {
-		struct kvm_cpu *cpu = kvm->cpus[i];
+	if (dbg_type & KVM_DEBUG_CMD_TYPE_DUMP) {
+		for (i = 0; i < kvm->nrcpus; i++) {
+			struct kvm_cpu *cpu = kvm->cpus[i];
 
-		if (!cpu)
-			continue;
+			if (!cpu)
+				continue;
 
-		printout_done = 0;
+			printout_done = 0;
+
+			kvm_cpu__set_debug_fd(fd);
+			pthread_kill(cpu->thread, SIGUSR1);
+			/*
+			 * Wait for the vCPU to dump state before signalling
+			 * the next thread. Since this is debug code it does
+			 * not matter that we are burning CPU time a bit:
+			 */
+			while (!printout_done)
+				sleep(0);
+		}
 
-		kvm_cpu__set_debug_fd(fd);
-		pthread_kill(cpu->thread, SIGUSR1);
-		/*
-		 * Wait for the vCPU to dump state before signalling
-		 * the next thread. Since this is debug code it does
-		 * not matter that we are burning CPU time a bit:
-		 */
-		while (!printout_done)
-			sleep(0);
+		serial8250__inject_sysrq(kvm, 'p');
 	}
 
-	close(fd);
-
-	serial8250__inject_sysrq(kvm, 'p');
+	if (dbg_type & KVM_DEBUG_CMD_DUMP_MASK)
+		/* builtin-debug is reading, signal EOT */
+		close(fd);
 }
 
 int kvm_ipc__init(struct kvm *kvm)
diff --git a/virtio/iommu.c b/virtio/iommu.c
index 2e5a23ee..5973cef1 100644
--- a/virtio/iommu.c
+++ b/virtio/iommu.c
@@ -620,3 +620,17 @@ void viommu_unregister(struct kvm *kvm, void *viommu)
 {
 	free(viommu);
 }
+
+int viommu_parse_debug_string(const char *cmdline, struct iommu_debug_params *params)
+{
+	/* show instances numbers */
+	/* send command to instance */
+	/* - dump mappings */
+	/* - statistics */
+	return -ENOSYS;
+}
+
+int viommu_debug(int sock, struct iommu_debug_params *params)
+{
+	return -ENOSYS;
+}
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 14/15] virtio-iommu: implement basic debug commands
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (26 preceding siblings ...)
  2017-04-07 19:24   ` [RFC PATCH kvmtool 14/15] virtio-iommu: implement basic debug commands Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` [RFC PATCH kvmtool 15/15] virtio: use virtio-iommu when available Jean-Philippe Brucker
                     ` (3 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

Using debug printf with the virtual IOMMU can be extremely verbose. To
ease debugging, add a few commands that can be sent via IPC. Format for
commands is "cmd [iommu [address_space]]" (or cmd:[iommu:[address_space]])

    $ lkvm debug -a -i list
    iommu 0 "viommu-vfio"
      ioas 1
        device 0x2                      # PCI bus
      ioas 2
        device 0x3
    iommu 1 "viommu-virtio"
      ioas 3
        device 0x10003                  # MMIO bus
      ioas 4
        device 0x6

    $ lkvm debug -a -i stats:0          # stats for viommu-vfio
    iommu 0 "viommu-virtio"
      kicks                 510         # virtio kicks from driver
      requests              510         # requests received
      ioas 3
        maps                1           # number of map requests
        unmaps              0           #     "    unmap   "
        resident            8192        # bytes currently mapped
        accesses            1           # number of device accesses
      ioas 4
        maps                290
        unmaps              4
        resident            1335296
        accesses            982

    $ lkvm debug -a -i "print 1, 2"     # Start debug print for
      ...                               # ioas 2 in iommu 1
      ...
      Info: VIOMMU map 0xffffffff000 -> 0x8f4e0000 (4096) to IOAS 2
      ...
    $ lkvm debug -a -i noprint          # Stop all debug print

We don't use atomics for statistics at the moment, since there is no
concurrent write on most of them. Only 'accesses' might be incremented
concurrently, so we might get imprecise values.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/iommu.h |  17 +++
 iommu.c             |  56 +++++++++-
 virtio/iommu.c      | 312 ++++++++++++++++++++++++++++++++++++++++++++++++----
 virtio/mmio.c       |   1 +
 virtio/pci.c        |   1 +
 5 files changed, 362 insertions(+), 25 deletions(-)

diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 60857fa5..70a09306 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -11,7 +11,20 @@
 #define IOMMU_PROT_WRITE	0x2
 #define IOMMU_PROT_EXEC		0x4
 
+enum iommu_debug_action {
+	IOMMU_DEBUG_LIST,
+	IOMMU_DEBUG_STATS,
+	IOMMU_DEBUG_SET_PRINT,
+	IOMMU_DEBUG_DUMP,
+
+	IOMMU_DEBUG_NUM_ACTIONS,
+};
+
+#define IOMMU_DEBUG_SELECTOR_INVALID	((unsigned int)-1)
+
 struct iommu_debug_params {
+	enum iommu_debug_action		action;
+	unsigned int			selector[2];
 	bool				print_enabled;
 };
 
@@ -31,6 +44,8 @@ struct iommu_ops {
 	int (*detach)(void *, struct device_header *);
 	int (*map)(void *, u64 virt_addr, u64 phys_addr, u64 size, int prot);
 	int (*unmap)(void *, u64 virt_addr, u64 size, int flags);
+
+	int (*debug_address_space)(void *, int fd, struct iommu_debug_params *);
 };
 
 struct iommu_properties {
@@ -74,6 +89,8 @@ static inline struct device_header *iommu_get_device(u32 device_id)
 
 void *iommu_alloc_address_space(struct device_header *dev);
 void iommu_free_address_space(void *address_space);
+int iommu_debug_address_space(void *address_space, int fd,
+			      struct iommu_debug_params *params);
 
 int iommu_map(void *address_space, u64 virt_addr, u64 phys_addr, u64 size,
 	      int prot);
diff --git a/iommu.c b/iommu.c
index 2220e4b2..bc9fc631 100644
--- a/iommu.c
+++ b/iommu.c
@@ -9,6 +9,10 @@
 #include "kvm/mutex.h"
 #include "kvm/rbtree-interval.h"
 
+struct iommu_ioas_stats {
+	u64			accesses;
+};
+
 struct iommu_mapping {
 	struct rb_int_node	iova_range;
 	u64			phys;
@@ -18,8 +22,31 @@ struct iommu_mapping {
 struct iommu_ioas {
 	struct rb_root		mappings;
 	struct mutex		mutex;
+
+	struct iommu_ioas_stats	stats;
+	bool			debug_enabled;
 };
 
+static void iommu_dump(struct iommu_ioas *ioas, int fd)
+{
+	struct rb_node *node;
+	struct iommu_mapping *map;
+
+	mutex_lock(&ioas->mutex);
+
+	dprintf(fd, "START IOMMU DUMP [[[\n"); /* You did ask for it. */
+	for (node = rb_first(&ioas->mappings); node; node = rb_next(node)) {
+		struct rb_int_node *int_node = rb_int(node);
+		map = container_of(int_node, struct iommu_mapping, iova_range);
+
+		dprintf(fd, "%#llx-%#llx -> %#llx %#x\n", int_node->low,
+			int_node->high, map->phys, map->prot);
+	}
+	dprintf(fd, "]]] END IOMMU DUMP\n");
+
+	mutex_unlock(&ioas->mutex);
+}
+
 void *iommu_alloc_address_space(struct device_header *unused)
 {
 	struct iommu_ioas *ioas = calloc(1, sizeof(*ioas));
@@ -33,6 +60,27 @@ void *iommu_alloc_address_space(struct device_header *unused)
 	return ioas;
 }
 
+int iommu_debug_address_space(void *address_space, int fd,
+			      struct iommu_debug_params *params)
+{
+	struct iommu_ioas *ioas = address_space;
+
+	switch (params->action) {
+	case IOMMU_DEBUG_STATS:
+		dprintf(fd, "    accesses            %llu\n", ioas->stats.accesses);
+		break;
+	case IOMMU_DEBUG_SET_PRINT:
+		ioas->debug_enabled = params->print_enabled;
+		break;
+	case IOMMU_DEBUG_DUMP:
+		iommu_dump(ioas, fd);
+	default:
+		break;
+	}
+
+	return 0;
+}
+
 void iommu_free_address_space(void *address_space)
 {
 	struct iommu_ioas *ioas = address_space;
@@ -157,8 +205,12 @@ u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
 	out_addr = map->phys + (addr - node->low);
 	*out_size = min_t(size_t, node->high - addr + 1, size);
 
-	pr_debug("access %llx %zu/%zu %x -> %#llx", addr, *out_size, size,
-		 prot, out_addr);
+	if (ioas->debug_enabled)
+		pr_info("access %llx %zu/%zu %s%s -> %#llx", addr, *out_size,
+			size, prot & IOMMU_PROT_READ ? "R" : "",
+			prot & IOMMU_PROT_WRITE ? "W" : "", out_addr);
+
+	ioas->stats.accesses++;
 out_unlock:
 	mutex_unlock(&ioas->mutex);
 
diff --git a/virtio/iommu.c b/virtio/iommu.c
index 5973cef1..153b537a 100644
--- a/virtio/iommu.c
+++ b/virtio/iommu.c
@@ -20,6 +20,17 @@
 /* Max size */
 #define VIOMMU_DEFAULT_QUEUE_SIZE	256
 
+struct viommu_ioas_stats {
+	u64				map;
+	u64				unmap;
+	u64				resident;
+};
+
+struct viommu_stats {
+	u64				kicks;
+	u64				requests;
+};
+
 struct viommu_endpoint {
 	struct device_header		*dev;
 	struct viommu_ioas		*ioas;
@@ -36,9 +47,14 @@ struct viommu_ioas {
 
 	struct iommu_ops		*ops;
 	void				*priv;
+
+	bool				debug_enabled;
+	struct viommu_ioas_stats	stats;
 };
 
 struct viommu_dev {
+	u32				id;
+
 	struct virtio_device		vdev;
 	struct virtio_iommu_config	config;
 
@@ -49,29 +65,77 @@ struct viommu_dev {
 	struct thread_pool__job		job;
 
 	struct rb_root			address_spaces;
+	struct mutex			address_spaces_mutex;
 	struct kvm			*kvm;
+
+	struct list_head		list;
+
+	bool				debug_enabled;
+	struct viommu_stats		stats;
 };
 
 static int compat_id = -1;
 
+static long long viommu_ids;
+static LIST_HEAD(viommus);
+static DEFINE_MUTEX(viommus_mutex);
+
+#define ioas_debug(ioas, fmt, ...)					\
+	do {								\
+		if ((ioas)->debug_enabled)				\
+			pr_info("ioas[%d] " fmt, (ioas)->id, ##__VA_ARGS__); \
+	} while (0)
+
 static struct viommu_ioas *viommu_find_ioas(struct viommu_dev *viommu,
 					    u32 ioasid)
 {
 	struct rb_node *node;
-	struct viommu_ioas *ioas;
+	struct viommu_ioas *ioas, *found = NULL;
 
+	mutex_lock(&viommu->address_spaces_mutex);
 	node = viommu->address_spaces.rb_node;
 	while (node) {
 		ioas = container_of(node, struct viommu_ioas, node);
-		if (ioas->id > ioasid)
+		if (ioas->id > ioasid) {
 			node = node->rb_left;
-		else if (ioas->id < ioasid)
+		} else if (ioas->id < ioasid) {
 			node = node->rb_right;
-		else
-			return ioas;
+		} else {
+			found = ioas;
+			break;
+		}
 	}
+	mutex_unlock(&viommu->address_spaces_mutex);
 
-	return NULL;
+	return found;
+}
+
+static int viommu_for_each_ioas(struct viommu_dev *viommu,
+				int (*fun)(struct viommu_dev *viommu,
+					   struct viommu_ioas *ioas,
+					   void *data),
+				void *data)
+{
+	int ret;
+	struct viommu_ioas *ioas;
+	struct rb_node *node, *next;
+
+	mutex_lock(&viommu->address_spaces_mutex);
+	node = rb_first(&viommu->address_spaces);
+	while (node) {
+		next = rb_next(node);
+		ioas = container_of(node, struct viommu_ioas, node);
+
+		ret = fun(viommu, ioas, data);
+		if (ret)
+			break;
+
+		node = next;
+	}
+
+	mutex_unlock(&viommu->address_spaces_mutex);
+
+	return ret;
 }
 
 static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
@@ -99,9 +163,12 @@ static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
 	new_ioas->id		= ioasid;
 	new_ioas->ops		= ops;
 	new_ioas->priv		= ops->alloc_address_space(device);
+	new_ioas->debug_enabled	= viommu->debug_enabled;
 
 	/* A NULL priv pointer is valid. */
 
+	mutex_lock(&viommu->address_spaces_mutex);
+
 	node = &viommu->address_spaces.rb_node;
 	while (*node) {
 		ioas = container_of(*node, struct viommu_ioas, node);
@@ -114,6 +181,7 @@ static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
 		} else {
 			pr_err("IOAS exists!");
 			free(new_ioas);
+			mutex_unlock(&viommu->address_spaces_mutex);
 			return NULL;
 		}
 	}
@@ -121,6 +189,8 @@ static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
 	rb_link_node(&new_ioas->node, parent, node);
 	rb_insert_color(&new_ioas->node, &viommu->address_spaces);
 
+	mutex_unlock(&viommu->address_spaces_mutex);
+
 	return new_ioas;
 }
 
@@ -130,7 +200,9 @@ static void viommu_free_ioas(struct viommu_dev *viommu,
 	if (ioas->priv)
 		ioas->ops->free_address_space(ioas->priv);
 
+	mutex_lock(&viommu->address_spaces_mutex);
 	rb_erase(&ioas->node, &viommu->address_spaces);
+	mutex_unlock(&viommu->address_spaces_mutex);
 	free(ioas);
 }
 
@@ -178,8 +250,7 @@ static int viommu_detach_device(struct viommu_dev *viommu,
 	if (!ioas)
 		return -EINVAL;
 
-	pr_debug("detaching device %#lx from IOAS %u",
-		 device_to_iommu_id(device), ioas->id);
+	ioas_debug(ioas, "detaching device %#lx", device_to_iommu_id(device));
 
 	ret = device->iommu_ops->detach(ioas->priv, device);
 	if (!ret)
@@ -208,8 +279,6 @@ static int viommu_handle_attach(struct viommu_dev *viommu,
 		return -ENODEV;
 	}
 
-	pr_debug("attaching device %#x to IOAS %u", device_id, ioasid);
-
 	vdev = device->iommu_data;
 	if (!vdev) {
 		vdev = viommu_alloc_device(device);
@@ -240,6 +309,9 @@ static int viommu_handle_attach(struct viommu_dev *viommu,
 	if (ret && ioas->nr_devices == 0)
 		viommu_free_ioas(viommu, ioas);
 
+	if (!ret)
+		ioas_debug(ioas, "attached device %#x", device_id);
+
 	return ret;
 }
 
@@ -267,6 +339,7 @@ static int viommu_handle_detach(struct viommu_dev *viommu,
 static int viommu_handle_map(struct viommu_dev *viommu,
 			     struct virtio_iommu_req_map *map)
 {
+	int ret;
 	int prot = 0;
 	struct viommu_ioas *ioas;
 
@@ -294,15 +367,21 @@ static int viommu_handle_map(struct viommu_dev *viommu,
 	if (flags & VIRTIO_IOMMU_MAP_F_EXEC)
 		prot |= IOMMU_PROT_EXEC;
 
-	pr_debug("map %#llx -> %#llx (%llu) to IOAS %u", virt_addr,
-		 phys_addr, size, ioasid);
+	ioas_debug(ioas, "map   %#llx -> %#llx (%llu)", virt_addr, phys_addr, size);
+
+	ret = ioas->ops->map(ioas->priv, virt_addr, phys_addr, size, prot);
+	if (!ret) {
+		ioas->stats.resident += size;
+		ioas->stats.map++;
+	}
 
-	return ioas->ops->map(ioas->priv, virt_addr, phys_addr, size, prot);
+	return ret;
 }
 
 static int viommu_handle_unmap(struct viommu_dev *viommu,
 			       struct virtio_iommu_req_unmap *unmap)
 {
+	int ret;
 	struct viommu_ioas *ioas;
 
 	u32 ioasid	= le32_to_cpu(unmap->address_space);
@@ -315,10 +394,15 @@ static int viommu_handle_unmap(struct viommu_dev *viommu,
 		return -ESRCH;
 	}
 
-	pr_debug("unmap %#llx (%llu) from IOAS %u", virt_addr, size,
-		 ioasid);
+	ioas_debug(ioas, "unmap %#llx (%llu)", virt_addr, size);
+
+	ret = ioas->ops->unmap(ioas->priv, virt_addr, size, 0);
+	if (!ret) {
+		ioas->stats.resident -= size;
+		ioas->stats.unmap++;
+	}
 
-	return ioas->ops->unmap(ioas->priv, virt_addr, size, 0);
+	return ret;
 }
 
 static size_t viommu_get_req_len(union virtio_iommu_req *req)
@@ -407,6 +491,8 @@ static ssize_t viommu_dispatch_commands(struct viommu_dev *viommu,
 			continue;
 		}
 
+		viommu->stats.requests++;
+
 		req = iov[i].iov_base;
 		op = req->head.type;
 		expected_len = viommu_get_req_len(req) - sizeof(*tail);
@@ -458,6 +544,8 @@ static void viommu_command(struct kvm *kvm, void *dev)
 
 	vq = &viommu->vq;
 
+	viommu->stats.kicks++;
+
 	while (virt_queue__available(vq)) {
 		head = virt_queue__get_iov(vq, iov, &out, &in, kvm);
 
@@ -594,6 +682,7 @@ void *viommu_register(struct kvm *kvm, struct iommu_properties *props)
 
 	viommu->queue_size		= VIOMMU_DEFAULT_QUEUE_SIZE;
 	viommu->address_spaces		= (struct rb_root)RB_ROOT;
+	viommu->address_spaces_mutex	= (struct mutex)MUTEX_INITIALIZER;
 	viommu->properties		= props;
 
 	viommu->config.page_sizes	= props->pgsize_mask ?: pgsize_mask;
@@ -607,6 +696,11 @@ void *viommu_register(struct kvm *kvm, struct iommu_properties *props)
 		return NULL;
 	}
 
+	mutex_lock(&viommus_mutex);
+	viommu->id = viommu_ids++;
+	list_add_tail(&viommu->list, &viommus);
+	mutex_unlock(&viommus_mutex);
+
 	pr_info("Loaded virtual IOMMU %s", props->name);
 
 	if (compat_id == -1)
@@ -616,21 +710,193 @@ void *viommu_register(struct kvm *kvm, struct iommu_properties *props)
 	return viommu;
 }
 
-void viommu_unregister(struct kvm *kvm, void *viommu)
+void viommu_unregister(struct kvm *kvm, void *dev)
 {
+	struct viommu_dev *viommu = dev;
+
+	mutex_lock(&viommus_mutex);
+	list_del(&viommu->list);
+	mutex_unlock(&viommus_mutex);
+
 	free(viommu);
 }
 
+const char *debug_usage =
+"  list [iommu [ioas]]            list iommus and address spaces\n"
+"  stats [iommu [ioas]]           display statistics\n"
+"  dump  [iommu [ioas]]           dump mappings\n"
+"  print [iommu [ioas]]           enable debug print\n"
+"  noprint [iommu [ioas]]         disable debug print\n"
+;
+
 int viommu_parse_debug_string(const char *cmdline, struct iommu_debug_params *params)
 {
-	/* show instances numbers */
-	/* send command to instance */
-	/* - dump mappings */
-	/* - statistics */
-	return -ENOSYS;
+	int pos = 0;
+	int ret = -EINVAL;
+	char *cur, *args = strdup(cmdline);
+	params->action = IOMMU_DEBUG_NUM_ACTIONS;
+
+	if (!args)
+		return -ENOMEM;
+
+	params->selector[0] = IOMMU_DEBUG_SELECTOR_INVALID;
+	params->selector[1] = IOMMU_DEBUG_SELECTOR_INVALID;
+
+	cur = strtok(args, " ,:");
+	while (cur) {
+		if (pos > 2)
+			break;
+
+		if (pos > 0) {
+			errno = 0;
+			params->selector[pos - 1] = strtoul(cur, NULL, 0);
+			if (errno) {
+				ret = -errno;
+				pr_err("Invalid number '%s'", cur);
+				break;
+			}
+		} else if (strncmp(cur, "list", 4) == 0) {
+			params->action = IOMMU_DEBUG_LIST;
+		} else if (strncmp(cur, "stats", 5) == 0) {
+			params->action = IOMMU_DEBUG_STATS;
+		} else if (strncmp(cur, "dump", 4) == 0) {
+			params->action = IOMMU_DEBUG_DUMP;
+		} else if (strncmp(cur, "print", 5) == 0) {
+			params->action = IOMMU_DEBUG_SET_PRINT;
+			params->print_enabled = true;
+		} else if (strncmp(cur, "noprint", 7) == 0) {
+			params->action = IOMMU_DEBUG_SET_PRINT;
+			params->print_enabled = false;
+		} else {
+			pr_err("Invalid command '%s'", cur);
+			break;
+		}
+
+		cur = strtok(NULL, " ,:");
+		pos++;
+		ret = 0;
+	}
+
+	free(args);
+
+	if (cur && cur[0])
+		pr_err("Ignoring argument '%s'", cur);
+
+	if (ret)
+		pr_info("Usage:\n%s", debug_usage);
+
+	return ret;
+}
+
+struct viommu_debug_context {
+	int				sock;
+	struct iommu_debug_params	*params;
+	bool				disp;
+};
+
+static int viommu_debug_ioas(struct viommu_dev *viommu,
+			     struct viommu_ioas *ioas,
+			     void *data)
+{
+	int ret = 0;
+	struct viommu_endpoint *vdev;
+	struct viommu_debug_context *ctx = data;
+
+	if (ctx->disp)
+		dprintf(ctx->sock, "  ioas %u\n", ioas->id);
+
+	switch (ctx->params->action) {
+	case IOMMU_DEBUG_LIST:
+		mutex_lock(&ioas->devices_mutex);
+		list_for_each_entry(vdev, &ioas->devices, list) {
+			dprintf(ctx->sock, "    device 0x%lx\n",
+				device_to_iommu_id(vdev->dev));
+		}
+		mutex_unlock(&ioas->devices_mutex);
+		break;
+	case IOMMU_DEBUG_STATS:
+		dprintf(ctx->sock, "    maps                %llu\n",
+			ioas->stats.map);
+		dprintf(ctx->sock, "    unmaps              %llu\n",
+			ioas->stats.unmap);
+		dprintf(ctx->sock, "    resident            %llu\n",
+			ioas->stats.resident);
+		break;
+	case IOMMU_DEBUG_SET_PRINT:
+		ioas->debug_enabled = ctx->params->print_enabled;
+		break;
+	default:
+		ret = -ENOSYS;
+
+	}
+
+	if (ioas->ops->debug_address_space)
+		ret = ioas->ops->debug_address_space(ioas->priv, ctx->sock,
+						     ctx->params);
+
+	return ret;
+}
+
+static int viommu_debug_iommu(struct viommu_dev *viommu,
+			      struct viommu_debug_context *ctx)
+{
+	struct viommu_ioas *ioas;
+
+	if (ctx->disp)
+		dprintf(ctx->sock, "iommu %u \"%s\"\n", viommu->id,
+			viommu->properties->name);
+
+	if (ctx->params->selector[1] != IOMMU_DEBUG_SELECTOR_INVALID) {
+		ioas = viommu_find_ioas(viommu, ctx->params->selector[1]);
+		return ioas ? viommu_debug_ioas(viommu, ioas, ctx) : -ESRCH;
+	}
+
+	switch (ctx->params->action) {
+	case IOMMU_DEBUG_STATS:
+		dprintf(ctx->sock, "  kicks                 %llu\n",
+			viommu->stats.kicks);
+		dprintf(ctx->sock, "  requests              %llu\n",
+			viommu->stats.requests);
+		break;
+	case IOMMU_DEBUG_SET_PRINT:
+		viommu->debug_enabled = ctx->params->print_enabled;
+		break;
+	default:
+		break;
+	}
+
+	return viommu_for_each_ioas(viommu, viommu_debug_ioas, ctx);
 }
 
 int viommu_debug(int sock, struct iommu_debug_params *params)
 {
-	return -ENOSYS;
+	int ret = -ESRCH;
+	bool match;
+	struct viommu_dev *viommu;
+	bool any = (params->selector[0] == IOMMU_DEBUG_SELECTOR_INVALID);
+
+	struct viommu_debug_context ctx = {
+		.sock		= sock,
+		.params		= params,
+	};
+
+	if (params->action == IOMMU_DEBUG_LIST ||
+	    params->action == IOMMU_DEBUG_STATS)
+		ctx.disp = true;
+
+	mutex_lock(&viommus_mutex);
+	list_for_each_entry(viommu, &viommus, list) {
+		match = (params->selector[0] == viommu->id);
+		if (match || any) {
+			ret = viommu_debug_iommu(viommu, &ctx);
+			if (ret || match)
+				break;
+		}
+	}
+	mutex_unlock(&viommus_mutex);
+
+	if (ret)
+		dprintf(sock, "error: %s\n", strerror(-ret));
+
+	return ret;
 }
diff --git a/virtio/mmio.c b/virtio/mmio.c
index 699d4403..7d39120a 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -307,6 +307,7 @@ static struct iommu_ops virtio_mmio_iommu_ops = {
 	.get_properties		= virtio__iommu_get_properties,
 	.alloc_address_space	= iommu_alloc_address_space,
 	.free_address_space	= iommu_free_address_space,
+	.debug_address_space	= iommu_debug_address_space,
 	.attach			= virtio_mmio_iommu_attach,
 	.detach			= virtio_mmio_iommu_detach,
 	.map			= iommu_map,
diff --git a/virtio/pci.c b/virtio/pci.c
index c9f0e558..c5d30eb2 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -442,6 +442,7 @@ static struct iommu_ops virtio_pci_iommu_ops = {
 	.get_properties		= virtio__iommu_get_properties,
 	.alloc_address_space	= iommu_alloc_address_space,
 	.free_address_space	= iommu_free_address_space,
+	.debug_address_space	= iommu_debug_address_space,
 	.attach			= virtio_pci_iommu_attach,
 	.detach			= virtio_pci_iommu_detach,
 	.map			= iommu_map,
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 14/15] virtio-iommu: implement basic debug commands
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (25 preceding siblings ...)
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (4 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Using debug printf with the virtual IOMMU can be extremely verbose. To
ease debugging, add a few commands that can be sent via IPC. Format for
commands is "cmd [iommu [address_space]]" (or cmd:[iommu:[address_space]])

    $ lkvm debug -a -i list
    iommu 0 "viommu-vfio"
      ioas 1
        device 0x2                      # PCI bus
      ioas 2
        device 0x3
    iommu 1 "viommu-virtio"
      ioas 3
        device 0x10003                  # MMIO bus
      ioas 4
        device 0x6

    $ lkvm debug -a -i stats:0          # stats for viommu-vfio
    iommu 0 "viommu-virtio"
      kicks                 510         # virtio kicks from driver
      requests              510         # requests received
      ioas 3
        maps                1           # number of map requests
        unmaps              0           #     "    unmap   "
        resident            8192        # bytes currently mapped
        accesses            1           # number of device accesses
      ioas 4
        maps                290
        unmaps              4
        resident            1335296
        accesses            982

    $ lkvm debug -a -i "print 1, 2"     # Start debug print for
      ...                               # ioas 2 in iommu 1
      ...
      Info: VIOMMU map 0xffffffff000 -> 0x8f4e0000 (4096) to IOAS 2
      ...
    $ lkvm debug -a -i noprint          # Stop all debug print

We don't use atomics for statistics at the moment, since there is no
concurrent write on most of them. Only 'accesses' might be incremented
concurrently, so we might get imprecise values.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/iommu.h |  17 +++
 iommu.c             |  56 +++++++++-
 virtio/iommu.c      | 312 ++++++++++++++++++++++++++++++++++++++++++++++++----
 virtio/mmio.c       |   1 +
 virtio/pci.c        |   1 +
 5 files changed, 362 insertions(+), 25 deletions(-)

diff --git a/include/kvm/iommu.h b/include/kvm/iommu.h
index 60857fa5..70a09306 100644
--- a/include/kvm/iommu.h
+++ b/include/kvm/iommu.h
@@ -11,7 +11,20 @@
 #define IOMMU_PROT_WRITE	0x2
 #define IOMMU_PROT_EXEC		0x4
 
+enum iommu_debug_action {
+	IOMMU_DEBUG_LIST,
+	IOMMU_DEBUG_STATS,
+	IOMMU_DEBUG_SET_PRINT,
+	IOMMU_DEBUG_DUMP,
+
+	IOMMU_DEBUG_NUM_ACTIONS,
+};
+
+#define IOMMU_DEBUG_SELECTOR_INVALID	((unsigned int)-1)
+
 struct iommu_debug_params {
+	enum iommu_debug_action		action;
+	unsigned int			selector[2];
 	bool				print_enabled;
 };
 
@@ -31,6 +44,8 @@ struct iommu_ops {
 	int (*detach)(void *, struct device_header *);
 	int (*map)(void *, u64 virt_addr, u64 phys_addr, u64 size, int prot);
 	int (*unmap)(void *, u64 virt_addr, u64 size, int flags);
+
+	int (*debug_address_space)(void *, int fd, struct iommu_debug_params *);
 };
 
 struct iommu_properties {
@@ -74,6 +89,8 @@ static inline struct device_header *iommu_get_device(u32 device_id)
 
 void *iommu_alloc_address_space(struct device_header *dev);
 void iommu_free_address_space(void *address_space);
+int iommu_debug_address_space(void *address_space, int fd,
+			      struct iommu_debug_params *params);
 
 int iommu_map(void *address_space, u64 virt_addr, u64 phys_addr, u64 size,
 	      int prot);
diff --git a/iommu.c b/iommu.c
index 2220e4b2..bc9fc631 100644
--- a/iommu.c
+++ b/iommu.c
@@ -9,6 +9,10 @@
 #include "kvm/mutex.h"
 #include "kvm/rbtree-interval.h"
 
+struct iommu_ioas_stats {
+	u64			accesses;
+};
+
 struct iommu_mapping {
 	struct rb_int_node	iova_range;
 	u64			phys;
@@ -18,8 +22,31 @@ struct iommu_mapping {
 struct iommu_ioas {
 	struct rb_root		mappings;
 	struct mutex		mutex;
+
+	struct iommu_ioas_stats	stats;
+	bool			debug_enabled;
 };
 
+static void iommu_dump(struct iommu_ioas *ioas, int fd)
+{
+	struct rb_node *node;
+	struct iommu_mapping *map;
+
+	mutex_lock(&ioas->mutex);
+
+	dprintf(fd, "START IOMMU DUMP [[[\n"); /* You did ask for it. */
+	for (node = rb_first(&ioas->mappings); node; node = rb_next(node)) {
+		struct rb_int_node *int_node = rb_int(node);
+		map = container_of(int_node, struct iommu_mapping, iova_range);
+
+		dprintf(fd, "%#llx-%#llx -> %#llx %#x\n", int_node->low,
+			int_node->high, map->phys, map->prot);
+	}
+	dprintf(fd, "]]] END IOMMU DUMP\n");
+
+	mutex_unlock(&ioas->mutex);
+}
+
 void *iommu_alloc_address_space(struct device_header *unused)
 {
 	struct iommu_ioas *ioas = calloc(1, sizeof(*ioas));
@@ -33,6 +60,27 @@ void *iommu_alloc_address_space(struct device_header *unused)
 	return ioas;
 }
 
+int iommu_debug_address_space(void *address_space, int fd,
+			      struct iommu_debug_params *params)
+{
+	struct iommu_ioas *ioas = address_space;
+
+	switch (params->action) {
+	case IOMMU_DEBUG_STATS:
+		dprintf(fd, "    accesses            %llu\n", ioas->stats.accesses);
+		break;
+	case IOMMU_DEBUG_SET_PRINT:
+		ioas->debug_enabled = params->print_enabled;
+		break;
+	case IOMMU_DEBUG_DUMP:
+		iommu_dump(ioas, fd);
+	default:
+		break;
+	}
+
+	return 0;
+}
+
 void iommu_free_address_space(void *address_space)
 {
 	struct iommu_ioas *ioas = address_space;
@@ -157,8 +205,12 @@ u64 iommu_access(void *address_space, u64 addr, size_t size, size_t *out_size,
 	out_addr = map->phys + (addr - node->low);
 	*out_size = min_t(size_t, node->high - addr + 1, size);
 
-	pr_debug("access %llx %zu/%zu %x -> %#llx", addr, *out_size, size,
-		 prot, out_addr);
+	if (ioas->debug_enabled)
+		pr_info("access %llx %zu/%zu %s%s -> %#llx", addr, *out_size,
+			size, prot & IOMMU_PROT_READ ? "R" : "",
+			prot & IOMMU_PROT_WRITE ? "W" : "", out_addr);
+
+	ioas->stats.accesses++;
 out_unlock:
 	mutex_unlock(&ioas->mutex);
 
diff --git a/virtio/iommu.c b/virtio/iommu.c
index 5973cef1..153b537a 100644
--- a/virtio/iommu.c
+++ b/virtio/iommu.c
@@ -20,6 +20,17 @@
 /* Max size */
 #define VIOMMU_DEFAULT_QUEUE_SIZE	256
 
+struct viommu_ioas_stats {
+	u64				map;
+	u64				unmap;
+	u64				resident;
+};
+
+struct viommu_stats {
+	u64				kicks;
+	u64				requests;
+};
+
 struct viommu_endpoint {
 	struct device_header		*dev;
 	struct viommu_ioas		*ioas;
@@ -36,9 +47,14 @@ struct viommu_ioas {
 
 	struct iommu_ops		*ops;
 	void				*priv;
+
+	bool				debug_enabled;
+	struct viommu_ioas_stats	stats;
 };
 
 struct viommu_dev {
+	u32				id;
+
 	struct virtio_device		vdev;
 	struct virtio_iommu_config	config;
 
@@ -49,29 +65,77 @@ struct viommu_dev {
 	struct thread_pool__job		job;
 
 	struct rb_root			address_spaces;
+	struct mutex			address_spaces_mutex;
 	struct kvm			*kvm;
+
+	struct list_head		list;
+
+	bool				debug_enabled;
+	struct viommu_stats		stats;
 };
 
 static int compat_id = -1;
 
+static long long viommu_ids;
+static LIST_HEAD(viommus);
+static DEFINE_MUTEX(viommus_mutex);
+
+#define ioas_debug(ioas, fmt, ...)					\
+	do {								\
+		if ((ioas)->debug_enabled)				\
+			pr_info("ioas[%d] " fmt, (ioas)->id, ##__VA_ARGS__); \
+	} while (0)
+
 static struct viommu_ioas *viommu_find_ioas(struct viommu_dev *viommu,
 					    u32 ioasid)
 {
 	struct rb_node *node;
-	struct viommu_ioas *ioas;
+	struct viommu_ioas *ioas, *found = NULL;
 
+	mutex_lock(&viommu->address_spaces_mutex);
 	node = viommu->address_spaces.rb_node;
 	while (node) {
 		ioas = container_of(node, struct viommu_ioas, node);
-		if (ioas->id > ioasid)
+		if (ioas->id > ioasid) {
 			node = node->rb_left;
-		else if (ioas->id < ioasid)
+		} else if (ioas->id < ioasid) {
 			node = node->rb_right;
-		else
-			return ioas;
+		} else {
+			found = ioas;
+			break;
+		}
 	}
+	mutex_unlock(&viommu->address_spaces_mutex);
 
-	return NULL;
+	return found;
+}
+
+static int viommu_for_each_ioas(struct viommu_dev *viommu,
+				int (*fun)(struct viommu_dev *viommu,
+					   struct viommu_ioas *ioas,
+					   void *data),
+				void *data)
+{
+	int ret;
+	struct viommu_ioas *ioas;
+	struct rb_node *node, *next;
+
+	mutex_lock(&viommu->address_spaces_mutex);
+	node = rb_first(&viommu->address_spaces);
+	while (node) {
+		next = rb_next(node);
+		ioas = container_of(node, struct viommu_ioas, node);
+
+		ret = fun(viommu, ioas, data);
+		if (ret)
+			break;
+
+		node = next;
+	}
+
+	mutex_unlock(&viommu->address_spaces_mutex);
+
+	return ret;
 }
 
 static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
@@ -99,9 +163,12 @@ static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
 	new_ioas->id		= ioasid;
 	new_ioas->ops		= ops;
 	new_ioas->priv		= ops->alloc_address_space(device);
+	new_ioas->debug_enabled	= viommu->debug_enabled;
 
 	/* A NULL priv pointer is valid. */
 
+	mutex_lock(&viommu->address_spaces_mutex);
+
 	node = &viommu->address_spaces.rb_node;
 	while (*node) {
 		ioas = container_of(*node, struct viommu_ioas, node);
@@ -114,6 +181,7 @@ static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
 		} else {
 			pr_err("IOAS exists!");
 			free(new_ioas);
+			mutex_unlock(&viommu->address_spaces_mutex);
 			return NULL;
 		}
 	}
@@ -121,6 +189,8 @@ static struct viommu_ioas *viommu_alloc_ioas(struct viommu_dev *viommu,
 	rb_link_node(&new_ioas->node, parent, node);
 	rb_insert_color(&new_ioas->node, &viommu->address_spaces);
 
+	mutex_unlock(&viommu->address_spaces_mutex);
+
 	return new_ioas;
 }
 
@@ -130,7 +200,9 @@ static void viommu_free_ioas(struct viommu_dev *viommu,
 	if (ioas->priv)
 		ioas->ops->free_address_space(ioas->priv);
 
+	mutex_lock(&viommu->address_spaces_mutex);
 	rb_erase(&ioas->node, &viommu->address_spaces);
+	mutex_unlock(&viommu->address_spaces_mutex);
 	free(ioas);
 }
 
@@ -178,8 +250,7 @@ static int viommu_detach_device(struct viommu_dev *viommu,
 	if (!ioas)
 		return -EINVAL;
 
-	pr_debug("detaching device %#lx from IOAS %u",
-		 device_to_iommu_id(device), ioas->id);
+	ioas_debug(ioas, "detaching device %#lx", device_to_iommu_id(device));
 
 	ret = device->iommu_ops->detach(ioas->priv, device);
 	if (!ret)
@@ -208,8 +279,6 @@ static int viommu_handle_attach(struct viommu_dev *viommu,
 		return -ENODEV;
 	}
 
-	pr_debug("attaching device %#x to IOAS %u", device_id, ioasid);
-
 	vdev = device->iommu_data;
 	if (!vdev) {
 		vdev = viommu_alloc_device(device);
@@ -240,6 +309,9 @@ static int viommu_handle_attach(struct viommu_dev *viommu,
 	if (ret && ioas->nr_devices == 0)
 		viommu_free_ioas(viommu, ioas);
 
+	if (!ret)
+		ioas_debug(ioas, "attached device %#x", device_id);
+
 	return ret;
 }
 
@@ -267,6 +339,7 @@ static int viommu_handle_detach(struct viommu_dev *viommu,
 static int viommu_handle_map(struct viommu_dev *viommu,
 			     struct virtio_iommu_req_map *map)
 {
+	int ret;
 	int prot = 0;
 	struct viommu_ioas *ioas;
 
@@ -294,15 +367,21 @@ static int viommu_handle_map(struct viommu_dev *viommu,
 	if (flags & VIRTIO_IOMMU_MAP_F_EXEC)
 		prot |= IOMMU_PROT_EXEC;
 
-	pr_debug("map %#llx -> %#llx (%llu) to IOAS %u", virt_addr,
-		 phys_addr, size, ioasid);
+	ioas_debug(ioas, "map   %#llx -> %#llx (%llu)", virt_addr, phys_addr, size);
+
+	ret = ioas->ops->map(ioas->priv, virt_addr, phys_addr, size, prot);
+	if (!ret) {
+		ioas->stats.resident += size;
+		ioas->stats.map++;
+	}
 
-	return ioas->ops->map(ioas->priv, virt_addr, phys_addr, size, prot);
+	return ret;
 }
 
 static int viommu_handle_unmap(struct viommu_dev *viommu,
 			       struct virtio_iommu_req_unmap *unmap)
 {
+	int ret;
 	struct viommu_ioas *ioas;
 
 	u32 ioasid	= le32_to_cpu(unmap->address_space);
@@ -315,10 +394,15 @@ static int viommu_handle_unmap(struct viommu_dev *viommu,
 		return -ESRCH;
 	}
 
-	pr_debug("unmap %#llx (%llu) from IOAS %u", virt_addr, size,
-		 ioasid);
+	ioas_debug(ioas, "unmap %#llx (%llu)", virt_addr, size);
+
+	ret = ioas->ops->unmap(ioas->priv, virt_addr, size, 0);
+	if (!ret) {
+		ioas->stats.resident -= size;
+		ioas->stats.unmap++;
+	}
 
-	return ioas->ops->unmap(ioas->priv, virt_addr, size, 0);
+	return ret;
 }
 
 static size_t viommu_get_req_len(union virtio_iommu_req *req)
@@ -407,6 +491,8 @@ static ssize_t viommu_dispatch_commands(struct viommu_dev *viommu,
 			continue;
 		}
 
+		viommu->stats.requests++;
+
 		req = iov[i].iov_base;
 		op = req->head.type;
 		expected_len = viommu_get_req_len(req) - sizeof(*tail);
@@ -458,6 +544,8 @@ static void viommu_command(struct kvm *kvm, void *dev)
 
 	vq = &viommu->vq;
 
+	viommu->stats.kicks++;
+
 	while (virt_queue__available(vq)) {
 		head = virt_queue__get_iov(vq, iov, &out, &in, kvm);
 
@@ -594,6 +682,7 @@ void *viommu_register(struct kvm *kvm, struct iommu_properties *props)
 
 	viommu->queue_size		= VIOMMU_DEFAULT_QUEUE_SIZE;
 	viommu->address_spaces		= (struct rb_root)RB_ROOT;
+	viommu->address_spaces_mutex	= (struct mutex)MUTEX_INITIALIZER;
 	viommu->properties		= props;
 
 	viommu->config.page_sizes	= props->pgsize_mask ?: pgsize_mask;
@@ -607,6 +696,11 @@ void *viommu_register(struct kvm *kvm, struct iommu_properties *props)
 		return NULL;
 	}
 
+	mutex_lock(&viommus_mutex);
+	viommu->id = viommu_ids++;
+	list_add_tail(&viommu->list, &viommus);
+	mutex_unlock(&viommus_mutex);
+
 	pr_info("Loaded virtual IOMMU %s", props->name);
 
 	if (compat_id == -1)
@@ -616,21 +710,193 @@ void *viommu_register(struct kvm *kvm, struct iommu_properties *props)
 	return viommu;
 }
 
-void viommu_unregister(struct kvm *kvm, void *viommu)
+void viommu_unregister(struct kvm *kvm, void *dev)
 {
+	struct viommu_dev *viommu = dev;
+
+	mutex_lock(&viommus_mutex);
+	list_del(&viommu->list);
+	mutex_unlock(&viommus_mutex);
+
 	free(viommu);
 }
 
+const char *debug_usage =
+"  list [iommu [ioas]]            list iommus and address spaces\n"
+"  stats [iommu [ioas]]           display statistics\n"
+"  dump  [iommu [ioas]]           dump mappings\n"
+"  print [iommu [ioas]]           enable debug print\n"
+"  noprint [iommu [ioas]]         disable debug print\n"
+;
+
 int viommu_parse_debug_string(const char *cmdline, struct iommu_debug_params *params)
 {
-	/* show instances numbers */
-	/* send command to instance */
-	/* - dump mappings */
-	/* - statistics */
-	return -ENOSYS;
+	int pos = 0;
+	int ret = -EINVAL;
+	char *cur, *args = strdup(cmdline);
+	params->action = IOMMU_DEBUG_NUM_ACTIONS;
+
+	if (!args)
+		return -ENOMEM;
+
+	params->selector[0] = IOMMU_DEBUG_SELECTOR_INVALID;
+	params->selector[1] = IOMMU_DEBUG_SELECTOR_INVALID;
+
+	cur = strtok(args, " ,:");
+	while (cur) {
+		if (pos > 2)
+			break;
+
+		if (pos > 0) {
+			errno = 0;
+			params->selector[pos - 1] = strtoul(cur, NULL, 0);
+			if (errno) {
+				ret = -errno;
+				pr_err("Invalid number '%s'", cur);
+				break;
+			}
+		} else if (strncmp(cur, "list", 4) == 0) {
+			params->action = IOMMU_DEBUG_LIST;
+		} else if (strncmp(cur, "stats", 5) == 0) {
+			params->action = IOMMU_DEBUG_STATS;
+		} else if (strncmp(cur, "dump", 4) == 0) {
+			params->action = IOMMU_DEBUG_DUMP;
+		} else if (strncmp(cur, "print", 5) == 0) {
+			params->action = IOMMU_DEBUG_SET_PRINT;
+			params->print_enabled = true;
+		} else if (strncmp(cur, "noprint", 7) == 0) {
+			params->action = IOMMU_DEBUG_SET_PRINT;
+			params->print_enabled = false;
+		} else {
+			pr_err("Invalid command '%s'", cur);
+			break;
+		}
+
+		cur = strtok(NULL, " ,:");
+		pos++;
+		ret = 0;
+	}
+
+	free(args);
+
+	if (cur && cur[0])
+		pr_err("Ignoring argument '%s'", cur);
+
+	if (ret)
+		pr_info("Usage:\n%s", debug_usage);
+
+	return ret;
+}
+
+struct viommu_debug_context {
+	int				sock;
+	struct iommu_debug_params	*params;
+	bool				disp;
+};
+
+static int viommu_debug_ioas(struct viommu_dev *viommu,
+			     struct viommu_ioas *ioas,
+			     void *data)
+{
+	int ret = 0;
+	struct viommu_endpoint *vdev;
+	struct viommu_debug_context *ctx = data;
+
+	if (ctx->disp)
+		dprintf(ctx->sock, "  ioas %u\n", ioas->id);
+
+	switch (ctx->params->action) {
+	case IOMMU_DEBUG_LIST:
+		mutex_lock(&ioas->devices_mutex);
+		list_for_each_entry(vdev, &ioas->devices, list) {
+			dprintf(ctx->sock, "    device 0x%lx\n",
+				device_to_iommu_id(vdev->dev));
+		}
+		mutex_unlock(&ioas->devices_mutex);
+		break;
+	case IOMMU_DEBUG_STATS:
+		dprintf(ctx->sock, "    maps                %llu\n",
+			ioas->stats.map);
+		dprintf(ctx->sock, "    unmaps              %llu\n",
+			ioas->stats.unmap);
+		dprintf(ctx->sock, "    resident            %llu\n",
+			ioas->stats.resident);
+		break;
+	case IOMMU_DEBUG_SET_PRINT:
+		ioas->debug_enabled = ctx->params->print_enabled;
+		break;
+	default:
+		ret = -ENOSYS;
+
+	}
+
+	if (ioas->ops->debug_address_space)
+		ret = ioas->ops->debug_address_space(ioas->priv, ctx->sock,
+						     ctx->params);
+
+	return ret;
+}
+
+static int viommu_debug_iommu(struct viommu_dev *viommu,
+			      struct viommu_debug_context *ctx)
+{
+	struct viommu_ioas *ioas;
+
+	if (ctx->disp)
+		dprintf(ctx->sock, "iommu %u \"%s\"\n", viommu->id,
+			viommu->properties->name);
+
+	if (ctx->params->selector[1] != IOMMU_DEBUG_SELECTOR_INVALID) {
+		ioas = viommu_find_ioas(viommu, ctx->params->selector[1]);
+		return ioas ? viommu_debug_ioas(viommu, ioas, ctx) : -ESRCH;
+	}
+
+	switch (ctx->params->action) {
+	case IOMMU_DEBUG_STATS:
+		dprintf(ctx->sock, "  kicks                 %llu\n",
+			viommu->stats.kicks);
+		dprintf(ctx->sock, "  requests              %llu\n",
+			viommu->stats.requests);
+		break;
+	case IOMMU_DEBUG_SET_PRINT:
+		viommu->debug_enabled = ctx->params->print_enabled;
+		break;
+	default:
+		break;
+	}
+
+	return viommu_for_each_ioas(viommu, viommu_debug_ioas, ctx);
 }
 
 int viommu_debug(int sock, struct iommu_debug_params *params)
 {
-	return -ENOSYS;
+	int ret = -ESRCH;
+	bool match;
+	struct viommu_dev *viommu;
+	bool any = (params->selector[0] == IOMMU_DEBUG_SELECTOR_INVALID);
+
+	struct viommu_debug_context ctx = {
+		.sock		= sock,
+		.params		= params,
+	};
+
+	if (params->action == IOMMU_DEBUG_LIST ||
+	    params->action == IOMMU_DEBUG_STATS)
+		ctx.disp = true;
+
+	mutex_lock(&viommus_mutex);
+	list_for_each_entry(viommu, &viommus, list) {
+		match = (params->selector[0] == viommu->id);
+		if (match || any) {
+			ret = viommu_debug_iommu(viommu, &ctx);
+			if (ret || match)
+				break;
+		}
+	}
+	mutex_unlock(&viommus_mutex);
+
+	if (ret)
+		dprintf(sock, "error: %s\n", strerror(-ret));
+
+	return ret;
 }
diff --git a/virtio/mmio.c b/virtio/mmio.c
index 699d4403..7d39120a 100644
--- a/virtio/mmio.c
+++ b/virtio/mmio.c
@@ -307,6 +307,7 @@ static struct iommu_ops virtio_mmio_iommu_ops = {
 	.get_properties		= virtio__iommu_get_properties,
 	.alloc_address_space	= iommu_alloc_address_space,
 	.free_address_space	= iommu_free_address_space,
+	.debug_address_space	= iommu_debug_address_space,
 	.attach			= virtio_mmio_iommu_attach,
 	.detach			= virtio_mmio_iommu_detach,
 	.map			= iommu_map,
diff --git a/virtio/pci.c b/virtio/pci.c
index c9f0e558..c5d30eb2 100644
--- a/virtio/pci.c
+++ b/virtio/pci.c
@@ -442,6 +442,7 @@ static struct iommu_ops virtio_pci_iommu_ops = {
 	.get_properties		= virtio__iommu_get_properties,
 	.alloc_address_space	= iommu_alloc_address_space,
 	.free_address_space	= iommu_free_address_space,
+	.debug_address_space	= iommu_debug_address_space,
 	.attach			= virtio_pci_iommu_attach,
 	.detach			= virtio_pci_iommu_detach,
 	.map			= iommu_map,
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 15/15] virtio: use virtio-iommu when available
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (27 preceding siblings ...)
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-04-07 19:24   ` Jean-Philippe Brucker
                     ` (2 subsequent siblings)
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

This is for development only. Virtual devices might blow up unexpectedly.
In general it seems to work (slowing devices down by a factor of two of
course). virtio-scsi, virtio-rng and virtio-balloon are still untested.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 virtio/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/virtio/core.c b/virtio/core.c
index 66e0cecb..4ca632f9 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -1,4 +1,5 @@
 #include <linux/virtio_config.h>
+#include <linux/virtio_ids.h>
 #include <linux/virtio_ring.h>
 #include <linux/types.h>
 #include <sys/uio.h>
@@ -369,6 +370,8 @@ int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 {
 	void *virtio;
 
+	vdev->use_iommu = kvm->cfg.viommu && subsys_id != VIRTIO_ID_IOMMU;
+
 	switch (trans) {
 	case VIRTIO_PCI:
 		virtio = calloc(sizeof(struct virtio_pci), 1);
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [RFC PATCH kvmtool 15/15] virtio: use virtio-iommu when available
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (28 preceding siblings ...)
  2017-04-07 19:24   ` [RFC PATCH kvmtool 15/15] virtio: use virtio-iommu when available Jean-Philippe Brucker
@ 2017-04-07 19:24   ` Jean-Philippe Brucker
  2017-05-22  8:26   ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Bharat Bhushan
       [not found]   ` <20170407192455.26814-1-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
  31 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-07 19:24 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

This is for development only. Virtual devices might blow up unexpectedly.
In general it seems to work (slowing devices down by a factor of two of
course). virtio-scsi, virtio-rng and virtio-balloon are still untested.

Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 virtio/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/virtio/core.c b/virtio/core.c
index 66e0cecb..4ca632f9 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -1,4 +1,5 @@
 #include <linux/virtio_config.h>
+#include <linux/virtio_ids.h>
 #include <linux/virtio_ring.h>
 #include <linux/types.h>
 #include <sys/uio.h>
@@ -369,6 +370,8 @@ int virtio_init(struct kvm *kvm, void *dev, struct virtio_device *vdev,
 {
 	void *virtio;
 
+	vdev->use_iommu = kvm->cfg.viommu && subsys_id != VIRTIO_ID_IOMMU;
+
 	switch (trans) {
 	case VIRTIO_PCI:
 		virtio = calloc(sizeof(struct virtio_pci), 1);
-- 
2.12.1

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
  2017-04-07 19:17 [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jean-Philippe Brucker
                   ` (7 preceding siblings ...)
  2017-04-07 21:19 ` [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Michael S. Tsirkin
@ 2017-04-07 21:19 ` Michael S. Tsirkin
  2017-04-10 18:39   ` [virtio-dev] " Jean-Philippe Brucker
  2017-04-10 18:39   ` Jean-Philippe Brucker
  2017-04-10  2:30 ` Need information on type 2 IOMMU valmiki
                   ` (4 subsequent siblings)
  13 siblings, 2 replies; 99+ messages in thread
From: Michael S. Tsirkin @ 2017-04-07 21:19 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, kvm, virtualization, virtio-dev, cdall, will.deacon,
	robin.murphy, lorenzo.pieralisi, joro, jasowang, alex.williamson,
	marc.zyngier

On Fri, Apr 07, 2017 at 08:17:44PM +0100, Jean-Philippe Brucker wrote:
> There are a number of advantages in a paravirtualized IOMMU over a full
> emulation. It is portable and could be reused on different architectures.
> It is easier to implement than a full emulation, with less state tracking.
> It might be more efficient in some cases, with less context switches to
> the host and the possibility of in-kernel emulation.

Thanks, this is very interesting. I am read to read it all, but I really
would like you to expand some more on the motivation for this work.
Productising this would be quite a bit of work. Spending just 6 lines on
motivation seems somewhat disproportionate. In particular, do you have
any specific efficiency measurements or estimates that you can share?

-- 
MST

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
  2017-04-07 19:17 [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jean-Philippe Brucker
                   ` (6 preceding siblings ...)
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
@ 2017-04-07 21:19 ` Michael S. Tsirkin
  2017-04-07 21:19 ` Michael S. Tsirkin
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 99+ messages in thread
From: Michael S. Tsirkin @ 2017-04-07 21:19 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: virtio-dev, lorenzo.pieralisi, kvm, cdall, marc.zyngier, joro,
	will.deacon, virtualization, iommu, robin.murphy

On Fri, Apr 07, 2017 at 08:17:44PM +0100, Jean-Philippe Brucker wrote:
> There are a number of advantages in a paravirtualized IOMMU over a full
> emulation. It is portable and could be reused on different architectures.
> It is easier to implement than a full emulation, with less state tracking.
> It might be more efficient in some cases, with less context switches to
> the host and the possibility of in-kernel emulation.

Thanks, this is very interesting. I am read to read it all, but I really
would like you to expand some more on the motivation for this work.
Productising this would be quite a bit of work. Spending just 6 lines on
motivation seems somewhat disproportionate. In particular, do you have
any specific efficiency measurements or estimates that you can share?

-- 
MST

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Need information on type 2 IOMMU
       [not found] ` <20170407191747.26618-1-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
  2017-04-07 19:17   ` [RFC 1/3] virtio-iommu: firmware description of the virtual topology Jean-Philippe Brucker
@ 2017-04-10  2:30   ` valmiki
       [not found]     ` <1b48daab-c9e1-84d1-78a9-84d3e2001f32-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2017-04-10  4:19     ` Alex Williamson
  2017-04-13  8:41   ` [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Tian, Kevin
  2 siblings, 2 replies; 99+ messages in thread
From: valmiki @ 2017-04-10  2:30 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: marc.zyngier-5wv7dgnIgG8, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	mst-H+wXaHxf7aLQT0dZR+AlfA

Hi All,

We have drivers/vfio/vfio_iommu_type1.c. what is type1 iommu? Is it 
w.r.t vfio layer it is being referred?

Is there type 2 IOMMU w.r.t vfio? If so what is it?

Regards,
Valmiki

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Need information on type 2 IOMMU
  2017-04-07 19:17 [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jean-Philippe Brucker
                   ` (8 preceding siblings ...)
  2017-04-07 21:19 ` Michael S. Tsirkin
@ 2017-04-10  2:30 ` valmiki
  2017-04-12  9:06 ` [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jason Wang
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 99+ messages in thread
From: valmiki @ 2017-04-10  2:30 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev; +Cc: marc.zyngier, mst

Hi All,

We have drivers/vfio/vfio_iommu_type1.c. what is type1 iommu? Is it 
w.r.t vfio layer it is being referred?

Is there type 2 IOMMU w.r.t vfio? If so what is it?

Regards,
Valmiki

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Need information on type 2 IOMMU
       [not found]     ` <1b48daab-c9e1-84d1-78a9-84d3e2001f32-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-04-10  4:19       ` Alex Williamson
  0 siblings, 0 replies; 99+ messages in thread
From: Alex Williamson @ 2017-04-10  4:19 UTC (permalink / raw)
  To: valmiki
  Cc: virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b,
	kvm-u79uwXL29TY76Z2rM5mHXA, mst-H+wXaHxf7aLQT0dZR+AlfA,
	marc.zyngier-5wv7dgnIgG8, jasowang-H+wXaHxf7aLQT0dZR+AlfA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Mon, 10 Apr 2017 08:00:45 +0530
valmiki <valmikibow-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> Hi All,
> 
> We have drivers/vfio/vfio_iommu_type1.c. what is type1 iommu? Is it 
> w.r.t vfio layer it is being referred?
> 
> Is there type 2 IOMMU w.r.t vfio? If so what is it?

type1 is the 1st type.  It's an arbitrary name.  There is no type2, yet.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Need information on type 2 IOMMU
  2017-04-10  2:30   ` Need information on type 2 IOMMU valmiki
       [not found]     ` <1b48daab-c9e1-84d1-78a9-84d3e2001f32-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-04-10  4:19     ` Alex Williamson
  1 sibling, 0 replies; 99+ messages in thread
From: Alex Williamson @ 2017-04-10  4:19 UTC (permalink / raw)
  To: valmiki; +Cc: virtio-dev, kvm, mst, marc.zyngier, virtualization, iommu

On Mon, 10 Apr 2017 08:00:45 +0530
valmiki <valmikibow@gmail.com> wrote:

> Hi All,
> 
> We have drivers/vfio/vfio_iommu_type1.c. what is type1 iommu? Is it 
> w.r.t vfio layer it is being referred?
> 
> Is there type 2 IOMMU w.r.t vfio? If so what is it?

type1 is the 1st type.  It's an arbitrary name.  There is no type2, yet.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: Re: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
  2017-04-07 21:19 ` Michael S. Tsirkin
  2017-04-10 18:39   ` [virtio-dev] " Jean-Philippe Brucker
@ 2017-04-10 18:39   ` Jean-Philippe Brucker
  2017-04-10 20:04     ` [virtio-dev] " Michael S. Tsirkin
  2017-04-10 20:04     ` Michael S. Tsirkin
  1 sibling, 2 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-10 18:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: iommu, kvm, virtualization, virtio-dev, cdall, will.deacon,
	robin.murphy, lorenzo.pieralisi, joro, jasowang, alex.williamson,
	marc.zyngier

On 07/04/17 22:19, Michael S. Tsirkin wrote:
> On Fri, Apr 07, 2017 at 08:17:44PM +0100, Jean-Philippe Brucker wrote:
>> There are a number of advantages in a paravirtualized IOMMU over a full
>> emulation. It is portable and could be reused on different architectures.
>> It is easier to implement than a full emulation, with less state tracking.
>> It might be more efficient in some cases, with less context switches to
>> the host and the possibility of in-kernel emulation.
> 
> Thanks, this is very interesting. I am read to read it all, but I really
> would like you to expand some more on the motivation for this work.
> Productising this would be quite a bit of work. Spending just 6 lines on
> motivation seems somewhat disproportionate. In particular, do you have
> any specific efficiency measurements or estimates that you can share?

The main motivation for this work is to bring IOMMU virtualization to the
ARM world. We don't have any at the moment, and a full ARM SMMU
virtualization solution would be counter-productive. We would have to do
it for SMMUv2, for the completely orthogonal SMMUv3, and for any future
version of the architecture. Doing so in userspace might be acceptable,
but then for performance reasons people will want in-kernel emulation of
every IOMMU variant out there, which is a maintenance and security
nightmare. A single generic vIOMMU is preferable because it reduces
maintenance cost and attack surface.

The transport code is the same as any virtio device, both for userspace
and in-kernel implementations. So instead of rewriting everything from
scratch (and the lot of bugs that go with it) for each IOMMU variation, we
reuse well-tested code for transport and write the emulation layer once
and for all.

Note that this work applies to any architecture with an IOMMU, not only
ARM and their partners'. Introducing an IOMMU specially designed for
virtualization allows us to get rid of complex state tracking inherent to
full IOMMU emulations. With a full emulation, all guest accesses to page
table and configuration structures have to be trapped and interpreted. A
Virtio interface provides well-defined semantics and doesn't need to guess
what the guest is trying to do. It transmits requests made from guest
device drivers to host IOMMU almost unaltered, removing the intermediate
layer of arch-specific configuration structures and page tables.

Using a portable standard like Virtio also allows for efficient IOMMU
virtualization when guest and host are built for different architectures
(for instance when using Qemu TCG.) In-kernel emulation would still work
with vhost-iommu, but a platform-specific vIOMMUs would have to stay in
userspace.

I don't have any measurements at the moment, it is a bit early for that.
The kvmtool example was developed on a software model and is mostly here
for illustrative purpose, a Qemu implementation would be more suitable for
performance analysis. I wouldn't be able to give meaning to these numbers
anyway, since on ARM we don't have any existing solution to compare it
against. One could compare the complexity of handling guest accesses and
parsing page tables in Qemu's VT-d emulation with reading a chain of
buffers in Virtio, for a very rough estimate.

Thanks,
Jean-Philippe

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [virtio-dev] Re: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
  2017-04-07 21:19 ` Michael S. Tsirkin
@ 2017-04-10 18:39   ` Jean-Philippe Brucker
  2017-04-10 18:39   ` Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-10 18:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-dev, lorenzo.pieralisi, kvm, cdall, marc.zyngier, joro,
	will.deacon, virtualization, iommu, robin.murphy

On 07/04/17 22:19, Michael S. Tsirkin wrote:
> On Fri, Apr 07, 2017 at 08:17:44PM +0100, Jean-Philippe Brucker wrote:
>> There are a number of advantages in a paravirtualized IOMMU over a full
>> emulation. It is portable and could be reused on different architectures.
>> It is easier to implement than a full emulation, with less state tracking.
>> It might be more efficient in some cases, with less context switches to
>> the host and the possibility of in-kernel emulation.
> 
> Thanks, this is very interesting. I am read to read it all, but I really
> would like you to expand some more on the motivation for this work.
> Productising this would be quite a bit of work. Spending just 6 lines on
> motivation seems somewhat disproportionate. In particular, do you have
> any specific efficiency measurements or estimates that you can share?

The main motivation for this work is to bring IOMMU virtualization to the
ARM world. We don't have any at the moment, and a full ARM SMMU
virtualization solution would be counter-productive. We would have to do
it for SMMUv2, for the completely orthogonal SMMUv3, and for any future
version of the architecture. Doing so in userspace might be acceptable,
but then for performance reasons people will want in-kernel emulation of
every IOMMU variant out there, which is a maintenance and security
nightmare. A single generic vIOMMU is preferable because it reduces
maintenance cost and attack surface.

The transport code is the same as any virtio device, both for userspace
and in-kernel implementations. So instead of rewriting everything from
scratch (and the lot of bugs that go with it) for each IOMMU variation, we
reuse well-tested code for transport and write the emulation layer once
and for all.

Note that this work applies to any architecture with an IOMMU, not only
ARM and their partners'. Introducing an IOMMU specially designed for
virtualization allows us to get rid of complex state tracking inherent to
full IOMMU emulations. With a full emulation, all guest accesses to page
table and configuration structures have to be trapped and interpreted. A
Virtio interface provides well-defined semantics and doesn't need to guess
what the guest is trying to do. It transmits requests made from guest
device drivers to host IOMMU almost unaltered, removing the intermediate
layer of arch-specific configuration structures and page tables.

Using a portable standard like Virtio also allows for efficient IOMMU
virtualization when guest and host are built for different architectures
(for instance when using Qemu TCG.) In-kernel emulation would still work
with vhost-iommu, but a platform-specific vIOMMUs would have to stay in
userspace.

I don't have any measurements at the moment, it is a bit early for that.
The kvmtool example was developed on a software model and is mostly here
for illustrative purpose, a Qemu implementation would be more suitable for
performance analysis. I wouldn't be able to give meaning to these numbers
anyway, since on ARM we don't have any existing solution to compare it
against. One could compare the complexity of handling guest accesses and
parsing page tables in Qemu's VT-d emulation with reading a chain of
buffers in Virtio, for a very rough estimate.

Thanks,
Jean-Philippe

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [virtio-dev] Re: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
  2017-04-10 18:39   ` Jean-Philippe Brucker
  2017-04-10 20:04     ` [virtio-dev] " Michael S. Tsirkin
@ 2017-04-10 20:04     ` Michael S. Tsirkin
  1 sibling, 0 replies; 99+ messages in thread
From: Michael S. Tsirkin @ 2017-04-10 20:04 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: iommu, kvm, virtualization, virtio-dev, cdall, will.deacon,
	robin.murphy, lorenzo.pieralisi, joro, jasowang, alex.williamson,
	marc.zyngier

On Mon, Apr 10, 2017 at 07:39:24PM +0100, Jean-Philippe Brucker wrote:
> On 07/04/17 22:19, Michael S. Tsirkin wrote:
> > On Fri, Apr 07, 2017 at 08:17:44PM +0100, Jean-Philippe Brucker wrote:
> >> There are a number of advantages in a paravirtualized IOMMU over a full
> >> emulation. It is portable and could be reused on different architectures.
> >> It is easier to implement than a full emulation, with less state tracking.
> >> It might be more efficient in some cases, with less context switches to
> >> the host and the possibility of in-kernel emulation.
> > 
> > Thanks, this is very interesting. I am read to read it all, but I really
> > would like you to expand some more on the motivation for this work.
> > Productising this would be quite a bit of work. Spending just 6 lines on
> > motivation seems somewhat disproportionate. In particular, do you have
> > any specific efficiency measurements or estimates that you can share?
> 
> The main motivation for this work is to bring IOMMU virtualization to the
> ARM world. We don't have any at the moment, and a full ARM SMMU
> virtualization solution would be counter-productive. We would have to do
> it for SMMUv2, for the completely orthogonal SMMUv3, and for any future
> version of the architecture. Doing so in userspace might be acceptable,
> but then for performance reasons people will want in-kernel emulation of
> every IOMMU variant out there, which is a maintenance and security
> nightmare. A single generic vIOMMU is preferable because it reduces
> maintenance cost and attack surface.
> 
> The transport code is the same as any virtio device, both for userspace
> and in-kernel implementations. So instead of rewriting everything from
> scratch (and the lot of bugs that go with it) for each IOMMU variation, we
> reuse well-tested code for transport and write the emulation layer once
> and for all.
> 
> Note that this work applies to any architecture with an IOMMU, not only
> ARM and their partners'. Introducing an IOMMU specially designed for
> virtualization allows us to get rid of complex state tracking inherent to
> full IOMMU emulations. With a full emulation, all guest accesses to page
> table and configuration structures have to be trapped and interpreted. A
> Virtio interface provides well-defined semantics and doesn't need to guess
> what the guest is trying to do. It transmits requests made from guest
> device drivers to host IOMMU almost unaltered, removing the intermediate
> layer of arch-specific configuration structures and page tables.
> 
> Using a portable standard like Virtio also allows for efficient IOMMU
> virtualization when guest and host are built for different architectures
> (for instance when using Qemu TCG.) In-kernel emulation would still work
> with vhost-iommu, but a platform-specific vIOMMUs would have to stay in
> userspace.
> 
> I don't have any measurements at the moment, it is a bit early for that.
> The kvmtool example was developed on a software model and is mostly here
> for illustrative purpose, a Qemu implementation would be more suitable for
> performance analysis. I wouldn't be able to give meaning to these numbers
> anyway, since on ARM we don't have any existing solution to compare it
> against. One could compare the complexity of handling guest accesses and
> parsing page tables in Qemu's VT-d emulation with reading a chain of
> buffers in Virtio, for a very rough estimate.
> 
> Thanks,
> Jean-Philippe

This last suggestion sounds very reasonable.

-- 
MST

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [virtio-dev] Re: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
  2017-04-10 18:39   ` Jean-Philippe Brucker
@ 2017-04-10 20:04     ` Michael S. Tsirkin
  2017-04-10 20:04     ` Michael S. Tsirkin
  1 sibling, 0 replies; 99+ messages in thread
From: Michael S. Tsirkin @ 2017-04-10 20:04 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: virtio-dev, lorenzo.pieralisi, kvm, cdall, marc.zyngier, joro,
	will.deacon, virtualization, iommu, robin.murphy

On Mon, Apr 10, 2017 at 07:39:24PM +0100, Jean-Philippe Brucker wrote:
> On 07/04/17 22:19, Michael S. Tsirkin wrote:
> > On Fri, Apr 07, 2017 at 08:17:44PM +0100, Jean-Philippe Brucker wrote:
> >> There are a number of advantages in a paravirtualized IOMMU over a full
> >> emulation. It is portable and could be reused on different architectures.
> >> It is easier to implement than a full emulation, with less state tracking.
> >> It might be more efficient in some cases, with less context switches to
> >> the host and the possibility of in-kernel emulation.
> > 
> > Thanks, this is very interesting. I am read to read it all, but I really
> > would like you to expand some more on the motivation for this work.
> > Productising this would be quite a bit of work. Spending just 6 lines on
> > motivation seems somewhat disproportionate. In particular, do you have
> > any specific efficiency measurements or estimates that you can share?
> 
> The main motivation for this work is to bring IOMMU virtualization to the
> ARM world. We don't have any at the moment, and a full ARM SMMU
> virtualization solution would be counter-productive. We would have to do
> it for SMMUv2, for the completely orthogonal SMMUv3, and for any future
> version of the architecture. Doing so in userspace might be acceptable,
> but then for performance reasons people will want in-kernel emulation of
> every IOMMU variant out there, which is a maintenance and security
> nightmare. A single generic vIOMMU is preferable because it reduces
> maintenance cost and attack surface.
> 
> The transport code is the same as any virtio device, both for userspace
> and in-kernel implementations. So instead of rewriting everything from
> scratch (and the lot of bugs that go with it) for each IOMMU variation, we
> reuse well-tested code for transport and write the emulation layer once
> and for all.
> 
> Note that this work applies to any architecture with an IOMMU, not only
> ARM and their partners'. Introducing an IOMMU specially designed for
> virtualization allows us to get rid of complex state tracking inherent to
> full IOMMU emulations. With a full emulation, all guest accesses to page
> table and configuration structures have to be trapped and interpreted. A
> Virtio interface provides well-defined semantics and doesn't need to guess
> what the guest is trying to do. It transmits requests made from guest
> device drivers to host IOMMU almost unaltered, removing the intermediate
> layer of arch-specific configuration structures and page tables.
> 
> Using a portable standard like Virtio also allows for efficient IOMMU
> virtualization when guest and host are built for different architectures
> (for instance when using Qemu TCG.) In-kernel emulation would still work
> with vhost-iommu, but a platform-specific vIOMMUs would have to stay in
> userspace.
> 
> I don't have any measurements at the moment, it is a bit early for that.
> The kvmtool example was developed on a software model and is mostly here
> for illustrative purpose, a Qemu implementation would be more suitable for
> performance analysis. I wouldn't be able to give meaning to these numbers
> anyway, since on ARM we don't have any existing solution to compare it
> against. One could compare the complexity of handling guest accesses and
> parsing page tables in Qemu's VT-d emulation with reading a chain of
> buffers in Virtio, for a very rough estimate.
> 
> Thanks,
> Jean-Philippe

This last suggestion sounds very reasonable.

-- 
MST

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
  2017-04-07 19:17 [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jean-Philippe Brucker
                   ` (9 preceding siblings ...)
  2017-04-10  2:30 ` Need information on type 2 IOMMU valmiki
@ 2017-04-12  9:06 ` Jason Wang
  2017-04-13  8:16   ` Tian, Kevin
       [not found]   ` <a0920e37-a11e-784c-7d90-be6617ea7686-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-04-12  9:06 ` Jason Wang
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 99+ messages in thread
From: Jason Wang @ 2017-04-12  9:06 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	alex.williamson, marc.zyngier



On 2017年04月08日 03:17, Jean-Philippe Brucker wrote:
> This is the initial proposal for a paravirtualized IOMMU device using
> virtio transport. It contains a description of the device, a Linux driver,
> and a toy implementation in kvmtool. With this prototype, you can
> translate DMA to guest memory from emulated (virtio), or passed-through
> (VFIO) devices.
>
> In its simplest form, implemented here, the device handles map/unmap
> requests from the guest. Future extensions proposed in "RFC 3/3" should
> allow to bind page tables to devices.
>
> There are a number of advantages in a paravirtualized IOMMU over a full
> emulation. It is portable and could be reused on different architectures.
> It is easier to implement than a full emulation, with less state tracking.
> It might be more efficient in some cases, with less context switches to
> the host and the possibility of in-kernel emulation.

I like the idea. Consider the complexity of IOMMU hardware. I believe we 
don't want to have and fight  for bugs of three or more different IOMMU 
implementations in either userspace or kernel.

Thanks

>
> When designing it and writing the kvmtool device, I considered two main
> scenarios, illustrated below.
>
> Scenario 1: a hardware device passed through twice via VFIO
>
>     MEM____pIOMMU________PCI device________________________       HARDWARE
>              |     (2b)                                    \
>    ----------|-------------+-------------+------------------\-------------
>              |             :     KVM     :                   \
>              |             :             :                    \
>         pIOMMU drv         :         _______virtio-iommu drv   \    KERNEL
>              |             :        |    :          |           \
>            VFIO            :        |    :        VFIO           \
>              |             :        |    :          |             \
>              |             :        |    :          |             /
>    ----------|-------------+--------|----+----------|------------/--------
>              |                      |    :          |           /
>              | (1c)            (1b) |    :     (1a) |          / (2a)
>              |                      |    :          |         /
>              |                      |    :          |        /   USERSPACE
>              |___virtio-iommu dev___|    :        net drv___/
>                                          :
>    --------------------------------------+--------------------------------
>                   HOST                   :             GUEST
>
> (1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a
>         buffer with mmap, obtaining virtual address VA. It then send a
>         VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly VA=IOVA).
>      b. The maping request is relayed to the host through virtio
>         (VIRTIO_IOMMU_T_MAP).
>      c. The mapping request is relayed to the physical IOMMU through VFIO.
>
> (2) a. The guest userspace driver can now instruct the device to directly
>         access the buffer at IOVA
>      b. IOVA accesses from the device are translated into physical
>         addresses by the IOMMU.
>
> Scenario 2: a virtual net device behind a virtual IOMMU.
>
>    MEM__pIOMMU___PCI device                                     HARDWARE
>           |         |
>    -------|---------|------+-------------+-------------------------------
>           |         |      :     KVM     :
>           |         |      :             :
>      pIOMMU drv     |      :             :
>               \     |      :      _____________virtio-net drv      KERNEL
>                \_net drv   :     |       :          / (1a)
>                     |      :     |       :         /
>                    tap     :     |    ________virtio-iommu drv
>                     |      :     |   |   : (1b)
>    -----------------|------+-----|---|---+-------------------------------
>                     |            |   |   :
>                     |_virtio-net_|   |   :
>                           / (2)      |   :
>                          /           |   :                      USERSPACE
>                virtio-iommu dev______|   :
>                                          :
>    --------------------------------------+-------------------------------
>                   HOST                   :             GUEST
>
> (1) a. Guest virtio-net driver maps the virtio ring and a buffer
>      b. The mapping requests are relayed to the host through virtio.
> (2) The virtio-net device now needs to access any guest memory via the
>      IOMMU.
>
> Physical and virtual IOMMUs are completely dissociated. The net driver is
> mapping its own buffers via DMA/IOMMU API, and buffers are copied between
> virtio-net and tap.
>
>
> The description itself seemed too long for a single email, so I split it
> into three documents, and will attach Linux and kvmtool patches to this
> email.
>
> 	1. Firmware note,
> 	2. device operations (draft for the virtio specification),
> 	3. future work/possible improvements.
>
> Just to be clear on the terms I'm using:
>
> pIOMMU	physical IOMMU, controlling DMA accesses from physical devices
> vIOMMU	virtual IOMMU (virtio-iommu), controlling DMA accesses from
> 	physical and virtual devices to guest memory.
> GVA, GPA, HVA, HPA
> 	Guest/Host Virtual/Physical Address
> IOVA	I/O Virtual Address, the address accessed by a device doing DMA
> 	through an IOMMU. In the context of a guest OS, IOVA is GVA.
>
> Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI
> virtio-iommu.h header, which is BSD 3-clause. For the time being, the
> specification draft in RFC 2/3 is also BSD 3-clause.
>
>
> This proposal may be involuntarily centered around ARM architectures at
> times. Any feedback would be appreciated, especially regarding other IOMMU
> architectures.
>
> Thanks,
> Jean-Philippe

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
  2017-04-07 19:17 [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jean-Philippe Brucker
                   ` (10 preceding siblings ...)
  2017-04-12  9:06 ` [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jason Wang
@ 2017-04-12  9:06 ` Jason Wang
  2017-04-13  8:41 ` Tian, Kevin
       [not found] ` <20170407191747.26618-1-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
  13 siblings, 0 replies; 99+ messages in thread
From: Jason Wang @ 2017-04-12  9:06 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy



On 2017年04月08日 03:17, Jean-Philippe Brucker wrote:
> This is the initial proposal for a paravirtualized IOMMU device using
> virtio transport. It contains a description of the device, a Linux driver,
> and a toy implementation in kvmtool. With this prototype, you can
> translate DMA to guest memory from emulated (virtio), or passed-through
> (VFIO) devices.
>
> In its simplest form, implemented here, the device handles map/unmap
> requests from the guest. Future extensions proposed in "RFC 3/3" should
> allow to bind page tables to devices.
>
> There are a number of advantages in a paravirtualized IOMMU over a full
> emulation. It is portable and could be reused on different architectures.
> It is easier to implement than a full emulation, with less state tracking.
> It might be more efficient in some cases, with less context switches to
> the host and the possibility of in-kernel emulation.

I like the idea. Consider the complexity of IOMMU hardware. I believe we 
don't want to have and fight  for bugs of three or more different IOMMU 
implementations in either userspace or kernel.

Thanks

>
> When designing it and writing the kvmtool device, I considered two main
> scenarios, illustrated below.
>
> Scenario 1: a hardware device passed through twice via VFIO
>
>     MEM____pIOMMU________PCI device________________________       HARDWARE
>              |     (2b)                                    \
>    ----------|-------------+-------------+------------------\-------------
>              |             :     KVM     :                   \
>              |             :             :                    \
>         pIOMMU drv         :         _______virtio-iommu drv   \    KERNEL
>              |             :        |    :          |           \
>            VFIO            :        |    :        VFIO           \
>              |             :        |    :          |             \
>              |             :        |    :          |             /
>    ----------|-------------+--------|----+----------|------------/--------
>              |                      |    :          |           /
>              | (1c)            (1b) |    :     (1a) |          / (2a)
>              |                      |    :          |         /
>              |                      |    :          |        /   USERSPACE
>              |___virtio-iommu dev___|    :        net drv___/
>                                          :
>    --------------------------------------+--------------------------------
>                   HOST                   :             GUEST
>
> (1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a
>         buffer with mmap, obtaining virtual address VA. It then send a
>         VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly VA=IOVA).
>      b. The maping request is relayed to the host through virtio
>         (VIRTIO_IOMMU_T_MAP).
>      c. The mapping request is relayed to the physical IOMMU through VFIO.
>
> (2) a. The guest userspace driver can now instruct the device to directly
>         access the buffer at IOVA
>      b. IOVA accesses from the device are translated into physical
>         addresses by the IOMMU.
>
> Scenario 2: a virtual net device behind a virtual IOMMU.
>
>    MEM__pIOMMU___PCI device                                     HARDWARE
>           |         |
>    -------|---------|------+-------------+-------------------------------
>           |         |      :     KVM     :
>           |         |      :             :
>      pIOMMU drv     |      :             :
>               \     |      :      _____________virtio-net drv      KERNEL
>                \_net drv   :     |       :          / (1a)
>                     |      :     |       :         /
>                    tap     :     |    ________virtio-iommu drv
>                     |      :     |   |   : (1b)
>    -----------------|------+-----|---|---+-------------------------------
>                     |            |   |   :
>                     |_virtio-net_|   |   :
>                           / (2)      |   :
>                          /           |   :                      USERSPACE
>                virtio-iommu dev______|   :
>                                          :
>    --------------------------------------+-------------------------------
>                   HOST                   :             GUEST
>
> (1) a. Guest virtio-net driver maps the virtio ring and a buffer
>      b. The mapping requests are relayed to the host through virtio.
> (2) The virtio-net device now needs to access any guest memory via the
>      IOMMU.
>
> Physical and virtual IOMMUs are completely dissociated. The net driver is
> mapping its own buffers via DMA/IOMMU API, and buffers are copied between
> virtio-net and tap.
>
>
> The description itself seemed too long for a single email, so I split it
> into three documents, and will attach Linux and kvmtool patches to this
> email.
>
> 	1. Firmware note,
> 	2. device operations (draft for the virtio specification),
> 	3. future work/possible improvements.
>
> Just to be clear on the terms I'm using:
>
> pIOMMU	physical IOMMU, controlling DMA accesses from physical devices
> vIOMMU	virtual IOMMU (virtio-iommu), controlling DMA accesses from
> 	physical and virtual devices to guest memory.
> GVA, GPA, HVA, HPA
> 	Guest/Host Virtual/Physical Address
> IOVA	I/O Virtual Address, the address accessed by a device doing DMA
> 	through an IOMMU. In the context of a guest OS, IOVA is GVA.
>
> Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI
> virtio-iommu.h header, which is BSD 3-clause. For the time being, the
> specification draft in RFC 2/3 is also BSD 3-clause.
>
>
> This proposal may be involuntarily centered around ARM architectures at
> times. Any feedback would be appreciated, especially regarding other IOMMU
> architectures.
>
> Thanks,
> Jean-Philippe

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
       [not found]   ` <a0920e37-a11e-784c-7d90-be6617ea7686-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-04-13  8:16     ` Tian, Kevin
       [not found]       ` <AADFC41AFE54684AB9EE6CBC0274A5D190CA990E-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2017-04-13 13:12       ` Jean-Philippe Brucker
  0 siblings, 2 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-04-13  8:16 UTC (permalink / raw)
  To: Jason Wang, Jean-Philippe Brucker,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: cdall-QSEj5FYQhm4dnm+yROfE0A, mst-H+wXaHxf7aLQT0dZR+AlfA,
	marc.zyngier-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8

> From: Jason Wang
> Sent: Wednesday, April 12, 2017 5:07 PM
> 
> On 2017年04月08日 03:17, Jean-Philippe Brucker wrote:
> > This is the initial proposal for a paravirtualized IOMMU device using
> > virtio transport. It contains a description of the device, a Linux driver,
> > and a toy implementation in kvmtool. With this prototype, you can
> > translate DMA to guest memory from emulated (virtio), or passed-through
> > (VFIO) devices.
> >
> > In its simplest form, implemented here, the device handles map/unmap
> > requests from the guest. Future extensions proposed in "RFC 3/3" should
> > allow to bind page tables to devices.
> >
> > There are a number of advantages in a paravirtualized IOMMU over a full
> > emulation. It is portable and could be reused on different architectures.
> > It is easier to implement than a full emulation, with less state tracking.
> > It might be more efficient in some cases, with less context switches to
> > the host and the possibility of in-kernel emulation.
> 
> I like the idea. Consider the complexity of IOMMU hardware. I believe we
> don't want to have and fight  for bugs of three or more different IOMMU
> implementations in either userspace or kernel.
> 

Though there are definitely positive things around pvIOMMU approach,
it also has some limitations:

- Existing IOMMU implementations have been in old distros for quite some
time, while pvIOMMU driver will only land in future distros. Doing pvIOMMU
only means we completely drop support of old distros in VM;

- Similar situation on supporting other guest OSes e.g. Windows. IOMMU is
a key kernel component which I'm not sure pvIOMMU through virtio can be
recognized in those OSes (not like a virtio device driver);

I would image both full-emulated IOMMUs and pvIOMMU would co-exist
for some time due to above reasons. Someday when pvIOMMU is mature/
spread enough in the eco-system (and feature-wise comparable to full-emulated
IOMMUs for all vendors), then we may make a call.

Thanks,
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
  2017-04-12  9:06 ` [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jason Wang
@ 2017-04-13  8:16   ` Tian, Kevin
       [not found]   ` <a0920e37-a11e-784c-7d90-be6617ea7686-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 0 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-04-13  8:16 UTC (permalink / raw)
  To: Jason Wang, Jean-Philippe Brucker, iommu, kvm, virtualization,
	virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

> From: Jason Wang
> Sent: Wednesday, April 12, 2017 5:07 PM
> 
> On 2017年04月08日 03:17, Jean-Philippe Brucker wrote:
> > This is the initial proposal for a paravirtualized IOMMU device using
> > virtio transport. It contains a description of the device, a Linux driver,
> > and a toy implementation in kvmtool. With this prototype, you can
> > translate DMA to guest memory from emulated (virtio), or passed-through
> > (VFIO) devices.
> >
> > In its simplest form, implemented here, the device handles map/unmap
> > requests from the guest. Future extensions proposed in "RFC 3/3" should
> > allow to bind page tables to devices.
> >
> > There are a number of advantages in a paravirtualized IOMMU over a full
> > emulation. It is portable and could be reused on different architectures.
> > It is easier to implement than a full emulation, with less state tracking.
> > It might be more efficient in some cases, with less context switches to
> > the host and the possibility of in-kernel emulation.
> 
> I like the idea. Consider the complexity of IOMMU hardware. I believe we
> don't want to have and fight  for bugs of three or more different IOMMU
> implementations in either userspace or kernel.
> 

Though there are definitely positive things around pvIOMMU approach,
it also has some limitations:

- Existing IOMMU implementations have been in old distros for quite some
time, while pvIOMMU driver will only land in future distros. Doing pvIOMMU
only means we completely drop support of old distros in VM;

- Similar situation on supporting other guest OSes e.g. Windows. IOMMU is
a key kernel component which I'm not sure pvIOMMU through virtio can be
recognized in those OSes (not like a virtio device driver);

I would image both full-emulated IOMMUs and pvIOMMU would co-exist
for some time due to above reasons. Someday when pvIOMMU is mature/
spread enough in the eco-system (and feature-wise comparable to full-emulated
IOMMUs for all vendors), then we may make a call.

Thanks,
Kevin
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
       [not found] ` <20170407191747.26618-1-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
  2017-04-07 19:17   ` [RFC 1/3] virtio-iommu: firmware description of the virtual topology Jean-Philippe Brucker
  2017-04-10  2:30   ` Need information on type 2 IOMMU valmiki
@ 2017-04-13  8:41   ` Tian, Kevin
  2017-04-13 13:12     ` Jean-Philippe Brucker
  2017-04-13 13:12     ` Jean-Philippe Brucker
  2 siblings, 2 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-04-13  8:41 UTC (permalink / raw)
  To: Jean-Philippe Brucker,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: cdall-QSEj5FYQhm4dnm+yROfE0A, mst-H+wXaHxf7aLQT0dZR+AlfA,
	marc.zyngier-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA

> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> This is the initial proposal for a paravirtualized IOMMU device using
> virtio transport. It contains a description of the device, a Linux driver,
> and a toy implementation in kvmtool. With this prototype, you can
> translate DMA to guest memory from emulated (virtio), or passed-through
> (VFIO) devices.
> 
> In its simplest form, implemented here, the device handles map/unmap
> requests from the guest. Future extensions proposed in "RFC 3/3" should
> allow to bind page tables to devices.
> 
> There are a number of advantages in a paravirtualized IOMMU over a full
> emulation. It is portable and could be reused on different architectures.
> It is easier to implement than a full emulation, with less state tracking.
> It might be more efficient in some cases, with less context switches to
> the host and the possibility of in-kernel emulation.
> 
> When designing it and writing the kvmtool device, I considered two main
> scenarios, illustrated below.
> 
> Scenario 1: a hardware device passed through twice via VFIO
> 
>    MEM____pIOMMU________PCI device________________________
> HARDWARE
>             |     (2b)                                    \
>   ----------|-------------+-------------+------------------\-------------
>             |             :     KVM     :                   \
>             |             :             :                    \
>        pIOMMU drv         :         _______virtio-iommu drv   \    KERNEL
>             |             :        |    :          |           \
>           VFIO            :        |    :        VFIO           \
>             |             :        |    :          |             \
>             |             :        |    :          |             /
>   ----------|-------------+--------|----+----------|------------/--------
>             |                      |    :          |           /
>             | (1c)            (1b) |    :     (1a) |          / (2a)
>             |                      |    :          |         /
>             |                      |    :          |        /   USERSPACE
>             |___virtio-iommu dev___|    :        net drv___/
>                                         :
>   --------------------------------------+--------------------------------
>                  HOST                   :             GUEST
> 

Usually people draw such layers in reverse order, e.g. hw in the
bottom then kernel in the middle then user in the top. :-)

> (1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a
>        buffer with mmap, obtaining virtual address VA. It then send a
>        VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly
> VA=IOVA).
>     b. The maping request is relayed to the host through virtio
>        (VIRTIO_IOMMU_T_MAP).
>     c. The mapping request is relayed to the physical IOMMU through VFIO.
> 
> (2) a. The guest userspace driver can now instruct the device to directly
>        access the buffer at IOVA
>     b. IOVA accesses from the device are translated into physical
>        addresses by the IOMMU.
> 
> Scenario 2: a virtual net device behind a virtual IOMMU.
> 
>   MEM__pIOMMU___PCI device                                     HARDWARE
>          |         |
>   -------|---------|------+-------------+-------------------------------
>          |         |      :     KVM     :
>          |         |      :             :
>     pIOMMU drv     |      :             :
>              \     |      :      _____________virtio-net drv      KERNEL
>               \_net drv   :     |       :          / (1a)
>                    |      :     |       :         /
>                   tap     :     |    ________virtio-iommu drv
>                    |      :     |   |   : (1b)
>   -----------------|------+-----|---|---+-------------------------------
>                    |            |   |   :
>                    |_virtio-net_|   |   :
>                          / (2)      |   :
>                         /           |   :                      USERSPACE
>               virtio-iommu dev______|   :
>                                         :
>   --------------------------------------+-------------------------------
>                  HOST                   :             GUEST
> 
> (1) a. Guest virtio-net driver maps the virtio ring and a buffer
>     b. The mapping requests are relayed to the host through virtio.
> (2) The virtio-net device now needs to access any guest memory via the
>     IOMMU.
> 
> Physical and virtual IOMMUs are completely dissociated. The net driver is
> mapping its own buffers via DMA/IOMMU API, and buffers are copied
> between
> virtio-net and tap.
> 
> 
> The description itself seemed too long for a single email, so I split it
> into three documents, and will attach Linux and kvmtool patches to this
> email.
> 
> 	1. Firmware note,
> 	2. device operations (draft for the virtio specification),
> 	3. future work/possible improvements.
> 
> Just to be clear on the terms I'm using:
> 
> pIOMMU	physical IOMMU, controlling DMA accesses from physical
> devices
> vIOMMU	virtual IOMMU (virtio-iommu), controlling DMA accesses
> from
> 	physical and virtual devices to guest memory.

maybe clearer to call controlling 'virtual' DMA access since we're
essentially doing DMA virtualization here. Otherwise I read it
a bit confusing since DMA accesses from physical device should
be controlled by pIOMMU.

> GVA, GPA, HVA, HPA
> 	Guest/Host Virtual/Physical Address
> IOVA	I/O Virtual Address, the address accessed by a device doing DMA
> 	through an IOMMU. In the context of a guest OS, IOVA is GVA.

This statement is not accurate. For kernel DMA protection, it is 
per-device standalone address space (definitely nothing to do 
with GVA). For user DMA protection, user space driver decides 
how it wants to construct IOVA address space. could be a 
standalone one, or reuse GVA. In virtualization case it is either
GPA (w/o vIOMMU) or guest IOVA (w/ IOMMU and guest creates
IOVA space).

anyway IOVA concept is clear. possibly just removing the example
is still clear. :-)

> 
> Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI
> virtio-iommu.h header, which is BSD 3-clause. For the time being, the
> specification draft in RFC 2/3 is also BSD 3-clause.
> 
> 
> This proposal may be involuntarily centered around ARM architectures at
> times. Any feedback would be appreciated, especially regarding other
> IOMMU
> architectures.
> 

thanks for doing this. will definitely look them in detail and feedback.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
  2017-04-07 19:17 [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jean-Philippe Brucker
                   ` (11 preceding siblings ...)
  2017-04-12  9:06 ` Jason Wang
@ 2017-04-13  8:41 ` Tian, Kevin
       [not found] ` <20170407191747.26618-1-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
  13 siblings, 0 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-04-13  8:41 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> This is the initial proposal for a paravirtualized IOMMU device using
> virtio transport. It contains a description of the device, a Linux driver,
> and a toy implementation in kvmtool. With this prototype, you can
> translate DMA to guest memory from emulated (virtio), or passed-through
> (VFIO) devices.
> 
> In its simplest form, implemented here, the device handles map/unmap
> requests from the guest. Future extensions proposed in "RFC 3/3" should
> allow to bind page tables to devices.
> 
> There are a number of advantages in a paravirtualized IOMMU over a full
> emulation. It is portable and could be reused on different architectures.
> It is easier to implement than a full emulation, with less state tracking.
> It might be more efficient in some cases, with less context switches to
> the host and the possibility of in-kernel emulation.
> 
> When designing it and writing the kvmtool device, I considered two main
> scenarios, illustrated below.
> 
> Scenario 1: a hardware device passed through twice via VFIO
> 
>    MEM____pIOMMU________PCI device________________________
> HARDWARE
>             |     (2b)                                    \
>   ----------|-------------+-------------+------------------\-------------
>             |             :     KVM     :                   \
>             |             :             :                    \
>        pIOMMU drv         :         _______virtio-iommu drv   \    KERNEL
>             |             :        |    :          |           \
>           VFIO            :        |    :        VFIO           \
>             |             :        |    :          |             \
>             |             :        |    :          |             /
>   ----------|-------------+--------|----+----------|------------/--------
>             |                      |    :          |           /
>             | (1c)            (1b) |    :     (1a) |          / (2a)
>             |                      |    :          |         /
>             |                      |    :          |        /   USERSPACE
>             |___virtio-iommu dev___|    :        net drv___/
>                                         :
>   --------------------------------------+--------------------------------
>                  HOST                   :             GUEST
> 

Usually people draw such layers in reverse order, e.g. hw in the
bottom then kernel in the middle then user in the top. :-)

> (1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a
>        buffer with mmap, obtaining virtual address VA. It then send a
>        VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly
> VA=IOVA).
>     b. The maping request is relayed to the host through virtio
>        (VIRTIO_IOMMU_T_MAP).
>     c. The mapping request is relayed to the physical IOMMU through VFIO.
> 
> (2) a. The guest userspace driver can now instruct the device to directly
>        access the buffer at IOVA
>     b. IOVA accesses from the device are translated into physical
>        addresses by the IOMMU.
> 
> Scenario 2: a virtual net device behind a virtual IOMMU.
> 
>   MEM__pIOMMU___PCI device                                     HARDWARE
>          |         |
>   -------|---------|------+-------------+-------------------------------
>          |         |      :     KVM     :
>          |         |      :             :
>     pIOMMU drv     |      :             :
>              \     |      :      _____________virtio-net drv      KERNEL
>               \_net drv   :     |       :          / (1a)
>                    |      :     |       :         /
>                   tap     :     |    ________virtio-iommu drv
>                    |      :     |   |   : (1b)
>   -----------------|------+-----|---|---+-------------------------------
>                    |            |   |   :
>                    |_virtio-net_|   |   :
>                          / (2)      |   :
>                         /           |   :                      USERSPACE
>               virtio-iommu dev______|   :
>                                         :
>   --------------------------------------+-------------------------------
>                  HOST                   :             GUEST
> 
> (1) a. Guest virtio-net driver maps the virtio ring and a buffer
>     b. The mapping requests are relayed to the host through virtio.
> (2) The virtio-net device now needs to access any guest memory via the
>     IOMMU.
> 
> Physical and virtual IOMMUs are completely dissociated. The net driver is
> mapping its own buffers via DMA/IOMMU API, and buffers are copied
> between
> virtio-net and tap.
> 
> 
> The description itself seemed too long for a single email, so I split it
> into three documents, and will attach Linux and kvmtool patches to this
> email.
> 
> 	1. Firmware note,
> 	2. device operations (draft for the virtio specification),
> 	3. future work/possible improvements.
> 
> Just to be clear on the terms I'm using:
> 
> pIOMMU	physical IOMMU, controlling DMA accesses from physical
> devices
> vIOMMU	virtual IOMMU (virtio-iommu), controlling DMA accesses
> from
> 	physical and virtual devices to guest memory.

maybe clearer to call controlling 'virtual' DMA access since we're
essentially doing DMA virtualization here. Otherwise I read it
a bit confusing since DMA accesses from physical device should
be controlled by pIOMMU.

> GVA, GPA, HVA, HPA
> 	Guest/Host Virtual/Physical Address
> IOVA	I/O Virtual Address, the address accessed by a device doing DMA
> 	through an IOMMU. In the context of a guest OS, IOVA is GVA.

This statement is not accurate. For kernel DMA protection, it is 
per-device standalone address space (definitely nothing to do 
with GVA). For user DMA protection, user space driver decides 
how it wants to construct IOVA address space. could be a 
standalone one, or reuse GVA. In virtualization case it is either
GPA (w/o vIOMMU) or guest IOVA (w/ IOMMU and guest creates
IOVA space).

anyway IOVA concept is clear. possibly just removing the example
is still clear. :-)

> 
> Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI
> virtio-iommu.h header, which is BSD 3-clause. For the time being, the
> specification draft in RFC 2/3 is also BSD 3-clause.
> 
> 
> This proposal may be involuntarily centered around ARM architectures at
> times. Any feedback would be appreciated, especially regarding other
> IOMMU
> architectures.
> 

thanks for doing this. will definitely look them in detail and feedback.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
       [not found]       ` <AADFC41AFE54684AB9EE6CBC0274A5D190CA990E-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-04-13 13:12         ` Jean-Philippe Brucker
  0 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-13 13:12 UTC (permalink / raw)
  To: Tian, Kevin, Jason Wang,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: cdall-QSEj5FYQhm4dnm+yROfE0A, mst-H+wXaHxf7aLQT0dZR+AlfA,
	marc.zyngier-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8

On 13/04/17 09:16, Tian, Kevin wrote:
>> From: Jason Wang
>> Sent: Wednesday, April 12, 2017 5:07 PM
>>
>> On 2017年04月08日 03:17, Jean-Philippe Brucker wrote:
>>> This is the initial proposal for a paravirtualized IOMMU device using
>>> virtio transport. It contains a description of the device, a Linux driver,
>>> and a toy implementation in kvmtool. With this prototype, you can
>>> translate DMA to guest memory from emulated (virtio), or passed-through
>>> (VFIO) devices.
>>>
>>> In its simplest form, implemented here, the device handles map/unmap
>>> requests from the guest. Future extensions proposed in "RFC 3/3" should
>>> allow to bind page tables to devices.
>>>
>>> There are a number of advantages in a paravirtualized IOMMU over a full
>>> emulation. It is portable and could be reused on different architectures.
>>> It is easier to implement than a full emulation, with less state tracking.
>>> It might be more efficient in some cases, with less context switches to
>>> the host and the possibility of in-kernel emulation.
>>
>> I like the idea. Consider the complexity of IOMMU hardware. I believe we
>> don't want to have and fight  for bugs of three or more different IOMMU
>> implementations in either userspace or kernel.
>>
> 
> Though there are definitely positive things around pvIOMMU approach,
> it also has some limitations:
> 
> - Existing IOMMU implementations have been in old distros for quite some
> time, while pvIOMMU driver will only land in future distros. Doing pvIOMMU
> only means we completely drop support of old distros in VM;
> 
> - Similar situation on supporting other guest OSes e.g. Windows. IOMMU is
> a key kernel component which I'm not sure pvIOMMU through virtio can be
> recognized in those OSes (not like a virtio device driver);

I can't talk about other OSes, but on Linux virtio-iommu is implemented
the same way as other IOMMU drivers and doesn't require core modifications.

> I would image both full-emulated IOMMUs and pvIOMMU would co-exist
> for some time due to above reasons. Someday when pvIOMMU is mature/
> spread enough in the eco-system (and feature-wise comparable to full-emulated
> IOMMUs for all vendors), then we may make a call.

Agreed. The main inconvenient of any paravirtualized device is that they
need additional support in the guest. It is not our intention to disrupt
all the work done on IOMMU virtualization for x86 and other architectures.
Even for ARM, people might want to provide SMMU emulations to unmodified
guests, implemented in userspace. What we intend to avoid, as detailed in
my other reply, is in-kernel emulation of all possible ARM-based IOMMU
variations for Linux. So we propose a generic alternative from the start,
that others can reuse later.

Thanks,
Jean-Philippe

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
  2017-04-13  8:16     ` Tian, Kevin
       [not found]       ` <AADFC41AFE54684AB9EE6CBC0274A5D190CA990E-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-04-13 13:12       ` Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-13 13:12 UTC (permalink / raw)
  To: Tian, Kevin, Jason Wang, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

On 13/04/17 09:16, Tian, Kevin wrote:
>> From: Jason Wang
>> Sent: Wednesday, April 12, 2017 5:07 PM
>>
>> On 2017年04月08日 03:17, Jean-Philippe Brucker wrote:
>>> This is the initial proposal for a paravirtualized IOMMU device using
>>> virtio transport. It contains a description of the device, a Linux driver,
>>> and a toy implementation in kvmtool. With this prototype, you can
>>> translate DMA to guest memory from emulated (virtio), or passed-through
>>> (VFIO) devices.
>>>
>>> In its simplest form, implemented here, the device handles map/unmap
>>> requests from the guest. Future extensions proposed in "RFC 3/3" should
>>> allow to bind page tables to devices.
>>>
>>> There are a number of advantages in a paravirtualized IOMMU over a full
>>> emulation. It is portable and could be reused on different architectures.
>>> It is easier to implement than a full emulation, with less state tracking.
>>> It might be more efficient in some cases, with less context switches to
>>> the host and the possibility of in-kernel emulation.
>>
>> I like the idea. Consider the complexity of IOMMU hardware. I believe we
>> don't want to have and fight  for bugs of three or more different IOMMU
>> implementations in either userspace or kernel.
>>
> 
> Though there are definitely positive things around pvIOMMU approach,
> it also has some limitations:
> 
> - Existing IOMMU implementations have been in old distros for quite some
> time, while pvIOMMU driver will only land in future distros. Doing pvIOMMU
> only means we completely drop support of old distros in VM;
> 
> - Similar situation on supporting other guest OSes e.g. Windows. IOMMU is
> a key kernel component which I'm not sure pvIOMMU through virtio can be
> recognized in those OSes (not like a virtio device driver);

I can't talk about other OSes, but on Linux virtio-iommu is implemented
the same way as other IOMMU drivers and doesn't require core modifications.

> I would image both full-emulated IOMMUs and pvIOMMU would co-exist
> for some time due to above reasons. Someday when pvIOMMU is mature/
> spread enough in the eco-system (and feature-wise comparable to full-emulated
> IOMMUs for all vendors), then we may make a call.

Agreed. The main inconvenient of any paravirtualized device is that they
need additional support in the guest. It is not our intention to disrupt
all the work done on IOMMU virtualization for x86 and other architectures.
Even for ARM, people might want to provide SMMU emulations to unmodified
guests, implemented in userspace. What we intend to avoid, as detailed in
my other reply, is in-kernel emulation of all possible ARM-based IOMMU
variations for Linux. So we propose a generic alternative from the start,
that others can reuse later.

Thanks,
Jean-Philippe

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
  2017-04-13  8:41   ` [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Tian, Kevin
@ 2017-04-13 13:12     ` Jean-Philippe Brucker
  2017-04-13 13:12     ` Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-13 13:12 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

On 13/04/17 09:41, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Saturday, April 8, 2017 3:18 AM
>>
>> This is the initial proposal for a paravirtualized IOMMU device using
>> virtio transport. It contains a description of the device, a Linux driver,
>> and a toy implementation in kvmtool. With this prototype, you can
>> translate DMA to guest memory from emulated (virtio), or passed-through
>> (VFIO) devices.
>>
>> In its simplest form, implemented here, the device handles map/unmap
>> requests from the guest. Future extensions proposed in "RFC 3/3" should
>> allow to bind page tables to devices.
>>
>> There are a number of advantages in a paravirtualized IOMMU over a full
>> emulation. It is portable and could be reused on different architectures.
>> It is easier to implement than a full emulation, with less state tracking.
>> It might be more efficient in some cases, with less context switches to
>> the host and the possibility of in-kernel emulation.
>>
>> When designing it and writing the kvmtool device, I considered two main
>> scenarios, illustrated below.
>>
>> Scenario 1: a hardware device passed through twice via VFIO
>>
>>    MEM____pIOMMU________PCI device________________________
>> HARDWARE
>>             |     (2b)                                    \
>>   ----------|-------------+-------------+------------------\-------------
>>             |             :     KVM     :                   \
>>             |             :             :                    \
>>        pIOMMU drv         :         _______virtio-iommu drv   \    KERNEL
>>             |             :        |    :          |           \
>>           VFIO            :        |    :        VFIO           \
>>             |             :        |    :          |             \
>>             |             :        |    :          |             /
>>   ----------|-------------+--------|----+----------|------------/--------
>>             |                      |    :          |           /
>>             | (1c)            (1b) |    :     (1a) |          / (2a)
>>             |                      |    :          |         /
>>             |                      |    :          |        /   USERSPACE
>>             |___virtio-iommu dev___|    :        net drv___/
>>                                         :
>>   --------------------------------------+--------------------------------
>>                  HOST                   :             GUEST
>>
> 
> Usually people draw such layers in reverse order, e.g. hw in the
> bottom then kernel in the middle then user in the top. :-)

Alright, I'll keep that in mind.

>> (1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a
>>        buffer with mmap, obtaining virtual address VA. It then send a
>>        VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly
>> VA=IOVA).
>>     b. The maping request is relayed to the host through virtio
>>        (VIRTIO_IOMMU_T_MAP).
>>     c. The mapping request is relayed to the physical IOMMU through VFIO.
>>
>> (2) a. The guest userspace driver can now instruct the device to directly
>>        access the buffer at IOVA
>>     b. IOVA accesses from the device are translated into physical
>>        addresses by the IOMMU.
>>
>> Scenario 2: a virtual net device behind a virtual IOMMU.
>>
>>   MEM__pIOMMU___PCI device                                     HARDWARE
>>          |         |
>>   -------|---------|------+-------------+-------------------------------
>>          |         |      :     KVM     :
>>          |         |      :             :
>>     pIOMMU drv     |      :             :
>>              \     |      :      _____________virtio-net drv      KERNEL
>>               \_net drv   :     |       :          / (1a)
>>                    |      :     |       :         /
>>                   tap     :     |    ________virtio-iommu drv
>>                    |      :     |   |   : (1b)
>>   -----------------|------+-----|---|---+-------------------------------
>>                    |            |   |   :
>>                    |_virtio-net_|   |   :
>>                          / (2)      |   :
>>                         /           |   :                      USERSPACE
>>               virtio-iommu dev______|   :
>>                                         :
>>   --------------------------------------+-------------------------------
>>                  HOST                   :             GUEST
>>
>> (1) a. Guest virtio-net driver maps the virtio ring and a buffer
>>     b. The mapping requests are relayed to the host through virtio.
>> (2) The virtio-net device now needs to access any guest memory via the
>>     IOMMU.
>>
>> Physical and virtual IOMMUs are completely dissociated. The net driver is
>> mapping its own buffers via DMA/IOMMU API, and buffers are copied
>> between
>> virtio-net and tap.
>>
>>
>> The description itself seemed too long for a single email, so I split it
>> into three documents, and will attach Linux and kvmtool patches to this
>> email.
>>
>> 	1. Firmware note,
>> 	2. device operations (draft for the virtio specification),
>> 	3. future work/possible improvements.
>>
>> Just to be clear on the terms I'm using:
>>
>> pIOMMU	physical IOMMU, controlling DMA accesses from physical
>> devices
>> vIOMMU	virtual IOMMU (virtio-iommu), controlling DMA accesses
>> from
>> 	physical and virtual devices to guest memory.
> 
> maybe clearer to call controlling 'virtual' DMA access since we're
> essentially doing DMA virtualization here. Otherwise I read it
> a bit confusing since DMA accesses from physical device should
> be controlled by pIOMMU.
> 
>> GVA, GPA, HVA, HPA
>> 	Guest/Host Virtual/Physical Address
>> IOVA	I/O Virtual Address, the address accessed by a device doing DMA
>> 	through an IOMMU. In the context of a guest OS, IOVA is GVA.
> 
> This statement is not accurate. For kernel DMA protection, it is 
> per-device standalone address space (definitely nothing to do 
> with GVA). For user DMA protection, user space driver decides 
> how it wants to construct IOVA address space. could be a 
> standalone one, or reuse GVA. In virtualization case it is either
> GPA (w/o vIOMMU) or guest IOVA (w/ IOMMU and guest creates
> IOVA space).
> 
> anyway IOVA concept is clear. possibly just removing the example
> is still clear. :-)

Ok, I dropped most IOVA references from the RFC to avoid ambiguity anyway.
I'll tidy up my so-called clarifications next time :)

Thanks,
Jean-Philippe

>>
>> Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI
>> virtio-iommu.h header, which is BSD 3-clause. For the time being, the
>> specification draft in RFC 2/3 is also BSD 3-clause.
>>
>>
>> This proposal may be involuntarily centered around ARM architectures at
>> times. Any feedback would be appreciated, especially regarding other
>> IOMMU
>> architectures.
>>
> 
> thanks for doing this. will definitely look them in detail and feedback.
> 
> Thanks
> Kevin
> 
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
  2017-04-13  8:41   ` [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Tian, Kevin
  2017-04-13 13:12     ` Jean-Philippe Brucker
@ 2017-04-13 13:12     ` Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-13 13:12 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

On 13/04/17 09:41, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Saturday, April 8, 2017 3:18 AM
>>
>> This is the initial proposal for a paravirtualized IOMMU device using
>> virtio transport. It contains a description of the device, a Linux driver,
>> and a toy implementation in kvmtool. With this prototype, you can
>> translate DMA to guest memory from emulated (virtio), or passed-through
>> (VFIO) devices.
>>
>> In its simplest form, implemented here, the device handles map/unmap
>> requests from the guest. Future extensions proposed in "RFC 3/3" should
>> allow to bind page tables to devices.
>>
>> There are a number of advantages in a paravirtualized IOMMU over a full
>> emulation. It is portable and could be reused on different architectures.
>> It is easier to implement than a full emulation, with less state tracking.
>> It might be more efficient in some cases, with less context switches to
>> the host and the possibility of in-kernel emulation.
>>
>> When designing it and writing the kvmtool device, I considered two main
>> scenarios, illustrated below.
>>
>> Scenario 1: a hardware device passed through twice via VFIO
>>
>>    MEM____pIOMMU________PCI device________________________
>> HARDWARE
>>             |     (2b)                                    \
>>   ----------|-------------+-------------+------------------\-------------
>>             |             :     KVM     :                   \
>>             |             :             :                    \
>>        pIOMMU drv         :         _______virtio-iommu drv   \    KERNEL
>>             |             :        |    :          |           \
>>           VFIO            :        |    :        VFIO           \
>>             |             :        |    :          |             \
>>             |             :        |    :          |             /
>>   ----------|-------------+--------|----+----------|------------/--------
>>             |                      |    :          |           /
>>             | (1c)            (1b) |    :     (1a) |          / (2a)
>>             |                      |    :          |         /
>>             |                      |    :          |        /   USERSPACE
>>             |___virtio-iommu dev___|    :        net drv___/
>>                                         :
>>   --------------------------------------+--------------------------------
>>                  HOST                   :             GUEST
>>
> 
> Usually people draw such layers in reverse order, e.g. hw in the
> bottom then kernel in the middle then user in the top. :-)

Alright, I'll keep that in mind.

>> (1) a. Guest userspace is running a net driver (e.g. DPDK). It allocates a
>>        buffer with mmap, obtaining virtual address VA. It then send a
>>        VFIO_IOMMU_MAP_DMA request to map VA to an IOVA (possibly
>> VA=IOVA).
>>     b. The maping request is relayed to the host through virtio
>>        (VIRTIO_IOMMU_T_MAP).
>>     c. The mapping request is relayed to the physical IOMMU through VFIO.
>>
>> (2) a. The guest userspace driver can now instruct the device to directly
>>        access the buffer at IOVA
>>     b. IOVA accesses from the device are translated into physical
>>        addresses by the IOMMU.
>>
>> Scenario 2: a virtual net device behind a virtual IOMMU.
>>
>>   MEM__pIOMMU___PCI device                                     HARDWARE
>>          |         |
>>   -------|---------|------+-------------+-------------------------------
>>          |         |      :     KVM     :
>>          |         |      :             :
>>     pIOMMU drv     |      :             :
>>              \     |      :      _____________virtio-net drv      KERNEL
>>               \_net drv   :     |       :          / (1a)
>>                    |      :     |       :         /
>>                   tap     :     |    ________virtio-iommu drv
>>                    |      :     |   |   : (1b)
>>   -----------------|------+-----|---|---+-------------------------------
>>                    |            |   |   :
>>                    |_virtio-net_|   |   :
>>                          / (2)      |   :
>>                         /           |   :                      USERSPACE
>>               virtio-iommu dev______|   :
>>                                         :
>>   --------------------------------------+-------------------------------
>>                  HOST                   :             GUEST
>>
>> (1) a. Guest virtio-net driver maps the virtio ring and a buffer
>>     b. The mapping requests are relayed to the host through virtio.
>> (2) The virtio-net device now needs to access any guest memory via the
>>     IOMMU.
>>
>> Physical and virtual IOMMUs are completely dissociated. The net driver is
>> mapping its own buffers via DMA/IOMMU API, and buffers are copied
>> between
>> virtio-net and tap.
>>
>>
>> The description itself seemed too long for a single email, so I split it
>> into three documents, and will attach Linux and kvmtool patches to this
>> email.
>>
>> 	1. Firmware note,
>> 	2. device operations (draft for the virtio specification),
>> 	3. future work/possible improvements.
>>
>> Just to be clear on the terms I'm using:
>>
>> pIOMMU	physical IOMMU, controlling DMA accesses from physical
>> devices
>> vIOMMU	virtual IOMMU (virtio-iommu), controlling DMA accesses
>> from
>> 	physical and virtual devices to guest memory.
> 
> maybe clearer to call controlling 'virtual' DMA access since we're
> essentially doing DMA virtualization here. Otherwise I read it
> a bit confusing since DMA accesses from physical device should
> be controlled by pIOMMU.
> 
>> GVA, GPA, HVA, HPA
>> 	Guest/Host Virtual/Physical Address
>> IOVA	I/O Virtual Address, the address accessed by a device doing DMA
>> 	through an IOMMU. In the context of a guest OS, IOVA is GVA.
> 
> This statement is not accurate. For kernel DMA protection, it is 
> per-device standalone address space (definitely nothing to do 
> with GVA). For user DMA protection, user space driver decides 
> how it wants to construct IOVA address space. could be a 
> standalone one, or reuse GVA. In virtualization case it is either
> GPA (w/o vIOMMU) or guest IOVA (w/ IOMMU and guest creates
> IOVA space).
> 
> anyway IOVA concept is clear. possibly just removing the example
> is still clear. :-)

Ok, I dropped most IOVA references from the RFC to avoid ambiguity anyway.
I'll tidy up my so-called clarifications next time :)

Thanks,
Jean-Philippe

>>
>> Note: kvmtool is GPLv2. Linux patches are GPLv2, except for UAPI
>> virtio-iommu.h header, which is BSD 3-clause. For the time being, the
>> specification draft in RFC 2/3 is also BSD 3-clause.
>>
>>
>> This proposal may be involuntarily centered around ARM architectures at
>> times. Any feedback would be appreciated, especially regarding other
>> IOMMU
>> architectures.
>>
> 
> thanks for doing this. will definitely look them in detail and feedback.
> 
> Thanks
> Kevin
> 
> 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 1/3] virtio-iommu: firmware description of the virtual topology
       [not found]     ` <20170407191747.26618-2-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
@ 2017-04-18  9:51       ` Tian, Kevin
  2017-04-18 18:41         ` Jean-Philippe Brucker
  2017-04-18 18:41         ` Jean-Philippe Brucker
  0 siblings, 2 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-04-18  9:51 UTC (permalink / raw)
  To: Jean-Philippe Brucker,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: cdall-QSEj5FYQhm4dnm+yROfE0A, mst-H+wXaHxf7aLQT0dZR+AlfA,
	marc.zyngier-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA

> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> Unlike other virtio devices, the virtio-iommu doesn't work independently,
> it is linked to other virtual or assigned devices. So before jumping into
> device operations, we need to define a way for the guest to discover the
> virtual IOMMU and the devices it translates.
> 
> The host must describe the relation between IOMMU and devices to the
> guest
> using either device-tree or ACPI. The virtual IOMMU identifies each

Do you plan to support both device tree and ACPI?

> virtual device with a 32-bit ID, that we will call "Device ID" in this
> document. Device IDs are not necessarily unique system-wide, but they may
> not overlap within a single virtual IOMMU. Device ID of passed-through
> devices do not need to match IDs seen by the physical IOMMU.
> 
> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
> because with PCI the IOMMU interface would itself be an endpoint, and
> existing firmware interfaces don't allow to describe IOMMU<->master
> relations between PCI endpoints.

I'm not familiar with virtio-mmio mechanism. Curious how devices in
virtio-mmio are enumerated today? Could we use that mechanism to
identify vIOMMUs and then invent a purely para-virtualized method to
enumerate devices behind each vIOMMU? 

Asking this is because each vendor has its own enumeration methods.
ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your 
current proposal looks following ARM definitions which I'm not sure 
extensible enough to cover features defined only in other vendors' 
structures.

Since the purpose of this series is to go para-virtualize, why not also
para-virtualize and simplify the enumeration method? For example, 
we may define a query interface through vIOMMU registers to allow 
guest query whether a device belonging to that vIOMMU. Then we 
can even remove use of any enumeration structure completely... 
Just a quick example which I may not think through all the pros and 
cons. :-)

> 
> The following diagram describes a situation where two virtual IOMMUs
> translate traffic from devices in the system. vIOMMU 1 translates two PCI
> domains, in which each function has a 16-bits requester ID. In order for
> the vIOMMU to differentiate guest requests targeted at devices in each
> domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI
> domains and a collection of platform devices.
> 
>                        Device ID    Requester ID
>                   /       0x0           0x0      \
>                  /         |             |        PCI domain 1
>                 /      0xffff           0xffff   /
>         vIOMMU 1
>                 \     0x10000           0x0      \
>                  \         |             |        PCI domain 2
>                   \   0x1ffff           0xffff   /
> 
>                   /       0x0                    \
>                  /         |                      platform devices
>                 /      0x1fff                    /
>         vIOMMU 2
>                 \      0x2000           0x0      \
>                  \         |             |        PCI domain 3
>                   \   0x11fff           0xffff   /
> 

isn't above be (0x30000, 3ffff) for PCI domain 3 giving device ID is 16bit?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 1/3] virtio-iommu: firmware description of the virtual topology
  2017-04-07 19:17   ` [RFC 1/3] virtio-iommu: firmware description of the virtual topology Jean-Philippe Brucker
       [not found]     ` <20170407191747.26618-2-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
@ 2017-04-18  9:51     ` Tian, Kevin
  1 sibling, 0 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-04-18  9:51 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> Unlike other virtio devices, the virtio-iommu doesn't work independently,
> it is linked to other virtual or assigned devices. So before jumping into
> device operations, we need to define a way for the guest to discover the
> virtual IOMMU and the devices it translates.
> 
> The host must describe the relation between IOMMU and devices to the
> guest
> using either device-tree or ACPI. The virtual IOMMU identifies each

Do you plan to support both device tree and ACPI?

> virtual device with a 32-bit ID, that we will call "Device ID" in this
> document. Device IDs are not necessarily unique system-wide, but they may
> not overlap within a single virtual IOMMU. Device ID of passed-through
> devices do not need to match IDs seen by the physical IOMMU.
> 
> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
> because with PCI the IOMMU interface would itself be an endpoint, and
> existing firmware interfaces don't allow to describe IOMMU<->master
> relations between PCI endpoints.

I'm not familiar with virtio-mmio mechanism. Curious how devices in
virtio-mmio are enumerated today? Could we use that mechanism to
identify vIOMMUs and then invent a purely para-virtualized method to
enumerate devices behind each vIOMMU? 

Asking this is because each vendor has its own enumeration methods.
ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your 
current proposal looks following ARM definitions which I'm not sure 
extensible enough to cover features defined only in other vendors' 
structures.

Since the purpose of this series is to go para-virtualize, why not also
para-virtualize and simplify the enumeration method? For example, 
we may define a query interface through vIOMMU registers to allow 
guest query whether a device belonging to that vIOMMU. Then we 
can even remove use of any enumeration structure completely... 
Just a quick example which I may not think through all the pros and 
cons. :-)

> 
> The following diagram describes a situation where two virtual IOMMUs
> translate traffic from devices in the system. vIOMMU 1 translates two PCI
> domains, in which each function has a 16-bits requester ID. In order for
> the vIOMMU to differentiate guest requests targeted at devices in each
> domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI
> domains and a collection of platform devices.
> 
>                        Device ID    Requester ID
>                   /       0x0           0x0      \
>                  /         |             |        PCI domain 1
>                 /      0xffff           0xffff   /
>         vIOMMU 1
>                 \     0x10000           0x0      \
>                  \         |             |        PCI domain 2
>                   \   0x1ffff           0xffff   /
> 
>                   /       0x0                    \
>                  /         |                      platform devices
>                 /      0x1fff                    /
>         vIOMMU 2
>                 \      0x2000           0x0      \
>                  \         |             |        PCI domain 3
>                   \   0x11fff           0xffff   /
> 

isn't above be (0x30000, 3ffff) for PCI domain 3 giving device ID is 16bit?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 2/3] virtio-iommu: device probing and operations
  2017-04-07 19:17 ` [RFC 2/3] virtio-iommu: device probing and operations Jean-Philippe Brucker
@ 2017-04-18 10:26   ` Tian, Kevin
  2017-04-18 18:45     ` Jean-Philippe Brucker
  2017-04-18 18:45     ` Jean-Philippe Brucker
  2017-04-18 10:26   ` Tian, Kevin
  1 sibling, 2 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-04-18 10:26 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
[...]
>   II. Feature bits
>   ================
> 
> VIRTIO_IOMMU_F_INPUT_RANGE (0)
>  Available range of virtual addresses is described in input_range

Usually only the maximum supported address bits are important. 
Curious do you see such situation where low end of the address 
space is not usable (since you have both start/end defined later)?

[...]
>   1. Attach device
>   ----------------
> 
> struct virtio_iommu_req_attach {
> 	le32	address_space;
> 	le32	device;
> 	le32	flags/reserved;
> };
> 
> Attach a device to an address space. 'address_space' is an identifier
> unique to the guest. If the address space doesn't exist in the IOMMU

Based on your description this address space ID is per operation right?
MAP/UNMAP and page-table sharing should have different ID spaces...

> device, it is created. 'device' is an identifier unique to the IOMMU. The
> host communicates unique device ID to the guest during boot. The method
> used to communicate this ID is outside the scope of this specification,
> but the following rules must apply:
> 
> * The device ID is unique from the IOMMU point of view. Multiple devices
>   whose DMA transactions are not translated by the same IOMMU may have
> the
>   same device ID. Devices whose DMA transactions may be translated by the
>   same IOMMU must have different device IDs.
> 
> * Sometimes the host cannot completely isolate two devices from each
>   others. For example on a legacy PCI bus, devices can snoop DMA
>   transactions from their neighbours. In this case, the host must
>   communicate to the guest that it cannot isolate these devices from each
>   others. The method used to communicate this is outside the scope of this
>   specification. The IOMMU device must ensure that devices that cannot be

"IOMMU device" -> "IOMMU driver"

>   isolated by the host have the same address spaces.
> 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 2/3] virtio-iommu: device probing and operations
  2017-04-07 19:17 ` [RFC 2/3] virtio-iommu: device probing and operations Jean-Philippe Brucker
  2017-04-18 10:26   ` Tian, Kevin
@ 2017-04-18 10:26   ` Tian, Kevin
  1 sibling, 0 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-04-18 10:26 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
[...]
>   II. Feature bits
>   ================
> 
> VIRTIO_IOMMU_F_INPUT_RANGE (0)
>  Available range of virtual addresses is described in input_range

Usually only the maximum supported address bits are important. 
Curious do you see such situation where low end of the address 
space is not usable (since you have both start/end defined later)?

[...]
>   1. Attach device
>   ----------------
> 
> struct virtio_iommu_req_attach {
> 	le32	address_space;
> 	le32	device;
> 	le32	flags/reserved;
> };
> 
> Attach a device to an address space. 'address_space' is an identifier
> unique to the guest. If the address space doesn't exist in the IOMMU

Based on your description this address space ID is per operation right?
MAP/UNMAP and page-table sharing should have different ID spaces...

> device, it is created. 'device' is an identifier unique to the IOMMU. The
> host communicates unique device ID to the guest during boot. The method
> used to communicate this ID is outside the scope of this specification,
> but the following rules must apply:
> 
> * The device ID is unique from the IOMMU point of view. Multiple devices
>   whose DMA transactions are not translated by the same IOMMU may have
> the
>   same device ID. Devices whose DMA transactions may be translated by the
>   same IOMMU must have different device IDs.
> 
> * Sometimes the host cannot completely isolate two devices from each
>   others. For example on a legacy PCI bus, devices can snoop DMA
>   transactions from their neighbours. In this case, the host must
>   communicate to the guest that it cannot isolate these devices from each
>   others. The method used to communicate this is outside the scope of this
>   specification. The IOMMU device must ensure that devices that cannot be

"IOMMU device" -> "IOMMU driver"

>   isolated by the host have the same address spaces.
> 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 1/3] virtio-iommu: firmware description of the virtual topology
  2017-04-18  9:51       ` Tian, Kevin
@ 2017-04-18 18:41         ` Jean-Philippe Brucker
  2017-04-21  8:43           ` Tian, Kevin
  2017-04-18 18:41         ` Jean-Philippe Brucker
  1 sibling, 1 reply; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-18 18:41 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

On 18/04/17 10:51, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Saturday, April 8, 2017 3:18 AM
>>
>> Unlike other virtio devices, the virtio-iommu doesn't work independently,
>> it is linked to other virtual or assigned devices. So before jumping into
>> device operations, we need to define a way for the guest to discover the
>> virtual IOMMU and the devices it translates.
>>
>> The host must describe the relation between IOMMU and devices to the
>> guest
>> using either device-tree or ACPI. The virtual IOMMU identifies each
> 
> Do you plan to support both device tree and ACPI?

Yes, with ACPI the topology would be described using IORT nodes. I didn't
include an example in my driver because DT is sufficient for a prototype
and is readily available (both in Linux and kvmtool), whereas IORT would
be quite easy to reuse in Linux, but isn't present in kvmtool at the
moment. However, both interfaces have to be supported for the virtio-iommu
to be portable.

>> virtual device with a 32-bit ID, that we will call "Device ID" in this
>> document. Device IDs are not necessarily unique system-wide, but they may
>> not overlap within a single virtual IOMMU. Device ID of passed-through
>> devices do not need to match IDs seen by the physical IOMMU.
>>
>> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
>> because with PCI the IOMMU interface would itself be an endpoint, and
>> existing firmware interfaces don't allow to describe IOMMU<->master
>> relations between PCI endpoints.
> 
> I'm not familiar with virtio-mmio mechanism. Curious how devices in
> virtio-mmio are enumerated today? Could we use that mechanism to
> identify vIOMMUs and then invent a purely para-virtualized method to
> enumerate devices behind each vIOMMU?

Using DT, virtio-mmio devices are described with "virtio-mmio" compatible
node, and with ACPI they use _HID LNRO0005. Since the host already
describes available devices to a guest using a firmware interface, I think
we should reuse the tools provided by that interface for describing
relations between DMA masters and IOMMU.

> Asking this is because each vendor has its own enumeration methods.
> ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
> tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your 
> current proposal looks following ARM definitions which I'm not sure 
> extensible enough to cover features defined only in other vendors' 
> structures.

ACPI IORT can be extended to incorporate para-virtualized IOMMUs,
regardless of the underlying architecture. It isn't defined solely for the
ARM SMMU, but serves a more general purpose of describing a map of device
identifiers communicated from one components to another. Both DMAR and
IVRS have such description (respectively DRHD and IVHD), but they are
designed for a specific IOMMU, whereas IORT could host other kinds.

It seems that all we really need is an interface that says "there is a
virtio-iommu at address X, here are the devices it translates and their
corresponding IDs", and both DT and ACPI IORT are able to fulfill this role.

> Since the purpose of this series is to go para-virtualize, why not also
> para-virtualize and simplify the enumeration method? For example, 
> we may define a query interface through vIOMMU registers to allow 
> guest query whether a device belonging to that vIOMMU. Then we 
> can even remove use of any enumeration structure completely... 
> Just a quick example which I may not think through all the pros and 
> cons. :-)

I don't think adding a brand new topology description mechanism is worth
the effort, we're better off reusing what already exists and is
implemented by operating systems. Adding a query interface inside the
vIOMMU may work (though might be very painful to integrate with fwspec in
Linux), but would be redundant since the host has to provide a firmware
description of the system anyway.

>> The following diagram describes a situation where two virtual IOMMUs
>> translate traffic from devices in the system. vIOMMU 1 translates two PCI
>> domains, in which each function has a 16-bits requester ID. In order for
>> the vIOMMU to differentiate guest requests targeted at devices in each
>> domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI
>> domains and a collection of platform devices.
>>
>>                        Device ID    Requester ID
>>                   /       0x0           0x0      \
>>                  /         |             |        PCI domain 1
>>                 /      0xffff           0xffff   /
>>         vIOMMU 1
>>                 \     0x10000           0x0      \
>>                  \         |             |        PCI domain 2
>>                   \   0x1ffff           0xffff   /
>>
>>                   /       0x0                    \
>>                  /         |                      platform devices
>>                 /      0x1fff                    /
>>         vIOMMU 2
>>                 \      0x2000           0x0      \
>>                  \         |             |        PCI domain 3
>>                   \   0x11fff           0xffff   /
>>
> 
> isn't above be (0x30000, 3ffff) for PCI domain 3 giving device ID is 16bit?

Unlike Requester IDs in PCI, there is no architected rule for IDs of
platform devices, it's an integration choice. The ID of platform device is
used exclusively for interfacing with an IOMMU (or MSI controller), it
doesn't mean anything outside this context. Here the host allocates 13
bits to platform device IDs, which is legal.

Thanks,
Jean-Philippe

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 1/3] virtio-iommu: firmware description of the virtual topology
  2017-04-18  9:51       ` Tian, Kevin
  2017-04-18 18:41         ` Jean-Philippe Brucker
@ 2017-04-18 18:41         ` Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-18 18:41 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

On 18/04/17 10:51, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Saturday, April 8, 2017 3:18 AM
>>
>> Unlike other virtio devices, the virtio-iommu doesn't work independently,
>> it is linked to other virtual or assigned devices. So before jumping into
>> device operations, we need to define a way for the guest to discover the
>> virtual IOMMU and the devices it translates.
>>
>> The host must describe the relation between IOMMU and devices to the
>> guest
>> using either device-tree or ACPI. The virtual IOMMU identifies each
> 
> Do you plan to support both device tree and ACPI?

Yes, with ACPI the topology would be described using IORT nodes. I didn't
include an example in my driver because DT is sufficient for a prototype
and is readily available (both in Linux and kvmtool), whereas IORT would
be quite easy to reuse in Linux, but isn't present in kvmtool at the
moment. However, both interfaces have to be supported for the virtio-iommu
to be portable.

>> virtual device with a 32-bit ID, that we will call "Device ID" in this
>> document. Device IDs are not necessarily unique system-wide, but they may
>> not overlap within a single virtual IOMMU. Device ID of passed-through
>> devices do not need to match IDs seen by the physical IOMMU.
>>
>> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
>> because with PCI the IOMMU interface would itself be an endpoint, and
>> existing firmware interfaces don't allow to describe IOMMU<->master
>> relations between PCI endpoints.
> 
> I'm not familiar with virtio-mmio mechanism. Curious how devices in
> virtio-mmio are enumerated today? Could we use that mechanism to
> identify vIOMMUs and then invent a purely para-virtualized method to
> enumerate devices behind each vIOMMU?

Using DT, virtio-mmio devices are described with "virtio-mmio" compatible
node, and with ACPI they use _HID LNRO0005. Since the host already
describes available devices to a guest using a firmware interface, I think
we should reuse the tools provided by that interface for describing
relations between DMA masters and IOMMU.

> Asking this is because each vendor has its own enumeration methods.
> ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
> tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your 
> current proposal looks following ARM definitions which I'm not sure 
> extensible enough to cover features defined only in other vendors' 
> structures.

ACPI IORT can be extended to incorporate para-virtualized IOMMUs,
regardless of the underlying architecture. It isn't defined solely for the
ARM SMMU, but serves a more general purpose of describing a map of device
identifiers communicated from one components to another. Both DMAR and
IVRS have such description (respectively DRHD and IVHD), but they are
designed for a specific IOMMU, whereas IORT could host other kinds.

It seems that all we really need is an interface that says "there is a
virtio-iommu at address X, here are the devices it translates and their
corresponding IDs", and both DT and ACPI IORT are able to fulfill this role.

> Since the purpose of this series is to go para-virtualize, why not also
> para-virtualize and simplify the enumeration method? For example, 
> we may define a query interface through vIOMMU registers to allow 
> guest query whether a device belonging to that vIOMMU. Then we 
> can even remove use of any enumeration structure completely... 
> Just a quick example which I may not think through all the pros and 
> cons. :-)

I don't think adding a brand new topology description mechanism is worth
the effort, we're better off reusing what already exists and is
implemented by operating systems. Adding a query interface inside the
vIOMMU may work (though might be very painful to integrate with fwspec in
Linux), but would be redundant since the host has to provide a firmware
description of the system anyway.

>> The following diagram describes a situation where two virtual IOMMUs
>> translate traffic from devices in the system. vIOMMU 1 translates two PCI
>> domains, in which each function has a 16-bits requester ID. In order for
>> the vIOMMU to differentiate guest requests targeted at devices in each
>> domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two PCI
>> domains and a collection of platform devices.
>>
>>                        Device ID    Requester ID
>>                   /       0x0           0x0      \
>>                  /         |             |        PCI domain 1
>>                 /      0xffff           0xffff   /
>>         vIOMMU 1
>>                 \     0x10000           0x0      \
>>                  \         |             |        PCI domain 2
>>                   \   0x1ffff           0xffff   /
>>
>>                   /       0x0                    \
>>                  /         |                      platform devices
>>                 /      0x1fff                    /
>>         vIOMMU 2
>>                 \      0x2000           0x0      \
>>                  \         |             |        PCI domain 3
>>                   \   0x11fff           0xffff   /
>>
> 
> isn't above be (0x30000, 3ffff) for PCI domain 3 giving device ID is 16bit?

Unlike Requester IDs in PCI, there is no architected rule for IDs of
platform devices, it's an integration choice. The ID of platform device is
used exclusively for interfacing with an IOMMU (or MSI controller), it
doesn't mean anything outside this context. Here the host allocates 13
bits to platform device IDs, which is legal.

Thanks,
Jean-Philippe

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 2/3] virtio-iommu: device probing and operations
  2017-04-18 10:26   ` Tian, Kevin
  2017-04-18 18:45     ` Jean-Philippe Brucker
@ 2017-04-18 18:45     ` Jean-Philippe Brucker
  2017-04-21  9:02       ` Tian, Kevin
  1 sibling, 1 reply; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-18 18:45 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

On 18/04/17 11:26, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Saturday, April 8, 2017 3:18 AM
>>
> [...]
>>   II. Feature bits
>>   ================
>>
>> VIRTIO_IOMMU_F_INPUT_RANGE (0)
>>  Available range of virtual addresses is described in input_range
> 
> Usually only the maximum supported address bits are important. 
> Curious do you see such situation where low end of the address 
> space is not usable (since you have both start/end defined later)?

A start address would allow to provide something resembling a GART to the
guest: an IOMMU with one address space (ioasid_bits=0) and a small IOVA
aperture. I'm not sure how useful that would be in practice.

On a related note, the virtio-iommu itself doesn't provide a
per-address-space aperture as it stands. For example, attaching a device
to an address space might restrict the available IOVA range for the whole
AS if that device cannot write to high memory (above 32-bit). If the guest
attempts to map an IOVA outside this window into the device's address
space, it should expect the MAP request to fail. And when attaching, if
the address space already has mappings outside this window, then ATTACH
should fail.

This too seems to be something that ought to be communicated by firmware,
but bits are missing (I can't find anything equivalent to DT's dma-ranges
for PCI root bridges in ACPI tables, for example). In addition VFIO
doesn't communicate any DMA mask for devices, and doesn't check them
itself. I guess that the host could find out the DMA mask of devices one
way or another, but it is tricky to enforce, so I didn't make this a hard
requirement. Although I should probably add a few words about it.

> [...]
>>   1. Attach device
>>   ----------------
>>
>> struct virtio_iommu_req_attach {
>> 	le32	address_space;
>> 	le32	device;
>> 	le32	flags/reserved;
>> };
>>
>> Attach a device to an address space. 'address_space' is an identifier
>> unique to the guest. If the address space doesn't exist in the IOMMU
> 
> Based on your description this address space ID is per operation right?
> MAP/UNMAP and page-table sharing should have different ID spaces...

I think it's simpler if we keep a single IOASID space per virtio-iommu
device, because the maximum number of address spaces (described by
ioasid_bits) might be a restriction of the pIOMMU. For page-table sharing
you still need to define which devices will share a page directory using
ATTACH requests, though that interface is not set in stone.

>> device, it is created. 'device' is an identifier unique to the IOMMU. The
>> host communicates unique device ID to the guest during boot. The method
>> used to communicate this ID is outside the scope of this specification,
>> but the following rules must apply:
>>
>> * The device ID is unique from the IOMMU point of view. Multiple devices
>>   whose DMA transactions are not translated by the same IOMMU may have
>> the
>>   same device ID. Devices whose DMA transactions may be translated by the
>>   same IOMMU must have different device IDs.
>>
>> * Sometimes the host cannot completely isolate two devices from each
>>   others. For example on a legacy PCI bus, devices can snoop DMA
>>   transactions from their neighbours. In this case, the host must
>>   communicate to the guest that it cannot isolate these devices from each
>>   others. The method used to communicate this is outside the scope of this
>>   specification. The IOMMU device must ensure that devices that cannot be
> 
> "IOMMU device" -> "IOMMU driver"

Indeed

Thanks!
Jean-Philippe

>>   isolated by the host have the same address spaces.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 2/3] virtio-iommu: device probing and operations
  2017-04-18 10:26   ` Tian, Kevin
@ 2017-04-18 18:45     ` Jean-Philippe Brucker
  2017-04-18 18:45     ` Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-18 18:45 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

On 18/04/17 11:26, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Saturday, April 8, 2017 3:18 AM
>>
> [...]
>>   II. Feature bits
>>   ================
>>
>> VIRTIO_IOMMU_F_INPUT_RANGE (0)
>>  Available range of virtual addresses is described in input_range
> 
> Usually only the maximum supported address bits are important. 
> Curious do you see such situation where low end of the address 
> space is not usable (since you have both start/end defined later)?

A start address would allow to provide something resembling a GART to the
guest: an IOMMU with one address space (ioasid_bits=0) and a small IOVA
aperture. I'm not sure how useful that would be in practice.

On a related note, the virtio-iommu itself doesn't provide a
per-address-space aperture as it stands. For example, attaching a device
to an address space might restrict the available IOVA range for the whole
AS if that device cannot write to high memory (above 32-bit). If the guest
attempts to map an IOVA outside this window into the device's address
space, it should expect the MAP request to fail. And when attaching, if
the address space already has mappings outside this window, then ATTACH
should fail.

This too seems to be something that ought to be communicated by firmware,
but bits are missing (I can't find anything equivalent to DT's dma-ranges
for PCI root bridges in ACPI tables, for example). In addition VFIO
doesn't communicate any DMA mask for devices, and doesn't check them
itself. I guess that the host could find out the DMA mask of devices one
way or another, but it is tricky to enforce, so I didn't make this a hard
requirement. Although I should probably add a few words about it.

> [...]
>>   1. Attach device
>>   ----------------
>>
>> struct virtio_iommu_req_attach {
>> 	le32	address_space;
>> 	le32	device;
>> 	le32	flags/reserved;
>> };
>>
>> Attach a device to an address space. 'address_space' is an identifier
>> unique to the guest. If the address space doesn't exist in the IOMMU
> 
> Based on your description this address space ID is per operation right?
> MAP/UNMAP and page-table sharing should have different ID spaces...

I think it's simpler if we keep a single IOASID space per virtio-iommu
device, because the maximum number of address spaces (described by
ioasid_bits) might be a restriction of the pIOMMU. For page-table sharing
you still need to define which devices will share a page directory using
ATTACH requests, though that interface is not set in stone.

>> device, it is created. 'device' is an identifier unique to the IOMMU. The
>> host communicates unique device ID to the guest during boot. The method
>> used to communicate this ID is outside the scope of this specification,
>> but the following rules must apply:
>>
>> * The device ID is unique from the IOMMU point of view. Multiple devices
>>   whose DMA transactions are not translated by the same IOMMU may have
>> the
>>   same device ID. Devices whose DMA transactions may be translated by the
>>   same IOMMU must have different device IDs.
>>
>> * Sometimes the host cannot completely isolate two devices from each
>>   others. For example on a legacy PCI bus, devices can snoop DMA
>>   transactions from their neighbours. In this case, the host must
>>   communicate to the guest that it cannot isolate these devices from each
>>   others. The method used to communicate this is outside the scope of this
>>   specification. The IOMMU device must ensure that devices that cannot be
> 
> "IOMMU device" -> "IOMMU driver"

Indeed

Thanks!
Jean-Philippe

>>   isolated by the host have the same address spaces.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 3/3] virtio-iommu: future work
       [not found]   ` <20170407191747.26618-4-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
@ 2017-04-21  8:31     ` Tian, Kevin
  2017-04-24 15:05       ` Jean-Philippe Brucker
  2017-04-24 15:05       ` Jean-Philippe Brucker
  2017-04-26 16:24     ` Michael S. Tsirkin
  1 sibling, 2 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-04-21  8:31 UTC (permalink / raw)
  To: Jean-Philippe Brucker,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: cdall-QSEj5FYQhm4dnm+yROfE0A, mst-H+wXaHxf7aLQT0dZR+AlfA,
	marc.zyngier-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA

> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> Here I propose a few ideas for extensions and optimizations. This is all
> very exploratory, feel free to correct mistakes and suggest more things.

[...]
> 
>   II. Page table sharing
>   ======================
> 
>   1. Sharing IOMMU page tables
>   ----------------------------
> 
> VIRTIO_IOMMU_F_PT_SHARING
> 
> This is independent of the nested mode described in I.2, but relies on a
> similar feature in the physical IOMMU: having two stages of page tables,
> one for the host and one for the guest.
> 
> When this is supported, the guest can manage its own s1 page directory, to
> avoid sending MAP/UNMAP requests. Feature
> VIRTIO_IOMMU_F_PT_SHARING allows
> a driver to give a page directory pointer (pgd) to the host and send
> invalidations when removing or changing a mapping. In this mode, three
> requests are used: probe, attach and invalidate. An address space cannot
> be using the MAP/UNMAP interface and PT_SHARING at the same time.
> 
> Device and driver first need to negotiate which page table format they
> will be using. This depends on the physical IOMMU, so the request contains
> a negotiation part to probe the device capabilities.
> 
> (1) Driver attaches devices to address spaces as usual, but a flag
>     VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
>     create page tables for use with the MAP/UNMAP API. The driver intends
>     to manage the address space itself.
> 
> (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
>     pg_format array.
> 
> 	VIRTIO_IOMMU_T_PROBE_TABLE
> 
> 	struct virtio_iommu_req_probe_table {
> 		le32	address_space;
> 		le32	flags;
> 		le32	len;
> 
> 		le32	nr_contexts;
> 		struct {
> 			le32	model;
> 			u8	format[64];
> 		} pg_format[len];
> 	};
> 
> Introducing a probe request is more flexible than advertising those
> features in virtio config, because capabilities are dynamic, and depend on
> which devices are attached to an address space. Within a single address
> space, devices may support different numbers of contexts (PASIDs), and
> some may not support recoverable faults.
> 
> (3) Device responds success with all page table formats implemented by the
>     physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
>     initialize the array to 0 and deduce from there which entries have
>     been filled by the device.
> 
> Using a probe method seems preferable over trying to attach every possible
> format until one sticks. For instance, with an ARM guest running on an x86
> host, PROBE_TABLE would return the Intel IOMMU page table format, and
> the
> guest could use that page table code to handle its mappings, hidden behind
> the IOMMU API. This requires that the page-table code is reasonably
> abstracted from the architecture, as is done with drivers/iommu/io-pgtable
> (an x86 guest could use any format implement by io-pgtable for example.)

So essentially you need modify all existing IOMMU drivers to support page 
table sharing in pvIOMMU. After abstraction is done the core pvIOMMU files 
can be kept vendor agnostic. But if we talk about the whole pvIOMMU 
module, it actually includes vendor specific logic thus unlike typical 
para-virtualized virtio drivers being completely vendor agnostic. Is this 
understanding accurate?

It also means in the host-side pIOMMU driver needs to propagate all
supported formats through VFIO to Qemu vIOMMU, meaning
such format definitions need be consistently agreed across all those 
components.

[...]

> 
>   2. Sharing MMU page tables
>   --------------------------
> 
> The guest can share process page-tables with the physical IOMMU. To do
> that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
> page table format is implicit, so the pg_format array can be empty (unless
> the guest wants to query some specific property, e.g. number of levels
> supported by the pIOMMU?). If the host answers with success, guest can
> send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
> F_INDIRECT | F_FAULT) flags.
> 
> F_FAULT means that the host communicates page requests from device to
> the
> guest, and the guest can handle them by mapping virtual address in the
> fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
> below.)
> 
> F_NATIVE means that the pIOMMU pgtable format is the same as guest
> MMU
> pgtable format.
> 
> F_INDIRECT means that 'table' pointer is a context table, instead of a
> page directory. Each slot in the context table points to a page directory:
> 
>                        64              2 1 0
>           table ----> +---------------------+
>                       |       pgd       |0|1|<--- context 0
>                       |       ---       |0|0|<--- context 1
>                       |       pgd       |0|1|
>                       |       ---       |0|0|
>                       |       ---       |0|0|
>                       +---------------------+
>                                          | \___Entry is valid
>                                          |______reserved
> 
> Question: do we want per-context page table format, or can it stay global
> for the whole indirect table?

Are you defining this context table format in software, or following
hardware definition? At least for VT-d there is a strict hardware-defined
structure (PASID table) which must be used here.

[...]
> 
>   4. Host implementation with VFIO
>   --------------------------------
> 
> The VFIO interface for sharing page tables is being worked on at the
> moment by Intel. Other virtual IOMMU implementation will most likely let
> guest manage full context tables (PASID tables) themselves, giving the
> context table pointer to the pIOMMU via a VFIO ioctl.
> 
> For the architecture-agnostic virtio-iommu however, we shouldn't have to
> implement all possible formats of context table (they are at least
> different between ARM SMMU and Intel IOMMU, and will certainly be
> extended

Since anyway you'll finally require vendor specific page table logic,
why not also abstracting this context table too which then doesn't
require below host-side changes?

> in future physical IOMMU architectures.) In addition, most users might
> only care about having one page directory per device, as SVM is a luxury
> at the moment and few devices support it. For these reasons, we should
> allow to pass single page directories via VFIO, using very similar
> structures as described above, whilst reusing the VFIO channel developed
> for Intel vIOMMU.
> 
> 	* VFIO_SVM_INFO: probe page table formats
> 	* VFIO_SVM_BIND: set pgd and arch-specific configuration
> 
> There is an inconvenient with letting the pIOMMU driver manage the guest's
> context table. During a page table walk, the pIOMMU translates the context
> table pointer using the stage-2 page tables. The context table must
> therefore be mapped in guest-physical space by the pIOMMU driver. One
> solution is to let the pIOMMU driver reserve some GPA space upfront using
> the iommu and sysfs resv API [1]. The host would then carve that region
> out of the guest-physical space using a firmware mechanism (for example DT
> reserved-memory node).

Can you elaborate this flow? pIOMMU driver doesn't directly manage GPA
address space thus it's not reasonable for it to randomly specify a reserved
range. It might make more sense for GPA owner (e.g. Qemu) to decide and
then pass information to pIOMMU driver.

> 
> 
>   III. Relaxed operations
>   =======================
> 
> VIRTIO_IOMMU_F_RELAXED
> 
> Adding an IOMMU dramatically reduces performance of a device, because
> map/unmap operations are costly and produce a lot of TLB traffic. For
> significant performance improvements, device might allow the driver to
> sacrifice safety for speed. In this mode, the driver does not need to send
> UNMAP requests. The semantics of MAP change and are more complex to
> implement. Given a MAP([start:end] -> phys, flags) request:
> 
> (1) If [start:end] isn't mapped, request succeeds as usual.
> (2) If [start:end] overlaps an existing mapping [old_start:old_end], we
>     unmap [max(start, old_start):min(end, old_end)] and replace it with
>     [start:end].
> (3) If [start:end] overlaps an existing mapping that matches the new map
>     request exactly (same flags, same phys address), the old mapping is
>     kept.
> 
> This squashing could be performed by the guest. The driver can catch unmap
> requests from the DMA layer, and only relay map requests for (1) and (2).
> A MAP request is therefore able to split and partially override an
> existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests
> are unnecessary, but are now allowed to split or carve holes in mappings.
> 
> In this model, a MAP request may take longer, but we may have a net gain
> by removing a lot of redundant requests. Squashing series of map/unmap
> performed by the guest for the same mapping improves temporal reuse of
> IOVA mappings, which I can observe by simply dumping IOMMU activity of a
> virtio device. It reduce the number of TLB invalidations to the strict
> minimum while keeping correctness of DMA operations (provided the device
> obeys its driver). There is a good read on the subject of optimistic
> teardown in paper [2].
> 
> This model is completely unsafe. A stale DMA transaction might access a
> page long after the device driver in the guest unmapped it and
> decommissioned the page. The DMA transaction might hit into a completely
> different part of the system that is now reusing the page. Existing
> relaxed implementations attempt to mitigate the risk by setting a timeout
> on the teardown. Unmap requests from device drivers are not discarded
> entirely, but buffered and sent at a later time. Paper [2] reports good
> results with a 10ms delay.
> 
> We could add a way for device and driver to negotiate a vulnerability
> window to mitigate the risk of DMA attacks. Driver might not accept a
> window at all, since it requires more infrastructure to keep delayed
> mappings. In my opinion, it should be made clear that regardless of the
> duration of this window, any driver accepting F_RELAXED feature makes the
> guest completely vulnerable, and the choice boils down to either isolation
> or speed, not a bit of both.

Even with above optimization I'd image the performance drop is still
significant for kernel map/unmap usages, not to say when such 
optimization is not possible if safety is required (actually I don't
know why IOMMU is still required if safety can be compromised. Aren't
we using IOMMU for security purpose?). I think we'd better focus on
higher-value usages, e.g. user space DMA protection (DPDK) and 
SVM, while leaving kernel protection with a lower priority (most for 
functionality verification). Is this strategy aligned with your thought?

btw what about interrupt remapping/posting? Are they also in your
plan for pvIOMMU?

Last, thanks for very informative write-! Looks a long enabling path is 
required get pvIOMMU feature on-par with a real IOMMU. Starting 
with a minimal set is relatively easier. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 3/3] virtio-iommu: future work
  2017-04-07 19:17 ` [RFC 3/3] virtio-iommu: future work Jean-Philippe Brucker
@ 2017-04-21  8:31   ` Tian, Kevin
       [not found]   ` <20170407191747.26618-4-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
  2017-04-26 16:24   ` Michael S. Tsirkin
  2 siblings, 0 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-04-21  8:31 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

> From: Jean-Philippe Brucker
> Sent: Saturday, April 8, 2017 3:18 AM
> 
> Here I propose a few ideas for extensions and optimizations. This is all
> very exploratory, feel free to correct mistakes and suggest more things.

[...]
> 
>   II. Page table sharing
>   ======================
> 
>   1. Sharing IOMMU page tables
>   ----------------------------
> 
> VIRTIO_IOMMU_F_PT_SHARING
> 
> This is independent of the nested mode described in I.2, but relies on a
> similar feature in the physical IOMMU: having two stages of page tables,
> one for the host and one for the guest.
> 
> When this is supported, the guest can manage its own s1 page directory, to
> avoid sending MAP/UNMAP requests. Feature
> VIRTIO_IOMMU_F_PT_SHARING allows
> a driver to give a page directory pointer (pgd) to the host and send
> invalidations when removing or changing a mapping. In this mode, three
> requests are used: probe, attach and invalidate. An address space cannot
> be using the MAP/UNMAP interface and PT_SHARING at the same time.
> 
> Device and driver first need to negotiate which page table format they
> will be using. This depends on the physical IOMMU, so the request contains
> a negotiation part to probe the device capabilities.
> 
> (1) Driver attaches devices to address spaces as usual, but a flag
>     VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
>     create page tables for use with the MAP/UNMAP API. The driver intends
>     to manage the address space itself.
> 
> (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
>     pg_format array.
> 
> 	VIRTIO_IOMMU_T_PROBE_TABLE
> 
> 	struct virtio_iommu_req_probe_table {
> 		le32	address_space;
> 		le32	flags;
> 		le32	len;
> 
> 		le32	nr_contexts;
> 		struct {
> 			le32	model;
> 			u8	format[64];
> 		} pg_format[len];
> 	};
> 
> Introducing a probe request is more flexible than advertising those
> features in virtio config, because capabilities are dynamic, and depend on
> which devices are attached to an address space. Within a single address
> space, devices may support different numbers of contexts (PASIDs), and
> some may not support recoverable faults.
> 
> (3) Device responds success with all page table formats implemented by the
>     physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
>     initialize the array to 0 and deduce from there which entries have
>     been filled by the device.
> 
> Using a probe method seems preferable over trying to attach every possible
> format until one sticks. For instance, with an ARM guest running on an x86
> host, PROBE_TABLE would return the Intel IOMMU page table format, and
> the
> guest could use that page table code to handle its mappings, hidden behind
> the IOMMU API. This requires that the page-table code is reasonably
> abstracted from the architecture, as is done with drivers/iommu/io-pgtable
> (an x86 guest could use any format implement by io-pgtable for example.)

So essentially you need modify all existing IOMMU drivers to support page 
table sharing in pvIOMMU. After abstraction is done the core pvIOMMU files 
can be kept vendor agnostic. But if we talk about the whole pvIOMMU 
module, it actually includes vendor specific logic thus unlike typical 
para-virtualized virtio drivers being completely vendor agnostic. Is this 
understanding accurate?

It also means in the host-side pIOMMU driver needs to propagate all
supported formats through VFIO to Qemu vIOMMU, meaning
such format definitions need be consistently agreed across all those 
components.

[...]

> 
>   2. Sharing MMU page tables
>   --------------------------
> 
> The guest can share process page-tables with the physical IOMMU. To do
> that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
> page table format is implicit, so the pg_format array can be empty (unless
> the guest wants to query some specific property, e.g. number of levels
> supported by the pIOMMU?). If the host answers with success, guest can
> send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
> F_INDIRECT | F_FAULT) flags.
> 
> F_FAULT means that the host communicates page requests from device to
> the
> guest, and the guest can handle them by mapping virtual address in the
> fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
> below.)
> 
> F_NATIVE means that the pIOMMU pgtable format is the same as guest
> MMU
> pgtable format.
> 
> F_INDIRECT means that 'table' pointer is a context table, instead of a
> page directory. Each slot in the context table points to a page directory:
> 
>                        64              2 1 0
>           table ----> +---------------------+
>                       |       pgd       |0|1|<--- context 0
>                       |       ---       |0|0|<--- context 1
>                       |       pgd       |0|1|
>                       |       ---       |0|0|
>                       |       ---       |0|0|
>                       +---------------------+
>                                          | \___Entry is valid
>                                          |______reserved
> 
> Question: do we want per-context page table format, or can it stay global
> for the whole indirect table?

Are you defining this context table format in software, or following
hardware definition? At least for VT-d there is a strict hardware-defined
structure (PASID table) which must be used here.

[...]
> 
>   4. Host implementation with VFIO
>   --------------------------------
> 
> The VFIO interface for sharing page tables is being worked on at the
> moment by Intel. Other virtual IOMMU implementation will most likely let
> guest manage full context tables (PASID tables) themselves, giving the
> context table pointer to the pIOMMU via a VFIO ioctl.
> 
> For the architecture-agnostic virtio-iommu however, we shouldn't have to
> implement all possible formats of context table (they are at least
> different between ARM SMMU and Intel IOMMU, and will certainly be
> extended

Since anyway you'll finally require vendor specific page table logic,
why not also abstracting this context table too which then doesn't
require below host-side changes?

> in future physical IOMMU architectures.) In addition, most users might
> only care about having one page directory per device, as SVM is a luxury
> at the moment and few devices support it. For these reasons, we should
> allow to pass single page directories via VFIO, using very similar
> structures as described above, whilst reusing the VFIO channel developed
> for Intel vIOMMU.
> 
> 	* VFIO_SVM_INFO: probe page table formats
> 	* VFIO_SVM_BIND: set pgd and arch-specific configuration
> 
> There is an inconvenient with letting the pIOMMU driver manage the guest's
> context table. During a page table walk, the pIOMMU translates the context
> table pointer using the stage-2 page tables. The context table must
> therefore be mapped in guest-physical space by the pIOMMU driver. One
> solution is to let the pIOMMU driver reserve some GPA space upfront using
> the iommu and sysfs resv API [1]. The host would then carve that region
> out of the guest-physical space using a firmware mechanism (for example DT
> reserved-memory node).

Can you elaborate this flow? pIOMMU driver doesn't directly manage GPA
address space thus it's not reasonable for it to randomly specify a reserved
range. It might make more sense for GPA owner (e.g. Qemu) to decide and
then pass information to pIOMMU driver.

> 
> 
>   III. Relaxed operations
>   =======================
> 
> VIRTIO_IOMMU_F_RELAXED
> 
> Adding an IOMMU dramatically reduces performance of a device, because
> map/unmap operations are costly and produce a lot of TLB traffic. For
> significant performance improvements, device might allow the driver to
> sacrifice safety for speed. In this mode, the driver does not need to send
> UNMAP requests. The semantics of MAP change and are more complex to
> implement. Given a MAP([start:end] -> phys, flags) request:
> 
> (1) If [start:end] isn't mapped, request succeeds as usual.
> (2) If [start:end] overlaps an existing mapping [old_start:old_end], we
>     unmap [max(start, old_start):min(end, old_end)] and replace it with
>     [start:end].
> (3) If [start:end] overlaps an existing mapping that matches the new map
>     request exactly (same flags, same phys address), the old mapping is
>     kept.
> 
> This squashing could be performed by the guest. The driver can catch unmap
> requests from the DMA layer, and only relay map requests for (1) and (2).
> A MAP request is therefore able to split and partially override an
> existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests
> are unnecessary, but are now allowed to split or carve holes in mappings.
> 
> In this model, a MAP request may take longer, but we may have a net gain
> by removing a lot of redundant requests. Squashing series of map/unmap
> performed by the guest for the same mapping improves temporal reuse of
> IOVA mappings, which I can observe by simply dumping IOMMU activity of a
> virtio device. It reduce the number of TLB invalidations to the strict
> minimum while keeping correctness of DMA operations (provided the device
> obeys its driver). There is a good read on the subject of optimistic
> teardown in paper [2].
> 
> This model is completely unsafe. A stale DMA transaction might access a
> page long after the device driver in the guest unmapped it and
> decommissioned the page. The DMA transaction might hit into a completely
> different part of the system that is now reusing the page. Existing
> relaxed implementations attempt to mitigate the risk by setting a timeout
> on the teardown. Unmap requests from device drivers are not discarded
> entirely, but buffered and sent at a later time. Paper [2] reports good
> results with a 10ms delay.
> 
> We could add a way for device and driver to negotiate a vulnerability
> window to mitigate the risk of DMA attacks. Driver might not accept a
> window at all, since it requires more infrastructure to keep delayed
> mappings. In my opinion, it should be made clear that regardless of the
> duration of this window, any driver accepting F_RELAXED feature makes the
> guest completely vulnerable, and the choice boils down to either isolation
> or speed, not a bit of both.

Even with above optimization I'd image the performance drop is still
significant for kernel map/unmap usages, not to say when such 
optimization is not possible if safety is required (actually I don't
know why IOMMU is still required if safety can be compromised. Aren't
we using IOMMU for security purpose?). I think we'd better focus on
higher-value usages, e.g. user space DMA protection (DPDK) and 
SVM, while leaving kernel protection with a lower priority (most for 
functionality verification). Is this strategy aligned with your thought?

btw what about interrupt remapping/posting? Are they also in your
plan for pvIOMMU?

Last, thanks for very informative write-! Looks a long enabling path is 
required get pvIOMMU feature on-par with a real IOMMU. Starting 
with a minimal set is relatively easier. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 1/3] virtio-iommu: firmware description of the virtual topology
  2017-04-18 18:41         ` Jean-Philippe Brucker
@ 2017-04-21  8:43           ` Tian, Kevin
       [not found]             ` <AADFC41AFE54684AB9EE6CBC0274A5D190CB2570-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2017-04-24 15:05             ` Jean-Philippe Brucker
  0 siblings, 2 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-04-21  8:43 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Wednesday, April 19, 2017 2:41 AM
> 
> On 18/04/17 10:51, Tian, Kevin wrote:
> >> From: Jean-Philippe Brucker
> >> Sent: Saturday, April 8, 2017 3:18 AM
> >>
> >> Unlike other virtio devices, the virtio-iommu doesn't work independently,
> >> it is linked to other virtual or assigned devices. So before jumping into
> >> device operations, we need to define a way for the guest to discover the
> >> virtual IOMMU and the devices it translates.
> >>
> >> The host must describe the relation between IOMMU and devices to the
> >> guest
> >> using either device-tree or ACPI. The virtual IOMMU identifies each
> >
> > Do you plan to support both device tree and ACPI?
> 
> Yes, with ACPI the topology would be described using IORT nodes. I didn't
> include an example in my driver because DT is sufficient for a prototype
> and is readily available (both in Linux and kvmtool), whereas IORT would
> be quite easy to reuse in Linux, but isn't present in kvmtool at the
> moment. However, both interfaces have to be supported for the virtio-
> iommu
> to be portable.

'portable' means whether guest enables ACPI?

> 
> >> virtual device with a 32-bit ID, that we will call "Device ID" in this
> >> document. Device IDs are not necessarily unique system-wide, but they
> may
> >> not overlap within a single virtual IOMMU. Device ID of passed-through
> >> devices do not need to match IDs seen by the physical IOMMU.
> >>
> >> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
> >> because with PCI the IOMMU interface would itself be an endpoint, and
> >> existing firmware interfaces don't allow to describe IOMMU<->master
> >> relations between PCI endpoints.
> >
> > I'm not familiar with virtio-mmio mechanism. Curious how devices in
> > virtio-mmio are enumerated today? Could we use that mechanism to
> > identify vIOMMUs and then invent a purely para-virtualized method to
> > enumerate devices behind each vIOMMU?
> 
> Using DT, virtio-mmio devices are described with "virtio-mmio" compatible
> node, and with ACPI they use _HID LNRO0005. Since the host already
> describes available devices to a guest using a firmware interface, I think
> we should reuse the tools provided by that interface for describing
> relations between DMA masters and IOMMU.

OK, I didn't realize virtio-mmio is defined to rely on DT for enumeration.

> 
> > Asking this is because each vendor has its own enumeration methods.
> > ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
> > tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your
> > current proposal looks following ARM definitions which I'm not sure
> > extensible enough to cover features defined only in other vendors'
> > structures.
> 
> ACPI IORT can be extended to incorporate para-virtualized IOMMUs,
> regardless of the underlying architecture. It isn't defined solely for the
> ARM SMMU, but serves a more general purpose of describing a map of
> device
> identifiers communicated from one components to another. Both DMAR and
> IVRS have such description (respectively DRHD and IVHD), but they are
> designed for a specific IOMMU, whereas IORT could host other kinds.

I'll take a look at IORT definition. DRHD includes information more
than device mapping.

> 
> It seems that all we really need is an interface that says "there is a
> virtio-iommu at address X, here are the devices it translates and their
> corresponding IDs", and both DT and ACPI IORT are able to fulfill this role.
> 
> > Since the purpose of this series is to go para-virtualize, why not also
> > para-virtualize and simplify the enumeration method? For example,
> > we may define a query interface through vIOMMU registers to allow
> > guest query whether a device belonging to that vIOMMU. Then we
> > can even remove use of any enumeration structure completely...
> > Just a quick example which I may not think through all the pros and
> > cons. :-)
> 
> I don't think adding a brand new topology description mechanism is worth
> the effort, we're better off reusing what already exists and is
> implemented by operating systems. Adding a query interface inside the
> vIOMMU may work (though might be very painful to integrate with fwspec in
> Linux), but would be redundant since the host has to provide a firmware
> description of the system anyway.
> 
> >> The following diagram describes a situation where two virtual IOMMUs
> >> translate traffic from devices in the system. vIOMMU 1 translates two PCI
> >> domains, in which each function has a 16-bits requester ID. In order for
> >> the vIOMMU to differentiate guest requests targeted at devices in each
> >> domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two
> PCI
> >> domains and a collection of platform devices.
> >>
> >>                        Device ID    Requester ID
> >>                   /       0x0           0x0      \
> >>                  /         |             |        PCI domain 1
> >>                 /      0xffff           0xffff   /
> >>         vIOMMU 1
> >>                 \     0x10000           0x0      \
> >>                  \         |             |        PCI domain 2
> >>                   \   0x1ffff           0xffff   /
> >>
> >>                   /       0x0                    \
> >>                  /         |                      platform devices
> >>                 /      0x1fff                    /
> >>         vIOMMU 2
> >>                 \      0x2000           0x0      \
> >>                  \         |             |        PCI domain 3
> >>                   \   0x11fff           0xffff   /
> >>
> >
> > isn't above be (0x30000, 3ffff) for PCI domain 3 giving device ID is 16bit?
> 
> Unlike Requester IDs in PCI, there is no architected rule for IDs of
> platform devices, it's an integration choice. The ID of platform device is
> used exclusively for interfacing with an IOMMU (or MSI controller), it
> doesn't mean anything outside this context. Here the host allocates 13
> bits to platform device IDs, which is legal.
> 

Please add such explanation to your next version. In earlier text
"16-bits request ID" is mentioned for vIOMMU1, which gave me
the illusion that same 16bit applies to vIOMMU2 too.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 2/3] virtio-iommu: device probing and operations
  2017-04-18 18:45     ` Jean-Philippe Brucker
@ 2017-04-21  9:02       ` Tian, Kevin
  2017-04-24 15:05         ` Jean-Philippe Brucker
       [not found]         ` <AADFC41AFE54684AB9EE6CBC0274A5D190CB262D-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 2 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-04-21  9:02 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Wednesday, April 19, 2017 2:46 AM
> 
> On 18/04/17 11:26, Tian, Kevin wrote:
> >> From: Jean-Philippe Brucker
> >> Sent: Saturday, April 8, 2017 3:18 AM
> >>
> > [...]
> >>   II. Feature bits
> >>   ================
> >>
> >> VIRTIO_IOMMU_F_INPUT_RANGE (0)
> >>  Available range of virtual addresses is described in input_range
> >
> > Usually only the maximum supported address bits are important.
> > Curious do you see such situation where low end of the address
> > space is not usable (since you have both start/end defined later)?
> 
> A start address would allow to provide something resembling a GART to the
> guest: an IOMMU with one address space (ioasid_bits=0) and a small IOVA
> aperture. I'm not sure how useful that would be in practice.

Intel VT-d has no such limitation, which I can tell. :-)

> 
> On a related note, the virtio-iommu itself doesn't provide a
> per-address-space aperture as it stands. For example, attaching a device
> to an address space might restrict the available IOVA range for the whole
> AS if that device cannot write to high memory (above 32-bit). If the guest
> attempts to map an IOVA outside this window into the device's address
> space, it should expect the MAP request to fail. And when attaching, if
> the address space already has mappings outside this window, then ATTACH
> should fail.
> 
> This too seems to be something that ought to be communicated by firmware,
> but bits are missing (I can't find anything equivalent to DT's dma-ranges
> for PCI root bridges in ACPI tables, for example). In addition VFIO
> doesn't communicate any DMA mask for devices, and doesn't check them
> itself. I guess that the host could find out the DMA mask of devices one
> way or another, but it is tricky to enforce, so I didn't make this a hard
> requirement. Although I should probably add a few words about it.

If there is no such communication on bare metal, then same for pvIOMMU.

> 
> > [...]
> >>   1. Attach device
> >>   ----------------
> >>
> >> struct virtio_iommu_req_attach {
> >> 	le32	address_space;
> >> 	le32	device;
> >> 	le32	flags/reserved;
> >> };
> >>
> >> Attach a device to an address space. 'address_space' is an identifier
> >> unique to the guest. If the address space doesn't exist in the IOMMU
> >
> > Based on your description this address space ID is per operation right?
> > MAP/UNMAP and page-table sharing should have different ID spaces...
> 
> I think it's simpler if we keep a single IOASID space per virtio-iommu
> device, because the maximum number of address spaces (described by
> ioasid_bits) might be a restriction of the pIOMMU. For page-table sharing
> you still need to define which devices will share a page directory using
> ATTACH requests, though that interface is not set in stone.

got you. yes VM is supposed to consume less IOASIDs than physically
available. It doesn’t hurt to have one IOASID space for both IOVA
map/unmap usages (one IOASID per device) and SVM usages (multiple
IOASIDs per device). The former is digested by software and the latter
will be bound to hardware.

Thanks
Kevin

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 1/3] virtio-iommu: firmware description of the virtual topology
       [not found]             ` <AADFC41AFE54684AB9EE6CBC0274A5D190CB2570-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-04-24 15:05               ` Jean-Philippe Brucker
  0 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-24 15:05 UTC (permalink / raw)
  To: Tian, Kevin, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: cdall-QSEj5FYQhm4dnm+yROfE0A, mst-H+wXaHxf7aLQT0dZR+AlfA,
	marc.zyngier-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA

On 21/04/17 09:43, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org]
>> Sent: Wednesday, April 19, 2017 2:41 AM
>>
>> On 18/04/17 10:51, Tian, Kevin wrote:
>>>> From: Jean-Philippe Brucker
>>>> Sent: Saturday, April 8, 2017 3:18 AM
>>>>
>>>> Unlike other virtio devices, the virtio-iommu doesn't work independently,
>>>> it is linked to other virtual or assigned devices. So before jumping into
>>>> device operations, we need to define a way for the guest to discover the
>>>> virtual IOMMU and the devices it translates.
>>>>
>>>> The host must describe the relation between IOMMU and devices to the
>>>> guest
>>>> using either device-tree or ACPI. The virtual IOMMU identifies each
>>>
>>> Do you plan to support both device tree and ACPI?
>>
>> Yes, with ACPI the topology would be described using IORT nodes. I didn't
>> include an example in my driver because DT is sufficient for a prototype
>> and is readily available (both in Linux and kvmtool), whereas IORT would
>> be quite easy to reuse in Linux, but isn't present in kvmtool at the
>> moment. However, both interfaces have to be supported for the virtio-
>> iommu
>> to be portable.
> 
> 'portable' means whether guest enables ACPI?

Sorry, "supported" isn't the right term for what I meant. It is for
firmware interface to accommodate devices, not the other way around, so
firmware consideration is outside the scope of the virtio-iommu
specification and virtio-iommu itself doesn't need to "support" any interface.

For the purpose of this particular document however, both popular firmware
interfaces (ACPI and DT) must be taken into account. Those are the two
interfaces I know about, there might be others. But I figure that a VMM
implementing a virtual IOMMU is complex enough to be able to also
implement one of these two interfaces, so talking about DT and ACPI should
fit all use cases. It also provides two examples for other firmware
interfaces that wish to describe the IOMMU topology.

>>>> virtual device with a 32-bit ID, that we will call "Device ID" in this
>>>> document. Device IDs are not necessarily unique system-wide, but they
>> may
>>>> not overlap within a single virtual IOMMU. Device ID of passed-through
>>>> devices do not need to match IDs seen by the physical IOMMU.
>>>>
>>>> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
>>>> because with PCI the IOMMU interface would itself be an endpoint, and
>>>> existing firmware interfaces don't allow to describe IOMMU<->master
>>>> relations between PCI endpoints.
>>>
>>> I'm not familiar with virtio-mmio mechanism. Curious how devices in
>>> virtio-mmio are enumerated today? Could we use that mechanism to
>>> identify vIOMMUs and then invent a purely para-virtualized method to
>>> enumerate devices behind each vIOMMU?
>>
>> Using DT, virtio-mmio devices are described with "virtio-mmio" compatible
>> node, and with ACPI they use _HID LNRO0005. Since the host already
>> describes available devices to a guest using a firmware interface, I think
>> we should reuse the tools provided by that interface for describing
>> relations between DMA masters and IOMMU.
> 
> OK, I didn't realize virtio-mmio is defined to rely on DT for enumeration.

Not necessarily DT, you can have virtio-mmio devices in ACPI namespace as
well. Qemu has a an example of LNRO0005 with ACPI.

>>> Asking this is because each vendor has its own enumeration methods.
>>> ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
>>> tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your
>>> current proposal looks following ARM definitions which I'm not sure
>>> extensible enough to cover features defined only in other vendors'
>>> structures.
>>
>> ACPI IORT can be extended to incorporate para-virtualized IOMMUs,
>> regardless of the underlying architecture. It isn't defined solely for the
>> ARM SMMU, but serves a more general purpose of describing a map of
>> device
>> identifiers communicated from one components to another. Both DMAR and
>> IVRS have such description (respectively DRHD and IVHD), but they are
>> designed for a specific IOMMU, whereas IORT could host other kinds.
> 
> I'll take a look at IORT definition. DRHD includes information more
> than device mapping.

I guess that most information provided by DMAR and others are
IOMMU-specific and the equivalent for virtio-iommu would fit in virtio
config space. But describing device mapping relative to IOMMUs is the same
problem for all systems. Doing it with a virtio-iommu probing mechanism
would require to reinvent a way to identify devices every time a host
wants to add support for a new bus (RID for PCI, base address for MMIO,
others in the future), when firmwares would have to provide this
information anyway for bare metal.

>> It seems that all we really need is an interface that says "there is a
>> virtio-iommu at address X, here are the devices it translates and their
>> corresponding IDs", and both DT and ACPI IORT are able to fulfill this role.
>>
>>> Since the purpose of this series is to go para-virtualize, why not also
>>> para-virtualize and simplify the enumeration method? For example,
>>> we may define a query interface through vIOMMU registers to allow
>>> guest query whether a device belonging to that vIOMMU. Then we
>>> can even remove use of any enumeration structure completely...
>>> Just a quick example which I may not think through all the pros and
>>> cons. :-)
>>
>> I don't think adding a brand new topology description mechanism is worth
>> the effort, we're better off reusing what already exists and is
>> implemented by operating systems. Adding a query interface inside the
>> vIOMMU may work (though might be very painful to integrate with fwspec in
>> Linux), but would be redundant since the host has to provide a firmware
>> description of the system anyway.
>>
>>>> The following diagram describes a situation where two virtual IOMMUs
>>>> translate traffic from devices in the system. vIOMMU 1 translates two PCI
>>>> domains, in which each function has a 16-bits requester ID. In order for
>>>> the vIOMMU to differentiate guest requests targeted at devices in each
>>>> domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two
>> PCI
>>>> domains and a collection of platform devices.
>>>>
>>>>                        Device ID    Requester ID
>>>>                   /       0x0           0x0      \
>>>>                  /         |             |        PCI domain 1
>>>>                 /      0xffff           0xffff   /
>>>>         vIOMMU 1
>>>>                 \     0x10000           0x0      \
>>>>                  \         |             |        PCI domain 2
>>>>                   \   0x1ffff           0xffff   /
>>>>
>>>>                   /       0x0                    \
>>>>                  /         |                      platform devices
>>>>                 /      0x1fff                    /
>>>>         vIOMMU 2
>>>>                 \      0x2000           0x0      \
>>>>                  \         |             |        PCI domain 3
>>>>                   \   0x11fff           0xffff   /
>>>>
>>>
>>> isn't above be (0x30000, 3ffff) for PCI domain 3 giving device ID is 16bit?
>>
>> Unlike Requester IDs in PCI, there is no architected rule for IDs of
>> platform devices, it's an integration choice. The ID of platform device is
>> used exclusively for interfacing with an IOMMU (or MSI controller), it
>> doesn't mean anything outside this context. Here the host allocates 13
>> bits to platform device IDs, which is legal.
>>
> 
> Please add such explanation to your next version. In earlier text
> "16-bits request ID" is mentioned for vIOMMU1, which gave me
> the illusion that same 16bit applies to vIOMMU2 too.

Sure, I will clarify this.

Thanks,
Jean-Philippe

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 1/3] virtio-iommu: firmware description of the virtual topology
  2017-04-21  8:43           ` Tian, Kevin
       [not found]             ` <AADFC41AFE54684AB9EE6CBC0274A5D190CB2570-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-04-24 15:05             ` Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-24 15:05 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

On 21/04/17 09:43, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
>> Sent: Wednesday, April 19, 2017 2:41 AM
>>
>> On 18/04/17 10:51, Tian, Kevin wrote:
>>>> From: Jean-Philippe Brucker
>>>> Sent: Saturday, April 8, 2017 3:18 AM
>>>>
>>>> Unlike other virtio devices, the virtio-iommu doesn't work independently,
>>>> it is linked to other virtual or assigned devices. So before jumping into
>>>> device operations, we need to define a way for the guest to discover the
>>>> virtual IOMMU and the devices it translates.
>>>>
>>>> The host must describe the relation between IOMMU and devices to the
>>>> guest
>>>> using either device-tree or ACPI. The virtual IOMMU identifies each
>>>
>>> Do you plan to support both device tree and ACPI?
>>
>> Yes, with ACPI the topology would be described using IORT nodes. I didn't
>> include an example in my driver because DT is sufficient for a prototype
>> and is readily available (both in Linux and kvmtool), whereas IORT would
>> be quite easy to reuse in Linux, but isn't present in kvmtool at the
>> moment. However, both interfaces have to be supported for the virtio-
>> iommu
>> to be portable.
> 
> 'portable' means whether guest enables ACPI?

Sorry, "supported" isn't the right term for what I meant. It is for
firmware interface to accommodate devices, not the other way around, so
firmware consideration is outside the scope of the virtio-iommu
specification and virtio-iommu itself doesn't need to "support" any interface.

For the purpose of this particular document however, both popular firmware
interfaces (ACPI and DT) must be taken into account. Those are the two
interfaces I know about, there might be others. But I figure that a VMM
implementing a virtual IOMMU is complex enough to be able to also
implement one of these two interfaces, so talking about DT and ACPI should
fit all use cases. It also provides two examples for other firmware
interfaces that wish to describe the IOMMU topology.

>>>> virtual device with a 32-bit ID, that we will call "Device ID" in this
>>>> document. Device IDs are not necessarily unique system-wide, but they
>> may
>>>> not overlap within a single virtual IOMMU. Device ID of passed-through
>>>> devices do not need to match IDs seen by the physical IOMMU.
>>>>
>>>> The virtual IOMMU uses virtio-mmio transport exclusively, not virtio-pci,
>>>> because with PCI the IOMMU interface would itself be an endpoint, and
>>>> existing firmware interfaces don't allow to describe IOMMU<->master
>>>> relations between PCI endpoints.
>>>
>>> I'm not familiar with virtio-mmio mechanism. Curious how devices in
>>> virtio-mmio are enumerated today? Could we use that mechanism to
>>> identify vIOMMUs and then invent a purely para-virtualized method to
>>> enumerate devices behind each vIOMMU?
>>
>> Using DT, virtio-mmio devices are described with "virtio-mmio" compatible
>> node, and with ACPI they use _HID LNRO0005. Since the host already
>> describes available devices to a guest using a firmware interface, I think
>> we should reuse the tools provided by that interface for describing
>> relations between DMA masters and IOMMU.
> 
> OK, I didn't realize virtio-mmio is defined to rely on DT for enumeration.

Not necessarily DT, you can have virtio-mmio devices in ACPI namespace as
well. Qemu has a an example of LNRO0005 with ACPI.

>>> Asking this is because each vendor has its own enumeration methods.
>>> ARM has device tree and ACPI IORT. AMR has ACPI IVRS and device
>>> tree (same format as ARM?). Intel has APCI DMAR and sub-tables. Your
>>> current proposal looks following ARM definitions which I'm not sure
>>> extensible enough to cover features defined only in other vendors'
>>> structures.
>>
>> ACPI IORT can be extended to incorporate para-virtualized IOMMUs,
>> regardless of the underlying architecture. It isn't defined solely for the
>> ARM SMMU, but serves a more general purpose of describing a map of
>> device
>> identifiers communicated from one components to another. Both DMAR and
>> IVRS have such description (respectively DRHD and IVHD), but they are
>> designed for a specific IOMMU, whereas IORT could host other kinds.
> 
> I'll take a look at IORT definition. DRHD includes information more
> than device mapping.

I guess that most information provided by DMAR and others are
IOMMU-specific and the equivalent for virtio-iommu would fit in virtio
config space. But describing device mapping relative to IOMMUs is the same
problem for all systems. Doing it with a virtio-iommu probing mechanism
would require to reinvent a way to identify devices every time a host
wants to add support for a new bus (RID for PCI, base address for MMIO,
others in the future), when firmwares would have to provide this
information anyway for bare metal.

>> It seems that all we really need is an interface that says "there is a
>> virtio-iommu at address X, here are the devices it translates and their
>> corresponding IDs", and both DT and ACPI IORT are able to fulfill this role.
>>
>>> Since the purpose of this series is to go para-virtualize, why not also
>>> para-virtualize and simplify the enumeration method? For example,
>>> we may define a query interface through vIOMMU registers to allow
>>> guest query whether a device belonging to that vIOMMU. Then we
>>> can even remove use of any enumeration structure completely...
>>> Just a quick example which I may not think through all the pros and
>>> cons. :-)
>>
>> I don't think adding a brand new topology description mechanism is worth
>> the effort, we're better off reusing what already exists and is
>> implemented by operating systems. Adding a query interface inside the
>> vIOMMU may work (though might be very painful to integrate with fwspec in
>> Linux), but would be redundant since the host has to provide a firmware
>> description of the system anyway.
>>
>>>> The following diagram describes a situation where two virtual IOMMUs
>>>> translate traffic from devices in the system. vIOMMU 1 translates two PCI
>>>> domains, in which each function has a 16-bits requester ID. In order for
>>>> the vIOMMU to differentiate guest requests targeted at devices in each
>>>> domain, their Device ID ranges cannot overlap. vIOMMU 2 translates two
>> PCI
>>>> domains and a collection of platform devices.
>>>>
>>>>                        Device ID    Requester ID
>>>>                   /       0x0           0x0      \
>>>>                  /         |             |        PCI domain 1
>>>>                 /      0xffff           0xffff   /
>>>>         vIOMMU 1
>>>>                 \     0x10000           0x0      \
>>>>                  \         |             |        PCI domain 2
>>>>                   \   0x1ffff           0xffff   /
>>>>
>>>>                   /       0x0                    \
>>>>                  /         |                      platform devices
>>>>                 /      0x1fff                    /
>>>>         vIOMMU 2
>>>>                 \      0x2000           0x0      \
>>>>                  \         |             |        PCI domain 3
>>>>                   \   0x11fff           0xffff   /
>>>>
>>>
>>> isn't above be (0x30000, 3ffff) for PCI domain 3 giving device ID is 16bit?
>>
>> Unlike Requester IDs in PCI, there is no architected rule for IDs of
>> platform devices, it's an integration choice. The ID of platform device is
>> used exclusively for interfacing with an IOMMU (or MSI controller), it
>> doesn't mean anything outside this context. Here the host allocates 13
>> bits to platform device IDs, which is legal.
>>
> 
> Please add such explanation to your next version. In earlier text
> "16-bits request ID" is mentioned for vIOMMU1, which gave me
> the illusion that same 16bit applies to vIOMMU2 too.

Sure, I will clarify this.

Thanks,
Jean-Philippe

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 2/3] virtio-iommu: device probing and operations
       [not found]         ` <AADFC41AFE54684AB9EE6CBC0274A5D190CB262D-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-04-24 15:05           ` Jean-Philippe Brucker
  2017-08-21  7:59             ` Tian, Kevin
  0 siblings, 1 reply; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-24 15:05 UTC (permalink / raw)
  To: Tian, Kevin, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: cdall-QSEj5FYQhm4dnm+yROfE0A, mst-H+wXaHxf7aLQT0dZR+AlfA,
	marc.zyngier-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA

On 21/04/17 10:02, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
>> Sent: Wednesday, April 19, 2017 2:46 AM
>>
>> On 18/04/17 11:26, Tian, Kevin wrote:
>>>> From: Jean-Philippe Brucker
>>>> Sent: Saturday, April 8, 2017 3:18 AM
>>>>
>>> [...]
>>>>   II. Feature bits
>>>>   ================
>>>>
>>>> VIRTIO_IOMMU_F_INPUT_RANGE (0)
>>>>  Available range of virtual addresses is described in input_range
>>>
>>> Usually only the maximum supported address bits are important.
>>> Curious do you see such situation where low end of the address
>>> space is not usable (since you have both start/end defined later)?
>>
>> A start address would allow to provide something resembling a GART to the
>> guest: an IOMMU with one address space (ioasid_bits=0) and a small IOVA
>> aperture. I'm not sure how useful that would be in practice.
> 
> Intel VT-d has no such limitation, which I can tell. :-)
> 
>>
>> On a related note, the virtio-iommu itself doesn't provide a
>> per-address-space aperture as it stands. For example, attaching a device
>> to an address space might restrict the available IOVA range for the whole
>> AS if that device cannot write to high memory (above 32-bit). If the guest
>> attempts to map an IOVA outside this window into the device's address
>> space, it should expect the MAP request to fail. And when attaching, if
>> the address space already has mappings outside this window, then ATTACH
>> should fail.
>>
>> This too seems to be something that ought to be communicated by firmware,
>> but bits are missing (I can't find anything equivalent to DT's dma-ranges
>> for PCI root bridges in ACPI tables, for example). In addition VFIO
>> doesn't communicate any DMA mask for devices, and doesn't check them
>> itself. I guess that the host could find out the DMA mask of devices one
>> way or another, but it is tricky to enforce, so I didn't make this a hard
>> requirement. Although I should probably add a few words about it.
> 
> If there is no such communication on bare metal, then same for pvIOMMU.
> 
>>
>>> [...]
>>>>   1. Attach device
>>>>   ----------------
>>>>
>>>> struct virtio_iommu_req_attach {
>>>> 	le32	address_space;
>>>> 	le32	device;
>>>> 	le32	flags/reserved;
>>>> };
>>>>
>>>> Attach a device to an address space. 'address_space' is an identifier
>>>> unique to the guest. If the address space doesn't exist in the IOMMU
>>>
>>> Based on your description this address space ID is per operation right?
>>> MAP/UNMAP and page-table sharing should have different ID spaces...
>>
>> I think it's simpler if we keep a single IOASID space per virtio-iommu
>> device, because the maximum number of address spaces (described by
>> ioasid_bits) might be a restriction of the pIOMMU. For page-table sharing
>> you still need to define which devices will share a page directory using
>> ATTACH requests, though that interface is not set in stone.
> 
> got you. yes VM is supposed to consume less IOASIDs than physically
> available. It doesn’t hurt to have one IOASID space for both IOVA
> map/unmap usages (one IOASID per device) and SVM usages (multiple
> IOASIDs per device). The former is digested by software and the latter
> will be bound to hardware.
> 

Hmm, I'm using address space indexed by IOASID for "classic" IOMMU, and
then contexts indexed by PASID when talking about SVM. So in my mind an
address space can have multiple sub-address-spaces (contexts). Number of
IOASIDs is a limitation of the pIOMMU, and number of PASIDs is a
limitation of the device. Therefore attaching devices to address spaces
would update the number of available contexts in that address space. The
terminology is not ideal, and I'd be happy to change it for something more
clear.

Thanks,
Jean-Philippe
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 2/3] virtio-iommu: device probing and operations
  2017-04-21  9:02       ` Tian, Kevin
@ 2017-04-24 15:05         ` Jean-Philippe Brucker
       [not found]         ` <AADFC41AFE54684AB9EE6CBC0274A5D190CB262D-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-24 15:05 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

On 21/04/17 10:02, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
>> Sent: Wednesday, April 19, 2017 2:46 AM
>>
>> On 18/04/17 11:26, Tian, Kevin wrote:
>>>> From: Jean-Philippe Brucker
>>>> Sent: Saturday, April 8, 2017 3:18 AM
>>>>
>>> [...]
>>>>   II. Feature bits
>>>>   ================
>>>>
>>>> VIRTIO_IOMMU_F_INPUT_RANGE (0)
>>>>  Available range of virtual addresses is described in input_range
>>>
>>> Usually only the maximum supported address bits are important.
>>> Curious do you see such situation where low end of the address
>>> space is not usable (since you have both start/end defined later)?
>>
>> A start address would allow to provide something resembling a GART to the
>> guest: an IOMMU with one address space (ioasid_bits=0) and a small IOVA
>> aperture. I'm not sure how useful that would be in practice.
> 
> Intel VT-d has no such limitation, which I can tell. :-)
> 
>>
>> On a related note, the virtio-iommu itself doesn't provide a
>> per-address-space aperture as it stands. For example, attaching a device
>> to an address space might restrict the available IOVA range for the whole
>> AS if that device cannot write to high memory (above 32-bit). If the guest
>> attempts to map an IOVA outside this window into the device's address
>> space, it should expect the MAP request to fail. And when attaching, if
>> the address space already has mappings outside this window, then ATTACH
>> should fail.
>>
>> This too seems to be something that ought to be communicated by firmware,
>> but bits are missing (I can't find anything equivalent to DT's dma-ranges
>> for PCI root bridges in ACPI tables, for example). In addition VFIO
>> doesn't communicate any DMA mask for devices, and doesn't check them
>> itself. I guess that the host could find out the DMA mask of devices one
>> way or another, but it is tricky to enforce, so I didn't make this a hard
>> requirement. Although I should probably add a few words about it.
> 
> If there is no such communication on bare metal, then same for pvIOMMU.
> 
>>
>>> [...]
>>>>   1. Attach device
>>>>   ----------------
>>>>
>>>> struct virtio_iommu_req_attach {
>>>> 	le32	address_space;
>>>> 	le32	device;
>>>> 	le32	flags/reserved;
>>>> };
>>>>
>>>> Attach a device to an address space. 'address_space' is an identifier
>>>> unique to the guest. If the address space doesn't exist in the IOMMU
>>>
>>> Based on your description this address space ID is per operation right?
>>> MAP/UNMAP and page-table sharing should have different ID spaces...
>>
>> I think it's simpler if we keep a single IOASID space per virtio-iommu
>> device, because the maximum number of address spaces (described by
>> ioasid_bits) might be a restriction of the pIOMMU. For page-table sharing
>> you still need to define which devices will share a page directory using
>> ATTACH requests, though that interface is not set in stone.
> 
> got you. yes VM is supposed to consume less IOASIDs than physically
> available. It doesn’t hurt to have one IOASID space for both IOVA
> map/unmap usages (one IOASID per device) and SVM usages (multiple
> IOASIDs per device). The former is digested by software and the latter
> will be bound to hardware.
> 

Hmm, I'm using address space indexed by IOASID for "classic" IOMMU, and
then contexts indexed by PASID when talking about SVM. So in my mind an
address space can have multiple sub-address-spaces (contexts). Number of
IOASIDs is a limitation of the pIOMMU, and number of PASIDs is a
limitation of the device. Therefore attaching devices to address spaces
would update the number of available contexts in that address space. The
terminology is not ideal, and I'd be happy to change it for something more
clear.

Thanks,
Jean-Philippe
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 3/3] virtio-iommu: future work
  2017-04-21  8:31     ` Tian, Kevin
  2017-04-24 15:05       ` Jean-Philippe Brucker
@ 2017-04-24 15:05       ` Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-24 15:05 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

On 21/04/17 09:31, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Saturday, April 8, 2017 3:18 AM
>>
>> Here I propose a few ideas for extensions and optimizations. This is all
>> very exploratory, feel free to correct mistakes and suggest more things.
> 
> [...]
>>
>>   II. Page table sharing
>>   ======================
>>
>>   1. Sharing IOMMU page tables
>>   ----------------------------
>>
>> VIRTIO_IOMMU_F_PT_SHARING
>>
>> This is independent of the nested mode described in I.2, but relies on a
>> similar feature in the physical IOMMU: having two stages of page tables,
>> one for the host and one for the guest.
>>
>> When this is supported, the guest can manage its own s1 page directory, to
>> avoid sending MAP/UNMAP requests. Feature
>> VIRTIO_IOMMU_F_PT_SHARING allows
>> a driver to give a page directory pointer (pgd) to the host and send
>> invalidations when removing or changing a mapping. In this mode, three
>> requests are used: probe, attach and invalidate. An address space cannot
>> be using the MAP/UNMAP interface and PT_SHARING at the same time.
>>
>> Device and driver first need to negotiate which page table format they
>> will be using. This depends on the physical IOMMU, so the request contains
>> a negotiation part to probe the device capabilities.
>>
>> (1) Driver attaches devices to address spaces as usual, but a flag
>>     VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
>>     create page tables for use with the MAP/UNMAP API. The driver intends
>>     to manage the address space itself.
>>
>> (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
>>     pg_format array.
>>
>> 	VIRTIO_IOMMU_T_PROBE_TABLE
>>
>> 	struct virtio_iommu_req_probe_table {
>> 		le32	address_space;
>> 		le32	flags;
>> 		le32	len;
>>
>> 		le32	nr_contexts;
>> 		struct {
>> 			le32	model;
>> 			u8	format[64];
>> 		} pg_format[len];
>> 	};
>>
>> Introducing a probe request is more flexible than advertising those
>> features in virtio config, because capabilities are dynamic, and depend on
>> which devices are attached to an address space. Within a single address
>> space, devices may support different numbers of contexts (PASIDs), and
>> some may not support recoverable faults.
>>
>> (3) Device responds success with all page table formats implemented by the
>>     physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
>>     initialize the array to 0 and deduce from there which entries have
>>     been filled by the device.
>>
>> Using a probe method seems preferable over trying to attach every possible
>> format until one sticks. For instance, with an ARM guest running on an x86
>> host, PROBE_TABLE would return the Intel IOMMU page table format, and
>> the
>> guest could use that page table code to handle its mappings, hidden behind
>> the IOMMU API. This requires that the page-table code is reasonably
>> abstracted from the architecture, as is done with drivers/iommu/io-pgtable
>> (an x86 guest could use any format implement by io-pgtable for example.)
> 
> So essentially you need modify all existing IOMMU drivers to support page 
> table sharing in pvIOMMU. After abstraction is done the core pvIOMMU files 
> can be kept vendor agnostic. But if we talk about the whole pvIOMMU 
> module, it actually includes vendor specific logic thus unlike typical 
> para-virtualized virtio drivers being completely vendor agnostic. Is this 
> understanding accurate?

Yes, although kernel modules would be separate. For Linux on ARM we
already have the page-table logic abstracted in iommu/io-pgtable module,
because multiple IOMMUs share the same PT formats (SMMUv2, SMMUv3, Renesas
IPMMU, Qcom MSM, Mediatek). It offers a simple interface:

* When attaching devices to an IOMMU domain, the IOMMU driver registers
its page table format and provides invalidation callbacks.

* On iommu_map/unmap, the IOMMU driver calls into io_pgtable_ops, which
provide map, unmap and iova_to_phys functions.

* Page table operations call back into the driver via iommu_gather_ops
when they need to invalidate TLB entries.

Currently only the few flavors of ARM PT formats are implemented, but
other page table formats could be added if they fit this model.

> It also means in the host-side pIOMMU driver needs to propagate all
> supported formats through VFIO to Qemu vIOMMU, meaning
> such format definitions need be consistently agreed across all those 
> components.

Yes, that's the icky part. We need to define a format that every OS and
hypervisor implementing virtio-iommu can understand (similarly to the
PASID table sharing interface that Yi L is working on for VFIO, although
that one is contained in Linux UAPI and doesn't require other OSes to know
about it).

>>   2. Sharing MMU page tables
>>   --------------------------
>>
>> The guest can share process page-tables with the physical IOMMU. To do
>> that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
>> page table format is implicit, so the pg_format array can be empty (unless
>> the guest wants to query some specific property, e.g. number of levels
>> supported by the pIOMMU?). If the host answers with success, guest can
>> send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
>> F_INDIRECT | F_FAULT) flags.
>>
>> F_FAULT means that the host communicates page requests from device to
>> the
>> guest, and the guest can handle them by mapping virtual address in the
>> fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
>> below.)
>>
>> F_NATIVE means that the pIOMMU pgtable format is the same as guest
>> MMU
>> pgtable format.
>>
>> F_INDIRECT means that 'table' pointer is a context table, instead of a
>> page directory. Each slot in the context table points to a page directory:
>>
>>                        64              2 1 0
>>           table ----> +---------------------+
>>                       |       pgd       |0|1|<--- context 0
>>                       |       ---       |0|0|<--- context 1
>>                       |       pgd       |0|1|
>>                       |       ---       |0|0|
>>                       |       ---       |0|0|
>>                       +---------------------+
>>                                          | \___Entry is valid
>>                                          |______reserved
>>
>> Question: do we want per-context page table format, or can it stay global
>> for the whole indirect table?
> 
> Are you defining this context table format in software, or following
> hardware definition? At least for VT-d there is a strict hardware-defined
> structure (PASID table) which must be used here.

This definition is only for virtio-iommu, I didn't follow any hardware
definitions. For SMMUv3 the context tables are completely different. There
may be two levels of tables, and each context gets a 512-bits descriptor
(it has per-context page table format and other info).

To be honest I'm not sure where I was going with this indirect table. I
can't see any advantage in using an indirect table over sending a bunch of
individual ATTACH_TABLE requests, each with a pgd and a pasid. However the
indirect flag could be needed for sharing physical context tables (below).

>>   4. Host implementation with VFIO
>>   --------------------------------
>>
>> The VFIO interface for sharing page tables is being worked on at the
>> moment by Intel. Other virtual IOMMU implementation will most likely let
>> guest manage full context tables (PASID tables) themselves, giving the
>> context table pointer to the pIOMMU via a VFIO ioctl.
>>
>> For the architecture-agnostic virtio-iommu however, we shouldn't have to
>> implement all possible formats of context table (they are at least
>> different between ARM SMMU and Intel IOMMU, and will certainly be
>> extended
> 
> Since anyway you'll finally require vendor specific page table logic,
> why not also abstracting this context table too which then doesn't
> require below host-side changes?

I keep going back and forth on that question :) Some pIOMMUs won't have
context tables, so we need a ATTACH_TABLE interface for sharing single pgd
anyway. Now for SVM, we could either create an additional interface for
vendor-specific context tables, or send individual ATTACH_TABLE request.

The disadvantage of sharing context tables is that it requires more
specification work to enumerate all existing context table formats,
similarly to the work needed for defining all page table formats. As I
said earlier this work needs to be done anyway for VFIO, but this time it
would be an interface that needs to suit all OSes and hypervisor, not only
Linux. I think it's a lot more complicated to agree on that since it's not
a matter of sending Linux patches to extend the interface anymore, it is a
wider scope.

So we need to carefully consider whether this additional specification
effort is really needed. We certainly want to share page tables with the
guest to improves performance over the map/unmap interface, but I don't
see a similar performance concern on context tables. Supposedly binding a
device context to a task is a relatively rare event, much less frequent
than updating PT mappings.

In addition page table formats might be more common than context table
formats and therefore easier to abstract. With context tables you will
need one format per IOMMU variant, whereas (on ARM) multiple IOMMUs could
share the same page table format. I'm not sure whether the same argument
applies to x86 (similarity of page tables between Intel and AMD IOMMU
versus differences in PASID/GCR3 table formats)

On the other hand, the clear advantage of sharing context tables with the
guest is that we don't have to do the complicated memory reserve dance
described below.

>> in future physical IOMMU architectures.) In addition, most users might
>> only care about having one page directory per device, as SVM is a luxury
>> at the moment and few devices support it. For these reasons, we should
>> allow to pass single page directories via VFIO, using very similar
>> structures as described above, whilst reusing the VFIO channel developed
>> for Intel vIOMMU.
>>
>> 	* VFIO_SVM_INFO: probe page table formats
>> 	* VFIO_SVM_ATTACH_TABLE: set pgd and arch-specific configuration
>>
>> There is an inconvenient with letting the pIOMMU driver manage the guest's
>> context table. During a page table walk, the pIOMMU translates the context
>> table pointer using the stage-2 page tables. The context table must
>> therefore be mapped in guest-physical space by the pIOMMU driver. One
>> solution is to let the pIOMMU driver reserve some GPA space upfront using
>> the iommu and sysfs resv API [1]. The host would then carve that region
>> out of the guest-physical space using a firmware mechanism (for example DT
>> reserved-memory node).
> 
> Can you elaborate this flow? pIOMMU driver doesn't directly manage GPA
> address space thus it's not reasonable for it to randomly specify a reserved
> range. It might make more sense for GPA owner (e.g. Qemu) to decide and
> then pass information to pIOMMU driver.

I realized that it's actually more complicated than this, because I didn't
consider hotplugging devices into VM. If you insert new devices at
runtime, you might need more GPA space for storing their context tables,
but only if they don't attach to an existing address space (otherwise on
ARM we could reuse the existing context table)

So GPA space cannot be reserved statically, but must be reclaimed at
runtime. In addition, context tables can become quite big, and with static
reserve we'd have to reserve tonnes of GPA space upfront even if the guest
isn't planning on using context tables at all. And even without
considering SVM, some IOMMUs (namely SMMUv3) would still need a
single-entry table in GPA space for nested translation.

I don't have any pleasant solution so far. One way of doing it is to carry
memory reclaim in ATTACH_TABLE requests:

(1) Driver sends ATTACH_TABLE(pasid, pgd)
(2) Device relays BIND(pasid, pgd) to pIOMMU via VFIO
(3) pIOMMU needs, say, 512KiB of contiguous GPA for mapping a context
table. Returns this info via VFIO.
(4) Device replies to ATTACH_TABLE with "try again" and, somewhere in the
request buffer, stores the amount of contiguous GPA that the operation
will cost.
(5) Driver re-sends the ATTACH_TABLE request, but this time with a GPA
address that the host can use.

Note that each reclaim for a table should be accompanied by an identifier
for that table. So that if a second ATTACH_TABLE requests reaches the
device between (4) and (5) and require GPA space for the same table, the
device returns the same GPA reclaim with the same identifier and the
driver won't have to allocate GPA twice.

If the pIOMMU needs N > 1 contiguous GPA chunks (for instance, two levels
of context tables) we could do N reclaim (requiring N + 1 ATTACH_TABLE
requests) or put an array in the ATTACH_TABLE request. I prefer the
former, there is little advantage to the latter.

Alternatively, this could be a job for something similar to
virtio-balloon, with contiguous chunks instead of pages. The ATTACH_TABLE
would block the primary request queue while the GPA reclaim is serviced by
the guest on an auxiliary queue (which may not be acceptable if the driver
expects MAP/UNMAP/INVALIDATE requests on the same queue to be fast).

In any case, I would greatly appreciate any proposal for a nicer
mechanism, because this feels very fragile.

>>   III. Relaxed operations
>>   =======================
>>
>> VIRTIO_IOMMU_F_RELAXED
>>
>> Adding an IOMMU dramatically reduces performance of a device, because
>> map/unmap operations are costly and produce a lot of TLB traffic. For
>> significant performance improvements, device might allow the driver to
>> sacrifice safety for speed. In this mode, the driver does not need to send
>> UNMAP requests. The semantics of MAP change and are more complex to
>> implement. Given a MAP([start:end] -> phys, flags) request:
>>
>> (1) If [start:end] isn't mapped, request succeeds as usual.
>> (2) If [start:end] overlaps an existing mapping [old_start:old_end], we
>>     unmap [max(start, old_start):min(end, old_end)] and replace it with
>>     [start:end].
>> (3) If [start:end] overlaps an existing mapping that matches the new map
>>     request exactly (same flags, same phys address), the old mapping is
>>     kept.
>>
>> This squashing could be performed by the guest. The driver can catch unmap
>> requests from the DMA layer, and only relay map requests for (1) and (2).
>> A MAP request is therefore able to split and partially override an
>> existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests
>> are unnecessary, but are now allowed to split or carve holes in mappings.
>>
>> In this model, a MAP request may take longer, but we may have a net gain
>> by removing a lot of redundant requests. Squashing series of map/unmap
>> performed by the guest for the same mapping improves temporal reuse of
>> IOVA mappings, which I can observe by simply dumping IOMMU activity of a
>> virtio device. It reduce the number of TLB invalidations to the strict
>> minimum while keeping correctness of DMA operations (provided the device
>> obeys its driver). There is a good read on the subject of optimistic
>> teardown in paper [2].
>>
>> This model is completely unsafe. A stale DMA transaction might access a
>> page long after the device driver in the guest unmapped it and
>> decommissioned the page. The DMA transaction might hit into a completely
>> different part of the system that is now reusing the page. Existing
>> relaxed implementations attempt to mitigate the risk by setting a timeout
>> on the teardown. Unmap requests from device drivers are not discarded
>> entirely, but buffered and sent at a later time. Paper [2] reports good
>> results with a 10ms delay.
>>
>> We could add a way for device and driver to negotiate a vulnerability
>> window to mitigate the risk of DMA attacks. Driver might not accept a
>> window at all, since it requires more infrastructure to keep delayed
>> mappings. In my opinion, it should be made clear that regardless of the
>> duration of this window, any driver accepting F_RELAXED feature makes the
>> guest completely vulnerable, and the choice boils down to either isolation
>> or speed, not a bit of both.
> 
> Even with above optimization I'd image the performance drop is still
> significant for kernel map/unmap usages, not to say when such 
> optimization is not possible if safety is required (actually I don't
> know why IOMMU is still required if safety can be compromised. Aren't
> we using IOMMU for security purpose?).

I guess apart from security concerns, a significant use case would be
scatter-gather, avoiding large contiguous (and pinned down) allocations in
guests. It's quite useful when you start doing DMA over MB or GB of
memory. It also allows pass-though to guest userspace, but for that there
are other ways (UIO or vfio-noiommu)

> I think we'd better focus on
> higher-value usages, e.g. user space DMA protection (DPDK) and 
> SVM, while leaving kernel protection with a lower priority (most for 
> functionality verification). Is this strategy aligned with your thought?
> 
> btw what about interrupt remapping/posting? Are they also in your
> plan for pvIOMMU?

I didn't think about this so far, because we don't have a special region
reserved for MSIs in the ARM IOMMUs; all MSI doorbells are accessed with
IOVAs and translated similarly to other regions. In addition with KVM ARM,
MSI injection bypasses the IOMMU altogether, the host doesn't actually
write the MSI. I could take a look at what other hypervisors and
architectures do.

> Last, thanks for very informative write-! Looks a long enabling path is 
> required get pvIOMMU feature on-par with a real IOMMU. Starting 
> with a minimal set is relatively easier. :-)

Yes, I described possible improvements in 3/3 in order to see how they
would fit within the baseline device of 2/3. But apart from vhost
prototype, these are a long way off, and I'd like to make sure that the
base is solid before tackling the rest.

Thanks,
Jean-Philippe

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 3/3] virtio-iommu: future work
  2017-04-21  8:31     ` Tian, Kevin
@ 2017-04-24 15:05       ` Jean-Philippe Brucker
  2017-04-24 15:05       ` Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-04-24 15:05 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

On 21/04/17 09:31, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Saturday, April 8, 2017 3:18 AM
>>
>> Here I propose a few ideas for extensions and optimizations. This is all
>> very exploratory, feel free to correct mistakes and suggest more things.
> 
> [...]
>>
>>   II. Page table sharing
>>   ======================
>>
>>   1. Sharing IOMMU page tables
>>   ----------------------------
>>
>> VIRTIO_IOMMU_F_PT_SHARING
>>
>> This is independent of the nested mode described in I.2, but relies on a
>> similar feature in the physical IOMMU: having two stages of page tables,
>> one for the host and one for the guest.
>>
>> When this is supported, the guest can manage its own s1 page directory, to
>> avoid sending MAP/UNMAP requests. Feature
>> VIRTIO_IOMMU_F_PT_SHARING allows
>> a driver to give a page directory pointer (pgd) to the host and send
>> invalidations when removing or changing a mapping. In this mode, three
>> requests are used: probe, attach and invalidate. An address space cannot
>> be using the MAP/UNMAP interface and PT_SHARING at the same time.
>>
>> Device and driver first need to negotiate which page table format they
>> will be using. This depends on the physical IOMMU, so the request contains
>> a negotiation part to probe the device capabilities.
>>
>> (1) Driver attaches devices to address spaces as usual, but a flag
>>     VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
>>     create page tables for use with the MAP/UNMAP API. The driver intends
>>     to manage the address space itself.
>>
>> (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
>>     pg_format array.
>>
>> 	VIRTIO_IOMMU_T_PROBE_TABLE
>>
>> 	struct virtio_iommu_req_probe_table {
>> 		le32	address_space;
>> 		le32	flags;
>> 		le32	len;
>>
>> 		le32	nr_contexts;
>> 		struct {
>> 			le32	model;
>> 			u8	format[64];
>> 		} pg_format[len];
>> 	};
>>
>> Introducing a probe request is more flexible than advertising those
>> features in virtio config, because capabilities are dynamic, and depend on
>> which devices are attached to an address space. Within a single address
>> space, devices may support different numbers of contexts (PASIDs), and
>> some may not support recoverable faults.
>>
>> (3) Device responds success with all page table formats implemented by the
>>     physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
>>     initialize the array to 0 and deduce from there which entries have
>>     been filled by the device.
>>
>> Using a probe method seems preferable over trying to attach every possible
>> format until one sticks. For instance, with an ARM guest running on an x86
>> host, PROBE_TABLE would return the Intel IOMMU page table format, and
>> the
>> guest could use that page table code to handle its mappings, hidden behind
>> the IOMMU API. This requires that the page-table code is reasonably
>> abstracted from the architecture, as is done with drivers/iommu/io-pgtable
>> (an x86 guest could use any format implement by io-pgtable for example.)
> 
> So essentially you need modify all existing IOMMU drivers to support page 
> table sharing in pvIOMMU. After abstraction is done the core pvIOMMU files 
> can be kept vendor agnostic. But if we talk about the whole pvIOMMU 
> module, it actually includes vendor specific logic thus unlike typical 
> para-virtualized virtio drivers being completely vendor agnostic. Is this 
> understanding accurate?

Yes, although kernel modules would be separate. For Linux on ARM we
already have the page-table logic abstracted in iommu/io-pgtable module,
because multiple IOMMUs share the same PT formats (SMMUv2, SMMUv3, Renesas
IPMMU, Qcom MSM, Mediatek). It offers a simple interface:

* When attaching devices to an IOMMU domain, the IOMMU driver registers
its page table format and provides invalidation callbacks.

* On iommu_map/unmap, the IOMMU driver calls into io_pgtable_ops, which
provide map, unmap and iova_to_phys functions.

* Page table operations call back into the driver via iommu_gather_ops
when they need to invalidate TLB entries.

Currently only the few flavors of ARM PT formats are implemented, but
other page table formats could be added if they fit this model.

> It also means in the host-side pIOMMU driver needs to propagate all
> supported formats through VFIO to Qemu vIOMMU, meaning
> such format definitions need be consistently agreed across all those 
> components.

Yes, that's the icky part. We need to define a format that every OS and
hypervisor implementing virtio-iommu can understand (similarly to the
PASID table sharing interface that Yi L is working on for VFIO, although
that one is contained in Linux UAPI and doesn't require other OSes to know
about it).

>>   2. Sharing MMU page tables
>>   --------------------------
>>
>> The guest can share process page-tables with the physical IOMMU. To do
>> that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
>> page table format is implicit, so the pg_format array can be empty (unless
>> the guest wants to query some specific property, e.g. number of levels
>> supported by the pIOMMU?). If the host answers with success, guest can
>> send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
>> F_INDIRECT | F_FAULT) flags.
>>
>> F_FAULT means that the host communicates page requests from device to
>> the
>> guest, and the guest can handle them by mapping virtual address in the
>> fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
>> below.)
>>
>> F_NATIVE means that the pIOMMU pgtable format is the same as guest
>> MMU
>> pgtable format.
>>
>> F_INDIRECT means that 'table' pointer is a context table, instead of a
>> page directory. Each slot in the context table points to a page directory:
>>
>>                        64              2 1 0
>>           table ----> +---------------------+
>>                       |       pgd       |0|1|<--- context 0
>>                       |       ---       |0|0|<--- context 1
>>                       |       pgd       |0|1|
>>                       |       ---       |0|0|
>>                       |       ---       |0|0|
>>                       +---------------------+
>>                                          | \___Entry is valid
>>                                          |______reserved
>>
>> Question: do we want per-context page table format, or can it stay global
>> for the whole indirect table?
> 
> Are you defining this context table format in software, or following
> hardware definition? At least for VT-d there is a strict hardware-defined
> structure (PASID table) which must be used here.

This definition is only for virtio-iommu, I didn't follow any hardware
definitions. For SMMUv3 the context tables are completely different. There
may be two levels of tables, and each context gets a 512-bits descriptor
(it has per-context page table format and other info).

To be honest I'm not sure where I was going with this indirect table. I
can't see any advantage in using an indirect table over sending a bunch of
individual ATTACH_TABLE requests, each with a pgd and a pasid. However the
indirect flag could be needed for sharing physical context tables (below).

>>   4. Host implementation with VFIO
>>   --------------------------------
>>
>> The VFIO interface for sharing page tables is being worked on at the
>> moment by Intel. Other virtual IOMMU implementation will most likely let
>> guest manage full context tables (PASID tables) themselves, giving the
>> context table pointer to the pIOMMU via a VFIO ioctl.
>>
>> For the architecture-agnostic virtio-iommu however, we shouldn't have to
>> implement all possible formats of context table (they are at least
>> different between ARM SMMU and Intel IOMMU, and will certainly be
>> extended
> 
> Since anyway you'll finally require vendor specific page table logic,
> why not also abstracting this context table too which then doesn't
> require below host-side changes?

I keep going back and forth on that question :) Some pIOMMUs won't have
context tables, so we need a ATTACH_TABLE interface for sharing single pgd
anyway. Now for SVM, we could either create an additional interface for
vendor-specific context tables, or send individual ATTACH_TABLE request.

The disadvantage of sharing context tables is that it requires more
specification work to enumerate all existing context table formats,
similarly to the work needed for defining all page table formats. As I
said earlier this work needs to be done anyway for VFIO, but this time it
would be an interface that needs to suit all OSes and hypervisor, not only
Linux. I think it's a lot more complicated to agree on that since it's not
a matter of sending Linux patches to extend the interface anymore, it is a
wider scope.

So we need to carefully consider whether this additional specification
effort is really needed. We certainly want to share page tables with the
guest to improves performance over the map/unmap interface, but I don't
see a similar performance concern on context tables. Supposedly binding a
device context to a task is a relatively rare event, much less frequent
than updating PT mappings.

In addition page table formats might be more common than context table
formats and therefore easier to abstract. With context tables you will
need one format per IOMMU variant, whereas (on ARM) multiple IOMMUs could
share the same page table format. I'm not sure whether the same argument
applies to x86 (similarity of page tables between Intel and AMD IOMMU
versus differences in PASID/GCR3 table formats)

On the other hand, the clear advantage of sharing context tables with the
guest is that we don't have to do the complicated memory reserve dance
described below.

>> in future physical IOMMU architectures.) In addition, most users might
>> only care about having one page directory per device, as SVM is a luxury
>> at the moment and few devices support it. For these reasons, we should
>> allow to pass single page directories via VFIO, using very similar
>> structures as described above, whilst reusing the VFIO channel developed
>> for Intel vIOMMU.
>>
>> 	* VFIO_SVM_INFO: probe page table formats
>> 	* VFIO_SVM_ATTACH_TABLE: set pgd and arch-specific configuration
>>
>> There is an inconvenient with letting the pIOMMU driver manage the guest's
>> context table. During a page table walk, the pIOMMU translates the context
>> table pointer using the stage-2 page tables. The context table must
>> therefore be mapped in guest-physical space by the pIOMMU driver. One
>> solution is to let the pIOMMU driver reserve some GPA space upfront using
>> the iommu and sysfs resv API [1]. The host would then carve that region
>> out of the guest-physical space using a firmware mechanism (for example DT
>> reserved-memory node).
> 
> Can you elaborate this flow? pIOMMU driver doesn't directly manage GPA
> address space thus it's not reasonable for it to randomly specify a reserved
> range. It might make more sense for GPA owner (e.g. Qemu) to decide and
> then pass information to pIOMMU driver.

I realized that it's actually more complicated than this, because I didn't
consider hotplugging devices into VM. If you insert new devices at
runtime, you might need more GPA space for storing their context tables,
but only if they don't attach to an existing address space (otherwise on
ARM we could reuse the existing context table)

So GPA space cannot be reserved statically, but must be reclaimed at
runtime. In addition, context tables can become quite big, and with static
reserve we'd have to reserve tonnes of GPA space upfront even if the guest
isn't planning on using context tables at all. And even without
considering SVM, some IOMMUs (namely SMMUv3) would still need a
single-entry table in GPA space for nested translation.

I don't have any pleasant solution so far. One way of doing it is to carry
memory reclaim in ATTACH_TABLE requests:

(1) Driver sends ATTACH_TABLE(pasid, pgd)
(2) Device relays BIND(pasid, pgd) to pIOMMU via VFIO
(3) pIOMMU needs, say, 512KiB of contiguous GPA for mapping a context
table. Returns this info via VFIO.
(4) Device replies to ATTACH_TABLE with "try again" and, somewhere in the
request buffer, stores the amount of contiguous GPA that the operation
will cost.
(5) Driver re-sends the ATTACH_TABLE request, but this time with a GPA
address that the host can use.

Note that each reclaim for a table should be accompanied by an identifier
for that table. So that if a second ATTACH_TABLE requests reaches the
device between (4) and (5) and require GPA space for the same table, the
device returns the same GPA reclaim with the same identifier and the
driver won't have to allocate GPA twice.

If the pIOMMU needs N > 1 contiguous GPA chunks (for instance, two levels
of context tables) we could do N reclaim (requiring N + 1 ATTACH_TABLE
requests) or put an array in the ATTACH_TABLE request. I prefer the
former, there is little advantage to the latter.

Alternatively, this could be a job for something similar to
virtio-balloon, with contiguous chunks instead of pages. The ATTACH_TABLE
would block the primary request queue while the GPA reclaim is serviced by
the guest on an auxiliary queue (which may not be acceptable if the driver
expects MAP/UNMAP/INVALIDATE requests on the same queue to be fast).

In any case, I would greatly appreciate any proposal for a nicer
mechanism, because this feels very fragile.

>>   III. Relaxed operations
>>   =======================
>>
>> VIRTIO_IOMMU_F_RELAXED
>>
>> Adding an IOMMU dramatically reduces performance of a device, because
>> map/unmap operations are costly and produce a lot of TLB traffic. For
>> significant performance improvements, device might allow the driver to
>> sacrifice safety for speed. In this mode, the driver does not need to send
>> UNMAP requests. The semantics of MAP change and are more complex to
>> implement. Given a MAP([start:end] -> phys, flags) request:
>>
>> (1) If [start:end] isn't mapped, request succeeds as usual.
>> (2) If [start:end] overlaps an existing mapping [old_start:old_end], we
>>     unmap [max(start, old_start):min(end, old_end)] and replace it with
>>     [start:end].
>> (3) If [start:end] overlaps an existing mapping that matches the new map
>>     request exactly (same flags, same phys address), the old mapping is
>>     kept.
>>
>> This squashing could be performed by the guest. The driver can catch unmap
>> requests from the DMA layer, and only relay map requests for (1) and (2).
>> A MAP request is therefore able to split and partially override an
>> existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests
>> are unnecessary, but are now allowed to split or carve holes in mappings.
>>
>> In this model, a MAP request may take longer, but we may have a net gain
>> by removing a lot of redundant requests. Squashing series of map/unmap
>> performed by the guest for the same mapping improves temporal reuse of
>> IOVA mappings, which I can observe by simply dumping IOMMU activity of a
>> virtio device. It reduce the number of TLB invalidations to the strict
>> minimum while keeping correctness of DMA operations (provided the device
>> obeys its driver). There is a good read on the subject of optimistic
>> teardown in paper [2].
>>
>> This model is completely unsafe. A stale DMA transaction might access a
>> page long after the device driver in the guest unmapped it and
>> decommissioned the page. The DMA transaction might hit into a completely
>> different part of the system that is now reusing the page. Existing
>> relaxed implementations attempt to mitigate the risk by setting a timeout
>> on the teardown. Unmap requests from device drivers are not discarded
>> entirely, but buffered and sent at a later time. Paper [2] reports good
>> results with a 10ms delay.
>>
>> We could add a way for device and driver to negotiate a vulnerability
>> window to mitigate the risk of DMA attacks. Driver might not accept a
>> window at all, since it requires more infrastructure to keep delayed
>> mappings. In my opinion, it should be made clear that regardless of the
>> duration of this window, any driver accepting F_RELAXED feature makes the
>> guest completely vulnerable, and the choice boils down to either isolation
>> or speed, not a bit of both.
> 
> Even with above optimization I'd image the performance drop is still
> significant for kernel map/unmap usages, not to say when such 
> optimization is not possible if safety is required (actually I don't
> know why IOMMU is still required if safety can be compromised. Aren't
> we using IOMMU for security purpose?).

I guess apart from security concerns, a significant use case would be
scatter-gather, avoiding large contiguous (and pinned down) allocations in
guests. It's quite useful when you start doing DMA over MB or GB of
memory. It also allows pass-though to guest userspace, but for that there
are other ways (UIO or vfio-noiommu)

> I think we'd better focus on
> higher-value usages, e.g. user space DMA protection (DPDK) and 
> SVM, while leaving kernel protection with a lower priority (most for 
> functionality verification). Is this strategy aligned with your thought?
> 
> btw what about interrupt remapping/posting? Are they also in your
> plan for pvIOMMU?

I didn't think about this so far, because we don't have a special region
reserved for MSIs in the ARM IOMMUs; all MSI doorbells are accessed with
IOVAs and translated similarly to other regions. In addition with KVM ARM,
MSI injection bypasses the IOMMU altogether, the host doesn't actually
write the MSI. I could take a look at what other hypervisors and
architectures do.

> Last, thanks for very informative write-! Looks a long enabling path is 
> required get pvIOMMU feature on-par with a real IOMMU. Starting 
> with a minimal set is relatively easier. :-)

Yes, I described possible improvements in 3/3 in order to see how they
would fit within the baseline device of 2/3. But apart from vhost
prototype, these are a long way off, and I'd like to make sure that the
base is solid before tackling the rest.

Thanks,
Jean-Philippe

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 3/3] virtio-iommu: future work
       [not found]   ` <20170407191747.26618-4-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
  2017-04-21  8:31     ` Tian, Kevin
@ 2017-04-26 16:24     ` Michael S. Tsirkin
  1 sibling, 0 replies; 99+ messages in thread
From: Michael S. Tsirkin @ 2017-04-26 16:24 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b,
	kvm-u79uwXL29TY76Z2rM5mHXA, cdall-QSEj5FYQhm4dnm+yROfE0A,
	marc.zyngier-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA

On Fri, Apr 07, 2017 at 08:17:47PM +0100, Jean-Philippe Brucker wrote:
> Here I propose a few ideas for extensions and optimizations. This is all
> very exploratory, feel free to correct mistakes and suggest more things.
> 
> 	I.   Linux host
> 	     1. vhost-iommu

A qemu based implementation would be a first step.
Would allow validating the claim that it's much
simpler to support than e.g. VTD.

> 	     2. VFIO nested translation
> 	II.  Page table sharing
> 	     1. Sharing IOMMU page tables
> 	     2. Sharing MMU page tables (SVM)
> 	     3. Fault reporting
> 	     4. Host implementation with VFIO
> 	III. Relaxed operations
> 	IV.  Misc
> 
> 
>   I. Linux host
>   =============
> 
>   1. vhost-iommu
>   --------------
> 
> An advantage of virtualizing an IOMMU using virtio is that it allows to
> hoist a lot of the emulation code into the kernel using vhost, and avoid
> returning to userspace for each request. The mainline kernel already
> implements vhost-net, vhost-scsi and vhost-vsock, and a lot of core code
> could be reused.
> 
> Introducing vhost in a simplified scenario 1 (removed guest userspace
> pass-through, irrelevant to this example) gives us the following:
> 
>   MEM____pIOMMU________PCI device____________                    HARDWARE
>             |                                \
>   ----------|-------------+-------------+-----\--------------------------
>             |             :     KVM     :      \
>        pIOMMU drv         :             :       \                  KERNEL
>             |             :             :     net drv
>           VFIO            :             :       /
>             |             :             :      /
>        vhost-iommu_________________________virtio-iommu-drv
>                           :             :
>   --------------------------------------+-------------------------------
>                  HOST                   :             GUEST
> 
> 
> Introducing vhost in scenario 2, userspace now only handles the device
> initialisation part, and most runtime communication is handled in kernel:
> 
>   MEM__pIOMMU___PCI device                                     HARDWARE
>          |         |
>   -------|---------|------+-------------+-------------------------------
>          |         |      :     KVM     :
>     pIOMMU drv     |      :             :                         KERNEL
>              \__net drv   :             :
>                    |      :             :
>                   tap     :             :
>                    |      :             :
>               _vhost-net________________________virtio-net drv
>          (2) /            :             :           / (1a)
>             /             :             :          /
>    vhost-iommu________________________________virtio-iommu drv
>                           :             : (1b)
>   ------------------------+-------------+-------------------------------
>                  HOST                   :             GUEST
> 
> (1) a. Guest virtio driver maps ring and buffers
>     b. Map requests are relayed to the host the same way.
> (2) To access any guest memory, vhost-net must query the IOMMU. We can
>     reuse the existing TLB protocol for this. TLB commands are written to
>     and read from the vhost-net fd.
> 
> As defined in Linux/include/uapi/linux/vhost.h, the vhost msg structure
> has everything needed for map/unmap operations:
> 
> 	struct vhost_iotlb_msg {
> 		__u64	iova;
> 		__u64	size;
> 		__u64	uaddr;
> 		__u8	perm; /* R/W */
> 		__u8	type;
> 	#define VHOST_IOTLB_MISS
> 	#define VHOST_IOTLB_UPDATE	/* MAP */
> 	#define VHOST_IOTLB_INVALIDATE	/* UNMAP */
> 	#define VHOST_IOTLB_ACCESS_FAIL
> 	};
> 
> 	struct vhost_msg {
> 		int type;
> 		union {
> 			struct vhost_iotlb_msg iotlb;
> 			__u8 padding[64];
> 		};
> 	};
> 
> The vhost-iommu device associates a virtual device ID to a TLB fd. We
> should be able to use the same commands for [vhost-net <-> virtio-iommu]
> and [virtio-net <-> vhost-iommu] communication. A virtio-net device
> would open a socketpair and hand one side to vhost-iommu.
> 
> If vhost_msg is ever used for another purpose than TLB, we'll have some
> trouble, as there will be multiple clients that want to read/write the
> vhost fd. A multicast transport method will be needed. Until then, this
> can work.
> 
> Details of operations would be:
> 
> (1) Userspace sets up vhost-iommu as with other vhost devices, by using
> standard vhost ioctls. Userspace starts by describing the system topology
> via ioctl:
> 
> 	ioctl(iommu_fd, VHOST_IOMMU_ADD_DEVICE, struct
> 	      vhost_iommu_add_device)
> 
> 	#define VHOST_IOMMU_DEVICE_TYPE_VFIO
> 	#define VHOST_IOMMU_DEVICE_TYPE_TLB
> 
> 	struct vhost_iommu_add_device {
> 		__u8 type;
> 		__u32 devid;
> 		union {
> 			struct vhost_iommu_device_vfio {
> 				int vfio_group_fd;
> 			};
> 			struct vhost_iommu_device_tlb {
> 				int fd;
> 			};
> 		};
> 	};
> 
> (2) VIRTIO_IOMMU_T_ATTACH(address space, devid)
> 
> vhost-iommu creates an address space if necessary, finds the device along
> with the relevant operations. If type is VFIO, operations are done on a
> container, otherwise they are done on single devices.
> 
> (3) VIRTIO_IOMMU_T_MAP(address space, virt, phys, size, flags)
> 
> Turn phys into an hva using the vhost mem table.
> 
> - If type is TLB, either preload with VHOST_IOTLB_UPDATE or store the
>   mapping locally and wait for the TLB to ask for it with a
>   VHOST_IOTLB_MISS.
> - If type is VFIO, turn it into a VFIO_IOMMU_MAP_DMA (might need to
>   introduce a shortcut in the external user API of VFIO).
> 
> (4) VIRTIO_IOMMU_T_UNMAP(address space, virt, phys, size, flags)
> 
> - If type is TLB, send a VHOST_IOTLB_INVALIDATE.
> - If type is VFIO, turn it into VFIO_IOMMU_UNMAP_DMA.
> 
> (5) VIRTIO_IOMMU_T_DETACH(address space, devid)
> 
> Undo whatever was done in (2).
> 
> 
>   2. VFIO nested translation
>   --------------------------
> 
> For my current kvmtool implementation, I am putting each VFIO group in a
> different container during initialization. We cannot detach a group from a
> container at runtime without first resetting all devices in that group. So
> the best way to provide dynamic address spaces right now is one container
> per group. The drawback is that we need to maintain multiple sets of page
> tables even if the guest wants to put all devices in the same address
> space. Another disadvantage is when implementing bypass mode, we need to
> map the whole address space at the beginning, then unmap everything on
> attach. Adding nested support would be a nice way to provide dynamic
> address spaces while keeping groups tied to a container at all times.
> 
> A physical IOMMU may offer nested translation. In this case, address
> spaces are managed by two page directories instead of one. A guest-
> virtual address is translated into a guest-physical one using what we'll
> call here "stage-1" (s1) page tables, and the guest-physical address is
> translated into a host-physical one using "stage-2" (s2) page tables.
> 
>                              s1      s2
>                          GVA --> GPA --> HPA
> 
> There isn't a lot of support in Linux for nesting IOMMU page directories
> at the moment (though SVM support is coming, see II). VFIO does have a
> "nesting" IOMMU type, which doesn't mean much at the moment. The ARM SMMU
> code uses this to decide whether to manage the container with s2 page
> tables instead of s1, but even then we still only have a single stage and
> it is assumed that IOVA=GPA.
> 
> Another model that would help with dynamically changing address spaces is
> nesting VFIO containers:
> 
>                            Parent  <---------- map/unmap
>                           container
>                          /   |     \
>                         /   group   \
>                      Child         Child  <--- map/unmap
>                    container     container
>                     |   |             |
>                  group group        group
> 
> At the beginning all groups are attached to the parent container, and
> there is no child container. Doing map/unmap on the parent container maps
> stage-2 page tables (map GPA -> HVA and pin the page -> HPA). User should
> be able to choose whether they want all devices attached to this container
> to be able to access GPAs (bypass mode, as it currently is) or simply
> block all DMA (in which case there is no need to pin pages here).
> 
> At some point the guest wants to create an address space and attaches
> children to it. Using an ioctl (to be defined), we can derive a child
> container from the parent container, and move groups from parent to child.
> 
> This returns a child fd. When the guest maps something in this new address
> space, we can do a map ioctl on the child container, which maps stage-1
> page tables (map GVA -> GPA).
> 
> A page table walk may access multiple levels of tables (pgd, p4d, pud,
> pmd, pt). With nested translation, each access to a table during the
> stage-1 walk requires a stage-2 walk. This makes a full translation costly
> so it is preferable to use a single stage of translation when possible.
> Folding two stages into one is simple with a single container, as shown in
> the kvmtool example. The host keeps track of GPA->HVA mappings, so it can
> fold the full GVA->HVA mapping before sending the VFIO request. With
> nested containers however, the IOMMU driver would have to do the folding
> work itself. Keeping a copy of stage-2 mapping created on the parent
> container, it would fold them into the actual stage-2 page tables when
> receiving a map request on the child container (note that software folding
> is not possible when stage-1 pgd is managed by the guest, as described in
> next section).
> 
> I don't know if nested VFIO containers are a desirable feature at all. I
> find the concept cute on paper, and it would make it easier for userspace
> to juggle with address spaces, but it might require some invasive changes
> in VFIO, and people have been able to use the current API for IOMMU
> virtualization so far.
> 
> 
>   II. Page table sharing
>   ======================
> 
>   1. Sharing IOMMU page tables
>   ----------------------------
> 
> VIRTIO_IOMMU_F_PT_SHARING
> 
> This is independent of the nested mode described in I.2, but relies on a
> similar feature in the physical IOMMU: having two stages of page tables,
> one for the host and one for the guest.
> 
> When this is supported, the guest can manage its own s1 page directory, to
> avoid sending MAP/UNMAP requests. Feature VIRTIO_IOMMU_F_PT_SHARING allows
> a driver to give a page directory pointer (pgd) to the host and send
> invalidations when removing or changing a mapping. In this mode, three
> requests are used: probe, attach and invalidate. An address space cannot
> be using the MAP/UNMAP interface and PT_SHARING at the same time.
> 
> Device and driver first need to negotiate which page table format they
> will be using. This depends on the physical IOMMU, so the request contains
> a negotiation part to probe the device capabilities.
> 
> (1) Driver attaches devices to address spaces as usual, but a flag
>     VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
>     create page tables for use with the MAP/UNMAP API. The driver intends
>     to manage the address space itself.
> 
> (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
>     pg_format array.
> 
> 	VIRTIO_IOMMU_T_PROBE_TABLE
> 
> 	struct virtio_iommu_req_probe_table {
> 		le32	address_space;
> 		le32	flags;
> 		le32	len;
> 	
> 		le32	nr_contexts;
> 		struct {
> 			le32	model;
> 			u8	format[64];
> 		} pg_format[len];
> 	};
> 
> Introducing a probe request is more flexible than advertising those
> features in virtio config, because capabilities are dynamic, and depend on
> which devices are attached to an address space. Within a single address
> space, devices may support different numbers of contexts (PASIDs), and
> some may not support recoverable faults.
> 
> (3) Device responds success with all page table formats implemented by the
>     physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
>     initialize the array to 0 and deduce from there which entries have
>     been filled by the device.
> 
> Using a probe method seems preferable over trying to attach every possible
> format until one sticks. For instance, with an ARM guest running on an x86
> host, PROBE_TABLE would return the Intel IOMMU page table format, and the
> guest could use that page table code to handle its mappings, hidden behind
> the IOMMU API. This requires that the page-table code is reasonably
> abstracted from the architecture, as is done with drivers/iommu/io-pgtable
> (an x86 guest could use any format implement by io-pgtable for example.)
> 
> (4) If the driver is able to use this format, it sends the ATTACH_TABLE
>     request.
> 
> 	VIRTIO_IOMMU_T_ATTACH_TABLE
> 
> 	struct virtio_iommu_req_attach_table {
> 		le32	address_space;
> 		le32	flags;
> 		le64	table;
> 	
> 		le32	nr_contexts;
> 		/* Page-table format description */
> 	
> 		le32	model;
> 		u8	config[64]
> 	};
> 
> 
>     'table' is a pointer to the page directory. 'nr_contexts' isn't used
>     here.
> 
>     For both ATTACH and PROBE, 'flags' are the following (and will be
>     explained later):
> 
> 	VIRTIO_IOMMU_ATTACH_TABLE_F_INDIRECT	(1 << 0)
> 	VIRTIO_IOMMU_ATTACH_TABLE_F_NATIVE	(1 << 1)
> 	VIRTIO_IOMMU_ATTACH_TABLE_F_FAULT	(1 << 2)
> 
> Now 'model' is a bit tricky. We need to specify all possible page table
> formats and their parameters. I'm not well-versed in x86, s390 or other
> IOMMUs, so I'll just focus on the ARM world for this example. We basically
> have two page table models, with a multitude of configuration bits:
> 
> 	* ARM LPAE
> 	* ARM short descriptor
> 
> We could define a high-level identifier per page-table model, such as:
> 
> 	#define PG_TABLE_ARM	0x1
> 	#define PG_TABLE_X86	0x2
> 	...
> 
> And each model would define its own structure. On ARM 'format' could be a
> simple u32 defining a variant, LPAE 32/64 or short descriptor. It could
> also contain additional capabilities. Then depending on the variant,
> 'config' would be:
> 
> 	struct pg_config_v7s {
> 		le32	tcr;
> 		le32	prrr;
> 		le32	nmrr;
> 		le32	asid;
> 	};
> 	
> 	struct pg_config_lpae {
> 		le64	tcr;
> 		le64	mair;
> 		le32	asid;
> 	
> 		/* And maybe TTB1? */
> 	};
> 
> 	struct pg_config_arm {
> 		le32	variant;
> 		union ...;
> 	};
> 
> I am really uneasy with describing all those nasty architectural details
> in the virtio-iommu specification. We certainly won't start describing the
> content bit-by-bit of tcr or mair here, but just declaring these fields
> might be sufficient.
> 
> (5) Once the table is attached, the driver can simply write the page
>     tables and expect the physical IOMMU to observe the mappings without
>     any additional request. When changing or removing a mapping, however,
>     the driver must send an invalidate request.
> 
> 	VIRTIO_IOMMU_T_INVALIDATE
> 
> 	struct virtio_iommu_req_invalidate {
> 		le32	address_space;
> 		le32	context;
> 		le32	flags;
> 		le64	virt_addr;
> 		le64	range_size;
> 	
> 		u8	opaque[64];
> 	};
> 
>     'flags' may be:
> 
>     VIRTIO_IOMMU_INVALIDATE_T_VADDR: invalidate a single VA range
>       from 'context' (context is 0 when !F_INDIRECT).
> 
>     And with context tables only (explained below):
> 
>     VIRTIO_IOMMU_INVALIDATE_T_SINGLE: invalidate all mappings from
>       'context' (context is 0 when !F_INDIRECT). virt_addr and range_size
>       are ignored.
> 
>     VIRTIO_IOMMU_INVALIDATE_T_TABLE: with F_INDIRECT, invalidate entries
>       in the table that changed. Device reads the table again, compares it
>       to previous values, and invalidate all mappings for contexts that
>       changed. context, virt_addr and range_size are ignored.
> 
> IOMMUs may offer hints and quirks in their invalidation packets. The
> opaque structure in invalidate would allow to transport those. This
> depends on the page table format and as with architectural page-table
> definitions, I really don't want to have those details in the spec itself.
> 
> 
>   2. Sharing MMU page tables
>   --------------------------
> 
> The guest can share process page-tables with the physical IOMMU. To do
> that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
> page table format is implicit, so the pg_format array can be empty (unless
> the guest wants to query some specific property, e.g. number of levels
> supported by the pIOMMU?). If the host answers with success, guest can
> send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
> F_INDIRECT | F_FAULT) flags.
> 
> F_FAULT means that the host communicates page requests from device to the
> guest, and the guest can handle them by mapping virtual address in the
> fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
> below.)
> 
> F_NATIVE means that the pIOMMU pgtable format is the same as guest MMU
> pgtable format.
> 
> F_INDIRECT means that 'table' pointer is a context table, instead of a
> page directory. Each slot in the context table points to a page directory:
> 
>                        64              2 1 0
>           table ----> +---------------------+
>                       |       pgd       |0|1|<--- context 0
>                       |       ---       |0|0|<--- context 1
>                       |       pgd       |0|1|
>                       |       ---       |0|0|
>                       |       ---       |0|0|
>                       +---------------------+
>                                          | \___Entry is valid
>                                          |______reserved
> 
> Question: do we want per-context page table format, or can it stay global
> for the whole indirect table?
> 
> Having a context table allows to provide multiple address spaces for a
> single device. In the simplest form, without F_INDIRECT we have a single
> address space per device, but some devices may implement more, for
> instance devices with the PCI PASID extension.
> 
> A slot's position in the context table gives an ID, between 0 and
> nr_contexts. The guest can use this ID to have the device target a
> specific address space with DMA. The mechanism to do that is
> device-specific. For a PCI device, the ID is a PASID, and PCI doesn't
> define a specific way of using them for DMA, it's the device driver's
> concern.
> 
> 
>   3. Fault reporting
>   ------------------
> 
> VIRTIO_IOMMU_F_EVENT_QUEUE
> 
> With this feature, an event virtqueue (1) is available. For now it will
> only be used for fault handling, but I'm calling it eventq so that other
> asynchronous features can piggy-back on it. Device may report faults and
> page requests by sending buffers via the used ring.
> 
> 	#define VIRTIO_IOMMU_T_FAULT	0x05
> 
> 	struct virtio_iommu_evt_fault {
> 		struct virtio_iommu_evt_head {
> 			u8 type;
> 			u8 reserved[3];
> 		};
> 	
> 		u32 address_space;
> 		u32 context;
> 	
> 		u64 vaddr;
> 		u32 flags;	/* Access details: R/W/X */
> 	
> 		/* In the reply: */
> 		u32 reply;	/* Fault handled, or failure */
> 		u64 paddr;
> 	};
> 
> Driver must send the reply via the request queue, with the fault status
> in 'reply', and the mapped page in 'paddr' on success.
> 
> Existing fault handling interfaces such as PRI have a tag (PRG) allowing
> to identify a page request (or group thereof) when sending a reply. I
> wonder if this would be useful to us, but it seems like the
> (address_space, context, vaddr) tuple is sufficient to identify a page
> fault, provided the device doesn't send duplicate faults. Duplicate faults
> could be required if they have a side effect, for instance implementing a
> poor man's doorbell. If this is desirable, we could add a fault_id field.
> 
> 
>   4. Host implementation with VFIO
>   --------------------------------
> 
> The VFIO interface for sharing page tables is being worked on at the
> moment by Intel. Other virtual IOMMU implementation will most likely let
> guest manage full context tables (PASID tables) themselves, giving the
> context table pointer to the pIOMMU via a VFIO ioctl.
> 
> For the architecture-agnostic virtio-iommu however, we shouldn't have to
> implement all possible formats of context table (they are at least
> different between ARM SMMU and Intel IOMMU, and will certainly be extended
> in future physical IOMMU architectures.) In addition, most users might
> only care about having one page directory per device, as SVM is a luxury
> at the moment and few devices support it. For these reasons, we should
> allow to pass single page directories via VFIO, using very similar
> structures as described above, whilst reusing the VFIO channel developed
> for Intel vIOMMU.
> 
> 	* VFIO_SVM_INFO: probe page table formats
> 	* VFIO_SVM_BIND: set pgd and arch-specific configuration
> 
> There is an inconvenient with letting the pIOMMU driver manage the guest's
> context table. During a page table walk, the pIOMMU translates the context
> table pointer using the stage-2 page tables. The context table must
> therefore be mapped in guest-physical space by the pIOMMU driver. One
> solution is to let the pIOMMU driver reserve some GPA space upfront using
> the iommu and sysfs resv API [1]. The host would then carve that region
> out of the guest-physical space using a firmware mechanism (for example DT
> reserved-memory node).
> 
> 
>   III. Relaxed operations
>   =======================
> 
> VIRTIO_IOMMU_F_RELAXED
> 
> Adding an IOMMU dramatically reduces performance of a device, because
> map/unmap operations are costly and produce a lot of TLB traffic. For
> significant performance improvements, device might allow the driver to
> sacrifice safety for speed. In this mode, the driver does not need to send
> UNMAP requests. The semantics of MAP change and are more complex to
> implement. Given a MAP([start:end] -> phys, flags) request:
> 
> (1) If [start:end] isn't mapped, request succeeds as usual.
> (2) If [start:end] overlaps an existing mapping [old_start:old_end], we
>     unmap [max(start, old_start):min(end, old_end)] and replace it with
>     [start:end].
> (3) If [start:end] overlaps an existing mapping that matches the new map
>     request exactly (same flags, same phys address), the old mapping is
>     kept.
> 
> This squashing could be performed by the guest. The driver can catch unmap
> requests from the DMA layer, and only relay map requests for (1) and (2).
> A MAP request is therefore able to split and partially override an
> existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests
> are unnecessary, but are now allowed to split or carve holes in mappings.
> 
> In this model, a MAP request may take longer, but we may have a net gain
> by removing a lot of redundant requests. Squashing series of map/unmap
> performed by the guest for the same mapping improves temporal reuse of
> IOVA mappings, which I can observe by simply dumping IOMMU activity of a
> virtio device. It reduce the number of TLB invalidations to the strict
> minimum while keeping correctness of DMA operations (provided the device
> obeys its driver). There is a good read on the subject of optimistic
> teardown in paper [2].
> 
> This model is completely unsafe. A stale DMA transaction might access a
> page long after the device driver in the guest unmapped it and
> decommissioned the page. The DMA transaction might hit into a completely
> different part of the system that is now reusing the page. Existing
> relaxed implementations attempt to mitigate the risk by setting a timeout
> on the teardown. Unmap requests from device drivers are not discarded
> entirely, but buffered and sent at a later time. Paper [2] reports good
> results with a 10ms delay.
> 
> We could add a way for device and driver to negotiate a vulnerability
> window to mitigate the risk of DMA attacks. Driver might not accept a
> window at all, since it requires more infrastructure to keep delayed
> mappings. In my opinion, it should be made clear that regardless of the
> duration of this window, any driver accepting F_RELAXED feature makes the
> guest completely vulnerable, and the choice boils down to either isolation
> or speed, not a bit of both.
> 
> 
>   IV. Misc
>   ========
> 
> I think we have enough to go on for a while. To improve MAP throughput, I
> considered adding a MAP_SG request depending on a feature bit, with
> variable size:
> 
> 	struct virtio_iommu_req_map_sg {
> 		struct virtio_iommu_req_head;
> 		u32	address_space;
> 		u32	nr_elems;
> 		u64	virt_addr;
> 		u64	size;
> 		u64	phys_addr[nr_elems];
> 	};
> 
> Would create the following mappings:
> 
> 	virt_addr		-> phys_addr[0]
> 	virt_addr + size	-> phys_addr[1]
> 	virt_addr + 2 * size	-> phys_addr[2]
> 	...
> 
> This would avoid the overhead of multiple map commands. We could try to
> find a more cunning format to compress virtually-contiguous mappings with
> different (phys, size) pairs as well. But Linux drivers rarely prefer
> map_sg() functions over regular map(), so I don't know if the whole map_sg
> feature is worth the effort. All we would gain is a few bytes anyway.
> 
> My current map_sg implementation in the virtio-iommu driver adds a batch
> of map requests to the queue and kick the host once. That might be enough
> of an optimization.
> 
> 
> Another invasive optimization would be adding grouped requests. By adding
> two flags in the header, L and G, we can group sequences of requests
> together, and have one status at the end, either 0 if all requests in the
> group succeeded, or the status of the first request that failed. This is
> all in-order. Requests in a group follow each others, there is no sequence
> identifier.
> 
> 	                       ___ L: request is last in the group
> 	                      /  _ G: request is part of a group
> 	                     |  /
> 	                     v v
> 	31                   9 8 7      0
> 	+--------------------------------+ <------- RO descriptor
> 	|        res0       |0|1|  type  |
> 	+--------------------------------+
> 	|            payload             |
> 	+--------------------------------+
> 	|        res0       |0|1|  type  |
> 	+--------------------------------+
> 	|            payload             |
> 	+--------------------------------+
> 	|        res0       |0|1|  type  |
> 	+--------------------------------+
> 	|            payload             |
> 	+--------------------------------+
> 	|        res0       |1|1|  type  |
> 	+--------------------------------+
> 	|            payload             |
> 	+--------------------------------+ <------- WO descriptor
> 	|        res0           | status |
> 	+--------------------------------+
> 
> This adds some complexity on the device, since it must unroll whatever was
> done by successful requests in a group as soon as one fails, and reject
> all subsequent ones. A group of requests is an atomic operation. As with
> map_sg, this change mostly allows to save space and virtio descriptors.
> 
> 
> [1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-iommu_groups
> [2] vIOMMU: Efficient IOMMU Emulation
>     N. Amit, M. Ben-Yehuda, D. Tsafrir, A. Schuster

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 3/3] virtio-iommu: future work
  2017-04-07 19:17 ` [RFC 3/3] virtio-iommu: future work Jean-Philippe Brucker
  2017-04-21  8:31   ` Tian, Kevin
       [not found]   ` <20170407191747.26618-4-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
@ 2017-04-26 16:24   ` Michael S. Tsirkin
  2 siblings, 0 replies; 99+ messages in thread
From: Michael S. Tsirkin @ 2017-04-26 16:24 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: virtio-dev, lorenzo.pieralisi, kvm, cdall, marc.zyngier, joro,
	will.deacon, virtualization, iommu, robin.murphy

On Fri, Apr 07, 2017 at 08:17:47PM +0100, Jean-Philippe Brucker wrote:
> Here I propose a few ideas for extensions and optimizations. This is all
> very exploratory, feel free to correct mistakes and suggest more things.
> 
> 	I.   Linux host
> 	     1. vhost-iommu

A qemu based implementation would be a first step.
Would allow validating the claim that it's much
simpler to support than e.g. VTD.

> 	     2. VFIO nested translation
> 	II.  Page table sharing
> 	     1. Sharing IOMMU page tables
> 	     2. Sharing MMU page tables (SVM)
> 	     3. Fault reporting
> 	     4. Host implementation with VFIO
> 	III. Relaxed operations
> 	IV.  Misc
> 
> 
>   I. Linux host
>   =============
> 
>   1. vhost-iommu
>   --------------
> 
> An advantage of virtualizing an IOMMU using virtio is that it allows to
> hoist a lot of the emulation code into the kernel using vhost, and avoid
> returning to userspace for each request. The mainline kernel already
> implements vhost-net, vhost-scsi and vhost-vsock, and a lot of core code
> could be reused.
> 
> Introducing vhost in a simplified scenario 1 (removed guest userspace
> pass-through, irrelevant to this example) gives us the following:
> 
>   MEM____pIOMMU________PCI device____________                    HARDWARE
>             |                                \
>   ----------|-------------+-------------+-----\--------------------------
>             |             :     KVM     :      \
>        pIOMMU drv         :             :       \                  KERNEL
>             |             :             :     net drv
>           VFIO            :             :       /
>             |             :             :      /
>        vhost-iommu_________________________virtio-iommu-drv
>                           :             :
>   --------------------------------------+-------------------------------
>                  HOST                   :             GUEST
> 
> 
> Introducing vhost in scenario 2, userspace now only handles the device
> initialisation part, and most runtime communication is handled in kernel:
> 
>   MEM__pIOMMU___PCI device                                     HARDWARE
>          |         |
>   -------|---------|------+-------------+-------------------------------
>          |         |      :     KVM     :
>     pIOMMU drv     |      :             :                         KERNEL
>              \__net drv   :             :
>                    |      :             :
>                   tap     :             :
>                    |      :             :
>               _vhost-net________________________virtio-net drv
>          (2) /            :             :           / (1a)
>             /             :             :          /
>    vhost-iommu________________________________virtio-iommu drv
>                           :             : (1b)
>   ------------------------+-------------+-------------------------------
>                  HOST                   :             GUEST
> 
> (1) a. Guest virtio driver maps ring and buffers
>     b. Map requests are relayed to the host the same way.
> (2) To access any guest memory, vhost-net must query the IOMMU. We can
>     reuse the existing TLB protocol for this. TLB commands are written to
>     and read from the vhost-net fd.
> 
> As defined in Linux/include/uapi/linux/vhost.h, the vhost msg structure
> has everything needed for map/unmap operations:
> 
> 	struct vhost_iotlb_msg {
> 		__u64	iova;
> 		__u64	size;
> 		__u64	uaddr;
> 		__u8	perm; /* R/W */
> 		__u8	type;
> 	#define VHOST_IOTLB_MISS
> 	#define VHOST_IOTLB_UPDATE	/* MAP */
> 	#define VHOST_IOTLB_INVALIDATE	/* UNMAP */
> 	#define VHOST_IOTLB_ACCESS_FAIL
> 	};
> 
> 	struct vhost_msg {
> 		int type;
> 		union {
> 			struct vhost_iotlb_msg iotlb;
> 			__u8 padding[64];
> 		};
> 	};
> 
> The vhost-iommu device associates a virtual device ID to a TLB fd. We
> should be able to use the same commands for [vhost-net <-> virtio-iommu]
> and [virtio-net <-> vhost-iommu] communication. A virtio-net device
> would open a socketpair and hand one side to vhost-iommu.
> 
> If vhost_msg is ever used for another purpose than TLB, we'll have some
> trouble, as there will be multiple clients that want to read/write the
> vhost fd. A multicast transport method will be needed. Until then, this
> can work.
> 
> Details of operations would be:
> 
> (1) Userspace sets up vhost-iommu as with other vhost devices, by using
> standard vhost ioctls. Userspace starts by describing the system topology
> via ioctl:
> 
> 	ioctl(iommu_fd, VHOST_IOMMU_ADD_DEVICE, struct
> 	      vhost_iommu_add_device)
> 
> 	#define VHOST_IOMMU_DEVICE_TYPE_VFIO
> 	#define VHOST_IOMMU_DEVICE_TYPE_TLB
> 
> 	struct vhost_iommu_add_device {
> 		__u8 type;
> 		__u32 devid;
> 		union {
> 			struct vhost_iommu_device_vfio {
> 				int vfio_group_fd;
> 			};
> 			struct vhost_iommu_device_tlb {
> 				int fd;
> 			};
> 		};
> 	};
> 
> (2) VIRTIO_IOMMU_T_ATTACH(address space, devid)
> 
> vhost-iommu creates an address space if necessary, finds the device along
> with the relevant operations. If type is VFIO, operations are done on a
> container, otherwise they are done on single devices.
> 
> (3) VIRTIO_IOMMU_T_MAP(address space, virt, phys, size, flags)
> 
> Turn phys into an hva using the vhost mem table.
> 
> - If type is TLB, either preload with VHOST_IOTLB_UPDATE or store the
>   mapping locally and wait for the TLB to ask for it with a
>   VHOST_IOTLB_MISS.
> - If type is VFIO, turn it into a VFIO_IOMMU_MAP_DMA (might need to
>   introduce a shortcut in the external user API of VFIO).
> 
> (4) VIRTIO_IOMMU_T_UNMAP(address space, virt, phys, size, flags)
> 
> - If type is TLB, send a VHOST_IOTLB_INVALIDATE.
> - If type is VFIO, turn it into VFIO_IOMMU_UNMAP_DMA.
> 
> (5) VIRTIO_IOMMU_T_DETACH(address space, devid)
> 
> Undo whatever was done in (2).
> 
> 
>   2. VFIO nested translation
>   --------------------------
> 
> For my current kvmtool implementation, I am putting each VFIO group in a
> different container during initialization. We cannot detach a group from a
> container at runtime without first resetting all devices in that group. So
> the best way to provide dynamic address spaces right now is one container
> per group. The drawback is that we need to maintain multiple sets of page
> tables even if the guest wants to put all devices in the same address
> space. Another disadvantage is when implementing bypass mode, we need to
> map the whole address space at the beginning, then unmap everything on
> attach. Adding nested support would be a nice way to provide dynamic
> address spaces while keeping groups tied to a container at all times.
> 
> A physical IOMMU may offer nested translation. In this case, address
> spaces are managed by two page directories instead of one. A guest-
> virtual address is translated into a guest-physical one using what we'll
> call here "stage-1" (s1) page tables, and the guest-physical address is
> translated into a host-physical one using "stage-2" (s2) page tables.
> 
>                              s1      s2
>                          GVA --> GPA --> HPA
> 
> There isn't a lot of support in Linux for nesting IOMMU page directories
> at the moment (though SVM support is coming, see II). VFIO does have a
> "nesting" IOMMU type, which doesn't mean much at the moment. The ARM SMMU
> code uses this to decide whether to manage the container with s2 page
> tables instead of s1, but even then we still only have a single stage and
> it is assumed that IOVA=GPA.
> 
> Another model that would help with dynamically changing address spaces is
> nesting VFIO containers:
> 
>                            Parent  <---------- map/unmap
>                           container
>                          /   |     \
>                         /   group   \
>                      Child         Child  <--- map/unmap
>                    container     container
>                     |   |             |
>                  group group        group
> 
> At the beginning all groups are attached to the parent container, and
> there is no child container. Doing map/unmap on the parent container maps
> stage-2 page tables (map GPA -> HVA and pin the page -> HPA). User should
> be able to choose whether they want all devices attached to this container
> to be able to access GPAs (bypass mode, as it currently is) or simply
> block all DMA (in which case there is no need to pin pages here).
> 
> At some point the guest wants to create an address space and attaches
> children to it. Using an ioctl (to be defined), we can derive a child
> container from the parent container, and move groups from parent to child.
> 
> This returns a child fd. When the guest maps something in this new address
> space, we can do a map ioctl on the child container, which maps stage-1
> page tables (map GVA -> GPA).
> 
> A page table walk may access multiple levels of tables (pgd, p4d, pud,
> pmd, pt). With nested translation, each access to a table during the
> stage-1 walk requires a stage-2 walk. This makes a full translation costly
> so it is preferable to use a single stage of translation when possible.
> Folding two stages into one is simple with a single container, as shown in
> the kvmtool example. The host keeps track of GPA->HVA mappings, so it can
> fold the full GVA->HVA mapping before sending the VFIO request. With
> nested containers however, the IOMMU driver would have to do the folding
> work itself. Keeping a copy of stage-2 mapping created on the parent
> container, it would fold them into the actual stage-2 page tables when
> receiving a map request on the child container (note that software folding
> is not possible when stage-1 pgd is managed by the guest, as described in
> next section).
> 
> I don't know if nested VFIO containers are a desirable feature at all. I
> find the concept cute on paper, and it would make it easier for userspace
> to juggle with address spaces, but it might require some invasive changes
> in VFIO, and people have been able to use the current API for IOMMU
> virtualization so far.
> 
> 
>   II. Page table sharing
>   ======================
> 
>   1. Sharing IOMMU page tables
>   ----------------------------
> 
> VIRTIO_IOMMU_F_PT_SHARING
> 
> This is independent of the nested mode described in I.2, but relies on a
> similar feature in the physical IOMMU: having two stages of page tables,
> one for the host and one for the guest.
> 
> When this is supported, the guest can manage its own s1 page directory, to
> avoid sending MAP/UNMAP requests. Feature VIRTIO_IOMMU_F_PT_SHARING allows
> a driver to give a page directory pointer (pgd) to the host and send
> invalidations when removing or changing a mapping. In this mode, three
> requests are used: probe, attach and invalidate. An address space cannot
> be using the MAP/UNMAP interface and PT_SHARING at the same time.
> 
> Device and driver first need to negotiate which page table format they
> will be using. This depends on the physical IOMMU, so the request contains
> a negotiation part to probe the device capabilities.
> 
> (1) Driver attaches devices to address spaces as usual, but a flag
>     VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
>     create page tables for use with the MAP/UNMAP API. The driver intends
>     to manage the address space itself.
> 
> (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
>     pg_format array.
> 
> 	VIRTIO_IOMMU_T_PROBE_TABLE
> 
> 	struct virtio_iommu_req_probe_table {
> 		le32	address_space;
> 		le32	flags;
> 		le32	len;
> 	
> 		le32	nr_contexts;
> 		struct {
> 			le32	model;
> 			u8	format[64];
> 		} pg_format[len];
> 	};
> 
> Introducing a probe request is more flexible than advertising those
> features in virtio config, because capabilities are dynamic, and depend on
> which devices are attached to an address space. Within a single address
> space, devices may support different numbers of contexts (PASIDs), and
> some may not support recoverable faults.
> 
> (3) Device responds success with all page table formats implemented by the
>     physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
>     initialize the array to 0 and deduce from there which entries have
>     been filled by the device.
> 
> Using a probe method seems preferable over trying to attach every possible
> format until one sticks. For instance, with an ARM guest running on an x86
> host, PROBE_TABLE would return the Intel IOMMU page table format, and the
> guest could use that page table code to handle its mappings, hidden behind
> the IOMMU API. This requires that the page-table code is reasonably
> abstracted from the architecture, as is done with drivers/iommu/io-pgtable
> (an x86 guest could use any format implement by io-pgtable for example.)
> 
> (4) If the driver is able to use this format, it sends the ATTACH_TABLE
>     request.
> 
> 	VIRTIO_IOMMU_T_ATTACH_TABLE
> 
> 	struct virtio_iommu_req_attach_table {
> 		le32	address_space;
> 		le32	flags;
> 		le64	table;
> 	
> 		le32	nr_contexts;
> 		/* Page-table format description */
> 	
> 		le32	model;
> 		u8	config[64]
> 	};
> 
> 
>     'table' is a pointer to the page directory. 'nr_contexts' isn't used
>     here.
> 
>     For both ATTACH and PROBE, 'flags' are the following (and will be
>     explained later):
> 
> 	VIRTIO_IOMMU_ATTACH_TABLE_F_INDIRECT	(1 << 0)
> 	VIRTIO_IOMMU_ATTACH_TABLE_F_NATIVE	(1 << 1)
> 	VIRTIO_IOMMU_ATTACH_TABLE_F_FAULT	(1 << 2)
> 
> Now 'model' is a bit tricky. We need to specify all possible page table
> formats and their parameters. I'm not well-versed in x86, s390 or other
> IOMMUs, so I'll just focus on the ARM world for this example. We basically
> have two page table models, with a multitude of configuration bits:
> 
> 	* ARM LPAE
> 	* ARM short descriptor
> 
> We could define a high-level identifier per page-table model, such as:
> 
> 	#define PG_TABLE_ARM	0x1
> 	#define PG_TABLE_X86	0x2
> 	...
> 
> And each model would define its own structure. On ARM 'format' could be a
> simple u32 defining a variant, LPAE 32/64 or short descriptor. It could
> also contain additional capabilities. Then depending on the variant,
> 'config' would be:
> 
> 	struct pg_config_v7s {
> 		le32	tcr;
> 		le32	prrr;
> 		le32	nmrr;
> 		le32	asid;
> 	};
> 	
> 	struct pg_config_lpae {
> 		le64	tcr;
> 		le64	mair;
> 		le32	asid;
> 	
> 		/* And maybe TTB1? */
> 	};
> 
> 	struct pg_config_arm {
> 		le32	variant;
> 		union ...;
> 	};
> 
> I am really uneasy with describing all those nasty architectural details
> in the virtio-iommu specification. We certainly won't start describing the
> content bit-by-bit of tcr or mair here, but just declaring these fields
> might be sufficient.
> 
> (5) Once the table is attached, the driver can simply write the page
>     tables and expect the physical IOMMU to observe the mappings without
>     any additional request. When changing or removing a mapping, however,
>     the driver must send an invalidate request.
> 
> 	VIRTIO_IOMMU_T_INVALIDATE
> 
> 	struct virtio_iommu_req_invalidate {
> 		le32	address_space;
> 		le32	context;
> 		le32	flags;
> 		le64	virt_addr;
> 		le64	range_size;
> 	
> 		u8	opaque[64];
> 	};
> 
>     'flags' may be:
> 
>     VIRTIO_IOMMU_INVALIDATE_T_VADDR: invalidate a single VA range
>       from 'context' (context is 0 when !F_INDIRECT).
> 
>     And with context tables only (explained below):
> 
>     VIRTIO_IOMMU_INVALIDATE_T_SINGLE: invalidate all mappings from
>       'context' (context is 0 when !F_INDIRECT). virt_addr and range_size
>       are ignored.
> 
>     VIRTIO_IOMMU_INVALIDATE_T_TABLE: with F_INDIRECT, invalidate entries
>       in the table that changed. Device reads the table again, compares it
>       to previous values, and invalidate all mappings for contexts that
>       changed. context, virt_addr and range_size are ignored.
> 
> IOMMUs may offer hints and quirks in their invalidation packets. The
> opaque structure in invalidate would allow to transport those. This
> depends on the page table format and as with architectural page-table
> definitions, I really don't want to have those details in the spec itself.
> 
> 
>   2. Sharing MMU page tables
>   --------------------------
> 
> The guest can share process page-tables with the physical IOMMU. To do
> that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
> page table format is implicit, so the pg_format array can be empty (unless
> the guest wants to query some specific property, e.g. number of levels
> supported by the pIOMMU?). If the host answers with success, guest can
> send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
> F_INDIRECT | F_FAULT) flags.
> 
> F_FAULT means that the host communicates page requests from device to the
> guest, and the guest can handle them by mapping virtual address in the
> fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
> below.)
> 
> F_NATIVE means that the pIOMMU pgtable format is the same as guest MMU
> pgtable format.
> 
> F_INDIRECT means that 'table' pointer is a context table, instead of a
> page directory. Each slot in the context table points to a page directory:
> 
>                        64              2 1 0
>           table ----> +---------------------+
>                       |       pgd       |0|1|<--- context 0
>                       |       ---       |0|0|<--- context 1
>                       |       pgd       |0|1|
>                       |       ---       |0|0|
>                       |       ---       |0|0|
>                       +---------------------+
>                                          | \___Entry is valid
>                                          |______reserved
> 
> Question: do we want per-context page table format, or can it stay global
> for the whole indirect table?
> 
> Having a context table allows to provide multiple address spaces for a
> single device. In the simplest form, without F_INDIRECT we have a single
> address space per device, but some devices may implement more, for
> instance devices with the PCI PASID extension.
> 
> A slot's position in the context table gives an ID, between 0 and
> nr_contexts. The guest can use this ID to have the device target a
> specific address space with DMA. The mechanism to do that is
> device-specific. For a PCI device, the ID is a PASID, and PCI doesn't
> define a specific way of using them for DMA, it's the device driver's
> concern.
> 
> 
>   3. Fault reporting
>   ------------------
> 
> VIRTIO_IOMMU_F_EVENT_QUEUE
> 
> With this feature, an event virtqueue (1) is available. For now it will
> only be used for fault handling, but I'm calling it eventq so that other
> asynchronous features can piggy-back on it. Device may report faults and
> page requests by sending buffers via the used ring.
> 
> 	#define VIRTIO_IOMMU_T_FAULT	0x05
> 
> 	struct virtio_iommu_evt_fault {
> 		struct virtio_iommu_evt_head {
> 			u8 type;
> 			u8 reserved[3];
> 		};
> 	
> 		u32 address_space;
> 		u32 context;
> 	
> 		u64 vaddr;
> 		u32 flags;	/* Access details: R/W/X */
> 	
> 		/* In the reply: */
> 		u32 reply;	/* Fault handled, or failure */
> 		u64 paddr;
> 	};
> 
> Driver must send the reply via the request queue, with the fault status
> in 'reply', and the mapped page in 'paddr' on success.
> 
> Existing fault handling interfaces such as PRI have a tag (PRG) allowing
> to identify a page request (or group thereof) when sending a reply. I
> wonder if this would be useful to us, but it seems like the
> (address_space, context, vaddr) tuple is sufficient to identify a page
> fault, provided the device doesn't send duplicate faults. Duplicate faults
> could be required if they have a side effect, for instance implementing a
> poor man's doorbell. If this is desirable, we could add a fault_id field.
> 
> 
>   4. Host implementation with VFIO
>   --------------------------------
> 
> The VFIO interface for sharing page tables is being worked on at the
> moment by Intel. Other virtual IOMMU implementation will most likely let
> guest manage full context tables (PASID tables) themselves, giving the
> context table pointer to the pIOMMU via a VFIO ioctl.
> 
> For the architecture-agnostic virtio-iommu however, we shouldn't have to
> implement all possible formats of context table (they are at least
> different between ARM SMMU and Intel IOMMU, and will certainly be extended
> in future physical IOMMU architectures.) In addition, most users might
> only care about having one page directory per device, as SVM is a luxury
> at the moment and few devices support it. For these reasons, we should
> allow to pass single page directories via VFIO, using very similar
> structures as described above, whilst reusing the VFIO channel developed
> for Intel vIOMMU.
> 
> 	* VFIO_SVM_INFO: probe page table formats
> 	* VFIO_SVM_BIND: set pgd and arch-specific configuration
> 
> There is an inconvenient with letting the pIOMMU driver manage the guest's
> context table. During a page table walk, the pIOMMU translates the context
> table pointer using the stage-2 page tables. The context table must
> therefore be mapped in guest-physical space by the pIOMMU driver. One
> solution is to let the pIOMMU driver reserve some GPA space upfront using
> the iommu and sysfs resv API [1]. The host would then carve that region
> out of the guest-physical space using a firmware mechanism (for example DT
> reserved-memory node).
> 
> 
>   III. Relaxed operations
>   =======================
> 
> VIRTIO_IOMMU_F_RELAXED
> 
> Adding an IOMMU dramatically reduces performance of a device, because
> map/unmap operations are costly and produce a lot of TLB traffic. For
> significant performance improvements, device might allow the driver to
> sacrifice safety for speed. In this mode, the driver does not need to send
> UNMAP requests. The semantics of MAP change and are more complex to
> implement. Given a MAP([start:end] -> phys, flags) request:
> 
> (1) If [start:end] isn't mapped, request succeeds as usual.
> (2) If [start:end] overlaps an existing mapping [old_start:old_end], we
>     unmap [max(start, old_start):min(end, old_end)] and replace it with
>     [start:end].
> (3) If [start:end] overlaps an existing mapping that matches the new map
>     request exactly (same flags, same phys address), the old mapping is
>     kept.
> 
> This squashing could be performed by the guest. The driver can catch unmap
> requests from the DMA layer, and only relay map requests for (1) and (2).
> A MAP request is therefore able to split and partially override an
> existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests
> are unnecessary, but are now allowed to split or carve holes in mappings.
> 
> In this model, a MAP request may take longer, but we may have a net gain
> by removing a lot of redundant requests. Squashing series of map/unmap
> performed by the guest for the same mapping improves temporal reuse of
> IOVA mappings, which I can observe by simply dumping IOMMU activity of a
> virtio device. It reduce the number of TLB invalidations to the strict
> minimum while keeping correctness of DMA operations (provided the device
> obeys its driver). There is a good read on the subject of optimistic
> teardown in paper [2].
> 
> This model is completely unsafe. A stale DMA transaction might access a
> page long after the device driver in the guest unmapped it and
> decommissioned the page. The DMA transaction might hit into a completely
> different part of the system that is now reusing the page. Existing
> relaxed implementations attempt to mitigate the risk by setting a timeout
> on the teardown. Unmap requests from device drivers are not discarded
> entirely, but buffered and sent at a later time. Paper [2] reports good
> results with a 10ms delay.
> 
> We could add a way for device and driver to negotiate a vulnerability
> window to mitigate the risk of DMA attacks. Driver might not accept a
> window at all, since it requires more infrastructure to keep delayed
> mappings. In my opinion, it should be made clear that regardless of the
> duration of this window, any driver accepting F_RELAXED feature makes the
> guest completely vulnerable, and the choice boils down to either isolation
> or speed, not a bit of both.
> 
> 
>   IV. Misc
>   ========
> 
> I think we have enough to go on for a while. To improve MAP throughput, I
> considered adding a MAP_SG request depending on a feature bit, with
> variable size:
> 
> 	struct virtio_iommu_req_map_sg {
> 		struct virtio_iommu_req_head;
> 		u32	address_space;
> 		u32	nr_elems;
> 		u64	virt_addr;
> 		u64	size;
> 		u64	phys_addr[nr_elems];
> 	};
> 
> Would create the following mappings:
> 
> 	virt_addr		-> phys_addr[0]
> 	virt_addr + size	-> phys_addr[1]
> 	virt_addr + 2 * size	-> phys_addr[2]
> 	...
> 
> This would avoid the overhead of multiple map commands. We could try to
> find a more cunning format to compress virtually-contiguous mappings with
> different (phys, size) pairs as well. But Linux drivers rarely prefer
> map_sg() functions over regular map(), so I don't know if the whole map_sg
> feature is worth the effort. All we would gain is a few bytes anyway.
> 
> My current map_sg implementation in the virtio-iommu driver adds a batch
> of map requests to the queue and kick the host once. That might be enough
> of an optimization.
> 
> 
> Another invasive optimization would be adding grouped requests. By adding
> two flags in the header, L and G, we can group sequences of requests
> together, and have one status at the end, either 0 if all requests in the
> group succeeded, or the status of the first request that failed. This is
> all in-order. Requests in a group follow each others, there is no sequence
> identifier.
> 
> 	                       ___ L: request is last in the group
> 	                      /  _ G: request is part of a group
> 	                     |  /
> 	                     v v
> 	31                   9 8 7      0
> 	+--------------------------------+ <------- RO descriptor
> 	|        res0       |0|1|  type  |
> 	+--------------------------------+
> 	|            payload             |
> 	+--------------------------------+
> 	|        res0       |0|1|  type  |
> 	+--------------------------------+
> 	|            payload             |
> 	+--------------------------------+
> 	|        res0       |0|1|  type  |
> 	+--------------------------------+
> 	|            payload             |
> 	+--------------------------------+
> 	|        res0       |1|1|  type  |
> 	+--------------------------------+
> 	|            payload             |
> 	+--------------------------------+ <------- WO descriptor
> 	|        res0           | status |
> 	+--------------------------------+
> 
> This adds some complexity on the device, since it must unroll whatever was
> done by successful requests in a group as soon as one fails, and reject
> all subsequent ones. A group of requests is an atomic operation. As with
> map_sg, this change mostly allows to save space and virtio descriptors.
> 
> 
> [1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-iommu_groups
> [2] vIOMMU: Efficient IOMMU Emulation
>     N. Amit, M. Ben-Yehuda, D. Tsafrir, A. Schuster

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC PATCH kvmtool 00/15] Add virtio-iommu
       [not found]   ` <20170407192455.26814-1-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
@ 2017-05-22  8:26     ` Bharat Bhushan
  2017-05-22 14:01       ` Jean-Philippe Brucker
       [not found]       ` <AM5PR0401MB2545FADDF2A7649DF0DB68309AF80-oQ3wXcTHOqrg6d/1FbYcvI3W/0Ik+aLCnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  0 siblings, 2 replies; 99+ messages in thread
From: Bharat Bhushan @ 2017-05-22  8:26 UTC (permalink / raw)
  To: Jean-Philippe Brucker,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: cdall-QSEj5FYQhm4dnm+yROfE0A, mst-H+wXaHxf7aLQT0dZR+AlfA,
	marc.zyngier-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8

Hi Jean,

I am trying to run and review on my side but I see Linux patches are not with latest kernel version.
Will it be possible for you to share your Linux and kvmtool git repository reference?

Thanks
-Bharat

> -----Original Message-----
> From: virtualization-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> [mailto:virtualization-bounces-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org] On Behalf Of Jean-
> Philippe Brucker
> Sent: Saturday, April 08, 2017 12:55 AM
> To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b@public.gmane.org
> Cc: cdall-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org; lorenzo.pieralisi-5wv7dgnIgG8@public.gmane.org; mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org;
> marc.zyngier-5wv7dgnIgG8@public.gmane.org; joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org; will.deacon-5wv7dgnIgG8@public.gmane.org;
> robin.murphy-5wv7dgnIgG8@public.gmane.org
> Subject: [RFC PATCH kvmtool 00/15] Add virtio-iommu
> 
> Implement a virtio-iommu device and translate DMA traffic from vfio and
> virtio devices. Virtio needed some rework to support scatter-gather accesses
> to vring and buffers at page granularity. Patch 3 implements the actual virtio-
> iommu device.
> 
> Adding --viommu on the command-line now inserts a virtual IOMMU in front
> of all virtio and vfio devices:
> 
> 	$ lkvm run -k Image --console virtio -p console=hvc0 \
> 	           --viommu --vfio 0 --vfio 4 --irqchip gicv3-its
> 	...
> 	[    2.998949] virtio_iommu virtio0: probe successful
> 	[    3.007739] virtio_iommu virtio1: probe successful
> 	...
> 	[    3.165023] iommu: Adding device 0000:00:00.0 to group 0
> 	[    3.536480] iommu: Adding device 10200.virtio to group 1
> 	[    3.553643] iommu: Adding device 10600.virtio to group 2
> 	[    3.570687] iommu: Adding device 10800.virtio to group 3
> 	[    3.627425] iommu: Adding device 10a00.virtio to group 4
> 	[    7.823689] iommu: Adding device 0000:00:01.0 to group 5
> 	...
> 
> Patches 13 and 14 add debug facilities. Some statistics are gathered for each
> address space and can be queried via the debug builtin:
> 
> 	$ lkvm debug -n guest-1210 --iommu stats
> 	iommu 0 "viommu-vfio"
> 	  kicks                 1255
> 	  requests              1256
> 	  ioas 1
> 	    maps                7
> 	    unmaps              4
> 	    resident            2101248
> 	  ioas 6
> 	    maps                623
> 	    unmaps              620
> 	    resident            16384
> 	iommu 1 "viommu-virtio"
> 	  kicks                 11426
> 	  requests              11431
> 	  ioas 2
> 	    maps                2836
> 	    unmaps              2835
> 	    resident            8192
> 	    accesses            2836
> 	...
> 
> This is based on the VFIO patchset[1], itself based on Andre's ITS work.
> The VFIO bits have only been tested on a software model and are unlikely to
> work on actual hardware, but I also tested virtio on an ARM Juno.
> 
> [1] http://www.spinics.net/lists/kvm/msg147624.html
> 
> Jean-Philippe Brucker (15):
>   virtio: synchronize virtio-iommu headers with Linux
>   FDT: (re)introduce a dynamic phandle allocator
>   virtio: add virtio-iommu
>   Add a simple IOMMU
>   iommu: describe IOMMU topology in device-trees
>   irq: register MSI doorbell addresses
>   virtio: factor virtqueue initialization
>   virtio: add vIOMMU instance for virtio devices
>   virtio: access vring and buffers through IOMMU mappings
>   virtio-pci: translate MSIs with the virtual IOMMU
>   virtio: set VIRTIO_F_IOMMU_PLATFORM when necessary
>   vfio: add support for virtual IOMMU
>   virtio-iommu: debug via IPC
>   virtio-iommu: implement basic debug commands
>   virtio: use virtio-iommu when available
> 
>  Makefile                          |   3 +
>  arm/gic.c                         |   4 +
>  arm/include/arm-common/fdt-arch.h |   2 +-
>  arm/pci.c                         |  49 ++-
>  builtin-debug.c                   |   8 +-
>  builtin-run.c                     |   2 +
>  fdt.c                             |  35 ++
>  include/kvm/builtin-debug.h       |   6 +
>  include/kvm/devices.h             |   4 +
>  include/kvm/fdt.h                 |  20 +
>  include/kvm/iommu.h               | 105 +++++
>  include/kvm/irq.h                 |   3 +
>  include/kvm/kvm-config.h          |   1 +
>  include/kvm/vfio.h                |   2 +
>  include/kvm/virtio-iommu.h        |  15 +
>  include/kvm/virtio-mmio.h         |   1 +
>  include/kvm/virtio-pci.h          |   2 +
>  include/kvm/virtio.h              | 137 +++++-
>  include/linux/virtio_config.h     |  74 ++++
>  include/linux/virtio_ids.h        |   4 +
>  include/linux/virtio_iommu.h      | 142 ++++++
>  iommu.c                           | 240 ++++++++++
>  irq.c                             |  35 ++
>  kvm-ipc.c                         |  43 +-
>  mips/include/kvm/fdt-arch.h       |   2 +-
>  powerpc/include/kvm/fdt-arch.h    |   2 +-
>  vfio.c                            | 281 +++++++++++-
>  virtio/9p.c                       |   7 +-
>  virtio/balloon.c                  |   7 +-
>  virtio/blk.c                      |  10 +-
>  virtio/console.c                  |   7 +-
>  virtio/core.c                     | 240 ++++++++--
>  virtio/iommu.c                    | 902
> ++++++++++++++++++++++++++++++++++++++
>  virtio/mmio.c                     |  44 +-
>  virtio/net.c                      |   8 +-
>  virtio/pci.c                      |  61 ++-
>  virtio/rng.c                      |   6 +-
>  virtio/scsi.c                     |   6 +-
>  x86/include/kvm/fdt-arch.h        |   2 +-
>  39 files changed, 2389 insertions(+), 133 deletions(-)  create mode 100644
> fdt.c  create mode 100644 include/kvm/iommu.h  create mode 100644
> include/kvm/virtio-iommu.h  create mode 100644
> include/linux/virtio_config.h  create mode 100644
> include/linux/virtio_iommu.h  create mode 100644 iommu.c  create mode
> 100644 virtio/iommu.c
> 
> --
> 2.12.1
> 
> _______________________________________________
> Virtualization mailing list
> Virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC PATCH kvmtool 00/15] Add virtio-iommu
  2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
                     ` (29 preceding siblings ...)
  2017-04-07 19:24   ` Jean-Philippe Brucker
@ 2017-05-22  8:26   ` Bharat Bhushan
       [not found]   ` <20170407192455.26814-1-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
  31 siblings, 0 replies; 99+ messages in thread
From: Bharat Bhushan @ 2017-05-22  8:26 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Hi Jean,

I am trying to run and review on my side but I see Linux patches are not with latest kernel version.
Will it be possible for you to share your Linux and kvmtool git repository reference?

Thanks
-Bharat

> -----Original Message-----
> From: virtualization-bounces@lists.linux-foundation.org
> [mailto:virtualization-bounces@lists.linux-foundation.org] On Behalf Of Jean-
> Philippe Brucker
> Sent: Saturday, April 08, 2017 12:55 AM
> To: iommu@lists.linux-foundation.org; kvm@vger.kernel.org;
> virtualization@lists.linux-foundation.org; virtio-dev@lists.oasis-open.org
> Cc: cdall@linaro.org; lorenzo.pieralisi@arm.com; mst@redhat.com;
> marc.zyngier@arm.com; joro@8bytes.org; will.deacon@arm.com;
> robin.murphy@arm.com
> Subject: [RFC PATCH kvmtool 00/15] Add virtio-iommu
> 
> Implement a virtio-iommu device and translate DMA traffic from vfio and
> virtio devices. Virtio needed some rework to support scatter-gather accesses
> to vring and buffers at page granularity. Patch 3 implements the actual virtio-
> iommu device.
> 
> Adding --viommu on the command-line now inserts a virtual IOMMU in front
> of all virtio and vfio devices:
> 
> 	$ lkvm run -k Image --console virtio -p console=hvc0 \
> 	           --viommu --vfio 0 --vfio 4 --irqchip gicv3-its
> 	...
> 	[    2.998949] virtio_iommu virtio0: probe successful
> 	[    3.007739] virtio_iommu virtio1: probe successful
> 	...
> 	[    3.165023] iommu: Adding device 0000:00:00.0 to group 0
> 	[    3.536480] iommu: Adding device 10200.virtio to group 1
> 	[    3.553643] iommu: Adding device 10600.virtio to group 2
> 	[    3.570687] iommu: Adding device 10800.virtio to group 3
> 	[    3.627425] iommu: Adding device 10a00.virtio to group 4
> 	[    7.823689] iommu: Adding device 0000:00:01.0 to group 5
> 	...
> 
> Patches 13 and 14 add debug facilities. Some statistics are gathered for each
> address space and can be queried via the debug builtin:
> 
> 	$ lkvm debug -n guest-1210 --iommu stats
> 	iommu 0 "viommu-vfio"
> 	  kicks                 1255
> 	  requests              1256
> 	  ioas 1
> 	    maps                7
> 	    unmaps              4
> 	    resident            2101248
> 	  ioas 6
> 	    maps                623
> 	    unmaps              620
> 	    resident            16384
> 	iommu 1 "viommu-virtio"
> 	  kicks                 11426
> 	  requests              11431
> 	  ioas 2
> 	    maps                2836
> 	    unmaps              2835
> 	    resident            8192
> 	    accesses            2836
> 	...
> 
> This is based on the VFIO patchset[1], itself based on Andre's ITS work.
> The VFIO bits have only been tested on a software model and are unlikely to
> work on actual hardware, but I also tested virtio on an ARM Juno.
> 
> [1] http://www.spinics.net/lists/kvm/msg147624.html
> 
> Jean-Philippe Brucker (15):
>   virtio: synchronize virtio-iommu headers with Linux
>   FDT: (re)introduce a dynamic phandle allocator
>   virtio: add virtio-iommu
>   Add a simple IOMMU
>   iommu: describe IOMMU topology in device-trees
>   irq: register MSI doorbell addresses
>   virtio: factor virtqueue initialization
>   virtio: add vIOMMU instance for virtio devices
>   virtio: access vring and buffers through IOMMU mappings
>   virtio-pci: translate MSIs with the virtual IOMMU
>   virtio: set VIRTIO_F_IOMMU_PLATFORM when necessary
>   vfio: add support for virtual IOMMU
>   virtio-iommu: debug via IPC
>   virtio-iommu: implement basic debug commands
>   virtio: use virtio-iommu when available
> 
>  Makefile                          |   3 +
>  arm/gic.c                         |   4 +
>  arm/include/arm-common/fdt-arch.h |   2 +-
>  arm/pci.c                         |  49 ++-
>  builtin-debug.c                   |   8 +-
>  builtin-run.c                     |   2 +
>  fdt.c                             |  35 ++
>  include/kvm/builtin-debug.h       |   6 +
>  include/kvm/devices.h             |   4 +
>  include/kvm/fdt.h                 |  20 +
>  include/kvm/iommu.h               | 105 +++++
>  include/kvm/irq.h                 |   3 +
>  include/kvm/kvm-config.h          |   1 +
>  include/kvm/vfio.h                |   2 +
>  include/kvm/virtio-iommu.h        |  15 +
>  include/kvm/virtio-mmio.h         |   1 +
>  include/kvm/virtio-pci.h          |   2 +
>  include/kvm/virtio.h              | 137 +++++-
>  include/linux/virtio_config.h     |  74 ++++
>  include/linux/virtio_ids.h        |   4 +
>  include/linux/virtio_iommu.h      | 142 ++++++
>  iommu.c                           | 240 ++++++++++
>  irq.c                             |  35 ++
>  kvm-ipc.c                         |  43 +-
>  mips/include/kvm/fdt-arch.h       |   2 +-
>  powerpc/include/kvm/fdt-arch.h    |   2 +-
>  vfio.c                            | 281 +++++++++++-
>  virtio/9p.c                       |   7 +-
>  virtio/balloon.c                  |   7 +-
>  virtio/blk.c                      |  10 +-
>  virtio/console.c                  |   7 +-
>  virtio/core.c                     | 240 ++++++++--
>  virtio/iommu.c                    | 902
> ++++++++++++++++++++++++++++++++++++++
>  virtio/mmio.c                     |  44 +-
>  virtio/net.c                      |   8 +-
>  virtio/pci.c                      |  61 ++-
>  virtio/rng.c                      |   6 +-
>  virtio/scsi.c                     |   6 +-
>  x86/include/kvm/fdt-arch.h        |   2 +-
>  39 files changed, 2389 insertions(+), 133 deletions(-)  create mode 100644
> fdt.c  create mode 100644 include/kvm/iommu.h  create mode 100644
> include/kvm/virtio-iommu.h  create mode 100644
> include/linux/virtio_config.h  create mode 100644
> include/linux/virtio_iommu.h  create mode 100644 iommu.c  create mode
> 100644 virtio/iommu.c
> 
> --
> 2.12.1
> 
> _______________________________________________
> Virtualization mailing list
> Virtualization@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC PATCH kvmtool 00/15] Add virtio-iommu
       [not found]       ` <AM5PR0401MB2545FADDF2A7649DF0DB68309AF80-oQ3wXcTHOqrg6d/1FbYcvI3W/0Ik+aLCnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
@ 2017-05-22 14:01         ` Jean-Philippe Brucker
  0 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-05-22 14:01 UTC (permalink / raw)
  To: Bharat Bhushan,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: cdall-QSEj5FYQhm4dnm+yROfE0A, mst-H+wXaHxf7aLQT0dZR+AlfA,
	marc.zyngier-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8

Hi Bharat,

On 22/05/17 09:26, Bharat Bhushan wrote:
> Hi Jean,
> 
> I am trying to run and review on my side but I see Linux patches are not with latest kernel version.
> Will it be possible for you to share your Linux and kvmtool git repository reference?

Please find linux and kvmtool patches at the following repos:

git://linux-arm.org/kvmtool-jpb.git virtio-iommu/base
git://linux-arm.org/linux-jpb.git virtio-iommu/base

Note that these branches are unstable, subject to fixes and rebase. I'll
try to keep them in sync with upstream.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC PATCH kvmtool 00/15] Add virtio-iommu
  2017-05-22  8:26     ` Bharat Bhushan
@ 2017-05-22 14:01       ` Jean-Philippe Brucker
       [not found]       ` <AM5PR0401MB2545FADDF2A7649DF0DB68309AF80-oQ3wXcTHOqrg6d/1FbYcvI3W/0Ik+aLCnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-05-22 14:01 UTC (permalink / raw)
  To: Bharat Bhushan, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Hi Bharat,

On 22/05/17 09:26, Bharat Bhushan wrote:
> Hi Jean,
> 
> I am trying to run and review on my side but I see Linux patches are not with latest kernel version.
> Will it be possible for you to share your Linux and kvmtool git repository reference?

Please find linux and kvmtool patches at the following repos:

git://linux-arm.org/kvmtool-jpb.git virtio-iommu/base
git://linux-arm.org/linux-jpb.git virtio-iommu/base

Note that these branches are unstable, subject to fixes and rebase. I'll
try to keep them in sync with upstream.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [virtio-dev] [RFC PATCH linux] iommu: Add virtio-iommu driver
  2017-04-07 19:23 ` [RFC PATCH linux] iommu: Add virtio-iommu driver Jean-Philippe Brucker
@ 2017-06-16  8:48   ` Bharat Bhushan
  2017-06-16 11:36     ` Jean-Philippe Brucker
  2017-06-16 11:36     ` Jean-Philippe Brucker
  2017-06-16  8:48   ` Bharat Bhushan
  1 sibling, 2 replies; 99+ messages in thread
From: Bharat Bhushan @ 2017-06-16  8:48 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

Hi Jean

> -----Original Message-----
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-
> open.org] On Behalf Of Jean-Philippe Brucker
> Sent: Saturday, April 08, 2017 12:53 AM
> To: iommu@lists.linux-foundation.org; kvm@vger.kernel.org;
> virtualization@lists.linux-foundation.org; virtio-dev@lists.oasis-open.org
> Cc: cdall@linaro.org; will.deacon@arm.com; robin.murphy@arm.com;
> lorenzo.pieralisi@arm.com; joro@8bytes.org; mst@redhat.com;
> jasowang@redhat.com; alex.williamson@redhat.com;
> marc.zyngier@arm.com
> Subject: [virtio-dev] [RFC PATCH linux] iommu: Add virtio-iommu driver
> 
> The virtio IOMMU is a para-virtualized device, allowing to send IOMMU
> requests such as map/unmap over virtio-mmio transport. This driver should
> illustrate the initial proposal for virtio-iommu, that you hopefully received
> with it. It handle attach, detach, map and unmap requests.
> 
> The bulk of the code is to create requests and send them through virtio.
> Implementing the IOMMU API is fairly straightforward since the virtio-iommu
> MAP/UNMAP interface is almost identical. I threw in a custom
> map_sg() function which takes up some space, but is optional. The core
> function would send a sequence of map requests, waiting for a reply
> between each mapping. This optimization avoids yielding to the host after
> each map, and instead prepares a batch of requests in the virtio ring and
> kicks the host once.
> 
> It must be applied on top of the probe deferral work for IOMMU, currently
> under discussion. This allows to dissociate early driver detection and device
> probing: device-tree or ACPI is parsed early to find which devices are
> translated by the IOMMU, but the IOMMU itself cannot be probed until the
> core virtio module is loaded.
> 
> Enabling DEBUG makes it extremely verbose at the moment, but it should be
> calmer in next versions.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
>  drivers/iommu/Kconfig             |  11 +
>  drivers/iommu/Makefile            |   1 +
>  drivers/iommu/virtio-iommu.c      | 980
> ++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/Kbuild         |   1 +
>  include/uapi/linux/virtio_ids.h   |   1 +
>  include/uapi/linux/virtio_iommu.h | 142 ++++++
>  6 files changed, 1136 insertions(+)
>  create mode 100644 drivers/iommu/virtio-iommu.c  create mode 100644
> include/uapi/linux/virtio_iommu.h
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index
> 37e204f3d9be..8cd56ee9a93a 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -359,4 +359,15 @@ config MTK_IOMMU_V1
> 
>  	  if unsure, say N here.
> 
> +config VIRTIO_IOMMU
> +	tristate "Virtio IOMMU driver"
> +	depends on VIRTIO_MMIO
> +	select IOMMU_API
> +	select INTERVAL_TREE
> +	select ARM_DMA_USE_IOMMU if ARM
> +	help
> +	  Para-virtualised IOMMU driver with virtio.
> +
> +	  Say Y here if you intend to run this kernel as a guest.
> +
>  endif # IOMMU_SUPPORT
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index
> 195f7b997d8e..1199d8475802 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -27,3 +27,4 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-
> smmu.o
>  obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>  obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>  obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
> +obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
> diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> new file mode 100644 index 000000000000..1cf4f57b7817
> --- /dev/null
> +++ b/drivers/iommu/virtio-iommu.c
> @@ -0,0 +1,980 @@
> +/*
> + * Virtio driver for the paravirtualized IOMMU
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
> USA.
> + *
> + * Copyright (C) 2017 ARM Limited
> + *
> + * Author: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>  */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/amba/bus.h>
> +#include <linux/delay.h>
> +#include <linux/dma-iommu.h>
> +#include <linux/freezer.h>
> +#include <linux/interval_tree.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/of_iommu.h>
> +#include <linux/of_platform.h>
> +#include <linux/platform_device.h>
> +#include <linux/virtio.h>
> +#include <linux/virtio_config.h>
> +#include <linux/virtio_ids.h>
> +#include <linux/wait.h>
> +
> +#include <uapi/linux/virtio_iommu.h>
> +
> +struct viommu_dev {
> +	struct iommu_device		iommu;
> +	struct device			*dev;
> +	struct virtio_device		*vdev;
> +
> +	struct virtqueue		*vq;
> +	struct list_head		pending_requests;
> +	/* Serialize anything touching the vq and the request list */
> +	spinlock_t			vq_lock;
> +
> +	struct list_head		list;
> +
> +	/* Device configuration */
> +	u64				pgsize_bitmap;
> +	u64				aperture_start;
> +	u64				aperture_end;
> +};
> +
> +struct viommu_mapping {
> +	phys_addr_t			paddr;
> +	struct interval_tree_node	iova;
> +};
> +
> +struct viommu_domain {
> +	struct iommu_domain		domain;
> +	struct viommu_dev		*viommu;
> +	struct mutex			mutex;
> +	u64				id;
> +
> +	spinlock_t			mappings_lock;
> +	struct rb_root			mappings;
> +
> +	/* Number of devices attached to this domain */
> +	unsigned long			attached;
> +};
> +
> +struct viommu_endpoint {
> +	struct viommu_dev		*viommu;
> +	struct viommu_domain		*vdomain;
> +};
> +
> +struct viommu_request {
> +	struct scatterlist		head;
> +	struct scatterlist		tail;
> +
> +	int				written;
> +	struct list_head		list;
> +};
> +
> +/* TODO: use an IDA */
> +static atomic64_t viommu_domain_ids_gen;
> +
> +#define to_viommu_domain(domain) container_of(domain, struct
> +viommu_domain, domain)
> +
> +/* Virtio transport */
> +
> +static int viommu_status_to_errno(u8 status) {
> +	switch (status) {
> +	case VIRTIO_IOMMU_S_OK:
> +		return 0;
> +	case VIRTIO_IOMMU_S_UNSUPP:
> +		return -ENOSYS;
> +	case VIRTIO_IOMMU_S_INVAL:
> +		return -EINVAL;
> +	case VIRTIO_IOMMU_S_RANGE:
> +		return -ERANGE;
> +	case VIRTIO_IOMMU_S_NOENT:
> +		return -ENOENT;
> +	case VIRTIO_IOMMU_S_FAULT:
> +		return -EFAULT;
> +	case VIRTIO_IOMMU_S_IOERR:
> +	case VIRTIO_IOMMU_S_DEVERR:
> +	default:
> +		return -EIO;
> +	}
> +}
> +
> +static int viommu_get_req_size(struct virtio_iommu_req_head *req, size_t
> *head,
> +			       size_t *tail)
> +{
> +	size_t size;
> +	union virtio_iommu_req r;
> +
> +	*tail = sizeof(struct virtio_iommu_req_tail);
> +
> +	switch (req->type) {
> +	case VIRTIO_IOMMU_T_ATTACH:
> +		size = sizeof(r.attach);
> +		break;
> +	case VIRTIO_IOMMU_T_DETACH:
> +		size = sizeof(r.detach);
> +		break;
> +	case VIRTIO_IOMMU_T_MAP:
> +		size = sizeof(r.map);
> +		break;
> +	case VIRTIO_IOMMU_T_UNMAP:
> +		size = sizeof(r.unmap);
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	*head = size - *tail;
> +	return 0;
> +}
> +
> +static int viommu_receive_resp(struct viommu_dev *viommu, int
> +nr_expected) {
> +
> +	unsigned int len;
> +	int nr_received = 0;
> +	struct viommu_request *req, *pending, *next;
> +
> +	pending = list_first_entry_or_null(&viommu->pending_requests,
> +					   struct viommu_request, list);
> +	if (WARN_ON(!pending))
> +		return 0;
> +
> +	while ((req = virtqueue_get_buf(viommu->vq, &len)) != NULL) {
> +		if (req != pending) {
> +			dev_warn(viommu->dev, "discarding stale
> request\n");
> +			continue;
> +		}
> +
> +		pending->written = len;
> +
> +		if (++nr_received == nr_expected) {
> +			list_del(&pending->list);
> +			/*
> +			 * In an ideal world, we'd wake up the waiter for this
> +			 * group of requests here. But everything is painfully
> +			 * synchronous, so waiter is the caller.
> +			 */
> +			break;
> +		}
> +
> +		next = list_next_entry(pending, list);
> +		list_del(&pending->list);
> +
> +		if (WARN_ON(list_empty(&viommu->pending_requests)))
> +			return 0;
> +
> +		pending = next;
> +	}
> +
> +	return nr_received;
> +}
> +
> +/* Must be called with vq_lock held */
> +static int _viommu_send_reqs_sync(struct viommu_dev *viommu,
> +				  struct viommu_request *req, int nr,
> +				  int *nr_sent)
> +{
> +	int i, ret;
> +	ktime_t timeout;
> +	int nr_received = 0;
> +	struct scatterlist *sg[2];
> +	/*
> +	 * FIXME: as it stands, 1s timeout per request. This is a voluntary
> +	 * exaggeration because I have no idea how real our ktime is. Are we
> +	 * using a RTC? Are we aware of steal time? I don't know much about
> +	 * this, need to do some digging.
> +	 */
> +	unsigned long timeout_ms = 1000;
> +
> +	*nr_sent = 0;
> +
> +	for (i = 0; i < nr; i++, req++) {
> +		/*
> +		 * The backend will allocate one indirect descriptor for each
> +		 * request, which allows to double the ring consumption, but
> +		 * might be slower.
> +		 */
> +		req->written = 0;
> +
> +		sg[0] = &req->head;
> +		sg[1] = &req->tail;
> +
> +		ret = virtqueue_add_sgs(viommu->vq, sg, 1, 1, req,
> +					GFP_ATOMIC);
> +		if (ret)
> +			break;
> +
> +		list_add_tail(&req->list, &viommu->pending_requests);
> +	}
> +
> +	if (i && !virtqueue_kick(viommu->vq))
> +		return -EPIPE;
> +
> +	/*
> +	 * Absolutely no wiggle room here. We're not allowed to sleep as
> callers
> +	 * might be holding spinlocks, so we have to poll like savages until
> +	 * something appears. Hopefully the host already handled the
> request
> +	 * during the above kick and returned it to us.
> +	 *
> +	 * A nice improvement would be for the caller to tell us if we can
> sleep
> +	 * whilst mapping, but this has to go through the IOMMU/DMA API.
> +	 */
> +	timeout = ktime_add_ms(ktime_get(), timeout_ms * i);
> +	while (nr_received < i && ktime_before(ktime_get(), timeout)) {
> +		nr_received += viommu_receive_resp(viommu, i -
> nr_received);
> +		if (nr_received < i) {
> +			/*
> +			 * FIXME: what's a good way to yield to host? A
> second
> +			 * virtqueue_kick won't have any effect since we
> haven't
> +			 * added any descriptor.
> +			 */
> +			udelay(10);
> +		}
> +	}
> +	dev_dbg(viommu->dev, "request took %lld us\n",
> +		ktime_us_delta(ktime_get(), ktime_sub_ms(timeout,
> timeout_ms * i)));
> +
> +	if (nr_received != i)
> +		ret = -ETIMEDOUT;
> +
> +	if (ret == -ENOSPC && nr_received)
> +		/*
> +		 * We've freed some space since virtio told us that the ring is
> +		 * full, tell the caller to come back later (after releasing the
> +		 * lock first, to be fair to other threads)
> +		 */
> +		ret = -EAGAIN;
> +
> +	*nr_sent = nr_received;
> +
> +	return ret;
> +}
> +
> +/**
> + * viommu_send_reqs_sync - add a batch of requests, kick the host and
> wait for
> + *                         them to return
> + *
> + * @req: array of requests
> + * @nr: size of the array
> + * @nr_sent: contains the number of requests actually sent after this
> function
> + *           returns
> + *
> + * Return 0 on success, or an error if we failed to send some of the
> requests.
> + */
> +static int viommu_send_reqs_sync(struct viommu_dev *viommu,
> +				 struct viommu_request *req, int nr,
> +				 int *nr_sent)
> +{
> +	int ret;
> +	int sent = 0;
> +	unsigned long flags;
> +
> +	*nr_sent = 0;
> +	do {
> +		spin_lock_irqsave(&viommu->vq_lock, flags);
> +		ret = _viommu_send_reqs_sync(viommu, req, nr, &sent);
> +		spin_unlock_irqrestore(&viommu->vq_lock, flags);
> +
> +		*nr_sent += sent;
> +		req += sent;
> +		nr -= sent;
> +	} while (ret == -EAGAIN);
> +
> +	return ret;
> +}
> +
> +/**
> + * viommu_send_req_sync - send one request and wait for reply
> + *
> + * @head_ptr: pointer to a virtio_iommu_req_* structure
> + *
> + * Returns 0 if the request was successful, or an error number
> +otherwise. No
> + * distinction is done between transport and request errors.
> + */
> +static int viommu_send_req_sync(struct viommu_dev *viommu, void
> +*head_ptr) {
> +	int ret;
> +	int nr_sent;
> +	struct viommu_request req;
> +	size_t head_size, tail_size;
> +	struct virtio_iommu_req_tail *tail;
> +	struct virtio_iommu_req_head *head = head_ptr;
> +
> +	ret = viommu_get_req_size(head, &head_size, &tail_size);
> +	if (ret)
> +		return ret;
> +
> +	dev_dbg(viommu->dev, "Sending request 0x%x, %zu bytes\n",
> head->type,
> +		head_size + tail_size);
> +
> +	tail = head_ptr + head_size;
> +
> +	sg_init_one(&req.head, head, head_size);
> +	sg_init_one(&req.tail, tail, tail_size);
> +
> +	ret = viommu_send_reqs_sync(viommu, &req, 1, &nr_sent);
> +	if (ret || !req.written || nr_sent != 1) {
> +		dev_err(viommu->dev, "failed to send command\n");
> +		return -EIO;
> +	}
> +
> +	ret = -viommu_status_to_errno(tail->status);
> +
> +	if (ret)
> +		dev_dbg(viommu->dev, " completed with %d\n", ret);
> +
> +	return ret;
> +}
> +
> +static int viommu_tlb_map(struct viommu_domain *vdomain, unsigned
> long iova,
> +			  phys_addr_t paddr, size_t size)
> +{
> +	unsigned long flags;
> +	struct viommu_mapping *mapping;
> +
> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
> +	if (!mapping)
> +		return -ENOMEM;
> +
> +	mapping->paddr = paddr;
> +	mapping->iova.start = iova;
> +	mapping->iova.last = iova + size - 1;
> +
> +	spin_lock_irqsave(&vdomain->mappings_lock, flags);
> +	interval_tree_insert(&mapping->iova, &vdomain->mappings);
> +	spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
> +
> +	return 0;
> +}
> +
> +static size_t viommu_tlb_unmap(struct viommu_domain *vdomain,
> +			       unsigned long iova, size_t size) {
> +	size_t unmapped = 0;
> +	unsigned long flags;
> +	unsigned long last = iova + size - 1;
> +	struct viommu_mapping *mapping = NULL;
> +	struct interval_tree_node *node, *next;
> +
> +	spin_lock_irqsave(&vdomain->mappings_lock, flags);
> +	next = interval_tree_iter_first(&vdomain->mappings, iova, last);
> +	while (next) {
> +		node = next;
> +		mapping = container_of(node, struct viommu_mapping,
> iova);
> +
> +		next = interval_tree_iter_next(node, iova, last);
> +
> +		/*
> +		 * Note that for a partial range, this will return the full
> +		 * mapping so we avoid sending split requests to the device.
> +		 */
> +		unmapped += mapping->iova.last - mapping->iova.start + 1;
> +
> +		interval_tree_remove(node, &vdomain->mappings);
> +		kfree(mapping);
> +	}
> +	spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
> +
> +	return unmapped;
> +}
> +
> +/* IOMMU API */
> +
> +static bool viommu_capable(enum iommu_cap cap) {
> +	return false; /* :( */
> +}
> +
> +static struct iommu_domain *viommu_domain_alloc(unsigned type) {
> +	struct viommu_domain *vdomain;
> +
> +	if (type != IOMMU_DOMAIN_UNMANAGED && type !=
> IOMMU_DOMAIN_DMA)
> +		return NULL;
> +
> +	vdomain = kzalloc(sizeof(struct viommu_domain), GFP_KERNEL);
> +	if (!vdomain)
> +		return NULL;
> +
> +	vdomain->id =
> atomic64_inc_return_relaxed(&viommu_domain_ids_gen);
> +
> +	mutex_init(&vdomain->mutex);
> +	spin_lock_init(&vdomain->mappings_lock);
> +	vdomain->mappings = RB_ROOT;
> +
> +	pr_debug("alloc domain of type %d -> %llu\n", type, vdomain->id);
> +
> +	if (type == IOMMU_DOMAIN_DMA &&
> +	    iommu_get_dma_cookie(&vdomain->domain)) {
> +		kfree(vdomain);
> +		return NULL;
> +	}
> +
> +	return &vdomain->domain;
> +}
> +
> +static void viommu_domain_free(struct iommu_domain *domain) {
> +	struct viommu_domain *vdomain = to_viommu_domain(domain);
> +
> +	pr_debug("free domain %llu\n", vdomain->id);
> +
> +	iommu_put_dma_cookie(domain);
> +
> +	/* Free all remaining mappings (size 2^64) */
> +	viommu_tlb_unmap(vdomain, 0, 0);
> +
> +	kfree(vdomain);
> +}
> +
> +static int viommu_attach_dev(struct iommu_domain *domain, struct
> device
> +*dev) {
> +	int i;
> +	int ret = 0;
> +	struct iommu_fwspec *fwspec = dev->iommu_fwspec;
> +	struct viommu_endpoint *vdev = fwspec->iommu_priv;
> +	struct viommu_domain *vdomain = to_viommu_domain(domain);
> +	struct virtio_iommu_req_attach req = {
> +		.head.type	= VIRTIO_IOMMU_T_ATTACH,
> +		.address_space	= cpu_to_le32(vdomain->id),
> +	};
> +
> +	mutex_lock(&vdomain->mutex);
> +	if (!vdomain->viommu) {
> +		struct viommu_dev *viommu = vdev->viommu;
> +
> +		vdomain->viommu = viommu;
> +
> +		domain->pgsize_bitmap		= viommu-
> >pgsize_bitmap;
> +		domain->geometry.aperture_start	= viommu-
> >aperture_start;
> +		domain->geometry.aperture_end	= viommu-
> >aperture_end;
> +		domain->geometry.force_aperture	= true;
> +
> +	} else if (vdomain->viommu != vdev->viommu) {
> +		dev_err(dev, "cannot attach to foreign VIOMMU\n");
> +		ret = -EXDEV;
> +	}
> +	mutex_unlock(&vdomain->mutex);
> +
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * When attaching the device to a new domain, it will be detached
> from
> +	 * the old one and, if as as a result the old domain isn't attached to
> +	 * any device, all mappings are removed from the old domain and it is
> +	 * freed. (Note that we can't use get_domain_for_dev here, it
> returns
> +	 * the default domain during initial attach.)
> +	 *
> +	 * Take note of the device disappearing, so we can ignore unmap
> request
> +	 * on stale domains (that is, between this detach and the upcoming
> +	 * free.)
> +	 *
> +	 * vdev->vdomain is protected by group->mutex
> +	 */
> +	if (vdev->vdomain) {
> +		dev_dbg(dev, "detach from domain %llu\n", vdev-
> >vdomain->id);
> +		vdev->vdomain->attached--;
> +	}
> +
> +	dev_dbg(dev, "attach to domain %llu\n", vdomain->id);
> +
> +	for (i = 0; i < fwspec->num_ids; i++) {
> +		req.device = cpu_to_le32(fwspec->ids[i]);
> +
> +		ret = viommu_send_req_sync(vdomain->viommu, &req);
> +		if (ret)
> +			break;
> +	}
> +
> +	vdomain->attached++;
> +	vdev->vdomain = vdomain;
> +
> +	return ret;
> +}
> +
> +static int viommu_map(struct iommu_domain *domain, unsigned long iova,
> +		      phys_addr_t paddr, size_t size, int prot) {
> +	int ret;
> +	struct viommu_domain *vdomain = to_viommu_domain(domain);
> +	struct virtio_iommu_req_map req = {
> +		.head.type	= VIRTIO_IOMMU_T_MAP,
> +		.address_space	= cpu_to_le32(vdomain->id),
> +		.virt_addr	= cpu_to_le64(iova),
> +		.phys_addr	= cpu_to_le64(paddr),
> +		.size		= cpu_to_le64(size),
> +	};
> +
> +	pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id, iova,
> +		 paddr, size);

A query, when I am tracing above prints I see same physical address is mapped with two different virtual address, do you know why kernel does this?

Thanks
-Bharat

> +
> +	if (!vdomain->attached)
> +		return -ENODEV;
> +
> +	if (prot & IOMMU_READ)
> +		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ);
> +
> +	if (prot & IOMMU_WRITE)
> +		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE);
> +
> +	ret = viommu_tlb_map(vdomain, iova, paddr, size);
> +	if (ret)
> +		return ret;
> +
> +	ret = viommu_send_req_sync(vdomain->viommu, &req);
> +	if (ret)
> +		viommu_tlb_unmap(vdomain, iova, size);
> +
> +	return ret;
> +}
> +
> +static size_t viommu_unmap(struct iommu_domain *domain, unsigned
> long iova,
> +			   size_t size)
> +{
> +	int ret;
> +	size_t unmapped;
> +	struct viommu_domain *vdomain = to_viommu_domain(domain);
> +	struct virtio_iommu_req_unmap req = {
> +		.head.type	= VIRTIO_IOMMU_T_UNMAP,
> +		.address_space	= cpu_to_le32(vdomain->id),
> +		.virt_addr	= cpu_to_le64(iova),
> +	};
> +
> +	pr_debug("unmap %llu 0x%lx (%zu)\n", vdomain->id, iova, size);
> +
> +	/* Callers may unmap after detach, but device already took care of it.
> */
> +	if (!vdomain->attached)
> +		return size;
> +
> +	unmapped = viommu_tlb_unmap(vdomain, iova, size);
> +	if (unmapped < size)
> +		return 0;
> +
> +	req.size = cpu_to_le64(unmapped);
> +
> +	ret = viommu_send_req_sync(vdomain->viommu, &req);
> +	if (ret)
> +		return 0;
> +
> +	return unmapped;
> +}
> +
> +static size_t viommu_map_sg(struct iommu_domain *domain, unsigned
> long iova,
> +			    struct scatterlist *sg, unsigned int nents, int prot) {
> +	int i, ret;
> +	int nr_sent;
> +	size_t mapped;
> +	size_t min_pagesz;
> +	size_t total_size;
> +	struct scatterlist *s;
> +	unsigned int flags = 0;
> +	unsigned long cur_iova;
> +	unsigned long mapped_iova;
> +	size_t head_size, tail_size;
> +	struct viommu_request reqs[nents];
> +	struct virtio_iommu_req_map map_reqs[nents];
> +	struct viommu_domain *vdomain = to_viommu_domain(domain);
> +
> +	if (!vdomain->attached)
> +		return 0;
> +
> +	pr_debug("map_sg %llu %u 0x%lx\n", vdomain->id, nents, iova);
> +
> +	if (prot & IOMMU_READ)
> +		flags |= VIRTIO_IOMMU_MAP_F_READ;
> +
> +	if (prot & IOMMU_WRITE)
> +		flags |= VIRTIO_IOMMU_MAP_F_WRITE;
> +
> +	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
> +	tail_size = sizeof(struct virtio_iommu_req_tail);
> +	head_size = sizeof(*map_reqs) - tail_size;
> +
> +	cur_iova = iova;
> +
> +	for_each_sg(sg, s, nents, i) {
> +		size_t size = s->length;
> +		phys_addr_t paddr = sg_phys(s);
> +		void *tail = (void *)&map_reqs[i] + head_size;
> +
> +		if (!IS_ALIGNED(paddr | size, min_pagesz)) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		/* TODO: merge physically-contiguous mappings if any */
> +		map_reqs[i] = (struct virtio_iommu_req_map) {
> +			.head.type	= VIRTIO_IOMMU_T_MAP,
> +			.address_space	= cpu_to_le32(vdomain->id),
> +			.flags		= cpu_to_le32(flags),
> +			.virt_addr	= cpu_to_le64(cur_iova),
> +			.phys_addr	= cpu_to_le64(paddr),
> +			.size		= cpu_to_le64(size),
> +		};
> +
> +		ret = viommu_tlb_map(vdomain, cur_iova, paddr, size);
> +		if (ret)
> +			break;
> +
> +		sg_init_one(&reqs[i].head, &map_reqs[i], head_size);
> +		sg_init_one(&reqs[i].tail, tail, tail_size);
> +
> +		cur_iova += size;
> +	}
> +
> +	total_size = cur_iova - iova;
> +
> +	if (ret) {
> +		viommu_tlb_unmap(vdomain, iova, total_size);
> +		return 0;
> +	}
> +
> +	ret = viommu_send_reqs_sync(vdomain->viommu, reqs, i,
> &nr_sent);
> +
> +	if (nr_sent != nents)
> +		goto err_rollback;
> +
> +	for (i = 0; i < nents; i++) {
> +		if (!reqs[i].written || map_reqs[i].tail.status)
> +			goto err_rollback;
> +	}
> +
> +	return total_size;
> +
> +err_rollback:
> +	/*
> +	 * Any request in the range might have failed. Unmap what was
> +	 * successful.
> +	 */
> +	cur_iova = iova;
> +	mapped_iova = iova;
> +	mapped = 0;
> +	for_each_sg(sg, s, nents, i) {
> +		size_t size = s->length;
> +
> +		cur_iova += size;
> +
> +		if (!reqs[i].written || map_reqs[i].tail.status) {
> +			if (mapped)
> +				viommu_unmap(domain, mapped_iova,
> mapped);
> +
> +			mapped_iova = cur_iova;
> +			mapped = 0;
> +		} else {
> +			mapped += size;
> +		}
> +	}
> +
> +	viommu_tlb_unmap(vdomain, iova, total_size);
> +
> +	return 0;
> +}
> +
> +static phys_addr_t viommu_iova_to_phys(struct iommu_domain *domain,
> +				       dma_addr_t iova)
> +{
> +	u64 paddr = 0;
> +	unsigned long flags;
> +	struct viommu_mapping *mapping;
> +	struct interval_tree_node *node;
> +	struct viommu_domain *vdomain = to_viommu_domain(domain);
> +
> +	spin_lock_irqsave(&vdomain->mappings_lock, flags);
> +	node = interval_tree_iter_first(&vdomain->mappings, iova, iova);
> +	if (node) {
> +		mapping = container_of(node, struct viommu_mapping,
> iova);
> +		paddr = mapping->paddr + (iova - mapping->iova.start);
> +	}
> +	spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
> +
> +	pr_debug("iova_to_phys %llu 0x%llx->0x%llx\n", vdomain->id, iova,
> +		 paddr);
> +
> +	return paddr;
> +}
> +
> +static struct iommu_ops viommu_ops;
> +static struct virtio_driver virtio_iommu_drv;
> +
> +static int viommu_match_node(struct device *dev, void *data) {
> +	return dev->parent->fwnode == data;
> +}
> +
> +static struct viommu_dev *viommu_get_by_fwnode(struct
> fwnode_handle
> +*fwnode) {
> +	struct device *dev = driver_find_device(&virtio_iommu_drv.driver,
> NULL,
> +						fwnode,
> viommu_match_node);
> +	put_device(dev);
> +
> +	return dev ? dev_to_virtio(dev)->priv : NULL; }
> +
> +static int viommu_add_device(struct device *dev) {
> +	struct iommu_group *group;
> +	struct viommu_endpoint *vdev;
> +	struct viommu_dev *viommu = NULL;
> +	struct iommu_fwspec *fwspec = dev->iommu_fwspec;
> +
> +	if (!fwspec || fwspec->ops != &viommu_ops)
> +		return -ENODEV;
> +
> +	viommu = viommu_get_by_fwnode(fwspec->iommu_fwnode);
> +	if (!viommu)
> +		return -ENODEV;
> +
> +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	vdev->viommu = viommu;
> +	fwspec->iommu_priv = vdev;
> +
> +	/*
> +	 * Last step creates a default domain and attaches to it. Everything
> +	 * must be ready.
> +	 */
> +	group = iommu_group_get_for_dev(dev);
> +
> +	return PTR_ERR_OR_ZERO(group);
> +}
> +
> +static void viommu_remove_device(struct device *dev) {
> +	kfree(dev->iommu_fwspec->iommu_priv);
> +}
> +
> +static struct iommu_group *
> +viommu_device_group(struct device *dev) {
> +	if (dev_is_pci(dev))
> +		return pci_device_group(dev);
> +	else
> +		return generic_device_group(dev);
> +}
> +
> +static int viommu_of_xlate(struct device *dev, struct of_phandle_args
> +*args) {
> +	u32 *id = args->args;
> +
> +	dev_dbg(dev, "of_xlate 0x%x\n", *id);
> +	return iommu_fwspec_add_ids(dev, args->args, 1); }
> +
> +/*
> + * (Maybe) temporary hack for device pass-through into guest userspace.
> +On ARM
> + * with an ITS, VFIO will look for a region where to map the doorbell,
> +even
> + * though the virtual doorbell is never written to by the device, and
> +instead
> + * the host injects interrupts directly. TODO: sort this out in VFIO.
> + */
> +#define MSI_IOVA_BASE			0x8000000
> +#define MSI_IOVA_LENGTH			0x100000
> +
> +static void viommu_get_resv_regions(struct device *dev, struct
> +list_head *head) {
> +	struct iommu_resv_region *region;
> +	int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
> +
> +	region = iommu_alloc_resv_region(MSI_IOVA_BASE,
> MSI_IOVA_LENGTH, prot,
> +					 IOMMU_RESV_MSI);
> +	if (!region)
> +		return;
> +
> +	list_add_tail(&region->list, head);
> +}
> +
> +static void viommu_put_resv_regions(struct device *dev, struct
> +list_head *head) {
> +	struct iommu_resv_region *entry, *next;
> +
> +	list_for_each_entry_safe(entry, next, head, list)
> +		kfree(entry);
> +}
> +
> +static struct iommu_ops viommu_ops = {
> +	.capable		= viommu_capable,
> +	.domain_alloc		= viommu_domain_alloc,
> +	.domain_free		= viommu_domain_free,
> +	.attach_dev		= viommu_attach_dev,
> +	.map			= viommu_map,
> +	.unmap			= viommu_unmap,
> +	.map_sg			= viommu_map_sg,
> +	.iova_to_phys		= viommu_iova_to_phys,
> +	.add_device		= viommu_add_device,
> +	.remove_device		= viommu_remove_device,
> +	.device_group		= viommu_device_group,
> +	.of_xlate		= viommu_of_xlate,
> +	.get_resv_regions	= viommu_get_resv_regions,
> +	.put_resv_regions	= viommu_put_resv_regions,
> +};
> +
> +static int viommu_init_vq(struct viommu_dev *viommu) {
> +	struct virtio_device *vdev = dev_to_virtio(viommu->dev);
> +	vq_callback_t *callback = NULL;
> +	const char *name = "request";
> +	int ret;
> +
> +	ret = vdev->config->find_vqs(vdev, 1, &viommu->vq, &callback,
> +				     &name, NULL);
> +	if (ret)
> +		dev_err(viommu->dev, "cannot find VQ\n");
> +
> +	return ret;
> +}
> +
> +static int viommu_probe(struct virtio_device *vdev) {
> +	struct device *parent_dev = vdev->dev.parent;
> +	struct viommu_dev *viommu = NULL;
> +	struct device *dev = &vdev->dev;
> +	int ret;
> +
> +	viommu = kzalloc(sizeof(*viommu), GFP_KERNEL);
> +	if (!viommu)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&viommu->vq_lock);
> +	INIT_LIST_HEAD(&viommu->pending_requests);
> +	viommu->dev = dev;
> +	viommu->vdev = vdev;
> +
> +	ret = viommu_init_vq(viommu);
> +	if (ret)
> +		goto err_free_viommu;
> +
> +	virtio_cread(vdev, struct virtio_iommu_config, page_sizes,
> +		     &viommu->pgsize_bitmap);
> +
> +	viommu->aperture_end = -1UL;
> +
> +	virtio_cread_feature(vdev, VIRTIO_IOMMU_F_INPUT_RANGE,
> +			     struct virtio_iommu_config, input_range.start,
> +			     &viommu->aperture_start);
> +
> +	virtio_cread_feature(vdev, VIRTIO_IOMMU_F_INPUT_RANGE,
> +			     struct virtio_iommu_config, input_range.end,
> +			     &viommu->aperture_end);
> +
> +	if (!viommu->pgsize_bitmap) {
> +		ret = -EINVAL;
> +		goto err_free_viommu;
> +	}
> +
> +	viommu_ops.pgsize_bitmap = viommu->pgsize_bitmap;
> +
> +	/*
> +	 * Not strictly necessary, virtio would enable it later. This allows to
> +	 * start using the request queue early.
> +	 */
> +	virtio_device_ready(vdev);
> +
> +	ret = iommu_device_sysfs_add(&viommu->iommu, dev, NULL, "%s",
> +				     virtio_bus_name(vdev));
> +	if (ret)
> +		goto err_free_viommu;
> +
> +	iommu_device_set_ops(&viommu->iommu, &viommu_ops);
> +	iommu_device_set_fwnode(&viommu->iommu, parent_dev-
> >fwnode);
> +
> +	iommu_device_register(&viommu->iommu);
> +
> +#ifdef CONFIG_PCI
> +	if (pci_bus_type.iommu_ops != &viommu_ops) {
> +		pci_request_acs();
> +		ret = bus_set_iommu(&pci_bus_type, &viommu_ops);
> +		if (ret)
> +			goto err_unregister;
> +	}
> +#endif
> +#ifdef CONFIG_ARM_AMBA
> +	if (amba_bustype.iommu_ops != &viommu_ops) {
> +		ret = bus_set_iommu(&amba_bustype, &viommu_ops);
> +		if (ret)
> +			goto err_unregister;
> +	}
> +#endif
> +	if (platform_bus_type.iommu_ops != &viommu_ops) {
> +		ret = bus_set_iommu(&platform_bus_type, &viommu_ops);
> +		if (ret)
> +			goto err_unregister;
> +	}
> +
> +	vdev->priv = viommu;
> +
> +	dev_info(viommu->dev, "probe successful\n");
> +
> +	return 0;
> +
> +err_unregister:
> +	iommu_device_unregister(&viommu->iommu);
> +
> +err_free_viommu:
> +	kfree(viommu);
> +
> +	return ret;
> +}
> +
> +static void viommu_remove(struct virtio_device *vdev) {
> +	struct viommu_dev *viommu = vdev->priv;
> +
> +	iommu_device_unregister(&viommu->iommu);
> +	kfree(viommu);
> +
> +	dev_info(&vdev->dev, "device removed\n"); }
> +
> +static void viommu_config_changed(struct virtio_device *vdev) {
> +	dev_warn(&vdev->dev, "config changed\n"); }
> +
> +static unsigned int features[] = {
> +	VIRTIO_IOMMU_F_INPUT_RANGE,
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_ID_IOMMU, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +static struct virtio_driver virtio_iommu_drv = {
> +	.driver.name		= KBUILD_MODNAME,
> +	.driver.owner		= THIS_MODULE,
> +	.id_table		= id_table,
> +	.feature_table		= features,
> +	.feature_table_size	= ARRAY_SIZE(features),
> +	.probe			= viommu_probe,
> +	.remove			= viommu_remove,
> +	.config_changed		= viommu_config_changed,
> +};
> +
> +module_virtio_driver(virtio_iommu_drv);
> +
> +IOMMU_OF_DECLARE(viommu, "virtio,mmio", NULL);
> +
> +MODULE_DESCRIPTION("virtio-iommu driver"); MODULE_AUTHOR("Jean-
> Philippe
> +Brucker <jean-philippe.brucker@arm.com>");
> +MODULE_LICENSE("GPL v2");
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild index
> 1f25c86374ad..c0cb0f173258 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -467,6 +467,7 @@ header-y += virtio_console.h  header-y +=
> virtio_gpu.h  header-y += virtio_ids.h  header-y += virtio_input.h
> +header-y += virtio_iommu.h
>  header-y += virtio_mmio.h
>  header-y += virtio_net.h
>  header-y += virtio_pci.h
> diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
> index 6d5c3b2d4f4d..934ed3d3cd3f 100644
> --- a/include/uapi/linux/virtio_ids.h
> +++ b/include/uapi/linux/virtio_ids.h
> @@ -43,5 +43,6 @@
>  #define VIRTIO_ID_INPUT        18 /* virtio input */
>  #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
>  #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
> +#define VIRTIO_ID_IOMMU	    61216 /* virtio IOMMU (temporary) */
> 
>  #endif /* _LINUX_VIRTIO_IDS_H */
> diff --git a/include/uapi/linux/virtio_iommu.h
> b/include/uapi/linux/virtio_iommu.h
> new file mode 100644
> index 000000000000..ec74c9a727d4
> --- /dev/null
> +++ b/include/uapi/linux/virtio_iommu.h
> @@ -0,0 +1,142 @@
> +/*
> + * Copyright (C) 2017 ARM Ltd.
> + *
> + * This header is BSD licensed so anyone can use the definitions
> + * to implement compatible drivers/servers:
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of ARM Ltd. nor the names of its contributors
> + *    may be used to endorse or promote products derived from this
> software
> + *    without specific prior written permission.
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> + * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS
> + * FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL IBM
> OR
> + * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> NOT
> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
> OF
> + * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
> CAUSED AND
> + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
> + * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> OUT
> + * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + */
> +#ifndef _UAPI_LINUX_VIRTIO_IOMMU_H
> +#define _UAPI_LINUX_VIRTIO_IOMMU_H
> +
> +/* Feature bits */
> +#define VIRTIO_IOMMU_F_INPUT_RANGE		0
> +#define VIRTIO_IOMMU_F_IOASID_BITS		1
> +#define VIRTIO_IOMMU_F_MAP_UNMAP		2
> +#define VIRTIO_IOMMU_F_BYPASS			3
> +
> +__packed
> +struct virtio_iommu_config {
> +	/* Supported page sizes */
> +	__u64					page_sizes;
> +	struct virtio_iommu_range {
> +		__u64				start;
> +		__u64				end;
> +	} input_range;
> +	__u8 					ioasid_bits;
> +};
> +
> +/* Request types */
> +#define VIRTIO_IOMMU_T_ATTACH			0x01
> +#define VIRTIO_IOMMU_T_DETACH			0x02
> +#define VIRTIO_IOMMU_T_MAP			0x03
> +#define VIRTIO_IOMMU_T_UNMAP			0x04
> +
> +/* Status types */
> +#define VIRTIO_IOMMU_S_OK			0x00
> +#define VIRTIO_IOMMU_S_IOERR			0x01
> +#define VIRTIO_IOMMU_S_UNSUPP			0x02
> +#define VIRTIO_IOMMU_S_DEVERR			0x03
> +#define VIRTIO_IOMMU_S_INVAL			0x04
> +#define VIRTIO_IOMMU_S_RANGE			0x05
> +#define VIRTIO_IOMMU_S_NOENT			0x06
> +#define VIRTIO_IOMMU_S_FAULT			0x07
> +
> +__packed
> +struct virtio_iommu_req_head {
> +	__u8					type;
> +	__u8					reserved[3];
> +};
> +
> +__packed
> +struct virtio_iommu_req_tail {
> +	__u8					status;
> +	__u8					reserved[3];
> +};
> +
> +__packed
> +struct virtio_iommu_req_attach {
> +	struct virtio_iommu_req_head		head;
> +
> +	__le32					address_space;
> +	__le32					device;
> +	__le32					reserved;
> +
> +	struct virtio_iommu_req_tail		tail;
> +};
> +
> +__packed
> +struct virtio_iommu_req_detach {
> +	struct virtio_iommu_req_head		head;
> +
> +	__le32					device;
> +	__le32					reserved;
> +
> +	struct virtio_iommu_req_tail		tail;
> +};
> +
> +#define VIRTIO_IOMMU_MAP_F_READ			(1 << 0)
> +#define VIRTIO_IOMMU_MAP_F_WRITE		(1 << 1)
> +#define VIRTIO_IOMMU_MAP_F_EXEC			(1 << 2)
> +
> +#define VIRTIO_IOMMU_MAP_F_MASK
> 	(VIRTIO_IOMMU_MAP_F_READ |	\
> +
> VIRTIO_IOMMU_MAP_F_WRITE |	\
> +
> VIRTIO_IOMMU_MAP_F_EXEC)
> +
> +__packed
> +struct virtio_iommu_req_map {
> +	struct virtio_iommu_req_head		head;
> +
> +	__le32					address_space;
> +	__le32					flags;
> +	__le64					virt_addr;
> +	__le64					phys_addr;
> +	__le64					size;
> +
> +	struct virtio_iommu_req_tail		tail;
> +};
> +
> +__packed
> +struct virtio_iommu_req_unmap {
> +	struct virtio_iommu_req_head		head;
> +
> +	__le32					address_space;
> +	__le32					flags;
> +	__le64					virt_addr;
> +	__le64					size;
> +
> +	struct virtio_iommu_req_tail		tail;
> +};
> +
> +union virtio_iommu_req {
> +	struct virtio_iommu_req_head		head;
> +
> +	struct virtio_iommu_req_attach		attach;
> +	struct virtio_iommu_req_detach		detach;
> +	struct virtio_iommu_req_map		map;
> +	struct virtio_iommu_req_unmap		unmap;
> +};
> +
> +#endif
> --
> 2.12.1
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [virtio-dev] [RFC PATCH linux] iommu: Add virtio-iommu driver
  2017-04-07 19:23 ` [RFC PATCH linux] iommu: Add virtio-iommu driver Jean-Philippe Brucker
  2017-06-16  8:48   ` [virtio-dev] " Bharat Bhushan
@ 2017-06-16  8:48   ` Bharat Bhushan
  1 sibling, 0 replies; 99+ messages in thread
From: Bharat Bhushan @ 2017-06-16  8:48 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

Hi Jean

> -----Original Message-----
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-
> open.org] On Behalf Of Jean-Philippe Brucker
> Sent: Saturday, April 08, 2017 12:53 AM
> To: iommu@lists.linux-foundation.org; kvm@vger.kernel.org;
> virtualization@lists.linux-foundation.org; virtio-dev@lists.oasis-open.org
> Cc: cdall@linaro.org; will.deacon@arm.com; robin.murphy@arm.com;
> lorenzo.pieralisi@arm.com; joro@8bytes.org; mst@redhat.com;
> jasowang@redhat.com; alex.williamson@redhat.com;
> marc.zyngier@arm.com
> Subject: [virtio-dev] [RFC PATCH linux] iommu: Add virtio-iommu driver
> 
> The virtio IOMMU is a para-virtualized device, allowing to send IOMMU
> requests such as map/unmap over virtio-mmio transport. This driver should
> illustrate the initial proposal for virtio-iommu, that you hopefully received
> with it. It handle attach, detach, map and unmap requests.
> 
> The bulk of the code is to create requests and send them through virtio.
> Implementing the IOMMU API is fairly straightforward since the virtio-iommu
> MAP/UNMAP interface is almost identical. I threw in a custom
> map_sg() function which takes up some space, but is optional. The core
> function would send a sequence of map requests, waiting for a reply
> between each mapping. This optimization avoids yielding to the host after
> each map, and instead prepares a batch of requests in the virtio ring and
> kicks the host once.
> 
> It must be applied on top of the probe deferral work for IOMMU, currently
> under discussion. This allows to dissociate early driver detection and device
> probing: device-tree or ACPI is parsed early to find which devices are
> translated by the IOMMU, but the IOMMU itself cannot be probed until the
> core virtio module is loaded.
> 
> Enabling DEBUG makes it extremely verbose at the moment, but it should be
> calmer in next versions.
> 
> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> ---
>  drivers/iommu/Kconfig             |  11 +
>  drivers/iommu/Makefile            |   1 +
>  drivers/iommu/virtio-iommu.c      | 980
> ++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/Kbuild         |   1 +
>  include/uapi/linux/virtio_ids.h   |   1 +
>  include/uapi/linux/virtio_iommu.h | 142 ++++++
>  6 files changed, 1136 insertions(+)
>  create mode 100644 drivers/iommu/virtio-iommu.c  create mode 100644
> include/uapi/linux/virtio_iommu.h
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index
> 37e204f3d9be..8cd56ee9a93a 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -359,4 +359,15 @@ config MTK_IOMMU_V1
> 
>  	  if unsure, say N here.
> 
> +config VIRTIO_IOMMU
> +	tristate "Virtio IOMMU driver"
> +	depends on VIRTIO_MMIO
> +	select IOMMU_API
> +	select INTERVAL_TREE
> +	select ARM_DMA_USE_IOMMU if ARM
> +	help
> +	  Para-virtualised IOMMU driver with virtio.
> +
> +	  Say Y here if you intend to run this kernel as a guest.
> +
>  endif # IOMMU_SUPPORT
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index
> 195f7b997d8e..1199d8475802 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -27,3 +27,4 @@ obj-$(CONFIG_TEGRA_IOMMU_SMMU) += tegra-
> smmu.o
>  obj-$(CONFIG_EXYNOS_IOMMU) += exynos-iommu.o
>  obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o
>  obj-$(CONFIG_S390_IOMMU) += s390-iommu.o
> +obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o
> diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> new file mode 100644 index 000000000000..1cf4f57b7817
> --- /dev/null
> +++ b/drivers/iommu/virtio-iommu.c
> @@ -0,0 +1,980 @@
> +/*
> + * Virtio driver for the paravirtualized IOMMU
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307,
> USA.
> + *
> + * Copyright (C) 2017 ARM Limited
> + *
> + * Author: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>  */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/amba/bus.h>
> +#include <linux/delay.h>
> +#include <linux/dma-iommu.h>
> +#include <linux/freezer.h>
> +#include <linux/interval_tree.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/of_iommu.h>
> +#include <linux/of_platform.h>
> +#include <linux/platform_device.h>
> +#include <linux/virtio.h>
> +#include <linux/virtio_config.h>
> +#include <linux/virtio_ids.h>
> +#include <linux/wait.h>
> +
> +#include <uapi/linux/virtio_iommu.h>
> +
> +struct viommu_dev {
> +	struct iommu_device		iommu;
> +	struct device			*dev;
> +	struct virtio_device		*vdev;
> +
> +	struct virtqueue		*vq;
> +	struct list_head		pending_requests;
> +	/* Serialize anything touching the vq and the request list */
> +	spinlock_t			vq_lock;
> +
> +	struct list_head		list;
> +
> +	/* Device configuration */
> +	u64				pgsize_bitmap;
> +	u64				aperture_start;
> +	u64				aperture_end;
> +};
> +
> +struct viommu_mapping {
> +	phys_addr_t			paddr;
> +	struct interval_tree_node	iova;
> +};
> +
> +struct viommu_domain {
> +	struct iommu_domain		domain;
> +	struct viommu_dev		*viommu;
> +	struct mutex			mutex;
> +	u64				id;
> +
> +	spinlock_t			mappings_lock;
> +	struct rb_root			mappings;
> +
> +	/* Number of devices attached to this domain */
> +	unsigned long			attached;
> +};
> +
> +struct viommu_endpoint {
> +	struct viommu_dev		*viommu;
> +	struct viommu_domain		*vdomain;
> +};
> +
> +struct viommu_request {
> +	struct scatterlist		head;
> +	struct scatterlist		tail;
> +
> +	int				written;
> +	struct list_head		list;
> +};
> +
> +/* TODO: use an IDA */
> +static atomic64_t viommu_domain_ids_gen;
> +
> +#define to_viommu_domain(domain) container_of(domain, struct
> +viommu_domain, domain)
> +
> +/* Virtio transport */
> +
> +static int viommu_status_to_errno(u8 status) {
> +	switch (status) {
> +	case VIRTIO_IOMMU_S_OK:
> +		return 0;
> +	case VIRTIO_IOMMU_S_UNSUPP:
> +		return -ENOSYS;
> +	case VIRTIO_IOMMU_S_INVAL:
> +		return -EINVAL;
> +	case VIRTIO_IOMMU_S_RANGE:
> +		return -ERANGE;
> +	case VIRTIO_IOMMU_S_NOENT:
> +		return -ENOENT;
> +	case VIRTIO_IOMMU_S_FAULT:
> +		return -EFAULT;
> +	case VIRTIO_IOMMU_S_IOERR:
> +	case VIRTIO_IOMMU_S_DEVERR:
> +	default:
> +		return -EIO;
> +	}
> +}
> +
> +static int viommu_get_req_size(struct virtio_iommu_req_head *req, size_t
> *head,
> +			       size_t *tail)
> +{
> +	size_t size;
> +	union virtio_iommu_req r;
> +
> +	*tail = sizeof(struct virtio_iommu_req_tail);
> +
> +	switch (req->type) {
> +	case VIRTIO_IOMMU_T_ATTACH:
> +		size = sizeof(r.attach);
> +		break;
> +	case VIRTIO_IOMMU_T_DETACH:
> +		size = sizeof(r.detach);
> +		break;
> +	case VIRTIO_IOMMU_T_MAP:
> +		size = sizeof(r.map);
> +		break;
> +	case VIRTIO_IOMMU_T_UNMAP:
> +		size = sizeof(r.unmap);
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	*head = size - *tail;
> +	return 0;
> +}
> +
> +static int viommu_receive_resp(struct viommu_dev *viommu, int
> +nr_expected) {
> +
> +	unsigned int len;
> +	int nr_received = 0;
> +	struct viommu_request *req, *pending, *next;
> +
> +	pending = list_first_entry_or_null(&viommu->pending_requests,
> +					   struct viommu_request, list);
> +	if (WARN_ON(!pending))
> +		return 0;
> +
> +	while ((req = virtqueue_get_buf(viommu->vq, &len)) != NULL) {
> +		if (req != pending) {
> +			dev_warn(viommu->dev, "discarding stale
> request\n");
> +			continue;
> +		}
> +
> +		pending->written = len;
> +
> +		if (++nr_received == nr_expected) {
> +			list_del(&pending->list);
> +			/*
> +			 * In an ideal world, we'd wake up the waiter for this
> +			 * group of requests here. But everything is painfully
> +			 * synchronous, so waiter is the caller.
> +			 */
> +			break;
> +		}
> +
> +		next = list_next_entry(pending, list);
> +		list_del(&pending->list);
> +
> +		if (WARN_ON(list_empty(&viommu->pending_requests)))
> +			return 0;
> +
> +		pending = next;
> +	}
> +
> +	return nr_received;
> +}
> +
> +/* Must be called with vq_lock held */
> +static int _viommu_send_reqs_sync(struct viommu_dev *viommu,
> +				  struct viommu_request *req, int nr,
> +				  int *nr_sent)
> +{
> +	int i, ret;
> +	ktime_t timeout;
> +	int nr_received = 0;
> +	struct scatterlist *sg[2];
> +	/*
> +	 * FIXME: as it stands, 1s timeout per request. This is a voluntary
> +	 * exaggeration because I have no idea how real our ktime is. Are we
> +	 * using a RTC? Are we aware of steal time? I don't know much about
> +	 * this, need to do some digging.
> +	 */
> +	unsigned long timeout_ms = 1000;
> +
> +	*nr_sent = 0;
> +
> +	for (i = 0; i < nr; i++, req++) {
> +		/*
> +		 * The backend will allocate one indirect descriptor for each
> +		 * request, which allows to double the ring consumption, but
> +		 * might be slower.
> +		 */
> +		req->written = 0;
> +
> +		sg[0] = &req->head;
> +		sg[1] = &req->tail;
> +
> +		ret = virtqueue_add_sgs(viommu->vq, sg, 1, 1, req,
> +					GFP_ATOMIC);
> +		if (ret)
> +			break;
> +
> +		list_add_tail(&req->list, &viommu->pending_requests);
> +	}
> +
> +	if (i && !virtqueue_kick(viommu->vq))
> +		return -EPIPE;
> +
> +	/*
> +	 * Absolutely no wiggle room here. We're not allowed to sleep as
> callers
> +	 * might be holding spinlocks, so we have to poll like savages until
> +	 * something appears. Hopefully the host already handled the
> request
> +	 * during the above kick and returned it to us.
> +	 *
> +	 * A nice improvement would be for the caller to tell us if we can
> sleep
> +	 * whilst mapping, but this has to go through the IOMMU/DMA API.
> +	 */
> +	timeout = ktime_add_ms(ktime_get(), timeout_ms * i);
> +	while (nr_received < i && ktime_before(ktime_get(), timeout)) {
> +		nr_received += viommu_receive_resp(viommu, i -
> nr_received);
> +		if (nr_received < i) {
> +			/*
> +			 * FIXME: what's a good way to yield to host? A
> second
> +			 * virtqueue_kick won't have any effect since we
> haven't
> +			 * added any descriptor.
> +			 */
> +			udelay(10);
> +		}
> +	}
> +	dev_dbg(viommu->dev, "request took %lld us\n",
> +		ktime_us_delta(ktime_get(), ktime_sub_ms(timeout,
> timeout_ms * i)));
> +
> +	if (nr_received != i)
> +		ret = -ETIMEDOUT;
> +
> +	if (ret == -ENOSPC && nr_received)
> +		/*
> +		 * We've freed some space since virtio told us that the ring is
> +		 * full, tell the caller to come back later (after releasing the
> +		 * lock first, to be fair to other threads)
> +		 */
> +		ret = -EAGAIN;
> +
> +	*nr_sent = nr_received;
> +
> +	return ret;
> +}
> +
> +/**
> + * viommu_send_reqs_sync - add a batch of requests, kick the host and
> wait for
> + *                         them to return
> + *
> + * @req: array of requests
> + * @nr: size of the array
> + * @nr_sent: contains the number of requests actually sent after this
> function
> + *           returns
> + *
> + * Return 0 on success, or an error if we failed to send some of the
> requests.
> + */
> +static int viommu_send_reqs_sync(struct viommu_dev *viommu,
> +				 struct viommu_request *req, int nr,
> +				 int *nr_sent)
> +{
> +	int ret;
> +	int sent = 0;
> +	unsigned long flags;
> +
> +	*nr_sent = 0;
> +	do {
> +		spin_lock_irqsave(&viommu->vq_lock, flags);
> +		ret = _viommu_send_reqs_sync(viommu, req, nr, &sent);
> +		spin_unlock_irqrestore(&viommu->vq_lock, flags);
> +
> +		*nr_sent += sent;
> +		req += sent;
> +		nr -= sent;
> +	} while (ret == -EAGAIN);
> +
> +	return ret;
> +}
> +
> +/**
> + * viommu_send_req_sync - send one request and wait for reply
> + *
> + * @head_ptr: pointer to a virtio_iommu_req_* structure
> + *
> + * Returns 0 if the request was successful, or an error number
> +otherwise. No
> + * distinction is done between transport and request errors.
> + */
> +static int viommu_send_req_sync(struct viommu_dev *viommu, void
> +*head_ptr) {
> +	int ret;
> +	int nr_sent;
> +	struct viommu_request req;
> +	size_t head_size, tail_size;
> +	struct virtio_iommu_req_tail *tail;
> +	struct virtio_iommu_req_head *head = head_ptr;
> +
> +	ret = viommu_get_req_size(head, &head_size, &tail_size);
> +	if (ret)
> +		return ret;
> +
> +	dev_dbg(viommu->dev, "Sending request 0x%x, %zu bytes\n",
> head->type,
> +		head_size + tail_size);
> +
> +	tail = head_ptr + head_size;
> +
> +	sg_init_one(&req.head, head, head_size);
> +	sg_init_one(&req.tail, tail, tail_size);
> +
> +	ret = viommu_send_reqs_sync(viommu, &req, 1, &nr_sent);
> +	if (ret || !req.written || nr_sent != 1) {
> +		dev_err(viommu->dev, "failed to send command\n");
> +		return -EIO;
> +	}
> +
> +	ret = -viommu_status_to_errno(tail->status);
> +
> +	if (ret)
> +		dev_dbg(viommu->dev, " completed with %d\n", ret);
> +
> +	return ret;
> +}
> +
> +static int viommu_tlb_map(struct viommu_domain *vdomain, unsigned
> long iova,
> +			  phys_addr_t paddr, size_t size)
> +{
> +	unsigned long flags;
> +	struct viommu_mapping *mapping;
> +
> +	mapping = kzalloc(sizeof(*mapping), GFP_ATOMIC);
> +	if (!mapping)
> +		return -ENOMEM;
> +
> +	mapping->paddr = paddr;
> +	mapping->iova.start = iova;
> +	mapping->iova.last = iova + size - 1;
> +
> +	spin_lock_irqsave(&vdomain->mappings_lock, flags);
> +	interval_tree_insert(&mapping->iova, &vdomain->mappings);
> +	spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
> +
> +	return 0;
> +}
> +
> +static size_t viommu_tlb_unmap(struct viommu_domain *vdomain,
> +			       unsigned long iova, size_t size) {
> +	size_t unmapped = 0;
> +	unsigned long flags;
> +	unsigned long last = iova + size - 1;
> +	struct viommu_mapping *mapping = NULL;
> +	struct interval_tree_node *node, *next;
> +
> +	spin_lock_irqsave(&vdomain->mappings_lock, flags);
> +	next = interval_tree_iter_first(&vdomain->mappings, iova, last);
> +	while (next) {
> +		node = next;
> +		mapping = container_of(node, struct viommu_mapping,
> iova);
> +
> +		next = interval_tree_iter_next(node, iova, last);
> +
> +		/*
> +		 * Note that for a partial range, this will return the full
> +		 * mapping so we avoid sending split requests to the device.
> +		 */
> +		unmapped += mapping->iova.last - mapping->iova.start + 1;
> +
> +		interval_tree_remove(node, &vdomain->mappings);
> +		kfree(mapping);
> +	}
> +	spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
> +
> +	return unmapped;
> +}
> +
> +/* IOMMU API */
> +
> +static bool viommu_capable(enum iommu_cap cap) {
> +	return false; /* :( */
> +}
> +
> +static struct iommu_domain *viommu_domain_alloc(unsigned type) {
> +	struct viommu_domain *vdomain;
> +
> +	if (type != IOMMU_DOMAIN_UNMANAGED && type !=
> IOMMU_DOMAIN_DMA)
> +		return NULL;
> +
> +	vdomain = kzalloc(sizeof(struct viommu_domain), GFP_KERNEL);
> +	if (!vdomain)
> +		return NULL;
> +
> +	vdomain->id =
> atomic64_inc_return_relaxed(&viommu_domain_ids_gen);
> +
> +	mutex_init(&vdomain->mutex);
> +	spin_lock_init(&vdomain->mappings_lock);
> +	vdomain->mappings = RB_ROOT;
> +
> +	pr_debug("alloc domain of type %d -> %llu\n", type, vdomain->id);
> +
> +	if (type == IOMMU_DOMAIN_DMA &&
> +	    iommu_get_dma_cookie(&vdomain->domain)) {
> +		kfree(vdomain);
> +		return NULL;
> +	}
> +
> +	return &vdomain->domain;
> +}
> +
> +static void viommu_domain_free(struct iommu_domain *domain) {
> +	struct viommu_domain *vdomain = to_viommu_domain(domain);
> +
> +	pr_debug("free domain %llu\n", vdomain->id);
> +
> +	iommu_put_dma_cookie(domain);
> +
> +	/* Free all remaining mappings (size 2^64) */
> +	viommu_tlb_unmap(vdomain, 0, 0);
> +
> +	kfree(vdomain);
> +}
> +
> +static int viommu_attach_dev(struct iommu_domain *domain, struct
> device
> +*dev) {
> +	int i;
> +	int ret = 0;
> +	struct iommu_fwspec *fwspec = dev->iommu_fwspec;
> +	struct viommu_endpoint *vdev = fwspec->iommu_priv;
> +	struct viommu_domain *vdomain = to_viommu_domain(domain);
> +	struct virtio_iommu_req_attach req = {
> +		.head.type	= VIRTIO_IOMMU_T_ATTACH,
> +		.address_space	= cpu_to_le32(vdomain->id),
> +	};
> +
> +	mutex_lock(&vdomain->mutex);
> +	if (!vdomain->viommu) {
> +		struct viommu_dev *viommu = vdev->viommu;
> +
> +		vdomain->viommu = viommu;
> +
> +		domain->pgsize_bitmap		= viommu-
> >pgsize_bitmap;
> +		domain->geometry.aperture_start	= viommu-
> >aperture_start;
> +		domain->geometry.aperture_end	= viommu-
> >aperture_end;
> +		domain->geometry.force_aperture	= true;
> +
> +	} else if (vdomain->viommu != vdev->viommu) {
> +		dev_err(dev, "cannot attach to foreign VIOMMU\n");
> +		ret = -EXDEV;
> +	}
> +	mutex_unlock(&vdomain->mutex);
> +
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * When attaching the device to a new domain, it will be detached
> from
> +	 * the old one and, if as as a result the old domain isn't attached to
> +	 * any device, all mappings are removed from the old domain and it is
> +	 * freed. (Note that we can't use get_domain_for_dev here, it
> returns
> +	 * the default domain during initial attach.)
> +	 *
> +	 * Take note of the device disappearing, so we can ignore unmap
> request
> +	 * on stale domains (that is, between this detach and the upcoming
> +	 * free.)
> +	 *
> +	 * vdev->vdomain is protected by group->mutex
> +	 */
> +	if (vdev->vdomain) {
> +		dev_dbg(dev, "detach from domain %llu\n", vdev-
> >vdomain->id);
> +		vdev->vdomain->attached--;
> +	}
> +
> +	dev_dbg(dev, "attach to domain %llu\n", vdomain->id);
> +
> +	for (i = 0; i < fwspec->num_ids; i++) {
> +		req.device = cpu_to_le32(fwspec->ids[i]);
> +
> +		ret = viommu_send_req_sync(vdomain->viommu, &req);
> +		if (ret)
> +			break;
> +	}
> +
> +	vdomain->attached++;
> +	vdev->vdomain = vdomain;
> +
> +	return ret;
> +}
> +
> +static int viommu_map(struct iommu_domain *domain, unsigned long iova,
> +		      phys_addr_t paddr, size_t size, int prot) {
> +	int ret;
> +	struct viommu_domain *vdomain = to_viommu_domain(domain);
> +	struct virtio_iommu_req_map req = {
> +		.head.type	= VIRTIO_IOMMU_T_MAP,
> +		.address_space	= cpu_to_le32(vdomain->id),
> +		.virt_addr	= cpu_to_le64(iova),
> +		.phys_addr	= cpu_to_le64(paddr),
> +		.size		= cpu_to_le64(size),
> +	};
> +
> +	pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id, iova,
> +		 paddr, size);

A query, when I am tracing above prints I see same physical address is mapped with two different virtual address, do you know why kernel does this?

Thanks
-Bharat

> +
> +	if (!vdomain->attached)
> +		return -ENODEV;
> +
> +	if (prot & IOMMU_READ)
> +		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_READ);
> +
> +	if (prot & IOMMU_WRITE)
> +		req.flags |= cpu_to_le32(VIRTIO_IOMMU_MAP_F_WRITE);
> +
> +	ret = viommu_tlb_map(vdomain, iova, paddr, size);
> +	if (ret)
> +		return ret;
> +
> +	ret = viommu_send_req_sync(vdomain->viommu, &req);
> +	if (ret)
> +		viommu_tlb_unmap(vdomain, iova, size);
> +
> +	return ret;
> +}
> +
> +static size_t viommu_unmap(struct iommu_domain *domain, unsigned
> long iova,
> +			   size_t size)
> +{
> +	int ret;
> +	size_t unmapped;
> +	struct viommu_domain *vdomain = to_viommu_domain(domain);
> +	struct virtio_iommu_req_unmap req = {
> +		.head.type	= VIRTIO_IOMMU_T_UNMAP,
> +		.address_space	= cpu_to_le32(vdomain->id),
> +		.virt_addr	= cpu_to_le64(iova),
> +	};
> +
> +	pr_debug("unmap %llu 0x%lx (%zu)\n", vdomain->id, iova, size);
> +
> +	/* Callers may unmap after detach, but device already took care of it.
> */
> +	if (!vdomain->attached)
> +		return size;
> +
> +	unmapped = viommu_tlb_unmap(vdomain, iova, size);
> +	if (unmapped < size)
> +		return 0;
> +
> +	req.size = cpu_to_le64(unmapped);
> +
> +	ret = viommu_send_req_sync(vdomain->viommu, &req);
> +	if (ret)
> +		return 0;
> +
> +	return unmapped;
> +}
> +
> +static size_t viommu_map_sg(struct iommu_domain *domain, unsigned
> long iova,
> +			    struct scatterlist *sg, unsigned int nents, int prot) {
> +	int i, ret;
> +	int nr_sent;
> +	size_t mapped;
> +	size_t min_pagesz;
> +	size_t total_size;
> +	struct scatterlist *s;
> +	unsigned int flags = 0;
> +	unsigned long cur_iova;
> +	unsigned long mapped_iova;
> +	size_t head_size, tail_size;
> +	struct viommu_request reqs[nents];
> +	struct virtio_iommu_req_map map_reqs[nents];
> +	struct viommu_domain *vdomain = to_viommu_domain(domain);
> +
> +	if (!vdomain->attached)
> +		return 0;
> +
> +	pr_debug("map_sg %llu %u 0x%lx\n", vdomain->id, nents, iova);
> +
> +	if (prot & IOMMU_READ)
> +		flags |= VIRTIO_IOMMU_MAP_F_READ;
> +
> +	if (prot & IOMMU_WRITE)
> +		flags |= VIRTIO_IOMMU_MAP_F_WRITE;
> +
> +	min_pagesz = 1 << __ffs(domain->pgsize_bitmap);
> +	tail_size = sizeof(struct virtio_iommu_req_tail);
> +	head_size = sizeof(*map_reqs) - tail_size;
> +
> +	cur_iova = iova;
> +
> +	for_each_sg(sg, s, nents, i) {
> +		size_t size = s->length;
> +		phys_addr_t paddr = sg_phys(s);
> +		void *tail = (void *)&map_reqs[i] + head_size;
> +
> +		if (!IS_ALIGNED(paddr | size, min_pagesz)) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		/* TODO: merge physically-contiguous mappings if any */
> +		map_reqs[i] = (struct virtio_iommu_req_map) {
> +			.head.type	= VIRTIO_IOMMU_T_MAP,
> +			.address_space	= cpu_to_le32(vdomain->id),
> +			.flags		= cpu_to_le32(flags),
> +			.virt_addr	= cpu_to_le64(cur_iova),
> +			.phys_addr	= cpu_to_le64(paddr),
> +			.size		= cpu_to_le64(size),
> +		};
> +
> +		ret = viommu_tlb_map(vdomain, cur_iova, paddr, size);
> +		if (ret)
> +			break;
> +
> +		sg_init_one(&reqs[i].head, &map_reqs[i], head_size);
> +		sg_init_one(&reqs[i].tail, tail, tail_size);
> +
> +		cur_iova += size;
> +	}
> +
> +	total_size = cur_iova - iova;
> +
> +	if (ret) {
> +		viommu_tlb_unmap(vdomain, iova, total_size);
> +		return 0;
> +	}
> +
> +	ret = viommu_send_reqs_sync(vdomain->viommu, reqs, i,
> &nr_sent);
> +
> +	if (nr_sent != nents)
> +		goto err_rollback;
> +
> +	for (i = 0; i < nents; i++) {
> +		if (!reqs[i].written || map_reqs[i].tail.status)
> +			goto err_rollback;
> +	}
> +
> +	return total_size;
> +
> +err_rollback:
> +	/*
> +	 * Any request in the range might have failed. Unmap what was
> +	 * successful.
> +	 */
> +	cur_iova = iova;
> +	mapped_iova = iova;
> +	mapped = 0;
> +	for_each_sg(sg, s, nents, i) {
> +		size_t size = s->length;
> +
> +		cur_iova += size;
> +
> +		if (!reqs[i].written || map_reqs[i].tail.status) {
> +			if (mapped)
> +				viommu_unmap(domain, mapped_iova,
> mapped);
> +
> +			mapped_iova = cur_iova;
> +			mapped = 0;
> +		} else {
> +			mapped += size;
> +		}
> +	}
> +
> +	viommu_tlb_unmap(vdomain, iova, total_size);
> +
> +	return 0;
> +}
> +
> +static phys_addr_t viommu_iova_to_phys(struct iommu_domain *domain,
> +				       dma_addr_t iova)
> +{
> +	u64 paddr = 0;
> +	unsigned long flags;
> +	struct viommu_mapping *mapping;
> +	struct interval_tree_node *node;
> +	struct viommu_domain *vdomain = to_viommu_domain(domain);
> +
> +	spin_lock_irqsave(&vdomain->mappings_lock, flags);
> +	node = interval_tree_iter_first(&vdomain->mappings, iova, iova);
> +	if (node) {
> +		mapping = container_of(node, struct viommu_mapping,
> iova);
> +		paddr = mapping->paddr + (iova - mapping->iova.start);
> +	}
> +	spin_unlock_irqrestore(&vdomain->mappings_lock, flags);
> +
> +	pr_debug("iova_to_phys %llu 0x%llx->0x%llx\n", vdomain->id, iova,
> +		 paddr);
> +
> +	return paddr;
> +}
> +
> +static struct iommu_ops viommu_ops;
> +static struct virtio_driver virtio_iommu_drv;
> +
> +static int viommu_match_node(struct device *dev, void *data) {
> +	return dev->parent->fwnode == data;
> +}
> +
> +static struct viommu_dev *viommu_get_by_fwnode(struct
> fwnode_handle
> +*fwnode) {
> +	struct device *dev = driver_find_device(&virtio_iommu_drv.driver,
> NULL,
> +						fwnode,
> viommu_match_node);
> +	put_device(dev);
> +
> +	return dev ? dev_to_virtio(dev)->priv : NULL; }
> +
> +static int viommu_add_device(struct device *dev) {
> +	struct iommu_group *group;
> +	struct viommu_endpoint *vdev;
> +	struct viommu_dev *viommu = NULL;
> +	struct iommu_fwspec *fwspec = dev->iommu_fwspec;
> +
> +	if (!fwspec || fwspec->ops != &viommu_ops)
> +		return -ENODEV;
> +
> +	viommu = viommu_get_by_fwnode(fwspec->iommu_fwnode);
> +	if (!viommu)
> +		return -ENODEV;
> +
> +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	vdev->viommu = viommu;
> +	fwspec->iommu_priv = vdev;
> +
> +	/*
> +	 * Last step creates a default domain and attaches to it. Everything
> +	 * must be ready.
> +	 */
> +	group = iommu_group_get_for_dev(dev);
> +
> +	return PTR_ERR_OR_ZERO(group);
> +}
> +
> +static void viommu_remove_device(struct device *dev) {
> +	kfree(dev->iommu_fwspec->iommu_priv);
> +}
> +
> +static struct iommu_group *
> +viommu_device_group(struct device *dev) {
> +	if (dev_is_pci(dev))
> +		return pci_device_group(dev);
> +	else
> +		return generic_device_group(dev);
> +}
> +
> +static int viommu_of_xlate(struct device *dev, struct of_phandle_args
> +*args) {
> +	u32 *id = args->args;
> +
> +	dev_dbg(dev, "of_xlate 0x%x\n", *id);
> +	return iommu_fwspec_add_ids(dev, args->args, 1); }
> +
> +/*
> + * (Maybe) temporary hack for device pass-through into guest userspace.
> +On ARM
> + * with an ITS, VFIO will look for a region where to map the doorbell,
> +even
> + * though the virtual doorbell is never written to by the device, and
> +instead
> + * the host injects interrupts directly. TODO: sort this out in VFIO.
> + */
> +#define MSI_IOVA_BASE			0x8000000
> +#define MSI_IOVA_LENGTH			0x100000
> +
> +static void viommu_get_resv_regions(struct device *dev, struct
> +list_head *head) {
> +	struct iommu_resv_region *region;
> +	int prot = IOMMU_WRITE | IOMMU_NOEXEC | IOMMU_MMIO;
> +
> +	region = iommu_alloc_resv_region(MSI_IOVA_BASE,
> MSI_IOVA_LENGTH, prot,
> +					 IOMMU_RESV_MSI);
> +	if (!region)
> +		return;
> +
> +	list_add_tail(&region->list, head);
> +}
> +
> +static void viommu_put_resv_regions(struct device *dev, struct
> +list_head *head) {
> +	struct iommu_resv_region *entry, *next;
> +
> +	list_for_each_entry_safe(entry, next, head, list)
> +		kfree(entry);
> +}
> +
> +static struct iommu_ops viommu_ops = {
> +	.capable		= viommu_capable,
> +	.domain_alloc		= viommu_domain_alloc,
> +	.domain_free		= viommu_domain_free,
> +	.attach_dev		= viommu_attach_dev,
> +	.map			= viommu_map,
> +	.unmap			= viommu_unmap,
> +	.map_sg			= viommu_map_sg,
> +	.iova_to_phys		= viommu_iova_to_phys,
> +	.add_device		= viommu_add_device,
> +	.remove_device		= viommu_remove_device,
> +	.device_group		= viommu_device_group,
> +	.of_xlate		= viommu_of_xlate,
> +	.get_resv_regions	= viommu_get_resv_regions,
> +	.put_resv_regions	= viommu_put_resv_regions,
> +};
> +
> +static int viommu_init_vq(struct viommu_dev *viommu) {
> +	struct virtio_device *vdev = dev_to_virtio(viommu->dev);
> +	vq_callback_t *callback = NULL;
> +	const char *name = "request";
> +	int ret;
> +
> +	ret = vdev->config->find_vqs(vdev, 1, &viommu->vq, &callback,
> +				     &name, NULL);
> +	if (ret)
> +		dev_err(viommu->dev, "cannot find VQ\n");
> +
> +	return ret;
> +}
> +
> +static int viommu_probe(struct virtio_device *vdev) {
> +	struct device *parent_dev = vdev->dev.parent;
> +	struct viommu_dev *viommu = NULL;
> +	struct device *dev = &vdev->dev;
> +	int ret;
> +
> +	viommu = kzalloc(sizeof(*viommu), GFP_KERNEL);
> +	if (!viommu)
> +		return -ENOMEM;
> +
> +	spin_lock_init(&viommu->vq_lock);
> +	INIT_LIST_HEAD(&viommu->pending_requests);
> +	viommu->dev = dev;
> +	viommu->vdev = vdev;
> +
> +	ret = viommu_init_vq(viommu);
> +	if (ret)
> +		goto err_free_viommu;
> +
> +	virtio_cread(vdev, struct virtio_iommu_config, page_sizes,
> +		     &viommu->pgsize_bitmap);
> +
> +	viommu->aperture_end = -1UL;
> +
> +	virtio_cread_feature(vdev, VIRTIO_IOMMU_F_INPUT_RANGE,
> +			     struct virtio_iommu_config, input_range.start,
> +			     &viommu->aperture_start);
> +
> +	virtio_cread_feature(vdev, VIRTIO_IOMMU_F_INPUT_RANGE,
> +			     struct virtio_iommu_config, input_range.end,
> +			     &viommu->aperture_end);
> +
> +	if (!viommu->pgsize_bitmap) {
> +		ret = -EINVAL;
> +		goto err_free_viommu;
> +	}
> +
> +	viommu_ops.pgsize_bitmap = viommu->pgsize_bitmap;
> +
> +	/*
> +	 * Not strictly necessary, virtio would enable it later. This allows to
> +	 * start using the request queue early.
> +	 */
> +	virtio_device_ready(vdev);
> +
> +	ret = iommu_device_sysfs_add(&viommu->iommu, dev, NULL, "%s",
> +				     virtio_bus_name(vdev));
> +	if (ret)
> +		goto err_free_viommu;
> +
> +	iommu_device_set_ops(&viommu->iommu, &viommu_ops);
> +	iommu_device_set_fwnode(&viommu->iommu, parent_dev-
> >fwnode);
> +
> +	iommu_device_register(&viommu->iommu);
> +
> +#ifdef CONFIG_PCI
> +	if (pci_bus_type.iommu_ops != &viommu_ops) {
> +		pci_request_acs();
> +		ret = bus_set_iommu(&pci_bus_type, &viommu_ops);
> +		if (ret)
> +			goto err_unregister;
> +	}
> +#endif
> +#ifdef CONFIG_ARM_AMBA
> +	if (amba_bustype.iommu_ops != &viommu_ops) {
> +		ret = bus_set_iommu(&amba_bustype, &viommu_ops);
> +		if (ret)
> +			goto err_unregister;
> +	}
> +#endif
> +	if (platform_bus_type.iommu_ops != &viommu_ops) {
> +		ret = bus_set_iommu(&platform_bus_type, &viommu_ops);
> +		if (ret)
> +			goto err_unregister;
> +	}
> +
> +	vdev->priv = viommu;
> +
> +	dev_info(viommu->dev, "probe successful\n");
> +
> +	return 0;
> +
> +err_unregister:
> +	iommu_device_unregister(&viommu->iommu);
> +
> +err_free_viommu:
> +	kfree(viommu);
> +
> +	return ret;
> +}
> +
> +static void viommu_remove(struct virtio_device *vdev) {
> +	struct viommu_dev *viommu = vdev->priv;
> +
> +	iommu_device_unregister(&viommu->iommu);
> +	kfree(viommu);
> +
> +	dev_info(&vdev->dev, "device removed\n"); }
> +
> +static void viommu_config_changed(struct virtio_device *vdev) {
> +	dev_warn(&vdev->dev, "config changed\n"); }
> +
> +static unsigned int features[] = {
> +	VIRTIO_IOMMU_F_INPUT_RANGE,
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_ID_IOMMU, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +static struct virtio_driver virtio_iommu_drv = {
> +	.driver.name		= KBUILD_MODNAME,
> +	.driver.owner		= THIS_MODULE,
> +	.id_table		= id_table,
> +	.feature_table		= features,
> +	.feature_table_size	= ARRAY_SIZE(features),
> +	.probe			= viommu_probe,
> +	.remove			= viommu_remove,
> +	.config_changed		= viommu_config_changed,
> +};
> +
> +module_virtio_driver(virtio_iommu_drv);
> +
> +IOMMU_OF_DECLARE(viommu, "virtio,mmio", NULL);
> +
> +MODULE_DESCRIPTION("virtio-iommu driver"); MODULE_AUTHOR("Jean-
> Philippe
> +Brucker <jean-philippe.brucker@arm.com>");
> +MODULE_LICENSE("GPL v2");
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild index
> 1f25c86374ad..c0cb0f173258 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -467,6 +467,7 @@ header-y += virtio_console.h  header-y +=
> virtio_gpu.h  header-y += virtio_ids.h  header-y += virtio_input.h
> +header-y += virtio_iommu.h
>  header-y += virtio_mmio.h
>  header-y += virtio_net.h
>  header-y += virtio_pci.h
> diff --git a/include/uapi/linux/virtio_ids.h b/include/uapi/linux/virtio_ids.h
> index 6d5c3b2d4f4d..934ed3d3cd3f 100644
> --- a/include/uapi/linux/virtio_ids.h
> +++ b/include/uapi/linux/virtio_ids.h
> @@ -43,5 +43,6 @@
>  #define VIRTIO_ID_INPUT        18 /* virtio input */
>  #define VIRTIO_ID_VSOCK        19 /* virtio vsock transport */
>  #define VIRTIO_ID_CRYPTO       20 /* virtio crypto */
> +#define VIRTIO_ID_IOMMU	    61216 /* virtio IOMMU (temporary) */
> 
>  #endif /* _LINUX_VIRTIO_IDS_H */
> diff --git a/include/uapi/linux/virtio_iommu.h
> b/include/uapi/linux/virtio_iommu.h
> new file mode 100644
> index 000000000000..ec74c9a727d4
> --- /dev/null
> +++ b/include/uapi/linux/virtio_iommu.h
> @@ -0,0 +1,142 @@
> +/*
> + * Copyright (C) 2017 ARM Ltd.
> + *
> + * This header is BSD licensed so anyone can use the definitions
> + * to implement compatible drivers/servers:
> + *
> + * Redistribution and use in source and binary forms, with or without
> + * modification, are permitted provided that the following conditions
> + * are met:
> + * 1. Redistributions of source code must retain the above copyright
> + *    notice, this list of conditions and the following disclaimer.
> + * 2. Redistributions in binary form must reproduce the above copyright
> + *    notice, this list of conditions and the following disclaimer in the
> + *    documentation and/or other materials provided with the distribution.
> + * 3. Neither the name of ARM Ltd. nor the names of its contributors
> + *    may be used to endorse or promote products derived from this
> software
> + *    without specific prior written permission.
> + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> CONTRIBUTORS
> + * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
> NOT
> + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> FITNESS
> + * FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL IBM
> OR
> + * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
> NOT
> + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
> OF
> + * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
> CAUSED AND
> + * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
> + * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
> OUT
> + * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
> + * SUCH DAMAGE.
> + */
> +#ifndef _UAPI_LINUX_VIRTIO_IOMMU_H
> +#define _UAPI_LINUX_VIRTIO_IOMMU_H
> +
> +/* Feature bits */
> +#define VIRTIO_IOMMU_F_INPUT_RANGE		0
> +#define VIRTIO_IOMMU_F_IOASID_BITS		1
> +#define VIRTIO_IOMMU_F_MAP_UNMAP		2
> +#define VIRTIO_IOMMU_F_BYPASS			3
> +
> +__packed
> +struct virtio_iommu_config {
> +	/* Supported page sizes */
> +	__u64					page_sizes;
> +	struct virtio_iommu_range {
> +		__u64				start;
> +		__u64				end;
> +	} input_range;
> +	__u8 					ioasid_bits;
> +};
> +
> +/* Request types */
> +#define VIRTIO_IOMMU_T_ATTACH			0x01
> +#define VIRTIO_IOMMU_T_DETACH			0x02
> +#define VIRTIO_IOMMU_T_MAP			0x03
> +#define VIRTIO_IOMMU_T_UNMAP			0x04
> +
> +/* Status types */
> +#define VIRTIO_IOMMU_S_OK			0x00
> +#define VIRTIO_IOMMU_S_IOERR			0x01
> +#define VIRTIO_IOMMU_S_UNSUPP			0x02
> +#define VIRTIO_IOMMU_S_DEVERR			0x03
> +#define VIRTIO_IOMMU_S_INVAL			0x04
> +#define VIRTIO_IOMMU_S_RANGE			0x05
> +#define VIRTIO_IOMMU_S_NOENT			0x06
> +#define VIRTIO_IOMMU_S_FAULT			0x07
> +
> +__packed
> +struct virtio_iommu_req_head {
> +	__u8					type;
> +	__u8					reserved[3];
> +};
> +
> +__packed
> +struct virtio_iommu_req_tail {
> +	__u8					status;
> +	__u8					reserved[3];
> +};
> +
> +__packed
> +struct virtio_iommu_req_attach {
> +	struct virtio_iommu_req_head		head;
> +
> +	__le32					address_space;
> +	__le32					device;
> +	__le32					reserved;
> +
> +	struct virtio_iommu_req_tail		tail;
> +};
> +
> +__packed
> +struct virtio_iommu_req_detach {
> +	struct virtio_iommu_req_head		head;
> +
> +	__le32					device;
> +	__le32					reserved;
> +
> +	struct virtio_iommu_req_tail		tail;
> +};
> +
> +#define VIRTIO_IOMMU_MAP_F_READ			(1 << 0)
> +#define VIRTIO_IOMMU_MAP_F_WRITE		(1 << 1)
> +#define VIRTIO_IOMMU_MAP_F_EXEC			(1 << 2)
> +
> +#define VIRTIO_IOMMU_MAP_F_MASK
> 	(VIRTIO_IOMMU_MAP_F_READ |	\
> +
> VIRTIO_IOMMU_MAP_F_WRITE |	\
> +
> VIRTIO_IOMMU_MAP_F_EXEC)
> +
> +__packed
> +struct virtio_iommu_req_map {
> +	struct virtio_iommu_req_head		head;
> +
> +	__le32					address_space;
> +	__le32					flags;
> +	__le64					virt_addr;
> +	__le64					phys_addr;
> +	__le64					size;
> +
> +	struct virtio_iommu_req_tail		tail;
> +};
> +
> +__packed
> +struct virtio_iommu_req_unmap {
> +	struct virtio_iommu_req_head		head;
> +
> +	__le32					address_space;
> +	__le32					flags;
> +	__le64					virt_addr;
> +	__le64					size;
> +
> +	struct virtio_iommu_req_tail		tail;
> +};
> +
> +union virtio_iommu_req {
> +	struct virtio_iommu_req_head		head;
> +
> +	struct virtio_iommu_req_attach		attach;
> +	struct virtio_iommu_req_detach		detach;
> +	struct virtio_iommu_req_map		map;
> +	struct virtio_iommu_req_unmap		unmap;
> +};
> +
> +#endif
> --
> 2.12.1
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [virtio-dev] [RFC PATCH linux] iommu: Add virtio-iommu driver
  2017-06-16  8:48   ` [virtio-dev] " Bharat Bhushan
@ 2017-06-16 11:36     ` Jean-Philippe Brucker
  2017-06-16 11:36     ` Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-06-16 11:36 UTC (permalink / raw)
  To: Bharat Bhushan, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

On 16/06/17 09:48, Bharat Bhushan wrote:
> Hi Jean
>> +static int viommu_map(struct iommu_domain *domain, unsigned long iova,
>> +		      phys_addr_t paddr, size_t size, int prot) {
>> +	int ret;
>> +	struct viommu_domain *vdomain = to_viommu_domain(domain);
>> +	struct virtio_iommu_req_map req = {
>> +		.head.type	= VIRTIO_IOMMU_T_MAP,
>> +		.address_space	= cpu_to_le32(vdomain->id),
>> +		.virt_addr	= cpu_to_le64(iova),
>> +		.phys_addr	= cpu_to_le64(paddr),
>> +		.size		= cpu_to_le64(size),
>> +	};
>> +
>> +	pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id, iova,
>> +		 paddr, size);
> 
> A query, when I am tracing above prints I see same physical address is mapped with two different virtual address, do you know why kernel does this?

That really depends which driver is calling into viommu. iommu_map is
called from the DMA API, which can be used by any device drivers. Within
an address space, multiple IOVAs pointing to the same PA isn't forbidden.

For example, looking at MAP requests for a virtio-net device, I get the
following trace:

ioas[1] map   0xfffffff3000 -> 0x8faa0000 (4096)
ioas[1] map   0xfffffff2000 -> 0x8faa0000 (4096)
ioas[1] map   0xfffffff1000 -> 0x8faa0000 (4096)
ioas[1] map   0xfffffff0000 -> 0x8faa0000 (4096)
ioas[1] map   0xffffffef000 -> 0x8faa0000 (4096)
ioas[1] map   0xffffffee000 -> 0x8faa0000 (4096)
ioas[1] map   0xffffffed000 -> 0x8faa0000 (4096)
ioas[1] map   0xffffffec000 -> 0x8faa0000 (4096)
ioas[1] map   0xffffffeb000 -> 0x8faa0000 (4096)
ioas[1] map   0xffffffea000 -> 0x8faa0000 (4096)
ioas[1] map   0xffffffe8000 -> 0x8faa0000 (8192)
...

During initialization, the virtio-net driver primes the rx queue with
receive buffers, that the host will then fill with network packets. It
calls virtqueue_add_inbuf_ctx to create descriptors on the rx virtqueue
for each buffer. Each buffer is 0x180 bytes here, so one 4k page can
contain around 10 (actually 11, with the last one crossing page boundary).

I guess the call trace goes like this:
 virtnet_open
 try_fill_recv
 add_recvbuf_mergeable
 virtqueue_add_inbuf_ctx
 vring_map_one_sg
 dma_map_page
 __iommu_dma_map

But the IOMMU cannot map fragments of pages, since the granule is 0x1000.
Therefore when virtqueue_add_inbuf_ctx maps the buffer, __iommu_dma_map
aligns address and size on full pages. Someone motivated could probably
optimize this by caching mapped pages and reusing IOVAs, but currently
that's how it goes.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [virtio-dev] [RFC PATCH linux] iommu: Add virtio-iommu driver
  2017-06-16  8:48   ` [virtio-dev] " Bharat Bhushan
  2017-06-16 11:36     ` Jean-Philippe Brucker
@ 2017-06-16 11:36     ` Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-06-16 11:36 UTC (permalink / raw)
  To: Bharat Bhushan, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

On 16/06/17 09:48, Bharat Bhushan wrote:
> Hi Jean
>> +static int viommu_map(struct iommu_domain *domain, unsigned long iova,
>> +		      phys_addr_t paddr, size_t size, int prot) {
>> +	int ret;
>> +	struct viommu_domain *vdomain = to_viommu_domain(domain);
>> +	struct virtio_iommu_req_map req = {
>> +		.head.type	= VIRTIO_IOMMU_T_MAP,
>> +		.address_space	= cpu_to_le32(vdomain->id),
>> +		.virt_addr	= cpu_to_le64(iova),
>> +		.phys_addr	= cpu_to_le64(paddr),
>> +		.size		= cpu_to_le64(size),
>> +	};
>> +
>> +	pr_debug("map %llu 0x%lx -> 0x%llx (%zu)\n", vdomain->id, iova,
>> +		 paddr, size);
> 
> A query, when I am tracing above prints I see same physical address is mapped with two different virtual address, do you know why kernel does this?

That really depends which driver is calling into viommu. iommu_map is
called from the DMA API, which can be used by any device drivers. Within
an address space, multiple IOVAs pointing to the same PA isn't forbidden.

For example, looking at MAP requests for a virtio-net device, I get the
following trace:

ioas[1] map   0xfffffff3000 -> 0x8faa0000 (4096)
ioas[1] map   0xfffffff2000 -> 0x8faa0000 (4096)
ioas[1] map   0xfffffff1000 -> 0x8faa0000 (4096)
ioas[1] map   0xfffffff0000 -> 0x8faa0000 (4096)
ioas[1] map   0xffffffef000 -> 0x8faa0000 (4096)
ioas[1] map   0xffffffee000 -> 0x8faa0000 (4096)
ioas[1] map   0xffffffed000 -> 0x8faa0000 (4096)
ioas[1] map   0xffffffec000 -> 0x8faa0000 (4096)
ioas[1] map   0xffffffeb000 -> 0x8faa0000 (4096)
ioas[1] map   0xffffffea000 -> 0x8faa0000 (4096)
ioas[1] map   0xffffffe8000 -> 0x8faa0000 (8192)
...

During initialization, the virtio-net driver primes the rx queue with
receive buffers, that the host will then fill with network packets. It
calls virtqueue_add_inbuf_ctx to create descriptors on the rx virtqueue
for each buffer. Each buffer is 0x180 bytes here, so one 4k page can
contain around 10 (actually 11, with the last one crossing page boundary).

I guess the call trace goes like this:
 virtnet_open
 try_fill_recv
 add_recvbuf_mergeable
 virtqueue_add_inbuf_ctx
 vring_map_one_sg
 dma_map_page
 __iommu_dma_map

But the IOMMU cannot map fragments of pages, since the granule is 0x1000.
Therefore when virtqueue_add_inbuf_ctx maps the buffer, __iommu_dma_map
aligns address and size on full pages. Someone motivated could probably
optimize this by caching mapped pages and reusing IOVAs, but currently
that's how it goes.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 2/3] virtio-iommu: device probing and operations
  2017-04-24 15:05           ` Jean-Philippe Brucker
@ 2017-08-21  7:59             ` Tian, Kevin
  2017-08-21 12:00               ` Jean-Philippe Brucker
  2017-08-21 12:00                 ` [virtio-dev] " Jean-Philippe Brucker
  0 siblings, 2 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-08-21  7:59 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Monday, April 24, 2017 11:06 PM
> >>>>   1. Attach device
> >>>>   ----------------
> >>>>
> >>>> struct virtio_iommu_req_attach {
> >>>> 	le32	address_space;
> >>>> 	le32	device;
> >>>> 	le32	flags/reserved;
> >>>> };
> >>>>
> >>>> Attach a device to an address space. 'address_space' is an identifier
> >>>> unique to the guest. If the address space doesn't exist in the IOMMU
> >>>
> >>> Based on your description this address space ID is per operation right?
> >>> MAP/UNMAP and page-table sharing should have different ID spaces...
> >>
> >> I think it's simpler if we keep a single IOASID space per virtio-iommu
> >> device, because the maximum number of address spaces (described by
> >> ioasid_bits) might be a restriction of the pIOMMU. For page-table
> sharing
> >> you still need to define which devices will share a page directory using
> >> ATTACH requests, though that interface is not set in stone.
> >
> > got you. yes VM is supposed to consume less IOASIDs than physically
> > available. It doesn’t hurt to have one IOASID space for both IOVA
> > map/unmap usages (one IOASID per device) and SVM usages (multiple
> > IOASIDs per device). The former is digested by software and the latter
> > will be bound to hardware.
> >
> 
> Hmm, I'm using address space indexed by IOASID for "classic" IOMMU, and
> then contexts indexed by PASID when talking about SVM. So in my mind an
> address space can have multiple sub-address-spaces (contexts). Number of
> IOASIDs is a limitation of the pIOMMU, and number of PASIDs is a
> limitation of the device. Therefore attaching devices to address spaces
> would update the number of available contexts in that address space. The
> terminology is not ideal, and I'd be happy to change it for something more
> clear.
> 

(sorry to pick up this old thread, as the .tex one is not good for review
and this thread provides necessary background for IOASID).

Hi, Jean,

I'd like to hear more clarification regarding the relationship between 
IOASID and PASID. When reading back above explanation, it looks
confusing to me now (though I might get the meaning months ago :/).
At least Intel VT-d only understands PASID (or you can think IOASID
=PASID). There is no such layered address space concept. Then for
map/unmap type request, do you intend to steal some PASIDs for
that purpose on such architecture (since IOASID is a mandatory field 
in map/unmap request)?

Thanks
Kevin
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 2/3] virtio-iommu: device probing and operations
  2017-08-21  7:59             ` Tian, Kevin
@ 2017-08-21 12:00                 ` Jean-Philippe Brucker
  2017-08-21 12:00                 ` [virtio-dev] " Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-21 12:00 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

On 21/08/17 08:59, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
>> Sent: Monday, April 24, 2017 11:06 PM
>>>>>>   1. Attach device
>>>>>>   ----------------
>>>>>>
>>>>>> struct virtio_iommu_req_attach {
>>>>>> 	le32	address_space;
>>>>>> 	le32	device;
>>>>>> 	le32	flags/reserved;
>>>>>> };
>>>>>>
>>>>>> Attach a device to an address space. 'address_space' is an identifier
>>>>>> unique to the guest. If the address space doesn't exist in the IOMMU
>>>>>
>>>>> Based on your description this address space ID is per operation right?
>>>>> MAP/UNMAP and page-table sharing should have different ID spaces...
>>>>
>>>> I think it's simpler if we keep a single IOASID space per virtio-iommu
>>>> device, because the maximum number of address spaces (described by
>>>> ioasid_bits) might be a restriction of the pIOMMU. For page-table
>> sharing
>>>> you still need to define which devices will share a page directory using
>>>> ATTACH requests, though that interface is not set in stone.
>>>
>>> got you. yes VM is supposed to consume less IOASIDs than physically
>>> available. It doesn’t hurt to have one IOASID space for both IOVA
>>> map/unmap usages (one IOASID per device) and SVM usages (multiple
>>> IOASIDs per device). The former is digested by software and the latter
>>> will be bound to hardware.
>>>
>>
>> Hmm, I'm using address space indexed by IOASID for "classic" IOMMU, and
>> then contexts indexed by PASID when talking about SVM. So in my mind an
>> address space can have multiple sub-address-spaces (contexts). Number of
>> IOASIDs is a limitation of the pIOMMU, and number of PASIDs is a
>> limitation of the device. Therefore attaching devices to address spaces
>> would update the number of available contexts in that address space. The
>> terminology is not ideal, and I'd be happy to change it for something more
>> clear.
>>
> 
> (sorry to pick up this old thread, as the .tex one is not good for review
> and this thread provides necessary background for IOASID).
> 
> Hi, Jean,
> 
> I'd like to hear more clarification regarding the relationship between 
> IOASID and PASID. When reading back above explanation, it looks
> confusing to me now (though I might get the meaning months ago :/).
> At least Intel VT-d only understands PASID (or you can think IOASID
> =PASID). There is no such layered address space concept. Then for
> map/unmap type request, do you intend to steal some PASIDs for
> that purpose on such architecture (since IOASID is a mandatory field 
> in map/unmap request)?

IOASID is a logical ID, it isn't used by hardware. The address space
concept in virtio-iommu allows to group endpoints together, so that they
have the same address space. I thought it was pretty much the same as
"domains" in VT-d? In any case, it is the same as domains in Linux. An
IOASID provides a handle for communication between virtio-iommu device and
driver, but unlike PASID, the IOASID number doesn't mean anything outside
of virtio-iommu.

I haven't introduced PASIDs in public virtio-iommu documents yet, but the
way I intend it, PASID != IOASID. We will still have a logical address
space identified by IOASID, that can contain multiple contexts identified
by PASID. At the moment, after the ATTACH request, an address space
contains a single anonymous context (NO PASID) that can be managed with
MAP/UNMAP requests. With virtio-iommu v0.4, structures look like the
following. The NO PASID context is implicit.

                    address space      context
    endpoint ----.                                  .- mapping
    endpoint ----+---- IOASID -------- NO PASID ----+- mapping
    endpoint ----'                                  '- mapping

I'd like to add a flag to ATTACH that says "don't create a default
anonymous context, I'll handle contexts myself". Then a new "ADD_TABLE"
request to handle contexts. When creating a context, the guest decides if
it wants to manage it via MAP/UNMAP requests (and a new "context" field),
or instead manage mappings itself by allocating a page directory and use
INVALIDATE requests.

                    address space      context
    endpoint ----.                                  .- mapping
    endpoint ----+---- IOASID ----+--- NO PASID ----+- mapping
    endpoint ----'                |                 '- mapping
                                  +--- PASID 0  ---- pgd
                                  |     ...
                                  '--- PASID N  ---- pgd

In this example the guest chose to still have an anonymous context that
uses MAP/UNMAP, along with a few PASID contexts with their own page tables.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 2/3] virtio-iommu: device probing and operations
  2017-08-21  7:59             ` Tian, Kevin
@ 2017-08-21 12:00               ` Jean-Philippe Brucker
  2017-08-21 12:00                 ` [virtio-dev] " Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-21 12:00 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

On 21/08/17 08:59, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
>> Sent: Monday, April 24, 2017 11:06 PM
>>>>>>   1. Attach device
>>>>>>   ----------------
>>>>>>
>>>>>> struct virtio_iommu_req_attach {
>>>>>> 	le32	address_space;
>>>>>> 	le32	device;
>>>>>> 	le32	flags/reserved;
>>>>>> };
>>>>>>
>>>>>> Attach a device to an address space. 'address_space' is an identifier
>>>>>> unique to the guest. If the address space doesn't exist in the IOMMU
>>>>>
>>>>> Based on your description this address space ID is per operation right?
>>>>> MAP/UNMAP and page-table sharing should have different ID spaces...
>>>>
>>>> I think it's simpler if we keep a single IOASID space per virtio-iommu
>>>> device, because the maximum number of address spaces (described by
>>>> ioasid_bits) might be a restriction of the pIOMMU. For page-table
>> sharing
>>>> you still need to define which devices will share a page directory using
>>>> ATTACH requests, though that interface is not set in stone.
>>>
>>> got you. yes VM is supposed to consume less IOASIDs than physically
>>> available. It doesn’t hurt to have one IOASID space for both IOVA
>>> map/unmap usages (one IOASID per device) and SVM usages (multiple
>>> IOASIDs per device). The former is digested by software and the latter
>>> will be bound to hardware.
>>>
>>
>> Hmm, I'm using address space indexed by IOASID for "classic" IOMMU, and
>> then contexts indexed by PASID when talking about SVM. So in my mind an
>> address space can have multiple sub-address-spaces (contexts). Number of
>> IOASIDs is a limitation of the pIOMMU, and number of PASIDs is a
>> limitation of the device. Therefore attaching devices to address spaces
>> would update the number of available contexts in that address space. The
>> terminology is not ideal, and I'd be happy to change it for something more
>> clear.
>>
> 
> (sorry to pick up this old thread, as the .tex one is not good for review
> and this thread provides necessary background for IOASID).
> 
> Hi, Jean,
> 
> I'd like to hear more clarification regarding the relationship between 
> IOASID and PASID. When reading back above explanation, it looks
> confusing to me now (though I might get the meaning months ago :/).
> At least Intel VT-d only understands PASID (or you can think IOASID
> =PASID). There is no such layered address space concept. Then for
> map/unmap type request, do you intend to steal some PASIDs for
> that purpose on such architecture (since IOASID is a mandatory field 
> in map/unmap request)?

IOASID is a logical ID, it isn't used by hardware. The address space
concept in virtio-iommu allows to group endpoints together, so that they
have the same address space. I thought it was pretty much the same as
"domains" in VT-d? In any case, it is the same as domains in Linux. An
IOASID provides a handle for communication between virtio-iommu device and
driver, but unlike PASID, the IOASID number doesn't mean anything outside
of virtio-iommu.

I haven't introduced PASIDs in public virtio-iommu documents yet, but the
way I intend it, PASID != IOASID. We will still have a logical address
space identified by IOASID, that can contain multiple contexts identified
by PASID. At the moment, after the ATTACH request, an address space
contains a single anonymous context (NO PASID) that can be managed with
MAP/UNMAP requests. With virtio-iommu v0.4, structures look like the
following. The NO PASID context is implicit.

                    address space      context
    endpoint ----.                                  .- mapping
    endpoint ----+---- IOASID -------- NO PASID ----+- mapping
    endpoint ----'                                  '- mapping

I'd like to add a flag to ATTACH that says "don't create a default
anonymous context, I'll handle contexts myself". Then a new "ADD_TABLE"
request to handle contexts. When creating a context, the guest decides if
it wants to manage it via MAP/UNMAP requests (and a new "context" field),
or instead manage mappings itself by allocating a page directory and use
INVALIDATE requests.

                    address space      context
    endpoint ----.                                  .- mapping
    endpoint ----+---- IOASID ----+--- NO PASID ----+- mapping
    endpoint ----'                |                 '- mapping
                                  +--- PASID 0  ---- pgd
                                  |     ...
                                  '--- PASID N  ---- pgd

In this example the guest chose to still have an anonymous context that
uses MAP/UNMAP, along with a few PASID contexts with their own page tables.

Thanks,
Jean
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [virtio-dev] Re: [RFC 2/3] virtio-iommu: device probing and operations
@ 2017-08-21 12:00                 ` Jean-Philippe Brucker
  0 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-21 12:00 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier

On 21/08/17 08:59, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
>> Sent: Monday, April 24, 2017 11:06 PM
>>>>>>   1. Attach device
>>>>>>   ----------------
>>>>>>
>>>>>> struct virtio_iommu_req_attach {
>>>>>> 	le32	address_space;
>>>>>> 	le32	device;
>>>>>> 	le32	flags/reserved;
>>>>>> };
>>>>>>
>>>>>> Attach a device to an address space. 'address_space' is an identifier
>>>>>> unique to the guest. If the address space doesn't exist in the IOMMU
>>>>>
>>>>> Based on your description this address space ID is per operation right?
>>>>> MAP/UNMAP and page-table sharing should have different ID spaces...
>>>>
>>>> I think it's simpler if we keep a single IOASID space per virtio-iommu
>>>> device, because the maximum number of address spaces (described by
>>>> ioasid_bits) might be a restriction of the pIOMMU. For page-table
>> sharing
>>>> you still need to define which devices will share a page directory using
>>>> ATTACH requests, though that interface is not set in stone.
>>>
>>> got you. yes VM is supposed to consume less IOASIDs than physically
>>> available. It doesn’t hurt to have one IOASID space for both IOVA
>>> map/unmap usages (one IOASID per device) and SVM usages (multiple
>>> IOASIDs per device). The former is digested by software and the latter
>>> will be bound to hardware.
>>>
>>
>> Hmm, I'm using address space indexed by IOASID for "classic" IOMMU, and
>> then contexts indexed by PASID when talking about SVM. So in my mind an
>> address space can have multiple sub-address-spaces (contexts). Number of
>> IOASIDs is a limitation of the pIOMMU, and number of PASIDs is a
>> limitation of the device. Therefore attaching devices to address spaces
>> would update the number of available contexts in that address space. The
>> terminology is not ideal, and I'd be happy to change it for something more
>> clear.
>>
> 
> (sorry to pick up this old thread, as the .tex one is not good for review
> and this thread provides necessary background for IOASID).
> 
> Hi, Jean,
> 
> I'd like to hear more clarification regarding the relationship between 
> IOASID and PASID. When reading back above explanation, it looks
> confusing to me now (though I might get the meaning months ago :/).
> At least Intel VT-d only understands PASID (or you can think IOASID
> =PASID). There is no such layered address space concept. Then for
> map/unmap type request, do you intend to steal some PASIDs for
> that purpose on such architecture (since IOASID is a mandatory field 
> in map/unmap request)?

IOASID is a logical ID, it isn't used by hardware. The address space
concept in virtio-iommu allows to group endpoints together, so that they
have the same address space. I thought it was pretty much the same as
"domains" in VT-d? In any case, it is the same as domains in Linux. An
IOASID provides a handle for communication between virtio-iommu device and
driver, but unlike PASID, the IOASID number doesn't mean anything outside
of virtio-iommu.

I haven't introduced PASIDs in public virtio-iommu documents yet, but the
way I intend it, PASID != IOASID. We will still have a logical address
space identified by IOASID, that can contain multiple contexts identified
by PASID. At the moment, after the ATTACH request, an address space
contains a single anonymous context (NO PASID) that can be managed with
MAP/UNMAP requests. With virtio-iommu v0.4, structures look like the
following. The NO PASID context is implicit.

                    address space      context
    endpoint ----.                                  .- mapping
    endpoint ----+---- IOASID -------- NO PASID ----+- mapping
    endpoint ----'                                  '- mapping

I'd like to add a flag to ATTACH that says "don't create a default
anonymous context, I'll handle contexts myself". Then a new "ADD_TABLE"
request to handle contexts. When creating a context, the guest decides if
it wants to manage it via MAP/UNMAP requests (and a new "context" field),
or instead manage mappings itself by allocating a page directory and use
INVALIDATE requests.

                    address space      context
    endpoint ----.                                  .- mapping
    endpoint ----+---- IOASID ----+--- NO PASID ----+- mapping
    endpoint ----'                |                 '- mapping
                                  +--- PASID 0  ---- pgd
                                  |     ...
                                  '--- PASID N  ---- pgd

In this example the guest chose to still have an anonymous context that
uses MAP/UNMAP, along with a few PASID contexts with their own page tables.

Thanks,
Jean

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 2/3] virtio-iommu: device probing and operations
       [not found]                 ` <454095c4-cae5-ad52-a459-5c9e2cce4047-5wv7dgnIgG8@public.gmane.org>
@ 2017-08-22  6:24                   ` Tian, Kevin
  2017-08-22 14:19                     ` Jean-Philippe Brucker
  2017-08-22 14:19                       ` [virtio-dev] " Jean-Philippe Brucker
  0 siblings, 2 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-08-22  6:24 UTC (permalink / raw)
  To: Jean-Philippe Brucker,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: cdall-QSEj5FYQhm4dnm+yROfE0A, mst-H+wXaHxf7aLQT0dZR+AlfA,
	marc.zyngier-5wv7dgnIgG8, will.deacon-5wv7dgnIgG8,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Monday, August 21, 2017 8:00 PM
> 
> On 21/08/17 08:59, Tian, Kevin wrote:
> >> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> >> Sent: Monday, April 24, 2017 11:06 PM
> >>>>>>   1. Attach device
> >>>>>>   ----------------
> >>>>>>
> >>>>>> struct virtio_iommu_req_attach {
> >>>>>> 	le32	address_space;
> >>>>>> 	le32	device;
> >>>>>> 	le32	flags/reserved;
> >>>>>> };
> >>>>>>
> >>>>>> Attach a device to an address space. 'address_space' is an identifier
> >>>>>> unique to the guest. If the address space doesn't exist in the
> IOMMU
> >>>>>
> >>>>> Based on your description this address space ID is per operation
> right?
> >>>>> MAP/UNMAP and page-table sharing should have different ID
> spaces...
> >>>>
> >>>> I think it's simpler if we keep a single IOASID space per virtio-iommu
> >>>> device, because the maximum number of address spaces (described
> by
> >>>> ioasid_bits) might be a restriction of the pIOMMU. For page-table
> >> sharing
> >>>> you still need to define which devices will share a page directory using
> >>>> ATTACH requests, though that interface is not set in stone.
> >>>
> >>> got you. yes VM is supposed to consume less IOASIDs than physically
> >>> available. It doesn’t hurt to have one IOASID space for both IOVA
> >>> map/unmap usages (one IOASID per device) and SVM usages (multiple
> >>> IOASIDs per device). The former is digested by software and the latter
> >>> will be bound to hardware.
> >>>
> >>
> >> Hmm, I'm using address space indexed by IOASID for "classic" IOMMU,
> and
> >> then contexts indexed by PASID when talking about SVM. So in my mind
> an
> >> address space can have multiple sub-address-spaces (contexts). Number
> of
> >> IOASIDs is a limitation of the pIOMMU, and number of PASIDs is a
> >> limitation of the device. Therefore attaching devices to address spaces
> >> would update the number of available contexts in that address space.
> The
> >> terminology is not ideal, and I'd be happy to change it for something
> more
> >> clear.
> >>
> >
> > (sorry to pick up this old thread, as the .tex one is not good for review
> > and this thread provides necessary background for IOASID).
> >
> > Hi, Jean,
> >
> > I'd like to hear more clarification regarding the relationship between
> > IOASID and PASID. When reading back above explanation, it looks
> > confusing to me now (though I might get the meaning months ago :/).
> > At least Intel VT-d only understands PASID (or you can think IOASID
> > =PASID). There is no such layered address space concept. Then for
> > map/unmap type request, do you intend to steal some PASIDs for
> > that purpose on such architecture (since IOASID is a mandatory field
> > in map/unmap request)?
> 
> IOASID is a logical ID, it isn't used by hardware. The address space
> concept in virtio-iommu allows to group endpoints together, so that they
> have the same address space. I thought it was pretty much the same as
> "domains" in VT-d? In any case, it is the same as domains in Linux. An
> IOASID provides a handle for communication between virtio-iommu device
> and
> driver, but unlike PASID, the IOASID number doesn't mean anything outside
> of virtio-iommu.

Thanks. It's clear to me then.

btw does it make more sense to use "domain id" instead of "IO address
space id"? For one, when talking about layered address spaces
usually parent address space is a superset of all child address spaces
which doesn't apply to this case, since either anonymous address
space or PASID-tagged address spaces are completely isolated. Instead
'domain' is a more inclusive terminology to embrace multiple address
spaces. For two, 'domain' is better aligned to software terminology (e.g.
iommu_domain) is easier for people to catch up. :-)

> 
> I haven't introduced PASIDs in public virtio-iommu documents yet, but the
> way I intend it, PASID != IOASID. We will still have a logical address
> space identified by IOASID, that can contain multiple contexts identified
> by PASID. At the moment, after the ATTACH request, an address space
> contains a single anonymous context (NO PASID) that can be managed with
> MAP/UNMAP requests. With virtio-iommu v0.4, structures look like the
> following. The NO PASID context is implicit.
> 
>                     address space      context
>     endpoint ----.                                  .- mapping
>     endpoint ----+---- IOASID -------- NO PASID ----+- mapping
>     endpoint ----'                                  '- mapping
> 
> I'd like to add a flag to ATTACH that says "don't create a default
> anonymous context, I'll handle contexts myself". Then a new "ADD_TABLE"
> request to handle contexts. When creating a context, the guest decides if
> it wants to manage it via MAP/UNMAP requests (and a new "context" field),
> or instead manage mappings itself by allocating a page directory and use
> INVALIDATE requests.
> 
>                     address space      context
>     endpoint ----.                                  .- mapping
>     endpoint ----+---- IOASID ----+--- NO PASID ----+- mapping
>     endpoint ----'                |                 '- mapping
>                                   +--- PASID 0  ---- pgd
>                                   |     ...
>                                   '--- PASID N  ---- pgd
> 
> In this example the guest chose to still have an anonymous context that
> uses MAP/UNMAP, along with a few PASID contexts with their own page
> tables.
> 

Above explanation is a good background. Is it useful to include it
in current spec? Though SVM support is not planned now, adding
such background could help build a full story for IOASID concept.

Thanks
Kevin
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 2/3] virtio-iommu: device probing and operations
  2017-08-21 12:00                 ` [virtio-dev] " Jean-Philippe Brucker
  (?)
  (?)
@ 2017-08-22  6:24                 ` Tian, Kevin
  -1 siblings, 0 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-08-22  6:24 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	robin.murphy

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Monday, August 21, 2017 8:00 PM
> 
> On 21/08/17 08:59, Tian, Kevin wrote:
> >> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> >> Sent: Monday, April 24, 2017 11:06 PM
> >>>>>>   1. Attach device
> >>>>>>   ----------------
> >>>>>>
> >>>>>> struct virtio_iommu_req_attach {
> >>>>>> 	le32	address_space;
> >>>>>> 	le32	device;
> >>>>>> 	le32	flags/reserved;
> >>>>>> };
> >>>>>>
> >>>>>> Attach a device to an address space. 'address_space' is an identifier
> >>>>>> unique to the guest. If the address space doesn't exist in the
> IOMMU
> >>>>>
> >>>>> Based on your description this address space ID is per operation
> right?
> >>>>> MAP/UNMAP and page-table sharing should have different ID
> spaces...
> >>>>
> >>>> I think it's simpler if we keep a single IOASID space per virtio-iommu
> >>>> device, because the maximum number of address spaces (described
> by
> >>>> ioasid_bits) might be a restriction of the pIOMMU. For page-table
> >> sharing
> >>>> you still need to define which devices will share a page directory using
> >>>> ATTACH requests, though that interface is not set in stone.
> >>>
> >>> got you. yes VM is supposed to consume less IOASIDs than physically
> >>> available. It doesn’t hurt to have one IOASID space for both IOVA
> >>> map/unmap usages (one IOASID per device) and SVM usages (multiple
> >>> IOASIDs per device). The former is digested by software and the latter
> >>> will be bound to hardware.
> >>>
> >>
> >> Hmm, I'm using address space indexed by IOASID for "classic" IOMMU,
> and
> >> then contexts indexed by PASID when talking about SVM. So in my mind
> an
> >> address space can have multiple sub-address-spaces (contexts). Number
> of
> >> IOASIDs is a limitation of the pIOMMU, and number of PASIDs is a
> >> limitation of the device. Therefore attaching devices to address spaces
> >> would update the number of available contexts in that address space.
> The
> >> terminology is not ideal, and I'd be happy to change it for something
> more
> >> clear.
> >>
> >
> > (sorry to pick up this old thread, as the .tex one is not good for review
> > and this thread provides necessary background for IOASID).
> >
> > Hi, Jean,
> >
> > I'd like to hear more clarification regarding the relationship between
> > IOASID and PASID. When reading back above explanation, it looks
> > confusing to me now (though I might get the meaning months ago :/).
> > At least Intel VT-d only understands PASID (or you can think IOASID
> > =PASID). There is no such layered address space concept. Then for
> > map/unmap type request, do you intend to steal some PASIDs for
> > that purpose on such architecture (since IOASID is a mandatory field
> > in map/unmap request)?
> 
> IOASID is a logical ID, it isn't used by hardware. The address space
> concept in virtio-iommu allows to group endpoints together, so that they
> have the same address space. I thought it was pretty much the same as
> "domains" in VT-d? In any case, it is the same as domains in Linux. An
> IOASID provides a handle for communication between virtio-iommu device
> and
> driver, but unlike PASID, the IOASID number doesn't mean anything outside
> of virtio-iommu.

Thanks. It's clear to me then.

btw does it make more sense to use "domain id" instead of "IO address
space id"? For one, when talking about layered address spaces
usually parent address space is a superset of all child address spaces
which doesn't apply to this case, since either anonymous address
space or PASID-tagged address spaces are completely isolated. Instead
'domain' is a more inclusive terminology to embrace multiple address
spaces. For two, 'domain' is better aligned to software terminology (e.g.
iommu_domain) is easier for people to catch up. :-)

> 
> I haven't introduced PASIDs in public virtio-iommu documents yet, but the
> way I intend it, PASID != IOASID. We will still have a logical address
> space identified by IOASID, that can contain multiple contexts identified
> by PASID. At the moment, after the ATTACH request, an address space
> contains a single anonymous context (NO PASID) that can be managed with
> MAP/UNMAP requests. With virtio-iommu v0.4, structures look like the
> following. The NO PASID context is implicit.
> 
>                     address space      context
>     endpoint ----.                                  .- mapping
>     endpoint ----+---- IOASID -------- NO PASID ----+- mapping
>     endpoint ----'                                  '- mapping
> 
> I'd like to add a flag to ATTACH that says "don't create a default
> anonymous context, I'll handle contexts myself". Then a new "ADD_TABLE"
> request to handle contexts. When creating a context, the guest decides if
> it wants to manage it via MAP/UNMAP requests (and a new "context" field),
> or instead manage mappings itself by allocating a page directory and use
> INVALIDATE requests.
> 
>                     address space      context
>     endpoint ----.                                  .- mapping
>     endpoint ----+---- IOASID ----+--- NO PASID ----+- mapping
>     endpoint ----'                |                 '- mapping
>                                   +--- PASID 0  ---- pgd
>                                   |     ...
>                                   '--- PASID N  ---- pgd
> 
> In this example the guest chose to still have an anonymous context that
> uses MAP/UNMAP, along with a few PASID contexts with their own page
> tables.
> 

Above explanation is a good background. Is it useful to include it
in current spec? Though SVM support is not planned now, adding
such background could help build a full story for IOASID concept.

Thanks
Kevin
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 2/3] virtio-iommu: device probing and operations
  2017-08-22  6:24                   ` Tian, Kevin
@ 2017-08-22 14:19                       ` Jean-Philippe Brucker
  2017-08-22 14:19                       ` [virtio-dev] " Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-22 14:19 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier, Auger Eric,
	Bharat Bhushan

On 22/08/17 07:24, Tian, Kevin wrote:
>>> (sorry to pick up this old thread, as the .tex one is not good for review
>>> and this thread provides necessary background for IOASID).
>>>
>>> Hi, Jean,
>>>
>>> I'd like to hear more clarification regarding the relationship between
>>> IOASID and PASID. When reading back above explanation, it looks
>>> confusing to me now (though I might get the meaning months ago :/).
>>> At least Intel VT-d only understands PASID (or you can think IOASID
>>> =PASID). There is no such layered address space concept. Then for
>>> map/unmap type request, do you intend to steal some PASIDs for
>>> that purpose on such architecture (since IOASID is a mandatory field
>>> in map/unmap request)?
>>
>> IOASID is a logical ID, it isn't used by hardware. The address space
>> concept in virtio-iommu allows to group endpoints together, so that they
>> have the same address space. I thought it was pretty much the same as
>> "domains" in VT-d? In any case, it is the same as domains in Linux. An
>> IOASID provides a handle for communication between virtio-iommu device
>> and
>> driver, but unlike PASID, the IOASID number doesn't mean anything outside
>> of virtio-iommu.
> 
> Thanks. It's clear to me then.
> 
> btw does it make more sense to use "domain id" instead of "IO address
> space id"? For one, when talking about layered address spaces
> usually parent address space is a superset of all child address spaces
> which doesn't apply to this case, since either anonymous address
> space or PASID-tagged address spaces are completely isolated. Instead> 'domain' is a more inclusive terminology to embrace multiple address
> spaces. For two, 'domain' is better aligned to software terminology (e.g.
> iommu_domain) is easier for people to catch up. :-)

I do agree that the naming isn't great. I didn't use "domain" for various
reasons (it also has a different meanings in ARM) but I keep regretting
it. As there is no virtio-iommu code upstream yet, it is still possible to
change this one.

I find that "address space" was a good fit for the baseline device, but
the name doesn't scale. When introducing PASIDs, the address space moves
one level down in the programming model. It is contexts, anonymous or
PASID-tagged, that should be called address spaces. I was considering
replacing it with "domain", "container", "partition"...

Even though I don't want to use too much Linux terminology (virtio isn't
just Linux), "domain" is better fitted, somewhat neutral, and gets the
point across. A domain has one or more input address spaces and a single
output address space.

When introducing nested translation to virtio-iommu (for the guest to have
virtual machines itself), there will be one or more intermediate address
spaces. Domains will be nested, with the terminology "parent domain" and
"child domain". I only briefly looked at a programming model for this but
I think we can nest virtio-iommus without much hassle.

If there is no objection the next version will use "domain" in place of
"address_space". The change is quite invasive at this point, but I believe
that it will makes things more clear down the road.

>> I haven't introduced PASIDs in public virtio-iommu documents yet, but the
>> way I intend it, PASID != IOASID. We will still have a logical address
>> space identified by IOASID, that can contain multiple contexts identified
>> by PASID. At the moment, after the ATTACH request, an address space
>> contains a single anonymous context (NO PASID) that can be managed with
>> MAP/UNMAP requests. With virtio-iommu v0.4, structures look like the
>> following. The NO PASID context is implicit.
>>
>>                     address space      context
>>     endpoint ----.                                  .- mapping
>>     endpoint ----+---- IOASID -------- NO PASID ----+- mapping
>>     endpoint ----'                                  '- mapping
>>
>> I'd like to add a flag to ATTACH that says "don't create a default
>> anonymous context, I'll handle contexts myself". Then a new "ADD_TABLE"
>> request to handle contexts. When creating a context, the guest decides if
>> it wants to manage it via MAP/UNMAP requests (and a new "context" field),
>> or instead manage mappings itself by allocating a page directory and use
>> INVALIDATE requests.
>>
>>                     address space      context
>>     endpoint ----.                                  .- mapping
>>     endpoint ----+---- IOASID ----+--- NO PASID ----+- mapping
>>     endpoint ----'                |                 '- mapping
>>                                   +--- PASID 0  ---- pgd
>>                                   |     ...
>>                                   '--- PASID N  ---- pgd
>>
>> In this example the guest chose to still have an anonymous context that
>> uses MAP/UNMAP, along with a few PASID contexts with their own page
>> tables.
>>
> 
> Above explanation is a good background. Is it useful to include it
> in current spec? Though SVM support is not planned now, adding
> such background could help build a full story for IOASID concept.

I think introducing this explanation when PASIDs are added to the spec is
good enough. Right now it would look like clutter.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [RFC 2/3] virtio-iommu: device probing and operations
  2017-08-22  6:24                   ` Tian, Kevin
@ 2017-08-22 14:19                     ` Jean-Philippe Brucker
  2017-08-22 14:19                       ` [virtio-dev] " Jean-Philippe Brucker
  1 sibling, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-22 14:19 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	Auger Eric, robin.murphy

On 22/08/17 07:24, Tian, Kevin wrote:
>>> (sorry to pick up this old thread, as the .tex one is not good for review
>>> and this thread provides necessary background for IOASID).
>>>
>>> Hi, Jean,
>>>
>>> I'd like to hear more clarification regarding the relationship between
>>> IOASID and PASID. When reading back above explanation, it looks
>>> confusing to me now (though I might get the meaning months ago :/).
>>> At least Intel VT-d only understands PASID (or you can think IOASID
>>> =PASID). There is no such layered address space concept. Then for
>>> map/unmap type request, do you intend to steal some PASIDs for
>>> that purpose on such architecture (since IOASID is a mandatory field
>>> in map/unmap request)?
>>
>> IOASID is a logical ID, it isn't used by hardware. The address space
>> concept in virtio-iommu allows to group endpoints together, so that they
>> have the same address space. I thought it was pretty much the same as
>> "domains" in VT-d? In any case, it is the same as domains in Linux. An
>> IOASID provides a handle for communication between virtio-iommu device
>> and
>> driver, but unlike PASID, the IOASID number doesn't mean anything outside
>> of virtio-iommu.
> 
> Thanks. It's clear to me then.
> 
> btw does it make more sense to use "domain id" instead of "IO address
> space id"? For one, when talking about layered address spaces
> usually parent address space is a superset of all child address spaces
> which doesn't apply to this case, since either anonymous address
> space or PASID-tagged address spaces are completely isolated. Instead> 'domain' is a more inclusive terminology to embrace multiple address
> spaces. For two, 'domain' is better aligned to software terminology (e.g.
> iommu_domain) is easier for people to catch up. :-)

I do agree that the naming isn't great. I didn't use "domain" for various
reasons (it also has a different meanings in ARM) but I keep regretting
it. As there is no virtio-iommu code upstream yet, it is still possible to
change this one.

I find that "address space" was a good fit for the baseline device, but
the name doesn't scale. When introducing PASIDs, the address space moves
one level down in the programming model. It is contexts, anonymous or
PASID-tagged, that should be called address spaces. I was considering
replacing it with "domain", "container", "partition"...

Even though I don't want to use too much Linux terminology (virtio isn't
just Linux), "domain" is better fitted, somewhat neutral, and gets the
point across. A domain has one or more input address spaces and a single
output address space.

When introducing nested translation to virtio-iommu (for the guest to have
virtual machines itself), there will be one or more intermediate address
spaces. Domains will be nested, with the terminology "parent domain" and
"child domain". I only briefly looked at a programming model for this but
I think we can nest virtio-iommus without much hassle.

If there is no objection the next version will use "domain" in place of
"address_space". The change is quite invasive at this point, but I believe
that it will makes things more clear down the road.

>> I haven't introduced PASIDs in public virtio-iommu documents yet, but the
>> way I intend it, PASID != IOASID. We will still have a logical address
>> space identified by IOASID, that can contain multiple contexts identified
>> by PASID. At the moment, after the ATTACH request, an address space
>> contains a single anonymous context (NO PASID) that can be managed with
>> MAP/UNMAP requests. With virtio-iommu v0.4, structures look like the
>> following. The NO PASID context is implicit.
>>
>>                     address space      context
>>     endpoint ----.                                  .- mapping
>>     endpoint ----+---- IOASID -------- NO PASID ----+- mapping
>>     endpoint ----'                                  '- mapping
>>
>> I'd like to add a flag to ATTACH that says "don't create a default
>> anonymous context, I'll handle contexts myself". Then a new "ADD_TABLE"
>> request to handle contexts. When creating a context, the guest decides if
>> it wants to manage it via MAP/UNMAP requests (and a new "context" field),
>> or instead manage mappings itself by allocating a page directory and use
>> INVALIDATE requests.
>>
>>                     address space      context
>>     endpoint ----.                                  .- mapping
>>     endpoint ----+---- IOASID ----+--- NO PASID ----+- mapping
>>     endpoint ----'                |                 '- mapping
>>                                   +--- PASID 0  ---- pgd
>>                                   |     ...
>>                                   '--- PASID N  ---- pgd
>>
>> In this example the guest chose to still have an anonymous context that
>> uses MAP/UNMAP, along with a few PASID contexts with their own page
>> tables.
>>
> 
> Above explanation is a good background. Is it useful to include it
> in current spec? Though SVM support is not planned now, adding
> such background could help build a full story for IOASID concept.

I think introducing this explanation when PASIDs are added to the spec is
good enough. Right now it would look like clutter.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [virtio-dev] Re: [RFC 2/3] virtio-iommu: device probing and operations
@ 2017-08-22 14:19                       ` Jean-Philippe Brucker
  0 siblings, 0 replies; 99+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-22 14:19 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier, Auger Eric,
	Bharat Bhushan

On 22/08/17 07:24, Tian, Kevin wrote:
>>> (sorry to pick up this old thread, as the .tex one is not good for review
>>> and this thread provides necessary background for IOASID).
>>>
>>> Hi, Jean,
>>>
>>> I'd like to hear more clarification regarding the relationship between
>>> IOASID and PASID. When reading back above explanation, it looks
>>> confusing to me now (though I might get the meaning months ago :/).
>>> At least Intel VT-d only understands PASID (or you can think IOASID
>>> =PASID). There is no such layered address space concept. Then for
>>> map/unmap type request, do you intend to steal some PASIDs for
>>> that purpose on such architecture (since IOASID is a mandatory field
>>> in map/unmap request)?
>>
>> IOASID is a logical ID, it isn't used by hardware. The address space
>> concept in virtio-iommu allows to group endpoints together, so that they
>> have the same address space. I thought it was pretty much the same as
>> "domains" in VT-d? In any case, it is the same as domains in Linux. An
>> IOASID provides a handle for communication between virtio-iommu device
>> and
>> driver, but unlike PASID, the IOASID number doesn't mean anything outside
>> of virtio-iommu.
> 
> Thanks. It's clear to me then.
> 
> btw does it make more sense to use "domain id" instead of "IO address
> space id"? For one, when talking about layered address spaces
> usually parent address space is a superset of all child address spaces
> which doesn't apply to this case, since either anonymous address
> space or PASID-tagged address spaces are completely isolated. Instead> 'domain' is a more inclusive terminology to embrace multiple address
> spaces. For two, 'domain' is better aligned to software terminology (e.g.
> iommu_domain) is easier for people to catch up. :-)

I do agree that the naming isn't great. I didn't use "domain" for various
reasons (it also has a different meanings in ARM) but I keep regretting
it. As there is no virtio-iommu code upstream yet, it is still possible to
change this one.

I find that "address space" was a good fit for the baseline device, but
the name doesn't scale. When introducing PASIDs, the address space moves
one level down in the programming model. It is contexts, anonymous or
PASID-tagged, that should be called address spaces. I was considering
replacing it with "domain", "container", "partition"...

Even though I don't want to use too much Linux terminology (virtio isn't
just Linux), "domain" is better fitted, somewhat neutral, and gets the
point across. A domain has one or more input address spaces and a single
output address space.

When introducing nested translation to virtio-iommu (for the guest to have
virtual machines itself), there will be one or more intermediate address
spaces. Domains will be nested, with the terminology "parent domain" and
"child domain". I only briefly looked at a programming model for this but
I think we can nest virtio-iommus without much hassle.

If there is no objection the next version will use "domain" in place of
"address_space". The change is quite invasive at this point, but I believe
that it will makes things more clear down the road.

>> I haven't introduced PASIDs in public virtio-iommu documents yet, but the
>> way I intend it, PASID != IOASID. We will still have a logical address
>> space identified by IOASID, that can contain multiple contexts identified
>> by PASID. At the moment, after the ATTACH request, an address space
>> contains a single anonymous context (NO PASID) that can be managed with
>> MAP/UNMAP requests. With virtio-iommu v0.4, structures look like the
>> following. The NO PASID context is implicit.
>>
>>                     address space      context
>>     endpoint ----.                                  .- mapping
>>     endpoint ----+---- IOASID -------- NO PASID ----+- mapping
>>     endpoint ----'                                  '- mapping
>>
>> I'd like to add a flag to ATTACH that says "don't create a default
>> anonymous context, I'll handle contexts myself". Then a new "ADD_TABLE"
>> request to handle contexts. When creating a context, the guest decides if
>> it wants to manage it via MAP/UNMAP requests (and a new "context" field),
>> or instead manage mappings itself by allocating a page directory and use
>> INVALIDATE requests.
>>
>>                     address space      context
>>     endpoint ----.                                  .- mapping
>>     endpoint ----+---- IOASID ----+--- NO PASID ----+- mapping
>>     endpoint ----'                |                 '- mapping
>>                                   +--- PASID 0  ---- pgd
>>                                   |     ...
>>                                   '--- PASID N  ---- pgd
>>
>> In this example the guest chose to still have an anonymous context that
>> uses MAP/UNMAP, along with a few PASID contexts with their own page
>> tables.
>>
> 
> Above explanation is a good background. Is it useful to include it
> in current spec? Though SVM support is not planned now, adding
> such background could help build a full story for IOASID concept.

I think introducing this explanation when PASIDs are added to the spec is
good enough. Right now it would look like clutter.

Thanks,
Jean


---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org


^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 2/3] virtio-iommu: device probing and operations
  2017-08-22 14:19                       ` [virtio-dev] " Jean-Philippe Brucker
@ 2017-08-23  2:23                         ` Tian, Kevin
  -1 siblings, 0 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-08-23  2:23 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier, Auger Eric,
	Bharat Bhushan

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Tuesday, August 22, 2017 10:19 PM
> 
> On 22/08/17 07:24, Tian, Kevin wrote:
> >>> (sorry to pick up this old thread, as the .tex one is not good for review
> >>> and this thread provides necessary background for IOASID).
> >>>
> >>> Hi, Jean,
> >>>
> >>> I'd like to hear more clarification regarding the relationship between
> >>> IOASID and PASID. When reading back above explanation, it looks
> >>> confusing to me now (though I might get the meaning months ago :/).
> >>> At least Intel VT-d only understands PASID (or you can think IOASID
> >>> =PASID). There is no such layered address space concept. Then for
> >>> map/unmap type request, do you intend to steal some PASIDs for
> >>> that purpose on such architecture (since IOASID is a mandatory field
> >>> in map/unmap request)?
> >>
> >> IOASID is a logical ID, it isn't used by hardware. The address space
> >> concept in virtio-iommu allows to group endpoints together, so that they
> >> have the same address space. I thought it was pretty much the same as
> >> "domains" in VT-d? In any case, it is the same as domains in Linux. An
> >> IOASID provides a handle for communication between virtio-iommu
> device
> >> and
> >> driver, but unlike PASID, the IOASID number doesn't mean anything
> outside
> >> of virtio-iommu.
> >
> > Thanks. It's clear to me then.
> >
> > btw does it make more sense to use "domain id" instead of "IO address
> > space id"? For one, when talking about layered address spaces
> > usually parent address space is a superset of all child address spaces
> > which doesn't apply to this case, since either anonymous address
> > space or PASID-tagged address spaces are completely isolated. Instead>
> 'domain' is a more inclusive terminology to embrace multiple address
> > spaces. For two, 'domain' is better aligned to software terminology (e.g.
> > iommu_domain) is easier for people to catch up. :-)
> 
> I do agree that the naming isn't great. I didn't use "domain" for various
> reasons (it also has a different meanings in ARM) but I keep regretting
> it. As there is no virtio-iommu code upstream yet, it is still possible to
> change this one.
> 
> I find that "address space" was a good fit for the baseline device, but
> the name doesn't scale. When introducing PASIDs, the address space
> moves
> one level down in the programming model. It is contexts, anonymous or
> PASID-tagged, that should be called address spaces. I was considering
> replacing it with "domain", "container", "partition"...
> 
> Even though I don't want to use too much Linux terminology (virtio isn't
> just Linux), "domain" is better fitted, somewhat neutral, and gets the
> point across. A domain has one or more input address spaces and a single
> output address space.
> 
> When introducing nested translation to virtio-iommu (for the guest to have
> virtual machines itself), there will be one or more intermediate address
> spaces. Domains will be nested, with the terminology "parent domain" and
> "child domain". I only briefly looked at a programming model for this but
> I think we can nest virtio-iommus without much hassle.
> 
> If there is no objection the next version will use "domain" in place of
> "address_space". The change is quite invasive at this point, but I believe
> that it will makes things more clear down the road.
> 

Sounds good to me. Thanks.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* RE: [RFC 2/3] virtio-iommu: device probing and operations
  2017-08-22 14:19                       ` [virtio-dev] " Jean-Philippe Brucker
  (?)
@ 2017-08-23  2:23                       ` Tian, Kevin
  -1 siblings, 0 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-08-23  2:23 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, lorenzo.pieralisi, mst, marc.zyngier, joro, will.deacon,
	Auger Eric, robin.murphy

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Tuesday, August 22, 2017 10:19 PM
> 
> On 22/08/17 07:24, Tian, Kevin wrote:
> >>> (sorry to pick up this old thread, as the .tex one is not good for review
> >>> and this thread provides necessary background for IOASID).
> >>>
> >>> Hi, Jean,
> >>>
> >>> I'd like to hear more clarification regarding the relationship between
> >>> IOASID and PASID. When reading back above explanation, it looks
> >>> confusing to me now (though I might get the meaning months ago :/).
> >>> At least Intel VT-d only understands PASID (or you can think IOASID
> >>> =PASID). There is no such layered address space concept. Then for
> >>> map/unmap type request, do you intend to steal some PASIDs for
> >>> that purpose on such architecture (since IOASID is a mandatory field
> >>> in map/unmap request)?
> >>
> >> IOASID is a logical ID, it isn't used by hardware. The address space
> >> concept in virtio-iommu allows to group endpoints together, so that they
> >> have the same address space. I thought it was pretty much the same as
> >> "domains" in VT-d? In any case, it is the same as domains in Linux. An
> >> IOASID provides a handle for communication between virtio-iommu
> device
> >> and
> >> driver, but unlike PASID, the IOASID number doesn't mean anything
> outside
> >> of virtio-iommu.
> >
> > Thanks. It's clear to me then.
> >
> > btw does it make more sense to use "domain id" instead of "IO address
> > space id"? For one, when talking about layered address spaces
> > usually parent address space is a superset of all child address spaces
> > which doesn't apply to this case, since either anonymous address
> > space or PASID-tagged address spaces are completely isolated. Instead>
> 'domain' is a more inclusive terminology to embrace multiple address
> > spaces. For two, 'domain' is better aligned to software terminology (e.g.
> > iommu_domain) is easier for people to catch up. :-)
> 
> I do agree that the naming isn't great. I didn't use "domain" for various
> reasons (it also has a different meanings in ARM) but I keep regretting
> it. As there is no virtio-iommu code upstream yet, it is still possible to
> change this one.
> 
> I find that "address space" was a good fit for the baseline device, but
> the name doesn't scale. When introducing PASIDs, the address space
> moves
> one level down in the programming model. It is contexts, anonymous or
> PASID-tagged, that should be called address spaces. I was considering
> replacing it with "domain", "container", "partition"...
> 
> Even though I don't want to use too much Linux terminology (virtio isn't
> just Linux), "domain" is better fitted, somewhat neutral, and gets the
> point across. A domain has one or more input address spaces and a single
> output address space.
> 
> When introducing nested translation to virtio-iommu (for the guest to have
> virtual machines itself), there will be one or more intermediate address
> spaces. Domains will be nested, with the terminology "parent domain" and
> "child domain". I only briefly looked at a programming model for this but
> I think we can nest virtio-iommus without much hassle.
> 
> If there is no objection the next version will use "domain" in place of
> "address_space". The change is quite invasive at this point, but I believe
> that it will makes things more clear down the road.
> 

Sounds good to me. Thanks.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [virtio-dev] RE: [RFC 2/3] virtio-iommu: device probing and operations
@ 2017-08-23  2:23                         ` Tian, Kevin
  0 siblings, 0 replies; 99+ messages in thread
From: Tian, Kevin @ 2017-08-23  2:23 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: cdall, will.deacon, robin.murphy, lorenzo.pieralisi, joro, mst,
	jasowang, alex.williamson, marc.zyngier, Auger Eric,
	Bharat Bhushan

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Tuesday, August 22, 2017 10:19 PM
> 
> On 22/08/17 07:24, Tian, Kevin wrote:
> >>> (sorry to pick up this old thread, as the .tex one is not good for review
> >>> and this thread provides necessary background for IOASID).
> >>>
> >>> Hi, Jean,
> >>>
> >>> I'd like to hear more clarification regarding the relationship between
> >>> IOASID and PASID. When reading back above explanation, it looks
> >>> confusing to me now (though I might get the meaning months ago :/).
> >>> At least Intel VT-d only understands PASID (or you can think IOASID
> >>> =PASID). There is no such layered address space concept. Then for
> >>> map/unmap type request, do you intend to steal some PASIDs for
> >>> that purpose on such architecture (since IOASID is a mandatory field
> >>> in map/unmap request)?
> >>
> >> IOASID is a logical ID, it isn't used by hardware. The address space
> >> concept in virtio-iommu allows to group endpoints together, so that they
> >> have the same address space. I thought it was pretty much the same as
> >> "domains" in VT-d? In any case, it is the same as domains in Linux. An
> >> IOASID provides a handle for communication between virtio-iommu
> device
> >> and
> >> driver, but unlike PASID, the IOASID number doesn't mean anything
> outside
> >> of virtio-iommu.
> >
> > Thanks. It's clear to me then.
> >
> > btw does it make more sense to use "domain id" instead of "IO address
> > space id"? For one, when talking about layered address spaces
> > usually parent address space is a superset of all child address spaces
> > which doesn't apply to this case, since either anonymous address
> > space or PASID-tagged address spaces are completely isolated. Instead>
> 'domain' is a more inclusive terminology to embrace multiple address
> > spaces. For two, 'domain' is better aligned to software terminology (e.g.
> > iommu_domain) is easier for people to catch up. :-)
> 
> I do agree that the naming isn't great. I didn't use "domain" for various
> reasons (it also has a different meanings in ARM) but I keep regretting
> it. As there is no virtio-iommu code upstream yet, it is still possible to
> change this one.
> 
> I find that "address space" was a good fit for the baseline device, but
> the name doesn't scale. When introducing PASIDs, the address space
> moves
> one level down in the programming model. It is contexts, anonymous or
> PASID-tagged, that should be called address spaces. I was considering
> replacing it with "domain", "container", "partition"...
> 
> Even though I don't want to use too much Linux terminology (virtio isn't
> just Linux), "domain" is better fitted, somewhat neutral, and gets the
> point across. A domain has one or more input address spaces and a single
> output address space.
> 
> When introducing nested translation to virtio-iommu (for the guest to have
> virtual machines itself), there will be one or more intermediate address
> spaces. Domains will be nested, with the terminology "parent domain" and
> "child domain". I only briefly looked at a programming model for this but
> I think we can nest virtio-iommus without much hassle.
> 
> If there is no objection the next version will use "domain" in place of
> "address_space". The change is quite invasive at this point, but I believe
> that it will makes things more clear down the road.
> 

Sounds good to me. Thanks.

^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2017-08-23  2:23 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-07 19:17 [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jean-Philippe Brucker
2017-04-07 19:17 ` [RFC 1/3] virtio-iommu: firmware description of the virtual topology Jean-Philippe Brucker
2017-04-07 19:17 ` [RFC 2/3] virtio-iommu: device probing and operations Jean-Philippe Brucker
2017-04-18 10:26   ` Tian, Kevin
2017-04-18 18:45     ` Jean-Philippe Brucker
2017-04-18 18:45     ` Jean-Philippe Brucker
2017-04-21  9:02       ` Tian, Kevin
2017-04-24 15:05         ` Jean-Philippe Brucker
     [not found]         ` <AADFC41AFE54684AB9EE6CBC0274A5D190CB262D-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-04-24 15:05           ` Jean-Philippe Brucker
2017-08-21  7:59             ` Tian, Kevin
2017-08-21 12:00               ` Jean-Philippe Brucker
2017-08-21 12:00               ` Jean-Philippe Brucker
2017-08-21 12:00                 ` [virtio-dev] " Jean-Philippe Brucker
     [not found]                 ` <454095c4-cae5-ad52-a459-5c9e2cce4047-5wv7dgnIgG8@public.gmane.org>
2017-08-22  6:24                   ` Tian, Kevin
2017-08-22 14:19                     ` Jean-Philippe Brucker
2017-08-22 14:19                     ` Jean-Philippe Brucker
2017-08-22 14:19                       ` [virtio-dev] " Jean-Philippe Brucker
2017-08-23  2:23                       ` Tian, Kevin
2017-08-23  2:23                       ` Tian, Kevin
2017-08-23  2:23                         ` [virtio-dev] " Tian, Kevin
2017-08-22  6:24                 ` Tian, Kevin
2017-04-18 10:26   ` Tian, Kevin
2017-04-07 19:17 ` Jean-Philippe Brucker
2017-04-07 19:17 ` [RFC 3/3] virtio-iommu: future work Jean-Philippe Brucker
2017-04-21  8:31   ` Tian, Kevin
     [not found]   ` <20170407191747.26618-4-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
2017-04-21  8:31     ` Tian, Kevin
2017-04-24 15:05       ` Jean-Philippe Brucker
2017-04-24 15:05       ` Jean-Philippe Brucker
2017-04-26 16:24     ` Michael S. Tsirkin
2017-04-26 16:24   ` Michael S. Tsirkin
2017-04-07 19:17 ` Jean-Philippe Brucker
2017-04-07 19:23 ` [RFC PATCH linux] iommu: Add virtio-iommu driver Jean-Philippe Brucker
2017-06-16  8:48   ` [virtio-dev] " Bharat Bhushan
2017-06-16 11:36     ` Jean-Philippe Brucker
2017-06-16 11:36     ` Jean-Philippe Brucker
2017-06-16  8:48   ` Bharat Bhushan
2017-04-07 19:24 ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 01/15] virtio: synchronize virtio-iommu headers with Linux Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 02/15] FDT: (re)introduce a dynamic phandle allocator Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 03/15] virtio: add virtio-iommu Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 04/15] Add a simple IOMMU Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 05/15] iommu: describe IOMMU topology in device-trees Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 06/15] irq: register MSI doorbell addresses Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 07/15] virtio: factor virtqueue initialization Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 08/15] virtio: add vIOMMU instance for virtio devices Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 09/15] virtio: access vring and buffers through IOMMU mappings Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 10/15] virtio-pci: translate MSIs with the virtual IOMMU Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 11/15] virtio: set VIRTIO_F_IOMMU_PLATFORM when necessary Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 12/15] vfio: add support for virtual IOMMU Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 13/15] virtio-iommu: debug via IPC Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 14/15] virtio-iommu: implement basic debug commands Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-04-07 19:24   ` [RFC PATCH kvmtool 15/15] virtio: use virtio-iommu when available Jean-Philippe Brucker
2017-04-07 19:24   ` Jean-Philippe Brucker
2017-05-22  8:26   ` [RFC PATCH kvmtool 00/15] Add virtio-iommu Bharat Bhushan
     [not found]   ` <20170407192455.26814-1-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
2017-05-22  8:26     ` Bharat Bhushan
2017-05-22 14:01       ` Jean-Philippe Brucker
     [not found]       ` <AM5PR0401MB2545FADDF2A7649DF0DB68309AF80-oQ3wXcTHOqrg6d/1FbYcvI3W/0Ik+aLCnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-05-22 14:01         ` Jean-Philippe Brucker
2017-04-07 21:19 ` [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Michael S. Tsirkin
2017-04-07 21:19 ` Michael S. Tsirkin
2017-04-10 18:39   ` [virtio-dev] " Jean-Philippe Brucker
2017-04-10 18:39   ` Jean-Philippe Brucker
2017-04-10 20:04     ` [virtio-dev] " Michael S. Tsirkin
2017-04-10 20:04     ` Michael S. Tsirkin
2017-04-10  2:30 ` Need information on type 2 IOMMU valmiki
2017-04-12  9:06 ` [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Jason Wang
2017-04-13  8:16   ` Tian, Kevin
     [not found]   ` <a0920e37-a11e-784c-7d90-be6617ea7686-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-04-13  8:16     ` Tian, Kevin
     [not found]       ` <AADFC41AFE54684AB9EE6CBC0274A5D190CA990E-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-04-13 13:12         ` Jean-Philippe Brucker
2017-04-13 13:12       ` Jean-Philippe Brucker
2017-04-12  9:06 ` Jason Wang
2017-04-13  8:41 ` Tian, Kevin
     [not found] ` <20170407191747.26618-1-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
2017-04-07 19:17   ` [RFC 1/3] virtio-iommu: firmware description of the virtual topology Jean-Philippe Brucker
     [not found]     ` <20170407191747.26618-2-jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>
2017-04-18  9:51       ` Tian, Kevin
2017-04-18 18:41         ` Jean-Philippe Brucker
2017-04-21  8:43           ` Tian, Kevin
     [not found]             ` <AADFC41AFE54684AB9EE6CBC0274A5D190CB2570-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-04-24 15:05               ` Jean-Philippe Brucker
2017-04-24 15:05             ` Jean-Philippe Brucker
2017-04-18 18:41         ` Jean-Philippe Brucker
2017-04-18  9:51     ` Tian, Kevin
2017-04-10  2:30   ` Need information on type 2 IOMMU valmiki
     [not found]     ` <1b48daab-c9e1-84d1-78a9-84d3e2001f32-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-04-10  4:19       ` Alex Williamson
2017-04-10  4:19     ` Alex Williamson
2017-04-13  8:41   ` [RFC 0/3] virtio-iommu: a paravirtualized IOMMU Tian, Kevin
2017-04-13 13:12     ` Jean-Philippe Brucker
2017-04-13 13:12     ` Jean-Philippe Brucker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.