All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 kvmtool 00/13] Add vfio-pci support
@ 2018-06-18 18:41 Jean-Philippe Brucker
  2018-06-18 18:41 ` [PATCH v6 kvmtool 01/13] pci: add config operations callbacks on the PCI header Jean-Philippe Brucker
                   ` (13 more replies)
  0 siblings, 14 replies; 15+ messages in thread
From: Jean-Philippe Brucker @ 2018-06-18 18:41 UTC (permalink / raw)
  To: kvm
  Cc: lorenzo.pieralisi, marc.zyngier, punit.agrawal, will.deacon,
	alex.williamson, robin.murphy, kvmarm

This is version six of the VFIO support in kvmtool. It addresses
Will's comments for v5.

Implement vfio-pci support in kvmtool. Devices are assigned to the guest
by passing "--vfio-pci [domain:]bus:dev.fn" to lkvm run, after having
bound the device to the VFIO driver (see Documentation/vfio.txt)

This time around I also tested assignment of the x540 NIC on my old x86
desktop as well (previously only on AMD Seattle, an arm64 host). It
worked smoothly.

v6: git://linux-arm.org/kvmtool-jpb.git vfio/v6
v5: https://www.spinics.net/lists/kvm/msg166119.html

Jean-Philippe Brucker (13):
  pci: add config operations callbacks on the PCI header
  pci: allow to specify IRQ type for PCI devices
  irq: add irqfd helpers
  Extend memory bank API with memory types
  pci: add capability helpers
  Import VFIO headers
  Add fls_long and roundup_pow_of_two helpers
  Add PCI device passthrough using VFIO
  vfio-pci: add MSI-X support
  vfio-pci: add MSI support
  vfio: Support non-mmappable regions
  Introduce reserved memory regions
  vfio: check reserved regions before mapping DMA

 Makefile                     |    2 +
 arm/gic.c                    |   76 ++-
 arm/include/arm-common/gic.h |    6 +
 arm/kvm.c                    |    2 +-
 arm/pci.c                    |    4 +-
 builtin-run.c                |    5 +
 hw/pci-shmem.c               |   12 +-
 hw/vesa.c                    |    2 +-
 include/kvm/irq.h            |   17 +
 include/kvm/kvm-config.h     |    3 +
 include/kvm/kvm.h            |   54 +-
 include/kvm/pci.h            |  118 +++-
 include/kvm/util.h           |   14 +
 include/kvm/vfio.h           |  127 ++++
 include/linux/vfio.h         |  719 ++++++++++++++++++++
 irq.c                        |   31 +
 kvm.c                        |   99 ++-
 mips/kvm.c                   |    6 +-
 pci.c                        |  105 +--
 powerpc/kvm.c                |    2 +-
 vfio/core.c                  |  676 +++++++++++++++++++
 vfio/pci.c                   | 1193 ++++++++++++++++++++++++++++++++++
 virtio/net.c                 |    9 +-
 virtio/scsi.c                |   10 +-
 x86/kvm.c                    |    6 +-
 25 files changed, 3175 insertions(+), 123 deletions(-)
 create mode 100644 include/kvm/vfio.h
 create mode 100644 include/linux/vfio.h
 create mode 100644 vfio/core.c
 create mode 100644 vfio/pci.c

-- 
2.17.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v6 kvmtool 01/13] pci: add config operations callbacks on the PCI header
  2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
@ 2018-06-18 18:41 ` Jean-Philippe Brucker
  2018-06-18 18:42 ` [PATCH v6 kvmtool 02/13] pci: allow to specify IRQ type for PCI devices Jean-Philippe Brucker
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jean-Philippe Brucker @ 2018-06-18 18:41 UTC (permalink / raw)
  To: kvm
  Cc: lorenzo.pieralisi, marc.zyngier, punit.agrawal, will.deacon,
	alex.williamson, robin.murphy, kvmarm

When implementing PCI device passthrough, we will need to forward config
accesses from a guest to the VFIO driver. Add a private cfg_ops structure
to the PCI header, and use it in the PCI config access functions.

A read from the guest first calls into the device's cfg_ops.read, to let
the backend update the local header before filling the guest register.
Same happens for a write, we let the backend perform the write and replace
the guest-provided register with whatever sticks, before updating the local
header.

Try to untangle the PCI config access logic while we're at it.

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
[JPB: moved to a separate patch]
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/pci.h | 72 +++++++++++++++++++++++++-------------
 pci.c             | 89 ++++++++++++++++++++++-------------------------
 2 files changed, 89 insertions(+), 72 deletions(-)

diff --git a/include/kvm/pci.h b/include/kvm/pci.h
index b0c28a10a..56649d87d 100644
--- a/include/kvm/pci.h
+++ b/include/kvm/pci.h
@@ -57,33 +57,55 @@ struct msix_cap {
 	u32 pba_offset;
 };
 
+#define PCI_BAR_OFFSET(b)	(offsetof(struct pci_device_header, bar[b]))
+#define PCI_DEV_CFG_SIZE	256
+#define PCI_DEV_CFG_MASK	(PCI_DEV_CFG_SIZE - 1)
+
+struct pci_device_header;
+
+struct pci_config_operations {
+	void (*write)(struct kvm *kvm, struct pci_device_header *pci_hdr,
+		      u8 offset, void *data, int sz);
+	void (*read)(struct kvm *kvm, struct pci_device_header *pci_hdr,
+		     u8 offset, void *data, int sz);
+};
+
 struct pci_device_header {
-	u16		vendor_id;
-	u16		device_id;
-	u16		command;
-	u16		status;
-	u8		revision_id;
-	u8		class[3];
-	u8		cacheline_size;
-	u8		latency_timer;
-	u8		header_type;
-	u8		bist;
-	u32		bar[6];
-	u32		card_bus;
-	u16		subsys_vendor_id;
-	u16		subsys_id;
-	u32		exp_rom_bar;
-	u8		capabilities;
-	u8		reserved1[3];
-	u32		reserved2;
-	u8		irq_line;
-	u8		irq_pin;
-	u8		min_gnt;
-	u8		max_lat;
-	struct msix_cap msix;
-	u8		empty[136]; /* Rest of PCI config space */
+	/* Configuration space, as seen by the guest */
+	union {
+		struct {
+			u16		vendor_id;
+			u16		device_id;
+			u16		command;
+			u16		status;
+			u8		revision_id;
+			u8		class[3];
+			u8		cacheline_size;
+			u8		latency_timer;
+			u8		header_type;
+			u8		bist;
+			u32		bar[6];
+			u32		card_bus;
+			u16		subsys_vendor_id;
+			u16		subsys_id;
+			u32		exp_rom_bar;
+			u8		capabilities;
+			u8		reserved1[3];
+			u32		reserved2;
+			u8		irq_line;
+			u8		irq_pin;
+			u8		min_gnt;
+			u8		max_lat;
+			struct msix_cap msix;
+		} __attribute__((packed));
+		/* Pad to PCI config space size */
+		u8	__pad[PCI_DEV_CFG_SIZE];
+	};
+
+	/* Private to lkvm */
 	u32		bar_size[6];
-} __attribute__((packed));
+	struct pci_config_operations	cfg_ops;
+};
 
 int pci__init(struct kvm *kvm);
 int pci__exit(struct kvm *kvm);
diff --git a/pci.c b/pci.c
index 3a6696c54..e48e24b8c 100644
--- a/pci.c
+++ b/pci.c
@@ -8,8 +8,6 @@
 #include <linux/err.h>
 #include <assert.h>
 
-#define PCI_BAR_OFFSET(b)		(offsetof(struct pci_device_header, bar[b]))
-
 static u32 pci_config_address_bits;
 
 /* This is within our PCI gap - in an unused area.
@@ -131,59 +129,56 @@ static struct ioport_operations pci_config_data_ops = {
 
 void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, int size)
 {
-	u8 dev_num;
-
-	dev_num	= addr.device_number;
-
-	if (pci_device_exists(0, dev_num, 0)) {
-		unsigned long offset;
-
-		offset = addr.w & 0xff;
-		if (offset < sizeof(struct pci_device_header)) {
-			void *p = device__find_dev(DEVICE_BUS_PCI, dev_num)->data;
-			struct pci_device_header *hdr = p;
-			u8 bar = (offset - PCI_BAR_OFFSET(0)) / (sizeof(u32));
-			u32 sz = cpu_to_le32(PCI_IO_SIZE);
-
-			if (bar < 6 && hdr->bar_size[bar])
-				sz = hdr->bar_size[bar];
-
-			/*
-			 * If the kernel masks the BAR it would expect to find the
-			 * size of the BAR there next time it reads from it.
-			 * When the kernel got the size it would write the address
-			 * back.
-			 */
-			if (*(u32 *)(p + offset)) {
-				/* See if kernel tries to mask one of the BARs */
-				if ((offset >= PCI_BAR_OFFSET(0)) &&
-				    (offset <= PCI_BAR_OFFSET(6)) &&
-				    (ioport__read32(data)  == 0xFFFFFFFF))
-					memcpy(p + offset, &sz, sizeof(sz));
-				    else
-					memcpy(p + offset, data, size);
-			}
-		}
+	void *base;
+	u8 bar, offset;
+	struct pci_device_header *pci_hdr;
+	u8 dev_num = addr.device_number;
+
+	if (!pci_device_exists(addr.bus_number, dev_num, 0))
+		return;
+
+	offset = addr.w & PCI_DEV_CFG_MASK;
+	base = pci_hdr = device__find_dev(DEVICE_BUS_PCI, dev_num)->data;
+
+	if (pci_hdr->cfg_ops.write)
+		pci_hdr->cfg_ops.write(kvm, pci_hdr, offset, data, size);
+
+	/*
+	 * legacy hack: ignore writes to uninitialized regions (e.g. ROM BAR).
+	 * Not very nice but has been working so far.
+	 */
+	if (*(u32 *)(base + offset) == 0)
+		return;
+
+	bar = (offset - PCI_BAR_OFFSET(0)) / sizeof(u32);
+
+	/*
+	 * If the kernel masks the BAR it would expect to find the size of the
+	 * BAR there next time it reads from it. When the kernel got the size it
+	 * would write the address back.
+	 */
+	if (bar < 6 && ioport__read32(data) == 0xFFFFFFFF) {
+		u32 sz = pci_hdr->bar_size[bar];
+		memcpy(base + offset, &sz, sizeof(sz));
+	} else {
+		memcpy(base + offset, data, size);
 	}
 }
 
 void pci__config_rd(struct kvm *kvm, union pci_config_address addr, void *data, int size)
 {
-	u8 dev_num;
-
-	dev_num	= addr.device_number;
+	u8 offset;
+	struct pci_device_header *pci_hdr;
+	u8 dev_num = addr.device_number;
 
-	if (pci_device_exists(0, dev_num, 0)) {
-		unsigned long offset;
+	if (pci_device_exists(addr.bus_number, dev_num, 0)) {
+		pci_hdr = device__find_dev(DEVICE_BUS_PCI, dev_num)->data;
+		offset = addr.w & PCI_DEV_CFG_MASK;
 
-		offset = addr.w & 0xff;
-		if (offset < sizeof(struct pci_device_header)) {
-			void *p = device__find_dev(DEVICE_BUS_PCI, dev_num)->data;
+		if (pci_hdr->cfg_ops.read)
+			pci_hdr->cfg_ops.read(kvm, pci_hdr, offset, data, size);
 
-			memcpy(data, p + offset, size);
-		} else {
-			memset(data, 0x00, size);
-		}
+		memcpy(data, (void *)pci_hdr + offset, size);
 	} else {
 		memset(data, 0xff, size);
 	}
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 kvmtool 02/13] pci: allow to specify IRQ type for PCI devices
  2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
  2018-06-18 18:41 ` [PATCH v6 kvmtool 01/13] pci: add config operations callbacks on the PCI header Jean-Philippe Brucker
@ 2018-06-18 18:42 ` Jean-Philippe Brucker
  2018-06-18 18:42 ` [PATCH v6 kvmtool 03/13] irq: add irqfd helpers Jean-Philippe Brucker
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jean-Philippe Brucker @ 2018-06-18 18:42 UTC (permalink / raw)
  To: kvm
  Cc: lorenzo.pieralisi, marc.zyngier, punit.agrawal, will.deacon,
	alex.williamson, robin.murphy, kvmarm

Currently all our virtual device interrupts are edge-triggered. But we're
going to need level-triggered interrupts when passing physical devices.
Let the device configure its interrupt kind. Keep edge as default, to
avoid changing existing users.

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 arm/pci.c         | 3 ++-
 include/kvm/pci.h | 6 ++++++
 pci.c             | 3 +++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arm/pci.c b/arm/pci.c
index 813df26a8..744b14c26 100644
--- a/arm/pci.c
+++ b/arm/pci.c
@@ -77,6 +77,7 @@ void pci__generate_fdt_nodes(void *fdt)
 		u8 dev_num = dev_hdr->dev_num;
 		u8 pin = pci_hdr->irq_pin;
 		u8 irq = pci_hdr->irq_line;
+		u32 irq_flags = pci_hdr->irq_type;
 
 		*entry = (struct of_interrupt_map_entry) {
 			.pci_irq_mask = {
@@ -93,7 +94,7 @@ void pci__generate_fdt_nodes(void *fdt)
 			.gic_irq = {
 				.type	= cpu_to_fdt32(GIC_FDT_IRQ_TYPE_SPI),
 				.num	= cpu_to_fdt32(irq - GIC_SPI_IRQ_BASE),
-				.flags	= cpu_to_fdt32(IRQ_TYPE_EDGE_RISING),
+				.flags	= cpu_to_fdt32(irq_flags),
 			},
 		};
 
diff --git a/include/kvm/pci.h b/include/kvm/pci.h
index 56649d87d..5d9c0f3b0 100644
--- a/include/kvm/pci.h
+++ b/include/kvm/pci.h
@@ -9,6 +9,7 @@
 #include "kvm/devices.h"
 #include "kvm/kvm.h"
 #include "kvm/msi.h"
+#include "kvm/fdt.h"
 
 /*
  * PCI Configuration Mechanism #1 I/O ports. See Section 3.7.4.1.
@@ -105,6 +106,11 @@ struct pci_device_header {
 	/* Private to lkvm */
 	u32		bar_size[6];
 	struct pci_config_operations	cfg_ops;
+	/*
+	 * PCI INTx# are level-triggered, but virtual device often feature
+	 * edge-triggered INTx# for convenience.
+	 */
+	enum irq_type	irq_type;
 };
 
 int pci__init(struct kvm *kvm);
diff --git a/pci.c b/pci.c
index e48e24b8c..5a8c2ef41 100644
--- a/pci.c
+++ b/pci.c
@@ -39,6 +39,9 @@ void pci__assign_irq(struct device_header *dev_hdr)
 	 */
 	pci_hdr->irq_pin	= 1;
 	pci_hdr->irq_line	= irq__alloc_line();
+
+	if (!pci_hdr->irq_type)
+		pci_hdr->irq_type = IRQ_TYPE_EDGE_RISING;
 }
 
 static void *pci_config_address_ptr(u16 port)
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 kvmtool 03/13] irq: add irqfd helpers
  2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
  2018-06-18 18:41 ` [PATCH v6 kvmtool 01/13] pci: add config operations callbacks on the PCI header Jean-Philippe Brucker
  2018-06-18 18:42 ` [PATCH v6 kvmtool 02/13] pci: allow to specify IRQ type for PCI devices Jean-Philippe Brucker
@ 2018-06-18 18:42 ` Jean-Philippe Brucker
  2018-06-18 18:42 ` [PATCH v6 kvmtool 04/13] Extend memory bank API with memory types Jean-Philippe Brucker
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jean-Philippe Brucker @ 2018-06-18 18:42 UTC (permalink / raw)
  To: kvm
  Cc: lorenzo.pieralisi, marc.zyngier, punit.agrawal, will.deacon,
	alex.williamson, robin.murphy, kvmarm

Add helpers to add and remove IRQFD routing for both irqchips and MSIs.
We have to make a special case of IRQ lines on ARM where the
initialisation order goes like this:

 (1) Devices reserve their IRQ lines
 (2) VGIC is setup with VGIC_CTRL_INIT (in a late_init call)
 (3) MSIs are reserved lazily, when the guest needs them

Since we cannot setup IRQFD before (2), store the IRQFD routing for IRQ
lines temporarily until we're ready to submit them.

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 arm/gic.c                    | 76 +++++++++++++++++++++++++++++++++++-
 arm/include/arm-common/gic.h |  6 +++
 hw/pci-shmem.c               |  8 +---
 include/kvm/irq.h            | 17 ++++++++
 irq.c                        | 31 +++++++++++++++
 virtio/net.c                 |  9 +----
 virtio/scsi.c                | 10 ++---
 7 files changed, 135 insertions(+), 22 deletions(-)

diff --git a/arm/gic.c b/arm/gic.c
index 238a75c70..abcbcc091 100644
--- a/arm/gic.c
+++ b/arm/gic.c
@@ -17,6 +17,16 @@ static u64 gic_redists_base;
 static u64 gic_redists_size;
 static u64 gic_msi_base;
 static u64 gic_msi_size = 0;
+static bool vgic_is_init = false;
+
+struct kvm_irqfd_line {
+	unsigned int		gsi;
+	int			trigger_fd;
+	int			resample_fd;
+	struct list_head	list;
+};
+
+static LIST_HEAD(irqfd_lines);
 
 int irqchip_parser(const struct option *opt, const char *arg, int unset)
 {
@@ -38,6 +48,26 @@ int irqchip_parser(const struct option *opt, const char *arg, int unset)
 	return 0;
 }
 
+static int irq__setup_irqfd_lines(struct kvm *kvm)
+{
+	int ret;
+	struct kvm_irqfd_line *line, *tmp;
+
+	list_for_each_entry_safe(line, tmp, &irqfd_lines, list) {
+		ret = irq__common_add_irqfd(kvm, line->gsi, line->trigger_fd,
+					    line->resample_fd);
+		if (ret < 0) {
+			pr_err("Failed to register IRQFD");
+			return ret;
+		}
+
+		list_del(&line->list);
+		free(line);
+	}
+
+	return 0;
+}
+
 static int irq__routing_init(struct kvm *kvm)
 {
 	int r;
@@ -292,7 +322,9 @@ static int gic__init_gic(struct kvm *kvm)
 	kvm->msix_needs_devid = kvm__supports_vm_extension(kvm,
 							   KVM_CAP_MSI_DEVID);
 
-	return 0;
+	vgic_is_init = true;
+
+	return irq__setup_irqfd_lines(kvm);
 }
 late_init(gic__init_gic)
 
@@ -372,3 +404,45 @@ void kvm__irq_trigger(struct kvm *kvm, int irq)
 	kvm__irq_line(kvm, irq, VIRTIO_IRQ_HIGH);
 	kvm__irq_line(kvm, irq, VIRTIO_IRQ_LOW);
 }
+
+int gic__add_irqfd(struct kvm *kvm, unsigned int gsi, int trigger_fd,
+		   int resample_fd)
+{
+	struct kvm_irqfd_line *line;
+
+	if (vgic_is_init)
+		return irq__common_add_irqfd(kvm, gsi, trigger_fd, resample_fd);
+
+	/* Postpone the routing setup until we have a distributor */
+	line = malloc(sizeof(*line));
+	if (!line)
+		return -ENOMEM;
+
+	*line = (struct kvm_irqfd_line) {
+		.gsi		= gsi,
+		.trigger_fd	= trigger_fd,
+		.resample_fd	= resample_fd,
+	};
+	list_add(&line->list, &irqfd_lines);
+
+	return 0;
+}
+
+void gic__del_irqfd(struct kvm *kvm, unsigned int gsi, int trigger_fd)
+{
+	struct kvm_irqfd_line *line;
+
+	if (vgic_is_init) {
+		irq__common_del_irqfd(kvm, gsi, trigger_fd);
+		return;
+	}
+
+	list_for_each_entry(line, &irqfd_lines, list) {
+		if (line->gsi != gsi)
+			continue;
+
+		list_del(&line->list);
+		free(line);
+		break;
+	}
+}
diff --git a/arm/include/arm-common/gic.h b/arm/include/arm-common/gic.h
index ae253c059..1125d601f 100644
--- a/arm/include/arm-common/gic.h
+++ b/arm/include/arm-common/gic.h
@@ -37,4 +37,10 @@ int gic__create(struct kvm *kvm, enum irqchip_type type);
 int gic__create_gicv2m_frame(struct kvm *kvm, u64 msi_frame_addr);
 void gic__generate_fdt_nodes(void *fdt, enum irqchip_type type);
 
+int gic__add_irqfd(struct kvm *kvm, unsigned int gsi, int trigger_fd,
+		   int resample_fd);
+void gic__del_irqfd(struct kvm *kvm, unsigned int gsi, int trigger_fd);
+#define irq__add_irqfd gic__add_irqfd
+#define irq__del_irqfd gic__del_irqfd
+
 #endif /* ARM_COMMON__GIC_H */
diff --git a/hw/pci-shmem.c b/hw/pci-shmem.c
index 512b5b069..107043e9d 100644
--- a/hw/pci-shmem.c
+++ b/hw/pci-shmem.c
@@ -127,7 +127,6 @@ static void callback_mmio_msix(struct kvm_cpu *vcpu, u64 addr, u8 *data, u32 len
 int pci_shmem__get_local_irqfd(struct kvm *kvm)
 {
 	int fd, gsi, r;
-	struct kvm_irqfd irqfd;
 
 	if (local_fd == 0) {
 		fd = eventfd(0, 0);
@@ -143,12 +142,7 @@ int pci_shmem__get_local_irqfd(struct kvm *kvm)
 			gsi = pci_shmem_pci_device.irq_line;
 		}
 
-		irqfd = (struct kvm_irqfd) {
-			.fd = fd,
-			.gsi = gsi,
-		};
-
-		r = ioctl(kvm->vm_fd, KVM_IRQFD, &irqfd);
+		r = irq__add_irqfd(kvm, gsi, fd, -1);
 		if (r < 0)
 			return r;
 
diff --git a/include/kvm/irq.h b/include/kvm/irq.h
index 8ba8b7405..2a3f8c9dc 100644
--- a/include/kvm/irq.h
+++ b/include/kvm/irq.h
@@ -7,6 +7,7 @@
 #include <linux/list.h>
 #include <linux/kvm.h>
 
+#include "kvm/kvm-arch.h"
 #include "kvm/msi.h"
 
 struct kvm;
@@ -35,4 +36,20 @@ void irq__update_msix_route(struct kvm *kvm, u32 gsi, struct msi_msg *msg);
 bool irq__can_signal_msi(struct kvm *kvm);
 int irq__signal_msi(struct kvm *kvm, struct kvm_msi *msi);
 
+/*
+ * The function takes two eventfd arguments, trigger_fd and resample_fd. If
+ * resample_fd is <= 0, resampling is disabled and the IRQ is edge-triggered
+ */
+int irq__common_add_irqfd(struct kvm *kvm, unsigned int gsi, int trigger_fd,
+			   int resample_fd);
+void irq__common_del_irqfd(struct kvm *kvm, unsigned int gsi, int trigger_fd);
+
+#ifndef irq__add_irqfd
+#define irq__add_irqfd irq__common_add_irqfd
+#endif
+
+#ifndef irq__del_irqfd
+#define irq__del_irqfd irq__common_del_irqfd
+#endif
+
 #endif
diff --git a/irq.c b/irq.c
index c89604cc1..cdcf99233 100644
--- a/irq.c
+++ b/irq.c
@@ -170,6 +170,37 @@ void irq__update_msix_route(struct kvm *kvm, u32 gsi, struct msi_msg *msg)
 		die_perror("KVM_SET_GSI_ROUTING");
 }
 
+int irq__common_add_irqfd(struct kvm *kvm, unsigned int gsi, int trigger_fd,
+			   int resample_fd)
+{
+	struct kvm_irqfd irqfd = {
+		.fd		= trigger_fd,
+		.gsi		= gsi,
+		.flags		= resample_fd > 0 ? KVM_IRQFD_FLAG_RESAMPLE : 0,
+		.resamplefd	= resample_fd,
+	};
+
+	/* If we emulate MSI routing, translate the MSI to the corresponding IRQ */
+	if (msi_routing_ops->translate_gsi)
+		irqfd.gsi = msi_routing_ops->translate_gsi(kvm, gsi);
+
+	return ioctl(kvm->vm_fd, KVM_IRQFD, &irqfd);
+}
+
+void irq__common_del_irqfd(struct kvm *kvm, unsigned int gsi, int trigger_fd)
+{
+	struct kvm_irqfd irqfd = {
+		.fd		= trigger_fd,
+		.gsi		= gsi,
+		.flags		= KVM_IRQFD_FLAG_DEASSIGN,
+	};
+
+	if (msi_routing_ops->translate_gsi)
+		irqfd.gsi = msi_routing_ops->translate_gsi(kvm, gsi);
+
+	ioctl(kvm->vm_fd, KVM_IRQFD, &irqfd);
+}
+
 int __attribute__((weak)) irq__exit(struct kvm *kvm)
 {
 	free(irq_routing);
diff --git a/virtio/net.c b/virtio/net.c
index 419a5e301..f95258caa 100644
--- a/virtio/net.c
+++ b/virtio/net.c
@@ -602,23 +602,18 @@ static int init_vq(struct kvm *kvm, void *dev, u32 vq, u32 page_size, u32 align,
 static void notify_vq_gsi(struct kvm *kvm, void *dev, u32 vq, u32 gsi)
 {
 	struct net_dev *ndev = dev;
-	struct kvm_irqfd irq;
 	struct vhost_vring_file file;
 	int r;
 
 	if (ndev->vhost_fd == 0)
 		return;
 
-	irq = (struct kvm_irqfd) {
-		.gsi	= gsi,
-		.fd	= eventfd(0, 0),
-	};
 	file = (struct vhost_vring_file) {
 		.index	= vq,
-		.fd	= irq.fd,
+		.fd	= eventfd(0, 0),
 	};
 
-	r = ioctl(kvm->vm_fd, KVM_IRQFD, &irq);
+	r = irq__add_irqfd(kvm, gsi, file.fd, -1);
 	if (r < 0)
 		die_perror("KVM_IRQFD failed");
 
diff --git a/virtio/scsi.c b/virtio/scsi.c
index 58d2353a1..a429ac85a 100644
--- a/virtio/scsi.c
+++ b/virtio/scsi.c
@@ -1,6 +1,7 @@
 #include "kvm/virtio-scsi.h"
 #include "kvm/virtio-pci-dev.h"
 #include "kvm/disk-image.h"
+#include "kvm/irq.h"
 #include "kvm/kvm.h"
 #include "kvm/pci.h"
 #include "kvm/ioeventfd.h"
@@ -97,22 +98,17 @@ static void notify_vq_gsi(struct kvm *kvm, void *dev, u32 vq, u32 gsi)
 {
 	struct vhost_vring_file file;
 	struct scsi_dev *sdev = dev;
-	struct kvm_irqfd irq;
 	int r;
 
 	if (sdev->vhost_fd == 0)
 		return;
 
-	irq = (struct kvm_irqfd) {
-		.gsi	= gsi,
-		.fd	= eventfd(0, 0),
-	};
 	file = (struct vhost_vring_file) {
 		.index	= vq,
-		.fd	= irq.fd,
+		.fd	= eventfd(0, 0),
 	};
 
-	r = ioctl(kvm->vm_fd, KVM_IRQFD, &irq);
+	r = irq__add_irqfd(kvm, gsi, file.fd, -1);
 	if (r < 0)
 		die_perror("KVM_IRQFD failed");
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 kvmtool 04/13] Extend memory bank API with memory types
  2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
                   ` (2 preceding siblings ...)
  2018-06-18 18:42 ` [PATCH v6 kvmtool 03/13] irq: add irqfd helpers Jean-Philippe Brucker
@ 2018-06-18 18:42 ` Jean-Philippe Brucker
  2018-06-18 18:42 ` [PATCH v6 kvmtool 05/13] pci: add capability helpers Jean-Philippe Brucker
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jean-Philippe Brucker @ 2018-06-18 18:42 UTC (permalink / raw)
  To: kvm
  Cc: lorenzo.pieralisi, marc.zyngier, punit.agrawal, will.deacon,
	alex.williamson, robin.murphy, kvmarm

Introduce memory types RAM and DEVICE, along with a way for subsystems to
query the global memory banks. This is required by VFIO, which will need
to pin and map guest RAM so that assigned devices can safely do DMA to it.
Depending on the architecture, the physical map is made of either one or
two RAM regions. In addition, this new memory types API paves the way to
reserved memory regions introduced in a subsequent patch.

For the moment we put vesa and ivshmem memory into the DEVICE category, so
they don't have to be pinned. This means that physical devices assigned
with VFIO won't be able to DMA to the vesa frame buffer or ivshmem. In
order to do that, simply changing the type to "RAM" would work. But to
keep the types consistent, it would be better to introduce flags such as
KVM_MEM_TYPE_DMA that would complement both RAM and DEVICE type. We could
then reuse the API for generating firmware information (that is, for x86
bios; DT supports reserved-memory description).

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 arm/kvm.c         |  2 +-
 hw/pci-shmem.c    |  4 ++--
 hw/vesa.c         |  2 +-
 include/kvm/kvm.h | 44 +++++++++++++++++++++++++++++++++++++++++++-
 kvm.c             | 31 ++++++++++++++++++++++++++++++-
 mips/kvm.c        |  6 +++---
 powerpc/kvm.c     |  2 +-
 x86/kvm.c         |  6 +++---
 8 files changed, 84 insertions(+), 13 deletions(-)

diff --git a/arm/kvm.c b/arm/kvm.c
index 2ab436e8d..b824f637e 100644
--- a/arm/kvm.c
+++ b/arm/kvm.c
@@ -34,7 +34,7 @@ void kvm__init_ram(struct kvm *kvm)
 	phys_size	= kvm->ram_size;
 	host_mem	= kvm->ram_start;
 
-	err = kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+	err = kvm__register_ram(kvm, phys_start, phys_size, host_mem);
 	if (err)
 		die("Failed to register %lld bytes of memory at physical "
 		    "address 0x%llx [err %d]", phys_size, phys_start, err);
diff --git a/hw/pci-shmem.c b/hw/pci-shmem.c
index 107043e9d..f92bc7554 100644
--- a/hw/pci-shmem.c
+++ b/hw/pci-shmem.c
@@ -387,8 +387,8 @@ int pci_shmem__init(struct kvm *kvm)
 	if (mem == NULL)
 		return -EINVAL;
 
-	kvm__register_mem(kvm, shmem_region->phys_addr, shmem_region->size,
-			  mem);
+	kvm__register_dev_mem(kvm, shmem_region->phys_addr, shmem_region->size,
+			      mem);
 	return 0;
 }
 dev_init(pci_shmem__init);
diff --git a/hw/vesa.c b/hw/vesa.c
index a9a1d3e2c..f3c5114cf 100644
--- a/hw/vesa.c
+++ b/hw/vesa.c
@@ -73,7 +73,7 @@ struct framebuffer *vesa__init(struct kvm *kvm)
 	if (mem == MAP_FAILED)
 		ERR_PTR(-errno);
 
-	kvm__register_mem(kvm, VESA_MEM_ADDR, VESA_MEM_SIZE, mem);
+	kvm__register_dev_mem(kvm, VESA_MEM_ADDR, VESA_MEM_SIZE, mem);
 
 	vesafb = (struct framebuffer) {
 		.width			= VESA_WIDTH,
diff --git a/include/kvm/kvm.h b/include/kvm/kvm.h
index 90463b8c3..19f7d265c 100644
--- a/include/kvm/kvm.h
+++ b/include/kvm/kvm.h
@@ -34,6 +34,14 @@ enum {
 	KVM_VMSTATE_PAUSED,
 };
 
+enum kvm_mem_type {
+	KVM_MEM_TYPE_RAM	= 1 << 0,
+	KVM_MEM_TYPE_DEVICE	= 1 << 1,
+
+	KVM_MEM_TYPE_ALL	= KVM_MEM_TYPE_RAM
+				| KVM_MEM_TYPE_DEVICE
+};
+
 struct kvm_ext {
 	const char *name;
 	int code;
@@ -44,6 +52,7 @@ struct kvm_mem_bank {
 	u64			guest_phys_addr;
 	void			*host_addr;
 	u64			size;
+	enum kvm_mem_type	type;
 };
 
 struct kvm {
@@ -90,7 +99,22 @@ void kvm__irq_line(struct kvm *kvm, int irq, int level);
 void kvm__irq_trigger(struct kvm *kvm, int irq);
 bool kvm__emulate_io(struct kvm_cpu *vcpu, u16 port, void *data, int direction, int size, u32 count);
 bool kvm__emulate_mmio(struct kvm_cpu *vcpu, u64 phys_addr, u8 *data, u32 len, u8 is_write);
-int kvm__register_mem(struct kvm *kvm, u64 guest_phys, u64 size, void *userspace_addr);
+int kvm__register_mem(struct kvm *kvm, u64 guest_phys, u64 size, void *userspace_addr,
+		      enum kvm_mem_type type);
+static inline int kvm__register_ram(struct kvm *kvm, u64 guest_phys, u64 size,
+				    void *userspace_addr)
+{
+	return kvm__register_mem(kvm, guest_phys, size, userspace_addr,
+				 KVM_MEM_TYPE_RAM);
+}
+
+static inline int kvm__register_dev_mem(struct kvm *kvm, u64 guest_phys,
+					u64 size, void *userspace_addr)
+{
+	return kvm__register_mem(kvm, guest_phys, size, userspace_addr,
+				 KVM_MEM_TYPE_DEVICE);
+}
+
 int kvm__register_mmio(struct kvm *kvm, u64 phys_addr, u64 phys_addr_len, bool coalesce,
 		       void (*mmio_fn)(struct kvm_cpu *vcpu, u64 addr, u8 *data, u32 len, u8 is_write, void *ptr),
 			void *ptr);
@@ -117,6 +141,24 @@ u64 host_to_guest_flat(struct kvm *kvm, void *ptr);
 bool kvm__arch_load_kernel_image(struct kvm *kvm, int fd_kernel, int fd_initrd,
 				 const char *kernel_cmdline);
 
+static inline const char *kvm_mem_type_to_string(enum kvm_mem_type type)
+{
+	switch (type) {
+	case KVM_MEM_TYPE_ALL:
+		return "(all)";
+	case KVM_MEM_TYPE_RAM:
+		return "RAM";
+	case KVM_MEM_TYPE_DEVICE:
+		return "device";
+	}
+
+	return "???";
+}
+
+int kvm__for_each_mem_bank(struct kvm *kvm, enum kvm_mem_type type,
+			   int (*fun)(struct kvm *kvm, struct kvm_mem_bank *bank, void *data),
+			   void *data);
+
 /*
  * Debugging
  */
diff --git a/kvm.c b/kvm.c
index f8f2fdc2c..e9c3c5fcb 100644
--- a/kvm.c
+++ b/kvm.c
@@ -182,7 +182,8 @@ core_exit(kvm__exit);
  * memory regions to it. Therefore, be careful if you use this function for
  * registering memory regions for emulating hardware.
  */
-int kvm__register_mem(struct kvm *kvm, u64 guest_phys, u64 size, void *userspace_addr)
+int kvm__register_mem(struct kvm *kvm, u64 guest_phys, u64 size,
+		      void *userspace_addr, enum kvm_mem_type type)
 {
 	struct kvm_userspace_memory_region mem;
 	struct kvm_mem_bank *bank;
@@ -196,6 +197,7 @@ int kvm__register_mem(struct kvm *kvm, u64 guest_phys, u64 size, void *userspace
 	bank->guest_phys_addr		= guest_phys;
 	bank->host_addr			= userspace_addr;
 	bank->size			= size;
+	bank->type			= type;
 
 	mem = (struct kvm_userspace_memory_region) {
 		.slot			= kvm->mem_slots++,
@@ -245,6 +247,33 @@ u64 host_to_guest_flat(struct kvm *kvm, void *ptr)
 	return 0;
 }
 
+/*
+ * Iterate over each registered memory bank. Call @fun for each bank with @data
+ * as argument. @type is a bitmask that allows to filter banks according to
+ * their type.
+ *
+ * If one call to @fun returns a non-zero value, stop iterating and return the
+ * value. Otherwise, return zero.
+ */
+int kvm__for_each_mem_bank(struct kvm *kvm, enum kvm_mem_type type,
+			   int (*fun)(struct kvm *kvm, struct kvm_mem_bank *bank, void *data),
+			   void *data)
+{
+	int ret;
+	struct kvm_mem_bank *bank;
+
+	list_for_each_entry(bank, &kvm->mem_banks, list) {
+		if (type != KVM_MEM_TYPE_ALL && !(bank->type & type))
+			continue;
+
+		ret = fun(kvm, bank, data);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+
 int kvm__recommended_cpus(struct kvm *kvm)
 {
 	int ret;
diff --git a/mips/kvm.c b/mips/kvm.c
index 24bd65076..211770da0 100644
--- a/mips/kvm.c
+++ b/mips/kvm.c
@@ -28,21 +28,21 @@ void kvm__init_ram(struct kvm *kvm)
 		phys_size  = kvm->ram_size;
 		host_mem   = kvm->ram_start;
 
-		kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+		kvm__register_ram(kvm, phys_start, phys_size, host_mem);
 	} else {
 		/* one region for memory that fits below MMIO range */
 		phys_start = 0;
 		phys_size  = KVM_MMIO_START;
 		host_mem   = kvm->ram_start;
 
-		kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+		kvm__register_ram(kvm, phys_start, phys_size, host_mem);
 
 		/* one region for rest of memory */
 		phys_start = KVM_MMIO_START + KVM_MMIO_SIZE;
 		phys_size  = kvm->ram_size - KVM_MMIO_START;
 		host_mem   = kvm->ram_start + KVM_MMIO_START;
 
-		kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+		kvm__register_ram(kvm, phys_start, phys_size, host_mem);
 	}
 }
 
diff --git a/powerpc/kvm.c b/powerpc/kvm.c
index c738c1d61..702d67dca 100644
--- a/powerpc/kvm.c
+++ b/powerpc/kvm.c
@@ -79,7 +79,7 @@ void kvm__init_ram(struct kvm *kvm)
 		    "overlaps MMIO!\n",
 		    phys_size);
 
-	kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+	kvm__register_ram(kvm, phys_start, phys_size, host_mem);
 }
 
 void kvm__arch_set_cmdline(char *cmdline, bool video)
diff --git a/x86/kvm.c b/x86/kvm.c
index d8751e9a5..3e0f0b743 100644
--- a/x86/kvm.c
+++ b/x86/kvm.c
@@ -98,7 +98,7 @@ void kvm__init_ram(struct kvm *kvm)
 		phys_size  = kvm->ram_size;
 		host_mem   = kvm->ram_start;
 
-		kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+		kvm__register_ram(kvm, phys_start, phys_size, host_mem);
 	} else {
 		/* First RAM range from zero to the PCI gap: */
 
@@ -106,7 +106,7 @@ void kvm__init_ram(struct kvm *kvm)
 		phys_size  = KVM_32BIT_GAP_START;
 		host_mem   = kvm->ram_start;
 
-		kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+		kvm__register_ram(kvm, phys_start, phys_size, host_mem);
 
 		/* Second RAM range from 4GB to the end of RAM: */
 
@@ -114,7 +114,7 @@ void kvm__init_ram(struct kvm *kvm)
 		phys_size  = kvm->ram_size - phys_start;
 		host_mem   = kvm->ram_start + phys_start;
 
-		kvm__register_mem(kvm, phys_start, phys_size, host_mem);
+		kvm__register_ram(kvm, phys_start, phys_size, host_mem);
 	}
 }
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 kvmtool 05/13] pci: add capability helpers
  2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
                   ` (3 preceding siblings ...)
  2018-06-18 18:42 ` [PATCH v6 kvmtool 04/13] Extend memory bank API with memory types Jean-Philippe Brucker
@ 2018-06-18 18:42 ` Jean-Philippe Brucker
  2018-06-18 18:42 ` [PATCH v6 kvmtool 06/13] Import VFIO headers Jean-Philippe Brucker
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jean-Philippe Brucker @ 2018-06-18 18:42 UTC (permalink / raw)
  To: kvm
  Cc: lorenzo.pieralisi, marc.zyngier, punit.agrawal, will.deacon,
	alex.williamson, robin.murphy, kvmarm

Add a way to iterate over all capabilities in a config space. Add a search
function for getting a specific capability.

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/pci.h | 14 ++++++++++++++
 pci.c             | 13 +++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/include/kvm/pci.h b/include/kvm/pci.h
index 5d9c0f3b0..01c244bcf 100644
--- a/include/kvm/pci.h
+++ b/include/kvm/pci.h
@@ -58,6 +58,11 @@ struct msix_cap {
 	u32 pba_offset;
 };
 
+struct pci_cap_hdr {
+	u8	type;
+	u8	next;
+};
+
 #define PCI_BAR_OFFSET(b)	(offsetof(struct pci_device_header, bar[b]))
 #define PCI_DEV_CFG_SIZE	256
 #define PCI_DEV_CFG_MASK	(PCI_DEV_CFG_SIZE - 1)
@@ -113,6 +118,13 @@ struct pci_device_header {
 	enum irq_type	irq_type;
 };
 
+#define PCI_CAP(pci_hdr, pos) ((void *)(pci_hdr) + (pos))
+
+#define pci_for_each_cap(pos, cap, hdr)				\
+	for ((pos) = (hdr)->capabilities & ~3;			\
+	     (cap) = PCI_CAP(hdr, pos), (pos) != 0;		\
+	     (pos) = ((struct pci_cap_hdr *)(cap))->next & ~3)
+
 int pci__init(struct kvm *kvm);
 int pci__exit(struct kvm *kvm);
 struct pci_device_header *pci__find_dev(u8 dev_num);
@@ -121,4 +133,6 @@ void pci__assign_irq(struct device_header *dev_hdr);
 void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, int size);
 void pci__config_rd(struct kvm *kvm, union pci_config_address addr, void *data, int size);
 
+void *pci_find_cap(struct pci_device_header *hdr, u8 cap_type);
+
 #endif /* KVM__PCI_H */
diff --git a/pci.c b/pci.c
index 5a8c2ef41..689869cb7 100644
--- a/pci.c
+++ b/pci.c
@@ -27,6 +27,19 @@ u32 pci_get_io_space_block(u32 size)
 	return block;
 }
 
+void *pci_find_cap(struct pci_device_header *hdr, u8 cap_type)
+{
+	u8 pos;
+	struct pci_cap_hdr *cap;
+
+	pci_for_each_cap(pos, cap, hdr) {
+		if (cap->type == cap_type)
+			return cap;
+	}
+
+	return NULL;
+}
+
 void pci__assign_irq(struct device_header *dev_hdr)
 {
 	struct pci_device_header *pci_hdr = dev_hdr->data;
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 kvmtool 06/13] Import VFIO headers
  2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
                   ` (4 preceding siblings ...)
  2018-06-18 18:42 ` [PATCH v6 kvmtool 05/13] pci: add capability helpers Jean-Philippe Brucker
@ 2018-06-18 18:42 ` Jean-Philippe Brucker
  2018-06-18 18:42 ` [PATCH v6 kvmtool 07/13] Add fls_long and roundup_pow_of_two helpers Jean-Philippe Brucker
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jean-Philippe Brucker @ 2018-06-18 18:42 UTC (permalink / raw)
  To: kvm
  Cc: lorenzo.pieralisi, marc.zyngier, punit.agrawal, will.deacon,
	alex.williamson, robin.murphy, kvmarm

To ensure consistency between kvmtool and the kernel, import the UAPI
headers of the VFIO version we implement. This is from Linux v4.12.

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/linux/vfio.h | 719 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 719 insertions(+)
 create mode 100644 include/linux/vfio.h

diff --git a/include/linux/vfio.h b/include/linux/vfio.h
new file mode 100644
index 000000000..4e7ab4c52
--- /dev/null
+++ b/include/linux/vfio.h
@@ -0,0 +1,719 @@
+/*
+ * VFIO API definition
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef VFIO_H
+#define VFIO_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define VFIO_API_VERSION	0
+
+
+/* Kernel & User level defines for VFIO IOCTLs. */
+
+/* Extensions */
+
+#define VFIO_TYPE1_IOMMU		1
+#define VFIO_SPAPR_TCE_IOMMU		2
+#define VFIO_TYPE1v2_IOMMU		3
+/*
+ * IOMMU enforces DMA cache coherence (ex. PCIe NoSnoop stripping).  This
+ * capability is subject to change as groups are added or removed.
+ */
+#define VFIO_DMA_CC_IOMMU		4
+
+/* Check if EEH is supported */
+#define VFIO_EEH			5
+
+/* Two-stage IOMMU */
+#define VFIO_TYPE1_NESTING_IOMMU	6	/* Implies v2 */
+
+#define VFIO_SPAPR_TCE_v2_IOMMU		7
+
+/*
+ * The No-IOMMU IOMMU offers no translation or isolation for devices and
+ * supports no ioctls outside of VFIO_CHECK_EXTENSION.  Use of VFIO's No-IOMMU
+ * code will taint the host kernel and should be used with extreme caution.
+ */
+#define VFIO_NOIOMMU_IOMMU		8
+
+/*
+ * The IOCTL interface is designed for extensibility by embedding the
+ * structure length (argsz) and flags into structures passed between
+ * kernel and userspace.  We therefore use the _IO() macro for these
+ * defines to avoid implicitly embedding a size into the ioctl request.
+ * As structure fields are added, argsz will increase to match and flag
+ * bits will be defined to indicate additional fields with valid data.
+ * It's *always* the caller's responsibility to indicate the size of
+ * the structure passed by setting argsz appropriately.
+ */
+
+#define VFIO_TYPE	(';')
+#define VFIO_BASE	100
+
+/*
+ * For extension of INFO ioctls, VFIO makes use of a capability chain
+ * designed after PCI/e capabilities.  A flag bit indicates whether
+ * this capability chain is supported and a field defined in the fixed
+ * structure defines the offset of the first capability in the chain.
+ * This field is only valid when the corresponding bit in the flags
+ * bitmap is set.  This offset field is relative to the start of the
+ * INFO buffer, as is the next field within each capability header.
+ * The id within the header is a shared address space per INFO ioctl,
+ * while the version field is specific to the capability id.  The
+ * contents following the header are specific to the capability id.
+ */
+struct vfio_info_cap_header {
+	__u16	id;		/* Identifies capability */
+	__u16	version;	/* Version specific to the capability ID */
+	__u32	next;		/* Offset of next capability */
+};
+
+/*
+ * Callers of INFO ioctls passing insufficiently sized buffers will see
+ * the capability chain flag bit set, a zero value for the first capability
+ * offset (if available within the provided argsz), and argsz will be
+ * updated to report the necessary buffer size.  For compatibility, the
+ * INFO ioctl will not report error in this case, but the capability chain
+ * will not be available.
+ */
+
+/* -------- IOCTLs for VFIO file descriptor (/dev/vfio/vfio) -------- */
+
+/**
+ * VFIO_GET_API_VERSION - _IO(VFIO_TYPE, VFIO_BASE + 0)
+ *
+ * Report the version of the VFIO API.  This allows us to bump the entire
+ * API version should we later need to add or change features in incompatible
+ * ways.
+ * Return: VFIO_API_VERSION
+ * Availability: Always
+ */
+#define VFIO_GET_API_VERSION		_IO(VFIO_TYPE, VFIO_BASE + 0)
+
+/**
+ * VFIO_CHECK_EXTENSION - _IOW(VFIO_TYPE, VFIO_BASE + 1, __u32)
+ *
+ * Check whether an extension is supported.
+ * Return: 0 if not supported, 1 (or some other positive integer) if supported.
+ * Availability: Always
+ */
+#define VFIO_CHECK_EXTENSION		_IO(VFIO_TYPE, VFIO_BASE + 1)
+
+/**
+ * VFIO_SET_IOMMU - _IOW(VFIO_TYPE, VFIO_BASE + 2, __s32)
+ *
+ * Set the iommu to the given type.  The type must be supported by an
+ * iommu driver as verified by calling CHECK_EXTENSION using the same
+ * type.  A group must be set to this file descriptor before this
+ * ioctl is available.  The IOMMU interfaces enabled by this call are
+ * specific to the value set.
+ * Return: 0 on success, -errno on failure
+ * Availability: When VFIO group attached
+ */
+#define VFIO_SET_IOMMU			_IO(VFIO_TYPE, VFIO_BASE + 2)
+
+/* -------- IOCTLs for GROUP file descriptors (/dev/vfio/$GROUP) -------- */
+
+/**
+ * VFIO_GROUP_GET_STATUS - _IOR(VFIO_TYPE, VFIO_BASE + 3,
+ *						struct vfio_group_status)
+ *
+ * Retrieve information about the group.  Fills in provided
+ * struct vfio_group_info.  Caller sets argsz.
+ * Return: 0 on succes, -errno on failure.
+ * Availability: Always
+ */
+struct vfio_group_status {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_GROUP_FLAGS_VIABLE		(1 << 0)
+#define VFIO_GROUP_FLAGS_CONTAINER_SET	(1 << 1)
+};
+#define VFIO_GROUP_GET_STATUS		_IO(VFIO_TYPE, VFIO_BASE + 3)
+
+/**
+ * VFIO_GROUP_SET_CONTAINER - _IOW(VFIO_TYPE, VFIO_BASE + 4, __s32)
+ *
+ * Set the container for the VFIO group to the open VFIO file
+ * descriptor provided.  Groups may only belong to a single
+ * container.  Containers may, at their discretion, support multiple
+ * groups.  Only when a container is set are all of the interfaces
+ * of the VFIO file descriptor and the VFIO group file descriptor
+ * available to the user.
+ * Return: 0 on success, -errno on failure.
+ * Availability: Always
+ */
+#define VFIO_GROUP_SET_CONTAINER	_IO(VFIO_TYPE, VFIO_BASE + 4)
+
+/**
+ * VFIO_GROUP_UNSET_CONTAINER - _IO(VFIO_TYPE, VFIO_BASE + 5)
+ *
+ * Remove the group from the attached container.  This is the
+ * opposite of the SET_CONTAINER call and returns the group to
+ * an initial state.  All device file descriptors must be released
+ * prior to calling this interface.  When removing the last group
+ * from a container, the IOMMU will be disabled and all state lost,
+ * effectively also returning the VFIO file descriptor to an initial
+ * state.
+ * Return: 0 on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_UNSET_CONTAINER	_IO(VFIO_TYPE, VFIO_BASE + 5)
+
+/**
+ * VFIO_GROUP_GET_DEVICE_FD - _IOW(VFIO_TYPE, VFIO_BASE + 6, char)
+ *
+ * Return a new file descriptor for the device object described by
+ * the provided string.  The string should match a device listed in
+ * the devices subdirectory of the IOMMU group sysfs entry.  The
+ * group containing the device must already be added to this context.
+ * Return: new file descriptor on success, -errno on failure.
+ * Availability: When attached to container
+ */
+#define VFIO_GROUP_GET_DEVICE_FD	_IO(VFIO_TYPE, VFIO_BASE + 6)
+
+/* --------------- IOCTLs for DEVICE file descriptors --------------- */
+
+/**
+ * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 7,
+ *						struct vfio_device_info)
+ *
+ * Retrieve information about the device.  Fills in provided
+ * struct vfio_device_info.  Caller sets argsz.
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
+#define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
+#define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
+#define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
+#define VFIO_DEVICE_FLAGS_CCW	(1 << 4)	/* vfio-ccw device */
+	__u32	num_regions;	/* Max region index + 1 */
+	__u32	num_irqs;	/* Max IRQ index + 1 */
+};
+#define VFIO_DEVICE_GET_INFO		_IO(VFIO_TYPE, VFIO_BASE + 7)
+
+/*
+ * Vendor driver using Mediated device framework should provide device_api
+ * attribute in supported type attribute groups. Device API string should be one
+ * of the following corresponding to device flags in vfio_device_info structure.
+ */
+
+#define VFIO_DEVICE_API_PCI_STRING		"vfio-pci"
+#define VFIO_DEVICE_API_PLATFORM_STRING		"vfio-platform"
+#define VFIO_DEVICE_API_AMBA_STRING		"vfio-amba"
+#define VFIO_DEVICE_API_CCW_STRING		"vfio-ccw"
+
+/**
+ * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
+ *				       struct vfio_region_info)
+ *
+ * Retrieve information about a device region.  Caller provides
+ * struct vfio_region_info with index value set.  Caller sets argsz.
+ * Implementation of region mapping is bus driver specific.  This is
+ * intended to describe MMIO, I/O port, as well as bus specific
+ * regions (ex. PCI config space).  Zero sized regions may be used
+ * to describe unimplemented regions (ex. unimplemented PCI BARs).
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_region_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_REGION_INFO_FLAG_READ	(1 << 0) /* Region supports read */
+#define VFIO_REGION_INFO_FLAG_WRITE	(1 << 1) /* Region supports write */
+#define VFIO_REGION_INFO_FLAG_MMAP	(1 << 2) /* Region supports mmap */
+#define VFIO_REGION_INFO_FLAG_CAPS	(1 << 3) /* Info supports caps */
+	__u32	index;		/* Region index */
+	__u32	cap_offset;	/* Offset within info struct of first cap */
+	__u64	size;		/* Region size (bytes) */
+	__u64	offset;		/* Region offset from start of device fd */
+};
+#define VFIO_DEVICE_GET_REGION_INFO	_IO(VFIO_TYPE, VFIO_BASE + 8)
+
+/*
+ * The sparse mmap capability allows finer granularity of specifying areas
+ * within a region with mmap support.  When specified, the user should only
+ * mmap the offset ranges specified by the areas array.  mmaps outside of the
+ * areas specified may fail (such as the range covering a PCI MSI-X table) or
+ * may result in improper device behavior.
+ *
+ * The structures below define version 1 of this capability.
+ */
+#define VFIO_REGION_INFO_CAP_SPARSE_MMAP	1
+
+struct vfio_region_sparse_mmap_area {
+	__u64	offset;	/* Offset of mmap'able area within region */
+	__u64	size;	/* Size of mmap'able area */
+};
+
+struct vfio_region_info_cap_sparse_mmap {
+	struct vfio_info_cap_header header;
+	__u32	nr_areas;
+	__u32	reserved;
+	struct vfio_region_sparse_mmap_area areas[];
+};
+
+/*
+ * The device specific type capability allows regions unique to a specific
+ * device or class of devices to be exposed.  This helps solve the problem for
+ * vfio bus drivers of defining which region indexes correspond to which region
+ * on the device, without needing to resort to static indexes, as done by
+ * vfio-pci.  For instance, if we were to go back in time, we might remove
+ * VFIO_PCI_VGA_REGION_INDEX and let vfio-pci simply define that all indexes
+ * greater than or equal to VFIO_PCI_NUM_REGIONS are device specific and we'd
+ * make a "VGA" device specific type to describe the VGA access space.  This
+ * means that non-VGA devices wouldn't need to waste this index, and thus the
+ * address space associated with it due to implementation of device file
+ * descriptor offsets in vfio-pci.
+ *
+ * The current implementation is now part of the user ABI, so we can't use this
+ * for VGA, but there are other upcoming use cases, such as opregions for Intel
+ * IGD devices and framebuffers for vGPU devices.  We missed VGA, but we'll
+ * use this for future additions.
+ *
+ * The structure below defines version 1 of this capability.
+ */
+#define VFIO_REGION_INFO_CAP_TYPE	2
+
+struct vfio_region_info_cap_type {
+	struct vfio_info_cap_header header;
+	__u32 type;	/* global per bus driver */
+	__u32 subtype;	/* type specific */
+};
+
+#define VFIO_REGION_TYPE_PCI_VENDOR_TYPE	(1 << 31)
+#define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
+
+/* 8086 Vendor sub-types */
+#define VFIO_REGION_SUBTYPE_INTEL_IGD_OPREGION	(1)
+#define VFIO_REGION_SUBTYPE_INTEL_IGD_HOST_CFG	(2)
+#define VFIO_REGION_SUBTYPE_INTEL_IGD_LPC_CFG	(3)
+
+/**
+ * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
+ *				    struct vfio_irq_info)
+ *
+ * Retrieve information about a device IRQ.  Caller provides
+ * struct vfio_irq_info with index value set.  Caller sets argsz.
+ * Implementation of IRQ mapping is bus driver specific.  Indexes
+ * using multiple IRQs are primarily intended to support MSI-like
+ * interrupt blocks.  Zero count irq blocks may be used to describe
+ * unimplemented interrupt types.
+ *
+ * The EVENTFD flag indicates the interrupt index supports eventfd based
+ * signaling.
+ *
+ * The MASKABLE flags indicates the index supports MASK and UNMASK
+ * actions described below.
+ *
+ * AUTOMASKED indicates that after signaling, the interrupt line is
+ * automatically masked by VFIO and the user needs to unmask the line
+ * to receive new interrupts.  This is primarily intended to distinguish
+ * level triggered interrupts.
+ *
+ * The NORESIZE flag indicates that the interrupt lines within the index
+ * are setup as a set and new subindexes cannot be enabled without first
+ * disabling the entire index.  This is used for interrupts like PCI MSI
+ * and MSI-X where the driver may only use a subset of the available
+ * indexes, but VFIO needs to enable a specific number of vectors
+ * upfront.  In the case of MSI-X, where the user can enable MSI-X and
+ * then add and unmask vectors, it's up to userspace to make the decision
+ * whether to allocate the maximum supported number of vectors or tear
+ * down setup and incrementally increase the vectors as each is enabled.
+ */
+struct vfio_irq_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IRQ_INFO_EVENTFD		(1 << 0)
+#define VFIO_IRQ_INFO_MASKABLE		(1 << 1)
+#define VFIO_IRQ_INFO_AUTOMASKED	(1 << 2)
+#define VFIO_IRQ_INFO_NORESIZE		(1 << 3)
+	__u32	index;		/* IRQ index */
+	__u32	count;		/* Number of IRQs within this index */
+};
+#define VFIO_DEVICE_GET_IRQ_INFO	_IO(VFIO_TYPE, VFIO_BASE + 9)
+
+/**
+ * VFIO_DEVICE_SET_IRQS - _IOW(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_set)
+ *
+ * Set signaling, masking, and unmasking of interrupts.  Caller provides
+ * struct vfio_irq_set with all fields set.  'start' and 'count' indicate
+ * the range of subindexes being specified.
+ *
+ * The DATA flags specify the type of data provided.  If DATA_NONE, the
+ * operation performs the specified action immediately on the specified
+ * interrupt(s).  For example, to unmask AUTOMASKED interrupt [0,0]:
+ * flags = (DATA_NONE|ACTION_UNMASK), index = 0, start = 0, count = 1.
+ *
+ * DATA_BOOL allows sparse support for the same on arrays of interrupts.
+ * For example, to mask interrupts [0,1] and [0,3] (but not [0,2]):
+ * flags = (DATA_BOOL|ACTION_MASK), index = 0, start = 1, count = 3,
+ * data = {1,0,1}
+ *
+ * DATA_EVENTFD binds the specified ACTION to the provided __s32 eventfd.
+ * A value of -1 can be used to either de-assign interrupts if already
+ * assigned or skip un-assigned interrupts.  For example, to set an eventfd
+ * to be trigger for interrupts [0,0] and [0,2]:
+ * flags = (DATA_EVENTFD|ACTION_TRIGGER), index = 0, start = 0, count = 3,
+ * data = {fd1, -1, fd2}
+ * If index [0,1] is previously set, two count = 1 ioctls calls would be
+ * required to set [0,0] and [0,2] without changing [0,1].
+ *
+ * Once a signaling mechanism is set, DATA_BOOL or DATA_NONE can be used
+ * with ACTION_TRIGGER to perform kernel level interrupt loopback testing
+ * from userspace (ie. simulate hardware triggering).
+ *
+ * Setting of an event triggering mechanism to userspace for ACTION_TRIGGER
+ * enables the interrupt index for the device.  Individual subindex interrupts
+ * can be disabled using the -1 value for DATA_EVENTFD or the index can be
+ * disabled as a whole with: flags = (DATA_NONE|ACTION_TRIGGER), count = 0.
+ *
+ * Note that ACTION_[UN]MASK specify user->kernel signaling (irqfds) while
+ * ACTION_TRIGGER specifies kernel->user signaling.
+ */
+struct vfio_irq_set {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IRQ_SET_DATA_NONE		(1 << 0) /* Data not present */
+#define VFIO_IRQ_SET_DATA_BOOL		(1 << 1) /* Data is bool (u8) */
+#define VFIO_IRQ_SET_DATA_EVENTFD	(1 << 2) /* Data is eventfd (s32) */
+#define VFIO_IRQ_SET_ACTION_MASK	(1 << 3) /* Mask interrupt */
+#define VFIO_IRQ_SET_ACTION_UNMASK	(1 << 4) /* Unmask interrupt */
+#define VFIO_IRQ_SET_ACTION_TRIGGER	(1 << 5) /* Trigger interrupt */
+	__u32	index;
+	__u32	start;
+	__u32	count;
+	__u8	data[];
+};
+#define VFIO_DEVICE_SET_IRQS		_IO(VFIO_TYPE, VFIO_BASE + 10)
+
+#define VFIO_IRQ_SET_DATA_TYPE_MASK	(VFIO_IRQ_SET_DATA_NONE | \
+					 VFIO_IRQ_SET_DATA_BOOL | \
+					 VFIO_IRQ_SET_DATA_EVENTFD)
+#define VFIO_IRQ_SET_ACTION_TYPE_MASK	(VFIO_IRQ_SET_ACTION_MASK | \
+					 VFIO_IRQ_SET_ACTION_UNMASK | \
+					 VFIO_IRQ_SET_ACTION_TRIGGER)
+/**
+ * VFIO_DEVICE_RESET - _IO(VFIO_TYPE, VFIO_BASE + 11)
+ *
+ * Reset a device.
+ */
+#define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 11)
+
+/*
+ * The VFIO-PCI bus driver makes use of the following fixed region and
+ * IRQ index mapping.  Unimplemented regions return a size of zero.
+ * Unimplemented IRQ types return a count of zero.
+ */
+
+enum {
+	VFIO_PCI_BAR0_REGION_INDEX,
+	VFIO_PCI_BAR1_REGION_INDEX,
+	VFIO_PCI_BAR2_REGION_INDEX,
+	VFIO_PCI_BAR3_REGION_INDEX,
+	VFIO_PCI_BAR4_REGION_INDEX,
+	VFIO_PCI_BAR5_REGION_INDEX,
+	VFIO_PCI_ROM_REGION_INDEX,
+	VFIO_PCI_CONFIG_REGION_INDEX,
+	/*
+	 * Expose VGA regions defined for PCI base class 03, subclass 00.
+	 * This includes I/O port ranges 0x3b0 to 0x3bb and 0x3c0 to 0x3df
+	 * as well as the MMIO range 0xa0000 to 0xbffff.  Each implemented
+	 * range is found at it's identity mapped offset from the region
+	 * offset, for example 0x3b0 is region_info.offset + 0x3b0.  Areas
+	 * between described ranges are unimplemented.
+	 */
+	VFIO_PCI_VGA_REGION_INDEX,
+	VFIO_PCI_NUM_REGIONS = 9 /* Fixed user ABI, region indexes >=9 use */
+				 /* device specific cap to define content. */
+};
+
+enum {
+	VFIO_PCI_INTX_IRQ_INDEX,
+	VFIO_PCI_MSI_IRQ_INDEX,
+	VFIO_PCI_MSIX_IRQ_INDEX,
+	VFIO_PCI_ERR_IRQ_INDEX,
+	VFIO_PCI_REQ_IRQ_INDEX,
+	VFIO_PCI_NUM_IRQS
+};
+
+/*
+ * The vfio-ccw bus driver makes use of the following fixed region and
+ * IRQ index mapping. Unimplemented regions return a size of zero.
+ * Unimplemented IRQ types return a count of zero.
+ */
+
+enum {
+	VFIO_CCW_CONFIG_REGION_INDEX,
+	VFIO_CCW_NUM_REGIONS
+};
+
+enum {
+	VFIO_CCW_IO_IRQ_INDEX,
+	VFIO_CCW_NUM_IRQS
+};
+
+/**
+ * VFIO_DEVICE_GET_PCI_HOT_RESET_INFO - _IORW(VFIO_TYPE, VFIO_BASE + 12,
+ *					      struct vfio_pci_hot_reset_info)
+ *
+ * Return: 0 on success, -errno on failure:
+ *	-enospc = insufficient buffer, -enodev = unsupported for device.
+ */
+struct vfio_pci_dependent_device {
+	__u32	group_id;
+	__u16	segment;
+	__u8	bus;
+	__u8	devfn; /* Use PCI_SLOT/PCI_FUNC */
+};
+
+struct vfio_pci_hot_reset_info {
+	__u32	argsz;
+	__u32	flags;
+	__u32	count;
+	struct vfio_pci_dependent_device	devices[];
+};
+
+#define VFIO_DEVICE_GET_PCI_HOT_RESET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/**
+ * VFIO_DEVICE_PCI_HOT_RESET - _IOW(VFIO_TYPE, VFIO_BASE + 13,
+ *				    struct vfio_pci_hot_reset)
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_pci_hot_reset {
+	__u32	argsz;
+	__u32	flags;
+	__u32	count;
+	__s32	group_fds[];
+};
+
+#define VFIO_DEVICE_PCI_HOT_RESET	_IO(VFIO_TYPE, VFIO_BASE + 13)
+
+/* -------- API for Type1 VFIO IOMMU -------- */
+
+/**
+ * VFIO_IOMMU_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 12, struct vfio_iommu_info)
+ *
+ * Retrieve information about the IOMMU object. Fills in provided
+ * struct vfio_iommu_info. Caller sets argsz.
+ *
+ * XXX Should we do these by CHECK_EXTENSION too?
+ */
+struct vfio_iommu_type1_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
+	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
+};
+
+#define VFIO_IOMMU_GET_INFO _IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/**
+ * VFIO_IOMMU_MAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 13, struct vfio_dma_map)
+ *
+ * Map process virtual addresses to IO virtual addresses using the
+ * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ */
+struct vfio_iommu_type1_dma_map {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
+#define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+	__u64	vaddr;				/* Process virtual address */
+	__u64	iova;				/* IO virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+
+#define VFIO_IOMMU_MAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 13)
+
+/**
+ * VFIO_IOMMU_UNMAP_DMA - _IOWR(VFIO_TYPE, VFIO_BASE + 14,
+ *							struct vfio_dma_unmap)
+ *
+ * Unmap IO virtual addresses using the provided struct vfio_dma_unmap.
+ * Caller sets argsz.  The actual unmapped size is returned in the size
+ * field.  No guarantee is made to the user that arbitrary unmaps of iova
+ * or size different from those used in the original mapping call will
+ * succeed.
+ */
+struct vfio_iommu_type1_dma_unmap {
+	__u32	argsz;
+	__u32	flags;
+	__u64	iova;				/* IO virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+
+#define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
+
+/*
+ * IOCTLs to enable/disable IOMMU container usage.
+ * No parameters are supported.
+ */
+#define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
+#define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
+
+/* -------- Additional API for SPAPR TCE (Server POWERPC) IOMMU -------- */
+
+/*
+ * The SPAPR TCE DDW info struct provides the information about
+ * the details of Dynamic DMA window capability.
+ *
+ * @pgsizes contains a page size bitmask, 4K/64K/16M are supported.
+ * @max_dynamic_windows_supported tells the maximum number of windows
+ * which the platform can create.
+ * @levels tells the maximum number of levels in multi-level IOMMU tables;
+ * this allows splitting a table into smaller chunks which reduces
+ * the amount of physically contiguous memory required for the table.
+ */
+struct vfio_iommu_spapr_tce_ddw_info {
+	__u64 pgsizes;			/* Bitmap of supported page sizes */
+	__u32 max_dynamic_windows_supported;
+	__u32 levels;
+};
+
+/*
+ * The SPAPR TCE info struct provides the information about the PCI bus
+ * address ranges available for DMA, these values are programmed into
+ * the hardware so the guest has to know that information.
+ *
+ * The DMA 32 bit window start is an absolute PCI bus address.
+ * The IOVA address passed via map/unmap ioctls are absolute PCI bus
+ * addresses too so the window works as a filter rather than an offset
+ * for IOVA addresses.
+ *
+ * Flags supported:
+ * - VFIO_IOMMU_SPAPR_INFO_DDW: informs the userspace that dynamic DMA windows
+ *   (DDW) support is present. @ddw is only supported when DDW is present.
+ */
+struct vfio_iommu_spapr_tce_info {
+	__u32 argsz;
+	__u32 flags;
+#define VFIO_IOMMU_SPAPR_INFO_DDW	(1 << 0)	/* DDW supported */
+	__u32 dma32_window_start;	/* 32 bit window start (bytes) */
+	__u32 dma32_window_size;	/* 32 bit window size (bytes) */
+	struct vfio_iommu_spapr_tce_ddw_info ddw;
+};
+
+#define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/*
+ * EEH PE operation struct provides ways to:
+ * - enable/disable EEH functionality;
+ * - unfreeze IO/DMA for frozen PE;
+ * - read PE state;
+ * - reset PE;
+ * - configure PE;
+ * - inject EEH error.
+ */
+struct vfio_eeh_pe_err {
+	__u32 type;
+	__u32 func;
+	__u64 addr;
+	__u64 mask;
+};
+
+struct vfio_eeh_pe_op {
+	__u32 argsz;
+	__u32 flags;
+	__u32 op;
+	union {
+		struct vfio_eeh_pe_err err;
+	};
+};
+
+#define VFIO_EEH_PE_DISABLE		0	/* Disable EEH functionality */
+#define VFIO_EEH_PE_ENABLE		1	/* Enable EEH functionality  */
+#define VFIO_EEH_PE_UNFREEZE_IO		2	/* Enable IO for frozen PE   */
+#define VFIO_EEH_PE_UNFREEZE_DMA	3	/* Enable DMA for frozen PE  */
+#define VFIO_EEH_PE_GET_STATE		4	/* PE state retrieval        */
+#define  VFIO_EEH_PE_STATE_NORMAL	0	/* PE in functional state    */
+#define  VFIO_EEH_PE_STATE_RESET	1	/* PE reset in progress      */
+#define  VFIO_EEH_PE_STATE_STOPPED	2	/* Stopped DMA and IO        */
+#define  VFIO_EEH_PE_STATE_STOPPED_DMA	4	/* Stopped DMA only          */
+#define  VFIO_EEH_PE_STATE_UNAVAIL	5	/* State unavailable         */
+#define VFIO_EEH_PE_RESET_DEACTIVATE	5	/* Deassert PE reset         */
+#define VFIO_EEH_PE_RESET_HOT		6	/* Assert hot reset          */
+#define VFIO_EEH_PE_RESET_FUNDAMENTAL	7	/* Assert fundamental reset  */
+#define VFIO_EEH_PE_CONFIGURE		8	/* PE configuration          */
+#define VFIO_EEH_PE_INJECT_ERR		9	/* Inject EEH error          */
+
+#define VFIO_EEH_PE_OP			_IO(VFIO_TYPE, VFIO_BASE + 21)
+
+/**
+ * VFIO_IOMMU_SPAPR_REGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 17, struct vfio_iommu_spapr_register_memory)
+ *
+ * Registers user space memory where DMA is allowed. It pins
+ * user pages and does the locked memory accounting so
+ * subsequent VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA calls
+ * get faster.
+ */
+struct vfio_iommu_spapr_register_memory {
+	__u32	argsz;
+	__u32	flags;
+	__u64	vaddr;				/* Process virtual address */
+	__u64	size;				/* Size of mapping (bytes) */
+};
+#define VFIO_IOMMU_SPAPR_REGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 17)
+
+/**
+ * VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - _IOW(VFIO_TYPE, VFIO_BASE + 18, struct vfio_iommu_spapr_register_memory)
+ *
+ * Unregisters user space memory registered with
+ * VFIO_IOMMU_SPAPR_REGISTER_MEMORY.
+ * Uses vfio_iommu_spapr_register_memory for parameters.
+ */
+#define VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY	_IO(VFIO_TYPE, VFIO_BASE + 18)
+
+/**
+ * VFIO_IOMMU_SPAPR_TCE_CREATE - _IOWR(VFIO_TYPE, VFIO_BASE + 19, struct vfio_iommu_spapr_tce_create)
+ *
+ * Creates an additional TCE table and programs it (sets a new DMA window)
+ * to every IOMMU group in the container. It receives page shift, window
+ * size and number of levels in the TCE table being created.
+ *
+ * It allocates and returns an offset on a PCI bus of the new DMA window.
+ */
+struct vfio_iommu_spapr_tce_create {
+	__u32 argsz;
+	__u32 flags;
+	/* in */
+	__u32 page_shift;
+	__u32 __resv1;
+	__u64 window_size;
+	__u32 levels;
+	__u32 __resv2;
+	/* out */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_CREATE	_IO(VFIO_TYPE, VFIO_BASE + 19)
+
+/**
+ * VFIO_IOMMU_SPAPR_TCE_REMOVE - _IOW(VFIO_TYPE, VFIO_BASE + 20, struct vfio_iommu_spapr_tce_remove)
+ *
+ * Unprograms a TCE table from all groups in the container and destroys it.
+ * It receives a PCI bus offset as a window id.
+ */
+struct vfio_iommu_spapr_tce_remove {
+	__u32 argsz;
+	__u32 flags;
+	/* in */
+	__u64 start_addr;
+};
+#define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
+
+/* ***************************************************************** */
+
+#endif /* VFIO_H */
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 kvmtool 07/13] Add fls_long and roundup_pow_of_two helpers
  2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
                   ` (5 preceding siblings ...)
  2018-06-18 18:42 ` [PATCH v6 kvmtool 06/13] Import VFIO headers Jean-Philippe Brucker
@ 2018-06-18 18:42 ` Jean-Philippe Brucker
  2018-06-18 18:42 ` [PATCH v6 kvmtool 08/13] Add PCI device passthrough using VFIO Jean-Philippe Brucker
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jean-Philippe Brucker @ 2018-06-18 18:42 UTC (permalink / raw)
  To: kvm
  Cc: lorenzo.pieralisi, marc.zyngier, punit.agrawal, will.deacon,
	alex.williamson, robin.murphy, kvmarm

It's always nice to have a log2 handy, and the vfio-pci code will need to
perform power of two allocation from an arbitrary size. Add fls_long and
roundup_pow_of_two, based on the GCC builtin.

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/util.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/include/kvm/util.h b/include/kvm/util.h
index 0df9f0dfd..4ca7aa939 100644
--- a/include/kvm/util.h
+++ b/include/kvm/util.h
@@ -90,6 +90,20 @@ static inline void msleep(unsigned int msecs)
 	usleep(MSECS_TO_USECS(msecs));
 }
 
+/*
+ * Find last (most significant) bit set. Same implementation as Linux:
+ * fls(0) = 0, fls(1) = 1, fls(1UL << 63) = 64
+ */
+static inline int fls_long(unsigned long x)
+{
+	return x ? sizeof(x) * 8 - __builtin_clzl(x) : 0;
+}
+
+static inline unsigned long roundup_pow_of_two(unsigned long x)
+{
+	return x ? 1UL << fls_long(x - 1) : 0;
+}
+
 struct kvm;
 void *mmap_hugetlbfs(struct kvm *kvm, const char *htlbfs_path, u64 size);
 void *mmap_anon_or_hugetlbfs(struct kvm *kvm, const char *hugetlbfs_path, u64 size);
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 kvmtool 08/13] Add PCI device passthrough using VFIO
  2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
                   ` (6 preceding siblings ...)
  2018-06-18 18:42 ` [PATCH v6 kvmtool 07/13] Add fls_long and roundup_pow_of_two helpers Jean-Philippe Brucker
@ 2018-06-18 18:42 ` Jean-Philippe Brucker
  2018-06-18 18:42 ` [PATCH v6 kvmtool 09/13] vfio-pci: add MSI-X support Jean-Philippe Brucker
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jean-Philippe Brucker @ 2018-06-18 18:42 UTC (permalink / raw)
  To: kvm
  Cc: lorenzo.pieralisi, marc.zyngier, punit.agrawal, will.deacon,
	alex.williamson, robin.murphy, kvmarm

Assigning devices using VFIO allows the guest to have direct access to the
device, whilst filtering accesses to sensitive areas by trapping config
space accesses and mapping DMA with an IOMMU.

This patch adds a new option to lkvm run: --vfio-pci=<BDF>. Before
assigning a device to a VM, some preparation is required. As described in
Linux Documentation/vfio.txt, the device driver needs to be changed to
vfio-pci:

  $ dev=0000:00:00.0

  $ echo $dev > /sys/bus/pci/devices/$dev/driver/unbind
  $ echo vfio-pci > /sys/bus/pci/devices/$dev/driver_override
  $ echo $dev > /sys/bus/pci/drivers_probe

Adding --vfio-pci=$dev to lkvm-run will pass the device to the guest.
Multiple devices can be passed to the guest by adding more --vfio-pci
parameters.

This patch only implements PCI with INTx. MSI-X routing will be added in a
subsequent patch, and at some point we might add support for passing
platform devices to guests.

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 Makefile                 |   2 +
 arm/pci.c                |   1 +
 builtin-run.c            |   5 +
 include/kvm/kvm-config.h |   3 +
 include/kvm/pci.h        |   3 +-
 include/kvm/vfio.h       |  71 ++++++
 vfio/core.c              | 494 +++++++++++++++++++++++++++++++++++++++
 vfio/pci.c               | 395 +++++++++++++++++++++++++++++++
 8 files changed, 973 insertions(+), 1 deletion(-)
 create mode 100644 include/kvm/vfio.h
 create mode 100644 vfio/core.c
 create mode 100644 vfio/pci.c

diff --git a/Makefile b/Makefile
index 5453199b3..e3088a466 100644
--- a/Makefile
+++ b/Makefile
@@ -59,6 +59,8 @@ OBJS	+= main.o
 OBJS	+= mmio.o
 OBJS	+= pci.o
 OBJS	+= term.o
+OBJS	+= vfio/core.o
+OBJS	+= vfio/pci.o
 OBJS	+= virtio/blk.o
 OBJS	+= virtio/scsi.o
 OBJS	+= virtio/console.o
diff --git a/arm/pci.c b/arm/pci.c
index 744b14c26..557cfa989 100644
--- a/arm/pci.c
+++ b/arm/pci.c
@@ -1,5 +1,6 @@
 #include "kvm/devices.h"
 #include "kvm/fdt.h"
+#include "kvm/kvm.h"
 #include "kvm/of_pci.h"
 #include "kvm/pci.h"
 #include "kvm/util.h"
diff --git a/builtin-run.c b/builtin-run.c
index b56aea7d1..443c10ba4 100644
--- a/builtin-run.c
+++ b/builtin-run.c
@@ -146,6 +146,11 @@ void kvm_run_set_wrapper_sandbox(void)
 	OPT_BOOLEAN('\0', "no-dhcp", &(cfg)->no_dhcp, "Disable kernel"	\
 			" DHCP in rootfs mode"),			\
 									\
+	OPT_GROUP("VFIO options:"),					\
+	OPT_CALLBACK('\0', "vfio-pci", NULL, "[domain:]bus:dev.fn",	\
+		     "Assign a PCI device to the virtual machine",	\
+		     vfio_device_parser, kvm),				\
+									\
 	OPT_GROUP("Debug options:"),					\
 	OPT_BOOLEAN('\0', "debug", &do_debug_print,			\
 			"Enable debug messages"),			\
diff --git a/include/kvm/kvm-config.h b/include/kvm/kvm-config.h
index 386fa8c59..a052b0bc7 100644
--- a/include/kvm/kvm-config.h
+++ b/include/kvm/kvm-config.h
@@ -2,6 +2,7 @@
 #define KVM_CONFIG_H_
 
 #include "kvm/disk-image.h"
+#include "kvm/vfio.h"
 #include "kvm/kvm-config-arch.h"
 
 #define DEFAULT_KVM_DEV		"/dev/kvm"
@@ -20,9 +21,11 @@
 struct kvm_config {
 	struct kvm_config_arch arch;
 	struct disk_image_params disk_image[MAX_DISK_IMAGES];
+	struct vfio_device_params *vfio_devices;
 	u64 ram_size;
 	u8  image_count;
 	u8 num_net_devices;
+	u8 num_vfio_devices;
 	bool virtio_rng;
 	int active_console;
 	int debug_iodelay;
diff --git a/include/kvm/pci.h b/include/kvm/pci.h
index 01c244bcf..274b77ea6 100644
--- a/include/kvm/pci.h
+++ b/include/kvm/pci.h
@@ -7,7 +7,6 @@
 #include <endian.h>
 
 #include "kvm/devices.h"
-#include "kvm/kvm.h"
 #include "kvm/msi.h"
 #include "kvm/fdt.h"
 
@@ -22,6 +21,8 @@
 #define PCI_IO_SIZE		0x100
 #define PCI_CFG_SIZE		(1ULL << 24)
 
+struct kvm;
+
 union pci_config_address {
 	struct {
 #if __BYTE_ORDER == __LITTLE_ENDIAN
diff --git a/include/kvm/vfio.h b/include/kvm/vfio.h
new file mode 100644
index 000000000..c434703ab
--- /dev/null
+++ b/include/kvm/vfio.h
@@ -0,0 +1,71 @@
+#ifndef KVM__VFIO_H
+#define KVM__VFIO_H
+
+#include "kvm/parse-options.h"
+#include "kvm/pci.h"
+
+#include <linux/vfio.h>
+
+#define vfio_dev_err(vdev, fmt, ...) \
+	pr_err("%s: " fmt, (vdev)->params->name, ##__VA_ARGS__)
+#define vfio_dev_warn(vdev, fmt, ...) \
+	pr_warning("%s: " fmt, (vdev)->params->name, ##__VA_ARGS__)
+#define vfio_dev_info(vdev, fmt, ...) \
+	pr_info("%s: " fmt, (vdev)->params->name, ##__VA_ARGS__)
+#define vfio_dev_dbg(vdev, fmt, ...) \
+	pr_debug("%s: " fmt, (vdev)->params->name, ##__VA_ARGS__)
+#define vfio_dev_die(vdev, fmt, ...) \
+	die("%s: " fmt, (vdev)->params->name, ##__VA_ARGS__)
+
+/* Currently limited by num_vfio_devices */
+#define MAX_VFIO_DEVICES		256
+
+enum vfio_device_type {
+	VFIO_DEVICE_PCI,
+};
+
+struct vfio_pci_device {
+	struct pci_device_header	hdr;
+};
+
+struct vfio_region {
+	struct vfio_region_info		info;
+	u64				guest_phys_addr;
+	void				*host_addr;
+};
+
+struct vfio_device {
+	struct device_header		dev_hdr;
+	struct vfio_device_params	*params;
+	struct vfio_group		*group;
+
+	int				fd;
+	struct vfio_device_info		info;
+	struct vfio_region		*regions;
+
+	char				*sysfs_path;
+
+	struct vfio_pci_device		pci;
+};
+
+struct vfio_device_params {
+	char				*name;
+	const char			*bus;
+	enum vfio_device_type		type;
+};
+
+struct vfio_group {
+	unsigned long			id; /* iommu_group number in sysfs */
+	int				fd;
+	int				refs;
+	struct list_head		list;
+};
+
+int vfio_device_parser(const struct option *opt, const char *arg, int unset);
+int vfio_map_region(struct kvm *kvm, struct vfio_device *vdev,
+		    struct vfio_region *region);
+void vfio_unmap_region(struct kvm *kvm, struct vfio_region *region);
+int vfio_pci_setup_device(struct kvm *kvm, struct vfio_device *device);
+void vfio_pci_teardown_device(struct kvm *kvm, struct vfio_device *vdev);
+
+#endif /* KVM__VFIO_H */
diff --git a/vfio/core.c b/vfio/core.c
new file mode 100644
index 000000000..6c78e89f1
--- /dev/null
+++ b/vfio/core.c
@@ -0,0 +1,494 @@
+#include "kvm/kvm.h"
+#include "kvm/vfio.h"
+
+#include <linux/list.h>
+
+#define VFIO_DEV_DIR		"/dev/vfio"
+#define VFIO_DEV_NODE		VFIO_DEV_DIR "/vfio"
+#define IOMMU_GROUP_DIR		"/sys/kernel/iommu_groups"
+
+static int vfio_container;
+static LIST_HEAD(vfio_groups);
+static struct vfio_device *vfio_devices;
+
+static int vfio_device_pci_parser(const struct option *opt, char *arg,
+				  struct vfio_device_params *dev)
+{
+	unsigned int domain, bus, devnr, fn;
+
+	int nr = sscanf(arg, "%4x:%2x:%2x.%1x", &domain, &bus, &devnr, &fn);
+	if (nr < 4) {
+		domain = 0;
+		nr = sscanf(arg, "%2x:%2x.%1x", &bus, &devnr, &fn);
+		if (nr < 3) {
+			pr_err("Invalid device identifier %s", arg);
+			return -EINVAL;
+		}
+	}
+
+	dev->type = VFIO_DEVICE_PCI;
+	dev->bus = "pci";
+	dev->name = malloc(13);
+	if (!dev->name)
+		return -ENOMEM;
+
+	snprintf(dev->name, 13, "%04x:%02x:%02x.%x", domain, bus, devnr, fn);
+
+	return 0;
+}
+
+int vfio_device_parser(const struct option *opt, const char *arg, int unset)
+{
+	int ret = -EINVAL;
+	static int idx = 0;
+	struct kvm *kvm = opt->ptr;
+	struct vfio_device_params *dev, *devs;
+	char *cur, *buf = strdup(arg);
+
+	if (!buf)
+		return -ENOMEM;
+
+	if (idx >= MAX_VFIO_DEVICES) {
+		pr_warning("Too many VFIO devices");
+		goto out_free_buf;
+	}
+
+	devs = realloc(kvm->cfg.vfio_devices, sizeof(*dev) * (idx + 1));
+	if (!devs) {
+		ret = -ENOMEM;
+		goto out_free_buf;
+	}
+
+	kvm->cfg.vfio_devices = devs;
+	dev = &devs[idx];
+
+	cur = strtok(buf, ",");
+	if (!cur)
+		goto out_free_buf;
+
+	if (!strcmp(opt->long_name, "vfio-pci"))
+		ret = vfio_device_pci_parser(opt, cur, dev);
+	else
+		ret = -EINVAL;
+
+	if (!ret)
+		kvm->cfg.num_vfio_devices = ++idx;
+
+out_free_buf:
+	free(buf);
+
+	return ret;
+}
+
+int vfio_map_region(struct kvm *kvm, struct vfio_device *vdev,
+		    struct vfio_region *region)
+{
+	void *base;
+	int ret, prot = 0;
+	/* KVM needs page-aligned regions */
+	u64 map_size = ALIGN(region->info.size, PAGE_SIZE);
+
+	/*
+	 * We don't want to mess about trapping config accesses, so require that
+	 * they can be mmap'd. Note that for PCI, this precludes the use of I/O
+	 * BARs in the guest (we will hide them from Configuration Space, which
+	 * is trapped).
+	 */
+	if (!(region->info.flags & VFIO_REGION_INFO_FLAG_MMAP)) {
+		vfio_dev_info(vdev, "ignoring region %u, as it can't be mmap'd",
+			      region->info.index);
+		return 0;
+	}
+
+	if (region->info.flags & VFIO_REGION_INFO_FLAG_READ)
+		prot |= PROT_READ;
+	if (region->info.flags & VFIO_REGION_INFO_FLAG_WRITE)
+		prot |= PROT_WRITE;
+
+	base = mmap(NULL, region->info.size, prot, MAP_SHARED, vdev->fd,
+		    region->info.offset);
+	if (base == MAP_FAILED) {
+		ret = -errno;
+		vfio_dev_err(vdev, "failed to mmap region %u (0x%llx bytes)",
+			     region->info.index, region->info.size);
+		return ret;
+	}
+	region->host_addr = base;
+
+	ret = kvm__register_dev_mem(kvm, region->guest_phys_addr, map_size,
+				    region->host_addr);
+	if (ret) {
+		vfio_dev_err(vdev, "failed to register region with KVM");
+		return ret;
+	}
+
+	return 0;
+}
+
+void vfio_unmap_region(struct kvm *kvm, struct vfio_region *region)
+{
+	munmap(region->host_addr, region->info.size);
+}
+
+static int vfio_configure_device(struct kvm *kvm, struct vfio_device *vdev)
+{
+	int ret;
+	struct vfio_group *group = vdev->group;
+
+	vdev->fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD,
+			 vdev->params->name);
+	if (vdev->fd < 0) {
+		vfio_dev_warn(vdev, "failed to get fd");
+
+		/* The device might be a bridge without an fd */
+		return 0;
+	}
+
+	vdev->info.argsz = sizeof(vdev->info);
+	if (ioctl(vdev->fd, VFIO_DEVICE_GET_INFO, &vdev->info)) {
+		ret = -errno;
+		vfio_dev_err(vdev, "failed to get info");
+		goto err_close_device;
+	}
+
+	if (vdev->info.flags & VFIO_DEVICE_FLAGS_RESET &&
+	    ioctl(vdev->fd, VFIO_DEVICE_RESET) < 0)
+		vfio_dev_warn(vdev, "failed to reset device");
+
+	vdev->regions = calloc(vdev->info.num_regions, sizeof(*vdev->regions));
+	if (!vdev->regions) {
+		ret = -ENOMEM;
+		goto err_close_device;
+	}
+
+	/* Now for the bus-specific initialization... */
+	switch (vdev->params->type) {
+	case VFIO_DEVICE_PCI:
+		BUG_ON(!(vdev->info.flags & VFIO_DEVICE_FLAGS_PCI));
+		ret = vfio_pci_setup_device(kvm, vdev);
+		break;
+	default:
+		BUG_ON(1);
+		ret = -EINVAL;
+	}
+
+	if (ret)
+		goto err_free_regions;
+
+	vfio_dev_info(vdev, "assigned to device number 0x%x in group %lu",
+		      vdev->dev_hdr.dev_num, group->id);
+
+	return 0;
+
+err_free_regions:
+	free(vdev->regions);
+err_close_device:
+	close(vdev->fd);
+
+	return ret;
+}
+
+static int vfio_configure_devices(struct kvm *kvm)
+{
+	int i, ret;
+
+	for (i = 0; i < kvm->cfg.num_vfio_devices; ++i) {
+		ret = vfio_configure_device(kvm, &vfio_devices[i]);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int vfio_get_iommu_type(void)
+{
+	if (ioctl(vfio_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1v2_IOMMU))
+		return VFIO_TYPE1v2_IOMMU;
+
+	if (ioctl(vfio_container, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU))
+		return VFIO_TYPE1_IOMMU;
+
+	return -ENODEV;
+}
+
+static int vfio_map_mem_bank(struct kvm *kvm, struct kvm_mem_bank *bank, void *data)
+{
+	int ret = 0;
+	struct vfio_iommu_type1_dma_map dma_map = {
+		.argsz	= sizeof(dma_map),
+		.flags	= VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE,
+		.vaddr	= (unsigned long)bank->host_addr,
+		.iova	= (u64)bank->guest_phys_addr,
+		.size	= bank->size,
+	};
+
+	/* Map the guest memory for DMA (i.e. provide isolation) */
+	if (ioctl(vfio_container, VFIO_IOMMU_MAP_DMA, &dma_map)) {
+		ret = -errno;
+		pr_err("Failed to map 0x%llx -> 0x%llx (%llu) for DMA",
+		       dma_map.iova, dma_map.vaddr, dma_map.size);
+	}
+
+	return ret;
+}
+
+static int vfio_unmap_mem_bank(struct kvm *kvm, struct kvm_mem_bank *bank, void *data)
+{
+	struct vfio_iommu_type1_dma_unmap dma_unmap = {
+		.argsz = sizeof(dma_unmap),
+		.size = bank->size,
+		.iova = bank->guest_phys_addr,
+	};
+
+	ioctl(vfio_container, VFIO_IOMMU_UNMAP_DMA, &dma_unmap);
+
+	return 0;
+}
+
+static struct vfio_group *vfio_group_create(struct kvm *kvm, unsigned long id)
+{
+	int ret;
+	struct vfio_group *group;
+	char group_node[PATH_MAX];
+	struct vfio_group_status group_status = {
+		.argsz = sizeof(group_status),
+	};
+
+	group = calloc(1, sizeof(*group));
+	if (!group)
+		return NULL;
+
+	group->id	= id;
+	group->refs	= 1;
+
+	ret = snprintf(group_node, PATH_MAX, VFIO_DEV_DIR "/%lu", id);
+	if (ret < 0 || ret == PATH_MAX)
+		return NULL;
+
+	group->fd = open(group_node, O_RDWR);
+	if (group->fd < 0) {
+		pr_err("Failed to open IOMMU group %s", group_node);
+		goto err_free_group;
+	}
+
+	if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &group_status)) {
+		pr_err("Failed to determine status of IOMMU group %lu", id);
+		goto err_close_group;
+	}
+
+	if (!(group_status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
+		pr_err("IOMMU group %lu is not viable", id);
+		goto err_close_group;
+	}
+
+	if (ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &vfio_container)) {
+		pr_err("Failed to add IOMMU group %lu to VFIO container", id);
+		goto err_close_group;
+	}
+
+	list_add(&group->list, &vfio_groups);
+
+	return group;
+
+err_close_group:
+	close(group->fd);
+err_free_group:
+	free(group);
+
+	return NULL;
+}
+
+static void vfio_group_exit(struct kvm *kvm, struct vfio_group *group)
+{
+	if (--group->refs != 0)
+		return;
+
+	ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER);
+
+	list_del(&group->list);
+	close(group->fd);
+	free(group);
+}
+
+static struct vfio_group *
+vfio_group_get_for_dev(struct kvm *kvm, struct vfio_device *vdev)
+{
+	int dirfd;
+	ssize_t ret;
+	char *group_name;
+	unsigned long group_id;
+	char group_path[PATH_MAX];
+	struct vfio_group *group = NULL;
+
+	/* Find IOMMU group for this device */
+	dirfd = open(vdev->sysfs_path, O_DIRECTORY | O_PATH | O_RDONLY);
+	if (dirfd < 0) {
+		vfio_dev_err(vdev, "failed to open '%s'", vdev->sysfs_path);
+		return NULL;
+	}
+
+	ret = readlinkat(dirfd, "iommu_group", group_path, PATH_MAX);
+	if (ret < 0) {
+		vfio_dev_err(vdev, "no iommu_group");
+		goto out_close;
+	}
+	if (ret == PATH_MAX)
+		goto out_close;
+
+	group_path[ret] = '\0';
+
+	group_name = basename(group_path);
+	errno = 0;
+	group_id = strtoul(group_name, NULL, 10);
+	if (errno)
+		goto out_close;
+
+	list_for_each_entry(group, &vfio_groups, list) {
+		if (group->id == group_id) {
+			group->refs++;
+			return group;
+		}
+	}
+
+	group = vfio_group_create(kvm, group_id);
+
+out_close:
+	close(dirfd);
+	return group;
+}
+
+static int vfio_device_init(struct kvm *kvm, struct vfio_device *vdev)
+{
+	int ret;
+	char dev_path[PATH_MAX];
+	struct vfio_group *group;
+
+	ret = snprintf(dev_path, PATH_MAX, "/sys/bus/%s/devices/%s",
+		       vdev->params->bus, vdev->params->name);
+	if (ret < 0 || ret == PATH_MAX)
+		return -EINVAL;
+
+	vdev->sysfs_path = strndup(dev_path, PATH_MAX);
+	if (!vdev->sysfs_path)
+		return -errno;
+
+	group = vfio_group_get_for_dev(kvm, vdev);
+	if (!group) {
+		free(vdev->sysfs_path);
+		return -EINVAL;
+	}
+
+	vdev->group = group;
+
+	return 0;
+}
+
+static void vfio_device_exit(struct kvm *kvm, struct vfio_device *vdev)
+{
+	vfio_group_exit(kvm, vdev->group);
+
+	switch (vdev->params->type) {
+	case VFIO_DEVICE_PCI:
+		vfio_pci_teardown_device(kvm, vdev);
+		break;
+	default:
+		vfio_dev_warn(vdev, "no teardown function for device");
+	}
+
+	close(vdev->fd);
+
+	free(vdev->regions);
+	free(vdev->sysfs_path);
+}
+
+static int vfio_container_init(struct kvm *kvm)
+{
+	int api, i, ret, iommu_type;;
+
+	/* Create a container for our IOMMU groups */
+	vfio_container = open(VFIO_DEV_NODE, O_RDWR);
+	if (vfio_container == -1) {
+		ret = errno;
+		pr_err("Failed to open %s", VFIO_DEV_NODE);
+		return ret;
+	}
+
+	api = ioctl(vfio_container, VFIO_GET_API_VERSION);
+	if (api != VFIO_API_VERSION) {
+		pr_err("Unknown VFIO API version %d", api);
+		return -ENODEV;
+	}
+
+	iommu_type = vfio_get_iommu_type();
+	if (iommu_type < 0) {
+		pr_err("VFIO type-1 IOMMU not supported on this platform");
+		return iommu_type;
+	}
+
+	/* Create groups for our devices and add them to the container */
+	for (i = 0; i < kvm->cfg.num_vfio_devices; ++i) {
+		vfio_devices[i].params = &kvm->cfg.vfio_devices[i];
+
+		ret = vfio_device_init(kvm, &vfio_devices[i]);
+		if (ret)
+			return ret;
+	}
+
+	/* Finalise the container */
+	if (ioctl(vfio_container, VFIO_SET_IOMMU, iommu_type)) {
+		ret = -errno;
+		pr_err("Failed to set IOMMU type %d for VFIO container",
+		       iommu_type);
+		return ret;
+	} else {
+		pr_info("Using IOMMU type %d for VFIO container", iommu_type);
+	}
+
+	return kvm__for_each_mem_bank(kvm, KVM_MEM_TYPE_RAM, vfio_map_mem_bank,
+				      NULL);
+}
+
+static int vfio__init(struct kvm *kvm)
+{
+	int ret;
+
+	if (!kvm->cfg.num_vfio_devices)
+		return 0;
+
+	vfio_devices = calloc(kvm->cfg.num_vfio_devices, sizeof(*vfio_devices));
+	if (!vfio_devices)
+		return -ENOMEM;
+
+	ret = vfio_container_init(kvm);
+	if (ret)
+		return ret;
+
+	ret = vfio_configure_devices(kvm);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+dev_base_init(vfio__init);
+
+static int vfio__exit(struct kvm *kvm)
+{
+	int i;
+
+	if (!kvm->cfg.num_vfio_devices)
+		return 0;
+
+	for (i = 0; i < kvm->cfg.num_vfio_devices; i++)
+		vfio_device_exit(kvm, &vfio_devices[i]);
+
+	free(vfio_devices);
+
+	kvm__for_each_mem_bank(kvm, KVM_MEM_TYPE_RAM, vfio_unmap_mem_bank, NULL);
+	close(vfio_container);
+
+	free(kvm->cfg.vfio_devices);
+
+	return 0;
+}
+dev_base_exit(vfio__exit);
diff --git a/vfio/pci.c b/vfio/pci.c
new file mode 100644
index 000000000..6b157cfa3
--- /dev/null
+++ b/vfio/pci.c
@@ -0,0 +1,395 @@
+#include "kvm/irq.h"
+#include "kvm/kvm.h"
+#include "kvm/kvm-cpu.h"
+#include "kvm/vfio.h"
+
+#include <sys/ioctl.h>
+#include <sys/eventfd.h>
+
+/* Wrapper around UAPI vfio_irq_set */
+struct vfio_irq_eventfd {
+	struct vfio_irq_set	irq;
+	int			fd;
+};
+
+static void vfio_pci_cfg_read(struct kvm *kvm, struct pci_device_header *pci_hdr,
+			      u8 offset, void *data, int sz)
+{
+	struct vfio_region_info *info;
+	struct vfio_pci_device *pdev;
+	struct vfio_device *vdev;
+	char base[sz];
+
+	pdev = container_of(pci_hdr, struct vfio_pci_device, hdr);
+	vdev = container_of(pdev, struct vfio_device, pci);
+	info = &vdev->regions[VFIO_PCI_CONFIG_REGION_INDEX].info;
+
+	/* Dummy read in case of side-effects */
+	if (pread(vdev->fd, base, sz, info->offset + offset) != sz)
+		vfio_dev_warn(vdev, "failed to read %d bytes from Configuration Space at 0x%x",
+			      sz, offset);
+}
+
+static void vfio_pci_cfg_write(struct kvm *kvm, struct pci_device_header *pci_hdr,
+			       u8 offset, void *data, int sz)
+{
+	struct vfio_region_info *info;
+	struct vfio_pci_device *pdev;
+	struct vfio_device *vdev;
+	void *base = pci_hdr;
+
+	pdev = container_of(pci_hdr, struct vfio_pci_device, hdr);
+	vdev = container_of(pdev, struct vfio_device, pci);
+	info = &vdev->regions[VFIO_PCI_CONFIG_REGION_INDEX].info;
+
+	if (pwrite(vdev->fd, data, sz, info->offset + offset) != sz)
+		vfio_dev_warn(vdev, "Failed to write %d bytes to Configuration Space at 0x%x",
+			      sz, offset);
+
+	if (pread(vdev->fd, base + offset, sz, info->offset + offset) != sz)
+		vfio_dev_warn(vdev, "Failed to read %d bytes from Configuration Space at 0x%x",
+			      sz, offset);
+}
+
+static int vfio_pci_parse_caps(struct vfio_device *vdev)
+{
+	struct vfio_pci_device *pdev = &vdev->pci;
+
+	if (!(pdev->hdr.status & PCI_STATUS_CAP_LIST))
+		return 0;
+
+	pdev->hdr.status &= ~PCI_STATUS_CAP_LIST;
+	pdev->hdr.capabilities = 0;
+
+	/* TODO: install virtual capabilities */
+
+	return 0;
+}
+
+static int vfio_pci_parse_cfg_space(struct vfio_device *vdev)
+{
+	ssize_t sz = PCI_STD_HEADER_SIZEOF;
+	struct vfio_region_info *info;
+	struct vfio_pci_device *pdev = &vdev->pci;
+
+	if (vdev->info.num_regions < VFIO_PCI_CONFIG_REGION_INDEX) {
+		vfio_dev_err(vdev, "Config Space not found");
+		return -ENODEV;
+	}
+
+	info = &vdev->regions[VFIO_PCI_CONFIG_REGION_INDEX].info;
+	*info = (struct vfio_region_info) {
+			.argsz = sizeof(*info),
+			.index = VFIO_PCI_CONFIG_REGION_INDEX,
+	};
+
+	ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, info);
+	if (!info->size) {
+		vfio_dev_err(vdev, "Config Space has size zero?!");
+		return -EINVAL;
+	}
+
+	if (pread(vdev->fd, &pdev->hdr, sz, info->offset) != sz) {
+		vfio_dev_err(vdev, "failed to read %zd bytes of Config Space", sz);
+		return -EIO;
+	}
+
+	/* Strip bit 7, that indicates multifunction */
+	pdev->hdr.header_type &= 0x7f;
+
+	if (pdev->hdr.header_type != PCI_HEADER_TYPE_NORMAL) {
+		vfio_dev_err(vdev, "unsupported header type %u",
+			     pdev->hdr.header_type);
+		return -EOPNOTSUPP;
+	}
+
+	vfio_pci_parse_caps(vdev);
+
+	return 0;
+}
+
+static int vfio_pci_fixup_cfg_space(struct vfio_device *vdev)
+{
+	int i;
+	ssize_t hdr_sz;
+	struct vfio_region_info *info;
+	struct vfio_pci_device *pdev = &vdev->pci;
+
+	/* Enable exclusively MMIO and bus mastering */
+	pdev->hdr.command &= ~PCI_COMMAND_IO;
+	pdev->hdr.command |= PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER;
+
+	/* Initialise the BARs */
+	for (i = VFIO_PCI_BAR0_REGION_INDEX; i <= VFIO_PCI_BAR5_REGION_INDEX; ++i) {
+		struct vfio_region *region = &vdev->regions[i];
+		u64 base = region->guest_phys_addr;
+
+		if (!base)
+			continue;
+
+		pdev->hdr.bar_size[i] = region->info.size;
+
+		/* Construct a fake reg to match what we've mapped. */
+		pdev->hdr.bar[i] = (base & PCI_BASE_ADDRESS_MEM_MASK) |
+					PCI_BASE_ADDRESS_SPACE_MEMORY |
+					PCI_BASE_ADDRESS_MEM_TYPE_32;
+	}
+
+	/* I really can't be bothered to support cardbus. */
+	pdev->hdr.card_bus = 0;
+
+	/*
+	 * Nuke the expansion ROM for now. If we want to do this properly,
+	 * we need to save its size somewhere and map into the guest.
+	 */
+	pdev->hdr.exp_rom_bar = 0;
+
+	/* Install our fake Configuration Space */
+	info = &vdev->regions[VFIO_PCI_CONFIG_REGION_INDEX].info;
+	hdr_sz = PCI_DEV_CFG_SIZE;
+	if (pwrite(vdev->fd, &pdev->hdr, hdr_sz, info->offset) != hdr_sz) {
+		vfio_dev_err(vdev, "failed to write %zd bytes to Config Space",
+			     hdr_sz);
+		return -EIO;
+	}
+
+	/* Register callbacks for cfg accesses */
+	pdev->hdr.cfg_ops = (struct pci_config_operations) {
+		.read	= vfio_pci_cfg_read,
+		.write	= vfio_pci_cfg_write,
+	};
+
+	pdev->hdr.irq_type = IRQ_TYPE_LEVEL_HIGH;
+
+	return 0;
+}
+
+static int vfio_pci_configure_bar(struct kvm *kvm, struct vfio_device *vdev,
+				  size_t nr)
+{
+	int ret;
+	size_t map_size;
+	struct vfio_region *region = &vdev->regions[nr];
+
+	if (nr >= vdev->info.num_regions)
+		return 0;
+
+	region->info = (struct vfio_region_info) {
+		.argsz = sizeof(region->info),
+		.index = nr,
+	};
+
+	ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, &region->info);
+	if (ret) {
+		ret = -errno;
+		vfio_dev_err(vdev, "cannot get info for BAR %zu", nr);
+		return ret;
+	}
+
+	/* Ignore invalid or unimplemented regions */
+	if (!region->info.size)
+		return 0;
+
+	/* Grab some MMIO space in the guest */
+	map_size = ALIGN(region->info.size, PAGE_SIZE);
+	region->guest_phys_addr = pci_get_io_space_block(map_size);
+
+	/*
+	 * Map the BARs into the guest. We'll later need to update
+	 * configuration space to reflect our allocation.
+	 */
+	ret = vfio_map_region(kvm, vdev, region);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+
+static int vfio_pci_configure_dev_regions(struct kvm *kvm,
+					  struct vfio_device *vdev)
+{
+	int ret;
+	u32 bar;
+	size_t i;
+	bool is_64bit = false;
+	struct vfio_pci_device *pdev = &vdev->pci;
+
+	ret = vfio_pci_parse_cfg_space(vdev);
+	if (ret)
+		return ret;
+
+	for (i = VFIO_PCI_BAR0_REGION_INDEX; i <= VFIO_PCI_BAR5_REGION_INDEX; ++i) {
+		/* Ignore top half of 64-bit BAR */
+		if (i % 2 && is_64bit)
+			continue;
+
+		ret = vfio_pci_configure_bar(kvm, vdev, i);
+		if (ret)
+			return ret;
+
+		bar = pdev->hdr.bar[i];
+		is_64bit = (bar & PCI_BASE_ADDRESS_SPACE) ==
+			   PCI_BASE_ADDRESS_SPACE_MEMORY &&
+			   bar & PCI_BASE_ADDRESS_MEM_TYPE_64;
+	}
+
+	/* We've configured the BARs, fake up a Configuration Space */
+	return vfio_pci_fixup_cfg_space(vdev);
+}
+
+static int vfio_pci_enable_intx(struct kvm *kvm, struct vfio_device *vdev)
+{
+	int ret;
+	int trigger_fd, unmask_fd;
+	struct vfio_irq_eventfd	trigger;
+	struct vfio_irq_eventfd	unmask;
+	struct vfio_pci_device *pdev = &vdev->pci;
+	int gsi = pdev->hdr.irq_line - KVM_IRQ_OFFSET;
+
+	struct vfio_irq_info irq_info = {
+		.argsz = sizeof(irq_info),
+		.index = VFIO_PCI_INTX_IRQ_INDEX,
+	};
+
+	ret = ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
+	if (ret || irq_info.count == 0) {
+		vfio_dev_err(vdev, "no INTx reported by VFIO");
+		return -ENODEV;
+	}
+
+	if (!(irq_info.flags & VFIO_IRQ_INFO_EVENTFD)) {
+		vfio_dev_err(vdev, "interrupt not eventfd capable");
+		return -EINVAL;
+	}
+
+	if (!(irq_info.flags & VFIO_IRQ_INFO_AUTOMASKED)) {
+		vfio_dev_err(vdev, "INTx interrupt not AUTOMASKED");
+		return -EINVAL;
+	}
+
+	/*
+	 * PCI IRQ is level-triggered, so we use two eventfds. trigger_fd
+	 * signals an interrupt from host to guest, and unmask_fd signals the
+	 * deassertion of the line from guest to host.
+	 */
+	trigger_fd = eventfd(0, 0);
+	if (trigger_fd < 0) {
+		vfio_dev_err(vdev, "failed to create trigger eventfd");
+		return trigger_fd;
+	}
+
+	unmask_fd = eventfd(0, 0);
+	if (unmask_fd < 0) {
+		vfio_dev_err(vdev, "failed to create unmask eventfd");
+		close(trigger_fd);
+		return unmask_fd;
+	}
+
+	ret = irq__add_irqfd(kvm, gsi, trigger_fd, unmask_fd);
+	if (ret)
+		goto err_close;
+
+	trigger.irq = (struct vfio_irq_set) {
+		.argsz	= sizeof(trigger),
+		.flags	= VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_TRIGGER,
+		.index	= VFIO_PCI_INTX_IRQ_INDEX,
+		.start	= 0,
+		.count	= 1,
+	};
+	trigger.fd = trigger_fd;
+
+	ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &trigger);
+	if (ret < 0) {
+		vfio_dev_err(vdev, "failed to setup VFIO IRQ");
+		goto err_delete_line;
+	}
+
+	unmask.irq = (struct vfio_irq_set) {
+		.argsz	= sizeof(unmask),
+		.flags	= VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_UNMASK,
+		.index	= VFIO_PCI_INTX_IRQ_INDEX,
+		.start	= 0,
+		.count	= 1,
+	};
+	unmask.fd = unmask_fd;
+
+	ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &unmask);
+	if (ret < 0) {
+		vfio_dev_err(vdev, "failed to setup unmask IRQ");
+		goto err_remove_event;
+	}
+
+	return 0;
+
+err_remove_event:
+	/* Remove trigger event */
+	trigger.irq.flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER;
+	trigger.irq.count = 0;
+	ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &trigger);
+
+err_delete_line:
+	irq__del_irqfd(kvm, gsi, trigger_fd);
+
+err_close:
+	close(trigger_fd);
+	close(unmask_fd);
+	return ret;
+}
+
+static int vfio_pci_configure_dev_irqs(struct kvm *kvm, struct vfio_device *vdev)
+{
+	struct vfio_pci_device *pdev = &vdev->pci;
+
+	struct vfio_irq_info irq_info = {
+		.argsz = sizeof(irq_info),
+		.index = VFIO_PCI_INTX_IRQ_INDEX,
+	};
+
+	if (!pdev->hdr.irq_pin) {
+		/* TODO: add MSI support */
+		vfio_dev_err(vdev, "INTx not available, MSI-X not implemented");
+		return -ENOSYS;
+	}
+
+	return vfio_pci_enable_intx(kvm, vdev);
+}
+
+int vfio_pci_setup_device(struct kvm *kvm, struct vfio_device *vdev)
+{
+	int ret;
+
+	ret = vfio_pci_configure_dev_regions(kvm, vdev);
+	if (ret) {
+		vfio_dev_err(vdev, "failed to configure regions");
+		return ret;
+	}
+
+	vdev->dev_hdr = (struct device_header) {
+		.bus_type	= DEVICE_BUS_PCI,
+		.data		= &vdev->pci.hdr,
+	};
+
+	ret = device__register(&vdev->dev_hdr);
+	if (ret) {
+		vfio_dev_err(vdev, "failed to register VFIO device");
+		return ret;
+	}
+
+	ret = vfio_pci_configure_dev_irqs(kvm, vdev);
+	if (ret) {
+		vfio_dev_err(vdev, "failed to configure IRQs");
+		return ret;
+	}
+
+	return 0;
+}
+
+void vfio_pci_teardown_device(struct kvm *kvm, struct vfio_device *vdev)
+{
+	size_t i;
+
+	for (i = 0; i < vdev->info.num_regions; i++)
+		vfio_unmap_region(kvm, &vdev->regions[i]);
+
+	device__unregister(&vdev->dev_hdr);
+}
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 kvmtool 09/13] vfio-pci: add MSI-X support
  2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
                   ` (7 preceding siblings ...)
  2018-06-18 18:42 ` [PATCH v6 kvmtool 08/13] Add PCI device passthrough using VFIO Jean-Philippe Brucker
@ 2018-06-18 18:42 ` Jean-Philippe Brucker
  2018-06-18 18:42 ` [PATCH v6 kvmtool 10/13] vfio-pci: add MSI support Jean-Philippe Brucker
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jean-Philippe Brucker @ 2018-06-18 18:42 UTC (permalink / raw)
  To: kvm
  Cc: lorenzo.pieralisi, marc.zyngier, punit.agrawal, will.deacon,
	alex.williamson, robin.murphy, kvmarm

Add virtual MSI-X tables for PCI devices, and create IRQFD routes to let
the kernel inject MSIs from a physical device directly into the guest.

It would be tempting to create the MSI routes at init time before starting
vCPUs, when we can afford to exit gracefully. But some of it must be
initialized when the guest requests it.

* On the KVM side, MSIs must be enabled after devices allocate their IRQ
  lines and irqchips are operational, which can happen until late_init.

* On the VFIO side, hardware state of devices may be updated when setting
  up MSIs. For example, when passing a virtio-pci-legacy device to the
  guest:

  (1) The device-specific configuration layout (in BAR0) depends on
      whether MSIs are enabled or not in the device. If they are enabled,
      the device-specific configuration starts at offset 24, otherwise it
      starts at offset 20.
  (2) Linux guest assumes that MSIs are initially disabled (doesn't
      actually check the capability). So it reads the device config at
      offset 20.
  (3) Had we enabled MSIs early, host would have enabled the MSI-X
      capability and device would return the config at offset 24.
  (4) The guest would read junk and explode.

Therefore we have to create MSI-X routes when the guest requests MSIs, and
enable/disable them in VFIO when the guest pokes the MSI-X capability. We
have to follow both physical and virtual state of the capability, which
makes the state machine a bit complex, but I think it works.

An important missing feature is the absence of pending MSI handling. When
a vector or the function is masked, we should rewire the IRQFD to a
special thread that keeps note of pending interrupts (or just poll the
IRQFD before recreating the route?). And when the vector is unmasked, one
MSI should be injected if it was pending. At the moment no MSI is
injected, we simply disconnect the IRQFD and all messages are lost.

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/vfio.h |  52 ++++
 vfio/pci.c         | 651 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 691 insertions(+), 12 deletions(-)

diff --git a/include/kvm/vfio.h b/include/kvm/vfio.h
index c434703ab..483ba7e42 100644
--- a/include/kvm/vfio.h
+++ b/include/kvm/vfio.h
@@ -1,6 +1,7 @@
 #ifndef KVM__VFIO_H
 #define KVM__VFIO_H
 
+#include "kvm/mutex.h"
 #include "kvm/parse-options.h"
 #include "kvm/pci.h"
 
@@ -24,8 +25,59 @@ enum vfio_device_type {
 	VFIO_DEVICE_PCI,
 };
 
+/* MSI/MSI-X capability enabled */
+#define VFIO_PCI_MSI_STATE_ENABLED	(1 << 0)
+/* MSI/MSI-X capability or individual vector masked */
+#define VFIO_PCI_MSI_STATE_MASKED	(1 << 1)
+/* MSI-X capability has no vector enabled yet */
+#define VFIO_PCI_MSI_STATE_EMPTY	(1 << 2)
+
+struct vfio_pci_msi_entry {
+	struct msix_table		config;
+	int				gsi;
+	int				eventfd;
+	u8				phys_state;
+	u8				virt_state;
+};
+
+struct vfio_pci_msix_table {
+	size_t				size;
+	unsigned int			bar;
+	u32				guest_phys_addr;
+};
+
+struct vfio_pci_msix_pba {
+	size_t				size;
+	off_t				offset; /* in VFIO device fd */
+	unsigned int			bar;
+	u32				guest_phys_addr;
+};
+
+/* Common data for MSI and MSI-X */
+struct vfio_pci_msi_common {
+	off_t				pos;
+	u8				virt_state;
+	u8				phys_state;
+	struct mutex			mutex;
+	struct vfio_irq_info		info;
+	struct vfio_irq_set		*irq_set;
+	size_t				nr_entries;
+	struct vfio_pci_msi_entry	*entries;
+};
+
+#define VFIO_PCI_IRQ_MODE_INTX		(1 << 0)
+#define VFIO_PCI_IRQ_MODE_MSI		(1 << 1)
+#define VFIO_PCI_IRQ_MODE_MSIX		(1 << 2)
+
 struct vfio_pci_device {
 	struct pci_device_header	hdr;
+
+	unsigned long			irq_modes;
+	int				intx_fd;
+	unsigned int			intx_gsi;
+	struct vfio_pci_msi_common	msix;
+	struct vfio_pci_msix_table	msix_table;
+	struct vfio_pci_msix_pba	msix_pba;
 };
 
 struct vfio_region {
diff --git a/vfio/pci.c b/vfio/pci.c
index 6b157cfa3..b27de8548 100644
--- a/vfio/pci.c
+++ b/vfio/pci.c
@@ -5,6 +5,8 @@
 
 #include <sys/ioctl.h>
 #include <sys/eventfd.h>
+#include <sys/resource.h>
+#include <sys/time.h>
 
 /* Wrapper around UAPI vfio_irq_set */
 struct vfio_irq_eventfd {
@@ -12,6 +14,318 @@ struct vfio_irq_eventfd {
 	int			fd;
 };
 
+#define msi_is_enabled(state)		((state) & VFIO_PCI_MSI_STATE_ENABLED)
+#define msi_is_masked(state)		((state) & VFIO_PCI_MSI_STATE_MASKED)
+#define msi_is_empty(state)		((state) & VFIO_PCI_MSI_STATE_EMPTY)
+
+#define msi_update_state(state, val, bit)				\
+	(state) = (val) ? (state) | bit : (state) & ~bit;
+#define msi_set_enabled(state, val)					\
+	msi_update_state(state, val, VFIO_PCI_MSI_STATE_ENABLED)
+#define msi_set_masked(state, val)					\
+	msi_update_state(state, val, VFIO_PCI_MSI_STATE_MASKED)
+#define msi_set_empty(state, val)					\
+	msi_update_state(state, val, VFIO_PCI_MSI_STATE_EMPTY)
+
+static void vfio_pci_disable_intx(struct kvm *kvm, struct vfio_device *vdev);
+
+static int vfio_pci_enable_msis(struct kvm *kvm, struct vfio_device *vdev)
+{
+	size_t i;
+	int ret = 0;
+	int *eventfds;
+	struct vfio_pci_device *pdev = &vdev->pci;
+	struct vfio_pci_msi_common *msis = &pdev->msix;
+	struct vfio_irq_eventfd single = {
+		.irq = {
+			.argsz	= sizeof(single),
+			.flags	= VFIO_IRQ_SET_DATA_EVENTFD |
+				  VFIO_IRQ_SET_ACTION_TRIGGER,
+			.index	= msis->info.index,
+			.count	= 1,
+		},
+	};
+
+	if (!msi_is_enabled(msis->virt_state))
+		return 0;
+
+	if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_INTX) {
+		/*
+		 * PCI (and VFIO) forbids enabling INTx, MSI or MSIX at the same
+		 * time. Since INTx has to be enabled from the start (we don't
+		 * have a reliable way to know when the user starts using it),
+		 * disable it now.
+		 */
+		vfio_pci_disable_intx(kvm, vdev);
+		/* Permanently disable INTx */
+		pdev->irq_modes &= ~VFIO_PCI_IRQ_MODE_INTX;
+	}
+
+	eventfds = (void *)msis->irq_set + sizeof(struct vfio_irq_set);
+
+	/*
+	 * Initial registration of the full range. This enables the physical
+	 * MSI/MSI-X capability, which might have desired side effects. For
+	 * instance when assigning virtio legacy devices, enabling the MSI
+	 * capability modifies the config space layout!
+	 *
+	 * As an optimization, only update MSIs when guest unmasks the
+	 * capability. This greatly reduces the initialization time for Linux
+	 * guest with 2048+ MSIs. Linux guest starts by enabling the MSI-X cap
+	 * masked, then fills individual vectors, then unmasks the whole
+	 * function. So we only do one VFIO ioctl when enabling for the first
+	 * time, and then one when unmasking.
+	 *
+	 * phys_state is empty when it is enabled but no vector has been
+	 * registered via SET_IRQS yet.
+	 */
+	if (!msi_is_enabled(msis->phys_state) ||
+	    (!msi_is_masked(msis->virt_state) &&
+	     msi_is_empty(msis->phys_state))) {
+		bool empty = true;
+
+		for (i = 0; i < msis->nr_entries; i++) {
+			eventfds[i] = msis->entries[i].gsi >= 0 ?
+				      msis->entries[i].eventfd : -1;
+
+			if (eventfds[i] >= 0)
+				empty = false;
+		}
+
+		ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, msis->irq_set);
+		if (ret < 0) {
+			perror("VFIO_DEVICE_SET_IRQS(multi)");
+			return ret;
+		}
+
+		msi_set_enabled(msis->phys_state, true);
+		msi_set_empty(msis->phys_state, empty);
+
+		return 0;
+	}
+
+	if (msi_is_masked(msis->virt_state)) {
+		/* TODO: if phys_state is not empty nor masked, mask all vectors */
+		return 0;
+	}
+
+	/* Update individual vectors to avoid breaking those in use */
+	for (i = 0; i < msis->nr_entries; i++) {
+		struct vfio_pci_msi_entry *entry = &msis->entries[i];
+		int fd = entry->gsi >= 0 ? entry->eventfd : -1;
+
+		if (fd == eventfds[i])
+			continue;
+
+		single.irq.start = i;
+		single.fd = fd;
+
+		ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &single);
+		if (ret < 0) {
+			perror("VFIO_DEVICE_SET_IRQS(single)");
+			break;
+		}
+
+		eventfds[i] = fd;
+
+		if (msi_is_empty(msis->phys_state) && fd >= 0)
+			msi_set_empty(msis->phys_state, false);
+	}
+
+	return ret;
+}
+
+static int vfio_pci_disable_msis(struct kvm *kvm, struct vfio_device *vdev)
+{
+	int ret;
+	struct vfio_pci_device *pdev = &vdev->pci;
+	struct vfio_pci_msi_common *msis = &pdev->msix;
+	struct vfio_irq_set irq_set = {
+		.argsz	= sizeof(irq_set),
+		.flags 	= VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
+		.index 	= msis->info.index,
+		.start 	= 0,
+		.count	= 0,
+	};
+
+	if (!msi_is_enabled(msis->phys_state))
+		return 0;
+
+	ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+	if (ret < 0) {
+		perror("VFIO_DEVICE_SET_IRQS(NONE)");
+		return ret;
+	}
+
+	msi_set_enabled(msis->phys_state, false);
+	msi_set_empty(msis->phys_state, true);
+
+	return 0;
+}
+
+static int vfio_pci_update_msi_entry(struct kvm *kvm, struct vfio_device *vdev,
+				     struct vfio_pci_msi_entry *entry)
+{
+	int ret;
+
+	if (entry->eventfd < 0) {
+		entry->eventfd = eventfd(0, 0);
+		if (entry->eventfd < 0) {
+			ret = -errno;
+			vfio_dev_err(vdev, "cannot create eventfd");
+			return ret;
+		}
+	}
+
+	/* Allocate IRQ if necessary */
+	if (entry->gsi < 0) {
+		int ret = irq__add_msix_route(kvm, &entry->config.msg,
+					      vdev->dev_hdr.dev_num << 3);
+		if (ret < 0) {
+			vfio_dev_err(vdev, "cannot create MSI-X route");
+			return ret;
+		}
+		entry->gsi = ret;
+	} else {
+		irq__update_msix_route(kvm, entry->gsi, &entry->config.msg);
+	}
+
+	/*
+	 * MSI masking is unimplemented in VFIO, so we have to handle it by
+	 * disabling/enabling IRQ route instead. We do it on the KVM side rather
+	 * than VFIO, because:
+	 * - it is 8x faster
+	 * - it allows to decouple masking logic from capability state.
+	 * - in masked state, after removing irqfd route, we could easily plug
+	 *   the eventfd in a local handler, in order to serve Pending Bit reads
+	 *   to the guest.
+	 *
+	 * So entry->phys_state is masked when there is no active irqfd route.
+	 */
+	if (msi_is_masked(entry->virt_state) == msi_is_masked(entry->phys_state))
+		return 0;
+
+	if (msi_is_masked(entry->phys_state)) {
+		ret = irq__add_irqfd(kvm, entry->gsi, entry->eventfd, -1);
+		if (ret < 0) {
+			vfio_dev_err(vdev, "cannot setup irqfd");
+			return ret;
+		}
+	} else {
+		irq__del_irqfd(kvm, entry->gsi, entry->eventfd);
+	}
+
+	msi_set_masked(entry->phys_state, msi_is_masked(entry->virt_state));
+
+	return 0;
+}
+
+static void vfio_pci_msix_pba_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
+				     u32 len, u8 is_write, void *ptr)
+{
+	struct vfio_pci_device *pdev = ptr;
+	struct vfio_pci_msix_pba *pba = &pdev->msix_pba;
+	u64 offset = addr - pba->guest_phys_addr;
+	struct vfio_device *vdev = container_of(pdev, struct vfio_device, pci);
+
+	if (is_write)
+		return;
+
+	/*
+	 * TODO: emulate PBA. Hardware MSI-X is never masked, so reading the PBA
+	 * is completely useless here. Note that Linux doesn't use PBA.
+	 */
+	if (pread(vdev->fd, data, len, pba->offset + offset) != (ssize_t)len)
+		vfio_dev_err(vdev, "cannot access MSIX PBA\n");
+}
+
+static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
+				       u32 len, u8 is_write, void *ptr)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct vfio_pci_msi_entry *entry;
+	struct vfio_pci_device *pdev = ptr;
+	struct vfio_device *vdev = container_of(pdev, struct vfio_device, pci);
+
+	u64 offset = addr - pdev->msix_table.guest_phys_addr;
+
+	size_t vector = offset / PCI_MSIX_ENTRY_SIZE;
+	off_t field = offset % PCI_MSIX_ENTRY_SIZE;
+
+	/*
+	 * PCI spec says that software must use aligned 4 or 8 bytes accesses
+	 * for the MSI-X tables.
+	 */
+	if ((len != 4 && len != 8) || addr & (len - 1)) {
+		vfio_dev_warn(vdev, "invalid MSI-X table access");
+		return;
+	}
+
+	entry = &pdev->msix.entries[vector];
+
+	mutex_lock(&pdev->msix.mutex);
+
+	if (!is_write) {
+		memcpy(data, (void *)&entry->config + field, len);
+		goto out_unlock;
+	}
+
+	memcpy((void *)&entry->config + field, data, len);
+
+	/*
+	 * Check if access touched the vector control register, which is at the
+	 * end of the MSI-X entry.
+	 */
+	if (field + len <= PCI_MSIX_ENTRY_VECTOR_CTRL)
+		goto out_unlock;
+
+	msi_set_masked(entry->virt_state, entry->config.ctrl &
+		       PCI_MSIX_ENTRY_CTRL_MASKBIT);
+
+	if (vfio_pci_update_msi_entry(kvm, vdev, entry) < 0)
+		/* Not much we can do here. */
+		vfio_dev_err(vdev, "failed to configure MSIX vector %zu", vector);
+
+	/* Update the physical capability if necessary */
+	if (vfio_pci_enable_msis(kvm, vdev))
+		vfio_dev_err(vdev, "cannot enable MSIX");
+
+out_unlock:
+	mutex_unlock(&pdev->msix.mutex);
+}
+
+static void vfio_pci_msix_cap_write(struct kvm *kvm,
+				    struct vfio_device *vdev, u8 off,
+				    void *data, int sz)
+{
+	struct vfio_pci_device *pdev = &vdev->pci;
+	off_t enable_pos = PCI_MSIX_FLAGS + 1;
+	bool enable;
+	u16 flags;
+
+	off -= pdev->msix.pos;
+
+	/* Check if access intersects with the MSI-X Enable bit */
+	if (off > enable_pos || off + sz <= enable_pos)
+		return;
+
+	/* Read byte that contains the Enable bit */
+	flags = *(u8 *)(data + enable_pos - off) << 8;
+
+	mutex_lock(&pdev->msix.mutex);
+
+	msi_set_masked(pdev->msix.virt_state, flags & PCI_MSIX_FLAGS_MASKALL);
+	enable = flags & PCI_MSIX_FLAGS_ENABLE;
+	msi_set_enabled(pdev->msix.virt_state, enable);
+
+	if (enable && vfio_pci_enable_msis(kvm, vdev))
+		vfio_dev_err(vdev, "cannot enable MSIX");
+	else if (!enable && vfio_pci_disable_msis(kvm, vdev))
+		vfio_dev_err(vdev, "cannot disable MSIX");
+
+	mutex_unlock(&pdev->msix.mutex);
+}
+
 static void vfio_pci_cfg_read(struct kvm *kvm, struct pci_device_header *pci_hdr,
 			      u8 offset, void *data, int sz)
 {
@@ -46,29 +360,102 @@ static void vfio_pci_cfg_write(struct kvm *kvm, struct pci_device_header *pci_hd
 		vfio_dev_warn(vdev, "Failed to write %d bytes to Configuration Space at 0x%x",
 			      sz, offset);
 
+	/* Handle MSI write now, since it might update the hardware capability */
+	if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_MSIX)
+		vfio_pci_msix_cap_write(kvm, vdev, offset, data, sz);
+
 	if (pread(vdev->fd, base + offset, sz, info->offset + offset) != sz)
 		vfio_dev_warn(vdev, "Failed to read %d bytes from Configuration Space at 0x%x",
 			      sz, offset);
 }
 
+static ssize_t vfio_pci_cap_size(struct pci_cap_hdr *cap_hdr)
+{
+	switch (cap_hdr->type) {
+	case PCI_CAP_ID_MSIX:
+		return PCI_CAP_MSIX_SIZEOF;
+	default:
+		pr_err("unknown PCI capability 0x%x", cap_hdr->type);
+		return 0;
+	}
+}
+
+static int vfio_pci_add_cap(struct vfio_device *vdev, u8 *virt_hdr,
+			    struct pci_cap_hdr *cap, off_t pos)
+{
+	struct pci_cap_hdr *last;
+	struct pci_device_header *hdr = &vdev->pci.hdr;
+
+	cap->next = 0;
+
+	if (!hdr->capabilities) {
+		hdr->capabilities = pos;
+		hdr->status |= PCI_STATUS_CAP_LIST;
+	} else {
+		last = PCI_CAP(virt_hdr, hdr->capabilities);
+
+		while (last->next)
+			last = PCI_CAP(virt_hdr, last->next);
+
+		last->next = pos;
+	}
+
+	memcpy(virt_hdr + pos, cap, vfio_pci_cap_size(cap));
+
+	return 0;
+}
+
 static int vfio_pci_parse_caps(struct vfio_device *vdev)
 {
+	int ret;
+	size_t size;
+	u8 pos, next;
+	struct pci_cap_hdr *cap;
+	u8 virt_hdr[PCI_DEV_CFG_SIZE];
 	struct vfio_pci_device *pdev = &vdev->pci;
 
 	if (!(pdev->hdr.status & PCI_STATUS_CAP_LIST))
 		return 0;
 
+	memset(virt_hdr, 0, PCI_DEV_CFG_SIZE);
+
+	pos = pdev->hdr.capabilities & ~3;
+
 	pdev->hdr.status &= ~PCI_STATUS_CAP_LIST;
 	pdev->hdr.capabilities = 0;
 
-	/* TODO: install virtual capabilities */
+	for (; pos; pos = next) {
+		if (pos >= PCI_DEV_CFG_SIZE) {
+			vfio_dev_warn(vdev, "ignoring cap outside of config space");
+			return -EINVAL;
+		}
+
+		cap = PCI_CAP(&pdev->hdr, pos);
+		next = cap->next;
+
+		switch (cap->type) {
+		case PCI_CAP_ID_MSIX:
+			ret = vfio_pci_add_cap(vdev, virt_hdr, cap, pos);
+			if (ret)
+				return ret;
+
+			pdev->msix.pos = pos;
+			pdev->irq_modes |= VFIO_PCI_IRQ_MODE_MSIX;
+			break;
+		}
+	}
+
+	/* Wipe remaining capabilities */
+	pos = PCI_STD_HEADER_SIZEOF;
+	size = PCI_DEV_CFG_SIZE - PCI_STD_HEADER_SIZEOF;
+	memcpy((void *)&pdev->hdr + pos, virt_hdr + pos, size);
 
 	return 0;
 }
 
 static int vfio_pci_parse_cfg_space(struct vfio_device *vdev)
 {
-	ssize_t sz = PCI_STD_HEADER_SIZEOF;
+	ssize_t sz = PCI_DEV_CFG_SIZE;
 	struct vfio_region_info *info;
 	struct vfio_pci_device *pdev = &vdev->pci;
 
@@ -89,6 +476,7 @@ static int vfio_pci_parse_cfg_space(struct vfio_device *vdev)
 		return -EINVAL;
 	}
 
+	/* Read standard headers and capabilities */
 	if (pread(vdev->fd, &pdev->hdr, sz, info->offset) != sz) {
 		vfio_dev_err(vdev, "failed to read %zd bytes of Config Space", sz);
 		return -EIO;
@@ -103,6 +491,9 @@ static int vfio_pci_parse_cfg_space(struct vfio_device *vdev)
 		return -EOPNOTSUPP;
 	}
 
+	if (pdev->hdr.irq_pin)
+		pdev->irq_modes |= VFIO_PCI_IRQ_MODE_INTX;
+
 	vfio_pci_parse_caps(vdev);
 
 	return 0;
@@ -112,6 +503,7 @@ static int vfio_pci_fixup_cfg_space(struct vfio_device *vdev)
 {
 	int i;
 	ssize_t hdr_sz;
+	struct msix_cap *msix;
 	struct vfio_region_info *info;
 	struct vfio_pci_device *pdev = &vdev->pci;
 
@@ -144,6 +536,22 @@ static int vfio_pci_fixup_cfg_space(struct vfio_device *vdev)
 	 */
 	pdev->hdr.exp_rom_bar = 0;
 
+	/* Plumb in our fake MSI-X capability, if we have it. */
+	msix = pci_find_cap(&pdev->hdr, PCI_CAP_ID_MSIX);
+	if (msix) {
+		/* Add a shortcut to the PBA region for the MMIO handler */
+		int pba_index = VFIO_PCI_BAR0_REGION_INDEX + pdev->msix_pba.bar;
+		pdev->msix_pba.offset = vdev->regions[pba_index].info.offset +
+					(msix->pba_offset & PCI_MSIX_PBA_OFFSET);
+
+		/* Tidy up the capability */
+		msix->table_offset &= PCI_MSIX_TABLE_BIR;
+		msix->pba_offset &= PCI_MSIX_PBA_BIR;
+		if (pdev->msix_table.bar == pdev->msix_pba.bar)
+			msix->pba_offset |= pdev->msix_table.size &
+					    PCI_MSIX_PBA_OFFSET;
+	}
+
 	/* Install our fake Configuration Space */
 	info = &vdev->regions[VFIO_PCI_CONFIG_REGION_INDEX].info;
 	hdr_sz = PCI_DEV_CFG_SIZE;
@@ -164,11 +572,84 @@ static int vfio_pci_fixup_cfg_space(struct vfio_device *vdev)
 	return 0;
 }
 
+static int vfio_pci_create_msix_table(struct kvm *kvm,
+				      struct vfio_pci_device *pdev)
+{
+	int ret;
+	size_t i;
+	size_t mmio_size;
+	size_t nr_entries;
+	struct vfio_pci_msi_entry *entries;
+	struct vfio_pci_msix_pba *pba = &pdev->msix_pba;
+	struct vfio_pci_msix_table *table = &pdev->msix_table;
+	struct msix_cap *msix = PCI_CAP(&pdev->hdr, pdev->msix.pos);
+
+	table->bar = msix->table_offset & PCI_MSIX_TABLE_BIR;
+	pba->bar = msix->pba_offset & PCI_MSIX_TABLE_BIR;
+
+	/*
+	 * KVM needs memory regions to be multiple of and aligned on PAGE_SIZE.
+	 */
+	nr_entries = (msix->ctrl & PCI_MSIX_FLAGS_QSIZE) + 1;
+	table->size = ALIGN(nr_entries * PCI_MSIX_ENTRY_SIZE, PAGE_SIZE);
+	pba->size = ALIGN(DIV_ROUND_UP(nr_entries, 64), PAGE_SIZE);
+
+	entries = calloc(nr_entries, sizeof(struct vfio_pci_msi_entry));
+	if (!entries)
+		return -ENOMEM;
+
+	for (i = 0; i < nr_entries; i++)
+		entries[i].config.ctrl = PCI_MSIX_ENTRY_CTRL_MASKBIT;
+
+	/*
+	 * To ease MSI-X cap configuration in case they share the same BAR,
+	 * collapse table and pending array. The size of the BAR regions must be
+	 * powers of two.
+	 */
+	mmio_size = roundup_pow_of_two(table->size + pba->size);
+	table->guest_phys_addr = pci_get_io_space_block(mmio_size);
+	if (!table->guest_phys_addr) {
+		pr_err("cannot allocate IO space");
+		ret = -ENOMEM;
+		goto out_free;
+	}
+	pba->guest_phys_addr = table->guest_phys_addr + table->size;
+
+	ret = kvm__register_mmio(kvm, table->guest_phys_addr, table->size,
+				 false, vfio_pci_msix_table_access, pdev);
+	if (ret < 0)
+		goto out_free;
+
+	/*
+	 * We could map the physical PBA directly into the guest, but it's
+	 * likely smaller than a page, and we can only hand full pages to the
+	 * guest. Even though the PCI spec disallows sharing a page used for
+	 * MSI-X with any other resource, it allows to share the same page
+	 * between MSI-X table and PBA. For the sake of isolation, create a
+	 * virtual PBA.
+	 */
+	ret = kvm__register_mmio(kvm, pba->guest_phys_addr, pba->size, false,
+				 vfio_pci_msix_pba_access, pdev);
+	if (ret < 0)
+		goto out_free;
+
+	pdev->msix.entries = entries;
+	pdev->msix.nr_entries = nr_entries;
+
+	return 0;
+
+out_free:
+	free(entries);
+
+	return ret;
+}
+
 static int vfio_pci_configure_bar(struct kvm *kvm, struct vfio_device *vdev,
 				  size_t nr)
 {
 	int ret;
 	size_t map_size;
+	struct vfio_pci_device *pdev = &vdev->pci;
 	struct vfio_region *region = &vdev->regions[nr];
 
 	if (nr >= vdev->info.num_regions)
@@ -190,6 +671,17 @@ static int vfio_pci_configure_bar(struct kvm *kvm, struct vfio_device *vdev,
 	if (!region->info.size)
 		return 0;
 
+	if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_MSIX) {
+		/* Trap and emulate MSI-X table */
+		if (nr == pdev->msix_table.bar) {
+			region->guest_phys_addr = pdev->msix_table.guest_phys_addr;
+			return 0;
+		} else if (nr == pdev->msix_pba.bar) {
+			region->guest_phys_addr = pdev->msix_pba.guest_phys_addr;
+			return 0;
+		}
+	}
+
 	/* Grab some MMIO space in the guest */
 	map_size = ALIGN(region->info.size, PAGE_SIZE);
 	region->guest_phys_addr = pci_get_io_space_block(map_size);
@@ -218,6 +710,12 @@ static int vfio_pci_configure_dev_regions(struct kvm *kvm,
 	if (ret)
 		return ret;
 
+	if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_MSIX) {
+		ret = vfio_pci_create_msix_table(kvm, pdev);
+		if (ret)
+			return ret;
+	}
+
 	for (i = VFIO_PCI_BAR0_REGION_INDEX; i <= VFIO_PCI_BAR5_REGION_INDEX; ++i) {
 		/* Ignore top half of 64-bit BAR */
 		if (i % 2 && is_64bit)
@@ -237,6 +735,122 @@ static int vfio_pci_configure_dev_regions(struct kvm *kvm,
 	return vfio_pci_fixup_cfg_space(vdev);
 }
 
+/*
+ * Attempt to update the FD limit, if opening an eventfd for each IRQ vector
+ * would hit the limit. Which is likely to happen when a device uses 2048 MSIs.
+ */
+static int vfio_pci_reserve_irq_fds(size_t num)
+{
+	/*
+	 * I counted around 27 fds under normal load. Let's add 100 for good
+	 * measure.
+	 */
+	static size_t needed = 128;
+	struct rlimit fd_limit, new_limit;
+
+	needed += num;
+
+	if (getrlimit(RLIMIT_NOFILE, &fd_limit)) {
+		perror("getrlimit(RLIMIT_NOFILE)");
+		return 0;
+	}
+
+	if (fd_limit.rlim_cur >= needed)
+		return 0;
+
+	new_limit.rlim_cur = needed;
+
+	if (fd_limit.rlim_max < needed)
+		/* Try to bump hard limit (root only) */
+		new_limit.rlim_max = needed;
+	else
+		new_limit.rlim_max = fd_limit.rlim_max;
+
+	if (setrlimit(RLIMIT_NOFILE, &new_limit)) {
+		perror("setrlimit(RLIMIT_NOFILE)");
+		pr_warning("not enough FDs for full MSI-X support (estimated need: %zu)",
+			   (size_t)(needed - fd_limit.rlim_cur));
+	}
+
+	return 0;
+}
+
+static int vfio_pci_init_msis(struct kvm *kvm, struct vfio_device *vdev,
+			     struct vfio_pci_msi_common *msis)
+{
+	int ret;
+	size_t i;
+	int *eventfds;
+	size_t irq_set_size;
+	struct vfio_pci_msi_entry *entry;
+	size_t nr_entries = msis->nr_entries;
+
+	ret = ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO, &msis->info);
+	if (ret || &msis->info.count == 0) {
+		vfio_dev_err(vdev, "no MSI reported by VFIO");
+		return -ENODEV;
+	}
+
+	if (!(msis->info.flags & VFIO_IRQ_INFO_EVENTFD)) {
+		vfio_dev_err(vdev, "interrupt not EVENTFD capable");
+		return -EINVAL;
+	}
+
+	if (msis->info.count != nr_entries) {
+		vfio_dev_err(vdev, "invalid number of MSIs reported by VFIO");
+		return -EINVAL;
+	}
+
+	mutex_init(&msis->mutex);
+
+	vfio_pci_reserve_irq_fds(nr_entries);
+
+	irq_set_size = sizeof(struct vfio_irq_set) + nr_entries * sizeof(int);
+	msis->irq_set = malloc(irq_set_size);
+	if (!msis->irq_set)
+		return -ENOMEM;
+
+	*msis->irq_set = (struct vfio_irq_set) {
+		.argsz	= irq_set_size,
+		.flags 	= VFIO_IRQ_SET_DATA_EVENTFD |
+			  VFIO_IRQ_SET_ACTION_TRIGGER,
+		.index 	= msis->info.index,
+		.start 	= 0,
+		.count 	= nr_entries,
+	};
+
+	eventfds = (void *)msis->irq_set + sizeof(struct vfio_irq_set);
+
+	for (i = 0; i < nr_entries; i++) {
+		entry = &msis->entries[i];
+		entry->gsi = -1;
+		entry->eventfd = -1;
+		msi_set_masked(entry->virt_state, true);
+		msi_set_masked(entry->phys_state, true);
+		eventfds[i] = -1;
+	}
+
+	return 0;
+}
+
+static void vfio_pci_disable_intx(struct kvm *kvm, struct vfio_device *vdev)
+{
+	struct vfio_pci_device *pdev = &vdev->pci;
+	int gsi = pdev->intx_gsi;
+	struct vfio_irq_set irq_set = {
+		.argsz	= sizeof(irq_set),
+		.flags	= VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
+		.index	= VFIO_PCI_INTX_IRQ_INDEX,
+	};
+
+	pr_debug("user requested MSI, disabling INTx %d", gsi);
+
+	ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+	irq__del_irqfd(kvm, gsi, pdev->intx_fd);
+
+	close(pdev->intx_fd);
+}
+
 static int vfio_pci_enable_intx(struct kvm *kvm, struct vfio_device *vdev)
 {
 	int ret;
@@ -251,6 +865,8 @@ static int vfio_pci_enable_intx(struct kvm *kvm, struct vfio_device *vdev)
 		.index = VFIO_PCI_INTX_IRQ_INDEX,
 	};
 
+	vfio_pci_reserve_irq_fds(2);
+
 	ret = ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
 	if (ret || irq_info.count == 0) {
 		vfio_dev_err(vdev, "no INTx reported by VFIO");
@@ -319,6 +935,10 @@ static int vfio_pci_enable_intx(struct kvm *kvm, struct vfio_device *vdev)
 		goto err_remove_event;
 	}
 
+	pdev->intx_fd = trigger_fd;
+	/* Guest is going to ovewrite our irq_line... */
+	pdev->intx_gsi = gsi;
+
 	return 0;
 
 err_remove_event:
@@ -338,20 +958,23 @@ err_close:
 
 static int vfio_pci_configure_dev_irqs(struct kvm *kvm, struct vfio_device *vdev)
 {
+	int ret = 0;
 	struct vfio_pci_device *pdev = &vdev->pci;
 
-	struct vfio_irq_info irq_info = {
-		.argsz = sizeof(irq_info),
-		.index = VFIO_PCI_INTX_IRQ_INDEX,
-	};
-
-	if (!pdev->hdr.irq_pin) {
-		/* TODO: add MSI support */
-		vfio_dev_err(vdev, "INTx not available, MSI-X not implemented");
-		return -ENOSYS;
+	if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_MSIX) {
+		pdev->msix.info = (struct vfio_irq_info) {
+			.argsz = sizeof(pdev->msix.info),
+			.index = VFIO_PCI_MSIX_IRQ_INDEX,
+		};
+		ret = vfio_pci_init_msis(kvm, vdev, &pdev->msix);
+		if (ret)
+			return ret;
 	}
 
-	return vfio_pci_enable_intx(kvm, vdev);
+	if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_INTX)
+		ret = vfio_pci_enable_intx(kvm, vdev);
+
+	return ret;
 }
 
 int vfio_pci_setup_device(struct kvm *kvm, struct vfio_device *vdev)
@@ -387,9 +1010,13 @@ int vfio_pci_setup_device(struct kvm *kvm, struct vfio_device *vdev)
 void vfio_pci_teardown_device(struct kvm *kvm, struct vfio_device *vdev)
 {
 	size_t i;
+	struct vfio_pci_device *pdev = &vdev->pci;
 
 	for (i = 0; i < vdev->info.num_regions; i++)
 		vfio_unmap_region(kvm, &vdev->regions[i]);
 
 	device__unregister(&vdev->dev_hdr);
+
+	free(pdev->msix.irq_set);
+	free(pdev->msix.entries);
 }
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 kvmtool 10/13] vfio-pci: add MSI support
  2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
                   ` (8 preceding siblings ...)
  2018-06-18 18:42 ` [PATCH v6 kvmtool 09/13] vfio-pci: add MSI-X support Jean-Philippe Brucker
@ 2018-06-18 18:42 ` Jean-Philippe Brucker
  2018-06-18 18:42 ` [PATCH v6 kvmtool 11/13] vfio: Support non-mmappable regions Jean-Philippe Brucker
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jean-Philippe Brucker @ 2018-06-18 18:42 UTC (permalink / raw)
  To: kvm
  Cc: lorenzo.pieralisi, marc.zyngier, punit.agrawal, will.deacon,
	alex.williamson, robin.murphy, kvmarm

Allow guests to use the MSI capability in devices that support it. Emulate
the MSI capability, which is simpler than MSI-X as it doesn't rely on
external tables. Reuse most of the MSI-X code. Guests may choose between
MSI and MSI-X at runtime since we present both capabilities, but they
cannot enable MSI and MSI-X at the same time (forbidden by PCI).

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/pci.h  |  23 ++++++
 include/kvm/vfio.h |   1 +
 vfio/pci.c         | 178 +++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 195 insertions(+), 7 deletions(-)

diff --git a/include/kvm/pci.h b/include/kvm/pci.h
index 274b77ea6..a86c15a70 100644
--- a/include/kvm/pci.h
+++ b/include/kvm/pci.h
@@ -59,6 +59,29 @@ struct msix_cap {
 	u32 pba_offset;
 };
 
+struct msi_cap_64 {
+	u8 cap;
+	u8 next;
+	u16 ctrl;
+	u32 address_lo;
+	u32 address_hi;
+	u16 data;
+	u16 _align;
+	u32 mask_bits;
+	u32 pend_bits;
+};
+
+struct msi_cap_32 {
+	u8 cap;
+	u8 next;
+	u16 ctrl;
+	u32 address_lo;
+	u16 data;
+	u16 _align;
+	u32 mask_bits;
+	u32 pend_bits;
+};
+
 struct pci_cap_hdr {
 	u8	type;
 	u8	next;
diff --git a/include/kvm/vfio.h b/include/kvm/vfio.h
index 483ba7e42..2c621ec75 100644
--- a/include/kvm/vfio.h
+++ b/include/kvm/vfio.h
@@ -75,6 +75,7 @@ struct vfio_pci_device {
 	unsigned long			irq_modes;
 	int				intx_fd;
 	unsigned int			intx_gsi;
+	struct vfio_pci_msi_common	msi;
 	struct vfio_pci_msi_common	msix;
 	struct vfio_pci_msix_table	msix_table;
 	struct vfio_pci_msix_pba	msix_pba;
diff --git a/vfio/pci.c b/vfio/pci.c
index b27de8548..3ed07fb43 100644
--- a/vfio/pci.c
+++ b/vfio/pci.c
@@ -29,13 +29,14 @@ struct vfio_irq_eventfd {
 
 static void vfio_pci_disable_intx(struct kvm *kvm, struct vfio_device *vdev);
 
-static int vfio_pci_enable_msis(struct kvm *kvm, struct vfio_device *vdev)
+static int vfio_pci_enable_msis(struct kvm *kvm, struct vfio_device *vdev,
+				bool msix)
 {
 	size_t i;
 	int ret = 0;
 	int *eventfds;
 	struct vfio_pci_device *pdev = &vdev->pci;
-	struct vfio_pci_msi_common *msis = &pdev->msix;
+	struct vfio_pci_msi_common *msis = msix ? &pdev->msix : &pdev->msi;
 	struct vfio_irq_eventfd single = {
 		.irq = {
 			.argsz	= sizeof(single),
@@ -135,11 +136,12 @@ static int vfio_pci_enable_msis(struct kvm *kvm, struct vfio_device *vdev)
 	return ret;
 }
 
-static int vfio_pci_disable_msis(struct kvm *kvm, struct vfio_device *vdev)
+static int vfio_pci_disable_msis(struct kvm *kvm, struct vfio_device *vdev,
+				 bool msix)
 {
 	int ret;
 	struct vfio_pci_device *pdev = &vdev->pci;
-	struct vfio_pci_msi_common *msis = &pdev->msix;
+	struct vfio_pci_msi_common *msis = msix ? &pdev->msix : &pdev->msi;
 	struct vfio_irq_set irq_set = {
 		.argsz	= sizeof(irq_set),
 		.flags 	= VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
@@ -287,7 +289,7 @@ static void vfio_pci_msix_table_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
 		vfio_dev_err(vdev, "failed to configure MSIX vector %zu", vector);
 
 	/* Update the physical capability if necessary */
-	if (vfio_pci_enable_msis(kvm, vdev))
+	if (vfio_pci_enable_msis(kvm, vdev, true))
 		vfio_dev_err(vdev, "cannot enable MSIX");
 
 out_unlock:
@@ -318,14 +320,120 @@ static void vfio_pci_msix_cap_write(struct kvm *kvm,
 	enable = flags & PCI_MSIX_FLAGS_ENABLE;
 	msi_set_enabled(pdev->msix.virt_state, enable);
 
-	if (enable && vfio_pci_enable_msis(kvm, vdev))
+	if (enable && vfio_pci_enable_msis(kvm, vdev, true))
 		vfio_dev_err(vdev, "cannot enable MSIX");
-	else if (!enable && vfio_pci_disable_msis(kvm, vdev))
+	else if (!enable && vfio_pci_disable_msis(kvm, vdev, true))
 		vfio_dev_err(vdev, "cannot disable MSIX");
 
 	mutex_unlock(&pdev->msix.mutex);
 }
 
+static int vfio_pci_msi_vector_write(struct kvm *kvm, struct vfio_device *vdev,
+				     u8 off, u8 *data, u32 sz)
+{
+	size_t i;
+	u32 mask = 0;
+	size_t mask_pos, start, limit;
+	struct vfio_pci_msi_entry *entry;
+	struct vfio_pci_device *pdev = &vdev->pci;
+	struct msi_cap_64 *msi_cap_64 = PCI_CAP(&pdev->hdr, pdev->msi.pos);
+
+	if (!(msi_cap_64->ctrl & PCI_MSI_FLAGS_MASKBIT))
+		return 0;
+
+	if (msi_cap_64->ctrl & PCI_MSI_FLAGS_64BIT)
+		mask_pos = PCI_MSI_MASK_64;
+	else
+		mask_pos = PCI_MSI_MASK_32;
+
+	if (off >= mask_pos + 4 || off + sz <= mask_pos)
+		return 0;
+
+	/* Set mask to current state */
+	for (i = 0; i < pdev->msi.nr_entries; i++) {
+		entry = &pdev->msi.entries[i];
+		mask |= !!msi_is_masked(entry->virt_state) << i;
+	}
+
+	/* Update mask following the intersection of access and register */
+	start = max_t(size_t, off, mask_pos);
+	limit = min_t(size_t, off + sz, mask_pos + 4);
+
+	memcpy((void *)&mask + start - mask_pos, data + start - off,
+	       limit - start);
+
+	/* Update states if necessary */
+	for (i = 0; i < pdev->msi.nr_entries; i++) {
+		bool masked = mask & (1 << i);
+
+		entry = &pdev->msi.entries[i];
+		if (masked != msi_is_masked(entry->virt_state)) {
+			msi_set_masked(entry->virt_state, masked);
+			vfio_pci_update_msi_entry(kvm, vdev, entry);
+		}
+	}
+
+	return 1;
+}
+
+static void vfio_pci_msi_cap_write(struct kvm *kvm, struct vfio_device *vdev,
+				   u8 off, u8 *data, u32 sz)
+{
+	u8 ctrl;
+	struct msi_msg msg;
+	size_t i, nr_vectors;
+	struct vfio_pci_msi_entry *entry;
+	struct vfio_pci_device *pdev = &vdev->pci;
+	struct msi_cap_64 *msi_cap_64 = PCI_CAP(&pdev->hdr, pdev->msi.pos);
+
+	off -= pdev->msi.pos;
+
+	mutex_lock(&pdev->msi.mutex);
+
+	/* Check if the guest is trying to update mask bits */
+	if (vfio_pci_msi_vector_write(kvm, vdev, off, data, sz))
+		goto out_unlock;
+
+	/* Only modify routes when guest pokes the enable bit */
+	if (off > PCI_MSI_FLAGS || off + sz <= PCI_MSI_FLAGS)
+		goto out_unlock;
+
+	ctrl = *(u8 *)(data + PCI_MSI_FLAGS - off);
+
+	msi_set_enabled(pdev->msi.virt_state, ctrl & PCI_MSI_FLAGS_ENABLE);
+
+	if (!msi_is_enabled(pdev->msi.virt_state)) {
+		vfio_pci_disable_msis(kvm, vdev, false);
+		goto out_unlock;
+	}
+
+	/* Create routes for the requested vectors */
+	nr_vectors = 1 << ((ctrl & PCI_MSI_FLAGS_QSIZE) >> 4);
+
+	msg.address_lo = msi_cap_64->address_lo;
+	if (msi_cap_64->ctrl & PCI_MSI_FLAGS_64BIT) {
+		msg.address_hi = msi_cap_64->address_hi;
+		msg.data = msi_cap_64->data;
+	} else {
+		struct msi_cap_32 *msi_cap_32 = (void *)msi_cap_64;
+		msg.address_hi = 0;
+		msg.data = msi_cap_32->data;
+	}
+
+	for (i = 0; i < nr_vectors; i++) {
+		entry = &pdev->msi.entries[i];
+		entry->config.msg = msg;
+		vfio_pci_update_msi_entry(kvm, vdev, entry);
+	}
+
+	/* Update the physical capability if necessary */
+	if (vfio_pci_enable_msis(kvm, vdev, false))
+		vfio_dev_err(vdev, "cannot enable MSI");
+
+out_unlock:
+	mutex_unlock(&pdev->msi.mutex);
+}
+
 static void vfio_pci_cfg_read(struct kvm *kvm, struct pci_device_header *pci_hdr,
 			      u8 offset, void *data, int sz)
 {
@@ -364,16 +472,33 @@ static void vfio_pci_cfg_write(struct kvm *kvm, struct pci_device_header *pci_hd
 	if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_MSIX)
 		vfio_pci_msix_cap_write(kvm, vdev, offset, data, sz);
 
+	if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_MSI)
+		vfio_pci_msi_cap_write(kvm, vdev, offset, data, sz);
+
 	if (pread(vdev->fd, base + offset, sz, info->offset + offset) != sz)
 		vfio_dev_warn(vdev, "Failed to read %d bytes from Configuration Space at 0x%x",
 			      sz, offset);
 }
 
+static ssize_t vfio_pci_msi_cap_size(struct msi_cap_64 *cap_hdr)
+{
+	size_t size = 10;
+
+	if (cap_hdr->ctrl & PCI_MSI_FLAGS_64BIT)
+		size += 4;
+	if (cap_hdr->ctrl & PCI_MSI_FLAGS_MASKBIT)
+		size += 10;
+
+	return size;
+}
+
 static ssize_t vfio_pci_cap_size(struct pci_cap_hdr *cap_hdr)
 {
 	switch (cap_hdr->type) {
 	case PCI_CAP_ID_MSIX:
 		return PCI_CAP_MSIX_SIZEOF;
+	case PCI_CAP_ID_MSI:
+		return vfio_pci_msi_cap_size((void *)cap_hdr);
 	default:
 		pr_err("unknown PCI capability 0x%x", cap_hdr->type);
 		return 0;
@@ -442,6 +567,14 @@ static int vfio_pci_parse_caps(struct vfio_device *vdev)
 			pdev->msix.pos = pos;
 			pdev->irq_modes |= VFIO_PCI_IRQ_MODE_MSIX;
 			break;
+		case PCI_CAP_ID_MSI:
+			ret = vfio_pci_add_cap(vdev, virt_hdr, cap, pos);
+			if (ret)
+				return ret;
+
+			pdev->msi.pos = pos;
+			pdev->irq_modes |= VFIO_PCI_IRQ_MODE_MSI;
+			break;
 		}
 	}
 
@@ -644,6 +777,19 @@ out_free:
 	return ret;
 }
 
+static int vfio_pci_create_msi_cap(struct kvm *kvm, struct vfio_pci_device *pdev)
+{
+	struct msi_cap_64 *cap = PCI_CAP(&pdev->hdr, pdev->msi.pos);
+
+	pdev->msi.nr_entries = 1 << ((cap->ctrl & PCI_MSI_FLAGS_QMASK) >> 1),
+	pdev->msi.entries = calloc(pdev->msi.nr_entries,
+				   sizeof(struct vfio_pci_msi_entry));
+	if (!pdev->msi.entries)
+		return -ENOMEM;
+
+	return 0;
+}
+
 static int vfio_pci_configure_bar(struct kvm *kvm, struct vfio_device *vdev,
 				  size_t nr)
 {
@@ -716,6 +862,12 @@ static int vfio_pci_configure_dev_regions(struct kvm *kvm,
 			return ret;
 	}
 
+	if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_MSI) {
+		ret = vfio_pci_create_msi_cap(kvm, pdev);
+		if (ret)
+			return ret;
+	}
+
 	for (i = VFIO_PCI_BAR0_REGION_INDEX; i <= VFIO_PCI_BAR5_REGION_INDEX; ++i) {
 		/* Ignore top half of 64-bit BAR */
 		if (i % 2 && is_64bit)
@@ -971,6 +1123,16 @@ static int vfio_pci_configure_dev_irqs(struct kvm *kvm, struct vfio_device *vdev
 			return ret;
 	}
 
+	if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_MSI) {
+		pdev->msi.info = (struct vfio_irq_info) {
+			.argsz = sizeof(pdev->msi.info),
+			.index = VFIO_PCI_MSI_IRQ_INDEX,
+		};
+		ret = vfio_pci_init_msis(kvm, vdev, &pdev->msi);
+		if (ret)
+			return ret;
+	}
+
 	if (pdev->irq_modes & VFIO_PCI_IRQ_MODE_INTX)
 		ret = vfio_pci_enable_intx(kvm, vdev);
 
@@ -1019,4 +1181,6 @@ void vfio_pci_teardown_device(struct kvm *kvm, struct vfio_device *vdev)
 
 	free(pdev->msix.irq_set);
 	free(pdev->msix.entries);
+	free(pdev->msi.irq_set);
+	free(pdev->msi.entries);
 }
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 kvmtool 11/13] vfio: Support non-mmappable regions
  2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
                   ` (9 preceding siblings ...)
  2018-06-18 18:42 ` [PATCH v6 kvmtool 10/13] vfio-pci: add MSI support Jean-Philippe Brucker
@ 2018-06-18 18:42 ` Jean-Philippe Brucker
  2018-06-18 18:42 ` [PATCH v6 kvmtool 12/13] Introduce reserved memory regions Jean-Philippe Brucker
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 15+ messages in thread
From: Jean-Philippe Brucker @ 2018-06-18 18:42 UTC (permalink / raw)
  To: kvm
  Cc: lorenzo.pieralisi, marc.zyngier, punit.agrawal, will.deacon,
	alex.williamson, robin.murphy, kvmarm

In some cases device regions don't support mmap. They can still be made
available to the guest by trapping all accesses and forwarding reads or
writes to VFIO. Such regions may be:

* PCI I/O port BARs.
* Sub-page regions, for example a 4kB region on a host with 64k pages.
* Similarly, sparse mmap regions. For example when VFIO allows to mmap
  fragments of a PCI BAR and forbids accessing things like MSI-X tables.
  We don't support the sparse capability at the moment, so trap these
  regions instead (if VFIO rejects the mmap).

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/vfio.h |   3 +
 vfio/core.c        | 165 ++++++++++++++++++++++++++++++++++++++++-----
 vfio/pci.c         |  41 ++++++-----
 3 files changed, 176 insertions(+), 33 deletions(-)

diff --git a/include/kvm/vfio.h b/include/kvm/vfio.h
index 2c621ec75..60e6c54d6 100644
--- a/include/kvm/vfio.h
+++ b/include/kvm/vfio.h
@@ -83,8 +83,11 @@ struct vfio_pci_device {
 
 struct vfio_region {
 	struct vfio_region_info		info;
+	struct vfio_device		*vdev;
 	u64				guest_phys_addr;
 	void				*host_addr;
+	u32				port_base;
+	int				is_ioport	:1;
 };
 
 struct vfio_device {
diff --git a/vfio/core.c b/vfio/core.c
index 6c78e89f1..a4a257a70 100644
--- a/vfio/core.c
+++ b/vfio/core.c
@@ -1,5 +1,6 @@
 #include "kvm/kvm.h"
 #include "kvm/vfio.h"
+#include "kvm/ioport.h"
 
 #include <linux/list.h>
 
@@ -80,6 +81,141 @@ out_free_buf:
 	return ret;
 }
 
+static bool vfio_ioport_in(struct ioport *ioport, struct kvm_cpu *vcpu,
+			   u16 port, void *data, int len)
+{
+	u32 val;
+	ssize_t nr;
+	struct vfio_region *region = ioport->priv;
+	struct vfio_device *vdev = region->vdev;
+
+	u32 offset = port - region->port_base;
+
+	if (!(region->info.flags & VFIO_REGION_INFO_FLAG_READ))
+		return false;
+
+	nr = pread(vdev->fd, &val, len, region->info.offset + offset);
+	if (nr != len) {
+		vfio_dev_err(vdev, "could not read %d bytes from I/O port 0x%x\n",
+			     len, port);
+		return false;
+	}
+
+	switch (len) {
+	case 1:
+		ioport__write8(data, val);
+		break;
+	case 2:
+		ioport__write16(data, val);
+		break;
+	case 4:
+		ioport__write32(data, val);
+		break;
+	default:
+		return false;
+	}
+
+	return true;
+}
+
+static bool vfio_ioport_out(struct ioport *ioport, struct kvm_cpu *vcpu,
+			    u16 port, void *data, int len)
+{
+	u32 val;
+	ssize_t nr;
+	struct vfio_region *region = ioport->priv;
+	struct vfio_device *vdev = region->vdev;
+
+	u32 offset = port - region->port_base;
+
+	if (!(region->info.flags & VFIO_REGION_INFO_FLAG_WRITE))
+		return false;
+
+	switch (len) {
+	case 1:
+		val = ioport__read8(data);
+		break;
+	case 2:
+		val = ioport__read16(data);
+		break;
+	case 4:
+		val = ioport__read32(data);
+		break;
+	default:
+		return false;
+	}
+
+	nr = pwrite(vdev->fd, &val, len, region->info.offset + offset);
+	if (nr != len)
+		vfio_dev_err(vdev, "could not write %d bytes to I/O port 0x%x",
+			     len, port);
+
+	return nr == len;
+}
+
+static struct ioport_operations vfio_ioport_ops = {
+	.io_in	= vfio_ioport_in,
+	.io_out	= vfio_ioport_out,
+};
+
+static void vfio_mmio_access(struct kvm_cpu *vcpu, u64 addr, u8 *data, u32 len,
+			     u8 is_write, void *ptr)
+{
+	u64 val;
+	ssize_t nr;
+	struct vfio_region *region = ptr;
+	struct vfio_device *vdev = region->vdev;
+
+	u32 offset = addr - region->guest_phys_addr;
+
+	if (len < 1 || len > 8)
+		goto err_report;
+
+	if (is_write) {
+		if (!(region->info.flags & VFIO_REGION_INFO_FLAG_WRITE))
+			goto err_report;
+
+		memcpy(&val, data, len);
+
+		nr = pwrite(vdev->fd, &val, len, region->info.offset + offset);
+		if ((u32)nr != len)
+			goto err_report;
+	} else {
+		if (!(region->info.flags & VFIO_REGION_INFO_FLAG_READ))
+			goto err_report;
+
+		nr = pread(vdev->fd, &val, len, region->info.offset + offset);
+		if ((u32)nr != len)
+			goto err_report;
+
+		memcpy(data, &val, len);
+	}
+
+	return;
+
+err_report:
+	vfio_dev_err(vdev, "could not %s %u bytes at 0x%x (0x%llx)", is_write ?
+		     "write" : "read", len, offset, addr);
+}
+
+static int vfio_setup_trap_region(struct kvm *kvm, struct vfio_device *vdev,
+				  struct vfio_region *region)
+{
+	if (region->is_ioport) {
+		int port = ioport__register(kvm, IOPORT_EMPTY, &vfio_ioport_ops,
+					    region->info.size, region);
+		if (port < 0)
+			return port;
+
+		region->port_base = port;
+		return 0;
+	}
+
+	return kvm__register_mmio(kvm, region->guest_phys_addr,
+				  region->info.size, false, vfio_mmio_access,
+				  region);
+}
+
 int vfio_map_region(struct kvm *kvm, struct vfio_device *vdev,
 		    struct vfio_region *region)
 {
@@ -88,17 +224,8 @@ int vfio_map_region(struct kvm *kvm, struct vfio_device *vdev,
 	/* KVM needs page-aligned regions */
 	u64 map_size = ALIGN(region->info.size, PAGE_SIZE);
 
-	/*
-	 * We don't want to mess about trapping config accesses, so require that
-	 * they can be mmap'd. Note that for PCI, this precludes the use of I/O
-	 * BARs in the guest (we will hide them from Configuration Space, which
-	 * is trapped).
-	 */
-	if (!(region->info.flags & VFIO_REGION_INFO_FLAG_MMAP)) {
-		vfio_dev_info(vdev, "ignoring region %u, as it can't be mmap'd",
-			      region->info.index);
-		return 0;
-	}
+	if (!(region->info.flags & VFIO_REGION_INFO_FLAG_MMAP))
+		return vfio_setup_trap_region(kvm, vdev, region);
 
 	if (region->info.flags & VFIO_REGION_INFO_FLAG_READ)
 		prot |= PROT_READ;
@@ -108,10 +235,10 @@ int vfio_map_region(struct kvm *kvm, struct vfio_device *vdev,
 	base = mmap(NULL, region->info.size, prot, MAP_SHARED, vdev->fd,
 		    region->info.offset);
 	if (base == MAP_FAILED) {
-		ret = -errno;
-		vfio_dev_err(vdev, "failed to mmap region %u (0x%llx bytes)",
-			     region->info.index, region->info.size);
-		return ret;
+		/* TODO: support sparse mmap */
+		vfio_dev_warn(vdev, "failed to mmap region %u (0x%llx bytes), falling back to trapping",
+			 region->info.index, region->info.size);
+		return vfio_setup_trap_region(kvm, vdev, region);
 	}
 	region->host_addr = base;
 
@@ -127,7 +254,13 @@ int vfio_map_region(struct kvm *kvm, struct vfio_device *vdev,
 
 void vfio_unmap_region(struct kvm *kvm, struct vfio_region *region)
 {
-	munmap(region->host_addr, region->info.size);
+	if (region->host_addr) {
+		munmap(region->host_addr, region->info.size);
+	} else if (region->is_ioport) {
+		ioport__unregister(kvm, region->port_base);
+	} else {
+		kvm__deregister_mmio(kvm, region->guest_phys_addr);
+	}
 }
 
 static int vfio_configure_device(struct kvm *kvm, struct vfio_device *vdev)
diff --git a/vfio/pci.c b/vfio/pci.c
index 3ed07fb43..03de3c113 100644
--- a/vfio/pci.c
+++ b/vfio/pci.c
@@ -640,24 +640,27 @@ static int vfio_pci_fixup_cfg_space(struct vfio_device *vdev)
 	struct vfio_region_info *info;
 	struct vfio_pci_device *pdev = &vdev->pci;
 
-	/* Enable exclusively MMIO and bus mastering */
-	pdev->hdr.command &= ~PCI_COMMAND_IO;
-	pdev->hdr.command |= PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER;
-
 	/* Initialise the BARs */
 	for (i = VFIO_PCI_BAR0_REGION_INDEX; i <= VFIO_PCI_BAR5_REGION_INDEX; ++i) {
+		u64 base;
 		struct vfio_region *region = &vdev->regions[i];
-		u64 base = region->guest_phys_addr;
+
+		/* Construct a fake reg to match what we've mapped. */
+		if (region->is_ioport) {
+			base = (region->port_base & PCI_BASE_ADDRESS_IO_MASK) |
+				PCI_BASE_ADDRESS_SPACE_IO;
+		} else {
+			base = (region->guest_phys_addr &
+				PCI_BASE_ADDRESS_MEM_MASK) |
+				PCI_BASE_ADDRESS_SPACE_MEMORY;
+		}
+
+		pdev->hdr.bar[i] = base;
 
 		if (!base)
 			continue;
 
 		pdev->hdr.bar_size[i] = region->info.size;
-
-		/* Construct a fake reg to match what we've mapped. */
-		pdev->hdr.bar[i] = (base & PCI_BASE_ADDRESS_MEM_MASK) |
-					PCI_BASE_ADDRESS_SPACE_MEMORY |
-					PCI_BASE_ADDRESS_MEM_TYPE_32;
 	}
 
 	/* I really can't be bothered to support cardbus. */
@@ -794,6 +797,7 @@ static int vfio_pci_configure_bar(struct kvm *kvm, struct vfio_device *vdev,
 				  size_t nr)
 {
 	int ret;
+	u32 bar;
 	size_t map_size;
 	struct vfio_pci_device *pdev = &vdev->pci;
 	struct vfio_region *region = &vdev->regions[nr];
@@ -801,6 +805,10 @@ static int vfio_pci_configure_bar(struct kvm *kvm, struct vfio_device *vdev,
 	if (nr >= vdev->info.num_regions)
 		return 0;
 
+	bar = pdev->hdr.bar[nr];
+
+	region->vdev = vdev;
+	region->is_ioport = !!(bar & PCI_BASE_ADDRESS_SPACE_IO);
 	region->info = (struct vfio_region_info) {
 		.argsz = sizeof(region->info),
 		.index = nr,
@@ -828,14 +836,13 @@ static int vfio_pci_configure_bar(struct kvm *kvm, struct vfio_device *vdev,
 		}
 	}
 
-	/* Grab some MMIO space in the guest */
-	map_size = ALIGN(region->info.size, PAGE_SIZE);
-	region->guest_phys_addr = pci_get_io_space_block(map_size);
+	if (!region->is_ioport) {
+		/* Grab some MMIO space in the guest */
+		map_size = ALIGN(region->info.size, PAGE_SIZE);
+		region->guest_phys_addr = pci_get_io_space_block(map_size);
+	}
 
-	/*
-	 * Map the BARs into the guest. We'll later need to update
-	 * configuration space to reflect our allocation.
-	 */
+	/* Map the BARs into the guest or setup a trap region. */
 	ret = vfio_map_region(kvm, vdev, region);
 	if (ret)
 		return ret;
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 kvmtool 12/13] Introduce reserved memory regions
  2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
                   ` (10 preceding siblings ...)
  2018-06-18 18:42 ` [PATCH v6 kvmtool 11/13] vfio: Support non-mmappable regions Jean-Philippe Brucker
@ 2018-06-18 18:42 ` Jean-Philippe Brucker
  2018-06-18 18:42 ` [PATCH v6 kvmtool 13/13] vfio: check reserved regions before mapping DMA Jean-Philippe Brucker
  2018-06-19 12:19 ` [PATCH v6 kvmtool 00/13] Add vfio-pci support Will Deacon
  13 siblings, 0 replies; 15+ messages in thread
From: Jean-Philippe Brucker @ 2018-06-18 18:42 UTC (permalink / raw)
  To: kvm
  Cc: lorenzo.pieralisi, marc.zyngier, punit.agrawal, will.deacon,
	alex.williamson, robin.murphy, kvmarm

When passing devices to the guest, there might be address ranges
unavailable to the device. For instance, if address 0x10000000 corresponds
to an MSI doorbell, any transaction from a device to that address will be
directed to the MSI controller and might not even reach the IOMMU. In that
case 0x10000000 is reserved by the physical IOMMU in the guest's physical
space.

This patch introduces a simple API to register reserved ranges of
addresses that should not or cannot be provided to the guest. For the
moment it only checks that a reserved range does not overlap any user
memory (we don't consider MMIO) and aborts otherwise.

It should be possible instead to poke holes in the guest-physical memory
map and report them via the architecture's preferred route:
* ARM and PowerPC can add reserved-memory nodes to the DT they provide to
  the guest.
* x86 could poke holes in the memory map reported with e820. This requires
  to postpone creating the memory map until at least VFIO is initialized.
* MIPS could describe the reserved ranges with the "memmap=mm$ss" kernel
  parameter.

This would also require to call KVM_SET_USER_MEMORY_REGION for all memory
regions at the end of kvmtool initialisation. Extra care should be taken
to ensure we don't break any architecture, since they currently rely on
having a linear address space with at most two memory blocks.

This patch doesn't implement any address space carving. If an abort is
encountered, user can try to rebuild kvmtool with different addresses or
change its IOMMU resv regions if possible.

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 include/kvm/kvm.h | 10 +++++++
 kvm.c             | 68 +++++++++++++++++++++++++++++++++++++----------
 2 files changed, 64 insertions(+), 14 deletions(-)

diff --git a/include/kvm/kvm.h b/include/kvm/kvm.h
index 19f7d265c..1edacfdfc 100644
--- a/include/kvm/kvm.h
+++ b/include/kvm/kvm.h
@@ -37,9 +37,11 @@ enum {
 enum kvm_mem_type {
 	KVM_MEM_TYPE_RAM	= 1 << 0,
 	KVM_MEM_TYPE_DEVICE	= 1 << 1,
+	KVM_MEM_TYPE_RESERVED	= 1 << 2,
 
 	KVM_MEM_TYPE_ALL	= KVM_MEM_TYPE_RAM
 				| KVM_MEM_TYPE_DEVICE
+				| KVM_MEM_TYPE_RESERVED
 };
 
 struct kvm_ext {
@@ -115,6 +117,12 @@ static inline int kvm__register_dev_mem(struct kvm *kvm, u64 guest_phys,
 				 KVM_MEM_TYPE_DEVICE);
 }
 
+static inline int kvm__reserve_mem(struct kvm *kvm, u64 guest_phys, u64 size)
+{
+	return kvm__register_mem(kvm, guest_phys, size, NULL,
+				 KVM_MEM_TYPE_RESERVED);
+}
+
 int kvm__register_mmio(struct kvm *kvm, u64 phys_addr, u64 phys_addr_len, bool coalesce,
 		       void (*mmio_fn)(struct kvm_cpu *vcpu, u64 addr, u8 *data, u32 len, u8 is_write, void *ptr),
 			void *ptr);
@@ -150,6 +158,8 @@ static inline const char *kvm_mem_type_to_string(enum kvm_mem_type type)
 		return "RAM";
 	case KVM_MEM_TYPE_DEVICE:
 		return "device";
+	case KVM_MEM_TYPE_RESERVED:
+		return "reserved";
 	}
 
 	return "???";
diff --git a/kvm.c b/kvm.c
index e9c3c5fcb..7de825a9d 100644
--- a/kvm.c
+++ b/kvm.c
@@ -177,18 +177,55 @@ int kvm__exit(struct kvm *kvm)
 }
 core_exit(kvm__exit);
 
-/*
- * Note: KVM_SET_USER_MEMORY_REGION assumes that we don't pass overlapping
- * memory regions to it. Therefore, be careful if you use this function for
- * registering memory regions for emulating hardware.
- */
 int kvm__register_mem(struct kvm *kvm, u64 guest_phys, u64 size,
 		      void *userspace_addr, enum kvm_mem_type type)
 {
 	struct kvm_userspace_memory_region mem;
+	struct kvm_mem_bank *merged = NULL;
 	struct kvm_mem_bank *bank;
 	int ret;
 
+	/* Check for overlap */
+	list_for_each_entry(bank, &kvm->mem_banks, list) {
+		u64 bank_end = bank->guest_phys_addr + bank->size - 1;
+		u64 end = guest_phys + size - 1;
+		if (guest_phys > bank_end || end < bank->guest_phys_addr)
+			continue;
+
+		/* Merge overlapping reserved regions */
+		if (bank->type == KVM_MEM_TYPE_RESERVED &&
+		    type == KVM_MEM_TYPE_RESERVED) {
+			bank->guest_phys_addr = min(bank->guest_phys_addr, guest_phys);
+			bank->size = max(bank_end, end) - bank->guest_phys_addr + 1;
+
+			if (merged) {
+				/*
+				 * This is at least the second merge, remove
+				 * previous result.
+				 */
+				list_del(&merged->list);
+				free(merged);
+			}
+
+			guest_phys = bank->guest_phys_addr;
+			size = bank->size;
+			merged = bank;
+
+			/* Keep checking that we don't overlap another region */
+			continue;
+		}
+
+		pr_err("%s region [%llx-%llx] would overlap %s region [%llx-%llx]",
+		       kvm_mem_type_to_string(type), guest_phys, guest_phys + size - 1,
+		       kvm_mem_type_to_string(bank->type), bank->guest_phys_addr,
+		       bank->guest_phys_addr + bank->size - 1);
+
+		return -EINVAL;
+	}
+
+	if (merged)
+		return 0;
+
 	bank = malloc(sizeof(*bank));
 	if (!bank)
 		return -ENOMEM;
@@ -199,18 +236,21 @@ int kvm__register_mem(struct kvm *kvm, u64 guest_phys, u64 size,
 	bank->size			= size;
 	bank->type			= type;
 
-	mem = (struct kvm_userspace_memory_region) {
-		.slot			= kvm->mem_slots++,
-		.guest_phys_addr	= guest_phys,
-		.memory_size		= size,
-		.userspace_addr		= (unsigned long)userspace_addr,
-	};
+	if (type != KVM_MEM_TYPE_RESERVED) {
+		mem = (struct kvm_userspace_memory_region) {
+			.slot			= kvm->mem_slots++,
+			.guest_phys_addr	= guest_phys,
+			.memory_size		= size,
+			.userspace_addr		= (unsigned long)userspace_addr,
+		};
 
-	ret = ioctl(kvm->vm_fd, KVM_SET_USER_MEMORY_REGION, &mem);
-	if (ret < 0)
-		return -errno;
+		ret = ioctl(kvm->vm_fd, KVM_SET_USER_MEMORY_REGION, &mem);
+		if (ret < 0)
+			return -errno;
+	}
 
 	list_add(&bank->list, &kvm->mem_banks);
+
 	return 0;
 }
 
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 kvmtool 13/13] vfio: check reserved regions before mapping DMA
  2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
                   ` (11 preceding siblings ...)
  2018-06-18 18:42 ` [PATCH v6 kvmtool 12/13] Introduce reserved memory regions Jean-Philippe Brucker
@ 2018-06-18 18:42 ` Jean-Philippe Brucker
  2018-06-19 12:19 ` [PATCH v6 kvmtool 00/13] Add vfio-pci support Will Deacon
  13 siblings, 0 replies; 15+ messages in thread
From: Jean-Philippe Brucker @ 2018-06-18 18:42 UTC (permalink / raw)
  To: kvm
  Cc: lorenzo.pieralisi, marc.zyngier, punit.agrawal, will.deacon,
	alex.williamson, robin.murphy, kvmarm

Use the new reserved_regions API to ensure that RAM doesn't overlap any
reserved region. This prevents for instance from mapping an MSI doorbell
into the guest IPA space. For the moment we reject any overlapping. In the
future, we might carve reserved regions out of the guest physical
space.

Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
---
 vfio/core.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/vfio/core.c b/vfio/core.c
index a4a257a70..17b5b0cfc 100644
--- a/vfio/core.c
+++ b/vfio/core.c
@@ -379,6 +379,51 @@ static int vfio_unmap_mem_bank(struct kvm *kvm, struct kvm_mem_bank *bank, void
 	return 0;
 }
 
+static int vfio_configure_reserved_regions(struct kvm *kvm,
+					   struct vfio_group *group)
+{
+	FILE *file;
+	int ret = 0;
+	char type[9];
+	char filename[PATH_MAX];
+	unsigned long long start, end;
+
+	snprintf(filename, PATH_MAX, IOMMU_GROUP_DIR "/%lu/reserved_regions",
+		 group->id);
+
+	/* reserved_regions might not be present on older systems */
+	if (access(filename, F_OK))
+		return 0;
+
+	file = fopen(filename, "r");
+	if (!file)
+		return -errno;
+
+	while (fscanf(file, "0x%llx 0x%llx %8s\n", &start, &end, type) == 3) {
+		ret = kvm__reserve_mem(kvm, start, end - start + 1);
+		if (ret)
+			break;
+	}
+
+	fclose(file);
+
+	return ret;
+}
+
+static int vfio_configure_groups(struct kvm *kvm)
+{
+	int ret;
+	struct vfio_group *group;
+
+	list_for_each_entry(group, &vfio_groups, list) {
+		ret = vfio_configure_reserved_regions(kvm, group);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
 static struct vfio_group *vfio_group_create(struct kvm *kvm, unsigned long id)
 {
 	int ret;
@@ -597,6 +642,10 @@ static int vfio__init(struct kvm *kvm)
 	if (ret)
 		return ret;
 
+	ret = vfio_configure_groups(kvm);
+	if (ret)
+		return ret;
+
 	ret = vfio_configure_devices(kvm);
 	if (ret)
 		return ret;
-- 
2.17.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 kvmtool 00/13] Add vfio-pci support
  2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
                   ` (12 preceding siblings ...)
  2018-06-18 18:42 ` [PATCH v6 kvmtool 13/13] vfio: check reserved regions before mapping DMA Jean-Philippe Brucker
@ 2018-06-19 12:19 ` Will Deacon
  13 siblings, 0 replies; 15+ messages in thread
From: Will Deacon @ 2018-06-19 12:19 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: lorenzo.pieralisi, kvm, marc.zyngier, punit.agrawal,
	alex.williamson, robin.murphy, kvmarm

On Mon, Jun 18, 2018 at 07:41:58PM +0100, Jean-Philippe Brucker wrote:
> This is version six of the VFIO support in kvmtool. It addresses
> Will's comments for v5.
> 
> Implement vfio-pci support in kvmtool. Devices are assigned to the guest
> by passing "--vfio-pci [domain:]bus:dev.fn" to lkvm run, after having
> bound the device to the VFIO driver (see Documentation/vfio.txt)
> 
> This time around I also tested assignment of the x540 NIC on my old x86
> desktop as well (previously only on AMD Seattle, an arm64 host). It
> worked smoothly.

Thanks. Applied and pushed out.

Will

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-06-19 12:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-18 18:41 [PATCH v6 kvmtool 00/13] Add vfio-pci support Jean-Philippe Brucker
2018-06-18 18:41 ` [PATCH v6 kvmtool 01/13] pci: add config operations callbacks on the PCI header Jean-Philippe Brucker
2018-06-18 18:42 ` [PATCH v6 kvmtool 02/13] pci: allow to specify IRQ type for PCI devices Jean-Philippe Brucker
2018-06-18 18:42 ` [PATCH v6 kvmtool 03/13] irq: add irqfd helpers Jean-Philippe Brucker
2018-06-18 18:42 ` [PATCH v6 kvmtool 04/13] Extend memory bank API with memory types Jean-Philippe Brucker
2018-06-18 18:42 ` [PATCH v6 kvmtool 05/13] pci: add capability helpers Jean-Philippe Brucker
2018-06-18 18:42 ` [PATCH v6 kvmtool 06/13] Import VFIO headers Jean-Philippe Brucker
2018-06-18 18:42 ` [PATCH v6 kvmtool 07/13] Add fls_long and roundup_pow_of_two helpers Jean-Philippe Brucker
2018-06-18 18:42 ` [PATCH v6 kvmtool 08/13] Add PCI device passthrough using VFIO Jean-Philippe Brucker
2018-06-18 18:42 ` [PATCH v6 kvmtool 09/13] vfio-pci: add MSI-X support Jean-Philippe Brucker
2018-06-18 18:42 ` [PATCH v6 kvmtool 10/13] vfio-pci: add MSI support Jean-Philippe Brucker
2018-06-18 18:42 ` [PATCH v6 kvmtool 11/13] vfio: Support non-mmappable regions Jean-Philippe Brucker
2018-06-18 18:42 ` [PATCH v6 kvmtool 12/13] Introduce reserved memory regions Jean-Philippe Brucker
2018-06-18 18:42 ` [PATCH v6 kvmtool 13/13] vfio: check reserved regions before mapping DMA Jean-Philippe Brucker
2018-06-19 12:19 ` [PATCH v6 kvmtool 00/13] Add vfio-pci support Will Deacon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.